Commit Graph

27 Commits

Author SHA1 Message Date
Rusty Russell
06f18b1d1d memleak: make notleak() work even before memleak is initalized.
It now simply renames tal names, so it's harmless to do even if we're
not going to do memleak detection.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2025-10-24 11:30:17 +10:30
Rusty Russell
6e5cb299dd global: remove unnecessary includes from C files.
Basically, `devtools/reduce-includes.sh */*.c`.

Build time from make clean (RUST=0) (includes building external libs):

Before:
	real    0m38.944000-40.416000(40.1131+/-0.4)s
	user    3m6.790000-17.159000(15.0571+/-2.8)s
	sys     0m35.304000-37.336000(36.8942+/-0.57)s
After:
	real    0m37.872000-39.974000(39.5466+/-0.59)s
	user    3m1.211000-14.968000(12.4556+/-3.9)s
	sys     0m35.008000-36.830000(36.4143+/-0.5)s

Build time after touch config.vars (RUST=0):

Before:
	real    0m19.831000-21.862000(21.5528+/-0.58)s
	user    2m15.361000-30.731000(28.4798+/-4.4)s
	sys     0m21.056000-22.339000(22.0346+/-0.35)s

After:
	real    0m18.384000-21.307000(20.8605+/-0.92)s
	user    2m5.585000-26.843000(23.6017+/-6.7)s
	sys     0m19.650000-22.003000(21.4943+/-0.69)s

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2025-10-23 06:44:04 +10:30
Rusty Russell
f6a4e79420 global: remove unnecessary includes from headers.
Each header should only include the other headers it needs to compile;
`devtools/reduce-includes.sh */*.h` does this.  The C files then need
additional includes if they don't compile.

And remove the entirely useless wire/onion_wire.h, which only serves to include wire/onion_wiregen.h.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2025-10-23 06:44:04 +10:30
Rusty Russell
8cc86f3d88 trace: prevent memleak report.
notleak() doesn't work for lightningd since the first span is created before
memleak (or anything else!) is initialized, so we have to mark it manually.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2025-04-24 17:34:12 +09:30
Rusty Russell
3f136ef42d trace: don't mess up pointers when we reallocate.
It's convenient to have pointers, but we have to do fixups.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2025-04-24 17:34:12 +09:30
Rusty Russell
55f870e963 trace: double allocation if we run out.
This doesn't happen very often, but can with autoclean.

However, we rarely traverse to the end, since we always expect to find
what we're looking for, and we fill from the front.  So even a large
array (unless it's used) is fine.

Subtle: when we injected a parent, we used "active_spans" as the (arbitrary)
key.  That can now change with reallocation, and so if that memory were reused
we could have a key clash.  So we use "&active_spans" which doesn't change.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2025-04-23 13:45:18 +09:30
Rusty Russell
858e3b074f common: remove unnecessary parent_id and remote fields.
We don't ever actually close the remote span (we don't have its key,
after all), and we keep a pointer to the parent.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2025-04-23 13:45:18 +09:30
Rusty Russell
6a8e586ea8 trace: don't use randombytes_buf(), use pseudorand.
This is much faster to give 64 bits of data, and we don't need
cryptographic randomness.

This brings us back to 413ns per trace.

Before:
	real	0m5.819000-6.472000(6.2064+/-0.26)s
	user	0m3.779000-4.101000(3.956+/-0.12)s
	sys	0m2.040000-2.431000(2.2496+/-0.15)s

After:
	real	0m3.981000-4.247000(4.1276+/-0.11)s
	user	0m3.979000-4.245000(4.126+/-0.11)s
	sys	0m0.000000-0.002000(0.001+/-0.00063)s

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Changelog-Fixed: lightingd: trimmed overhead of tracing infrastructure.
2025-04-23 13:45:18 +09:30
Rusty Russell
b4dcf4e55d trace: use a static buffer instead of tal_fmt().
There's an EBPF limit anyway, so stick with a 512-byte buffer.

This brings us back to 621ns per trace:

Before:
	real	0m13.441000-14.592000(14.2686+/-0.43)s
	user	0m11.265000-12.289000(11.9626+/-0.37)s
	sys	0m2.175000-2.381000(2.3048+/-0.072)s

After:
	real	0m5.819000-6.472000(6.2064+/-0.26)s
	user	0m3.779000-4.101000(3.956+/-0.12)s
	sys	0m2.040000-2.431000(2.2496+/-0.15)s

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2025-04-23 13:45:18 +09:30
Rusty Russell
3973c35013 trace: we only ever add two tags, so use a static array.
Avoids allocations.  Also assume that name and value parameters
outlive the trace span, so don't copy.

Before:
	real	0m16.421000-18.407000(17.8128+/-0.72)s
	user	0m14.242000-16.041000(15.5382+/-0.67)s
	sys	0m2.179000-2.363000(2.273+/-0.061)s

After:
	real	0m13.441000-14.592000(14.2686+/-0.43)s
	user	0m11.265000-12.289000(11.9626+/-0.37)s
	sys	0m2.175000-2.381000(2.3048+/-0.072)s

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2025-04-23 13:45:18 +09:30
Rusty Russell
cdb74434f2 trace: don't copy strings where we don't need to.
1. trace_span_start() is always called with a string literal, so
   no copy needed (and we can use a macro to enforce this).
2. trace_span_tag() name and value are always longer-lived than
   the span, so no need to copy these either.

Before:
	real	0m18.524000-19.100000(18.7674+/-0.21)s
	user	0m16.171000-16.833000(16.424+/-0.26)s
	sys	0m2.259000-2.400000(2.337+/-0.059)s

After:
	real	0m16.421000-18.407000(17.8128+/-0.72)s
	user	0m14.242000-16.041000(15.5382+/-0.67)s
	sys	0m2.179000-2.363000(2.273+/-0.061)s

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2025-04-23 13:45:18 +09:30
Rusty Russell
e951155002 trace: fix parent handling (and test it!).
Testing parenting handling revealed several issues:

1. By calling "trace_span_start" when CLN_TRACEPARENT is set produces a bogus
   entry, for which the span_id is overwritten so we never end it.
2. We don't need to close the remote parent when we close the first child: in
   fact, this causes the remaining traces to be detached from the parent!
3. Suspension should return current to the parent, not to NULL.

Now the traces balance as we expect.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2025-04-23 13:45:18 +09:30
Rusty Russell
924f28adcb pytest: add test hooks so we can test tracing.
Suggested-by: Christian Decker
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2025-04-23 13:45:18 +09:30
Rusty Russell
c69a636fef trace: track suspensions, check they match.
I added this debugging because the next test revealed a mismatch, so
I wanted to see where it was happening.

The comment in lightningd suggests it's possible, but I can't see any
code which suspends in the lightningd io_loop, so I cannot see how
this is triggered.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2025-04-23 13:45:18 +09:30
Rusty Russell
2e6ad3ffc8 trace: handle key being freed while suspended.
This happens with autoclean, which does a datastore request then frees
the parent command without waiting for a response (see clean_finished).

This leaks a trace, and causes a crash if the pointer is later reused.

My solution is to create a trace variant which declares the trace key
to be a tal ptr and then we can clean up in the destructor if this happens.
This fixes the issue for me.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Changelog-Fixed: autoclean: fixed occasional crash when tracepoints compiled in.
2025-04-03 08:27:27 -05:00
Rusty Russell
ee2ec8befa trace: minimal fix to avoid crash when > 128 traces active.
chanbackup with many peers can do more than 128 concurrent rpc commands.
autoclean is the other plugin which can do many requests at once, so I
expect a similar issue there.

I tested this by rebuilding with `MAX_ACTIVE_SPANS` 1, which autoclean
tests triggered immediately.

The real fix is probably to use a hash table with a large initial size.

```
Mar 24 06:30:45 mlbb2 sh[28000]: chanbackup: common/trace.c:190: trace_span_slot: Assertion `s' failed.
Mar 24 06:30:45 mlbb2 sh[28000]: chanbackup: FATAL SIGNAL 6 (version v25.02)
Mar 24 06:30:45 mlbb2 sh[28000]: 0x5575232bac4f send_backtrace
Mar 24 06:30:45 mlbb2 sh[28000]:         common/daemon.c:33
Mar 24 06:30:45 mlbb2 sh[28000]: 0x5575232baceb crashdump
Mar 24 06:30:45 mlbb2 sh[28000]:         common/daemon.c:78
Mar 24 06:30:45 mlbb2 sh[28000]: 0x7f2958cd851f ???
Mar 24 06:30:45 mlbb2 sh[28000]:         ./signal/../sysdeps/unix/sysv/linux/x86_64/libc_sigaction.c:0
Mar 24 06:30:45 mlbb2 sh[28000]: 0x7f2958d2c9fc __pthread_kill_implementation
Mar 24 06:30:45 mlbb2 sh[28000]:         ./nptl/pthread_kill.c:44
Mar 24 06:30:45 mlbb2 sh[28000]: 0x7f2958d2c9fc __pthread_kill_internal
Mar 24 06:30:45 mlbb2 sh[28000]:         ./nptl/pthread_kill.c:78
Mar 24 06:30:45 mlbb2 sh[28000]: 0x7f2958d2c9fc __GI___pthread_kill
Mar 24 06:30:45 mlbb2 sh[28000]:         ./nptl/pthread_kill.c:89
Mar 24 06:30:45 mlbb2 sh[28000]: 0x7f2958cd8475 __GI_raise
Mar 24 06:30:45 mlbb2 sh[28000]:         ../sysdeps/posix/raise.c:26
Mar 24 06:30:45 mlbb2 sh[28000]: 0x7f2958cbe7f2 __GI_abort
Mar 24 06:30:45 mlbb2 sh[28000]:         ./stdlib/abort.c:79
Mar 24 06:30:45 mlbb2 sh[28000]: 0x7f2958cbe71a __assert_fail_base
Mar 24 06:30:45 mlbb2 sh[28000]:         ./assert/assert.c:94
Mar 24 06:30:45 mlbb2 sh[28000]: 0x7f2958ccfe95 __GI___assert_fail
Mar 24 06:30:45 mlbb2 sh[28000]:         ./assert/assert.c:103
Mar 24 06:30:45 mlbb2 sh[28000]: 0x5575232ab7fa trace_span_slot
Mar 24 06:30:45 mlbb2 sh[28000]:         common/trace.c:190
Mar 24 06:30:45 mlbb2 sh[28000]: 0x5575232abc9f trace_span_start
Mar 24 06:30:45 mlbb2 sh[28000]:         common/trace.c:267
Mar 24 06:30:45 mlbb2 sh[28000]: 0x5575232a7c34 send_outreq
Mar 24 06:30:45 mlbb2 sh[28000]:         plugins/libplugin.c:1112
```

Changelog-Fixed: autoclean/chanbackup: fixed tracepoint crash on large number of requests.
Fixes: https://github.com/ElementsProject/lightning/issues/8177
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2025-04-01 09:20:07 -05:00
Christian Decker
c596550de1 common: Make trace debugging output configurate at compile time
Just added a couple of compile-time guards and sprinkled the invariant
checking in a couple of places (disabled if compile time guard is
unset).
2024-11-24 10:24:31 +10:30
Christian Decker
557142627f common: Fix a potential cycle in the trace structure
It turns out that under some circumstances we end up clearing the
pointee of `current` but not the pointer. Thus when we select the next
slot we can end up reusing the same slot, making it its own parent.

We forcefull break these cycles by enforcing that `current` should
never be returned and be set as its own parent.

Changelog-None
2024-11-24 10:24:31 +10:30
Christian Decker
a6c81d4174 common: Add a tree checker for trace spans
Trace spans form a tree, but we don't actually check that the
structure doesn't break. Breakage can for example come if we use the
same key accidentally, making a new span its own ancestor.
2024-11-24 10:24:31 +10:30
Christian Decker
2e59ab8f15 common: Remove the recursive parent resolution in traces
We have the space in memory set aside anyway, so let's just copy the
`trace_id` into the span itself, rather than resolving the `root` at
time of emission.
2024-11-24 10:24:31 +10:30
Christian Decker
4f3ea8c048 common: Add some debuggig capabilities to the trace subsystem
After adding the DB query instrumentation we ran into a couple of
issues, with spans not being resumed correctly, and it was rather hard
to identify the problem. This adds debug statements so we can trace
the tracing (traception if you will).

Changelog-None
2024-11-24 10:24:31 +10:30
Jesse de Wit
2c09f9ddf0 trace: trim quotes from tag values 2024-11-13 13:15:52 +01:00
Ken Sedgwick
5b2c237718 drive-by: fix valgrind complaint about uninitialized memory 2024-05-08 21:05:49 -05:00
Christian Decker
5af9d19e94 misc: Improve generated traces from common/trace.c
The integration with opentelemetry was sub-optimal: it was generating
jaeger-style traces, with short traceIds and we were considering the
entire lifetime as a single trace. This PR changes that to a trace for
startup and then a trace for any event that doesn't already have a
parent.

We also allow using the `CLN_TRACEPARENT` envvar to attach the startup
to a remote / external trace, potentially by whatever started the main
process. This is useful to see the startup trace in the wider context
of whatever tooling is built around it.

Changelog-Added: tracing: It is now possible to inject a parent for the startup trace by setting the `CLN_TRACEPARENT` envvar
2023-12-23 15:19:31 +01:00
Christian Decker
7ce0942ee2 common: Add trace_cleanup() function for unit tests 2023-12-15 13:23:51 +10:30
Christian Decker
ca808b8476 trace: Use User Statically-Defined Tracepoints to emit tracing spans 2023-07-28 09:21:42 +09:30
Christian Decker
ed8d3336c9 trace: First draft of API 2023-07-28 09:21:42 +09:30