palladum-lightning

Author	SHA1	Message	Date
Rusty Russell	cbd52d49af	common: avoid allocations for small numbers of traces. If we only have 8 or fewer spans at once (as is the normal case), don't do allocation, which might interfere with tracing. This doesn't change our test_generate_coinmoves() benchmark. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>	2025-11-20 16:30:50 +10:30
Rusty Russell	7f55a8ea1a	common: remove tracing exponential behaviour from large numbers of requests. If we have USDT compiled in, scanning the array of spans becomes prohibitive if we have really large numbers of requests. In the bookkeeper code, when catching up with 1.6M channel events, this became clear in profiling. Use a hash table instead. Before: tests/test_coinmoves.py::test_generate_coinmoves (100,000, sqlite3): Time (from start to end of l2 node): 269 seconds (vs 14 with HAVE_USDT=0) Worst latency: 4.0 seconds After: tests/test_coinmoves.py::test_generate_coinmoves (100,000, sqlite3): Time (from start to end of l2 node): 14 seconds Worst latency: 4.3 seconds Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>	2025-11-20 16:30:50 +10:30
Rusty Russell	8b9020d7b9	global: use clock_time in place of time_now(). Except for tracing, that sticks with time_now(). Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>	2025-11-13 21:21:29 +10:30
Rusty Russell	06f18b1d1d	memleak: make notleak() work even before memleak is initalized. It now simply renames tal names, so it's harmless to do even if we're not going to do memleak detection. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>	2025-10-24 11:30:17 +10:30
Rusty Russell	6e5cb299dd	global: remove unnecessary includes from C files. Basically, `devtools/reduce-includes.sh /.c`. Build time from make clean (RUST=0) (includes building external libs): Before: real 0m38.944000-40.416000(40.1131+/-0.4)s user 3m6.790000-17.159000(15.0571+/-2.8)s sys 0m35.304000-37.336000(36.8942+/-0.57)s After: real 0m37.872000-39.974000(39.5466+/-0.59)s user 3m1.211000-14.968000(12.4556+/-3.9)s sys 0m35.008000-36.830000(36.4143+/-0.5)s Build time after touch config.vars (RUST=0): Before: real 0m19.831000-21.862000(21.5528+/-0.58)s user 2m15.361000-30.731000(28.4798+/-4.4)s sys 0m21.056000-22.339000(22.0346+/-0.35)s After: real 0m18.384000-21.307000(20.8605+/-0.92)s user 2m5.585000-26.843000(23.6017+/-6.7)s sys 0m19.650000-22.003000(21.4943+/-0.69)s Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>	2025-10-23 06:44:04 +10:30
Rusty Russell	f6a4e79420	global: remove unnecessary includes from headers. Each header should only include the other headers it needs to compile; `devtools/reduce-includes.sh /.h` does this. The C files then need additional includes if they don't compile. And remove the entirely useless wire/onion_wire.h, which only serves to include wire/onion_wiregen.h. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>	2025-10-23 06:44:04 +10:30
Rusty Russell	8cc86f3d88	trace: prevent memleak report. notleak() doesn't work for lightningd since the first span is created before memleak (or anything else!) is initialized, so we have to mark it manually. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>	2025-04-24 17:34:12 +09:30
Rusty Russell	3f136ef42d	trace: don't mess up pointers when we reallocate. It's convenient to have pointers, but we have to do fixups. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>	2025-04-24 17:34:12 +09:30
Rusty Russell	55f870e963	trace: double allocation if we run out. This doesn't happen very often, but can with autoclean. However, we rarely traverse to the end, since we always expect to find what we're looking for, and we fill from the front. So even a large array (unless it's used) is fine. Subtle: when we injected a parent, we used "active_spans" as the (arbitrary) key. That can now change with reallocation, and so if that memory were reused we could have a key clash. So we use "&active_spans" which doesn't change. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>	2025-04-23 13:45:18 +09:30
Rusty Russell	858e3b074f	common: remove unnecessary parent_id and remote fields. We don't ever actually close the remote span (we don't have its key, after all), and we keep a pointer to the parent. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>	2025-04-23 13:45:18 +09:30
Rusty Russell	6a8e586ea8	trace: don't use randombytes_buf(), use pseudorand. This is much faster to give 64 bits of data, and we don't need cryptographic randomness. This brings us back to 413ns per trace. Before: real 0m5.819000-6.472000(6.2064+/-0.26)s user 0m3.779000-4.101000(3.956+/-0.12)s sys 0m2.040000-2.431000(2.2496+/-0.15)s After: real 0m3.981000-4.247000(4.1276+/-0.11)s user 0m3.979000-4.245000(4.126+/-0.11)s sys 0m0.000000-0.002000(0.001+/-0.00063)s Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> Changelog-Fixed: lightingd: trimmed overhead of tracing infrastructure.	2025-04-23 13:45:18 +09:30
Rusty Russell	b4dcf4e55d	trace: use a static buffer instead of tal_fmt(). There's an EBPF limit anyway, so stick with a 512-byte buffer. This brings us back to 621ns per trace: Before: real 0m13.441000-14.592000(14.2686+/-0.43)s user 0m11.265000-12.289000(11.9626+/-0.37)s sys 0m2.175000-2.381000(2.3048+/-0.072)s After: real 0m5.819000-6.472000(6.2064+/-0.26)s user 0m3.779000-4.101000(3.956+/-0.12)s sys 0m2.040000-2.431000(2.2496+/-0.15)s Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>	2025-04-23 13:45:18 +09:30
Rusty Russell	3973c35013	trace: we only ever add two tags, so use a static array. Avoids allocations. Also assume that name and value parameters outlive the trace span, so don't copy. Before: real 0m16.421000-18.407000(17.8128+/-0.72)s user 0m14.242000-16.041000(15.5382+/-0.67)s sys 0m2.179000-2.363000(2.273+/-0.061)s After: real 0m13.441000-14.592000(14.2686+/-0.43)s user 0m11.265000-12.289000(11.9626+/-0.37)s sys 0m2.175000-2.381000(2.3048+/-0.072)s Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>	2025-04-23 13:45:18 +09:30
Rusty Russell	cdb74434f2	trace: don't copy strings where we don't need to. 1. trace_span_start() is always called with a string literal, so no copy needed (and we can use a macro to enforce this). 2. trace_span_tag() name and value are always longer-lived than the span, so no need to copy these either. Before: real 0m18.524000-19.100000(18.7674+/-0.21)s user 0m16.171000-16.833000(16.424+/-0.26)s sys 0m2.259000-2.400000(2.337+/-0.059)s After: real 0m16.421000-18.407000(17.8128+/-0.72)s user 0m14.242000-16.041000(15.5382+/-0.67)s sys 0m2.179000-2.363000(2.273+/-0.061)s Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>	2025-04-23 13:45:18 +09:30
Rusty Russell	e951155002	trace: fix parent handling (and test it!). Testing parenting handling revealed several issues: 1. By calling "trace_span_start" when CLN_TRACEPARENT is set produces a bogus entry, for which the span_id is overwritten so we never end it. 2. We don't need to close the remote parent when we close the first child: in fact, this causes the remaining traces to be detached from the parent! 3. Suspension should return current to the parent, not to NULL. Now the traces balance as we expect. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>	2025-04-23 13:45:18 +09:30
Rusty Russell	924f28adcb	pytest: add test hooks so we can test tracing. Suggested-by: Christian Decker Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>	2025-04-23 13:45:18 +09:30
Rusty Russell	c69a636fef	trace: track suspensions, check they match. I added this debugging because the next test revealed a mismatch, so I wanted to see where it was happening. The comment in lightningd suggests it's possible, but I can't see any code which suspends in the lightningd io_loop, so I cannot see how this is triggered. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>	2025-04-23 13:45:18 +09:30
Rusty Russell	2e6ad3ffc8	trace: handle key being freed while suspended. This happens with autoclean, which does a datastore request then frees the parent command without waiting for a response (see clean_finished). This leaks a trace, and causes a crash if the pointer is later reused. My solution is to create a trace variant which declares the trace key to be a tal ptr and then we can clean up in the destructor if this happens. This fixes the issue for me. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> Changelog-Fixed: autoclean: fixed occasional crash when tracepoints compiled in.	2025-04-03 08:27:27 -05:00
Rusty Russell	ee2ec8befa	trace: minimal fix to avoid crash when > 128 traces active. chanbackup with many peers can do more than 128 concurrent rpc commands. autoclean is the other plugin which can do many requests at once, so I expect a similar issue there. I tested this by rebuilding with `MAX_ACTIVE_SPANS` 1, which autoclean tests triggered immediately. The real fix is probably to use a hash table with a large initial size. ``` Mar 24 06:30:45 mlbb2 sh[28000]: chanbackup: common/trace.c:190: trace_span_slot: Assertion `s' failed. Mar 24 06:30:45 mlbb2 sh[28000]: chanbackup: FATAL SIGNAL 6 (version v25.02) Mar 24 06:30:45 mlbb2 sh[28000]: 0x5575232bac4f send_backtrace Mar 24 06:30:45 mlbb2 sh[28000]: common/daemon.c:33 Mar 24 06:30:45 mlbb2 sh[28000]: 0x5575232baceb crashdump Mar 24 06:30:45 mlbb2 sh[28000]: common/daemon.c:78 Mar 24 06:30:45 mlbb2 sh[28000]: 0x7f2958cd851f ??? Mar 24 06:30:45 mlbb2 sh[28000]: ./signal/../sysdeps/unix/sysv/linux/x86_64/libc_sigaction.c:0 Mar 24 06:30:45 mlbb2 sh[28000]: 0x7f2958d2c9fc __pthread_kill_implementation Mar 24 06:30:45 mlbb2 sh[28000]: ./nptl/pthread_kill.c:44 Mar 24 06:30:45 mlbb2 sh[28000]: 0x7f2958d2c9fc __pthread_kill_internal Mar 24 06:30:45 mlbb2 sh[28000]: ./nptl/pthread_kill.c:78 Mar 24 06:30:45 mlbb2 sh[28000]: 0x7f2958d2c9fc __GI___pthread_kill Mar 24 06:30:45 mlbb2 sh[28000]: ./nptl/pthread_kill.c:89 Mar 24 06:30:45 mlbb2 sh[28000]: 0x7f2958cd8475 __GI_raise Mar 24 06:30:45 mlbb2 sh[28000]: ../sysdeps/posix/raise.c:26 Mar 24 06:30:45 mlbb2 sh[28000]: 0x7f2958cbe7f2 __GI_abort Mar 24 06:30:45 mlbb2 sh[28000]: ./stdlib/abort.c:79 Mar 24 06:30:45 mlbb2 sh[28000]: 0x7f2958cbe71a __assert_fail_base Mar 24 06:30:45 mlbb2 sh[28000]: ./assert/assert.c:94 Mar 24 06:30:45 mlbb2 sh[28000]: 0x7f2958ccfe95 __GI___assert_fail Mar 24 06:30:45 mlbb2 sh[28000]: ./assert/assert.c:103 Mar 24 06:30:45 mlbb2 sh[28000]: 0x5575232ab7fa trace_span_slot Mar 24 06:30:45 mlbb2 sh[28000]: common/trace.c:190 Mar 24 06:30:45 mlbb2 sh[28000]: 0x5575232abc9f trace_span_start Mar 24 06:30:45 mlbb2 sh[28000]: common/trace.c:267 Mar 24 06:30:45 mlbb2 sh[28000]: 0x5575232a7c34 send_outreq Mar 24 06:30:45 mlbb2 sh[28000]: plugins/libplugin.c:1112 ``` Changelog-Fixed: autoclean/chanbackup: fixed tracepoint crash on large number of requests. Fixes: https://github.com/ElementsProject/lightning/issues/8177 Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>	2025-04-01 09:20:07 -05:00
Christian Decker	c596550de1	common: Make trace debugging output configurate at compile time Just added a couple of compile-time guards and sprinkled the invariant checking in a couple of places (disabled if compile time guard is unset).	2024-11-24 10:24:31 +10:30
Christian Decker	557142627f	common: Fix a potential cycle in the trace structure It turns out that under some circumstances we end up clearing the pointee of `current` but not the pointer. Thus when we select the next slot we can end up reusing the same slot, making it its own parent. We forcefull break these cycles by enforcing that `current` should never be returned and be set as its own parent. Changelog-None	2024-11-24 10:24:31 +10:30
Christian Decker	a6c81d4174	common: Add a tree checker for trace spans Trace spans form a tree, but we don't actually check that the structure doesn't break. Breakage can for example come if we use the same key accidentally, making a new span its own ancestor.	2024-11-24 10:24:31 +10:30
Christian Decker	2e59ab8f15	common: Remove the recursive parent resolution in traces We have the space in memory set aside anyway, so let's just copy the `trace_id` into the span itself, rather than resolving the `root` at time of emission.	2024-11-24 10:24:31 +10:30
Christian Decker	4f3ea8c048	common: Add some debuggig capabilities to the trace subsystem After adding the DB query instrumentation we ran into a couple of issues, with spans not being resumed correctly, and it was rather hard to identify the problem. This adds debug statements so we can trace the tracing (traception if you will). Changelog-None	2024-11-24 10:24:31 +10:30
Jesse de Wit	2c09f9ddf0	trace: trim quotes from tag values	2024-11-13 13:15:52 +01:00
Ken Sedgwick	5b2c237718	drive-by: fix valgrind complaint about uninitialized memory	2024-05-08 21:05:49 -05:00
Christian Decker	5af9d19e94	misc: Improve generated traces from common/trace.c The integration with opentelemetry was sub-optimal: it was generating jaeger-style traces, with short traceIds and we were considering the entire lifetime as a single trace. This PR changes that to a trace for startup and then a trace for any event that doesn't already have a parent. We also allow using the `CLN_TRACEPARENT` envvar to attach the startup to a remote / external trace, potentially by whatever started the main process. This is useful to see the startup trace in the wider context of whatever tooling is built around it. Changelog-Added: tracing: It is now possible to inject a parent for the startup trace by setting the `CLN_TRACEPARENT` envvar	2023-12-23 15:19:31 +01:00
Christian Decker	7ce0942ee2	common: Add trace_cleanup() function for unit tests	2023-12-15 13:23:51 +10:30
Christian Decker	ca808b8476	trace: Use User Statically-Defined Tracepoints to emit tracing spans	2023-07-28 09:21:42 +09:30
Christian Decker	ed8d3336c9	trace: First draft of API	2023-07-28 09:21:42 +09:30

30 Commits