Also added missing "added" annotation. This meant that I had to manually
change contrib/msggen/msggen/patch.py to insert that added notation where it
was missing from .msggen.json.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Changelog-None: introduced this release.
The original method name was lsps-lsps2-invoice but I somehow messed it
up and renamed during a rebase.
Changelog-Changed: lsps-jitchannel is now lsps-lsps2-invoice
Signed-off-by: Peter Neuroth <pet.v.ne@gmail.com>
We are seeing this in the CI logs, eg tests/test_connection.py::test_reconnect_sender_add1:
lightningd-1 2025-11-17T05:48:00.665Z DEBUG jsonrpc#84: Pausing parsing after 1 requests
followed by:
lightningd-1 2025-11-17T05:48:02.068Z **BROKEN** 022d223620a359a47ff7f7ac447c85c46c923da53389221a0054c11c1e3ca31d59-connectd: wake delay for WIRE_CHANNEL_REESTABLISH: 8512msec
So, what is consuming lightningd for 8 or so seconds?
This message helped diagnose that the issue was dev-memleak: fixed in a different branch.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Since we are the only writer, we don't need one.
Name (time in s) Min Max Mean StdDev Median
sqlite: test_spam_listcommands (before) 2.1193 2.4524 2.2343 0.1341 2.2229
sqlite: test_spam_listcommands (after) 2.0140 2.2349 2.1001 0.0893 2.0644
Postgres: test_spam_listcommands (before) 6.5572 6.8440 6.7067 0.1032 6.6967
Postgres: test_spam_listcommands (after) 4.4237 5.0024 4.6495 0.2278 4.6717
A nice 31% speedup!
Changelog-Changed: Postgres: significant speedup on read-only operations (e.g. 30% on empty SELECTs)
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
I had forgotten this file existed, but it needs tqdm and pytest-benchmark, so add those dev
requirements.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
We always start a transaction before processing, but there are cases where
we don't need to. Switch to doing it on-demand.
This doesn't make a big difference for sqlite3, but it can for Postgres because
of the latency: 12% or so. Every bit helps!
30 runs, min-max(mean+/-stddev):
Postgres before: 8.842773-9.769030(9.19531+/-0.21)
Postgres after: 8.007967-8.321856(8.14172+/-0.066)
sqlite3 before: 7.486042-8.371831(8.15544+/-0.19)
sqlite3 after: 7.973411-8.576135(8.3025+/-0.12)
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
This is slow, but will make sure we find out if we add latency spikes in future.
tests/test_coinmoves.py::test_generate_coinmoves (5,000,000, sqlite3):
Time (from start to end of l2 node): 223 seconds
Latency min/median/max: 0.0023 / 0.0033 / 0.113 seconds
tests/test_coinmoves.py::test_generate_coinmoves (5,000,000, Postgres):
Time (from start to end of l2 node): 470 seconds
Worst latency: 0.0024 / 0.0098 / 0.124 seconds
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Changelog-Fixed: lightningd: multiple signficant speedups for large nodes, especially preventing "freezes" under exceptionally high load.
This avoids latency spikes when we ask lightningd to give us 2M entries.
tests/test_coinmoves.py::test_generate_coinmoves (2,000,000, sqlite3):
Time (from start to end of l2 node): 88 seconds (was 95)
Worst latency: 0.028 seconds **WAS 4.5**
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Now we've found all the issues, the latency spike (4 seconds on my laptop)
for querying 2M elements remains.
Restore the limited sampling which we reverted, but make it 10,000 now.
This doesn't help our worst-case latency, because sql still asks for all 2M entries on
first access. We address that next.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
We have a reasonable number of commands now, and we *already* keep a
strmap for the usage strings. So simply keep the usage and the command
in the map, and skip the array.
tests/test_coinmoves.py::test_generate_coinmoves (2,000,000, sqlite3):
Time (from start to end of l2 node): 95 seconds (was 102)
Worst latency: 4.5 seconds
tests/test_coinmoves.py::test_generate_coinmoves (2,000,000, Postgres):
Time (from start to end of l2 node): 231 seconds
Worst latency: 4.8 seconds
Note the values compare against 25.09.2 (Postgres):
sqlite3:
Time (from start to end of l2 node): 403 seconds
Postgres:
Time (from start to end of l2 node): 671 seconds
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
tests/test_coinmoves.py::test_generate_coinmoves (2,000,000, sqlite3):
Time (from start to end of l2 node): 102 seconds **WAS 126**
Worst latency: 4.5 seconds **WAS 5.1**
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Now that ccan/io rotates through callbacks, we can call io_always() to
yield.
We're now fast enough that this doesn't have any effect on this test,
bit it's still good to have.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Now that ccan/io rotates through callbacks, we can call io_always() to yield.
Though it doesn't fire on our benchmark, it's a good thing to do.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
This rotates through fds explicitly, to avoid unfairness.
This doesn't really make a difference until we start using it.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
We would keep parsing if we were out of tokens, even if we had actually
finished one object!
These are comparison against the "xpay: use filtering on rpc_command
so we only get called on "pay"." not the disasterous previous one!
tests/test_coinmoves.py::test_generate_coinmoves (2,000,000, sqlite3):
Time (from start to end of l2 node): 126 seconds (was 135)
Worst latency: 5.1 seconds **WAS 12.1**
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
A client can do this by sending a large request, so this allows us to see what
happens if they do that, even though 1MB (2MB buffer) is more than we need.
This drives our performance through the floor: see next patch which gets
us back on track.
tests/test_coinmoves.py::test_generate_coinmoves (2,000,000, sqlite3):
Time (from start to end of l2 node): 271 seconds **WAS 135**
Worst latency: 105 seconds **WAS 12.1**
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Changelog-Added: Plugins: "filters" can be specified on the `custommsg` hook to limit what message types the hook will be called for.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Changelog-Added: pyln-client: optional filters can be given when hooks are registered (for supported hooks)
tests/test_coinmoves.py::test_generate_coinmoves (2,000,000, sqlite3):
Time (from start to end of l2 node): 135 seconds **WAS 227**
Worst latency: 12.1 seconds **WAS 62.4**
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
xpay is relying on the destructor to send another request. This means
that it doesn't actually submit the request until *next time* we wake.
This has been in xpay from the start, but it is not noticeable until
xpay stops subscribing to every command on the rpc_command hook.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Changelog-Added: Plugins: the `rpc_command` hook can now specify a "filter" on what commands it is interested in.
We're going to use this on the "rpc_command" hook, to allow xpay to specify that it
only wants to be called on "pay" commands.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
This potentially saves us some reads (not measurably though), at cost
of less fairness. It's important to measure though, because a single
large request will increase buffer size for successive requests, so we
can see this pattern in real usage.
tests/test_coinmoves.py::test_generate_coinmoves (2,000,000, sqlite3):
Time (from start to end of l2 node): 227 seconds (was 239)
Worst latency: 62.4 seconds (was 56.9)
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Profiling shows us spending all our time in tal_arr_remove when dealing
with a giant number of output streams. This applies both for RPC output
and plugin output.
Use linked list instead.
tests/test_coinmoves.py::test_generate_coinmoves (2,000,000, sqlite3):
Time (from start to end of l2 node): 239 seconds **WAS 518**
Worst latency: 56.9 seconds **WAS 353**
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Now we've rid ourselves of the worst offenders, we can make this a real
stress test. We remove plugin io saving and low-level logging, to avoid
benchmarking testing artifacts.
Here are the results:
tests/test_coinmoves.py::test_generate_coinmoves (2,000,000, sqlite3):
Time (from start to end of l2 node): 518 seconds
Worst latency: 353 seconds
tests/test_coinmoves.py::test_generate_coinmoves (2,000,000, Postgres):
Time (from start to end of l2 node): 417 seconds
Worst latency: 96.6 seconds
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
If we only have 8 or fewer spans at once (as is the normal case), don't
do allocation, which might interfere with tracing.
This doesn't change our test_generate_coinmoves() benchmark.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
If we have USDT compiled in, scanning the array of spans becomes
prohibitive if we have really large numbers of requests. In the
bookkeeper code, when catching up with 1.6M channel events, this
became clear in profiling.
Use a hash table instead.
Before:
tests/test_coinmoves.py::test_generate_coinmoves (100,000, sqlite3):
Time (from start to end of l2 node): 269 seconds (vs 14 with HAVE_USDT=0)
Worst latency: 4.0 seconds
After:
tests/test_coinmoves.py::test_generate_coinmoves (100,000, sqlite3):
Time (from start to end of l2 node): 14 seconds
Worst latency: 4.3 seconds
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
When we have many commands, this is where we spend all our time, and it's
just for an old assertion.
tests/test_coinmoves.py::test_generate_coinmoves (100,000, sqlite3):
Time (from start to end of l2 node): 13 seconds **WAS 34**
Worst latency: 4.0 seconds **WAS 24*
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
If we add a new hook, not at the end, while hooks are getting called,
then iteration could be messed up (e.g. calling a plugin twice, or
skipping one).
The simplest thing is to defer updates until nobody is calling the
hook. In theory this could livelock, in practice it won't.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
We make a copy, then attach a destructor to the hook in case that plugin exits, so we
can NULL it out in the local copy. When we have 300,000 requests pending, this means
we have 300,000 destructors, which don't scale (it's a single-linked list).
Simply NULL out (rather than shrink) the array in the `plugin_hook`.
Then we can keep using that.
tests/test_coinmoves.py::test_generate_coinmoves (100,000, sqlite3):
Time (from start to end of l2 node): 34 seconds **WAS 85**
Worst latency: 24 seconds **WAS 75**
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
We start with 100,000 entries. We will scale this to 2M as we fix the
O(N^2) bottlenecks.
I measure the node time after we modify the db, like so:
while guilt push && rm -rf /tmp/ltests* && uv run make -s RUST=0; do RUST=0 VALGRIND=0 TIMEOUT=100 TEST_DEBUG=1 eatmydata uv run pytest -vvv -p no:logging tests/test_coinmoves.py::test_generate_coinmoves > /tmp/`guilt top`-sql 2>&1; done
Then analyzed the results with:
FILE=/tmp/synthetic-data.patch-sql; START=$(grep 'lightningd-2 .* Server started with public key' $FILE | tail -n1 | cut -d\ -f2 | cut -d. -f1); END=$(grep 'lightningd-2 .* JSON-RPC shutdown' $FILE | tail -n1 | cut -d\ -f2 | cut -d. -f1); echo $(( $(date +%s -d $END) - $(date +%s -d $START) )); grep 'E assert' $FILE;
tests/test_coinmoves.py::test_generate_coinmoves (100,000, sqlite3):
Time (from start to end of l2 node): 85 seconds
Worst latency: 75 seconds
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
This reverts commit 1dda0c0753 so we can test
what its like to be flooded with logs again.
This benefits from other improvements we've made this release, to handling
plugin input (i.e. converting to use common/jsonrpc_io), so this doesn't
make much difference.
tests/test_coinmoves.py::test_generate_coinmoves (100,000, sqlite3):
Time (from start to end of l2 node): 211 seconds
Worst latency: 108 seconds
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
This reverts `bookkeeper: only read listchannelmoves 1000 entries at a time.` commit,
so we can properly fix the scalability in the coming patches.
tests/test_coinmoves.py::test_generate_coinmoves (100,000):
Time (from start to end of l2 node): 207 seconds
Worst latency: 106 seconds
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>