Basically, `devtools/reduce-includes.sh */*.c`.
Build time from make clean (RUST=0) (includes building external libs):
Before:
real 0m38.944000-40.416000(40.1131+/-0.4)s
user 3m6.790000-17.159000(15.0571+/-2.8)s
sys 0m35.304000-37.336000(36.8942+/-0.57)s
After:
real 0m37.872000-39.974000(39.5466+/-0.59)s
user 3m1.211000-14.968000(12.4556+/-3.9)s
sys 0m35.008000-36.830000(36.4143+/-0.5)s
Build time after touch config.vars (RUST=0):
Before:
real 0m19.831000-21.862000(21.5528+/-0.58)s
user 2m15.361000-30.731000(28.4798+/-4.4)s
sys 0m21.056000-22.339000(22.0346+/-0.35)s
After:
real 0m18.384000-21.307000(20.8605+/-0.92)s
user 2m5.585000-26.843000(23.6017+/-6.7)s
sys 0m19.650000-22.003000(21.4943+/-0.69)s
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Each header should only include the other headers it needs to compile;
`devtools/reduce-includes.sh */*.h` does this. The C files then need
additional includes if they don't compile.
And remove the entirely useless wire/onion_wire.h, which only serves to include wire/onion_wiregen.h.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
This means we don't have to manually choose what to link against,
which is much of the complexity of our Makefiles: the compiler will
automatically use any object files it needs to link.
We already do this for ccan as libccan.a, now we have libcommon.a.
We don't link against it for *everything*, as some tests require their own
versions.
Notes:
1. I get rid of the weird plugins/test/Makefile2 (accidental commit?)
2. Many tests change due to update-mocks.
3. In some places I added the missing dependency on the Makefile itself, though most are in the next
patch.
Before:
Total program size: 221366528
Total tests size: 364243856
After:
Total program size: 190733656
Total tests size: 337880888
Build time from make clean (RUST=0) (includes building external libs):
Before:
real 0m38.227000-44.245000(41.8222+/-1.6)s
user 3m2.105000-33.696000(23.1442+/-8.4)s
sys 0m35.054000-42.269000(39.7231+/-2)s
After:
real 0m38.944000-40.416000(40.1131+/-0.4)s
user 3m6.790000-17.159000(15.0571+/-2.8)s
sys 0m35.304000-37.336000(36.8942+/-0.57)s
Build time after touch config.vars (RUST=0):
Before:
real 0m18.928000-22.776000(21.5084+/-1.1)s
user 2m8.613000-36.567000(27.7281+/-7.7)s
sys 0m20.458000-23.436000(22.3963+/-0.77)s
After:
real 0m19.831000-21.862000(21.5528+/-0.58)s
user 2m15.361000-30.731000(28.4798+/-4.4)s
sys 0m21.056000-22.339000(22.0346+/-0.35)s
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
rusty@rusty-Framework:~/devel/cvs/lightni
openingd sends an ERROR, and exits. lightningd tells us to
disconnect. We read from lightningd first, and don't read from
openingd.
We need to drain subds when we're told to disconnect.
One issue we have in CI is reconnection races: if an incoming
connection arrives while an outgoing one is negotiated, we close the
outgoing one and issue a disconnect, which fails any connect attempts.
By sending a "reconnected" message instead of disconnect/connect we
can avoid disturbing in-progress connection attempts which happens in CI
quite a bit.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
We can stop listening on the incoming peer while we are closing, so we don't notice if they close:
```
['lightningd-2 2025-09-03T09:48:19.555Z **BROKEN** 0266e4598d1d3c415f572a8488830b60f7e744ed9235eb0b1ba93283b315c03518-connectd: Peer did not close, forcing close', 'lightningd-2 2025-09-03T09:48:22.918Z **BROKEN** 0266e4598d1d3c415f572a8488830b60f7e744ed9235eb0b1ba93283b315c03518-connectd: Peer did not close, forcing close']
=========================== short test summary info ============================
ERROR tests/test_misc.py::test_even_sendcustommsg - ValueError:
```
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
This message is logged when connectd tries to shut down a peer
connection but the transmit buffer remains full for too long, maybe
because the peer has crashed or has lost connectivity. Logging this
message at the BROKEN level is inappropriate because BROKEN is intended
to flag logic errors that imply incorrect code in CLN. The error in
question here is actually a runtime error, which does not imply
incorrect code (at least on our side), so demote the log message to the
UNUSUAL level. (Even this is still probably too severe, as this message
is logged rather more frequently than "unusual" would suggest.)
Changelog-None
Closes: https://github.com/ElementsProject/lightning/issues/5678
In a0fd72eb5e I added a diagnostic message if messages cause large
delays, *but* I didn't set the "peer_in_lasttime" variable in the case
of locally-handled packets.
I really want this in the release: the point of this was to try to
diagnose some high-latency ping issues we've seen on the real network.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
We haven't seen the "excessive queue length" backtrace since we fixed gossipd,
so it's safe to drop excess messages without worrying about losing gossip.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
We cannot use subd_req() here: replies will come out of order, and the
we should not simply assign the reponses in FIFO order.
Changelog-Fixed: lightningd: don't get confused with parallel ping commands.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
One reason why ping processing could be slow is that, once we receive
a message from the peer to send to a subdaemon, we don't listen for
others until we've drained that subdaemon queue entirely.
This can happens for reestablish: slow machines can take a while to
set that subdaemon up.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Implement the sending of `start_batch` and `protocol_batch_element` from `channeld` to `connectd`.
Each real peer wire message is prefixed with `protocol_batch_element` so connectd can know the size of the message that were batched together.
`connectd` intercepts `protocol_batch_element` messages and eats them (doesn’t forward them to peer) to get individual messages out of the batch.
It needs this to be able to encrypt them individiaully. Afterwards it recombines the now encrypted messages into a single message to send over the wire to the peer.
`channeld` remains responsible for making `start_batch` the first message of the message bundle.
We add `start_batch` to match t-bast’s splicing spec and we add a new internal wire type `WIRE_PROTOCOL_BATCH_ELEMENT` using the type number 0
Changelog-Added: support for `start_batch`
Our CORK logic was wrong, and it's better to use Nagle anyway:
we disable Nagle just before sending timing-critical messages.
Time for 100 (failed) payments:
Before:
148.8573575
After:
10.7356977
Note this revealed a timing problem in test_reject_invalid_payload: we would
miss the cause of the sendonion failure, and waitsendpay would be called
*after* it had failed, so simply returns "Payment failure reason unknown".
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Changelog-Fixed: Protocol: Removed 200ms latency from sending commit/revoke messages.
DNS seeds have been down/offline for a while, and this code (which
blocks!) has been a source of trouble. We should probably use a
canned set of "known nodes" if we want to bootstrap.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Changelog-Changed: Protocol: we no longer use DNS seeds for peer lookup fallbacks.
Fixes: https://github.com/ElementsProject/lightning/issues/7913
From grubles' logs:
```
2025-01-06T15:30:31.449Z DEBUG lightningd: attempting connection to 035ca2fe4793a5e789ce846062eb4834f573c060d9200ce77544a29b48a0aa5923 for additional gossip
2025-01-06T15:30:31.449Z DEBUG 035ca2fe4793a5e789ce846062eb4834f573c060d9200ce77544a29b48a0aa5923-connectd: Adding 0 addresses to important peer
2025-01-06T15:30:31.449Z DEBUG 035ca2fe4793a5e789ce846062eb4834f573c060d9200ce77544a29b48a0aa5923-connectd: Failed connected out: Unable to connect, no address known for peer
2025-01-06T15:30:31.449Z DEBUG 035ca2fe4793a5e789ce846062eb4834f573c060d9200ce77544a29b48a0aa5923-connectd: Will try reconnect in 300 seconds
2025-01-06T15:30:32.037Z DEBUG 035ca2fe4793a5e789ce846062eb4834f573c060d9200ce77544a29b48a0aa5923-connectd: Failed connected out: Unable to connect, no address known for peer
2025-01-06T15:30:32.037Z DEBUG 035ca2fe4793a5e789ce846062eb4834f573c060d9200ce77544a29b48a0aa5923-connectd: Will try reconnect in 300 seconds
2025-01-06T15:30:32.428Z DEBUG 035ca2fe4793a5e789ce846062eb4834f573c060d9200ce77544a29b48a0aa5923-connectd: Failed connected out: Unable to connect, no address known for peer
2025-01-06T15:30:32.428Z DEBUG 035ca2fe4793a5e789ce846062eb4834f573c060d9200ce77544a29b48a0aa5923-connectd: Will try reconnect in 300 seconds
2025-01-06T15:30:32.680Z DEBUG 035ca2fe4793a5e789ce846062eb4834f573c060d9200ce77544a29b48a0aa5923-connectd: Failed connected out: Unable to connect, no address known for peer
2025-01-06T15:30:32.681Z DEBUG 035ca2fe4793a5e789ce846062eb4834f573c060d9200ce77544a29b48a0aa5923-connectd: Will try reconnect in 300 seconds
2025-01-06T15:30:33.468Z DEBUG 035ca2fe4793a5e789ce846062eb4834f573c060d9200ce77544a29b48a0aa5923-connectd: Failed connected out: Unable to connect, no address known for peer
2025-01-06T15:30:33.469Z DEBUG 035ca2fe4793a5e789ce846062eb4834f573c060d9200ce77544a29b48a0aa5923-connectd: Will try reconnect in 300 seconds
2025-01-06T15:30:33.471Z DEBUG 035ca2fe4793a5e789ce846062eb4834f573c060d9200ce77544a29b48a0aa5923-connectd: Failed connected out: Unable to connect, no address known for peer
2025-01-06T15:30:33.471Z DEBUG 035ca2fe4793a5e789ce846062eb4834f573c060d9200ce77544a29b48a0aa5923-connectd: Will try reconnect in 300 seconds
2025-01-06T15:30:33.935Z DEBUG 035ca2fe4793a5e789ce846062eb4834f573c060d9200ce77544a29b48a0aa5923-connectd: Failed connected out: Unable to connect, no address known for peer
2025-01-06T15:30:33.935Z DEBUG 035ca2fe4793a5e789ce846062eb4834f573c060d9200ce77544a29b48a0aa5923-connectd: Will try reconnect in 300 seconds
2025-01-06T15:30:34.125Z DEBUG 035ca2fe4793a5e789ce846062eb4834f573c060d9200ce77544a29b48a0aa5923-connectd: Failed connected out: Unable to connect, no address known for peer
2025-01-06T15:30:34.125Z DEBUG 035ca2fe4793a5e789ce846062eb4834f573c060d9200ce77544a29b48a0aa5923-connectd: Will try reconnect in 300 seconds
2025-01-06T15:30:35.496Z DEBUG 035ca2fe4793a5e789ce846062eb4834f573c060d9200ce77544a29b48a0aa5923-connectd: Failed connected out: Unable to connect, no address known for peer
2025-01-06T15:30:35.497Z DEBUG 035ca2fe4793a5e789ce846062eb4834f573c060d9200ce77544a29b48a0aa5923-connectd: Will try reconnect in 300 seconds
2025-01-06T15:30:35.623Z DEBUG 035ca2fe4793a5e789ce846062eb4834f573c060d9200ce77544a29b48a0aa5923-connectd: Failed connected out: Unable to connect, no address known for peer
2025-01-06T15:30:35.623Z DEBUG 035ca2fe4793a5e789ce846062eb4834f573c060d9200ce77544a29b48a0aa5923-connectd: Will try reconnect in 300 seconds
2025-01-06T15:30:35.751Z DEBUG 035ca2fe4793a5e789ce846062eb4834f573c060d9200ce77544a29b48a0aa5923-connectd: Failed connected out: Unable to connect, no address known for peer
2025-01-06T15:30:35.751Z DEBUG 035ca2fe4793a5e789ce846062eb4834f573c060d9200ce77544a29b48a0aa5923-connectd: Will try reconnect in 300 seconds
2025-01-06T15:30:35.892Z DEBUG 035ca2fe4793a5e789ce846062eb4834f573c060d9200ce77544a29b48a0aa5923-connectd: Failed connected out: Unable to connect, no address known for peer
2025-01-06T15:30:35.892Z DEBUG 035ca2fe4793a5e789ce846062eb4834f573c060d9200ce77544a29b48a0aa5923-connectd: Will try reconnect in 300 seconds
```
We promised to wait 300 seconds!
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Unfortunately a spec typo means the data fields are missing (PR pending),
so we still patch those in.
The message "your_peer_storage" got renamed to "peer_storage_retrieval",
and the option "want_peer_backup_storage" was removed.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Changelog-EXPERIMENTAL: `experimental-peer-storage` now only advertizes feature 43, not 41.
Default goes to stderr for LOG_UNUSUAL and higher.
We have to whitelist more cases in map_catchup so we don't spam the logs
with perfectly-expected (but ignored) messages though.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
The updated API requires typed htables to explicitly state whether they
allow duplicates: for most cases we don't, but we've had issues in the
past.
This is a big patch, but mainly mechanical.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
We ratelimited DEBUG messages, but that can be annoying and cause us to miss things.
We demoted the worst offenders in the last release, to TRACE level.
Now, only log trace if it's wanted, and never suppress DEBUG.
Changelog-Changed: Logging: we no longer suppress DEBUG messages from subdaemons.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Fixes: https://github.com/ElementsProject/lightning/issues/7917
```
Program received signal SIGSEGV, Segmentation fault.
0x000000001014e9d8 in io_set_finish_ (conn=0x0, finish=0x0, arg=0x0) at ccan/ccan/io/io.c:137
137 conn->finish = finish;
(gdb) bt
incoming=true) at connectd/connectd.c:394
```
Fixes: #7871
Reported-by: grubles
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Changelog-None: broken in this release
Large nodes were not always getting their own channel gossip out
reliably. The number of peers we spam our own channel gossip to
is limited to save large nodes on startup, but this should be
relaxed slightly to ensure propagation.
Changelog-Fixed: Own-channel gossip is broadcast to more peers on connect.
We wait until a connection fails, or a subd is connected to the peer,
before letting another one through. This should prevent us from
overwhelming lightningd on large nodes, but unlike the previous back-off,
it's based on how fast lightningd is, not an arbitrary time.
We also let one through each second, in case we're connecting to many,
but not doing anything but gossip (e.g. 100 explicit connect
commands).
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Changelog-Changed: Reconnecting to peers at startup should be significantly faster (dependent on machine speed).
Rather than have lightningd call us repeatedly to try to connect, have
it tell us what peers are transient and aren't, and connectd will
automatically try to maintain that connection.
There's a new "downgrade_peer" message to tell it a peer is now
transient: to make it non-transient we simply tell connectd to
connect as a non-transient.
The first time, I missed that dual_open_control does its own state
transitions :(
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Changelog-Changed: `connectd` now handles maintaining/reconnecting to important peers, and we remember the last successful address we connected to.
Let lightningd feed us hints to try first, but we can extract the
addresses from node_announcement messages ourselves.
(Lightningd used to ask gossipd on our behalf: this is far simpler!)
One side effect of this is that we don't hand back address hints given to us
by lightningd: it would use these again for reconnecting. This is breaks
test_sendpay_grouping, so we disable it temporarily.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
tmpctx may not get cleaned immediately, so the timeout (a child of
the struct early_peer at this point) can still outlast the conn.
Do the clearer thing, and explicitly free the timeout.
Changelog-Fixed: connectd: crash on erroneous timeout.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Things are often equivalent but different types:
1. u8 arrays in libwally.
2. sha256
3. Secrets derived via sha256
4. txids
Rather than open-coding a BUILD_ASSERT & memcpy, create a macro to do it.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
However fast we can handle them, it's antisocial to allow others to
make us spam the rest of the network.
Changelog-Protocol: onion messages: we limit incoming to 4 per second, allowing a little burst.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
This is now permitted in the offers PR, so we should support it. But
we can't just look up in the gossmap, since the "short_channel_id"
could be an alias. So we get lightningd to tell us all scid->peer
mappings, and look up in that.
Changelog-Added: Protocol: onion messages can now be forwarded by short_channel_id.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Unlike "sendonionmessage" which instructs us to send to a peer, this
process it locally (presumably, it contains the next hop). This is
useful because it allows us to process an onion message which starts
with us (a legal case for a blinded path supplied by someone else!).
It also opens the door to bolt12 self-pay.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Allows for caller to log, but more importantly, when we add a command to
inject onion messages, allows for us to capture the error.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
A bit tricky, since we get more than one message at a time. However,
this just means we go over quota for a bit, and will get caught when
those are sent (we do this for a single message already, so it's not
that much worse).
Note: this not only limits sending, but it limits the actuall query
processing, which is nice.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
This basically means moving the code from gossipd to connectd to handle
these queries.
This will get connectd have finer control over ratelimiting them.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
This is more efficient in a few ways:
1. It's trivial to get to the end of the gossip_store, we don't have
to iterate.
2. It tends to be mmaped so we don't have to call pread().
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
We currently stream gossip as fast as we can, even if they start at
timestamp 0. Instead, use a simple token bucket filter and only let
them have 1MB per second (500 bytes per second for testing).
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Changelog-Protocol: connectd: we now throttle outgoing gossip at 1MB/second per peer.