Unfortunately the effect of leaving Nagle enabled is subtle. Here it
is in v25.12:
Normal:
tests/test_connection.py::test_no_delay PASSED
====================================================================== 1 passed in 13.87s
Nagle enabled:
tests/test_connection.py::test_no_delay PASSED
====================================================================== 1 passed in 21.70s
So it's hard to both catch this issue and not have false positives. Improve the
test by deliberately running with Nagle enabled, so we can do a direct comparison.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
We reimplemented this redundantly: hash_scid was called
short_channel_id_hash, so I obviously missed it.
Rename, and implement hash_scidd helper too.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Messages are now constant.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Changelog-Added: Protocol: we now pad all peer messages to make them the same length.
We had a flake of form:
```
2025-11-18T04:42:23.489Z **BROKEN** 022d223620a359a47ff7f7ac447c85c46c923da53389221a0054c11c1e3ca31d59-connectd: wake delay for WIRE_CHANNEL_REESTABLISH: 6789msec
```
Which happened as we're shutting down. Some investigation revealed
the cause: `dev-memleak` can be extremely slow. Fair enough.
So we change `dev-memleak` to call connectd first, and connectd uses
that as a trigger to stop complaining about delays.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
This is immune to things like clock changes, and has the convenient side-effect that
it will *not* be overridden when we override time for developer purposes.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
This is important: if it's tor-only and we don't have a proxy, we will fail
to connect, but it's no indication that the node is unreachable. Same with
IPv6.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Each header should only include the other headers it needs to compile;
`devtools/reduce-includes.sh */*.h` does this. The C files then need
additional includes if they don't compile.
And remove the entirely useless wire/onion_wire.h, which only serves to include wire/onion_wiregen.h.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
openingd sends an ERROR, and exits. lightningd tells us to
disconnect. We read from lightningd first, and don't read from
openingd.
We need to drain subds when we're told to disconnect.
One issue we have in CI is reconnection races: if an incoming
connection arrives while an outgoing one is negotiated, we close the
outgoing one and issue a disconnect, which fails any connect attempts.
By sending a "reconnected" message instead of disconnect/connect we
can avoid disturbing in-progress connection attempts which happens in CI
quite a bit.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
We cannot use subd_req() here: replies will come out of order, and the
we should not simply assign the reponses in FIFO order.
Changelog-Fixed: lightningd: don't get confused with parallel ping commands.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
One reason why ping processing could be slow is that, once we receive
a message from the peer to send to a subdaemon, we don't listen for
others until we've drained that subdaemon queue entirely.
This can happens for reestablish: slow machines can take a while to
set that subdaemon up.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
DNS seeds have been down/offline for a while, and this code (which
blocks!) has been a source of trouble. We should probably use a
canned set of "known nodes" if we want to bootstrap.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Changelog-Changed: Protocol: we no longer use DNS seeds for peer lookup fallbacks.
Fixes: https://github.com/ElementsProject/lightning/issues/7913
From grubles' logs:
```
2025-01-06T15:30:31.449Z DEBUG lightningd: attempting connection to 035ca2fe4793a5e789ce846062eb4834f573c060d9200ce77544a29b48a0aa5923 for additional gossip
2025-01-06T15:30:31.449Z DEBUG 035ca2fe4793a5e789ce846062eb4834f573c060d9200ce77544a29b48a0aa5923-connectd: Adding 0 addresses to important peer
2025-01-06T15:30:31.449Z DEBUG 035ca2fe4793a5e789ce846062eb4834f573c060d9200ce77544a29b48a0aa5923-connectd: Failed connected out: Unable to connect, no address known for peer
2025-01-06T15:30:31.449Z DEBUG 035ca2fe4793a5e789ce846062eb4834f573c060d9200ce77544a29b48a0aa5923-connectd: Will try reconnect in 300 seconds
2025-01-06T15:30:32.037Z DEBUG 035ca2fe4793a5e789ce846062eb4834f573c060d9200ce77544a29b48a0aa5923-connectd: Failed connected out: Unable to connect, no address known for peer
2025-01-06T15:30:32.037Z DEBUG 035ca2fe4793a5e789ce846062eb4834f573c060d9200ce77544a29b48a0aa5923-connectd: Will try reconnect in 300 seconds
2025-01-06T15:30:32.428Z DEBUG 035ca2fe4793a5e789ce846062eb4834f573c060d9200ce77544a29b48a0aa5923-connectd: Failed connected out: Unable to connect, no address known for peer
2025-01-06T15:30:32.428Z DEBUG 035ca2fe4793a5e789ce846062eb4834f573c060d9200ce77544a29b48a0aa5923-connectd: Will try reconnect in 300 seconds
2025-01-06T15:30:32.680Z DEBUG 035ca2fe4793a5e789ce846062eb4834f573c060d9200ce77544a29b48a0aa5923-connectd: Failed connected out: Unable to connect, no address known for peer
2025-01-06T15:30:32.681Z DEBUG 035ca2fe4793a5e789ce846062eb4834f573c060d9200ce77544a29b48a0aa5923-connectd: Will try reconnect in 300 seconds
2025-01-06T15:30:33.468Z DEBUG 035ca2fe4793a5e789ce846062eb4834f573c060d9200ce77544a29b48a0aa5923-connectd: Failed connected out: Unable to connect, no address known for peer
2025-01-06T15:30:33.469Z DEBUG 035ca2fe4793a5e789ce846062eb4834f573c060d9200ce77544a29b48a0aa5923-connectd: Will try reconnect in 300 seconds
2025-01-06T15:30:33.471Z DEBUG 035ca2fe4793a5e789ce846062eb4834f573c060d9200ce77544a29b48a0aa5923-connectd: Failed connected out: Unable to connect, no address known for peer
2025-01-06T15:30:33.471Z DEBUG 035ca2fe4793a5e789ce846062eb4834f573c060d9200ce77544a29b48a0aa5923-connectd: Will try reconnect in 300 seconds
2025-01-06T15:30:33.935Z DEBUG 035ca2fe4793a5e789ce846062eb4834f573c060d9200ce77544a29b48a0aa5923-connectd: Failed connected out: Unable to connect, no address known for peer
2025-01-06T15:30:33.935Z DEBUG 035ca2fe4793a5e789ce846062eb4834f573c060d9200ce77544a29b48a0aa5923-connectd: Will try reconnect in 300 seconds
2025-01-06T15:30:34.125Z DEBUG 035ca2fe4793a5e789ce846062eb4834f573c060d9200ce77544a29b48a0aa5923-connectd: Failed connected out: Unable to connect, no address known for peer
2025-01-06T15:30:34.125Z DEBUG 035ca2fe4793a5e789ce846062eb4834f573c060d9200ce77544a29b48a0aa5923-connectd: Will try reconnect in 300 seconds
2025-01-06T15:30:35.496Z DEBUG 035ca2fe4793a5e789ce846062eb4834f573c060d9200ce77544a29b48a0aa5923-connectd: Failed connected out: Unable to connect, no address known for peer
2025-01-06T15:30:35.497Z DEBUG 035ca2fe4793a5e789ce846062eb4834f573c060d9200ce77544a29b48a0aa5923-connectd: Will try reconnect in 300 seconds
2025-01-06T15:30:35.623Z DEBUG 035ca2fe4793a5e789ce846062eb4834f573c060d9200ce77544a29b48a0aa5923-connectd: Failed connected out: Unable to connect, no address known for peer
2025-01-06T15:30:35.623Z DEBUG 035ca2fe4793a5e789ce846062eb4834f573c060d9200ce77544a29b48a0aa5923-connectd: Will try reconnect in 300 seconds
2025-01-06T15:30:35.751Z DEBUG 035ca2fe4793a5e789ce846062eb4834f573c060d9200ce77544a29b48a0aa5923-connectd: Failed connected out: Unable to connect, no address known for peer
2025-01-06T15:30:35.751Z DEBUG 035ca2fe4793a5e789ce846062eb4834f573c060d9200ce77544a29b48a0aa5923-connectd: Will try reconnect in 300 seconds
2025-01-06T15:30:35.892Z DEBUG 035ca2fe4793a5e789ce846062eb4834f573c060d9200ce77544a29b48a0aa5923-connectd: Failed connected out: Unable to connect, no address known for peer
2025-01-06T15:30:35.892Z DEBUG 035ca2fe4793a5e789ce846062eb4834f573c060d9200ce77544a29b48a0aa5923-connectd: Will try reconnect in 300 seconds
```
We promised to wait 300 seconds!
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
The updated API requires typed htables to explicitly state whether they
allow duplicates: for most cases we don't, but we've had issues in the
past.
This is a big patch, but mainly mechanical.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
We wait until a connection fails, or a subd is connected to the peer,
before letting another one through. This should prevent us from
overwhelming lightningd on large nodes, but unlike the previous back-off,
it's based on how fast lightningd is, not an arbitrary time.
We also let one through each second, in case we're connecting to many,
but not doing anything but gossip (e.g. 100 explicit connect
commands).
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Changelog-Changed: Reconnecting to peers at startup should be significantly faster (dependent on machine speed).
Rather than have lightningd call us repeatedly to try to connect, have
it tell us what peers are transient and aren't, and connectd will
automatically try to maintain that connection.
There's a new "downgrade_peer" message to tell it a peer is now
transient: to make it non-transient we simply tell connectd to
connect as a non-transient.
The first time, I missed that dual_open_control does its own state
transitions :(
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Changelog-Changed: `connectd` now handles maintaining/reconnecting to important peers, and we remember the last successful address we connected to.
Let lightningd feed us hints to try first, but we can extract the
addresses from node_announcement messages ourselves.
(Lightningd used to ask gossipd on our behalf: this is far simpler!)
One side effect of this is that we don't hand back address hints given to us
by lightningd: it would use these again for reconnecting. This is breaks
test_sendpay_grouping, so we disable it temporarily.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
However fast we can handle them, it's antisocial to allow others to
make us spam the rest of the network.
Changelog-Protocol: onion messages: we limit incoming to 4 per second, allowing a little burst.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
This basically means moving the code from gossipd to connectd to handle
these queries.
This will get connectd have finer control over ratelimiting them.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
This is more efficient in a few ways:
1. It's trivial to get to the end of the gossip_store, we don't have
to iterate.
2. It tends to be mmaped so we don't have to call pread().
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
We currently stream gossip as fast as we can, even if they start at
timestamp 0. Instead, use a simple token bucket filter and only let
them have 1MB per second (500 bytes per second for testing).
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Changelog-Protocol: connectd: we now throttle outgoing gossip at 1MB/second per peer.
Currently, anything which doesn't have a live channel is considered transient.
We free this first under stress, and also if they're still connecting.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
We use a crude heuristic: if we were trying to contact them, it's a
"deliberate" connection, and should be preserved.
Changelog-Changed: connectd: prioritize peers with channels (and log!) if we run low on file descriptors.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
I thought I was going to want to have a convenient way of counting
these, but it turns out unnecessary. Still, this is slightly more
efficient and simple, so I am including it.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
We still refuse to run dev commands if lightningd sends it to us
despite us not being in developer mode, but that's mainly paranoia.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Also requires us to expose memleak when !DEVELOPER, however we only
ever used the memleak tracking when the LIGHTNINGD_DEV_MEMLEAK
environment variable was set, so keep that.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Most of this is piping the flag through so we know it's a websocket!
Reported-by: @ShahanaFarooqui
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
This allows us to detect when lightningd hasn't seen our latest
disconnect/reconnect; in particular, we would hit the following pattern:
1. lightningd says to connect a subd.
2. connectd disconnects and reconnects.
3. connectd reads message, connects subd.
4. lightningd reads disconnect and reconnect, sends msg to connect to subd again.
5. connectd asserts because subd is alreacy connected.
This way connectd can tell if lightningd is talking about the previous
connection, and ignoere it.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
We don't have to put aside a peer which is reconnecting and wait for
lightningd to remove the old peer, we can now simply free the old
and add the new.
Fixes: #5240
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Now we have separate peer draining logic, we can simply use it when
connectd tells us to release the peer, without waiting. (We could
simply free the peer, but that's a bit rude, as messages can get
lost).
This removes various complex flags and logic we had before.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Changelog-Fixed: `connectd`: various crashes and issues fixed by simplification and rewrite.
This removes it from the hashtable, and forces it to do nothing but
send out any remaining packets, then close.
It is, in effect, reduced to a stub, with no further interactions
with the rest of the system (all subds are freed already).
Also removes the need for an explicit "final_msg" too.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>