I noticed this in the logs:
```
lightningd-1 2026-01-28T00:27:37.504Z DEBUG gossipd: gossip_store: Read 59428/118856/0/0 cannounce/cupdate/nannounce/delete from store in 45521871 bytes, now 45521849 bytes (populated=true)
lightningd-1 2026-01-28T00:27:37.504Z DEBUG gossipd: Got 118856 bad cupdates, ignoring them (expected on mainnet)
```
That's weird, and turns out it counting good updates, not bad ones!
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
We reimplemented this redundantly: hash_scid was called
short_channel_id_hash, so I obviously missed it.
Rename, and implement hash_scidd helper too.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
gossmap doesn't care, so gossipd currently has to iterate through the
store to find them at startup. Create a callback for gossipd to use
instead.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
It's used by common/gossip_store.c, which is used by many things other than
gossipd. This file belongs in common.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
It's actually quite quick to load a cache-hot 308,874,377 byte
gossip_store (normal -Og build), but perf does show time spent
in siphash(), which is a bit overkill here, so drop that:
Before:
Time to load: 66718983-78037766(7.00553e+07+/-2.8e+06)nsec
After:
Time to load: 54510433-57991725(5.61457e+07+/-1e+06)nsec
We could save maybe 10% more by disabling checksums, but having
that assurance is nice.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
This happens for about 300 channels, from every process that loads the gossmap.
It's not very useful to flood the logs, so just log a summary.
To be fair, on my node, this is only the 11th most common message, so we will
revisit the others too:
```
1589311 DEBUG 02e01367e1d7818a7e9a0e8a52badd5c32615e07568dbe0497b6a47f9bef89d6af-connectd: peer_out WIRE_WARNING
139993 DEBUG lightningd: fixup_scan: block 786151 with 1203 txs
55388 DEBUG plugin-bcli: Log pruned 1001 entries (mem 10508118 -> 10298662)
33000 DEBUG gossipd: Unreasonable timestamp in 0102000a38ec41f9137a5a560dac6effbde059c12cb727344821cbdd4ef46964a4791a0f67cd997499a6062fc8b4284bf1b47a91541fd0e65129505f02e4d08542b16fe28c0ab6f1b372c1a6a246ae63f74f931e8365e15a089c68d61900000000000d9d56000ba40001690fe262010100900000000000000001000003e8000001f30000000000989680
23515 DEBUG hsmd: Client: Received message 14 from client
22269 DEBUG 024b9a1fa8e006f1e3937f65f66c408e6da8e1ca728ea43222a7381df1cc449605-hsmd: Got WIRE_HSMD_ECDH_REQ
14409 DEBUG gossipd: Enqueueing update for announce 0102002f7e4b4deb19947c67292e70cb22f7fac837fa9ee6269393f3c513d0431d52672e7387625856c19299cfd584e1a3f39e0f98df13c99090df9f4d5cca8446776fe28c0ab6f1b372c1a6a246ae63f74f931e8365e15a089c68d61900000000000e216b0008050001692e1c390101009000000000000003e800000000000013880000004526945a00
12534 DEBUG gossipd: Previously-rejected announce for 514127x248x1
===> 12092 DEBUG connectd: Bad cupdate for 641641x1164x1/1, ignoring (delta=80, fee=1073742199/58)
10761 DEBUG 02e01367e1d7818a7e9a0e8a52badd5c32615e07568dbe0497b6a47f9bef89d6af-channeld-chan#70770: Got it!
10761 DEBUG 02e01367e1d7818a7e9a0e8a52badd5c32615e07568dbe0497b6a47f9bef89d6af-channeld-chan#70770: ... , awaiting 1120
10761 DEBUG 02e01367e1d7818a7e9a0e8a52badd5c32615e07568dbe0497b6a47f9bef89d6af-channeld-chan#70770: Sending master 1020
```
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Basically, `devtools/reduce-includes.sh */*.c`.
Build time from make clean (RUST=0) (includes building external libs):
Before:
real 0m38.944000-40.416000(40.1131+/-0.4)s
user 3m6.790000-17.159000(15.0571+/-2.8)s
sys 0m35.304000-37.336000(36.8942+/-0.57)s
After:
real 0m37.872000-39.974000(39.5466+/-0.59)s
user 3m1.211000-14.968000(12.4556+/-3.9)s
sys 0m35.008000-36.830000(36.4143+/-0.5)s
Build time after touch config.vars (RUST=0):
Before:
real 0m19.831000-21.862000(21.5528+/-0.58)s
user 2m15.361000-30.731000(28.4798+/-4.4)s
sys 0m21.056000-22.339000(22.0346+/-0.35)s
After:
real 0m18.384000-21.307000(20.8605+/-0.92)s
user 2m5.585000-26.843000(23.6017+/-6.7)s
sys 0m19.650000-22.003000(21.4943+/-0.69)s
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Each header should only include the other headers it needs to compile;
`devtools/reduce-includes.sh */*.h` does this. The C files then need
additional includes if they don't compile.
And remove the entirely useless wire/onion_wire.h, which only serves to include wire/onion_wiregen.h.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
This can happen with other subdaemons too, on ZFS on Linux:
```
2025-09-24T13:51:22.703Z **BROKEN** connectd: Bad checksum on gossmap record @9850670/9851114 should be 3379961343 (01009411e26cd56d68aabc285ee1c8ee43d59be6f939b0ce353d80213918680a7438356b9c5ea6bb001a6bb37a4dea93776f4abc8cd371525b4d1605a74b89d7cb1bfc8865ddf22288c7ea08b9d98b34155b4aed159eb81732957e6bf79b996752bf2a9995aaead1d65e7889e826ea0ba42f7746c176fe12f2fe6c04af1a74b4f0a262d20efd57133eb32693c789eb3f09caf4f4c6ecd2f734b3b36e751ffcc2748c58feabce4173c4ce6098a2c5397aabf1be5442cb67b5030be11ebd8b9841838dae127fe30000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
```
Reported-by: @grubles
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
We add `start_batch` to match t-bast’s splicing spec and we add a new internal wire type `WIRE_PROTOCOL_BATCH_ELEMENT` using the type number 0
Changelog-Added: support for `start_batch`
We handed NULL as the logcb, resulting in a very uninformative crash:
```
2025-03-14T03:46:36.447Z INFO lightningd: Server started with public key 03d67f36c4f81789e2fe425028bacc96b199813eae426c517f589a45f1136c1fe5, alias Jubilee (color #dc42f4) and lightningd v25.02
topology: FATAL SIGNAL 11 (version v25.02)
0x560037f64aad send_backtrace
common/daemon.c:33
0x560037f64b49 crashdump
common/daemon.c:78
0x7f6c41ff351f ???
./signal/../sysdeps/unix/sysv/linux/x86_64/libc_sigaction.c:0
0x0 ???
???:0
```
Changelog-Fixed: `topology` crash on invoice creation if a peer had a really high feerate.
Fixes: https://github.com/ElementsProject/lightning/issues/8156
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Unfortunately a spec typo means the data fields are missing (PR pending),
so we still patch those in.
The message "your_peer_storage" got renamed to "peer_storage_retrieval",
and the option "want_peer_backup_storage" was removed.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Changelog-EXPERIMENTAL: `experimental-peer-storage` now only advertizes feature 43, not 41.
After analyzing various weird cases where we ended up with duplicate
gossip_store entries, it could be explained by us not fully processing
the gossip store.
It's not clear that my assumptions that we would always see our own writes
are true: technically this may require an fsync(). So we now add the
check, and do an fsync and try again.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Changelog-Fixed: gossipd: more sanity checks that we are correctly updating the gossip_store file.
Instead of making a copy.
To measure the performance impact, I timed
tests/test_askrene.py::test_real_biases on my laptop.
No checksum check: 194.52s
Copying for checksum check: 202.81s
Zero-copy checksum check: 194.40s
But these numbers proved noisy. Still, doesn't hurt.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
We assume if it's incorrect, we simply need to wait. If this proves incorrect,
we will see a stream of BROKEN log messages.
To measure the performance impact, I timed
tests/test_askrene.py::test_real_biases on my laptop.
Before: 194.52s
After: 202.81s
So it's marginal.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
While this shouldn't happen, it does (pending other fixes), and we stop reading the
gossip store until next time. The result is partial gossip, demonstrated beautifully
by NicolasDorier's report:
```
lightning_gossipd: gossmap: redundant channel_announce for 864063x1306x1, offsets 1272259 and 1784859!"
```
Gossipd stalld there and don't make more progress. So gossipd itself
doesn't see the entire gossip_store.
Then things get really batshit:
```
2025-02-04T05:53:28.582Z DEBUG gossipd: Store compact time: 1429910 msec
```
This took 1429 seconds to process. Why?
Because it hasn't been processing the gossip store fully, gossipd kept adding "new" records to the end:
```
2025-02-04T05:53:28.583Z DEBUG gossipd: gossip_store: Read 62716143/1739952/5158256/0 cannounce/cupdate/nannounce/delete from store in 31634458462 bytes, now 31634458440 bytes (populated=true)
```
It has 31GB of gossip in there! No wonder it took so long...
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Fixes: https://github.com/ElementsProject/lightning/issues/8035
Changelog-Fixed: gossipd: corruption in the gossip_store could cause ever-longer startup times and no gossip updates.
Default goes to stderr for LOG_UNUSUAL and higher.
We have to whitelist more cases in map_catchup so we don't spam the logs
with perfectly-expected (but ignored) messages though.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
We only use it in one place, and that was simply to share an fd between
gossipd writing and gossipd reading, which may be causing our zfs problem
anyway.
In fact, it fixes a race if we don't have HAVE_PWRITEV.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
We have a report of this happening under ZFS. We cannot do much if
this really is a problem where we can't read back what we write, but
this avoids the immediate crash.
Fixes: https://github.com/ElementsProject/lightning/issues/7971
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Changelog-Fixed: gossmap: occasional crash (at least on ZFS) reading gossip_store.
The updated API requires typed htables to explicitly state whether they
allow duplicates: for most cases we don't, but we've had issues in the
past.
This is a big patch, but mainly mechanical.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
In particular, this lets you find the exact htlc_maximum_msat/htlc_minimum_msat
values.
This means we actually create real channel_updates for local mods, which
requires a second "local" scratch region.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Since we don't compact the gossmap on the fly (FIXME!) we can
easily surpass 4GB in the gossmap, and 32 bit offsets are not
sufficient.
I'm a bit surprised we don't crash immediately, but we've definitely
seen issues.
Changelog-Fixed: gossipd: crash errors with large gossip_store (>4MB) growth on longer-running nodes.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
It was weird not to have a capacity associated with localmods channels, and
fixing it has some very nice side effects.
Now the gossmap_chan_get_capacity() call never fails (we prevented reading
of channels from gossmap in the partially-written case already), so we
make it return the capacity. We do this in msat, because that's what
all the callers want.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
This is actually what we want in several places: to only override one or
two fields in a channel_update.
We add a gossmap_local_setchan() with a similar API to the old
gossmap_local_updatechan(), for the case where we want to set every
field.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
We allow adding them, but crash when we remove the localmods. Yet
this could theoretically happen if a channel we modified was removed
from the gossmap, anyway.
Reported-by: Lagrang3 <lagrang3@protonmail.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
This simplifies the callers significantly: all channel_announcements now
have an amount, so gossmap_chan_get_capacity() only fails on a local
modification.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
We only write these in two places: one where we get a message from lightningd about
our own channel, and one where we get a reply from lightningd about a txout check.
The former case we explicitly check that we don't already have it in gossmap, so
add checks to the latter case, and give verbose detail if it's found.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
It's a u64, we should pass by copy. This is a big sweeping change,
but mainly mechanical (change one, compile, fix breakage, repeat).
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Wrote a test program which passed num_channel_updates_rejected as NULL
(which we don't usually do), and valgrind complained:
```
==1048302== Conditional jump or move depends on uninitialised value(s)
==1048302== at 0x118B90: update_channel (gossmap.c:550)
==1048302== by 0x119EEE: map_catchup (gossmap.c:663)
==1048302== by 0x11A299: load_gossip_store (gossmap.c:726)
==1048302== by 0x11A352: gossmap_load (gossmap.c:1052)
==1048302== by 0x125362: main (run-route-infloop.c:90)
```
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Thanks to amazing debugging assistance from grubles, we figured out
that indeed, my memory was correct: write and mmap are not consistent
on all platforms. The easiest fix is to disable mmap on OpenBSD for now:
the better fix is to do in-place updates using the mmap, and only rely
on write() for append (which always causes a remap anyway before it's accessed).
Fixes: https://github.com/ElementsProject/lightning/issues/7109
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
We never enabled it, because we seemed to be eliminating valid
channels. We discard zombie-marked records on loading.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
In particular, allow callers to see unknown records we ignore (and let
them fail as a result), and get called if we can't pack a
channel_update into our internal format.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>