Changelog-Added: HSMD: add new wire API to sign messages with bitcoin wallet keys according to BIP137.
Signed-off-by: Lagrang3 <lagrang3@protonmail.com>
This is inspired by a patch from @whitslack, which overlapped with this series.
Most importantly, there was only one call to bitcoin_tx_simple_input_weight(),
and it is better to be explicit with that one.
This also changes our funder calculation to assume our own input is taproot,
which it is likely to be given we've defaulted to taproot for outputs for
change addresses since 23.08.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
One in 256 times, we will grind a signature to 70 bytes (or shorter). This breaks
our feerate tests. Unfortunately the grinding is deterministic, so there doesn't
seem to be a way to avoid it. So we add a log message, and then we skip the
feerate test if it happens.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
We previously treated it as a P2WPKH, which is wrong.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Changelog-Fixed: wallet: fees are much closer to target feerate when doing txprepare/fundchannel.
To actually evaluate spend cost, we need to know whether it's taproot or not.
Using an enum (rather than making callers examine the script) means we can
ensure all cases are handled.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
The renaming makes it clear that it's HSM specific.
And it has no pointers, so we can have an array instead of an array of pointers.
I tested this hadn't accidentally changed the wire format by disabling
version checks and using an old hsmd with the altered daemons and
running the test suite.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
This is such a simple struct that we can actually define it in csv.
This prevents us from accidentally breaking the ABI in future.
I tested this hadn't accidentally changed the wire format by disabling
version checks and using an old hsmd with the altered daemons and
running the test suite.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
I'm about to update our utxo type, but Christian spotted that this is
part of the ABI for the hsm. So make that a private "hsm_utxo" type,
to insulate it from changes.
In particular, the HSM versions only contain the fields that the
hsm cares about, and the wire format is consistent (even though that
*did* include some of those fields, they are now dummies).
In the long term, this should be removed from the ABI: once we
no longer have "close_info" utxos, this information should already be
in the PSBT.
I tested this hadn't accidentally changed the wire format by disabling
version checks and using an old hsmd with the altered daemons and
running the test suite.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
notleak() doesn't work for lightningd since the first span is created before
memleak (or anything else!) is initialized, so we have to mark it manually.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
This doesn't happen very often, but can with autoclean.
However, we rarely traverse to the end, since we always expect to find
what we're looking for, and we fill from the front. So even a large
array (unless it's used) is fine.
Subtle: when we injected a parent, we used "active_spans" as the (arbitrary)
key. That can now change with reallocation, and so if that memory were reused
we could have a key clash. So we use "&active_spans" which doesn't change.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
We don't ever actually close the remote span (we don't have its key,
after all), and we keep a pointer to the parent.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
This is much faster to give 64 bits of data, and we don't need
cryptographic randomness.
This brings us back to 413ns per trace.
Before:
real 0m5.819000-6.472000(6.2064+/-0.26)s
user 0m3.779000-4.101000(3.956+/-0.12)s
sys 0m2.040000-2.431000(2.2496+/-0.15)s
After:
real 0m3.981000-4.247000(4.1276+/-0.11)s
user 0m3.979000-4.245000(4.126+/-0.11)s
sys 0m0.000000-0.002000(0.001+/-0.00063)s
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Changelog-Fixed: lightingd: trimmed overhead of tracing infrastructure.
There's an EBPF limit anyway, so stick with a 512-byte buffer.
This brings us back to 621ns per trace:
Before:
real 0m13.441000-14.592000(14.2686+/-0.43)s
user 0m11.265000-12.289000(11.9626+/-0.37)s
sys 0m2.175000-2.381000(2.3048+/-0.072)s
After:
real 0m5.819000-6.472000(6.2064+/-0.26)s
user 0m3.779000-4.101000(3.956+/-0.12)s
sys 0m2.040000-2.431000(2.2496+/-0.15)s
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Avoids allocations. Also assume that name and value parameters
outlive the trace span, so don't copy.
Before:
real 0m16.421000-18.407000(17.8128+/-0.72)s
user 0m14.242000-16.041000(15.5382+/-0.67)s
sys 0m2.179000-2.363000(2.273+/-0.061)s
After:
real 0m13.441000-14.592000(14.2686+/-0.43)s
user 0m11.265000-12.289000(11.9626+/-0.37)s
sys 0m2.175000-2.381000(2.3048+/-0.072)s
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
1. trace_span_start() is always called with a string literal, so
no copy needed (and we can use a macro to enforce this).
2. trace_span_tag() name and value are always longer-lived than
the span, so no need to copy these either.
Before:
real 0m18.524000-19.100000(18.7674+/-0.21)s
user 0m16.171000-16.833000(16.424+/-0.26)s
sys 0m2.259000-2.400000(2.337+/-0.059)s
After:
real 0m16.421000-18.407000(17.8128+/-0.72)s
user 0m14.242000-16.041000(15.5382+/-0.67)s
sys 0m2.179000-2.363000(2.273+/-0.061)s
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Testing parenting handling revealed several issues:
1. By calling "trace_span_start" when CLN_TRACEPARENT is set produces a bogus
entry, for which the span_id is overwritten so we never end it.
2. We don't need to close the remote parent when we close the first child: in
fact, this causes the remaining traces to be detached from the parent!
3. Suspension should return current to the parent, not to NULL.
Now the traces balance as we expect.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
I added this debugging because the next test revealed a mismatch, so
I wanted to see where it was happening.
The comment in lightningd suggests it's possible, but I can't see any
code which suspends in the lightningd io_loop, so I cannot see how
this is triggered.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
With an average runtime of 18.7674, this implies 1876ns
per trace, which is far in excess of the 370ns claimed in
doc/developers-guide/tracing-cln-performance.md.
We also add a tag in there, so we measure that!
Results on my laptop:
real 0m18.524000-19.100000(18.7674+/-0.21)s
user 0m16.171000-16.833000(16.424+/-0.26)s
sys 0m2.259000-2.400000(2.337+/-0.059)s
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
BIP-141 specifies that a witness program must be between 2 and 40 bytes in
length. In our fallback address parsing, we were already checking the upper
bound, but missing the lower bound check. This commit adds validation to
ensure fallback address witness programs are at least 2 bytes long, bringing
our implementation in line with the spec and other implementations like
rust-lightning.
Changelog-Fixed: Enforced minimum witness program length of 2 bytes for
fallback addresses to comply with BIP-141 and prevent invalid decodings.
This happens with autoclean, which does a datastore request then frees
the parent command without waiting for a response (see clean_finished).
This leaks a trace, and causes a crash if the pointer is later reused.
My solution is to create a trace variant which declares the trace key
to be a tal ptr and then we can clean up in the destructor if this happens.
This fixes the issue for me.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Changelog-Fixed: autoclean: fixed occasional crash when tracepoints compiled in.
chanbackup with many peers can do more than 128 concurrent rpc commands.
autoclean is the other plugin which can do many requests at once, so I
expect a similar issue there.
I tested this by rebuilding with `MAX_ACTIVE_SPANS` 1, which autoclean
tests triggered immediately.
The real fix is probably to use a hash table with a large initial size.
```
Mar 24 06:30:45 mlbb2 sh[28000]: chanbackup: common/trace.c:190: trace_span_slot: Assertion `s' failed.
Mar 24 06:30:45 mlbb2 sh[28000]: chanbackup: FATAL SIGNAL 6 (version v25.02)
Mar 24 06:30:45 mlbb2 sh[28000]: 0x5575232bac4f send_backtrace
Mar 24 06:30:45 mlbb2 sh[28000]: common/daemon.c:33
Mar 24 06:30:45 mlbb2 sh[28000]: 0x5575232baceb crashdump
Mar 24 06:30:45 mlbb2 sh[28000]: common/daemon.c:78
Mar 24 06:30:45 mlbb2 sh[28000]: 0x7f2958cd851f ???
Mar 24 06:30:45 mlbb2 sh[28000]: ./signal/../sysdeps/unix/sysv/linux/x86_64/libc_sigaction.c:0
Mar 24 06:30:45 mlbb2 sh[28000]: 0x7f2958d2c9fc __pthread_kill_implementation
Mar 24 06:30:45 mlbb2 sh[28000]: ./nptl/pthread_kill.c:44
Mar 24 06:30:45 mlbb2 sh[28000]: 0x7f2958d2c9fc __pthread_kill_internal
Mar 24 06:30:45 mlbb2 sh[28000]: ./nptl/pthread_kill.c:78
Mar 24 06:30:45 mlbb2 sh[28000]: 0x7f2958d2c9fc __GI___pthread_kill
Mar 24 06:30:45 mlbb2 sh[28000]: ./nptl/pthread_kill.c:89
Mar 24 06:30:45 mlbb2 sh[28000]: 0x7f2958cd8475 __GI_raise
Mar 24 06:30:45 mlbb2 sh[28000]: ../sysdeps/posix/raise.c:26
Mar 24 06:30:45 mlbb2 sh[28000]: 0x7f2958cbe7f2 __GI_abort
Mar 24 06:30:45 mlbb2 sh[28000]: ./stdlib/abort.c:79
Mar 24 06:30:45 mlbb2 sh[28000]: 0x7f2958cbe71a __assert_fail_base
Mar 24 06:30:45 mlbb2 sh[28000]: ./assert/assert.c:94
Mar 24 06:30:45 mlbb2 sh[28000]: 0x7f2958ccfe95 __GI___assert_fail
Mar 24 06:30:45 mlbb2 sh[28000]: ./assert/assert.c:103
Mar 24 06:30:45 mlbb2 sh[28000]: 0x5575232ab7fa trace_span_slot
Mar 24 06:30:45 mlbb2 sh[28000]: common/trace.c:190
Mar 24 06:30:45 mlbb2 sh[28000]: 0x5575232abc9f trace_span_start
Mar 24 06:30:45 mlbb2 sh[28000]: common/trace.c:267
Mar 24 06:30:45 mlbb2 sh[28000]: 0x5575232a7c34 send_outreq
Mar 24 06:30:45 mlbb2 sh[28000]: plugins/libplugin.c:1112
```
Changelog-Fixed: autoclean/chanbackup: fixed tracepoint crash on large number of requests.
Fixes: https://github.com/ElementsProject/lightning/issues/8177
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
We handed NULL as the logcb, resulting in a very uninformative crash:
```
2025-03-14T03:46:36.447Z INFO lightningd: Server started with public key 03d67f36c4f81789e2fe425028bacc96b199813eae426c517f589a45f1136c1fe5, alias Jubilee (color #dc42f4) and lightningd v25.02
topology: FATAL SIGNAL 11 (version v25.02)
0x560037f64aad send_backtrace
common/daemon.c:33
0x560037f64b49 crashdump
common/daemon.c:78
0x7f6c41ff351f ???
./signal/../sysdeps/unix/sysv/linux/x86_64/libc_sigaction.c:0
0x0 ???
???:0
```
Changelog-Fixed: `topology` crash on invoice creation if a peer had a really high feerate.
Fixes: https://github.com/ElementsProject/lightning/issues/8156
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Since we included the spec for it, this is a good time to implement
it.
I also asked chatgpt to write some unit tests. I had to mangle them a
bit, but it probably saved me a few minutes.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Unfortunately a spec typo means the data fields are missing (PR pending),
so we still patch those in.
The message "your_peer_storage" got renamed to "peer_storage_retrieval",
and the option "want_peer_backup_storage" was removed.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Changelog-EXPERIMENTAL: `experimental-peer-storage` now only advertizes feature 43, not 41.
We define a new subtype 'modern_scb_chan' and a new tlvtype 'scb_tlvs' which includes
all the relevant information to create a penalty transaction when the peer tries to cheat.
Key Changes:
- Rename the old format to 'legacy_scb_chan' and define a new type 'modern_scb_chan'
- Include TLVs to 'modern_scb_chan'
- Create a new msgtype 'static_chan_backup_with_tlvs'
- Modify 'struct channel' to include 'struct modern_scb_chan'
- Add these two types to 'varsize_types' in generate.py
The parameter max_htlc_value_in_flight_msat stablished by peers on
channel opening (BOLT02) can now be retrived from the
gossmods_from_listpeerchannels API.
Adapted the corresponding callback functions in renepay and askrene to
take into account that value as a constraint to the value we can send
through a channel.
Changelog-Add: fetch max_htlc_value_in_flight_msat from gossmods_listpeerchannels API
Signed-off-by: Lagrang3 <lagrang3@protonmail.com>
Turns out we weren't wiring them through! And libplugin wasn't reading them anyway.
Changelog-Fixed: lightningd: tell plugins our bolt12 features (so our bolt12 invoices explicitly allow MPP).
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Reported by Grubles on ARM64:
```
VALGRIND=1 valgrind -q --error-exitcode=7 --track-origins=yes --leak-check=full --show-reachable=yes --errors-for-leak-kinds=all common/test/run-tal_arr_randomize > /dev/null
==151138== Source and destination overlap in memcpy(0x4d69f08, 0x4d69f08, 8)
==151138== at 0x48CB68C: __GI_memcpy (vg_replace_strmem.c:1147)
==151138== by 0x41B50B: tal_arr_randomize_ (pseudorand.c:84)
==151138== by 0x41BB07: main (run-tal_arr_randomize.c:166)
==151138==
make: *** [Makefile:750: unittest/common/test/run-tal_arr_randomize] Error 7
```
It is correct: you can't overlap src and dst in memcpy. It *probably* works in
this case, but it's undefined!
Fixes: https://github.com/ElementsProject/lightning/issues/7030
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
After analyzing various weird cases where we ended up with duplicate
gossip_store entries, it could be explained by us not fully processing
the gossip store.
It's not clear that my assumptions that we would always see our own writes
are true: technically this may require an fsync(). So we now add the
check, and do an fsync and try again.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Changelog-Fixed: gossipd: more sanity checks that we are correctly updating the gossip_store file.
Instead of making a copy.
To measure the performance impact, I timed
tests/test_askrene.py::test_real_biases on my laptop.
No checksum check: 194.52s
Copying for checksum check: 202.81s
Zero-copy checksum check: 194.40s
But these numbers proved noisy. Still, doesn't hurt.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
We assume if it's incorrect, we simply need to wait. If this proves incorrect,
we will see a stream of BROKEN log messages.
To measure the performance impact, I timed
tests/test_askrene.py::test_real_biases on my laptop.
Before: 194.52s
After: 202.81s
So it's marginal.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
While this shouldn't happen, it does (pending other fixes), and we stop reading the
gossip store until next time. The result is partial gossip, demonstrated beautifully
by NicolasDorier's report:
```
lightning_gossipd: gossmap: redundant channel_announce for 864063x1306x1, offsets 1272259 and 1784859!"
```
Gossipd stalld there and don't make more progress. So gossipd itself
doesn't see the entire gossip_store.
Then things get really batshit:
```
2025-02-04T05:53:28.582Z DEBUG gossipd: Store compact time: 1429910 msec
```
This took 1429 seconds to process. Why?
Because it hasn't been processing the gossip store fully, gossipd kept adding "new" records to the end:
```
2025-02-04T05:53:28.583Z DEBUG gossipd: gossip_store: Read 62716143/1739952/5158256/0 cannounce/cupdate/nannounce/delete from store in 31634458462 bytes, now 31634458440 bytes (populated=true)
```
It has 31GB of gossip in there! No wonder it took so long...
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Fixes: https://github.com/ElementsProject/lightning/issues/8035
Changelog-Fixed: gossipd: corruption in the gossip_store could cause ever-longer startup times and no gossip updates.