perf(sha256): add ARMv8 2-way interleaved transform and scan_4way_direct

Process two independent SHA256 chains simultaneously to hide the 2-cycle latency of vsha256hq_u32 on Cortex-A76, approaching full throughput. Also reduces memcpy from 512 to ~192 bytes per 4-nonce group by reusing block buffers, and adds scan_4way_direct to bypass pthread_once (LDAR barrier) on every inner-loop call.
2026-03-30 10:41:59 +02:00
parent b2f0090236
commit 7d4096749a
3 changed files with 403 additions and 3 deletions
--- a/sha256/sha256_backend.h
+++ b/sha256/sha256_backend.h
@@ -62,4 +62,22 @@ uint32_t sha256d80_scan_4way(
    sha256_state_t out_states[4]
 );

+/*
+ * Ensure the SHA256 backend is initialized. Call once before using
+ * sha256d80_scan_4way_direct() to avoid per-call pthread_once overhead.
+ */
+void sha256_backend_ensure_init(void);
+
+/*
+ * Like sha256d80_scan_4way() but skips the pthread_once check.
+ * Caller MUST have called sha256_backend_ensure_init() (or any other
+ * backend function) before calling this.
+ */
+uint32_t sha256d80_scan_4way_direct(
+    const sha256d80_midstate_t *mid,
+    uint32_t start_nonce,
+    const uint32_t target_words[8],
+    sha256_state_t out_states[4]
+);
+
 #endif