Add sha256_transform_armv8_2way_pass2 which reads the pass1 output state
words directly into MSG0/MSG1 without byte serialization. Previously:
sha256_state_to_digest() → native uint32 → BE bytes (8x write_u32_be)
sha256_transform load → BE bytes → vrev32q_u8 → native uint32 (4x)
These two conversions cancel out. The new path skips both, saving ~52
shift/store/load/vrev ops per 4-nonce group. Also eliminates the two
128-byte block2 stack buffers from sha256d80_hash_4way_armv8_2way.
Replace time() syscall with clock_gettime(CLOCK_MONOTONIC_COARSE) and
gate timestamp/rate checks behind a batch counter to avoid clock overhead
on every iteration. Set reporting interval to 2s.
Process two independent SHA256 chains simultaneously to hide the 2-cycle
latency of vsha256hq_u32 on Cortex-A76, approaching full throughput.
Also reduces memcpy from 512 to ~192 bytes per 4-nonce group by reusing
block buffers, and adds scan_4way_direct to bypass pthread_once (LDAR
barrier) on every inner-loop call.
add local Codex skill for Python->C performance-focused translation
define modular C architecture and benchmark/correctness gates
add references for patterns, profiling, and module design
add scaffold_c_module.py to generate include/src/tests/bench skeleton
update agent default prompt for benchmark-backed optimizations