feat(skills): add local python-to-c-efficiency skill with modular C scaffold

add local Codex skill for Python->C performance-focused translation define modular C architecture and benchmark/correctness gates add references for patterns, profiling, and module design add scaffold_c_module.py to generate include/src/tests/bench skeleton update agent default prompt for benchmark-backed optimizations
2026-03-30 00:08:14 +02:00
parent 3ea99d84f4
commit 9a0a170799
6 changed files with 506 additions and 0 deletions
@@ -0,0 +1,111 @@
+---
+name: python-to-c-efficiency
+description: Tradurre codice Python in C ad alte prestazioni con focus su modularizzazione, riduzione overhead e benchmarking ripetibile. Usare quando Codex deve convertire `.py` in moduli `.c/.h`, isolare hot-path CPU o memoria, progettare API C stabili e consegnare implementazioni validate con test di equivalenza e misure performance prima/dopo.
+---
+
+# Python to C Efficiency
+
+## Obiettivo
+
+Convertire sezioni critiche Python in C con tre vincoli obbligatori:
+
+1. Mantenere comportamento osservabile equivalente.
+2. Consegnare codice modulare e manutenibile.
+3. Dimostrare miglioramento con benchmark ripetibile.
+
+## Workflow Decisionale
+
+1. Misurare prima di tradurre.
+Profilare Python e identificare i veri hotspot. Evitare conversioni "a tappeto".
+
+2. Delimitare il confine C.
+Tradurre kernel computazionali e lasciare in Python orchestrazione, I/O e glue code quando non critici.
+
+3. Disegnare il modulo C.
+Definire API, ownership memoria, error handling e formato dati prima dell'implementazione.
+
+4. Implementare versione corretta.
+Portare la logica Python in C senza ottimizzazioni premature.
+
+5. Ottimizzare solo hotspot confermati.
+Applicare tecniche CPU/cache/memoria su funzioni misurate come dominanti.
+
+6. Validare correttezza e performance.
+Confrontare output Python/C e pubblicare metriche prima/dopo con stesso dataset.
+
+## Architettura Modulare Obbligatoria
+
+Usare struttura per modulo:
+
+```text
+<feature>/
+  include/<feature>.h
+  src/<feature>.c
+  src/<feature>_internal.h   # opzionale
+  tests/test_<feature>.c
+  bench/bench_<feature>.c
+```
+
+Applicare queste regole:
+
+1. Esporre nel `.h` solo API pubbliche, tipi pubblici minimi e codici errore.
+2. Nascondere dettagli interni in `_internal.h` o nel `.c`.
+3. Passare buffer con puntatore + lunghezza (`ptr`, `size_t n`).
+4. Dichiarare `const` e `restrict` quando semanticamente corretti.
+5. Evitare stato globale mutabile nel percorso caldo.
+
+Per bootstrap rapido della struttura usare:
+
+```bash
+python3 scripts/scaffold_c_module.py --name my_kernel --out-dir /path/progetto
+```
+
+## Regole di Implementazione ad Alte Prestazioni
+
+1. Rendere i tipi espliciti (`int32_t`, `uint64_t`, `size_t`, `float`, `double`).
+2. Preallocare e riusare memoria nei loop intensivi.
+3. Favorire layout contigui e accesso sequenziale cache-friendly.
+4. Ridurre branch imprevedibili nel hot-path.
+5. Ridurre indirezioni e dispatch dinamico dove non necessario.
+6. Applicare `static inline` solo dove il profiler giustifica il beneficio.
+7. Usare algoritmo migliore prima di micro-ottimizzare il codice.
+
+## Contratto di Correttezza
+
+Prima di dichiarare completata la traduzione, eseguire sempre:
+
+1. Test di regressione su casi nominali e edge case.
+2. Confronto output Python vs C su dataset condiviso.
+3. Build debug con sanitizer e assenza di errori runtime.
+
+## Contratto di Performance
+
+Misurare e riportare sempre:
+
+1. Tempo wall e CPU.
+2. Memoria massima.
+3. Throughput o latenza per input unitario.
+
+Accettare il risultato solo se:
+
+1. Esiste speedup reale su workload target.
+2. Il miglioramento è ripetibile su più run.
+3. Eventuali regressioni secondarie sono documentate.
+
+## Output Richiesto
+
+Produrre sempre:
+
+1. File `.c/.h` modulari con API chiara.
+2. Nota breve di mapping Python→C per le parti critiche.
+3. Comandi build debug/release usati.
+4. Tabella compatta benchmark prima/dopo.
+5. Rischi residui o tradeoff aperti.
+
+## Riferimenti
+
+Caricare in base al task:
+
+- Mapping costrutti e pattern: [references/patterns-python-c.md](references/patterns-python-c.md)
+- Build, profiler e benchmark: [references/compiler-and-profiling.md](references/compiler-and-profiling.md)
+- Design modulare C: [references/modular-c-architecture.md](references/modular-c-architecture.md)
@@ -0,0 +1,4 @@
+interface:
+  display_name: "Python to C Efficiency"
+  short_description: "Translate Python into efficient C code"
+  default_prompt: "Use $python-to-c-efficiency to convert this Python hotspot into modular C (include/src/tests/bench) and provide benchmark-backed optimizations."
@@ -0,0 +1,68 @@
+# Compilazione, Profiling e Benchmark
+
+## 1) Profili build consigliati
+
+## Debug (correttezza e sicurezza)
+
+```bash
+gcc -O0 -g3 -fsanitize=address,undefined -fno-omit-frame-pointer -Wall -Wextra -Wpedantic src/*.c tests/*.c -o app_debug
+```
+
+Usare questo profilo per trovare UB, out-of-bounds e bug di lifetime.
+
+## Release (throughput)
+
+```bash
+gcc -O3 -march=native -flto -fno-semantic-interposition -DNDEBUG -Wall -Wextra src/*.c -o app_release
+```
+
+Confrontare anche con `clang` sullo stesso workload.
+
+## Release + PGO (opzionale su workload stabile)
+
+```bash
+gcc -O3 -fprofile-generate src/*.c -o app_pgo_gen
+./app_pgo_gen <input-realistico>
+gcc -O3 -fprofile-use -fprofile-correction src/*.c -o app_pgo
+```
+
+Applicare PGO solo quando il dataset di training è rappresentativo.
+
+## 2) Misurazione minima
+
+## Tempo e memoria
+
+```bash
+/usr/bin/time -v ./app_release
+```
+
+## Benchmark ripetibile
+
+```bash
+hyperfine --warmup 3 --runs 20 'python3 script.py' './app_release'
+```
+
+Bloccare input, CPU governor e carico macchina durante le run.
+
+## 3) Profiling CPU
+
+```bash
+perf stat ./app_release
+perf record -g ./app_release
+perf report
+```
+
+Usare `perf report` per confermare hotspot reali prima di ottimizzare.
+
+## 4) Interpretazione pratica
+
+- Ridurre prima complessità algoritmica.
+- Ottimizzare poi memoria/cache.
+- Applicare infine micro-ottimizzazioni (`inline`, branch hints, unrolling) solo se misurate.
+
+## 5) Gate di accettazione
+
+1. Nessun errore sanitizer in debug.
+2. Equivalenza output su dataset di regressione.
+3. Speedup ripetibile su target reale.
+4. Metriche e flag compiler documentati nel risultato.
@@ -0,0 +1,61 @@
+# Architettura Modulare C per Performance
+
+## Obiettivo
+
+Definire moduli C che siano allo stesso tempo veloci, testabili e sostituibili.
+
+## Struttura raccomandata
+
+```text
+feature/
+  include/feature.h
+  src/feature.c
+  src/feature_internal.h
+  tests/test_feature.c
+  bench/bench_feature.c
+```
+
+## Regole API
+
+1. Esporre API piccole e stabili in `include/feature.h`.
+2. Esporre tipi opachi quando lo stato interno non deve trapelare.
+3. Passare sempre buffer e lunghezze esplicite.
+4. Evitare allocazioni implicite nascoste nelle funzioni hot-path.
+5. Restituire codici errore prevedibili.
+
+Esempio di firma:
+
+```c
+feature_status_t feature_process(
+    const uint8_t *restrict in,
+    size_t n,
+    uint8_t *restrict out,
+    size_t out_n);
+```
+
+## Separazione responsabilità
+
+- `feature.c`: implementazione pubblica + validazioni input
+- `feature_internal.h`: helper statici e dettagli non pubblici
+- `test_feature.c`: correttezza funzionale e edge case
+- `bench_feature.c`: metriche throughput/latenza isolate
+
+## Policy di memoria
+
+1. Definire ownership nel contratto API.
+2. Prediligere buffer chiamante-alloca per hot-path.
+3. Usare allocator personalizzato solo se il profiler mostra contesa o overhead.
+
+## Policy di concorrenza
+
+1. Preferire moduli reentrant e senza stato globale.
+2. Separare stato per thread quando serve parallelismo.
+3. Evitare lock nel percorso caldo; usare partizionamento dati.
+
+## Checklist revisione modulo
+
+1. API minima e coerente.
+2. Nessun dettaglio interno esportato inutilmente.
+3. Test e benchmark presenti.
+4. Input validation chiara e costo minimo.
+5. Dipendenze inter-modulo ridotte.
@@ -0,0 +1,61 @@
+# Patterns Python -> C (efficienza)
+
+## 1) Mapping dati: scegliere strutture contigue
+
+- `list[int/float]` -> buffer contiguo (`int32_t*`, `float*`, `double*`) + `size_t n`
+- `tuple` a campi fissi -> `struct` con tipi espliciti
+- `dict` con chiavi piccole/note -> array indicizzato o enum + switch
+- `set` su dominio ridotto -> bitmap; su dominio ampio -> hash table dedicata
+
+Pattern consigliato per API:
+
+```c
+int kernel_run(const float *restrict in, float *restrict out, size_t n);
+```
+
+## 2) Mapping controllo di flusso
+
+- List comprehension numerica -> loop `for` con output preallocato
+- `sum(...)` -> accumulatore locale tipizzato
+- Generator pipeline -> passaggi espliciti su buffer intermedi preallocati
+- Evitare callback in hot-path quando una chiamata diretta è possibile
+
+## 3) Mapping semantica numerica
+
+- Definire policy per `float` vs `double` prima della traduzione
+- Riprodurre comportamento su NaN/Inf, divisione, modulo e rounding
+- Isolare conversioni int/float fuori dal loop caldo
+
+## 4) Gestione memoria e ownership
+
+- Definire ownership in firma funzione e commento API
+- Evitare `malloc/free` per elemento o per iterazione
+- Riusare arena/buffer quando il workload è batch
+- Validare dimensioni e puntatori in ingresso all'inizio della funzione
+
+## 5) Branch e cache locality
+
+- Ordinare il branch con caso frequente nel percorso lineare
+- Ridurre dipendenze dati tra iterazioni in loop lunghi
+- Ordinare campi `struct` per ridurre padding e cache miss
+
+## 6) Error model coerente
+
+Pattern consigliato:
+
+```c
+typedef enum {
+  KERNEL_OK = 0,
+  KERNEL_ERR_NULL = 1,
+  KERNEL_ERR_SIZE = 2
+} kernel_status_t;
+```
+
+Restituire errori prevedibili e senza side effect parziali non documentati.
+
+## 7) Sequenza pratica di traduzione
+
+1. Portare codice 1:1 in C per equivalenza funzionale.
+2. Inserire test di parità output.
+3. Profilare la versione C.
+4. Ottimizzare solo le funzioni che dominano runtime.
@@ -0,0 +1,201 @@
+#!/usr/bin/env python3
+"""Generate a performance-oriented C module skeleton.
+
+Creates:
+- include/<name>.h
+- src/<name>.c
+- tests/test_<name>.c
+- bench/bench_<name>.c
+"""
+
+from __future__ import annotations
+
+import argparse
+import re
+from pathlib import Path
+
+
+def normalize_name(raw: str) -> str:
+    name = raw.strip().lower().replace("-", "_")
+    name = re.sub(r"[^a-z0-9_]", "_", name)
+    name = re.sub(r"_+", "_", name).strip("_")
+    if not name:
+        raise ValueError("module name is empty after normalization")
+    if name[0].isdigit():
+        name = f"m_{name}"
+    return name
+
+
+def write_file(path: Path, content: str, force: bool) -> None:
+    if path.exists() and not force:
+        raise FileExistsError(f"file exists: {path}")
+    path.parent.mkdir(parents=True, exist_ok=True)
+    path.write_text(content, encoding="utf-8")
+
+
+def render_header(module: str) -> str:
+    guard = f"{module.upper()}_H"
+    return f"""#ifndef {guard}
+#define {guard}
+
+#include <stddef.h>
+#include <stdint.h>
+
+#ifdef __cplusplus
+extern "C" {{
+#endif
+
+typedef enum {{
+    {module}_ok = 0,
+    {module}_err_null = 1,
+    {module}_err_size = 2
+}} {module}_status_t;
+
+{module}_status_t {module}_process_f32(
+    const float *restrict in,
+    float *restrict out,
+    size_t n);
+
+#ifdef __cplusplus
+}}
+#endif
+
+#endif
+"""
+
+
+def render_source(module: str) -> str:
+    return f"""#include "{module}.h"
+
+{module}_status_t {module}_process_f32(
+    const float *restrict in,
+    float *restrict out,
+    size_t n)
+{{
+    if (in == NULL || out == NULL) {{
+        return {module}_err_null;
+    }}
+    if (n == 0) {{
+        return {module}_err_size;
+    }}
+
+    for (size_t i = 0; i < n; ++i) {{
+        out[i] = in[i] * 1.0f;
+    }}
+
+    return {module}_ok;
+}}
+"""
+
+
+def render_test(module: str) -> str:
+    return f"""#include <assert.h>
+#include <stdio.h>
+
+#include "{module}.h"
+
+int main(void)
+{{
+    float in[4] = {{1.0f, 2.0f, 3.0f, 4.0f}};
+    float out[4] = {{0.0f, 0.0f, 0.0f, 0.0f}};
+
+    {module}_status_t st = {module}_process_f32(in, out, 4);
+    assert(st == {module}_ok);
+    for (size_t i = 0; i < 4; ++i) {{
+        assert(out[i] == in[i]);
+    }}
+
+    printf("test_{module}: ok\\n");
+    return 0;
+}}
+"""
+
+
+def render_bench(module: str) -> str:
+    return f"""#define _POSIX_C_SOURCE 200809L
+
+#include <stdint.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <time.h>
+
+#include "{module}.h"
+
+static uint64_t ns_now(void)
+{{
+    struct timespec ts;
+    clock_gettime(CLOCK_MONOTONIC, &ts);
+    return (uint64_t)ts.tv_sec * 1000000000ull + (uint64_t)ts.tv_nsec;
+}}
+
+int main(void)
+{{
+    const size_t n = 1u << 20;
+    float *in = (float *)malloc(n * sizeof(float));
+    float *out = (float *)malloc(n * sizeof(float));
+    if (!in || !out) {{
+        fprintf(stderr, "alloc failed\\n");
+        free(in);
+        free(out);
+        return 1;
+    }}
+
+    for (size_t i = 0; i < n; ++i) {{
+        in[i] = (float)(i & 1023u);
+    }}
+
+    uint64_t t0 = ns_now();
+    {module}_status_t st = {module}_process_f32(in, out, n);
+    uint64_t t1 = ns_now();
+    if (st != {module}_ok) {{
+        fprintf(stderr, "kernel error: %d\\n", (int)st);
+        free(in);
+        free(out);
+        return 1;
+    }}
+
+    double ns_per_elem = (double)(t1 - t0) / (double)n;
+    printf("{module} ns/elem: %.3f\\n", ns_per_elem);
+
+    free(in);
+    free(out);
+    return 0;
+}}
+"""
+
+
+def main() -> int:
+    parser = argparse.ArgumentParser(description="Scaffold a C module with tests and bench")
+    parser.add_argument("--name", required=True, help="Module name (snake_case preferred)")
+    parser.add_argument(
+        "--out-dir",
+        default=".",
+        help="Project root where include/src/tests/bench live",
+    )
+    parser.add_argument("--force", action="store_true", help="Overwrite existing files")
+    args = parser.parse_args()
+
+    module = normalize_name(args.name)
+    root = Path(args.out_dir).resolve()
+
+    write_file(root / "include" / f"{module}.h", render_header(module), args.force)
+    write_file(root / "src" / f"{module}.c", render_source(module), args.force)
+    write_file(root / "tests" / f"test_{module}.c", render_test(module), args.force)
+    write_file(root / "bench" / f"bench_{module}.c", render_bench(module), args.force)
+
+    print(f"created module skeleton: {module}")
+    print(f"root: {root}")
+    print("next:")
+    print(
+        f"  gcc -O0 -g3 -fsanitize=address,undefined "
+        f"-Iinclude src/{module}.c tests/test_{module}.c -o test_{module}"
+    )
+    print(
+        f"  gcc -O3 -march=native "
+        f"-Iinclude src/{module}.c bench/bench_{module}.c -o bench_{module}"
+    )
+    return 0
+
+
+if __name__ == "__main__":
+    raise SystemExit(main())