fix(verify): riconosce URL www. come terminatori validi + doc multi-documento

- _URL_TAIL ora matcha anche www.* (non solo https://) — evita falsi blockers su watermark tipo www.docsity.com - README: documenta --collection / --stems per ingestion, retrieve e rag Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
feat(ingestion): supporto multi-documento in unica collection ChromaDB
2026-05-12 11:21:24 +02:00 · 2026-05-12 11:21:17 +02:00 · 2026-05-12 11:09:28 +02:00 · 2026-05-12 10:43:17 +02:00 · 2026-05-12 10:39:27 +02:00 · 2026-05-12 10:37:39 +02:00
64 changed files with 5809 additions and 3161 deletions
@@ -1,25 +1,23 @@
 ---
-description: Verifica i chunk di step 5, mostra i problemi, propone e applica le fix tramite fix_chunks.py con ri-verifica automatica finale.
+description: Perfeziona i chunk di un documento (verifica, dry-run, fix, ri-verifica) e li prepara per la vettorizzazione.
 allowed-tools: Read Bash Grep
 argument-hint: <stem>
 ---

-## Passo 0 — Verifica fresca (sempre)
+## Passo 0 — Verifica fresca

-Esegui sempre `verify_chunks.py` per avere un report aggiornato (non fidarti di un report.json preesistente):
+Esegui sempre `verify_chunks.py` per un report aggiornato:

 ```bash
-source .venv/bin/activate && python step-6/verify_chunks.py --stem $ARGUMENTS
+source .venv/bin/activate && python chunks/verify_chunks.py --stem $ARGUMENTS
 ```

---
-
 Leggi il report appena generato:

 !`python3 -c "
 import json, sys
 try:
-    r = json.load(open('step-6/$ARGUMENTS/report.json'))
+    r = json.load(open('chunks/$ARGUMENTS/report.json'))
    v = r.get('verdict','?')
    s = r.get('stats', {})
    t = r.get('thresholds', {})
@@ -35,7 +33,7 @@ try:
            print(f'  🔴 {label}: {len(items)}')
            for c in items[:3]:
                print(f'     [{c[\"chunk_id\"]}] {c[\"n_chars\"]} char → {c[\"last_text\"][-60:]!r}')
-    for cat, label in [('too_short','Troppo corti'), ('too_long','Troppo lunghi')]:
+    for cat, label in [('too_short','Troppo corti'), ('too_long','Troppo lunghi'), ('incomplete_math','Math incompleto')]:
        items = wa.get(cat, [])
        if items:
            print(f'  🟡 {label}: {len(items)}')
@@ -48,14 +46,14 @@ except Exception as e: print(f'ERRORE lettura report: {e}')

 ## Se verdict == "ok"

-✅ Nessun problema. Comunica:
+✅ Nessun problema bloccante. Comunica:

 ```
-✅ Chunk puliti — procedi con la vettorizzazione:
-   python step-8/ingest.py --stem $ARGUMENTS
+✅ Chunk pronti — procedi con la vettorizzazione:
+   python ingestion/ingest.py --stem $ARGUMENTS
 ```

-Fermati qui. Non eseguire nessun altro passo.
+Se ci sono solo 🟡, spiega brevemente i warning e chiedi se l'utente vuole risolverli prima o procedere.

 ---

@@ -64,15 +62,16 @@ Fermati qui. Non eseguire nessun altro passo.
 ### Passo 1 — Dry-run

 ```bash
-source .venv/bin/activate && python step-6/fix_chunks.py --stem $ARGUMENTS --dry-run
+source .venv/bin/activate && python chunks/fix_chunks.py --stem $ARGUMENTS --dry-run
 ```

 Spiega in italiano ogni operazione pianificata:
- **rimuovi chunk vuoti** — chunk privi di testo, non contribuiscono al retrieval
- **aggiungi prefisso** — il prefisso `[sezione > titolo]` fornisce contesto all'embedding; senza, il chunk è semanticamente decontestualizzato
+
+- **rimuovi chunk vuoti** — privi di testo, non contribuiscono al retrieval
+- **aggiungi prefisso** — `[sezione > titolo]` fornisce contesto all'embedding; senza, il chunk è decontestualizzato
 - **fondi incompleti** — frase spezzata a metà: il chunk corrente e il successivo formano una frase unica
- **fondi troppo corti** — chunk sotto MIN_CHARS: troppo brevi per portare informazione semantica utile
- **spezza troppo lunghi** — chunk sopra MAX_CHARS×1.5: troppo densi, degradano la precision del retrieval
+- **fondi troppo corti** — sotto MIN_CHARS: troppo brevi per portare informazione semantica utile
+- **spezza troppo lunghi** — sopra MAX_CHARS×1.5: troppo densi, degradano la precision del retrieval

 Se ci sono solo 🟡 (nessun 🔴), informa che si può procedere anche senza fix e chiedi la preferenza.

@@ -85,16 +84,16 @@ Applica solo su risposta affermativa esplicita.
 ### Passo 3 — Applica

 ```bash
-source .venv/bin/activate && python step-6/fix_chunks.py --stem $ARGUMENTS
+source .venv/bin/activate && python chunks/fix_chunks.py --stem $ARGUMENTS
 ```

 ### Passo 4 — Ri-verifica automatica

 ```bash
-source .venv/bin/activate && python step-6/verify_chunks.py --stem $ARGUMENTS
+source .venv/bin/activate && python chunks/verify_chunks.py --stem $ARGUMENTS
 ```

-Leggi il nuovo `step-6/$ARGUMENTS/report.json` e riporta:
+Leggi il nuovo `chunks/$ARGUMENTS/report.json` e riporta:
 - Nuovo verdict
 - Delta chunk (N prima → N dopo)
 - Problemi residui se presenti
@@ -104,17 +103,17 @@ Leggi il nuovo `step-6/$ARGUMENTS/report.json` e riporta:
 Se verdict finale è `ok` o `warnings_only` senza 🔴:

 ```
-✅ Chunk pronti in step-6/$ARGUMENTS/chunks.json
+✅ Chunk pronti in chunks/$ARGUMENTS/chunks.json
   Procedi con la vettorizzazione:
-   python step-8/ingest.py --stem $ARGUMENTS
+   python ingestion/ingest.py --stem $ARGUMENTS
 ```

-Se rimangono 🔴 dopo il fix (raro — testo non spezzabile o struttura anomala):
+Se rimangono 🔴 dopo il fix (testo non spezzabile o struttura anomala nel sorgente):

 ```
 🔴 X problemi residui non risolvibili automaticamente.
-   Torna a step-4/$ARGUMENTS/clean.md e correggi manualmente le sezioni indicate,
+   Torna a conversione/$ARGUMENTS/clean.md e correggi manualmente le sezioni indicate,
   poi riesegui nell'ordine:
-     python step-5/chunker.py --stem $ARGUMENTS --force
-     python step-6/verify_chunks.py --stem $ARGUMENTS
+     python chunks/chunker.py --stem $ARGUMENTS --force
+     python chunks/verify_chunks.py --stem $ARGUMENTS
 ```
@@ -0,0 +1,199 @@
+---
+description: Legge un file Markdown, individua tutti i problemi che compromettono il chunking (artefatti, sillabazione, header malformati, paragrafi spezzati, gerarchia incoerente, sezioni vuote) e applica le correzioni direttamente sul file senza chiedere conferma per i casi chiari.
+allowed-tools: Read Bash Grep Edit
+argument-hint: <path/to/clean.md oppure stem>
+---
+
+Risolvi il percorso del file da preparare:
+
+!`python3 -c "
+import sys, json, re
+from pathlib import Path
+
+arg = '$ARGUMENTS'.strip()
+root = Path('.')
+
+candidates = [
+    Path(arg),
+    root / arg,
+    root / 'conversione' / arg / 'clean.md',
+    root / 'step-4' / arg / 'clean.md',
+]
+
+md_path = None
+for p in candidates:
+    if p.exists() and p.suffix == '.md':
+        md_path = p
+        break
+
+if not md_path:
+    print('ERRORE: file non trovato per:', arg)
+    sys.exit(1)
+
+print('MD_PATH=' + str(md_path))
+
+# Cerca profilo strutturale (report.json o structure_profile.json)
+stem = md_path.parent.name
+profile_candidates = [
+    md_path.parent / 'report.json',
+    md_path.parent / 'structure_profile.json',
+    root / 'step-4' / stem / 'structure_profile.json',
+    root / 'conversione' / stem / 'report.json',
+]
+for sp in profile_candidates:
+    if sp.exists():
+        try:
+            d = json.load(open(sp))
+            st = d.get('structure', d)
+            print(f'STRATEGIA={st.get(\"strategia_chunking\",\"?\")}')
+            print(f'LINGUA={st.get(\"lingua_rilevata\",\"?\")}')
+            print(f'H1={st.get(\"n_h1\",0)} H2={st.get(\"n_h2\",0)} H3={st.get(\"n_h3\",0)}')
+            for a in st.get('avvertenze', []):
+                print(f'AVVISO: {a}')
+        except Exception:
+            pass
+        break
+
+# Statistiche file
+text = md_path.read_text(encoding='utf-8')
+lines = text.split('\n')
+pua = len(re.findall(r'[\ue000-\uf8ff]', text))
+print(f'RIGHE={len(lines)} CHARS={len(text)}')
+if pua:
+    print(f'PUA_RESIDUI={pua}')
+" 2>/dev/null`
+
+Se l'output contiene `ERRORE`, comunica il percorso non trovato e fermati.
+
+---
+
+Leggi il file completo identificato da `MD_PATH` nell'output sopra. Poi esegui **tutti** i controlli e applica le correzioni nell'ordine indicato.
+
+I parametri di riferimento per il chunking sono: **MIN_CHARS=200, MAX_CHARS=800**.
+
+---
+
+## Controllo 1 — Sillabazione residua
+
+Cerca blocchi di testo (non header) dove una riga termina con `-` e la successiva inizia con lettera minuscola: è un'interruzione di parola non risolta da PDF.
+
+Esempio da correggere:
+```
+...il meccanismo di decen-
+tralizzazione permette...
+```
+→ `...il meccanismo di decentralizzazione permette...`
+
+**Applica** ogni fusione con Edit. Se la parola ricomposta sembra errata, segnala invece di correggere.
+
+---
+
+## Controllo 2 — Artefatti di pagina
+
+Righe standalone che sono esclusivamente:
+- Un numero intero isolato (numero di pagina)
+- Titolo del libro / nome autore che si ripete identico 3+ volte nel documento
+- Intestazioni di capitolo che si ripetono (es. `## 3. Termodinamica` appare sia come header legittimo che come riga di testo duplicata)
+
+**Applica** la rimozione con Edit per le ripetizioni chiaramente decorative. Segnala i casi ambigui.
+
+---
+
+## Controllo 3 — Numeri di pagina in header
+
+Header che terminano con ` | N` o ` N` dove N è un numero isolato (residuo di indice non rimosso):
+- `### 16. Link vari | 109` → `### 16. Link vari`
+- `## Capitolo 3 42` → `## Capitolo 3`
+
+**Applica** con Edit.
+
+---
+
+## Controllo 4 — Header malformati
+
+Per ogni header (`#`, `##`, `###`):
+
+**a) ALL-CAPS non convertito:**
+`## TERMODINAMICA DEI PROCESSI` → `## Termodinamica dei processi`
+Usa sentence case (prima lettera maiuscola, resto minuscolo salvo nomi propri evidenti).
+**Applica**.
+
+**b) Livello h4/h5/h6:**
+`#### Sottosezione` → `### Sottosezione`
+**Applica**.
+
+**c) Testo troppo lungo (> 120 char):**
+Probabilmente non è un header ma testo estratto erroneamente. Rimuovi i `#` iniziali lasciando il testo come paragrafo normale.
+**Applica** se chiaramente non è un titolo. Segnala se ambiguo.
+
+**d) Header duplicati:**
+Se lo stesso header appare due volte, rimuovi la seconda occorrenza (o la prima se è quella fuori contesto).
+**Applica**.
+
+---
+
+## Controllo 5 — Paragrafi spezzati
+
+Blocchi di testo (non header, non liste) che terminano senza punteggiatura finale (`.?!»)`).
+
+Se il blocco successivo non inizia con lettera maiuscola e non è un header/lista, i due blocchi sono parte della stessa frase spezzata da un salto pagina PDF.
+
+**Applica** la fusione solo quando sei certo (la congiunzione è evidente: inizia con congiunzione, continua la frase in modo inequivocabile). Segnala i casi dubbi invece di correggere.
+
+---
+
+## Controllo 6 — Sezioni quasi-vuote o vuote
+
+Sezione (header + corpo) con corpo < 100 caratteri:
+- Se il contenuto è evidentemente una sottosezione o introduzione di ciò che segue (e non ha senso da solo), rimuovi l'header e unisci il testo alla sezione precedente o successiva.
+- Se è un header di capitolo che introduce legittime sottosezioni (`##` seguito da `###`), lascia invariato.
+
+**Applica** le fusioni sicure. Segnala quelle ambigue.
+
+---
+
+## Controllo 7 — Gerarchia heading
+
+Verifica che la gerarchia sia coerente. Problemi da correggere:
+
+- Più di un `# ` (h1) nel documento → il secondo e successivi diventano `## ` salvo che siano chiaramente titoli di parti distinte
+- `### ` prima del primo `## ` → abbassa il `###` a `## ` o aggiungi un `## ` genitore appropriato
+- `## ` prima del primo `# ` in documenti con h1 → lascia invariato (alcuni documenti non hanno h1)
+
+**Applica** solo le correzioni di livello sicure. Segnala le ristrutturazioni che richiedono giudizio.
+
+---
+
+## Controllo 8 — Sezioni troppo lunghe senza struttura
+
+Sezione (## o ###) con corpo > 3000 caratteri e nessun header figlio al suo interno: il chunker la spezzerà su frasi in modo meccanico, perdendo coerenza semantica.
+
+Se il testo contiene chiari cambio-argomento (paragrafi separati da riga vuota, con transizioni come "Inoltre...", "In secondo luogo...", "Un altro aspetto..."), considera di aggiungere un `### ` per suddividere semanticamente.
+
+**Non aggiungere header inventati.** Segnala le sezioni candidate e proponi i titoli: applica solo su risposta affermativa.
+
+---
+
+## Report finale
+
+Dopo aver applicato tutte le correzioni automatiche, mostra:
+
+```
+File: <path>
+Correzioni applicate: N totali
+
+  Sillabazione risolta:       N
+  Artefatti pagina rimossi:   N
+  Numeri pagina in header:    N
+  Header normalizzati:        N (ALL-CAPS, livello, lunghezza, duplicati)
+  Paragrafi fusi:             N
+  Sezioni quasi-vuote risolte:N
+  Gerarchia corretta:         N
+
+Problemi aperti (richiedono giudizio manuale):
+  [riga N] <descrizione precisa>
+  ...
+```
+
+Se non ci sono problemi aperti: **"Markdown pronto per il chunking."**
+Se ci sono problemi aperti: elencali e chiedi quali applicare.
@@ -1,115 +0,0 @@
---
-description: Revisione qualitativa del clean.md dopo il pre-processing automatico (step 4). Trova artefatti residui, paragrafi spezzati e header errati, poi propone le correzioni.
-allowed-tools: Read Bash Grep Edit
-argument-hint: <stem>
---
-
-Esegui la revisione qualitativa di `step-4/$ARGUMENTS/clean.md`.
-
-**Cosa è già stato fatto automaticamente (revision_log):**
-!`grep -A 12 "^## $ARGUMENTS" step-4/revision_log.md 2>/dev/null || echo "(nessun log trovato per questo stem)"`
-
-**Profilo strutturale attuale:**
-!`python3 -c "
-import json, sys
-try:
-    d = json.load(open('step-4/$ARGUMENTS/structure_profile.json'))
-    print(f'Livello: {d[\"livello_struttura\"]}  Strategia: {d[\"strategia_chunking\"]}')
-    print(f'h1={d[\"n_h1\"]}  h2={d[\"n_h2\"]}  h3={d[\"n_h3\"]}  paragrafi={d[\"n_paragrafi\"]}')
-    print(f'Lunghezza media sezione: {d[\"lunghezza_media_sezione\"]} char')
-    for a in d.get('avvertenze', []): print(f'  ⚠️  {a}')
-except Exception as e: print(f'ERRORE: {e}')
-" 2>/dev/null`
-
---
-
-Analizza `step-4/$ARGUMENTS/clean.md` eseguendo i grep seguenti e ragionando sui risultati. Per ogni check: esegui il grep, conta i risultati, riporta i casi concreti (max 5 esempi con numero di riga).
-
-## Check 1 — Sillabazione residua
-
-Righe che terminano con trattino seguito da testo nella riga successiva (artefatto PDF non risolto):
-
-```bash
-grep -n "\-$" step-4/$ARGUMENTS/clean.md | head -20
-```
-
-Segnala se presenti: numero di riga, testo della riga e della riga successiva.
-
-## Check 2 — Righe orfane (artefatti PDF)
-
-Righe standalone (non header `#`, non vuote) di meno di 60 caratteri che sembrano artefatti:
-
-```bash
-grep -n "^[^#\-\*\|].\{1,59\}$" step-4/$ARGUMENTS/clean.md | grep -v "^\s*$" | head -30
-```
-
-Valuta ogni riga: è testo normale breve (legittimo) o artefatto (numero di pagina, nome autore isolato, riga di intestazione ripetuta)?
-
-## Check 3 — Paragrafi con frase spezzata
-
-Blocchi di testo che terminano senza punteggiatura di fine frase (`.?!»)`):
-
-```bash
-grep -n "[^.!?»)\]\'\"]$" step-4/$ARGUMENTS/clean.md | grep -v "^[0-9]*:#" | grep -v "^[0-9]*:\s*$" | grep -v "^\s*[-\*]" | head -20
-```
-
-Riporta i casi più sospetti (righe brevi che finiscono a metà concetto).
-
-## Check 4 — Header sospetti
-
-```bash
-grep -n "^##\? " step-4/$ARGUMENTS/clean.md | head -40
-```
-
-Verifica:
- `##` o `###` con contenuto interamente MAIUSCOLO non convertito → segnala
- Header duplicati (stesso testo che appare due volte) → segnala
- `##` con testo > 80 caratteri (probabile testo che non è un header) → segnala
- Salti di livello anomali (es. `###` senza un `##` padre) → segnala
-
-## Check 5 — Sezioni quasi vuote
-
-```bash
-python3 -c "
-import re, sys
-text = open('step-4/$ARGUMENTS/clean.md').read()
-sections = re.split(r'^(#{1,3} .+)$', text, flags=re.MULTILINE)
-for i in range(1, len(sections)-1, 2):
-    header = sections[i].strip()
-    body = sections[i+1].strip() if i+1 < len(sections) else ''
-    if len(body) < 80 and body:
-        print(f'{header!r} → {len(body)} char: {body[:60]!r}')
-    elif not body:
-        print(f'{header!r} → VUOTA')
-" 2>/dev/null | head -20
-```
-
-Sezioni con body < 80 char o vuote compromettono il chunking. Segnala quelle che non hanno senso come sezione autonoma.
-
-## Check 6 — Gerarchia strutturale
-
-```bash
-grep -n "^#\{1,3\} " step-4/$ARGUMENTS/clean.md | head -50
-```
-
-Verifica che la gerarchia sia coerente: `# → ## → ###`. Segnala se ci sono `###` prima del primo `##`, o `##` prima del primo `#`, o `#` multipli (più di un h1).
-
---
-
-## Report finale
-
-```
-🔴 BLOCCANTI (compromettono il chunking o il retrieval)
-  [riga N] descrizione precisa del problema
-  ...
-
-🟡 MINORI (artefatti visibili, non bloccanti)
-  [riga N] descrizione
-  ...
-
-🟢 OK — nessun problema rilevato in questa categoria
-```
-
-Poi chiedi: **"Applico le correzioni per i 🔴? E per i 🟡?"**
-
-Applica solo ciò che viene esplicitamente approvato. Usa Edit per ogni modifica — mai riscrivere l'intero file.
@@ -26,26 +26,13 @@ __pycache__/
 .DS_Store
 Thumbs.db

-# Report generati dagli script
-step-0/*_step0_report.txt
-step-1/*_step1_report.txt

-# Output step-2 — MD grezzo generato da marker
-step-2/*/
-
-# Output step-3 — profilo struttura generato da detect_structure.py
-step-3/*/
-
-# Output step-4 — MD revisionato e log generati da revise.py
-step-4/*/
-step-4/revision_log.md
-
-# Output step-5 — chunk generati da chunker.py
-step-5/*/
-
-# Output step-6 — report generati da verify_chunks.py
-step-6/*/
-
-# Output conversione/ — generati da conversione/pipeline.py
+# Output conversione/ — generati dagli script
 conversione/*/
+!conversione/_pipeline/
+!conversione/_pipeline/**
+conversione/_pipeline/__pycache__/
+
+# Output chunks/ — generati da chunks/chunker.py e chunks/verify_chunks.py
+chunks/*/

@@ -1,86 +1,214 @@
-# CLAUDE.md — RAG from Scratch
+# CLAUDE.md
+
+This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
+
+## Missione
+
+Ricostruire la struttura logica di PDF digitali e serializzarla in Markdown **stabile e valido per la vettorizzazione RAG**, senza LLM né OCR. Il Markdown è solo il formato di output finale — la pipeline deve operare su una rappresentazione intermedia strutturata.
+
+**Non supportato:** PDF scansionati (immagini), PDF protetti da password.
+
+---

 ## Regole invarianti

 - **Lingua:** Rispondi sempre in italiano.
- **Venv obbligatorio:** Usa `.venv/bin/python` o attiva con `source .venv/bin/activate`. Mai `pip`/`python` di sistema.
- **Non modificare `raw.md`:** `step-2/<stem>/raw.md` è immutabile. La copia di lavoro è `step-4/<stem>/clean.md`.
+- **Venv:** Usa `.venv/bin/python` o `source .venv/bin/activate`. Mai `pip`/`python` di sistema.
+- **`raw.md` immutabile:** Non modificare mai `raw.md`. La copia di lavoro è sempre `clean.md`.
+- **Niente LLM nella pipeline:** tutta la logica deve essere rule-based e riproducibile.

 ---

-## Pipeline (ordine obbligatorio)
+## Setup

-```
-PDF (sources/) → step-0 → step-1 → step-2 → step-3
-              → step-4 (CRITICO: revisione manuale clean.md)
-              → step-5 → step-6 → step-7 (Ollama) → step-8 → step-9
-```
-
-Il parametro `--stem` identifica il documento (nome PDF senza `.pdf`). Lo stem è anche il nome della collection ChromaDB.
-
-Comandi tipici:
 ```bash
+python -m venv .venv
 source .venv/bin/activate
-python step-4/revise.py --stem <stem>
-python step-5/chunker.py --stem <stem>
-python step-6/verify_chunks.py --stem <stem>
-python step-8/ingest.py --stem <stem>
-python step-9/rag.py --stem <stem>
+pip install -r requirements.txt
 ```

---
+Dipendenze principali:

-## File critici
-
-| File | Ruolo |
-|---|---|
-| `step-9/config.py` | Fonte di verità: `EMBED_MODEL`, `OLLAMA_MODEL`, `TOP_K`, `TEMPERATURE`, `SYSTEM_PROMPT` |
-| `step-5/chunker.py` | Chunking adattivo — `MIN_CHARS=200`, `MAX_CHARS=800`, `OVERLAP_S=2` |
-| `step-6/verify_chunks.py` | Verifica chunk — stesse soglie di `chunker.py` |
-| `step-6/fix_chunks.py` | Fix automatici su chunk anomali |
-| `step-4/revise.py` | Pre-processing MD automatico (8 trasformazioni euristiche) |
-| `step-8/ingest.py` | Vettorizzazione ChromaDB — legge `EMBED_MODEL` da `config.py` |
-| `step-9/rag.py` | Pipeline RAG interattiva |
+- **PyMuPDF** (`fitz`) — estrazione primaria con metadati font e coordinate
+- **pdfplumber** — ricostruzione tabelle (opzionale, non per parsing generico)

 ---

-## Regole di assistenza
+## Comandi

-**Modifica `EMBED_MODEL` in `step-9/config.py`:**
-Avvisa sempre che serve rieseguire la vettorizzazione:
 ```bash
-python step-8/ingest.py --stem <stem> --force
+# Converti un PDF (posizionalo prima in sources/<nome>.pdf)
+.venv/bin/python conversione/ --stem <nome>
+
+# Forza riesecuzione (sovrascrive clean.md esistente)
+.venv/bin/python conversione/ --stem <nome> --force
+
+# Tutti i PDF in sources/
+.venv/bin/python conversione/
+
+# Validazione batch
+.venv/bin/python conversione/ validate
+.venv/bin/python conversione/ validate <stem> --detail
+
+# Rimuove l'output di uno stem
+bash conversione/clear.sh <nome>
+
+# Test suite
+.venv/bin/python -m pytest tests/
 ```
-`ingest.py` importa `EMBED_MODEL` direttamente da `config.py` — la coerenza è critica: se violata non produce errori ma restituisce risultati insensati.

-**Modifica soglie chunking (`MIN_CHARS`, `MAX_CHARS`, `OVERLAP_S`):**
-I valori compaiono in tre file che vanno sincronizzati manualmente:
-1. `step-5/chunker.py`
-2. `step-6/verify_chunks.py`
-3. `step-6/fix_chunks.py`
+`--stem` = nome file PDF senza estensione.

-**Step 4 — revisione clean.md:**
-`revise.py` applica trasformazioni automatiche, ma il risultato va sempre revisionato a mano. La qualità del RAG dipende da `clean.md` più di qualsiasi parametro tecnico. Suggerisci sempre `/step4-review <stem>` dopo `revise.py`.
+---

-**Step 6 — verifica chunk:**
-Dopo `verify_chunks.py`, usa `/step6-fix <stem>` prima di passare a step-8.
+## Architettura
+
+### Principio fondamentale
+
+La pipeline **non converte direttamente PDF → Markdown**.
+
+```
+PDF → Structured Document Tree → Markdown
+```
+
+Il Markdown è generato solo dall'albero documentale. Non dal testo grezzo.
+
+### Modello dati intermedio
+
+```python
+class Block:
+    text: str
+    page: int
+    bbox: tuple
+    font_size: float
+    font_name: str
+    is_bold: bool
+    block_type: str  # "header" | "paragraph" | "list" | "table" | "code"
+
+class Section:
+    title: str
+    level: int          # 1, 2, 3
+    content: list[Block]
+    children: list[Section]
+```
+
+Il Markdown si genera **solo** da `Section`. Mai da `Block` direttamente.
+
+---
+
+### I 9 stadi della pipeline
+
+#### Stage 1 — Metadata Extraction
+
+Usa `page.get_text("dict")` (o `"rawdict"`) di PyMuPDF. **Non usare estrazione plain text.**
+
+Per ogni span estrai: testo, font size, font name, flags, bbox, numero di pagina.
+Estrai anche: TOC del documento, bookmark, dimensioni pagina.
+
+#### Stage 2 — Layout Analysis
+
+Identifica i blocchi strutturali preservando l'ordine di lettura:
+headers, paragrafi, liste, tabelle, code block, interruzioni di pagina.
+
+#### Stage 3 — Font Analysis
+
+Inferisce la gerarchia visiva **per documento** (non hardcoded):
+- calcola il font size dominante del corpo
+- raggruppa i font size in cluster
+- identifica i candidati header per dimensione
+
+#### Stage 4 — Header Detection
+
+Segnali combinati (tutti richiesti):
+- font size > corpo testo
+- boldness / semibold
+- numerazione (`^\d+(\.\d+)*\s+`)
+- spaziatura verticale sopra/sotto
+- lunghezza riga corta
+
+#### Stage 5 — Hierarchy Inference
+
+Priorità delle regole (in ordine):
+
+1. **Numerazione** — `1` → H1, `1.1` → H2, `1.1.1` → H3 (ha precedenza sul font size)
+2. **TOC del PDF** — se presente, è autoritativo; allineare i header rilevati alla sua gerarchia
+3. **Font size clustering** — fallback se né numerazione né TOC esistono
+
+#### Stage 6 — Document Tree Reconstruction
+
+Costruisce l'albero `Section` con relazioni parent-child, ordinamento e nesting. Ogni nodo contiene titolo, livello, contenuto e figli.
+
+#### Stage 7 — Markdown Generation
+
+Serializza l'albero in Markdown valido:
+- Header: `#`/`##`/`###` senza salti di livello
+- Liste: preserva nesting ordered/unordered
+- Tabelle: GitHub-compatible; fallback testo strutturato
+- Code block: fenced con language tag dove rilevabile
+
+#### Stage 8 — Hierarchy Normalization
+
+Ripara le inconsistenze strutturali:
+- salti di livello invalidi (`# A` → `#### B` diventa `# A` → `## B`)
+- header vuoti (rimuovi o mergia)
+- header consecutivi duplicati (collassa)
+- nesting rotto
+
+#### Stage 9 — Structural Validation
+
+Valida il Markdown finale:
+- nessun salto di livello heading
+- nessuna sezione vuota
+- liste correttamente annidate
+- tabelle con colonne consistenti
+- ordine uguale al PDF sorgente
+
+---
+
+## Cosa rende un Markdown perfetto per la vettorizzazione
+
+- **Struttura semantica:** ogni header è un confine naturale di chunk; ogni sezione è un'unità concettuale.
+- **Gerarchia corretta:** h1/h2/h3 riflettono la struttura logica, non il layout tipografico.
+- **Testo pulito:** nessun artefatto di encoding, footnote superscript, `<br>`, dot-leader, PUA.
+- **Paragrafi interi:** nessuna frase troncata da salto pagina.
+- **Output deterministico:** stessa pipeline su stesso PDF produce sempre lo stesso output.
+
+---
+
+## Linee guida per sviluppare la pipeline
+
+- Ogni stage deve essere **indipendentemente testabile**.
+- Le regex per header numbering e simili vanno compilate in `_constants.py`, mai inline.
+- PyMuPDF è il parser primario. pdfplumber si usa solo per tabelle complesse.
+- Ogni stage deve ricevere l'output del precedente come struttura tipizzata, non testo grezzo.
+- Prima di aggiungere un nuovo segnale di detection (Stage 4), validarlo su almeno 3 PDF diversi.
+
+### Categorie di test richieste
+
+| Categoria | Input | Validazione attesa |
+|-----------|-------|-------------------|
+| Header reconstruction | PDF con H1/H2/H3 numerati | gerarchia corretta, no level skip |
+| TOC alignment | PDF con bookmark/TOC | markdown allineato al TOC |
+| Mixed font sizes | Font inconsistenti, bold nel corpo | body non classificato come header |
+| Broken layout | Header multi-riga, spacing irregolare | header mergiati, markdown valido |
+| Tables | Tabelle nel PDF | markdown table con colonne preservate |
+| Lists | Liste ordered/unordered annidate | nesting corretto |
+| Large documents | PDF tecnico voluminoso | output deterministico, memoria stabile |
+| Invalid hierarchy repair | `# A` + `#### B` artificiale | riparazione automatica in `# A` + `## B` |
+
+---
+
+## Pipeline attuale
+
+La pipeline in `conversione/_pipeline/` (basata su trasformazioni testo con `_apply.py`) è **deprecata** e deve essere sostituita dall'architettura a 9 stadi descritta sopra. Durante la migrazione:
+
+- separare estrazione da ricostruzione
+- introdurre strutture intermedie esplicite (`Block`, `Section`)
+- rimuovere l'architettura parser-centrica
+- ogni stage deve essere indipendente e testabile

 ---

 ## Skills custom

- `/step4-review <stem>` — Revisione qualitativa `clean.md`: artefatti, paragrafi spezzati, header errati.
- `/step6-fix <stem>` — Dry-run e applicazione fix chunk tramite `fix_chunks.py`.
-
---
-
-## Struttura directory per stem
-
-```
-step-2/<stem>/raw.md                ← immutabile
-step-4/<stem>/clean.md              ← copia di lavoro
-step-4/<stem>/structure_profile.json
-step-5/<stem>/chunks.json
-step-6/<stem>/report.json
-chroma_db/<stem>/                   ← collection ChromaDB
-```
+- `/prepare-md <path|stem>` — corregge `clean.md` quando la pipeline non basta: sillabazione, artefatti residui, header malformati, gerarchia incoerente.
@@ -1,785 +1,246 @@
-# RAG from Scratch — Singolo PDF Generico
+# PDF → Chunk RAG-ready

-Sistema RAG (Retrieval-Augmented Generation) costruito da zero, senza framework di alto livello.
-Funziona su qualsiasi PDF digitale. Gira interamente in locale, senza GPU, senza cloud.
+Converte PDF digitali in chunk semantici pronti per la vettorizzazione RAG,
+senza LLM né OCR.

-**Stack:** Python · Ollama · nomic-embed-text · Qwen3.5 · ChromaDB  
-**Compatibile con:** Linux · macOS · Windows (WSL2) · CPU Only · ~8 GB RAM libera
+**Pipeline:** PDF → Markdown strutturato → chunk semantici → embedding ChromaDB  
+**Stack:** Python · PyMuPDF · pdfplumber  
+**Non supportati:** PDF scansionati (solo immagini), PDF protetti da password.

 ---

-## Indice
-
- [Panoramica](#panoramica)
- [Struttura del progetto](#struttura-del-progetto)
- [Gli step](#gli-step)
-  - [Step 0 — Scegli il PDF](#step-0--scegli-il-pdf)
-  - [Step 1 — Ispezione automatica](#step-1--ispezione-automatica)
-  - [Step 2 — Conversione in Markdown grezzo](#step-2--conversione-in-markdown-grezzo)
-  - [Step 3 — Rilevamento struttura](#step-3--rilevamento-struttura)
-  - [Step 4 — Revisione manuale](#step-4--revisione-manuale)
-  - [Step 5 — Chunking adattivo](#step-5--chunking-adattivo)
-  - [Step 6 — Verifica chunk](#step-6--verifica-chunk)
-  - [Step 7 — Installazione ambiente](#step-7--installazione-ambiente)
-  - [Step 8 — Vettorizzazione](#step-8--vettorizzazione)
-  - [Step 9 — Pipeline RAG](#step-9--pipeline-rag)
-  - [Step 10 — Test automatici](#step-10--test-automatici)
- [Principi di progettazione](#principi-di-progettazione)
-
---
-
-## Panoramica
-
-```
-PDF
- └─► STEP 1  Ispezione automatica
-      └─► STEP 2  Conversione in Markdown grezzo
-           └─► STEP 3  Rilevamento struttura
-                └─► STEP 4  Revisione manuale        ← step più importante
-                     └─► STEP 5  Chunking adattivo
-                          └─► STEP 6  Verifica chunk
-                               └─► STEP 8  Vettorizzazione
-                                    └─► STEP 9  Pipeline RAG
-                                         └─► STEP 10 Test automatici
-
-STEP 0  Prerequisito iniziale (PDF adatto)
-STEP 7  Prerequisito tecnico (ambiente locale)
-```
-
-### Dove si concentra il rischio
-
-| Step | Rischio | Motivo |
-|---|---|---|
-| Step 0 | 🔴 Alto | Un PDF inadatto invalida tutto il lavoro successivo |
-| Step 1 | 🟢 Basso | Automatico, solo osservazione |
-| Step 2 | 🟢 Basso | Automatico, tool maturo |
-| Step 3 | 🟢 Basso | Automatico, solo analisi |
-| Step 4 | 🔴 Alto | Manuale — la qualità del MD determina la qualità del RAG |
-| Step 5 | 🟡 Medio | Logica adattiva, dipende dalla qualità del MD |
-| Step 6 | 🟢 Basso | Automatico, solo verifica |
-| Step 7 | 🟢 Basso | Installazione standard |
-| Step 8 | 🟢 Basso | Meccanico, lento ma affidabile |
-| Step 9 | 🟡 Medio | Qualità del prompt |
-| Step 10 | 🟢 Basso | Test automatici |
-
---
-
-## Struttura del progetto
-
-```
-rag-from-scratch/
-│
-├── sources/                        # PDF originali — non modificare mai
-│   └── documento.pdf
-│
-├── step-0/
-│   └── check_pdf.py                # Verifica requisiti del PDF
-│
-├── step-1/
-│   └── inspect_pdf.py              # Ispezione automatica del PDF
-│
-├── step-2/
-│   ├── convert_pdf.py              # Conversione PDF → Markdown grezzo
-│   └── <stem>/
-│       └── raw.md                  # MD grezzo (non toccare)
-│
-├── step-3/
-│   ├── detect_structure.py         # Rilevamento struttura MD
-│   └── <stem>/
-│       └── structure_profile.json  # Profilo struttura
-│
-├── step-4/
-│   ├── revise.py                   # Pre-processing automatico MD
-│   ├── revision_log.md             # Log modifiche manuali
-│   └── <stem>/
-│       ├── clean.md                # MD revisionato
-│       └── structure_profile.json  # Profilo aggiornato
-│
-├── step-5/
-│   ├── chunker.py                  # Chunking adattivo
-│   └── <stem>/
-│       └── chunks.json             # Chunk pronti per la vettorizzazione
-│
-├── step-6/
-│   ├── verify_chunks.py            # Verifica chunk
-│   ├── fix_chunks.py               # Fix chunk problematici
-│   └── <stem>/
-│       └── chunks.json             # Chunk verificati
-│
-├── step-7/
-│   ├── check_env.py                # Verifica ambiente locale
-│   └── README.md                   # Guida installazione Ollama e dipendenze
-│
-├── step-8/
-│   └── ingest.py                   # Vettorizzazione → ChromaDB
-│
-├── step-9/
-│   ├── config.py                   # Configurazione pipeline RAG ← modifica qui
-│   ├── rag.py                      # Pipeline RAG interattiva
-│   ├── test_ollama.py              # Test chat Ollama senza RAG
-│   └── README.md
-│
-├── chroma_db/                      # Vector store — generato da step-8
-├── requirements.txt
-├── .gitignore
-└── README.md
-```
-
---
-
-## Gli step
-
---
-
-### Step 0 — Scegli il PDF
-
-**Tipo:** prerequisito manuale  
-**Input:** nessuno  
-**Output:** un PDF adatto al sistema
-
-Il PDF deve soddisfare requisiti minimi prima di qualsiasi elaborazione.
-Un PDF inadatto rende tutto il lavoro successivo inutile.
-
-**Criteri obbligatori:**
-
- Il testo è selezionabile nel PDF reader — se non riesci a copiare una parola,
-  pdfplumber non la leggerà
- Non è protetto da password
- È generato digitalmente, non scansionato — una foto di un libro non è un PDF di testo
- Il contenuto importante è nel testo, non nelle immagini
-
-**Criteri desiderabili:**
-
- Ha una struttura logica riconoscibile: capitoli, sezioni, paragrafi
- Le sezioni hanno titoli espliciti
- Non ha layout a colonne multiple
- È in una lingua sola o prevalentemente una
-
-**Come verificarlo:**
-Apri il PDF nel tuo reader, seleziona del testo da pagine diverse e copialo.
-Se il testo copiato è leggibile e nell'ordine giusto, il PDF è adatto.
-Se ottieni caratteri strani o testo nell'ordine sbagliato, il PDF ha problemi.
-
---
-
-### Step 1 — Ispezione automatica
-
-**Tipo:** automatico  
-**Input:** tutti i PDF in `sources/`  
-**Output:** `step-1/<stem>_step1_report.txt`  
-**Script:** `step-1/inspect_pdf.py`
-
-```bash
-python step-1/inspect_pdf.py
-```
-
-Lo script scansiona automaticamente tutti i PDF in `sources/`, analizza ogni documento pagina per pagina e produce un report.
-Serve per capire la qualità del documento e mappare i problemi
-prima di affrontare la revisione manuale.
-
-**Cosa rileva:**
-
- Testo non estraibile (pagine con sole immagini)
- Sillabazioni a fine riga
- Layout a colonne (righe molto corte e numerose)
- Intestazioni e piè di pagina ripetitivi
- Caratteri Unicode anomali
- Pagine vuote
-
-**Output del report:**
-
-```
-Score: 87/100
-Pagine totali:       243
-Pagine con problemi:  12
-
-  Pagina 14: sillabazione rilevata (3 occorrenze)
-  Pagina 67: possibile layout a colonne
-  Pagina 201: caratteri Unicode anomali
-
-PROSSIMI PASSI:
-  → conversione con marker funzionerà bene
-  → attenzione alle pagine 14 e 67 nella revisione manuale
-```
-
-**Decisione:**
-
-| Score | Azione |
-|---|---|
-| ≥ 70 | Procedi allo step 2 |
-| 40–70 | Procedi con cautela, revisione estesa necessaria |
-| < 40 | Valuta una fonte PDF migliore |
-
---
-
-### Step 2 — Conversione in Markdown grezzo
-
-**Tipo:** automatico  
-**Input:** tutti i PDF in `sources/` (o uno solo con `--pdf`)  
-**Output:** `step-2/<stem>/raw.md` + `step-2/<stem>/clean.md`  
-**Script:** `step-2/convert_pdf.py`
-
-```bash
-python step-2/convert_pdf.py                         # tutti i PDF in sources/
-python step-2/convert_pdf.py --pdf sources/doc.pdf   # un solo PDF
-```
-
-Converte il PDF in Markdown usando `pymupdf4llm`. Il risultato non è perfetto — è la base
-su cui lavorerai nello step 4.
-
-Lo script crea due file:
- `raw.md` — conversione grezza, **non modificare mai**. È il punto di partenza di riferimento.
- `clean.md` — copia di lavoro che verrà modificata negli step successivi.
-
-**Cosa produce la conversione:**
-
- Titoli riconosciuti e convertiti in `#` `##` `###`
- Paragrafi separati da righe vuote
- Sillabazione parzialmente risolta
-
-**Cosa non produce:**
-
- Rimozione intestazioni e piè di pagina
- Correzione completa del layout a colonne
- Descrizione del contenuto delle immagini
-
---
-
-### Step 3 — Rilevamento struttura
-
-**Tipo:** automatico  
-**Input:** `step-2/<stem>/`  
-**Output:** `step-3/<stem>/structure_profile.json`  
-**Script:** `step-3/detect_structure.py`
-
-```bash
-python step-3/detect_structure.py                    # tutti i documenti in step-2/
-python step-3/detect_structure.py --stem <nome>      # un solo documento
-python step-3/detect_structure.py --force            # riesegui anche se già presente
-```
-
-Copia `raw.md` e `clean.md` da `step-2/<stem>/` e analizza la struttura del Markdown senza modificarla.
-Il profilo prodotto guida sia la revisione manuale che il chunker.
-
-**I quattro livelli strutturali:**
-
-```
-Livello 3 — struttura ricca
-  Il documento ha ### con regolarità.
-  Ogni ### è un'unità semantica chiara.
-  Esempi: opere filosofiche, manuali tecnici, leggi.
-  Strategia chunking: boundary su ###
-
-Livello 2 — struttura parziale
-  Il documento ha ## ma pochi o nessun ###.
-  Le sezioni sono i capitoli, non le sottosezioni.
-  Esempi: articoli scientifici, report, saggi.
-  Strategia chunking: boundary su ##, split interno su paragrafi
-
-Livello 1 — solo paragrafi
-  Il documento non ha titoli significativi.
-  La struttura è data dalle righe vuote.
-  Esempi: testi narrativi, lettere, trascrizioni.
-  Strategia chunking: boundary su paragrafo
-
-Livello 0 — testo piatto
-  Un blocco continuo senza struttura riconoscibile.
-  Esempi: PDF mal convertiti, testi antichi.
-  Strategia chunking: sliding window su frasi
-```
-
-**Profilo prodotto:**
-
-```json
-{
-  "livello_struttura": 3,
-  "n_h1": 1,
-  "n_h2": 9,
-  "n_h3": 296,
-  "n_paragrafi": 312,
-  "boundary_primario": "h3",
-  "lingua_rilevata": "it",
-  "lunghezza_media_sezione": 420,
-  "strategia_chunking": "h3_aware",
-  "avvertenze": [
-    "14 sezioni sotto i 200 caratteri — verranno accorpate",
-    "8 sezioni sopra i 800 caratteri — verranno divise"
-  ]
-}
-```
-
---
-
-### Step 4 — Revisione manuale
-
-**Tipo:** manuale (con pre-processing automatico)  
-**Input:** `step-3/<stem>/clean.md` + `step-3/<stem>/structure_profile.json`  
-**Output:** `step-4/<stem>/clean.md` — MD revisionato  
-**Script:** `step-4/revise.py`
-
-> Questo è lo step più importante dell'intera pipeline.
-> La qualità del RAG dipende da questo step più di qualsiasi
-> parametro tecnico o scelta di modello.
-
-#### Pre-processing automatico
-
-Prima di qualsiasi revisione manuale, esegui lo script di revisione automatica:
-
-```bash
-python step-4/revise.py --stem documento
-```
-
-Lo script applica le seguenti trasformazioni euristiche, valide per qualsiasi documento:
-
-| Trasformazione | Descrizione |
-|---|---|
-| Rimozione TOC | Righe che iniziano con `INDICE`, `INDEX`, `CONTENTS`, ecc. |
-| ALL-CAPS → `##` | Righe standalone in maiuscolo convertite in header section-case |
-| `N.  testo` → `### N.` | Sezioni numerate (con 1+ spazio dopo il punto) convertite in h3 |
-| Unione paragrafi | Blocchi spezzati da salti pagina PDF uniti automaticamente |
-| Whitespace | Spazi multipli normalizzati, righe vuote ridotte |
-
-Il profilo strutturale aggiornato viene salvato in `step-4/<stem>/structure_profile.json`.
-
-#### Revisione assistita da Claude Code
-
-Dopo il pre-processing, usa la skill integrata per una revisione qualitativa:
-
-```
-/step4-review documento
-```
-
-La skill analizza `step-4/<stem>/clean.md` e produce un report strutturato:
-
-```
-🔴 BLOCCANTI  — problemi che compromettono il chunking
-🟡 MINORI     — artefatti visibili ma non bloccanti
-🟢 OK         — categorie senza problemi
-```
-
-Poi propone le correzioni e le applica solo su tua approvazione.
-
-#### Revisione manuale (senza Claude Code)
-
-Se non usi Claude Code, esegui questi 6 check dal terminale.
-In tutti i comandi sostituisci `<stem>` con il nome reale del documento.
-
-**Check 1 — Sillabazione residua**
-Parole spezzate a fine riga con trattino (artefatto da PDF non risolto):
-```bash
-grep -n "\-$" step-4/<stem>/clean.md | head -20
-```
-Se trovi risultati: unisci la riga con la successiva eliminando il trattino
-e il ritorno a capo.
-
-**Check 2 — Righe orfane**
-Righe brevi (<60 char) isolate che sembrano numeri di pagina, autori, intestazioni:
-```bash
-grep -n "^[^#\-\*\|].\{1,59\}$" step-4/<stem>/clean.md | grep -v "^\s*$" | head -30
-```
-Per ogni riga: valuta se è testo legittimo (frase breve) o artefatto
-(numero di pagina, nome autore ripetuto, intestazione PDF). Gli artefatti vanno eliminati.
-
-**Check 3 — Frasi spezzate**
-Paragrafi che terminano senza punteggiatura di fine frase:
-```bash
-grep -n "[^.!?»)\]\'\"]$" step-4/<stem>/clean.md \
-  | grep -v "^[0-9]*:#" \
-  | grep -v "^[0-9]*:\s*$" \
-  | grep -v "^\s*[-\*]" \
-  | head -20
-```
-Segnala le righe brevi che finiscono a metà concetto. Uniscile alla riga successiva.
-
-**Check 4 — Header sospetti**
-```bash
-grep -n "^##\? " step-4/<stem>/clean.md | head -40
-```
-Verifica:
- Header con testo >80 caratteri → probabilmente è testo normale, non un header
- Header in MAIUSCOLO non convertito → cambia in formato sentence-case
- Header duplicati (stesso testo due volte) → valuta se unire o rinominare
- `###` senza un `##` padre → salto di gerarchia anomalo
-
-**Check 5 — Sezioni quasi vuote**
-```bash
-python3 -c "
-import re
-text = open('step-4/<stem>/clean.md').read()
-sections = re.split(r'^(#{1,3} .+)$', text, flags=re.MULTILINE)
-for i in range(1, len(sections)-1, 2):
-    header = sections[i].strip()
-    body = sections[i+1].strip() if i+1 < len(sections) else ''
-    if not body:
-        print(f'VUOTA: {header!r}')
-    elif len(body) < 80:
-        print(f'CORTA ({len(body)} char): {header!r} → {body[:60]!r}')
-"
-```
-Le sezioni vuote generano chunk inutili. Eliminale o accorpale alla sezione precedente.
-
-**Check 6 — Gerarchia strutturale**
-```bash
-grep -n "^#\{1,3\} " step-4/<stem>/clean.md | head -50
-```
-Deve esserci un solo `# h1` all'inizio. Poi `## h2` e opzionalmente `### h3`.
-Segnala `###` prima del primo `##`, o più di un `#`.
-
---
-
-**Struttura target dopo la revisione:**
-
-```markdown
-# Titolo del documento
-
-## Sezione principale
-
-### Sottosezione o unità atomica
-
-Testo fluente, frasi complete, nessun artefatto.
-Ogni paragrafo è semanticamente autonomo.
-Una riga vuota separa le sezioni.
-```
-
-**Criterio di qualità:**
-Leggi ogni sezione ad alta voce. Se suona naturale è corretta.
-Se si interrompe c'è una riga spezzata. Se suona ripetitiva c'è un artefatto.
-
---
-
-### Step 5 — Chunking adattivo
-
-**Tipo:** automatico  
-**Input:** `step-4/<stem>/clean.md` + `step-4/<stem>/structure_profile.json`  
-**Output:** `step-5/<stem>/chunks.json`  
-**Script:** `step-5/chunker.py`
-
-```bash
-python step-5/chunker.py --stem documento
-```
-
-Divide il Markdown pulito in chunk. Usa il profilo strutturale
-per scegliere la strategia giusta. Non sa nulla del contenuto —
-si basa solo sulla struttura.
-
-**Regole invarianti per qualsiasi documento:**
-
- Un chunk non attraversa mai il confine tra due sezioni diverse
- Un chunk non spezza mai una frase a metà
- Ogni chunk porta il suo contesto nel prefisso
- L'overlap tra chunk avviene solo su frasi intere,
-  mai tra sezioni diverse
-
-**Parametri:**
-
-| Parametro | Default | Significato |
-|---|---|---|
-| `MIN_CHARS` | 200 | Sotto questa soglia, accorpa al chunk successivo |
-| `MAX_CHARS` | 800 | Sopra questa soglia, spezza su frasi |
-| `OVERLAP_S` | 2 | Frasi di overlap tra sotto-chunk dello stesso boundary |
-
-**Struttura di ogni chunk:**
-
-```json
-{
-  "chunk_id": "sezione_principale__sottosezione_3__s0",
-  "text": "[Sezione principale > Sottosezione 3]\nTesto del chunk...",
-  "sezione": "Sezione principale",
-  "titolo": "Sottosezione 3",
-  "sub_index": 0,
-  "n_chars": 412
-}
-```
-
-Il prefisso `[Sezione > Titolo]` è fondamentale: permette all'embedding
-di catturare il contesto topico del chunk anche quando il testo
-da solo sarebbe ambiguo.
-
---
-
-### Step 6 — Verifica e fix chunk
-
-**Tipo:** automatico  
-**Input:** `step-5/<stem>/chunks.json`  
-**Output:** `step-6/<stem>/chunks.json` verificato + `report.json`  
-**Script:** `step-6/verify_chunks.py`, `step-6/fix_chunks.py`
-
-Questo step si articola in un ciclo: verifica → fix automatico → ri-verifica. Non si va allo step 8 finché non ci sono 🔴.
-
-**Workflow completo:**
-
-```
-1. Verifica
-   python step-6/verify_chunks.py --stem documento
-
-2a. Se ✅ OK o solo 🟡 → vai allo step 8
-
-2b. Se ci sono 🔴 → prova il fix automatico:
-   python step-6/fix_chunks.py --stem documento --dry-run   # anteprima
-   python step-6/fix_chunks.py --stem documento             # applica
-
-3. Ri-verifica dopo il fix:
-   python step-6/verify_chunks.py --stem documento
-
-4. Se rimangono 🔴 → torna allo step 4 e correggi clean.md,
-   poi riesegui dall'inizio:
-   python step-5/chunker.py --stem documento --force
-   python step-6/verify_chunks.py --stem documento
-```
-
-> **Shortcut con Claude:** usa `/step6-fix <stem>` — esegue dry-run, spiega le operazioni, chiede conferma e ri-verifica automaticamente.
-
-#### Senza Claude Code — come leggere l'output e decidere
-
-**1. Leggi l'output di `verify_chunks.py`**
-
-L'output termina con una delle tre condizioni:
-
-| Condizione | Significato | Cosa fare |
-|---|---|---|
-| `✅ N/N documenti senza problemi` | Nessun problema | Vai allo step 8 |
-| `🟡 Solo avvisi minori` | Chunk corti o lunghi, non bloccanti | Puoi andare allo step 8 oppure ottimizzare con `fix_chunks.py` |
-| `⚠️ 0/N documenti senza problemi` + 🔴 | Frasi spezzate o chunk vuoti | Esegui `fix_chunks.py`, poi ri-verifica |
-
-**2. Prima di applicare il fix: leggi il dry-run**
-
-```bash
-python step-6/fix_chunks.py --stem <stem> --dry-run
-```
-
-L'output elenca le operazioni pianificate. Significato:
-
-| Operazione | Cosa fa | Sicurezza |
-|---|---|---|
-| `fondi N chunk incompleti` | Unisce il chunk troncato col successivo | Sempre sicura |
-| `fondi N chunk troppo corti` | Unisce chunk <200 char col successivo | Sicura se il risultato non supera MAX×1.5 |
-| `spezza N chunk troppo lunghi` | Divide chunk >1200 char su frasi | Sicura solo se esistono frasi naturali dove spezzare |
-| `rimuovi N chunk vuoti` | Elimina chunk senza testo | Sempre sicura |
-
-**3. Se i 🔴 persistono dopo il fix**
-
-`fix_chunks.py` non riesce ad autocorreggersi quando il problema
-è nella struttura del testo sorgente. I casi tipici e la soluzione in `clean.md`:
-
-| Sintomo nel report | Causa in `clean.md` | Correzione |
-|---|---|---|
-| Chunk finisce con `:` | Intro di un elenco separata dall'elenco da una riga vuota | Rimuovi la riga vuota tra l'intro e la lista |
-| Chunk finisce a metà parola | Salto di pagina PDF con numero di pagina nel mezzo | Trova e rimuovi il numero di pagina, unisci le righe |
-| Chunk con testo artefatto (URL, watermark) | Artefatto non rimosso allo step 4 | Elimina la sezione in `clean.md` |
-| Chunk con frase enorme non spezzabile | Singolo paragrafo >MAX_CHARS senza frasi intermedie | Spezza manualmente il paragrafo su frasi logiche |
-
-Dopo ogni correzione in `clean.md` riesegui dall'inizio dello step 5:
-
-```bash
-python step-5/chunker.py --stem <stem> --force
-rm -f step-6/<stem>/chunks.json          # forza la rilettura da step-5
-python step-6/verify_chunks.py --stem <stem>
-```
-
-**Cosa verifica `verify_chunks.py`:**
-
- Nessun chunk è sotto `MIN_CHARS` 🟡
- Nessun chunk è sopra `MAX_CHARS × 1.5` 🟡
- Ogni chunk finisce con punteggiatura di fine frase 🔴
-
-**Cosa corregge `fix_chunks.py`:**
-
-| Operazione | Quando |
-|---|---|
-| Rimuovi chunk vuoti | Chunk privi di testo |
-| Fondi chunk incompleto col successivo | Chunk che finisce senza punteggiatura |
-| Fondi chunk troppo corto col successivo | Chunk sotto `MIN_CHARS` |
-| Spezza chunk troppo lungo | Chunk sopra `MAX_CHARS × 1.5` |
-
-**Tabella diagnosi — problemi non risolvibili con fix_chunks:**
-
-| Sintomo | Causa probabile | Soluzione |
-|---|---|---|
-| Molti chunk corti dopo il fix | `MIN_CHARS` troppo alto o testo frammentato nel MD | Abbassa `MIN_CHARS` o correggi step 4 |
-| Chunk spezzato creato dal fix stesso | Frase singola > `MAX_CHARS` non spezzabile | Spezza manualmente il paragrafo in step 4 |
-| Chunk che finisce a metà frase non risolvibile | Salto di pagina PDF non sanato nel MD | Correggi la riga spezzata in `clean.md` |
-
-**Output se tutto ok:**
-
-```
-Totale chunk:     301
-✅ OK:            301
-
-Distribuzione lunghezze:
-  Min:    187 char
-  Max:    923 char
-  Media:  401 char
-
-✅ 1/1 documenti senza problemi
-```
-
---
-
-### Step 7 — Installazione ambiente
-
-**Tipo:** manuale (una volta sola)  
-**Input:** nessuno  
-**Output:** ambiente locale funzionante  
-**Script:** `step-7/check_env.py`
-
-Installa Ollama, scarica i modelli e verifica l'ambiente. Si esegue una volta sola.
-
-Vedi [`step-7/README.md`](step-7/README.md) per istruzioni dettagliate e scelta dei modelli.
-
-```bash
-python step-7/check_env.py
-```
-
---
-
-### Step 8 — Vettorizzazione
-
-**Tipo:** automatico (lento)  
-**Input:** `step-6/<stem>/chunks.json`  
-**Output:** `chroma_db/` popolato  
-**Script:** `step-8/ingest.py`
+## Setup

 ```bash
+python -m venv .venv
 source .venv/bin/activate
-python step-8/ingest.py --stem <nome>
+pip install -r requirements.txt
 ```

-Trasforma ogni chunk in un vettore numerico e lo salva in ChromaDB.
-È il processo più lento — su CPU circa 1 secondo per chunk.
-Per 900 chunk aspetta circa 15 minuti.
-
-**Argomenti:**
-
-| Argomento | Descrizione |
-|---|---|
-| `--stem <nome>` | Processa un singolo documento. Senza questo argomento processa tutti gli stem trovati in `step-6/` |
-| `--force` | Cancella e ricrea la collection se esiste già. Senza `--force`, se la collection è presente lo step viene saltato |
-
-**Quando usare `--force`:**
-Se hai modificato i chunk (es. hai rieseguito step-6 dopo correzioni), la collection in ChromaDB
-contiene ancora i vecchi vettori. `--force` la cancella e la ricrea da zero con i chunk aggiornati.
-
-**Cosa succede per ogni chunk:**
-
-```
-testo del chunk
-    │
-    ▼  Ollama (nomic-embed-text)
-vettore di 768 numeri
-[0.23, -0.41, 0.87, 0.12, ...]
-    │
-    ▼  ChromaDB
-salva: testo + vettore + metadati (sezione, titolo, sub_index)
-```
-
-**Perché 768 numeri:**
-Ogni numero rappresenta una dimensione semantica.
-Testi con significato simile producono vettori simili —
-i loro numeri sono vicini nello spazio a 768 dimensioni.
-Questo è ciò che permette il retrieval semantico.
-
-**Output durante l'esecuzione:**
-
-```
-✅ Ollama OK — nomic-embed-text disponibile
-
-📦 872 chunk da ingestire
-
-  [  1/872] ✓ sezione_1__sotto_1__s0                ETA: 870s
-  [  2/872] ✓ sezione_1__sotto_2__s0                ETA: 867s
-  ...
-  [872/872] ✓ sezione_9__sotto_42__s0               ETA: 0s
-
-✅ Ingestione completata in 718s — 872/872 chunk salvati
-   Collection 'nietzsche' in chroma_db/
-```
-
-`chroma_db/` contiene ora tutti i vettori su disco.
-Non è necessario ripetere questo step a meno che il documento cambi.
-
 ---

-### Step 9 — Pipeline RAG
+## Flusso completo

-**Tipo:** interattivo  
-**Input:** `chroma_db/` + domanda dell'utente  
-**Output:** risposta basata sul documento  
-**Script:** `step-9/rag.py`
+### 1. Posiziona il PDF
+
+```
+sources/<nome>.pdf
+```
+
+### 2. Converti il PDF in Markdown

 ```bash
-source .venv/bin/activate
-python step-9/rag.py --stem <nome>
+# Singolo documento
+.venv/bin/python conversione/ --stem <nome>
+
+# Tutti i PDF in sources/
+.venv/bin/python conversione/
+
+# Forza riesecuzione (sovrascrive output esistente)
+.venv/bin/python conversione/ --stem <nome> --force
 ```

-Loop interattivo che risponde a domande sul documento. Configura i parametri in `step-9/config.py` prima di avviare.
+Output in `conversione/<nome>/`:

-Vedi [`step-9/README.md`](step-9/README.md) per la configurazione completa.
+| File | Descrizione |
+|------|-------------|
+| `raw.md` | Markdown grezzo — **non modificare** |
+| `clean.md` | Markdown pulito — input per il chunker |
+| `structure_profile.json` | Struttura rilevata e strategia di chunking |
+| `report.json` | Metriche di qualità della conversione |

---
-
-### Step 10 — Test automatici
-
-**Tipo:** automatico  
-**Input:** sistema completo  
-**Output:** tutti i test verdi  
-**Script:** `step-10/test_pipeline.py` *(da implementare)*
+### 3. Verifica la qualità del Markdown (opzionale)

 ```bash
-python step-10/test_pipeline.py --stem <nome>
+.venv/bin/python conversione/ validate <nome> --detail
 ```

-Verifica ogni componente in isolamento e poi nel sistema completo.
-I test non dipendono dal contenuto del documento — usano dati
-fittizi creati e distrutti in memoria.
+Se lo score è ≥ 80 e `valid=true`, procedi. Altrimenti usa `/prepare-md` per
+correzioni manuali (sillabazione residua, header malformati, ecc.).

-**Struttura dei test:**
+### 4. Genera i chunk

+```bash
+.venv/bin/python chunks/chunker.py --stem <nome>
+
+# Forza riesecuzione
+.venv/bin/python chunks/chunker.py --stem <nome> --force
 ```
-Test unitari — ogni componente isolato
-  ✓ split_sentences non spezza le frasi
-  ✓ parse_markdown rileva la struttura corretta
-  ✓ chunk_sezione rispetta i boundary
-  ✓ il prefisso è sempre presente in ogni chunk

-Test integrazione — i componenti parlano tra loro
-  ✓ Ollama è raggiungibile
-  ✓ i modelli sono disponibili
-  ✓ l'embedding produce 768 dimensioni
-  ✓ testi diversi producono vettori diversi
-  ✓ ChromaDB scrive e legge correttamente
+La strategia di chunking (`h3_aware`, `h2_paragraph_split`, `paragraph`,
+`sliding_window`) viene scelta automaticamente da `structure_profile.json`.

-Test qualità — il sistema si comporta bene
-  ✓ il retrieval trova il chunk pertinente
-  ✓ il retrieval non trova il chunk non pertinente
-  ✓ il LLM usa il contesto fornito
-  ✓ il LLM ammette quando la risposta non è nel contesto
+Output in `chunks/<nome>/`:
+
+| File | Descrizione |
+|------|-------------|
+| `chunks.json` | Lista di chunk con testo, sezione, titolo e metadati |
+| `report.json` | Statistiche e anomalie del chunking |
+
+### 5. Verifica i chunk
+
+```bash
+.venv/bin/python chunks/verify_chunks.py --stem <nome>
+```
+
+Verdict possibili:
+
+| Verdict | Significato | Cosa fare |
+|---------|-------------|-----------|
+| `ok` | Nessun problema | Procedi alla vettorizzazione |
+| `warnings_only` | Solo avvisi minori | Puoi procedere o eseguire il fix |
+| `blocked` | Problemi bloccanti (chunk incompleti) | Esegui il fix |
+
+### 6. Correggi i problemi (se necessario)
+
+```bash
+# Anteprima delle correzioni senza applicarle
+.venv/bin/python chunks/fix_chunks.py --stem <nome> --dry-run
+
+# Applica le correzioni (ricorsivo, fino a 3 iterazioni)
+.venv/bin/python chunks/fix_chunks.py --stem <nome>
+```
+
+Il fix gestisce automaticamente: chunk incompleti (frase spezzata), chunk
+troppo corti (accorpa al successivo), chunk eccessivamente lunghi (spezza
+su punteggiatura). Ogni chunk termina sempre su un confine di frase.
+
+### 7. Esegui l'ingestion
+
+Prima verifica che Ollama e i modelli siano pronti:
+
+```bash
+.venv/bin/python ollama/check_env.py
+```
+
+Poi genera gli embedding e salva in ChromaDB:
+
+```bash
+# Singolo documento → collection con lo stesso nome
+.venv/bin/python ingestion/ingest.py --stem <nome>
+
+# Più documenti → un'unica collection condivisa
+.venv/bin/python ingestion/ingest.py --collection <nome-collection> --stems doc1 doc2 doc3
+
+# Tutti i documenti in chunks/ → collection separate
+.venv/bin/python ingestion/ingest.py
+
+# Rigenera dopo aver cambiato modello o aggiornato i chunk
+.venv/bin/python ingestion/ingest.py --stem <nome> --force
+```
+
+Con `--collection` i chunk di documenti diversi vengono uniti in una singola
+collection. Il metadato `source` identifica il documento di provenienza di ogni chunk.
+
+Output in `chroma_db/` (ignorata da git).
+
+---
+
+## Configurazione del chunking
+
+Tutti i parametri sono in [`chunks/config.py`](chunks/config.py):
+
+```python
+TARGET_CHARS    = 600   # dimensione target dei chunk
+CHUNK_TOLERANCE = 0.25  # ±25% → range accettabile [450, 750]
+OVERLAP_SENTENCES = 1   # frasi di overlap tra chunk consecutivi
+PROTECT_TABLES  = True  # tabelle emesse come chunk atomici
+FIX_MAX_ITERATIONS = 3  # iterazioni massime del fix ricorsivo
+```
+
+Per ogni strategia è possibile definire valori diversi tramite `STRATEGY_OVERRIDES`.
+Modificare solo questo file — chunker, verify e fix si aggiornano automaticamente.
+
+---
+
+## Configurazione modelli
+
+Tutti i parametri LLM e embedding sono in [`config.py`](config.py):
+
+```python
+OLLAMA_MODEL = "qwen3.5:4b"      # modello LLM per la generazione
+EMBED_MODEL  = "nomic-embed-text" # modello embedding (deve coincidere con l'ingestion)
+TEMPERATURE  = 0.2                # 0.0 = deterministico, valori alti = più creativo
+NO_THINK     = True               # True = risposta diretta (più veloce), False = con ragionamento
+TOP_K        = 6                  # numero di chunk recuperati per ogni domanda
+OLLAMA_URL   = "http://localhost:11434"
+```
+
+> Se cambi `EMBED_MODEL` devi rieseguire l'ingestion con `--force` — gli embedding
+> devono essere prodotti dallo stesso modello usato nel retrieval.
+
+---
+
+## Testare il modello (senza RAG)
+
+Verifica che il modello LLM risponda correttamente prima di coinvolgere la pipeline:
+
+```bash
+.venv/bin/python ollama/test_ollama.py
+```
+
+Il modello usato è quello configurato in `config.py` (`OLLAMA_MODEL`).  
+Digita `exit` per uscire.
+
+---
+
+## Retrieval puro (senza generazione)
+
+Utile per verificare che i chunk giusti vengano recuperati prima di diagnosticare
+risposte sbagliate:
+
+```bash
+# Singolo documento
+.venv/bin/python retrieve.py --stem <nome>
+
+# Collection multi-documento
+.venv/bin/python retrieve.py --collection <nome-collection>
+
+# Modifica il numero di chunk restituiti
+.venv/bin/python retrieve.py --stem <nome> --top-k 10
+```
+
+Nel loop interattivo:
+
+| Comando | Effetto |
+|---------|---------|
+| `<query>` | Mostra i chunk più simili con score di similarità (testo troncato) |
+| `<query> -f` | Testo completo dei chunk |
+| `exit` | Termina |
+
+---
+
+## RAG interattivo
+
+Risponde a domande in linguaggio naturale usando i chunk indicizzati in ChromaDB:
+
+```bash
+# Singolo documento
+.venv/bin/python rag.py --stem <nome>
+
+# Collection multi-documento
+.venv/bin/python rag.py --collection <nome-collection>
+```
+
+Nel loop interattivo:
+
+| Comando | Effetto |
+|---------|---------|
+| `<domanda>` | Risposta generata dal LLM con contesto dai chunk |
+| `<domanda> -v` | Risposta + chunk recuperati con score di similarità e documento sorgente |
+| `exit` | Termina |
+
+---
+
+## Test
+
+```bash
+.venv/bin/python -m pytest tests/
 ```

 ---

-## Principi di progettazione
+## Riferimenti

-**Atomico**
-Ogni step fa una cosa sola. Il chunker non sa niente di Ollama.
-L'ingestione non sa niente del MD originale.
-Se un pezzo si rompe, sai esattamente dove.
-
-**Verificabile**
-Ogni step ha un criterio di completamento oggettivo.
-Non si passa allo step successivo finché il precedente non è verificato.
-
-**Reversibile**
-Puoi tornare indietro senza perdere il lavoro degli altri step.
-Cambi il MD? Riesegui solo step 5 e 8.
-Cambi i parametri del chunker? Riesegui solo step 5 e 8.
-Non si riparte mai da zero.
-
-**Senza assunzioni**
-Il sistema non assume nulla sulla struttura del documento.
-Rileva il livello strutturale e si adatta.
-Funziona su libri, manuali, articoli, contratti, dispense.
-
-**Tutto locale**
-Nessuna chiamata a API esterne.
-Nessun dato trasmesso fuori dalla macchina.
-Nessun costo di utilizzo.
+- [`conversione/README.md`](conversione/README.md) — dettagli sulla pipeline PDF→Markdown e sui tipi di documento supportati
+- [`ingestion/README.md`](ingestion/README.md) — configurazione embedding, scelta modello, regole --force
@@ -1,17 +1,17 @@
 #!/usr/bin/env python3
 """
-Step 5 — Chunking adattivo
+Chunking adattivo

-Divide il Markdown revisionato (step 4) in chunk semantici pronti per la
+Divide il Markdown revisionato in chunk semantici pronti per la
 vettorizzazione. La strategia dipende dal profilo strutturale del documento.

-Input:  step-4/<stem>/clean.md + step-4/<stem>/structure_profile.json
-Output: step-5/<stem>/chunks.json
+Input:  conversione/<stem>/clean.md + conversione/<stem>/structure_profile.json
+Output: chunks/<stem>/chunks.json

 Uso:
-    python step-5/chunker.py                    # tutti i documenti in step-4/
-    python step-5/chunker.py --stem documento   # un solo documento
-    python step-5/chunker.py --stem documento --force
+    python chunks/chunker.py                    # tutti i documenti in conversione/
+    python chunks/chunker.py --stem documento   # un solo documento
+    python chunks/chunker.py --stem documento --force
 """

 import argparse
@@ -20,86 +20,129 @@ import re
 import sys
 from pathlib import Path

-
-# ─── Parametri ────────────────────────────────────────────────────────────────
-
-MIN_CHARS = 200   # sotto questa soglia → accorpa al chunk successivo
-MAX_CHARS = 800   # sopra questa soglia → spezza su frasi
-OVERLAP_S = 2     # frasi di overlap tra sotto-chunk dello stesso boundary
+_HERE = Path(__file__).resolve().parent
+if str(_HERE) not in sys.path:
+    sys.path.insert(0, str(_HERE))
+import config as cfg


 # ─── Utilità ──────────────────────────────────────────────────────────────────

 def split_sentences(text: str) -> list[str]:
-    """
-    Divide il testo in frasi senza spezzare abbreviazioni comuni.
-    Split su punteggiatura finale (.!?») seguita da spazio + lettera maiuscola.
-    """
-    # Split conservativo: solo quando la punteggiatura è seguita da spazio
-    # e la parola successiva inizia in maiuscolo (o è fine stringa).
    parts = re.split(r'(?<=[.!?»])\s+(?=[A-ZÀÈÉÌÒÙA-Z\"])', text.strip())
-    # Se non trova nulla con maiuscola, usa split semplice
    if len(parts) <= 1:
        parts = re.split(r'(?<=[.!?»])\s+', text.strip())
    return [p.strip() for p in parts if p.strip()]


 def slugify(s: str, max_len: int = 60) -> str:
-    """Converti una stringa in slug per chunk_id."""
    s = s.lower()
    s = re.sub(r'[^\w\s-]', '', s)
    s = re.sub(r'[\s_-]+', '_', s).strip('_')
    return s[:max_len] if s else "section"


+def _is_table_block(text: str) -> bool:
+    """True se il testo è prevalentemente una tabella Markdown (≥50% righe con |)."""
+    lines = [l for l in text.strip().splitlines() if l.strip()]
+    if not lines:
+        return False
+    table_lines = sum(1 for l in lines if l.strip().startswith("|"))
+    return table_lines / len(lines) >= 0.5
+
+
+def _ov(strategy: str) -> tuple[int, float, int]:
+    """Legge (target_chars, tolerance, overlap) dagli override di strategia."""
+    ov = cfg.STRATEGY_OVERRIDES.get(strategy, {})
+    target    = ov.get("target_chars", cfg.TARGET_CHARS)
+    tolerance = ov.get("tolerance",    cfg.CHUNK_TOLERANCE)
+    overlap   = ov.get("overlap",      cfg.OVERLAP_SENTENCES)
+    return target, tolerance, overlap
+
+
+# ─── Core: split in sotto-chunk orientato al target ───────────────────────────
+
 def make_sub_chunks(
    body: str,
    prefix: str,
    sezione: str,
    titolo: str,
-    max_chars: int,
+    target: int,
+    tolerance: float,
    overlap_s: int,
 ) -> list[dict]:
+    """Divide body in chunk il più vicini possibile a `target` char.
+
+    Logica:
+      lower = target × (1 − tolerance)   → soglia minima per emettere
+      upper = target × (1 + tolerance)   → limite massimo
+
+    Si accumulano frasi intere finché la successiva farebbe superare `upper`.
+    A quel punto si emette (siamo vicini al target) e si riparte con overlap.
+    Ogni chunk termina sempre su un confine di frase; non attraversa mai
+    il boundary dell'header corrente.
    """
-    Suddivide un body in sotto-chunk rispettando max_chars.
-    Aggiunge overlap_s frasi di overlap tra sotto-chunk consecutivi.
-    Non attraversa mai i confini del body.
-    """
+    if cfg.PROTECT_TABLES and _is_table_block(body):
+        chunk_text = prefix + body
+        return [{
+            "chunk_id": f"{slugify(sezione)}__{slugify(titolo)}__s0",
+            "text": chunk_text,
+            "sezione": sezione,
+            "titolo": titolo,
+            "sub_index": 0,
+            "n_chars": len(chunk_text),
+        }]
+
+    # Soglia calcolata sul corpo (n_chars finale = prefix_len + body_len).
+    prefix_len = len(prefix)
+    upper_body = max(1, int(target * (1 + tolerance)) - prefix_len)
+
    sentences = split_sentences(body)
    if not sentences:
        return []

-    chunks = []
+    chunks: list[dict] = []
    current: list[str] = []
    current_len = 0
    sub_index = 0

-    i = 0
-    while i < len(sentences):
-        sent = sentences[i]
-        # +1 per lo spazio di separazione
-        if not current or current_len + len(sent) + 1 <= max_chars:
-            current.append(sent)
-            current_len += len(sent) + (1 if len(current) > 1 else 0)
-            i += 1
-        else:
-            # Flush del chunk corrente
-            chunk_text = prefix + " ".join(current)
-            chunks.append({
-                "chunk_id": f"{slugify(sezione)}__{slugify(titolo)}__s{sub_index}",
-                "text": chunk_text,
-                "sezione": sezione,
-                "titolo": titolo,
-                "sub_index": sub_index,
-                "n_chars": len(chunk_text),
-            })
-            sub_index += 1
-            # Overlap: riparti dalle ultime overlap_s frasi
-            overlap = current[-overlap_s:] if overlap_s and len(current) > overlap_s else []
-            current = overlap[:]
-            current_len = sum(len(s) + 1 for s in current)
+    def _emit() -> None:
+        nonlocal current, current_len, sub_index
+        chunk_text = prefix + " ".join(current)
+        chunks.append({
+            "chunk_id": f"{slugify(sezione)}__{slugify(titolo)}__s{sub_index}",
+            "text": chunk_text,
+            "sezione": sezione,
+            "titolo": titolo,
+            "sub_index": sub_index,
+            "n_chars": len(chunk_text),
+        })
+        overlap = current[-overlap_s:] if overlap_s and len(current) > overlap_s else []
+        current = overlap[:]
+        # Lunghezza corretta dell'overlap (n-1 spazi tra n frasi).
+        current_len = sum(len(s) for s in current) + max(0, len(current) - 1)
+        sub_index += 1
+
+    for sent in sentences:
+        sep     = 1 if current else 0
+        new_len = current_len + sep + len(sent)
+
+        if new_len <= upper_body:
+            # Ancora entro il limite del corpo: aggiungi e continua.
+            current.append(sent)
+            current_len = new_len
+        elif current:
+            # La frase successiva sfora il limite: emetti il chunk corrente
+            # (che termina su frase completa) poi inizia il nuovo con questa frase.
+            _emit()
+            current.append(sent)
+            current_len += (1 if current[:-1] else 0) + len(sent)
+        else:
+            # Chunk vuoto: la singola frase supera già il limite — emettiamo così com'è.
+            current.append(sent)
+            current_len = len(sent)
+            _emit()

-    # Flush delle frasi rimanenti
    if current:
        chunk_text = prefix + " ".join(current)
        chunks.append({
@@ -117,10 +160,6 @@ def make_sub_chunks(
 # ─── Parser Markdown ──────────────────────────────────────────────────────────

 def parse_h3_sections(text: str) -> list[dict]:
-    """
-    Parsa il documento in sezioni (sezione h2, titolo h3, body).
-    Testo prima del primo header viene assegnato a sezione vuota.
-    """
    sections = []
    current_h2 = ""
    current_h3 = ""
@@ -137,7 +176,6 @@ def parse_h3_sections(text: str) -> list[dict]:

    for line in text.splitlines():
        if re.match(r"^# ", line):
-            # h1 = titolo documento, non crea sezione
            flush()
            current_h2 = line[2:].strip()
            current_h3 = ""
@@ -159,7 +197,6 @@ def parse_h3_sections(text: str) -> list[dict]:


 def parse_h2_sections(text: str) -> list[dict]:
-    """Parsa il documento in sezioni h2 con il loro testo completo."""
    sections = []
    current_h2 = ""
    current_body_lines: list[str] = []
@@ -188,15 +225,11 @@ def parse_h2_sections(text: str) -> list[dict]:
 # ─── Strategie di chunking ────────────────────────────────────────────────────

 def chunk_h3_aware(text: str, stem: str) -> list[dict]:
-    """
-    Strategia h3_aware: boundary su ###.
-    Sezioni piccole (< MIN_CHARS) vengono accorpate alla successiva
-    purché appartengano allo stesso ## padre.
-    Sezioni grandi (> MAX_CHARS) vengono suddivise su frasi.
-    """
+    target, tolerance, overlap = _ov("h3_aware")
+    lower = int(target * (1 - tolerance))
+
    sections = parse_h3_sections(text)

-    # Merge greedy: accorpa al successivo se stesso h2 e body piccolo
    merged: list[dict] = []
    pending: dict | None = None

@@ -206,7 +239,7 @@ def chunk_h3_aware(text: str, stem: str) -> list[dict]:
            continue

        if (pending["sezione"] == sec["sezione"]
-                and len(pending["body"]) < MIN_CHARS):
+                and len(pending["body"]) < lower):
            sep_title = " / ".join(filter(None, [pending["titolo"], sec["titolo"]]))
            pending = {
                "sezione": pending["sezione"],
@@ -220,45 +253,39 @@ def chunk_h3_aware(text: str, stem: str) -> list[dict]:
    if pending:
        merged.append(pending)

-    # Genera chunk con eventuale split su frasi
    chunks = []
    for sec in merged:
        sezione = sec["sezione"] or stem
-        titolo = sec["titolo"] or ""
-        body = sec["body"]
-
-        prefix = f"[{sezione} > {titolo}]\n" if titolo else f"[{sezione}]\n"
-        sub = make_sub_chunks(body, prefix, sezione, titolo, MAX_CHARS, OVERLAP_S)
-        chunks.extend(sub)
+        titolo  = sec["titolo"] or ""
+        body    = sec["body"]
+        prefix  = f"[{sezione} > {titolo}]\n" if titolo else f"[{sezione}]\n"
+        chunks.extend(make_sub_chunks(body, prefix, sezione, titolo, target, tolerance, overlap))

    return chunks


 def chunk_h2_paragraph_split(text: str, stem: str) -> list[dict]:
-    """
-    Strategia h2_paragraph_split: boundary su ##.
-    All'interno di ogni ## i paragrafi vengono usati come sotto-unità.
-    """
+    target, tolerance, overlap = _ov("h2_paragraph_split")
+    lower = int(target * (1 - tolerance))
+
    sections = parse_h2_sections(text)
    chunks = []

    for sec in sections:
        sezione = sec["sezione"] or stem
-        body = sec["body"]
-        prefix = f"[{sezione}]\n"
+        body    = sec["body"]
+        prefix  = f"[{sezione}]\n"

-        # Suddividi in paragrafi interni (righe vuote doppie)
        paragraphs = [
            p.strip()
            for p in re.split(r"\n{2,}", body)
            if p.strip() and not re.match(r"^#+\s", p.strip())
        ]

-        # Merge paragrafi piccoli
        merged_pars: list[str] = []
        pending = ""
        for par in paragraphs:
-            if pending and len(pending) < MIN_CHARS:
+            if pending and len(pending) < lower:
                pending = pending + "\n\n" + par
            else:
                if pending:
@@ -268,7 +295,7 @@ def chunk_h2_paragraph_split(text: str, stem: str) -> list[dict]:
            merged_pars.append(pending)

        for idx, par in enumerate(merged_pars):
-            sub = make_sub_chunks(par, prefix, sezione, f"par{idx}", MAX_CHARS, OVERLAP_S)
+            sub = make_sub_chunks(par, prefix, sezione, f"par{idx}", target, tolerance, overlap)
            for c in sub:
                c["chunk_id"] = f"{slugify(sezione)}__p{idx}__s{c['sub_index']}"
            chunks.extend(sub)
@@ -277,9 +304,9 @@ def chunk_h2_paragraph_split(text: str, stem: str) -> list[dict]:


 def chunk_paragraph(text: str, stem: str) -> list[dict]:
-    """
-    Strategia paragraph: boundary su paragrafo (doppia riga vuota).
-    """
+    target, tolerance, overlap = _ov("paragraph")
+    lower = int(target * (1 - tolerance))
+
    paragraphs = [
        p.strip()
        for p in re.split(r"\n{2,}", text)
@@ -287,11 +314,10 @@ def chunk_paragraph(text: str, stem: str) -> list[dict]:
    ]
    prefix = f"[Documento: {stem}]\n"

-    # Merge paragrafi piccoli
    merged: list[str] = []
    pending = ""
    for par in paragraphs:
-        if pending and len(pending) < MIN_CHARS:
+        if pending and len(pending) < lower:
            pending = pending + "\n\n" + par
        else:
            if pending:
@@ -302,7 +328,7 @@ def chunk_paragraph(text: str, stem: str) -> list[dict]:

    chunks = []
    for idx, par in enumerate(merged):
-        sub = make_sub_chunks(par, prefix, stem, f"par{idx}", MAX_CHARS, OVERLAP_S)
+        sub = make_sub_chunks(par, prefix, stem, f"par{idx}", target, tolerance, overlap)
        for c in sub:
            c["chunk_id"] = f"para__{idx}__s{c['sub_index']}"
        chunks.extend(sub)
@@ -311,10 +337,9 @@ def chunk_paragraph(text: str, stem: str) -> list[dict]:


 def chunk_sliding_window(text: str, stem: str) -> list[dict]:
-    """
-    Strategia sliding_window: finestre di MAX_CHARS con OVERLAP_S frasi di overlap.
-    Usata per testi piatti senza struttura (livello 0).
-    """
+    target, tolerance, overlap = _ov("sliding_window")
+    upper = int(target * (1 + tolerance))
+
    sentences = split_sentences(text)
    prefix = f"[Documento: {stem}]\n"

@@ -329,10 +354,11 @@ def chunk_sliding_window(text: str, stem: str) -> list[dict]:
        j = i
        while j < len(sentences):
            s = sentences[j]
-            if window and cur_len + len(s) + 1 > MAX_CHARS:
+            sep = 1 if window else 0
+            if window and cur_len + sep + len(s) > upper:
                break
            window.append(s)
-            cur_len += len(s) + (1 if len(window) > 1 else 0)
+            cur_len += sep + len(s)
            j += 1

        if not window:
@@ -349,8 +375,7 @@ def chunk_sliding_window(text: str, stem: str) -> list[dict]:
            "n_chars": len(chunk_text),
        })
        win_idx += 1
-        # Avanza di (window_size - overlap), almeno 1
-        i += max(1, len(window) - OVERLAP_S)
+        i += max(1, len(window) - overlap)

    return chunks

@@ -358,36 +383,36 @@ def chunk_sliding_window(text: str, stem: str) -> list[dict]:
 # ─── Dispatcher ───────────────────────────────────────────────────────────────

 _STRATEGIES: dict[str, callable] = {
-    "h3_aware": chunk_h3_aware,
-    "h2_paragraph_split": chunk_h2_paragraph_split,
-    "paragraph": chunk_paragraph,
-    "sliding_window": chunk_sliding_window,
+    "h3_aware":            chunk_h3_aware,
+    "h2_paragraph_split":  chunk_h2_paragraph_split,
+    "paragraph":           chunk_paragraph,
+    "sliding_window":      chunk_sliding_window,
 }


 def chunk_document(clean_md: Path, profile: dict, stem: str) -> list[dict]:
-    text = clean_md.read_text(encoding="utf-8")
+    text      = clean_md.read_text(encoding="utf-8")
    strategia = profile.get("strategia_chunking", "paragraph")
-    fn = _STRATEGIES.get(strategia, chunk_paragraph)
+    fn        = _STRATEGIES.get(strategia, chunk_paragraph)
    return fn(text, stem)


 # ─── Per-document processing ──────────────────────────────────────────────────

 def process_stem(stem: str, project_root: Path, force: bool) -> bool:
-    step4_dir = project_root / "step-4" / stem
-    out_dir = project_root / "step-5" / stem
-    clean_md = step4_dir / "clean.md"
-    profile_path = step4_dir / "structure_profile.json"
-    out_file = out_dir / "chunks.json"
+    conv_dir     = project_root / "conversione" / stem
+    out_dir      = project_root / "chunks" / stem
+    clean_md     = conv_dir / "clean.md"
+    profile_path = conv_dir / "structure_profile.json"
+    out_file     = out_dir / "chunks.json"

    print(f"\nDocumento: {stem}")

    if not clean_md.exists():
-        print(f"  ✗ clean.md non trovato in step-4/{stem}/ — skip")
+        print(f"  ✗ clean.md non trovato in conversione/{stem}/ — skip")
        return False
    if not profile_path.exists():
-        print(f"  ✗ structure_profile.json non trovato in step-4/{stem}/ — skip")
+        print(f"  ✗ structure_profile.json non trovato in conversione/{stem}/ — skip")
        return False

    if out_file.exists() and not force:
@@ -395,7 +420,7 @@ def process_stem(stem: str, project_root: Path, force: bool) -> bool:
        print(f"       (usa --force per rieseguire)")
        return True

-    profile = json.loads(profile_path.read_text(encoding="utf-8"))
+    profile   = json.loads(profile_path.read_text(encoding="utf-8"))
    strategia = profile.get("strategia_chunking", "paragraph")
    print(f"  Strategia: {strategia}")

@@ -410,20 +435,32 @@ def process_stem(stem: str, project_root: Path, force: bool) -> bool:
        json.dumps(chunks, ensure_ascii=False, indent=2), encoding="utf-8"
    )

-    lengths = [c["n_chars"] for c in chunks]
-    min_c = min(lengths)
-    max_c = max(lengths)
-    avg_c = int(sum(lengths) / len(lengths))
-    short = sum(1 for l in lengths if l < MIN_CHARS)
-    long_ = sum(1 for l in lengths if l > MAX_CHARS * 1.5)
+    target, tolerance, _ = _ov(strategia)
+    lower = int(target * (1 - tolerance))
+    upper = int(target * (1 + tolerance))

+    meta = {"strategy": strategia, "target_chars": target,
+            "min_chars": lower, "max_chars": upper}
+    (out_dir / "meta.json").write_text(
+        json.dumps(meta, ensure_ascii=False), encoding="utf-8"
+    )
+
+    lengths = [c["n_chars"] for c in chunks]
+    min_c  = min(lengths)
+    max_c  = max(lengths)
+    avg_c  = int(sum(lengths) / len(lengths))
+    short  = sum(1 for l in lengths if l < lower)
+    long_  = sum(1 for l in lengths if l > upper)
+
+    print(f"  Target: {target} char  ±{int(tolerance*100)}%  "
+          f"→ range [{lower}, {upper}]")
    print(f"  Chunk totali: {len(chunks)}")
    print(f"  Min: {min_c} char  Max: {max_c} char  Media: {avg_c} char")
    if short:
-        print(f"  ⚠️  {short} chunk sotto MIN_CHARS ({MIN_CHARS})")
+        print(f"  ⚠️  {short} chunk sotto lower ({lower})")
    if long_:
-        print(f"  ⚠️  {long_} chunk sopra MAX_CHARS×1.5 ({int(MAX_CHARS * 1.5)})")
-    print(f"  ✅ chunks.json salvato in step-5/{stem}/")
+        print(f"  ⚠️  {long_} chunk sopra upper ({upper})")
+    print(f"  ✅ chunks.json salvato in chunks/{stem}/")
    return True


@@ -432,26 +469,29 @@ def process_stem(stem: str, project_root: Path, force: bool) -> bool:
 if __name__ == "__main__":
    project_root = Path(__file__).parent.parent

-    parser = argparse.ArgumentParser(description="Step 5 — Chunking adattivo")
-    parser.add_argument("--stem", help="Nome del documento (sottocartella di step-4/)")
+    parser = argparse.ArgumentParser(description="Chunking adattivo")
+    parser.add_argument("--stem", help="Nome del documento (sottocartella di conversione/)")
    parser.add_argument("--force", action="store_true", help="Riesegui anche se già presente")
    args = parser.parse_args()

    if args.stem:
        stems = [args.stem]
    else:
-        step4_dir = project_root / "step-4"
-        if not step4_dir.exists():
-            print(f"Errore: cartella step-4/ non trovata in {project_root}")
+        conv_dir = project_root / "conversione"
+        if not conv_dir.exists():
+            print(f"Errore: cartella conversione/ non trovata in {project_root}")
            sys.exit(1)
-        stems = sorted(p.name for p in step4_dir.iterdir() if p.is_dir())
+        stems = sorted(
+            p.name for p in conv_dir.iterdir()
+            if p.is_dir() and (p / "clean.md").exists()
+        )
        if not stems:
-            print(f"Errore: nessun documento trovato in step-4/")
+            print(f"Errore: nessun documento trovato in conversione/")
            sys.exit(1)

    results = [process_stem(s, project_root, args.force) for s in stems]

-    ok = sum(results)
+    ok    = sum(results)
    total = len(results)
    print(f"\n{'✅' if all(results) else '⚠️ '} {ok}/{total} documenti processati")
    sys.exit(0 if all(results) else 1)
@@ -0,0 +1,88 @@
+#!/usr/bin/env python3
+"""
+Parametri di configurazione della pipeline di chunking.
+
+Modifica questo file per cambiare il comportamento di chunker.py,
+verify_chunks.py e fix_chunks.py senza toccare il codice applicativo.
+"""
+
+# ─── Grandezza target dei chunk ───────────────────────────────────────────────
+#
+# TARGET_CHARS è la dimensione ideale a cui il chunker mira.
+# CHUNK_TOLERANCE è la tolleranza relativa (es. 0.25 = ±25%).
+#
+#   range accettabile = [TARGET × (1 − TOL),  TARGET × (1 + TOL)]
+#
+# Con TARGET=600 e TOL=0.25 → ogni chunk sarà tra 450 e 750 char,
+# il più vicino possibile a 600, terminando sempre su un confine di frase.
+#
+TARGET_CHARS    = 300
+CHUNK_TOLERANCE = 0.25
+
+# ─── Overlap ──────────────────────────────────────────────────────────────────
+
+# Numero di frasi ripetute all'inizio del chunk successivo per preservare
+# il contesto tra chunk adiacenti della stessa sezione.
+OVERLAP_SENTENCES = 1
+
+# ─── Soglie di validazione ────────────────────────────────────────────────────
+
+# fix_chunks.py spezza un chunk "too_long" solo se supera upper × questo fattore.
+# Es. upper=750, fattore=1.5 → split solo per chunk > 1125 char.
+# Chunk in [upper, upper×fattore] restano come warning non bloccanti.
+SPLIT_THRESHOLD_FACTOR = 1.5
+
+MATH_SYMS_MIN = 3   # min. simboli math per declassare incomplete → incomplete_math
+
+# ─── Pattern e formato ────────────────────────────────────────────────────────
+
+SENTENCE_SPLIT_PATTERN = r"(?<=[.!?»])\s+"
+PREFIX_TEMPLATE = "[{sezione} > {titolo}]"
+
+# ─── Protezione contenuti speciali ────────────────────────────────────────────
+
+# Se True, un blocco prevalentemente tabella Markdown (≥50% righe |…|)
+# viene emesso come chunk atomico senza sentence-splitting.
+PROTECT_TABLES = True
+
+# Riservato — blocchi LaTeX non spezzabili (implementazione futura).
+PROTECT_MATH = True
+
+# ─── Fix behavior ─────────────────────────────────────────────────────────────
+
+# Numero massimo di iterazioni del loop fix → verify → fix.
+# Con 1 si ottiene il comportamento originale (fix singolo senza re-verifica).
+FIX_MAX_ITERATIONS = 3
+
+# ─── Override per strategia ───────────────────────────────────────────────────
+#
+# Sovrascrivono TARGET_CHARS / CHUNK_TOLERANCE / OVERLAP_SENTENCES
+# per la specifica strategia indicata in structure_profile.json.
+# Chiavi riconosciute: "target_chars", "tolerance", "overlap".
+#
+STRATEGY_OVERRIDES: dict[str, dict] = {
+    "h3_aware": {
+        # Documenti strutturati H2→H3: chunk medi, overlap moderato.
+        "target_chars": 600,
+        "tolerance":    0.25,
+        "overlap":      2,
+    },
+    "h2_paragraph_split": {
+        # Documenti piatti (solo H2): chunk più ampi, overlap ridotto.
+        "target_chars": 800,
+        "tolerance":    0.25,
+        "overlap":      1,
+    },
+    "paragraph": {
+        # Documenti senza header significativi: chunk più corti.
+        "target_chars": 500,
+        "tolerance":    0.30,
+        "overlap":      1,
+    },
+    "sliding_window": {
+        # Testo lineare/narrativo: finestre ampie, overlap generoso.
+        "target_chars": 800,
+        "tolerance":    0.25,
+        "overlap":      3,
+    },
+}
@@ -0,0 +1,397 @@
+#!/usr/bin/env python3
+"""
+Fix chunk
+
+Applica correzioni dirette su chunks/<stem>/chunks.json basandosi sul
+report.json prodotto da verify_chunks.py. Non tocca clean.md.
+
+Fixes applicati:
+  empty      → rimuove il chunk
+  incomplete → fonde con il chunk successivo (la frase continua)
+  no_prefix  → aggiunge prefisso [sezione > titolo] se mancante
+  too_short  → fonde con il chunk adiacente nello stesso sezione
+  too_long   → spezza all'ultimo confine di paragrafo/frase entro MAX_CHARS
+
+Input:  chunks/<stem>/chunks.json  +  chunks/<stem>/report.json
+Output: chunks/<stem>/chunks.json  (sovrascrive)
+
+Uso:
+    python chunks/fix_chunks.py --stem documento
+    python chunks/fix_chunks.py --stem documento --dry-run
+"""
+
+import argparse
+import contextlib
+import io
+import json
+import re
+import sys
+from pathlib import Path
+
+_HERE = Path(__file__).resolve().parent
+if str(_HERE) not in sys.path:
+    sys.path.insert(0, str(_HERE))
+import config as cfg
+from verify_chunks import verify_stem as _verify_stem
+
+MAX_CHARS = int(cfg.TARGET_CHARS * (1 + cfg.CHUNK_TOLERANCE))
+
+
+def _load_thresholds(stem_dir: Path) -> int:
+    """Legge max_chars da meta.json (scritto dal chunker) o usa il default da config."""
+    meta = stem_dir / "meta.json"
+    if meta.exists():
+        import json as _json
+        return _json.loads(meta.read_text(encoding="utf-8"))["max_chars"]
+    return MAX_CHARS
+PUNCT_END = re.compile(r"[.!?»)\]'\u2019\"\u201c\u201d\u2018\u2014\u2013-]$")
+
+
+# ─── Helpers ──────────────────────────────────────────────────────────────────
+
+def _prefix(chunk: dict) -> str:
+    sezione = chunk.get("sezione", "")
+    titolo  = chunk.get("titolo", "")
+    if titolo:
+        return f"[{sezione} > {titolo}]"
+    return f"[{sezione}]"
+
+
+def _strip_prefix(text: str) -> str:
+    text = text.lstrip()
+    if text.startswith("["):
+        end = text.find("]")
+        if end != -1:
+            return text[end + 1:].lstrip("\n")
+    return text
+
+
+def _rebuild_text(chunk: dict, body: str) -> str:
+    return f"{_prefix(chunk)}\n{body}"
+
+
+# Fine frase forte: . ! ? seguiti da spazio + maiuscola o virgolette.
+# Non usare punteggiatura debole (,;:)>>]) per non creare chunk incompleti.
+_STRONG_END = re.compile(
+    r'[.!?\xbb]\s+(?=[A-Z\xc0-\xd6\xd8-\xde\xc0-\xff\xab\x22\x27(])'
+)
+_SECONDARY_END = re.compile(r';\s+')
+
+
+def _split_at_boundary(text: str, max_chars: int) -> list[str]:
+    """Spezza text in parti ≤ max_chars su confini di frase forti (.!?).
+
+    Se non trova un confine forte entro max_chars, NON spezza: meglio un
+    chunk too_long (warning) che un chunk incompleto (blocker).
+    """
+    if len(text) <= max_chars:
+        return [text]
+
+    parts = []
+    remaining = text
+
+    while len(remaining) > max_chars:
+        candidate = remaining[:max_chars]
+
+        last_pos = -1
+        for m in _STRONG_END.finditer(candidate):
+            last_pos = m.start() + 1  # posizione dopo il carattere terminatore
+
+        if last_pos > 0:
+            first = remaining[:last_pos].rstrip()
+            remaining = remaining[last_pos:].lstrip()
+            if first:
+                parts.append(first)
+        else:
+            # Prova confine secondario: ; + spazio (clausole legali)
+            sec_pos = -1
+            for m in _SECONDARY_END.finditer(candidate):
+                sec_pos = m.start() + 1
+            if sec_pos > 0:
+                first = remaining[:sec_pos].rstrip()
+                remaining = remaining[sec_pos:].lstrip()
+                if first:
+                    parts.append(first)
+            else:
+                # Nessun confine: lascia il chunk intero (too_long > incomplete)
+                break
+
+    if remaining:
+        parts.append(remaining)
+
+    return [p for p in parts if p.strip()]
+
+
+# ─── Operazioni sui chunk ─────────────────────────────────────────────────────
+
+def fix_empty(chunks: list[dict], empty_ids: set[str]) -> tuple[list[dict], int]:
+    before = len(chunks)
+    chunks = [c for c in chunks if c["chunk_id"] not in empty_ids]
+    return chunks, before - len(chunks)
+
+
+def fix_no_prefix(chunks: list[dict], no_prefix_ids: set[str]) -> tuple[list[dict], int]:
+    count = 0
+    for c in chunks:
+        if c["chunk_id"] in no_prefix_ids:
+            body = _strip_prefix(c["text"])
+            c["text"] = _rebuild_text(c, body)
+            c["n_chars"] = len(c["text"])
+            count += 1
+    return chunks, count
+
+
+def fix_incomplete_and_short(chunks: list[dict],
+                              problem_ids: set[str]) -> tuple[list[dict], int]:
+    merged = 0
+    i = 0
+    result: list[dict] = []
+
+    while i < len(chunks):
+        c = chunks[i]
+        if c["chunk_id"] in problem_ids and i + 1 < len(chunks):
+            nxt = chunks[i + 1]
+            body_c   = _strip_prefix(c["text"])
+            body_nxt = _strip_prefix(nxt["text"])
+            merged_body = body_c.rstrip() + "\n" + body_nxt.lstrip()
+            nxt["text"]    = _rebuild_text(nxt, merged_body)
+            nxt["n_chars"] = len(nxt["text"])
+            merged += 1
+            i += 1
+            continue
+        result.append(c)
+        i += 1
+
+    return result, merged
+
+
+def fix_too_long(chunks: list[dict],
+                 too_long_ids: set[str],
+                 max_chars: int) -> tuple[list[dict], int]:
+    result: list[dict] = []
+    split_count = 0
+
+    for c in chunks:
+        if c["chunk_id"] not in too_long_ids:
+            result.append(c)
+            continue
+
+        body  = _strip_prefix(c["text"])
+        parts = _split_at_boundary(body, max_chars)
+
+        if len(parts) == 1:
+            result.append(c)
+            continue
+
+        base_id  = re.sub(r"__s\d+$", "", c["chunk_id"])
+        base_sub = c.get("sub_index", 0)
+
+        for j, part in enumerate(parts):
+            new_chunk = dict(c)
+            new_chunk["sub_index"] = base_sub + j
+            new_chunk["chunk_id"]  = f"{base_id}__s{base_sub + j}"
+            new_chunk["text"]      = _rebuild_text(new_chunk, part)
+            new_chunk["n_chars"]   = len(new_chunk["text"])
+            result.append(new_chunk)
+
+        split_count += 1
+
+    return result, split_count
+
+
+def renumber_ids(chunks: list[dict]) -> list[dict]:
+    seen: dict[str, int] = {}
+    for c in chunks:
+        base = re.sub(r"__s\d+$", "", c["chunk_id"])
+        idx  = seen.get(base, 0)
+        c["chunk_id"]  = f"{base}__s{idx}"
+        c["sub_index"] = idx
+        seen[base] = idx + 1
+    return chunks
+
+
+# ─── Core ─────────────────────────────────────────────────────────────────────
+
+def fix_stem(stem: str, project_root: Path, max_chars: int, dry_run: bool,
+             max_iter: int = 10) -> bool:
+    stem_dir    = project_root / "chunks" / stem
+    chunks_path = stem_dir / "chunks.json"
+    report_path = stem_dir / "report.json"
+    max_chars   = _load_thresholds(stem_dir)
+
+    if not chunks_path.exists():
+        print(f"✗ chunks/{stem}/chunks.json non trovato.")
+        print(f"  Esegui prima: python chunks/chunker.py --stem {stem}")
+        return False
+
+    if not report_path.exists():
+        print(f"✗ chunks/{stem}/report.json non trovato.")
+        print(f"  Esegui prima: python chunks/verify_chunks.py --stem {stem}")
+        return False
+
+    chunks: list[dict] = json.loads(chunks_path.read_text(encoding="utf-8"))
+    report: dict       = json.loads(report_path.read_text(encoding="utf-8"))
+
+    verdict = report.get("verdict", "ok")
+    print(f"\nDocumento: {stem}  (verdict: {verdict})")
+
+    if verdict == "ok":
+        print("  ✅ Nessun problema - nulla da correggere.")
+        return True
+
+    empty_ids      = {e["chunk_id"] for e in report.get("blockers", {}).get("empty", [])}
+    no_prefix_ids  = {e["chunk_id"] for e in report.get("blockers", {}).get("no_prefix", [])}
+    incomplete_ids = {e["chunk_id"] for e in report.get("blockers", {}).get("incomplete", [])}
+    too_short_ids  = {e["chunk_id"] for e in report.get("warnings", {}).get("too_short", [])}
+
+    # Spezza solo chunk che superano upper × SPLIT_THRESHOLD_FACTOR,
+    # non quelli appena oltre upper (che causerebbero split con chunk incompleti).
+    _split_limit = max_chars * cfg.SPLIT_THRESHOLD_FACTOR
+    too_long_ids = {
+        e["chunk_id"]
+        for e in report.get("warnings", {}).get("too_long", [])
+        if e.get("n_chars", 0) > _split_limit
+    }
+
+    ops: list[str] = []
+    if empty_ids:
+        ops.append(f"  🗑  rimuovi {len(empty_ids)} chunk vuoti")
+    if no_prefix_ids:
+        ops.append(f"  🔧 aggiungi prefisso a {len(no_prefix_ids)} chunk")
+    if incomplete_ids:
+        ops.append(f"  🔗 fondi {len(incomplete_ids)} chunk incompleti col successivo")
+    if too_short_ids:
+        ops.append(f"  🔗 fondi {len(too_short_ids)} chunk troppo corti col successivo")
+    if too_long_ids:
+        ops.append(f"  ✂️  spezza {len(too_long_ids)} chunk troppo lunghi")
+
+    if not ops:
+        print("  ✅ Nessuna correzione necessaria.")
+        return True
+
+    print("\n  Operazioni pianificate:")
+    for op in ops:
+        print(op)
+
+    if dry_run:
+        print("\n  [dry-run] Nessuna modifica applicata.")
+        return True
+
+    n_before = len(chunks)
+
+    def _fix_blockers(chunks: list[dict], report: dict) -> list[dict]:
+        """Risolve solo i blockers (incomplete, empty, no_prefix) senza toccare warnings."""
+        empty_ids_      = {e["chunk_id"] for e in report.get("blockers", {}).get("empty", [])}
+        no_prefix_ids_  = {e["chunk_id"] for e in report.get("blockers", {}).get("no_prefix", [])}
+        incomplete_ids_ = {e["chunk_id"] for e in report.get("blockers", {}).get("incomplete", [])}
+        if empty_ids_:
+            chunks, n = fix_empty(chunks, empty_ids_)
+            print(f"  🗑  Rimossi {n} chunk vuoti.")
+        if no_prefix_ids_:
+            chunks, n = fix_no_prefix(chunks, no_prefix_ids_)
+            print(f"  🔧 Aggiunto prefisso a {n} chunk.")
+        if incomplete_ids_:
+            chunks, n = fix_incomplete_and_short(chunks, incomplete_ids_)
+            print(f"  🔗 Fusi {n} chunk incompleti.")
+        return renumber_ids(chunks)
+
+    def _fix_warnings(chunks: list[dict], report: dict) -> list[dict]:
+        """Applica fix opzionali: merge too_short e split too_long."""
+        too_short_ids_ = {e["chunk_id"] for e in report.get("warnings", {}).get("too_short", [])}
+        too_long_ids_  = {
+            e["chunk_id"]
+            for e in report.get("warnings", {}).get("too_long", [])
+            if e.get("n_chars", 0) > max_chars * cfg.SPLIT_THRESHOLD_FACTOR
+        }
+        if too_short_ids_:
+            chunks, n = fix_incomplete_and_short(chunks, too_short_ids_)
+            print(f"  🔗 Fusi {n} chunk troppo corti.")
+        if too_long_ids_:
+            chunks, n = fix_too_long(chunks, too_long_ids_, max_chars)
+            print(f"  ✂️  Spezzati {n} chunk lunghi.")
+        return renumber_ids(chunks)
+
+    # Fase 1: risolvi blockers a convergenza (solo merge incomplete)
+    chunks = _fix_blockers(chunks, report)
+
+    _min = int(cfg.TARGET_CHARS * (1 - cfg.CHUNK_TOLERANCE))
+    _max = int(cfg.TARGET_CHARS * (1 + cfg.CHUNK_TOLERANCE))
+    prev_blockers = sum(len(v) for v in report.get("blockers", {}).values())
+
+    for iteration in range(1, max_iter + 1):
+        chunks_path.write_text(
+            json.dumps(chunks, ensure_ascii=False, indent=2), encoding="utf-8"
+        )
+        with contextlib.redirect_stdout(io.StringIO()):
+            _verify_stem(stem, project_root, _min, _max)
+        report = json.loads(report_path.read_text(encoding="utf-8"))
+        new_verdict = report.get("verdict", "ok")
+        curr_blockers = sum(len(v) for v in report.get("blockers", {}).values())
+
+        if new_verdict in ("ok", "warnings_only") or curr_blockers == 0:
+            break
+        if curr_blockers >= prev_blockers:
+            print(f"\n  ⚠️  Nessun miglioramento ({curr_blockers} blockers) - i restanti richiedono correzione manuale del clean.md.")
+            break
+
+        print(f"\n  Iterazione {iteration + 1} - {curr_blockers} blockers residui:")
+        prev_blockers = curr_blockers
+        chunks = _fix_blockers(chunks, report)
+
+    # Fase 2: fix warnings (too_short merge + too_long split) - una sola passata finale
+    with contextlib.redirect_stdout(io.StringIO()):
+        _verify_stem(stem, project_root, _min, _max)
+    report = json.loads(report_path.read_text(encoding="utf-8"))
+    n_short = len(report.get("warnings", {}).get("too_short", []))
+    n_long  = sum(
+        1 for e in report.get("warnings", {}).get("too_long", [])
+        if e.get("n_chars", 0) > max_chars * cfg.SPLIT_THRESHOLD_FACTOR
+    )
+    if n_short or n_long:
+        print(f"\n  Fix warnings: {n_short} corti, {n_long} lunghi da spezzare")
+        chunks = _fix_warnings(chunks, report)
+
+    n_after = len(chunks)
+    print(f"\n  Totale chunk: {n_before} → {n_after}")
+
+    chunks_path.write_text(
+        json.dumps(chunks, ensure_ascii=False, indent=2), encoding="utf-8"
+    )
+    print(f"  ✅ Salvato: chunks/{stem}/chunks.json")
+
+    final_verdict = report.get("verdict", "?")
+    if final_verdict == "ok":
+        print(f"  ✅ Verdict finale: ok - procedi alla vettorizzazione.")
+    elif final_verdict == "warnings_only":
+        print(f"  🟡 Verdict finale: warnings_only - puoi procedere.")
+    else:
+        print(f"  🔴 Verdict finale: {final_verdict} - rilancia la verifica manualmente:")
+        print(f"     python chunks/verify_chunks.py --stem {stem}")
+
+    return True
+
+
+# ─── Entry point ──────────────────────────────────────────────────────────────
+
+if __name__ == "__main__":
+    project_root = Path(__file__).parent.parent
+
+    parser = argparse.ArgumentParser(description="Fix chunk")
+    parser.add_argument("--stem", required=True, help="Nome del documento (sottocartella di chunks/)")
+    _max_def = int(cfg.TARGET_CHARS * (1 + cfg.CHUNK_TOLERANCE))
+    parser.add_argument(
+        "--max", type=int, default=_max_def,
+        help=f"Soglia massima caratteri per lo split (default: TARGET×(1+TOL) = {_max_def})"
+    )
+    parser.add_argument(
+        "--dry-run", action="store_true",
+        help="Mostra le operazioni pianificate senza applicarle"
+    )
+    parser.add_argument(
+        "--max-iter", type=int, default=10, metavar="N",
+        help="Numero massimo di iterazioni automatiche (default: 10)"
+    )
+    args = parser.parse_args()
+
+    ok = fix_stem(args.stem, project_root, args.max, args.dry_run, args.max_iter)
+    sys.exit(0 if ok else 1)
@@ -1,18 +1,17 @@
 #!/usr/bin/env python3
 """
-Step 6 — Verifica chunk
+Verifica chunk

-Analizza step-5/<stem>/chunks.json e segnala ogni anomalia che potrebbe
-degradare la qualità del retrieval. Non modifica nulla: se ci sono problemi
-torna allo step 4 (revisione MD) o aggiusta i parametri dello step 5.
+Analizza chunks/<stem>/chunks.json e segnala ogni anomalia che potrebbe
+degradare la qualità del retrieval. Non modifica nulla.

-Input:  step-5/<stem>/chunks.json
-Output: report a schermo + step-6/<stem>/report.json + exit code (0 = OK, 1 = problemi)
+Input:  chunks/<stem>/chunks.json
+Output: report a schermo + chunks/<stem>/report.json + exit code (0 = OK, 1 = problemi)

 Uso:
-    python step-6/verify_chunks.py --stem documento
-    python step-6/verify_chunks.py                    # tutti i documenti in step-5/
-    python step-6/verify_chunks.py --min 200 --max 800
+    python chunks/verify_chunks.py --stem documento
+    python chunks/verify_chunks.py                    # tutti i documenti in chunks/
+    python chunks/verify_chunks.py --min 200 --max 800
 """

 import argparse
@@ -21,18 +20,42 @@ import re
 import sys
 from pathlib import Path

+_HERE = Path(__file__).resolve().parent
+if str(_HERE) not in sys.path:
+    sys.path.insert(0, str(_HERE))
+import config as cfg

-# ─── Soglie (devono coincidere con quelle usate in chunker.py) ────────────────

-MIN_CHARS = 200
-MAX_CHARS = 800
-PUNCT_END = re.compile("[.!?»)\\]'\u2019\"\u201c\u201d\u2018\u2014\u2013\u2026-]$")
+# ─── Soglie (derivate dal target, sovrascrivibili da CLI) ────────────────────

+MIN_CHARS = int(cfg.TARGET_CHARS * (1 - cfg.CHUNK_TOLERANCE))
+MAX_CHARS = int(cfg.TARGET_CHARS * (1 + cfg.CHUNK_TOLERANCE))
+PUNCT_END = re.compile(
+    r"[.!?\xbb)\]'\u2019\"\u201c\u201d\u2018\u2014\u2013\u2026]$"
+    r"|/$"    # URL che finisce con /
+    r"|\|$"   # riga di tabella Markdown
+    r"|;$"    # fine clausola legale (testo giuridico)
+    r"|:$"    # introduzione a lista o formula
+)
+_HEX_END     = re.compile(r"[0-9a-fA-F]{8,}$")
+_URL_TAIL    = re.compile(r"(https?://|www\.)\S+(\s+\S+){0,3}$")  # URL con fino a 3 token extra
+_MATH_SYMS   = re.compile(r"[∈∑≤≥≠∀∃∫√∞∂±×÷→←↔⊂⊃⊆⊇∩∪·°]")
+_ROMAN_END   = re.compile(r"\b(I{1,3}|IV|VI{0,3}|IX|XI{0,2}|XIV|XV|XVI{0,2}|XIX|XX{0,2})$")
+
+
+
+def _load_thresholds(stem_dir: "Path") -> "tuple[int, int]":
+    """Legge min/max da meta.json (scritto dal chunker) o usa i default da config."""
+    meta = stem_dir / "meta.json"
+    if meta.exists():
+        import json as _json
+        m = _json.loads(meta.read_text(encoding="utf-8"))
+        return m["min_chars"], m["max_chars"]
+    return MIN_CHARS, MAX_CHARS

 # ─── Checks ───────────────────────────────────────────────────────────────────

 def has_prefix(chunk: dict) -> bool:
-    """Il chunk inizia con il prefisso di contesto '[...'."""
    return chunk.get("text", "").lstrip().startswith("[")


@@ -45,48 +68,52 @@ def is_too_short(chunk: dict, min_chars: int) -> bool:


 def is_too_long(chunk: dict, max_chars: int) -> bool:
-    return chunk.get("n_chars", 0) > max_chars * 1.5
+    return chunk.get("n_chars", 0) > max_chars


 def ends_incomplete(chunk: dict) -> bool:
-    """L'ultima riga di testo non termina con punteggiatura di fine frase."""
    text = chunk.get("text", "").rstrip()
    if not text:
        return False
-    # Rimuovi marcatori markdown finali (_ e *) prima di controllare:
-    # pattern come _parola._ o _parola!_ sono frasi complete.
    text_check = re.sub(r"[_*]+$", "", text).rstrip()
    if not text_check:
        return False
-    return not PUNCT_END.search(text_check)
+    if PUNCT_END.search(text_check):
+        return False
+    if _HEX_END.search(text_check):   # hash SHA / codice hex
+        return False
+    if _ROMAN_END.search(text_check):  # numero romano finale (indice/riferimento PDF)
+        return False
+    if _URL_TAIL.search(text_check[-200:]):  # URL (con eventuale path dopo spazio)
+        return False
+    return True
+
+
+def is_math_incomplete(chunk: dict) -> bool:
+    """Incompleto ma in contesto matematico — degrada a warning invece di blocker."""
+    return ends_incomplete(chunk) and len(_MATH_SYMS.findall(chunk.get("text", ""))) >= cfg.MATH_SYMS_MIN


 # ─── Report ───────────────────────────────────────────────────────────────────

 def _fmt_chunk(c: dict) -> str:
-    cid = c.get("chunk_id", "?")
-    n = c.get("n_chars", 0)
+    cid     = c.get("chunk_id", "?")
+    n       = c.get("n_chars", 0)
    preview = c.get("text", "")[:60].replace("\n", " ")
    return f"  [{cid}] ({n} char) «{preview}»"


 def verify_stem(stem: str, project_root: Path, min_chars: int, max_chars: int) -> bool:
-    """Verifica i chunk di un documento. Ritorna True se nessun bloccante rilevato."""
-    import shutil
-
-    chunks_path = project_root / "step-6" / stem / "chunks.json"
+    stem_dir    = project_root / "chunks" / stem
+    chunks_path = stem_dir / "chunks.json"
+    min_chars, max_chars = _load_thresholds(stem_dir)

    print(f"\nDocumento: {stem}")

    if not chunks_path.exists():
-        src = project_root / "step-5" / stem / "chunks.json"
-        if src.exists():
-            chunks_path.parent.mkdir(parents=True, exist_ok=True)
-            shutil.copy2(src, chunks_path)
-            print(f"  → Copiato step-5/{stem}/chunks.json → step-6/{stem}/chunks.json")
-        else:
-            print(f"  ✗ chunks.json non trovato in step-5/{stem}/ né in step-6/{stem}/ — skip")
-            return False
+        print(f"  ✗ chunks/{stem}/chunks.json non trovato")
+        print(f"    Esegui prima: python chunks/chunker.py --stem {stem}")
+        return False

    chunks: list[dict] = json.loads(chunks_path.read_text(encoding="utf-8"))

@@ -96,25 +123,27 @@ def verify_stem(stem: str, project_root: Path, min_chars: int, max_chars: int) -

    # ── Raccogli problemi ──────────────────────────────────────────────────────

-    empty_chunks       = [c for c in chunks if is_empty(c)]
-    no_prefix          = [c for c in chunks if not is_empty(c) and not has_prefix(c)]
-    too_short          = [c for c in chunks if is_too_short(c, min_chars)]
-    too_long           = [c for c in chunks if is_too_long(c, max_chars)]
-    incomplete         = [c for c in chunks if not is_empty(c) and ends_incomplete(c)]
+    empty_chunks      = [c for c in chunks if is_empty(c)]
+    no_prefix         = [c for c in chunks if not is_empty(c) and not has_prefix(c)]
+    too_short         = [c for c in chunks if is_too_short(c, min_chars)]
+    too_long          = [c for c in chunks if is_too_long(c, max_chars)]
+    _incomplete_all   = [c for c in chunks if not is_empty(c) and ends_incomplete(c)]
+    incomplete_math   = [c for c in _incomplete_all if is_math_incomplete(c)]
+    incomplete        = [c for c in _incomplete_all if not is_math_incomplete(c)]

-    # ── Statistiche lunghezze ──────────────────────────────────────────────────
+    # ── Statistiche ───────────────────────────────────────────────────────────

    lengths = [c.get("n_chars", 0) for c in chunks]
    n_total = len(chunks)
    n_ok    = n_total - len(set(
-        c["chunk_id"] for lst in [empty_chunks, no_prefix, too_short, too_long, incomplete]
+        c["chunk_id"]
+        for lst in [empty_chunks, no_prefix, too_short, too_long, incomplete]
        for c in lst
    ))
-    min_l   = min(lengths)
-    max_l   = max(lengths)
-    avg_l   = int(sum(lengths) / n_total)
+    min_l = min(lengths)
+    max_l = max(lengths)
+    avg_l = int(sum(lengths) / n_total)

-    # Distribuzione in fasce
    n_under  = sum(1 for l in lengths if l < min_chars)
    n_normal = sum(1 for l in lengths if min_chars <= l <= max_chars)
    n_over   = sum(1 for l in lengths if l > max_chars)
@@ -149,7 +178,7 @@ def verify_stem(stem: str, project_root: Path, min_chars: int, max_chars: int) -
            print(_fmt_chunk(c))
        if len(no_prefix) > 5:
            print(f"  ... e altri {len(no_prefix) - 5}")
-        print(f"  → Causa probabile: header ### mancanti o malformati nel MD (step 4)")
+        print(f"  → Causa probabile: header ### mancanti o malformati nel MD")

    if too_short:
        has_errors = True
@@ -158,18 +187,16 @@ def verify_stem(stem: str, project_root: Path, min_chars: int, max_chars: int) -
            print(_fmt_chunk(c))
        if len(too_short) > 5:
            print(f"  ... e altri {len(too_short) - 5}")
-        print(f"  → Causa probabile: sezione isolata senza successivo nello stesso ##")
-        print(f"  → Soluzione: abbassa MIN_CHARS o revisiona il MD (step 4)")
+        print(f"  → Soluzione: abbassa MIN_CHARS o revisiona il MD")

    if too_long:
        has_errors = True
-        print(f"\n  🟡 {len(too_long)} chunk SOPRA MAX_CHARS×1.5 ({int(max_chars * 1.5)}):")
+        print(f"\n  🟡 {len(too_long)} chunk SOPRA MAX ({max_chars}):")
        for c in too_long[:5]:
            print(_fmt_chunk(c))
        if len(too_long) > 5:
            print(f"  ... e altri {len(too_long) - 5}")
-        print(f"  → Causa probabile: frase singola molto lunga non spezzabile")
-        print(f"  → Soluzione: alza MAX_CHARS o verifica il testo nel MD (step 4)")
+        print(f"  → Causa probabile: frasi singole lunghe (liste/paragrafi non suddivisibili)")

    if incomplete:
        has_errors = True
@@ -179,30 +206,38 @@ def verify_stem(stem: str, project_root: Path, min_chars: int, max_chars: int) -
            print(f"  [{c.get('chunk_id', '?')}] ...{last_line!r}")
        if len(incomplete) > 5:
            print(f"  ... e altri {len(incomplete) - 5}")
-        print(f"  → Causa probabile: paragrafo spezzato nel MD (step 4)")
-        print(f"  → Soluzione: correggi le righe spezzate in step-4/{stem}/clean.md")
+        print(f"  → Soluzione: correggi le righe spezzate in conversione/{stem}/clean.md")

-    # ── Costruisci e salva report.json ───────────────────────────────────────
+    if incomplete_math:
+        has_errors = True
+        print(f"\n  🟡 {len(incomplete_math)} chunk MATEMATICI SENZA PUNTEGGIATURA (formula/espressione):")
+        for c in incomplete_math[:3]:
+            last_line = c.get("text", "").rstrip().split("\n")[-1][-80:]
+            print(f"  [{c.get('chunk_id', '?')}] ...{last_line!r}")
+        if len(incomplete_math) > 3:
+            print(f"  ... e altri {len(incomplete_math) - 3}")
+        print(f"  → Le formule non finiscono con punteggiatura — avviso non bloccante")
+
+    # ── Costruisci e salva report.json ────────────────────────────────────────

    blockers = empty_chunks + no_prefix + incomplete
-    warnings = too_short + too_long
+    warnings = too_short + too_long + incomplete_math

    def _chunk_entry(c: dict) -> dict:
        return {
-            "chunk_id": c.get("chunk_id", ""),
-            "sezione":  c.get("sezione", ""),
-            "titolo":   c.get("titolo", ""),
-            "n_chars":  c.get("n_chars", 0),
+            "chunk_id":  c.get("chunk_id", ""),
+            "sezione":   c.get("sezione", ""),
+            "titolo":    c.get("titolo", ""),
+            "n_chars":   c.get("n_chars", 0),
            "last_text": c.get("text", "").rstrip().split("\n")[-1][-120:],
        }

-    if not blockers:
-        verdict = "ok" if not warnings else "warnings_only"
-    else:
-        verdict = "blocked"
+    verdict = "ok" if not blockers else "blocked"
+    if not blockers and warnings:
+        verdict = "warnings_only"

    report = {
-        "stem": stem,
+        "stem":    stem,
        "verdict": verdict,
        "stats": {
            "total":     n_total,
@@ -211,24 +246,30 @@ def verify_stem(stem: str, project_root: Path, min_chars: int, max_chars: int) -
            "max_chars": max_l,
            "avg_chars": avg_l,
        },
-        "thresholds": {"min_chars": min_chars, "max_chars": max_chars},
+        "thresholds": {
+            "min_chars": min_chars,
+            "max_chars": max_chars,
+            "target_chars": cfg.TARGET_CHARS,
+            "chunk_tolerance": cfg.CHUNK_TOLERANCE,
+        },
        "blockers": {
            "empty":      [_chunk_entry(c) for c in empty_chunks],
            "no_prefix":  [_chunk_entry(c) for c in no_prefix],
            "incomplete": [_chunk_entry(c) for c in incomplete],
        },
        "warnings": {
-            "too_short": [_chunk_entry(c) for c in too_short],
-            "too_long":  [_chunk_entry(c) for c in too_long],
+            "too_short":       [_chunk_entry(c) for c in too_short],
+            "too_long":        [_chunk_entry(c) for c in too_long],
+            "incomplete_math": [_chunk_entry(c) for c in incomplete_math],
        },
    }

-    out_dir = project_root / "step-6" / stem
+    out_dir = project_root / "chunks" / stem
    out_dir.mkdir(parents=True, exist_ok=True)
    (out_dir / "report.json").write_text(
        json.dumps(report, ensure_ascii=False, indent=2), encoding="utf-8"
    )
-    print(f"\n  report.json salvato in step-6/{stem}/")
+    print(f"\n  report.json salvato in chunks/{stem}/")

    # ── Prossimi passi ────────────────────────────────────────────────────────

@@ -238,14 +279,13 @@ def verify_stem(stem: str, project_root: Path, min_chars: int, max_chars: int) -

    if not blockers and not warnings:
        print(f"  ✅ Tutto OK — procedi alla vettorizzazione:")
-        print(f"       python step-8/ingest.py --stem {stem}")
+        print(f"       python ingestion/ingest.py --stem {stem}")

    elif not blockers:
-        # Solo 🟡: si può procedere, i warning sono facoltativi
        print(f"  🟡 Solo avvisi minori — puoi procedere alla vettorizzazione:")
-        print(f"       python step-8/ingest.py --stem {stem}")
+        print(f"       python ingestion/ingest.py --stem {stem}")
        print()
-        print(f"  Oppure, per ottimizzare prima di procedere:")
+        print(f"  Oppure, per ottimizzare prima:")
        if too_short:
            pct = int(len(too_short) / n_total * 100)
            print(f"    • {len(too_short)} chunk corti ({pct}% del totale)")
@@ -253,27 +293,26 @@ def verify_stem(stem: str, project_root: Path, min_chars: int, max_chars: int) -
            pct = int(len(too_long) / n_total * 100)
            print(f"    • {len(too_long)} chunk lunghi ({pct}% del totale)")
        if too_short or too_long:
-            print(f"      → Esegui: python step-6/fix_chunks.py --stem {stem} --dry-run")
-            print(f"        poi:     python step-6/fix_chunks.py --stem {stem}")
-            print(f"        poi:     python step-6/verify_chunks.py --stem {stem}")
+            print(f"      → Esegui: python chunks/fix_chunks.py --stem {stem} --dry-run")
+            print(f"        poi:     python chunks/fix_chunks.py --stem {stem}")
+            print(f"        poi:     python chunks/verify_chunks.py --stem {stem}")

    else:
-        # Ci sono 🔴: non si procede
        print(f"  🔴 Problemi bloccanti — correggi prima di procedere:")
        print()
        if empty_chunks:
            print(f"    • {len(empty_chunks)} chunk vuoti")
-            print(f"      → Controlla step-4/{stem}/clean.md per sezioni prive di testo")
+            print(f"      → Controlla conversione/{stem}/clean.md per sezioni prive di testo")
        if no_prefix:
            print(f"    • {len(no_prefix)} chunk senza prefisso di contesto")
-            print(f"      → Controlla che gli header ### siano corretti in step-4/{stem}/clean.md")
+            print(f"      → Controlla che gli header ### siano corretti in conversione/{stem}/clean.md")
        if incomplete:
            print(f"    • {len(incomplete)} chunk con frase spezzata")
-            print(f"      → Esegui: python step-6/fix_chunks.py --stem {stem}")
+            print(f"      → Esegui: python chunks/fix_chunks.py --stem {stem}")
        print()
        print(f"  Dopo le correzioni, riesegui nell'ordine:")
-        print(f"       python step-5/chunker.py --stem {stem} --force")
-        print(f"       python step-6/verify_chunks.py --stem {stem}")
+        print(f"       python chunks/chunker.py --stem {stem} --force")
+        print(f"       python chunks/verify_chunks.py --stem {stem}")
        print()
        if warnings:
            print(f"  🟡 Hai anche {len(warnings)} avvisi minori — affrontali dopo aver risolto i 🔴.")
@@ -286,31 +325,33 @@ def verify_stem(stem: str, project_root: Path, min_chars: int, max_chars: int) -
 if __name__ == "__main__":
    project_root = Path(__file__).parent.parent

-    parser = argparse.ArgumentParser(description="Step 6 — Verifica chunk")
-    parser.add_argument("--stem", help="Nome del documento (sottocartella di step-5/)")
+    parser = argparse.ArgumentParser(description="Verifica chunk")
+    parser.add_argument("--stem", help="Nome del documento (sottocartella di chunks/)")
+    _min_def = int(cfg.TARGET_CHARS * (1 - cfg.CHUNK_TOLERANCE))
+    _max_def = int(cfg.TARGET_CHARS * (1 + cfg.CHUNK_TOLERANCE))
    parser.add_argument(
-        "--min", type=int, default=MIN_CHARS,
-        help=f"Soglia minima caratteri (default: {MIN_CHARS})"
+        "--min", type=int, default=_min_def,
+        help=f"Soglia minima caratteri (default: TARGET×(1-TOL) = {_min_def})"
    )
    parser.add_argument(
-        "--max", type=int, default=MAX_CHARS,
-        help=f"Soglia massima caratteri (default: {MAX_CHARS})"
+        "--max", type=int, default=_max_def,
+        help=f"Soglia massima caratteri (default: TARGET×(1+TOL) = {_max_def})"
    )
    args = parser.parse_args()

    if args.stem:
        stems = [args.stem]
    else:
-        step5_dir = project_root / "step-5"
-        if not step5_dir.exists():
-            print(f"Errore: cartella step-5/ non trovata in {project_root}")
+        chunks_dir = project_root / "chunks"
+        if not chunks_dir.exists():
+            print(f"Errore: cartella chunks/ non trovata in {project_root}")
            sys.exit(1)
        stems = sorted(
-            p.name for p in step5_dir.iterdir()
+            p.name for p in chunks_dir.iterdir()
            if p.is_dir() and (p / "chunks.json").exists()
        )
        if not stems:
-            print("Errore: nessun chunks.json trovato in step-5/")
+            print("Errore: nessun chunks.json trovato in chunks/")
            sys.exit(1)

    results = [verify_stem(s, project_root, args.min, args.max) for s in stems]
@@ -1,9 +1,9 @@
-# ─── Step 9 — Configurazione RAG ─────────────────────────────────────────────
+# ─── Configurazione RAG ───────────────────────────────────────────────────────
 #
 # Modifica questo file per cambiare i parametri della pipeline.
 #
 # Uso:
-#   python step-9/rag.py --stem nietzsche
+#   python rag.py --stem nietzsche
 # ──────────────────────────────────────────────────────────────────────────────

 # ── Retrieval ─────────────────────────────────────────────────────────────────
@@ -18,7 +18,7 @@ TOP_K = 6
 # Temperatura del modello LLM.
 # 0.0 = completamente deterministico (stessa risposta ad ogni run)
 # 0.7 = più creativo e vario
-TEMPERATURE = 0.0
+TEMPERATURE = 0.2

 # Disabilita il "thinking" (ragionamento interno) nei modelli Qwen3/Qwen3.5.
 # True  = risposta diretta, più veloce
@@ -28,8 +28,8 @@ NO_THINK = True
 # ── Embedding ─────────────────────────────────────────────────────────────────

 # Modello di embedding usato da Ollama.
-# Deve corrispondere al modello usato durante la vettorizzazione (step-8).
-# Se cambi questo, devi rieseguire step-8 con --force.
+# Deve corrispondere al modello usato durante la vettorizzazione (ingest.py).
+# Se cambi questo, devi rieseguire ingest.py con --force.
 EMBED_MODEL = "nomic-embed-text"

 # ── Ollama ────────────────────────────────────────────────────────────────────
@@ -38,7 +38,7 @@ EMBED_MODEL = "nomic-embed-text"
 OLLAMA_URL = "http://localhost:11434"

 # Modello LLM. Scegli in base alla RAM disponibile (vedi README).
-OLLAMA_MODEL = "qwen3.5:0.8b"
+OLLAMA_MODEL = "qwen3.5:4b"

 # ── Prompt di sistema ─────────────────────────────────────────────────────────

@@ -33,13 +33,13 @@ Posiziona il PDF in `sources/<nome>.pdf`, poi:

 ```bash
 # Singolo documento
-python conversione/pipeline.py --stem <nome>
+python conversione/ --stem <nome>

 # Tutti i PDF in sources/
-python conversione/pipeline.py
+python conversione/

 # Forza la riesecuzione (sovrascrive output esistente)
-python conversione/pipeline.py --stem <nome> --force
+python conversione/ --stem <nome> --force
 ```

 Il parametro `--stem` è il nome del file PDF senza estensione.  
@@ -49,12 +49,13 @@ Esempio: `sources/analisi1.pdf` → `--stem analisi1`

 ## Output

-Per ogni stem vengono prodotti tre file in `conversione/<stem>/`:
+Per ogni stem vengono prodotti quattro file in `conversione/<stem>/`:

 | File | Descrizione |
 |------|-------------|
 | `raw.md` | Markdown grezzo estratto dal PDF — **non modificare** |
 | `clean.md` | Markdown pulito e strutturato — input per il chunker |
+| `structure_profile.json` | Struttura rilevata e strategia di chunking consigliata |
 | `report.json` | Metriche complete di qualità della conversione |

 ### report.json
@@ -110,11 +111,15 @@ anomalie e problemi residui con esempi.

 ## Validazione batch

-Dopo aver convertito uno o più documenti, esegui `validate.py` per ottenere
+Dopo aver convertito uno o più documenti, esegui `validate` per ottenere
 una tabella di stato su tutti gli stem:

 ```bash
-python conversione/validate.py
+# Tutti i documenti
+python conversione/ validate
+
+# Singolo documento con dettaglio penalità
+python conversione/ validate <stem> --detail
 ```

 Output di esempio:
@@ -221,11 +226,19 @@ Durante l'esecuzione la pipeline stampa le statistiche di ogni trasformazione:

 ```
  [3/4] Pulizia strutturale...
-  ✅ Immagini rimosse:      0
+  ✅ Simboli PUA corretti:  0
+     Immagini rimosse:      0
+     Note rimosse:          12
     Accenti corretti:      3701
     Dot-leader rimossi:    53
     Header concat fixati:  0
+     Header num. normaliz.: 8
+     Articoli → ###:        0
+     Ambienti matematici:   0
+     Titoli header uniti:   4
     TOC rimosso:           sì
+     Versi poesia riprist.: 0
+     Header verso demotati: 0
     ALL-CAPS → ##:         14
     Sezioni → ###:         279
     Paragrafi uniti:       12998
@@ -0,0 +1,109 @@
+#!/usr/bin/env python3
+"""
+Pipeline PDF → clean Markdown per vettorizzazione RAG.
+
+Uso:
+    # Converti
+    python conversione/ --stem <nome>
+    python conversione/ --stem <nome> --force
+    python conversione/                          # tutti i PDF in sources/
+
+    # Valida
+    python conversione/ validate
+    python conversione/ validate <stem> [<stem> ...] --detail
+
+Prerequisiti:
+    pip install opendataloader-pdf pdfplumber
+    Java 11+ sul PATH (https://adoptium.net/)
+"""
+
+import argparse
+import sys
+from pathlib import Path
+
+# Rende _pipeline importabile da conversione/
+sys.path.insert(0, str(Path(__file__).parent))
+
+from _pipeline import run, validate
+
+
+def _build_parser() -> argparse.ArgumentParser:
+    parser = argparse.ArgumentParser(
+        prog="conversione",
+        description="PDF → clean Markdown strutturato, pronto per chunking RAG",
+        formatter_class=argparse.RawDescriptionHelpFormatter,
+        epilog=(
+            "Esempi:\n"
+            "  python conversione/ --stem manuale\n"
+            "  python conversione/ --stem manuale --force\n"
+            "  python conversione/ validate\n"
+            "  python conversione/ validate manuale --detail"
+        ),
+    )
+
+    # ── Subcommand: validate ──────────────────────────────────────────────
+    sub = parser.add_subparsers(dest="cmd", metavar="comando")
+    val = sub.add_parser(
+        "validate",
+        help="valida i report.json prodotti dalla conversione",
+        description="Legge i report.json e assegna un voto 0-100 (A/B/C/D/F).",
+    )
+    val.add_argument(
+        "stems",
+        nargs="*",
+        metavar="STEM",
+        help="stem da validare. Ometti per tutti.",
+    )
+    val.add_argument(
+        "--detail", "-d",
+        action="store_true",
+        help="mostra il dettaglio delle penalità per ogni documento",
+    )
+
+    # ── Opzioni convert (modalità default) ───────────────────────────────
+    parser.add_argument(
+        "--stem",
+        metavar="NOME",
+        help="nome del PDF in sources/ (senza estensione). Ometti per tutti.",
+    )
+    parser.add_argument(
+        "--force",
+        action="store_true",
+        help="riesegui anche se clean.md è già presente",
+    )
+
+    return parser
+
+
+def main() -> None:
+    parser = _build_parser()
+    args   = parser.parse_args()
+    root   = Path(__file__).parent.parent
+
+    # ── Validate ─────────────────────────────────────────────────────────
+    if args.cmd == "validate":
+        validate(args.stems, root, detail=args.detail)
+        return
+
+    # ── Convert (default) ────────────────────────────────────────────────
+    if args.stem:
+        stems = [args.stem]
+    else:
+        sources_dir = root / "sources"
+        if not sources_dir.exists():
+            print("Errore: cartella sources/ non trovata.")
+            sys.exit(1)
+        stems = sorted(p.stem for p in sources_dir.glob("*.pdf"))
+        if not stems:
+            print("Errore: nessun PDF trovato in sources/.")
+            sys.exit(1)
+
+    results = [run(s, root, args.force) for s in stems]
+    ok      = sum(results)
+    total   = len(results)
+    print(f"\n{'✅' if all(results) else '⚠️ '} {ok}/{total} documenti convertiti")
+    sys.exit(0 if all(results) else 1)
+
+
+if __name__ == "__main__":
+    main()
@@ -0,0 +1,30 @@
+from .extract          import validate_pdf, extract_metadata
+from .structure        import analyze
+from .report           import build_report
+from .runner           import run
+from .validator        import validate
+from .models           import Block, Section, FontProfile
+from .stage1_metadata  import extract_raw_data
+from .stage2_layout    import analyze_layout
+from .stage3_font      import build_font_profile
+from .stage4_headers   import classify_blocks
+from .stage5_hierarchy import infer_hierarchy
+from .stage6_tree      import build_tree
+from .stage7_markdown  import serialize_tree
+from .stage8_normalize import normalize_hierarchy
+from .stage9_validate  import validate_markdown, ValidationResult
+
+__all__ = [
+    "validate_pdf", "extract_metadata",
+    "analyze", "build_report", "run", "validate",
+    "Block", "Section", "FontProfile",
+    "extract_raw_data",
+    "analyze_layout",
+    "build_font_profile",
+    "classify_blocks",
+    "infer_hierarchy",
+    "build_tree",
+    "serialize_tree",
+    "normalize_hierarchy",
+    "validate_markdown", "ValidationResult",
+]
@@ -0,0 +1,169 @@
+"""
+Costanti di modulo condivise tra i moduli di trasformazione.
+Tutte le regex compilate e le mappe statiche vivono qui.
+"""
+import re
+
+# ─── Keyword sets ─────────────────────────────────────────────────────────────
+
+_TOC_KEYWORDS = frozenset([
+    "indice", "index", "contents", "table of contents",
+    "sommario", "inhaltsverzeichnis", "inhalt",
+    "indice generale", "indice analitico", "indice dei contenuti",
+    "elenco dei capitoli", "argomenti", "table des matières",
+    "tabla de contenidos", "содержание",
+])
+
+_ORDINALS_IT = {
+    "PRIMO": "I", "SECONDO": "II", "TERZO": "III", "QUARTO": "IV",
+    "QUINTO": "V", "SESTO": "VI", "SETTIMO": "VII", "OTTAVO": "VIII",
+    "NONO": "IX", "DECIMO": "X",
+}
+_ORDINALS_EN = {
+    "ONE": "1", "TWO": "2", "THREE": "3", "FOUR": "4", "FIVE": "5",
+    "SIX": "6", "SEVEN": "7", "EIGHT": "8", "NINE": "9", "TEN": "10",
+}
+
+# ─── PUA Symbol font map ──────────────────────────────────────────────────────
+
+_SYMBOL_PUA_MAP: dict[str, str] = {
+    "": " ",
+    "": "(",
+    "": ")",
+    "": "+",
+    "": "−",
+    "": ".",
+    "": "/",
+    "": "0", "": "1", "": "2", "": "3", "": "4",
+    "": "5", "": "6", "": "7", "": "8", "": "9",
+    "": ":", "": ";", "": "<", "": "=", "": ">",
+    "": "≅",
+    "": "Α", "": "Β", "": "Χ", "": "Δ", "": "Ε",
+    "": "Φ", "": "Γ", "": "Η", "": "Ι", "": "ϑ",
+    "": "Κ", "": "Λ", "": "Μ", "": "Ν", "": "Ο",
+    "": "Π", "": "Θ", "": "Ρ", "": "Σ", "": "Τ",
+    "": "Υ", "": "ς", "": "Ω", "": "Ξ", "": "Ψ",
+    "": "Ζ",
+    "": "[",
+    "": "∴",
+    "": "]",
+    "": "⊥",
+    "": "α", "": "β", "": "χ", "": "δ", "": "ε",
+    "": "φ", "": "γ", "": "η", "": "ι", "": "ϕ",
+    "": "κ", "": "λ", "": "μ", "": "ν", "": "ο",
+    "": "π", "": "θ", "": "ρ", "": "σ", "": "τ",
+    "": "υ", "": "ϖ", "": "ω", "": "ξ", "": "ψ",
+    "": "ζ",
+    "": "{",
+    "": "|",
+    "": "}",
+    "": "~",
+    "": "±",
+    "": "•",
+    "": "√",
+    "": "≤",
+    "": "≥",
+    "": "∝",
+    "": "×",
+    "": "÷",
+    "": "×",
+    "": "≠",
+    "": "≠",
+    "": "≥",
+    "": "′",
+    "": "*",
+    "": ",",
+    "": "≤",
+    "": "•",
+    "": "•",
+    "": "→",
+    "": "÷",
+    "": "",
+    "": "→",
+    "": "",
+    "": "",
+    "": "",
+    "": "",
+    # TeX Computer Modern bracket/delimiter pieces (U+F8EB–F8FE) → stringa vuota
+    "": "",  # TeX large paren left
+    "": "",  # TeX large paren extension
+    "": "",  # TeX large paren right
+    "": "",  # TeX large paren right ext
+    "": "",  # TeX large bracket left
+    "": "",  # TeX large bracket ext
+    "": "",  # TeX brace top-left
+    "": "",  # TeX brace mid
+    "": "",  # TeX brace mid-right
+    "": "",  # TeX brace extension
+    "": "",  # TeX brace right
+    "": "",  # TeX bracket right large
+    "": "",  # TeX bracket right ext
+    "": "",  # TeX bracket right close
+    "": "",  # TeX integral large
+    "": "",  # TeX integral extension
+    "": "",  # TeX integral top
+    "": "",  # TeX radical top
+    "": "",  # TeX radical extension
+    "": "",  # TeX arrowhead
+}
+
+_SYMBOL_PUA_RE = re.compile(
+    "[" + "".join(re.escape(k) for k in _SYMBOL_PUA_MAP) + "]"
+)
+
+# ─── Regex compilate condivise ────────────────────────────────────────────────
+
+_SUPERSCRIPT_RE = re.compile(r'[¹²³⁰⁴-⁹]+')
+_FOOTNOTE_BODY_RE = re.compile(
+    r'^([¹²³⁰⁴-⁹]+\s+|\[\d{1,3}\]\s+)'
+)
+_NUMBERED_HDR_RE = re.compile(
+    r"^(#{1,6})\s+(\d+(?:\.\d+)*)\.\s+(.+)$",
+    re.MULTILINE,
+)
+_BIB_MARKERS_RE = re.compile(
+    r'\b(pp?\.|vol\.|n\.\s*\d|ed\.|edn\.|ISBN|DOI|arXiv)\b'
+    r'|\b(19|20)\d{2}\b'
+    r'|\b(ibid\.?|ibidem|op\.\s*cit\.?|cit\.|cfr\.|ivi[,;\s])\b',
+    re.IGNORECASE,
+)
+# Pattern autore accademico: iniziale maiuscola + cognome TUTTO-MAIUSCOLO (es. "A. SMITH")
+_FOOTNOTE_AUTHOR_RE = re.compile(r'(?<![A-Z])[A-Z]\.\s+[A-Z]{3,}')
+_WATERMARK_RE = re.compile(
+    r"^(BOZZA|DRAFT|CONFIDENTIAL|RISERVATO|PROVVISORIO|SAMPLE|SPECIMEN"
+    r"|DO NOT DISTRIBUTE|NON DISTRIBUIRE|COPY|COPIA)\s*$",
+    re.IGNORECASE | re.MULTILINE,
+)
+_TABSEP_RE = re.compile(r"(?m)^\|\s*\|\s*$|^\|---\|?\s*$")
+_DOTLEADER_RE = re.compile(r"^[^\n]*(?:(?:\. ){3,}|\.{4,})[^\n]*$", re.MULTILINE)
+_FM_RE = re.compile(
+    r"https?://|www\.|@[A-Za-z]|\bUniversit[àa]\b|\bDipartimento\b|"
+    r"\bCopyright\b|\bLicenza\b|\bEdizione\b|"
+    r"protetto da|tutti i diritti",
+    re.IGNORECASE,
+)
+_VERSE_NUM_RE = re.compile(
+    r"([.!?\xbb'\"" + "’" + r"]\s+)(\d+)(\s+)(?=[A-Z\xc0-\xd9a-z\xe0-\xf9\xab“”‟])"
+)
+# Math header demotion
+_MATH_SYMBOLS_RE = re.compile(
+    r"[=+∈∀∃≤≥∞∑∫∂→↔⊂⊃∩∪αβγδεζηθικλμνξοπρστυφχψωΑΒΓΔΕΖΗΘΙΚΛΜΝΞΟΠΡΣΤΥΦΧΨΩ]"
+)
+_EXERCISE_TRIGGER_RE = re.compile(
+    r"\b(Si dimostri|Si calcoli|Si provi|Si trovi|Trovare|Find|Prove|Show that"
+    r"|Compute|Calculate|Dimostrare|Verificare)\b",
+    re.IGNORECASE,
+)
+_MATH_HDR_RE = re.compile(r"^(#{2,3})\s+(.+)$")
+_NUMBERED_PREFIX_RE = re.compile(r"^(\d+(?:\.\d+)*[.)])\s+(.+)$", re.DOTALL)
+# Orphan TOC: voce di indice senza dot-leader (es. "3. Funzioni 174")
+_TOC_ITEM_RE = re.compile(
+    r"^\d+(\.\d+)*\.?\s+[A-Za-zÀ-ú\'\(][^\n]{2,70}$"
+)
+_TOC_HDR_WITH_PAGE_RE = re.compile(
+    r"^#{1,3}\s+\d+\.?\s+.{3,60}\s+\d{1,4}$"
+)
+# Artefatti PDF: page markers e separatori
+_PAGE_MARKER_RE = re.compile(r"(?m)^<!-- page: \d+ -->\s*$")
+_STANDALONE_NUM_RE = re.compile(r"(?m)^(?:- )?\d{1,3}$")
+_UNDERSCORE_SEP_RE = re.compile(r"(?m)^_{4,}\s*$")
@@ -0,0 +1,153 @@
+"""Funzioni helper pure condivise tra i moduli di trasformazione."""
+import re
+
+from ._constants import _ORDINALS_IT, _ORDINALS_EN
+
+
+def _sentence_case(s: str) -> str:
+    if not s:
+        return s
+    lower = s.lower()
+    return lower[0].upper() + lower[1:]
+
+
+def _is_allcaps_line(line: str) -> bool:
+    stripped = line.strip()
+    letters  = [c for c in stripped if c.isalpha()]
+    return (
+        len(letters) >= 3
+        and all(c.isupper() for c in letters)
+        and not stripped.startswith("#")
+        and not stripped.startswith("|")
+    )
+
+
+def _allcaps_to_header(raw_line: str) -> str:
+    text = re.sub(r"^[-*+]\s+", "", raw_line.strip())
+    text = text.rstrip(".").rstrip("?").strip()
+
+    _ORD_IT_PAT = "|".join(_ORDINALS_IT.keys())
+    m = re.match(rf"^CAPITOLO ({_ORD_IT_PAT})\. (.+)", text)
+    if m:
+        roman  = _ORDINALS_IT[m.group(1)]
+        titolo = m.group(2).rstrip(".").rstrip("?").strip()
+        return f"## Capitolo {roman} — {_sentence_case(titolo)}"
+
+    _ORD_EN_PAT = "|".join(_ORDINALS_EN.keys())
+    m = re.match(rf"^CHAPTER ({_ORD_EN_PAT}|\d+)\.? (.+)", text)
+    if m:
+        n      = _ORDINALS_EN.get(m.group(1), m.group(1))
+        titolo = m.group(2).rstrip(".").rstrip("?").strip()
+        return f"## Chapter {n} — {_sentence_case(titolo)}"
+
+    m = re.match(r"^([IVXLCDM]+|[0-9]+)\. (.+)", text)
+    if m:
+        return f"## {m.group(1)}. {_sentence_case(m.group(2).rstrip('.').strip())}"
+
+    return f"## {_sentence_case(text)}"
+
+
+def _extract_math_environments(text: str) -> tuple[str, int]:
+    _ENVS = (
+        r"Definizione|Definition|Teorema|Theorem|Lemma|"
+        r"Proposizione|Proposition|Corollario|Corollary|"
+        r"Osservazione|Remark|Nota|Note|Esempio|Example"
+    )
+    count  = 0
+    blocks = text.split("\n\n")
+    result = []
+
+    for block in blocks:
+        stripped = block.strip()
+        if not stripped or stripped.startswith("#"):
+            result.append(block)
+            continue
+
+        m = re.match(
+            rf"^({_ENVS})\s+((?:\d+\.?){{1,4}})\s*(.*)",
+            stripped,
+            re.DOTALL,
+        )
+        if not m:
+            result.append(block)
+            continue
+
+        env  = m.group(1)
+        num  = m.group(2).rstrip(".")
+        rest = m.group(3).strip()
+
+        title_m = re.match(r"^(\([^)]{2,60}\))\s+(.*)", rest, re.DOTALL)
+        if title_m:
+            header = f"### {env} {num} {title_m.group(1)}"
+            body   = title_m.group(2).strip()
+        else:
+            header = f"### {env} {num}."
+            body   = rest
+
+        result.append(f"{header}\n\n{body}" if body else header)
+        count += 1
+
+    return "\n\n".join(result), count
+
+
+def _merge_title_headers(text: str) -> tuple[str, int]:
+    count  = 0
+    blocks = re.split(r"\n{2,}", text)
+    result = []
+    i = 0
+    while i < len(blocks):
+        block    = blocks[i]
+        stripped = block.strip()
+        if (
+            re.match(r"^#{2,3} \d+\.\s*$", stripped)
+            and i + 1 < len(blocks)
+        ):
+            nxt = blocks[i + 1].strip()
+            if (
+                nxt
+                and "\n" not in nxt
+                and len(nxt) <= 80
+                and not nxt.startswith("#")
+                and not re.match(r"^\d+[\.\)]\s", nxt)
+            ):
+                result.append(stripped.rstrip() + " " + nxt)
+                count += 1
+                i += 2
+                continue
+        result.append(block)
+        i += 1
+    return re.sub(r"\n{3,}", "\n\n", "\n\n".join(result)), count
+
+
+def _extract_article_headers(text: str) -> tuple[str, int]:
+    count = 0
+
+    def _repl(m: re.Match) -> str:
+        nonlocal count
+        num  = m.group(1)
+        rest = m.group(2).strip()
+
+        title_m = re.match(
+            r"^([A-Z\xc0\xc8\xc9\xcc\xcd\xd2\xd3\xd9\xda].{1,74}?)\.\s+"
+            r"([A-Z\xc0\xc8\xc9\xcc\xcd\xd2\xd3\xd9\xda\(\d].{4,})",
+            rest,
+        )
+        if title_m:
+            count += 1
+            return (
+                f"### Art. {num}. {title_m.group(1)}.\n\n"
+                f"{title_m.group(2).strip()}"
+            )
+        if rest:
+            count += 1
+            return f"### Art. {num}.\n\n{rest}"
+        count += 1
+        return f"### Art. {num}."
+
+    text = re.sub(
+        r"^-\s+Art\.\s+([\d]+[a-z\-]*)\.\s*(.*)",
+        _repl,
+        text,
+        flags=re.MULTILINE,
+    )
+    return text, count
@@ -0,0 +1,82 @@
+"""Validazione PDF e estrazione metadati tramite fitz."""
+import re
+from pathlib import Path
+
+
+def validate_pdf(pdf_path: Path) -> tuple[bool, str]:
+    """Verifica esistenza, leggibilità e presenza di testo digitale estraibile."""
+    if not pdf_path.exists():
+        return False, f"File non trovato: {pdf_path}"
+    if pdf_path.suffix.lower() != ".pdf":
+        return False, f"Non è un PDF: {pdf_path.name}"
+    size = pdf_path.stat().st_size
+    if size == 0:
+        return False, "File vuoto"
+    if size < 1024:
+        return False, f"File troppo piccolo ({size} byte) — probabilmente corrotto"
+
+    try:
+        import pdfplumber
+        with pdfplumber.open(pdf_path) as pdf:
+            n_pages = len(pdf.pages)
+            if n_pages == 0:
+                return False, "PDF senza pagine"
+            sample = min(5, n_pages)
+            pages_with_text = sum(
+                1 for i in range(sample)
+                if len((pdf.pages[i].extract_text() or "").strip()) > 50
+            )
+            if pages_with_text == 0:
+                extended = min(15, n_pages)
+                if extended > sample:
+                    ext_with_text = sum(
+                        1 for i in range(sample, extended)
+                        if len((pdf.pages[i].extract_text() or "").strip()) > 50
+                    )
+                    if ext_with_text > 0:
+                        return True, (
+                            f"{n_pages} pagine — prime {sample} vuote, "
+                            f"testo trovato in pagine successive "
+                            f"(possibile copertina immagine)"
+                        )
+                return False, (
+                    f"Nessun testo nelle prime {extended} pagine "
+                    f"— probabilmente scansionato (OCR non supportato)"
+                )
+        return True, f"{n_pages} pagine, testo digitale confermato"
+    except MemoryError:
+        return False, "Memoria esaurita durante l'apertura del PDF"
+    except Exception as e:
+        msg = str(e).lower()
+        if "password" in msg or "encrypted" in msg:
+            return False, "PDF protetto da password"
+        return False, f"Impossibile aprire: {e}"
+
+
+def extract_metadata(pdf_path: Path) -> dict:
+    """
+    Estrae title, author, year e page count dal PDF tramite fitz.
+    Restituisce un dict con chiavi sempre presenti (stringa vuota se assenti).
+    """
+    try:
+        import fitz
+        doc  = fitz.open(str(pdf_path))
+        meta = doc.metadata
+        pages = len(doc)
+        doc.close()
+
+        year = ""
+        creation = meta.get("creationDate", "")
+        m = re.match(r"D:(\d{4})", creation)
+        if m:
+            year = m.group(1)
+
+        return {
+            "source": pdf_path.name,
+            "title":  (meta.get("title")  or "").strip(),
+            "author": (meta.get("author") or "").strip(),
+            "year":   year,
+            "pages":  pages,
+        }
+    except Exception:
+        return {"source": pdf_path.name, "title": "", "author": "", "year": "", "pages": 0}
@@ -0,0 +1,44 @@
+"""Strutture dati intermedie della pipeline: Block, Section, FontProfile."""
+from __future__ import annotations
+
+from dataclasses import dataclass, field
+
+
+@dataclass
+class Block:
+    text: str
+    page: int
+    bbox: tuple[float, float, float, float]  # x0, y0, x1, y1
+    font_size: float
+    font_name: str
+    is_bold: bool
+    block_type: str = "paragraph"  # paragraph|header_candidate|list_item|table|ignore
+    space_before: float = 0.0
+    level: int = 0                 # assegnato da stage5 (0 = non header)
+    origin_spans: list[dict] = field(default_factory=list, repr=False)
+
+    @property
+    def x0(self) -> float: return self.bbox[0]
+    @property
+    def y0(self) -> float: return self.bbox[1]
+    @property
+    def x1(self) -> float: return self.bbox[2]
+    @property
+    def y1(self) -> float: return self.bbox[3]
+
+
+@dataclass
+class Section:
+    title: str
+    level: int           # 1, 2, 3
+    content: list[Block] = field(default_factory=list)
+    children: list[Section] = field(default_factory=list)
+    page_start: int = 0
+    source_block: Block | None = field(default=None, repr=False)
+
+
+@dataclass
+class FontProfile:
+    body_size: float
+    cluster_map: dict[float, int]   # font_size arrotondato → livello (1/2/3)
+    header_sizes: list[float]       # taglie candidate header, ordinate desc
@@ -0,0 +1,135 @@
+import json
+import re
+from datetime import datetime
+from pathlib import Path
+
+from .structure import _parse_sections_with_body
+from ._constants import _MATH_SYMBOLS_RE, _EXERCISE_TRIGGER_RE, _MATH_HDR_RE
+
+
+def build_report(
+    stem: str,
+    out_dir: Path,
+    clean_text: str,
+    t_stats: dict,
+    profile: dict,
+    reduction: float,
+) -> Path:
+    text_lines = clean_text.split("\n")
+
+    sections = _parse_sections_with_body(clean_text, 3)
+    lengths  = [len(body) for _, body in sections]
+
+    def _pct(data: list[int], p: float) -> int:
+        if not data:
+            return 0
+        s = sorted(data)
+        return s[max(0, min(len(s) - 1, int(len(s) * p)))]
+
+    distribution = {
+        "min":     min(lengths) if lengths else 0,
+        "p25":     _pct(lengths, 0.25),
+        "mediana": _pct(lengths, 0.50),
+        "p75":     _pct(lengths, 0.75),
+        "max":     max(lengths) if lengths else 0,
+    }
+
+    bare_hdrs = [
+        {"header": hdr, "corpo_inizio": body[:120].replace("\n", " ")}
+        for hdr, body in sections
+        if re.match(r"^### \d+\.\s*$", hdr) and len(body.strip()) < 30
+    ]
+    short_secs = [
+        {"header": hdr, "chars": length, "testo": body[:80].replace("\n", " ")}
+        for (hdr, body), length in zip(sections, lengths)
+        if 0 < length < 150
+    ]
+    long_secs = [
+        {"header": hdr, "chars": length}
+        for (hdr, _), length in zip(sections, lengths)
+        if length > 1500
+    ]
+
+    def _scan(pattern: str, max_n: int = 10) -> list[dict]:
+        hits = []
+        for i, line in enumerate(text_lines):
+            if re.search(pattern, line) and not re.match(r"^#+ ", line):
+                hits.append({"riga": i + 1, "testo": line.strip()[:120]})
+                if len(hits) >= max_n:
+                    break
+        return hits
+
+    def _scan_formula_headers(max_n: int = 10) -> list[dict]:
+        hits = []
+        for i, line in enumerate(text_lines):
+            m = _MATH_HDR_RE.match(line)
+            if not m:
+                continue
+            body = m.group(2)
+            if len(body) <= 100:
+                continue
+            has_math = len(_MATH_SYMBOLS_RE.findall(body)) >= 3
+            has_ex   = bool(_EXERCISE_TRIGGER_RE.search(body))
+            if has_math or has_ex:
+                hits.append({"riga": i + 1, "testo": line.strip()[:120]})
+                if len(hits) >= max_n:
+                    break
+        return hits
+
+    residui = {
+        "backtick":         _scan(r"`"),
+        "dotleader":        _scan(r"(?:\. ){5,}"),
+        "url":              _scan(r"^(https?://|www\.)\S+"),
+        "immagini":         _scan(r"!\[[^\]]*\]\([^)]*\)"),
+        "br_inline":        _scan(r"<br>"),
+        "simboli_encoding": _scan(r'(?<=[0-9A-Za-z])[!"](?=[0-9A-Za-z])'),
+        "formule_inline":   _scan(r"\[\d+\.\d+\]"),
+        "footnote_markers": _scan(r'[¹²³⁰⁴-⁹]'),
+        "pua_markers":      _scan(r'[-]'),
+        "formula_headers":  _scan_formula_headers(),
+    }
+
+    report = {
+        "stem":      stem,
+        "timestamp": datetime.now().strftime("%Y-%m-%d %H:%M"),
+        "transforms": {
+            **t_stats,
+            "riduzione_pct": round(reduction),
+        },
+        "structure":    profile,
+        "distribution": distribution,
+        "anomalie": {
+            "bare_headers":        len(bare_hdrs),
+            "short_sections":      len(short_secs),
+            "long_sections":       len(long_secs),
+            "bare_headers_list":   bare_hdrs,
+            "short_sections_list": short_secs,
+            "long_sections_list":  long_secs,
+        },
+        "residui": {
+            "backtick":                   len(residui["backtick"]),
+            "dotleader":                  len(residui["dotleader"]),
+            "url":                        len(residui["url"]),
+            "immagini":                   len(residui["immagini"]),
+            "br_inline":                  len(residui["br_inline"]),
+            "simboli_encoding":           len(residui["simboli_encoding"]),
+            "formule_inline":             len(residui["formule_inline"]),
+            "footnote_markers":           len(residui["footnote_markers"]),
+            "pua_markers":                len(residui["pua_markers"]),
+            "backtick_esempi":            residui["backtick"],
+            "dotleader_esempi":           residui["dotleader"],
+            "url_esempi":                 residui["url"],
+            "immagini_esempi":            residui["immagini"],
+            "br_inline_esempi":           residui["br_inline"],
+            "simboli_encoding_esempi":    residui["simboli_encoding"],
+            "formule_inline_esempi":      residui["formule_inline"],
+            "footnote_markers_esempi":    residui["footnote_markers"],
+            "pua_markers_esempi":         residui["pua_markers"],
+            "formula_headers":            len(residui["formula_headers"]),
+            "formula_headers_esempi":     residui["formula_headers"],
+        },
+    }
+
+    report_path = out_dir / "report.json"
+    report_path.write_text(json.dumps(report, ensure_ascii=False, indent=2), encoding="utf-8")
+    return report_path
@@ -0,0 +1,220 @@
+"""Orchestrazione della pipeline PDF → Markdown a 9 stadi."""
+import json
+import sys
+import threading
+import time
+from pathlib import Path
+
+from .extract     import validate_pdf, extract_metadata
+from .stage1_metadata import extract_raw_data_with_pdfplumber_fallback as extract_raw_data
+from .stage2_layout   import analyze_layout
+from .stage3_font     import build_font_profile
+from .stage4_headers  import classify_blocks
+from .stage5_hierarchy import infer_hierarchy
+from .stage6_tree     import build_tree
+from .stage7_markdown import serialize_tree
+from .stage8_normalize import normalize_hierarchy
+from .stage9_validate  import validate_markdown
+from .structure   import analyze
+from .report      import build_report
+from .validator   import _score, _grade
+
+
+_LIVELLO_DESC = {3: "ricca (h3)", 2: "parziale (h2)", 1: "paragrafi", 0: "testo piatto"}
+_SPIN_FRAMES  = "⠋⠙⠹⠸⠼⠴⠦⠧⠇⠏"
+
+
+def _build_frontmatter(meta: dict) -> str:
+    lines = ["---", f"source: {meta['source']}"]
+    if meta.get("title"):
+        lines.append(f'title: "{meta["title"]}"')
+    if meta.get("author"):
+        lines.append(f'author: "{meta["author"]}"')
+    if meta.get("year"):
+        lines.append(f"year: {meta['year']}")
+    if meta.get("pages"):
+        lines.append(f"pages: {meta['pages']}")
+    lines += ["---", ""]
+    return "\n".join(lines) + "\n"
+
+
+class _Spinner:
+    def __init__(self, prefix: str):
+        self._prefix = prefix
+        self._stop   = threading.Event()
+        self._thread = threading.Thread(target=self._run, daemon=True)
+        self._t0     = 0.0
+
+    def __enter__(self):
+        self._t0 = time.perf_counter()
+        self._thread.start()
+        return self
+
+    def __exit__(self, *_):
+        self._stop.set()
+        self._thread.join()
+        sys.stdout.write("\r" + " " * 72 + "\r")
+        sys.stdout.flush()
+
+    def _run(self):
+        i = 0
+        while not self._stop.wait(0.1):
+            elapsed = time.perf_counter() - self._t0
+            frame   = _SPIN_FRAMES[i % len(_SPIN_FRAMES)]
+            sys.stdout.write(f"\r  {frame} {self._prefix}  {elapsed:.0f}s")
+            sys.stdout.flush()
+            i += 1
+
+
+def run(stem: str, project_root: Path, force: bool) -> bool:
+    pdf_path  = project_root / "sources" / f"{stem}.pdf"
+    out_dir   = project_root / "conversione" / stem
+    raw_out   = out_dir / "raw.md"
+    clean_out = out_dir / "clean.md"
+
+    print(f"\n{'─' * 52}")
+    print(f"  {stem}")
+    print(f"{'─' * 52}")
+
+    if clean_out.exists() and not force:
+        print(f"  ⚠️  conversione/{stem}/clean.md già presente — skip")
+        print(f"      (usa --force per rieseguire)")
+        return True
+
+    # ── [1] Validazione PDF ───────────────────────────────────────────────────
+    print("  [1/9] Validazione PDF...")
+    pdf_mb = pdf_path.stat().st_size / (1024 * 1024) if pdf_path.exists() else 0
+    print(f"     File: {pdf_path.name}  ({pdf_mb:.1f} MB)")
+
+    ok, msg = validate_pdf(pdf_path)
+    if not ok:
+        print(f"  ✗ {msg}")
+        return False
+    print(f"  ✅ {msg}")
+
+    meta = extract_metadata(pdf_path)
+    meta["source"] = pdf_path.name
+    if meta.get("title"):
+        print(f"     Titolo:  {meta['title']}")
+    if meta.get("author"):
+        print(f"     Autore:  {meta['author']}")
+
+    # ── [2] Stage 1: estrazione span ──────────────────────────────────────────
+    print("  [2/9] Stage 1: Estrazione span PyMuPDF...")
+    with _Spinner("Lettura PDF con PyMuPDF..."):
+        try:
+            raw_blocks, doc_meta = extract_raw_data(pdf_path)
+        except Exception as e:
+            print(f"  ✗ Estrazione fallita: {e}")
+            return False
+
+    print(f"  ✅ {len(raw_blocks)} span estratti da {doc_meta['page_count']} pagine")
+    toc_entries = len(doc_meta.get("toc", []))
+    if toc_entries:
+        print(f"     TOC: {toc_entries} voci")
+
+    # ── [3] Stage 2: layout ───────────────────────────────────────────────────
+    print("  [3/9] Stage 2: Analisi layout e reading order...")
+    with _Spinner("Analisi layout..."):
+        blocks = analyze_layout(raw_blocks, doc_meta)
+    print(f"  ✅ {len(blocks)} blocchi dopo layout analysis")
+
+    # ── [4] Stage 3: font analysis ────────────────────────────────────────────
+    print("  [4/9] Stage 3: Font analysis...")
+    profile = build_font_profile(blocks)
+    print(f"  ✅ Body size: {profile.body_size}pt  "
+          f"Header sizes: {profile.header_sizes}")
+
+    # ── [5] Stage 4: header detection ─────────────────────────────────────────
+    print("  [5/9] Stage 4: Header detection...")
+    blocks = classify_blocks(blocks, profile)
+    n_candidates = sum(1 for b in blocks if b.block_type == "header_candidate")
+    print(f"  ✅ {n_candidates} header candidate rilevati")
+
+    # ── [6] Stage 5: hierarchy inference ─────────────────────────────────────
+    print("  [6/9] Stage 5: Hierarchy inference...")
+    blocks = infer_hierarchy(blocks, profile, doc_meta.get("toc", []))
+    from collections import Counter
+    level_dist = Counter(b.level for b in blocks if b.block_type == "header_candidate")
+    print(f"  ✅ H1={level_dist.get(1,0)}  H2={level_dist.get(2,0)}  H3={level_dist.get(3,0)}")
+
+    # ── [7] Stage 6: document tree ────────────────────────────────────────────
+    print("  [7/9] Stage 6: Document tree reconstruction...")
+    tree = build_tree(blocks)
+    print(f"  ✅ {len(tree)} sezioni radice")
+
+    # ── [8] Stage 7: markdown generation ─────────────────────────────────────
+    print("  [8/9] Stage 7: Markdown generation...")
+    with _Spinner("Serializzazione albero..."):
+        raw_md = serialize_tree(tree, meta, pdf_path=pdf_path)
+
+    size_kb = len(raw_md.encode()) // 1024
+    n_lines = raw_md.count("\n")
+    print(f"  ✅ raw.md: {size_kb} KB, {n_lines} righe")
+
+    # Scrittura raw.md (IMMUTABILE)
+    try:
+        out_dir.mkdir(parents=True, exist_ok=True)
+        if not raw_out.exists() or force:
+            raw_out.write_text(raw_md, encoding="utf-8")
+    except PermissionError as e:
+        print(f"  ✗ Permesso negato durante la scrittura: {e}")
+        return False
+
+    # ── [9] Stage 8+9: normalizzazione + validazione ──────────────────────────
+    print("  [9/9] Stage 8-9: Normalize + validate...")
+    clean_md, norm_stats = normalize_hierarchy(raw_md)
+    validation = validate_markdown(clean_md, meta.get("pages", 0))
+
+    if norm_stats["n_level_jumps_repaired"]:
+        print(f"     Salti livello riparati:    {norm_stats['n_level_jumps_repaired']}")
+    if norm_stats["n_empty_headers_removed"]:
+        print(f"     Header vuoti rimossi:      {norm_stats['n_empty_headers_removed']}")
+    if norm_stats["n_duplicate_headers_removed"]:
+        print(f"     Header duplicati rimossi:  {norm_stats['n_duplicate_headers_removed']}")
+
+    for w in validation.warnings:
+        print(f"     ⚠️  {w}")
+    for e in validation.errors:
+        print(f"     ✗  {e}")
+
+    # Aggiungi frontmatter a clean.md
+    frontmatter = _build_frontmatter(meta)
+    full_clean  = frontmatter + clean_md
+
+    try:
+        clean_out.write_text(full_clean, encoding="utf-8")
+    except PermissionError as e:
+        print(f"  ✗ Permesso negato durante la scrittura di clean.md: {e}")
+        return False
+
+    print(f"  ✅ clean.md scritto")
+
+    # ── Analisi struttura + report + score ────────────────────────────────────
+    profile_struct = analyze(clean_out)
+    (out_dir / "structure_profile.json").write_text(
+        json.dumps(profile_struct, ensure_ascii=False, indent=2), encoding="utf-8"
+    )
+
+    print(f"     Struttura: livello {profile_struct['livello_struttura']} — "
+          f"{_LIVELLO_DESC[profile_struct['livello_struttura']]}")
+    print(f"     h1={profile_struct['n_h1']}  h2={profile_struct['n_h2']}  "
+          f"h3={profile_struct['n_h3']}  paragrafi={profile_struct['n_paragrafi']}")
+    print(f"     Strategia chunking: {profile_struct['strategia_chunking']}")
+    print(f"     Lingua rilevata:    {profile_struct['lingua_rilevata']}")
+    for w in profile_struct["avvertenze"]:
+        print(f"     ⚠️  {w}")
+
+    t_stats = {
+        **norm_stats,
+        "validation": validation.to_dict(),
+    }
+    reduction = 100.0 * (1 - len(clean_md) / len(raw_md)) if raw_md else 0.0
+    report_path = build_report(stem, out_dir, full_clean, t_stats, profile_struct, reduction)
+    report_data = json.loads(report_path.read_text(encoding="utf-8"))
+    score, _    = _score(report_data)
+
+    print(f"\n  Output → conversione/{stem}/")
+    print(f"    raw.md   (immutabile)  clean.md   report.json")
+    print(f"  Punteggio qualità: {score}/100 {_grade(score)}")
+    return True
@@ -0,0 +1,260 @@
+"""Stage 1: estrazione raw span da PDF con PyMuPDF + metadati documento."""
+from pathlib import Path
+
+import fitz  # PyMuPDF
+
+from .models import Block
+
+
+_BOLD_FONT_KEYWORDS = ("bold", "heavy", "black", "demi", "semibold")
+
+# Mappa PUA (U+F000–U+F0FF) → Unicode per font Symbol e font math LaTeX.
+# Le chiavi sono caratteri nel range PUA come estratti da PyMuPDF.
+_SYMBOL_PUA_MAP: dict[str, str] = {
+    '': ' ',  '': '!',  '': '∀',  '': '#',
+    '': '∃',  '': '%',  '': '&',  '': '∋',
+    '': '(',  '': ')',  '': '∗',  '': '+',
+    '': ',',  '': '−',  '': '.',  '': '/',
+    '': '0',  '': '1',  '': '2',  '': '3',
+    '': '4',  '': '5',  '': '6',  '': '7',
+    '': '8',  '': '9',  '': ':',  '': ';',
+    '': '<',  '': '=',  '': '>',  '': '?',
+    '': '≅',  '': 'Α',  '': 'Β',  '': 'Χ',
+    '': 'Δ',  '': 'Ε',  '': 'Φ',  '': 'Γ',
+    '': 'Η',  '': 'Ι',  '': 'ϑ',  '': 'Κ',
+    '': 'Λ',  '': 'Μ',  '': 'Ν',  '': 'Ο',
+    '': 'Π',  '': 'Θ',  '': 'Ρ',  '': 'Σ',
+    '': 'Τ',  '': 'Υ',  '': 'ς',  '': 'Ω',
+    '': 'Ξ',  '': 'Ψ',  '': 'Ζ',  '': '[',
+    '': '∴',  '': ']',  '': '⊥',  '': '_',
+    '': 'α',  '': 'β',  '': 'χ',  '': 'δ',
+    '': 'ε',  '': 'φ',  '': 'γ',  '': 'η',
+    '': 'ι',  '': 'ϕ',  '': 'κ',  '': 'λ',
+    '': 'μ',  '': 'ν',  '': 'ο',  '': 'π',
+    '': 'θ',  '': 'ρ',  '': 'σ',  '': 'τ',
+    '': 'υ',  '': 'ϖ',  '': 'ω',  '': 'ξ',
+    '': 'ψ',  '': 'ζ',  '': '{',  '': '|',
+    '': '}',  '': '∼',
+    '': 'ϒ',  '': '′',  '': '≤',  '': '⁄',
+    '': '∞',  '': 'ƒ',  '': '♣',  '': '♦',
+    '': '♥',  '': '♠',  '': '↔',  '': '←',
+    '': '↑',  '': '→',  '': '↓',
+    '': '°',  '': '±',  '': '″',  '': '≥',
+    '': '×',  '': '∝',  '': '∂',  '': '•',
+    '': '÷',  '': '≠',  '': '≡',  '': '≈',
+    '': '…',  '': '|',  '': '–',
+    '': 'ℵ',  '': 'ℑ',  '': 'ℜ',  '': '℘',
+    '': '⊗',  '': '⊕',  '': '∅',  '': '∩',
+    '': '∪',  '': '⊃',  '': '⊇',  '': '⊄',
+    '': '⊂',  '': '⊆',  '': '∈',  '': '∉',
+    '': '∠',  '': '∇',  '': '∏',  '': '©',
+    '': '™',  '': '∏',  '': '√',  '': '⋅',
+    '': '¬',  '': '∧',  '': '∨',
+    '': '◊',  '': '⟨',  '': '∑',
+    '': '⟩',  '': '∫',  '': '⌠',  '': '⌡',
+}
+
+# Font che tipicamente contengono caratteri PUA math (LaTeX e Symbol)
+_MATH_FONT_KEYWORDS = ("symbol", "cmmi", "cmsy", "msam", "msbm", "eurm", "cmex", "math")
+
+
+def _clean_pua(text: str) -> str:
+    """
+    Applica la mappatura PUA→Unicode a TUTTI i testi estratti.
+    Converte i caratteri nel range U+F000–U+F0FF usando _SYMBOL_PUA_MAP;
+    i caratteri PUA non mappati vengono rimossi (sostituiti con stringa vuota).
+    """
+    result = []
+    for ch in text:
+        cp = ord(ch)
+        if 0xF000 <= cp <= 0xF0FF:
+            mapped = _SYMBOL_PUA_MAP.get(ch)
+            if mapped is not None:
+                result.append(mapped)
+            # carattere PUA non mappato → scarta (artefatto illeggibile)
+        else:
+            result.append(ch)
+    return ''.join(result)
+
+
+def _is_bold_span(span: dict) -> bool:
+    if span["flags"] & 16:
+        return True
+    return any(kw in span["font"].lower() for kw in _BOLD_FONT_KEYWORDS)
+
+
+def _extract_page_blocks(page: fitz.Page, page_num: int) -> list[Block]:
+    page_dict = page.get_text("dict")
+    blocks: list[Block] = []
+    prev_y1 = 0.0
+
+    for raw_block in page_dict["blocks"]:
+        if raw_block.get("type") != 0:  # ignora blocchi immagine
+            continue
+
+        for line in raw_block.get("lines", []):
+            spans = line.get("spans", [])
+            if not spans:
+                continue
+
+            # Aggrega span della stessa riga con stesso font+size in un Block
+            groups: list[list[dict]] = []
+            current: list[dict] = []
+            for sp in spans:
+                if not current:
+                    current.append(sp)
+                elif (
+                    round(sp["size"], 1) == round(current[0]["size"], 1)
+                    and sp["font"] == current[0]["font"]
+                ):
+                    current.append(sp)
+                else:
+                    groups.append(current)
+                    current = [sp]
+            if current:
+                groups.append(current)
+
+            for group in groups:
+                text = _clean_pua("".join(s["text"] for s in group).strip())
+                if not text:
+                    continue
+
+                first = group[0]
+                bbox = (
+                    min(s["bbox"][0] for s in group),
+                    min(s["bbox"][1] for s in group),
+                    max(s["bbox"][2] for s in group),
+                    max(s["bbox"][3] for s in group),
+                )
+                y0 = bbox[1]
+                space_before = max(0.0, y0 - prev_y1)
+
+                is_bold = _is_bold_span(first)
+                font_size = round(first["size"], 2)
+
+                # Superscript (flags & 1) → ignore provvisorio
+                block_type = "ignore" if (first["flags"] & 1) else "paragraph"
+
+                block = Block(
+                    text=text,
+                    page=page_num,
+                    bbox=bbox,
+                    font_size=font_size,
+                    font_name=first["font"],
+                    is_bold=is_bold,
+                    block_type=block_type,
+                    space_before=space_before,
+                    origin_spans=group,
+                )
+                blocks.append(block)
+                prev_y1 = bbox[3]
+
+    return blocks
+
+
+def extract_raw_data(pdf_path: Path) -> tuple[list[Block], dict]:
+    """
+    Apre il PDF con PyMuPDF ed estrae tutti i Block + metadati documento.
+
+    Ritorna:
+        blocks   — lista di Block ordinati per pagina (poi per y0/x0 in stage2)
+        doc_meta — dict con: toc, page_count, page_dimensions, title, author, year
+    """
+    doc = fitz.open(str(pdf_path))
+
+    toc = doc.get_toc()  # [(level, title, page), ...]
+    page_count = len(doc)
+    page_dimensions = [(p.rect.width, p.rect.height) for p in doc]
+
+    raw_meta = doc.metadata or {}
+
+    import re
+    year = ""
+    creation = raw_meta.get("creationDate", "")
+    m = re.match(r"D:(\d{4})", creation)
+    if m:
+        year = m.group(1)
+
+    doc_meta = {
+        "toc": toc,
+        "page_count": page_count,
+        "page_dimensions": page_dimensions,
+        "title": (raw_meta.get("title") or "").strip(),
+        "author": (raw_meta.get("author") or "").strip(),
+        "year": year,
+    }
+
+    all_blocks: list[Block] = []
+    for page_num, page in enumerate(doc, start=1):
+        page_blocks = _extract_page_blocks(page, page_num)
+        all_blocks.extend(page_blocks)
+
+    doc.close()
+    return all_blocks, doc_meta
+
+
+def extract_raw_data_with_pdfplumber_fallback(pdf_path: Path) -> tuple[list[Block], dict]:
+    """
+    Estrae i Block con PyMuPDF; per le pagine dove il testo è < 100 caratteri
+    (ma la pagina non è blank), usa pdfplumber come fallback e aggiunge un
+    Block "paragraph" sintetico con il testo alternativo.
+
+    La funzione `extract_raw_data` originale rimane invariata.
+    """
+    all_blocks, doc_meta = extract_raw_data(pdf_path)
+
+    # Raggruppa i blocchi per pagina per misurare quante parole ci sono
+    from collections import defaultdict
+    blocks_by_page: dict[int, list[Block]] = defaultdict(list)
+    for b in all_blocks:
+        blocks_by_page[b.page].append(b)
+
+    page_count = doc_meta["page_count"]
+    sparse_pages = []
+    for page_num in range(1, page_count + 1):
+        page_blocks = blocks_by_page.get(page_num, [])
+        total_chars = sum(len(b.text) for b in page_blocks if b.block_type != "ignore")
+        if total_chars < 100:
+            sparse_pages.append(page_num)
+
+    if not sparse_pages:
+        return all_blocks, doc_meta
+
+    try:
+        import pdfplumber
+    except ImportError:
+        return all_blocks, doc_meta
+
+    try:
+        with pdfplumber.open(str(pdf_path)) as pdf:
+            for page_num in sparse_pages:
+                page_idx = page_num - 1
+                if page_idx >= len(pdf.pages):
+                    continue
+                page = pdf.pages[page_idx]
+                text = page.extract_text() or ""
+                text = text.strip()
+                if not text or len(text) < 20:
+                    continue  # pagina davvero vuota
+
+                # Costruisci un Block sintetico per il testo fallback
+                w = page.width or 612
+                h = page.height or 792
+                fallback_block = Block(
+                    text=_clean_pua(text),
+                    page=page_num,
+                    bbox=(0.0, 0.0, float(w), float(h)),
+                    font_size=10.0,
+                    font_name="pdfplumber-fallback",
+                    is_bold=False,
+                    block_type="paragraph",
+                    space_before=0.0,
+                    origin_spans=[],
+                )
+                all_blocks.append(fallback_block)
+    except Exception:
+        pass  # se pdfplumber fallisce, usa i block di PyMuPDF già presenti
+
+    # Riordina per pagina (i fallback sono stati appesi in coda)
+    all_blocks.sort(key=lambda b: (b.page, b.bbox[1], b.bbox[0]))
+    return all_blocks, doc_meta
@@ -0,0 +1,184 @@
+"""Stage 2: analisi layout — reading order, multi-colonna, merge header multi-riga."""
+from collections import Counter
+
+from .models import Block
+
+
+_RECURRING_MIN_OCCURRENCES = 3
+_RECURRING_MAX_LEN = 100
+_RECURRING_PAGE_RATIO = 0.05  # soglia minima: ≥5% delle pagine del documento
+
+
+def _mark_recurring_lines(blocks: list[Block]) -> list[Block]:
+    """
+    Segna come 'ignore' i blocchi con testo breve che compaiono molte volte
+    nel documento — tipicamente header/footer di pagina ripetuti.
+
+    La soglia scala con la lunghezza del documento: max(3, page_count * 5%)
+    per evitare di marcare come ricorrenti titoli di sezione che appaiono
+    poche volte in documenti lunghi con struttura a parti (es. I/II/III).
+    """
+    if not blocks:
+        return blocks
+    page_count = max(b.page for b in blocks)
+    threshold = max(_RECURRING_MIN_OCCURRENCES, int(page_count * _RECURRING_PAGE_RATIO))
+
+    counts = Counter(
+        b.text.strip()
+        for b in blocks
+        if 3 < len(b.text.strip()) < _RECURRING_MAX_LEN
+    )
+    recurring = {t for t, n in counts.items() if n >= threshold}
+    if not recurring:
+        return blocks
+    for b in blocks:
+        if b.text.strip() in recurring:
+            b.block_type = "ignore"
+    return blocks
+
+
+_COLUMN_GAP_RATIO   = 0.15   # gap orizzontale minimo per rilevare colonne (% page_width)
+_COLUMN_THRESHOLD   = 0.40   # % blocchi per lato per dichiarare layout multi-colonna
+_MULTILINE_X_TOL    = 5.0    # tolleranza px per allineamento x0 di righe consecutive (testo a sx)
+_MULTILINE_CX_TOL   = 20.0   # tolleranza px per allineamento centro di righe centrate
+
+
+def _detect_columns(blocks: list[Block], page_width: float) -> int:
+    """Ritorna 1 (singola colonna) o 2 (doppia colonna)."""
+    if not blocks or page_width <= 0:
+        return 1
+    mid = page_width * 0.5
+    left  = sum(1 for b in blocks if b.x0 < mid)
+    right = sum(1 for b in blocks if b.x0 >= mid)
+    total = left + right
+    if total == 0:
+        return 1
+    if (left / total >= _COLUMN_THRESHOLD) and (right / total >= _COLUMN_THRESHOLD):
+        return 2
+    return 1
+
+
+def _reorder_two_columns(blocks: list[Block], page_width: float) -> list[Block]:
+    """Riordina blocchi in layout a due colonne: prima col. sinistra, poi destra."""
+    mid = page_width * 0.5
+    left  = sorted([b for b in blocks if b.x0 < mid],  key=lambda b: b.y0)
+    right = sorted([b for b in blocks if b.x0 >= mid], key=lambda b: b.y0)
+    return left + right
+
+
+def _merge_multiline_headers(blocks: list[Block]) -> list[Block]:
+    """
+    Unifica coppie di block consecutivi che formano un header multi-riga:
+    stesso font_size, stesso x0 (±5px), gap verticale < 1.5×font_size.
+    """
+    if not blocks:
+        return blocks
+    result: list[Block] = []
+    i = 0
+    while i < len(blocks):
+        cur = blocks[i]
+        if i + 1 < len(blocks):
+            nxt = blocks[i + 1]
+            same_size  = round(cur.font_size, 1) == round(nxt.font_size, 1)
+            same_page  = cur.page == nxt.page
+            same_x     = abs(cur.x0 - nxt.x0) <= _MULTILINE_X_TOL
+            # Titoli centrati: larghezze diverse → x0 diversi; verifica il centro invece
+            cur_cx     = (cur.x0 + cur.x1) / 2
+            nxt_cx     = (nxt.x0 + nxt.x1) / 2
+            same_cx    = abs(cur_cx - nxt_cx) <= _MULTILINE_CX_TOL
+            aligned    = same_x or same_cx
+            gap        = nxt.y0 - cur.y1
+            # gap >= -3pt: le bbox di righe consecutive possono sovrapporsi leggermente
+            # per font a tight-leading; -3pt esclude cross-column merge (gap ≈ -800pt)
+            small_gap  = -3 <= gap < 1.5 * cur.font_size
+            both_short = len(cur.text) < 120 and len(nxt.text) < 120
+            # Non fondere blocco corpo testuale con titolo: il testo di corpo termina
+            # con ! o ? e contiene minuscole (fine frase), mentre il titolo è ALLCAPS/breve.
+            cur_stripped = cur.text.strip()
+            body_sentence_end = (
+                cur_stripped.endswith(("!", "?"))
+                and any(c.islower() for c in cur_stripped)
+            )
+            if same_size and same_page and aligned and small_gap and both_short and not body_sentence_end:
+                merged = Block(
+                    text=cur.text + " " + nxt.text,
+                    page=cur.page,
+                    bbox=(cur.x0, cur.y0, max(cur.x1, nxt.x1), nxt.y1),
+                    font_size=cur.font_size,
+                    font_name=cur.font_name,
+                    is_bold=cur.is_bold or nxt.is_bold,
+                    block_type=cur.block_type,
+                    space_before=cur.space_before,
+                    origin_spans=cur.origin_spans + nxt.origin_spans,
+                )
+                result.append(merged)
+                i += 2
+                continue
+
+        result.append(cur)
+        i += 1
+    return result
+
+
+def _recompute_space_before(blocks: list[Block]) -> list[Block]:
+    """Ricalcola space_before dopo eventuali riordinamenti.
+
+    Salto di pagina: usa b.y0 come stima del gap dalla cima della nuova pagina
+    (minimo 50pt) in modo che il primo blocco di ogni pagina ottenga il space_signal
+    anche quando si trova subito dopo un page break (coordinate y azzerano tra pagine).
+    """
+    for i, b in enumerate(blocks):
+        if i == 0:
+            b.space_before = 0.0
+        elif b.page != blocks[i - 1].page:
+            b.space_before = max(b.y0, 50.0)
+        else:
+            b.space_before = max(0.0, b.y0 - blocks[i - 1].y1)
+    return blocks
+
+
+def analyze_layout(raw_blocks: list[Block], doc_meta: dict) -> list[Block]:
+    """
+    Organizza i Block estratti in Stage 1 in reading order corretto.
+
+    1. Raggruppa per pagina.
+    2. Rileva layout multi-colonna → riordina.
+    3. Ordina ogni pagina per (y0, x0).
+    4. Merge header multi-riga.
+    5. Ricalcola space_before.
+    """
+    if not raw_blocks:
+        return []
+
+    page_dimensions = doc_meta.get("page_dimensions", [])
+
+    # Raggruppa per pagina
+    pages: dict[int, list[Block]] = {}
+    for b in raw_blocks:
+        pages.setdefault(b.page, []).append(b)
+
+    ordered: list[Block] = []
+    for page_num in sorted(pages):
+        page_blocks = pages[page_num]
+        page_idx = page_num - 1
+        page_width = page_dimensions[page_idx][0] if page_idx < len(page_dimensions) else 595.0
+
+        # Ordina per (y0, x0) prima della rilevazione colonne
+        page_blocks.sort(key=lambda b: (b.y0, b.x0))
+
+        n_cols = _detect_columns(page_blocks, page_width)
+        if n_cols == 2:
+            page_blocks = _reorder_two_columns(page_blocks, page_width)
+
+        ordered.extend(page_blocks)
+
+    # Merge header multi-riga
+    ordered = _merge_multiline_headers(ordered)
+
+    # Ricalcola space_before
+    ordered = _recompute_space_before(ordered)
+
+    # Segna come ignore i blocchi ricorrenti (header/footer di capitolo)
+    ordered = _mark_recurring_lines(ordered)
+
+    return ordered
@@ -0,0 +1,53 @@
+"""Stage 3: analisi font — rileva body size e cluster header per documento."""
+from collections import Counter
+
+from .models import Block, FontProfile
+
+
+def build_font_profile(blocks: list[Block]) -> FontProfile:
+    """
+    Determina body_size (mode dei font size) e costruisce cluster_map
+    per i livelli header (1=H1, 2=H2, 3=H3), inferiti dinamicamente.
+    """
+    sizes = [
+        round(b.font_size, 1)
+        for b in blocks
+        if b.block_type != "ignore"
+    ]
+    if not sizes:
+        return FontProfile(body_size=11.0, cluster_map={}, header_sizes=[])
+
+    counter = Counter(sizes)
+    total = sum(counter.values())
+
+    # Body size = font size più frequente
+    body_size = counter.most_common(1)[0][0]
+
+    # Candidati header: size > body + 1pt, frequenza < 30% del totale
+    raw_candidates = sorted(
+        {
+            s for s, c in counter.items()
+            if s > body_size + 1.0 and c / total < 0.30
+        },
+        reverse=True,
+    )
+
+    # Collassa cluster entro ±0.5pt
+    collapsed: list[float] = []
+    for s in raw_candidates:
+        if collapsed and abs(s - collapsed[-1]) <= 0.5:
+            continue  # appartiene al cluster precedente (già più grande)
+        collapsed.append(s)
+
+    header_sizes = collapsed[:3]  # max 3 livelli
+
+    # cluster_map: size arrotondato → livello (1=grande, 2=medio, 3=piccolo)
+    cluster_map: dict[float, int] = {}
+    for i, s in enumerate(header_sizes, start=1):
+        cluster_map[s] = i
+
+    return FontProfile(
+        body_size=body_size,
+        cluster_map=cluster_map,
+        header_sizes=header_sizes,
+    )
@@ -0,0 +1,162 @@
+"""Stage 4: classificazione blocchi — rileva header candidate con segnali combinati."""
+import re
+
+from .models import Block, FontProfile
+
+
+# Numerazione gerarchica con separatore esplicito: "1.", "1.2", "1.2.3" + MAIUSCOLA.
+# Non usa \s come separatore per evitare "1 La divisione..." (note a pie' di pagina).
+_NUMBERED_SECTION_RE = re.compile(r"^\d+(\.\d+)*[.)]\s*[A-ZÀ-Ÿ]")
+_ARTICLE_RE = re.compile(r"^Art(?:icolo|\.)\s+\d+", re.IGNORECASE)
+# "CAPITOLO QUARTO." / "CHAPTER FOUR" / "CANTO XII" — keyword strutturale ALLCAPS + ordinale/numero/romano.
+# Solo maiuscolo: cattura sezioni dove il font è identico al corpo (PDF letterari/accademici)
+# ma lascia intatti i riferimenti in sentence-case nel corpo del testo.
+_CHAPTER_WORD_RE = re.compile(
+    r"^(?:CAPITOLO|CHAPTER|CANTO)\s+(?:[A-ZÀ-Ÿ][A-ZÀ-Ÿ]+|\d+|[IVXLCDM]+)\b"
+)
+# "Capitolo 1: TITOLO" / "Chapter 3 — ..." in sentence-case + bold.
+# Cattura capitoli di PDF tecnici/didattici con body-size identico agli header.
+_CHAPTER_WORD_BOLD_RE = re.compile(
+    r"^(?:Capitolo|Chapter)\s+\d+\b", re.IGNORECASE
+)
+_PURE_NUMBERS_RE = re.compile(r"^[\d\s\-\./,]+$")   # solo numeri/punteggiatura, nessuna lettera
+# Simbolo di sezione § seguito da numero o romano: "§ 1", "§ I.", "§ 12"
+_SECTION_SYMBOL_RE = re.compile(r"^§\s*[\dIVXivx]")
+# Dot-leader: tipici di TOC e liste figure (". . . . .")
+_DOT_LEADER_RE = re.compile(r"(?:\.[ ]){3,}")
+# Riferimento di pagina TOC: ", p. 42" (voce indice) — in qualsiasi posizione nel testo
+# oppure multipli riferimenti pagina (liste TOC con più voci)
+_TOC_PAGE_REF_RE = re.compile(r",?\s+p\.\s+\d+")
+# Numerale romano minuscolo standalone: page number preliminari (i, ii, vii, xii…)
+_ROMAN_PAGE_RE = re.compile(r"^x{0,3}(?:ix|iv|v?i{0,3})$")
+_SHORT_LINE_THRESHOLD = 80   # caratteri
+_HEADER_SCORE_THRESHOLD = 3  # punteggio minimo per diventare header_candidate
+
+
+def _score_block(block: Block, body_size: float) -> int:
+    score = 0
+    text = block.text.strip()
+
+    # size_signal: font_size significativamente più grande del corpo
+    if block.font_size >= body_size + 1.5:
+        score += 2
+
+    # bold_signal: bold E font_size almeno pari al corpo.
+    # Usa round() per evitare falsi positivi da rumore floating point del PDF
+    # (es. 11.52 vs body_size 11.5 → stesso cluster, non un vero header).
+    if block.is_bold and round(block.font_size, 1) > round(body_size, 1):
+        score += 1
+
+    # number_signal: numerazione gerarchica SOLO se font > corpo + 0.5pt.
+    # Evita che paragrafi numerati a font-corpo (es. "1. Lo spazio non è…")
+    # vengano promossi ad header per il solo fatto di iniziare con un numero.
+    if _NUMBERED_SECTION_RE.match(text) and block.font_size > body_size + 0.5:
+        score += 2
+
+    # section_symbol_signal: simbolo § (tipico di trattati filosofici/giuridici).
+    # Threshold body-2.5pt: cattura § a font ridotto (varianti editoriali del PDF)
+    # ma esclude annotazioni marginali a 8.2pt (§9, §10 come running notes).
+    if _SECTION_SYMBOL_RE.match(text) and block.font_size >= body_size - 2.5:
+        score += 2
+
+    # allcaps_signal: testo interamente maiuscolo con font ≥ corpo → titolo di parte/capitolo.
+    # Threshold abbassata a >= body_size: cattura sezioni ALLCAPS nei PDF letterari
+    # dove il font del titolo è identico al corpo.
+    # Escluso se bold: bold+ALLCAPS a body_size indica enfasi nel testo (intestazioni di cella,
+    # etichette), non un titolo di sezione strutturale.
+    alpha = re.sub(r"[^a-zA-ZÀ-ÿ]", "", text)
+    if (alpha and alpha == alpha.upper() and len(alpha) > 3
+            and block.font_size >= body_size and not block.is_bold):
+        score += 1
+
+    # length_signal: riga breve (i titoli sono concisi)
+    if len(text) < _SHORT_LINE_THRESHOLD:
+        score += 1
+
+    # space_signal: spazio verticale prima del blocco > 1.5× dimensione font
+    if block.space_before > 1.5 * block.font_size:
+        score += 1
+
+    return score
+
+
+def classify_blocks(blocks: list[Block], profile: FontProfile) -> list[Block]:
+    """
+    Assegna block_type ad ogni Block in base a segnali combinati.
+
+    Guardie aggiuntive che impediscono la promozione a header_candidate:
+    - testo puramente numerico (numeri di pagina, intervalli TOC)
+    - testo che inizia con `|` (footer/intestazioni di capitolo stile tabella)
+    - testo troppo corto (< 2 caratteri)
+    """
+    body_size = profile.body_size
+
+    for block in blocks:
+        # Non toccare classificazioni precedenti protette
+        if block.block_type in ("table", "ignore"):
+            continue
+
+        text = block.text.strip()
+        if not text or len(text) < 2:
+            block.block_type = "ignore"
+            continue
+
+        # Guard legale: articoli di codice → sempre header candidate
+        if _ARTICLE_RE.match(text):
+            block.block_type = "header_candidate"
+            continue
+
+        # Guard letterario ALLCAPS: keyword strutturale + ordinale/numero/romano → sempre header candidate.
+        if _CHAPTER_WORD_RE.match(text) and len(text) < _SHORT_LINE_THRESHOLD:
+            block.block_type = "header_candidate"
+            continue
+
+        # Guard letterario bold: "Capitolo 1: TITOLO" bold anche al body-size → header candidate.
+        if block.is_bold and _CHAPTER_WORD_BOLD_RE.match(text) and len(text) < _SHORT_LINE_THRESHOLD:
+            block.block_type = "header_candidate"
+            continue
+
+        # Guard: testo puramente numerico → numero di pagina standalone, da ignorare
+        if _PURE_NUMBERS_RE.match(text):
+            block.block_type = "ignore"
+            continue
+
+        # Guard: numerale romano minuscolo standalone → page number preliminare (vii, xii…)
+        if _ROMAN_PAGE_RE.match(text) and len(text) >= 2:
+            block.block_type = "ignore"
+            continue
+
+        # Guard: dot-leader → riga TOC o lista figure, non testo del documento
+        if _DOT_LEADER_RE.search(text):
+            block.block_type = "ignore"
+            continue
+
+        # Guard: testo che inizia con pipe → footer/intestazione di capitolo o frammento tabella
+        if text.startswith("|"):
+            block.block_type = "ignore"
+            continue
+
+        # Guard: voce di indice con riferimento pagina → "§ 9. Titolo, p. 90."
+        if _TOC_PAGE_REF_RE.search(text):
+            block.block_type = "ignore"
+            continue
+
+        score = _score_block(block, body_size)
+        if score >= _HEADER_SCORE_THRESHOLD:
+            # Guard: header candidate deve iniziare con lettera maiuscola (dopo eventuali numeri/simboli).
+            # Filtra frammenti LaTeX come "1 segue", "1 allora", "2) prodotto" che hanno
+            # font grande ma non sono titoli di sezione.
+            stripped_nums = re.sub(r"^[§\d\s\.\)\(\-]+", "", text)
+            if stripped_nums and stripped_nums[0].islower():
+                block.block_type = "paragraph"
+            else:
+                block.block_type = "header_candidate"
+        else:
+            # Rilevamento liste: riga che inizia con bullet o numero seguito da punto
+            stripped = text.lstrip()
+            if stripped.startswith(("- ", "* ", "• ", "· ")) or re.match(r"^\d+\.\s", stripped):
+                block.block_type = "list_item"
+            else:
+                block.block_type = "paragraph"
+
+    return blocks
@@ -0,0 +1,147 @@
+"""Stage 5: inferenza gerarchia — assegna livello (1-3) agli header candidate."""
+import re
+import unicodedata
+
+from .models import Block, FontProfile
+
+
+_NUMBERED_RE = re.compile(r"^(\d+(?:\.\d+)*)[.)\s]\s*[A-ZÀ-Ÿ]")
+_MIN_NUMBERED_FOR_RULE1 = 3  # soglia per attivare Regola 1
+
+# "Capitolo 3 Titolo" / "Chapter 5 – Titolo": sezioni numerate con la parola
+# "Capitolo/Chapter" + numero intero (in senso-maiuscolo, tipicamente bold body-size).
+# Se ≥3 blocchi corrispondono, vengono promossi a livello 2 come sezioni primarie.
+_CHAPTER_NUM_BOLD_RE = re.compile(r"^(?:Capitolo|Chapter)\s+\d+\b", re.IGNORECASE)
+_MIN_CHAPTER_NUM_FOR_PROMOTION = 3
+
+
+def _normalize_title(text: str) -> str:
+    """Normalizza un titolo per il confronto fuzzy con il TOC."""
+    text = unicodedata.normalize("NFKC", text)
+    text = text.lower().strip()
+    text = re.sub(r"[^\w\s]", " ", text)
+    text = re.sub(r"\s+", " ", text)
+    return text.strip()
+
+
+def _fuzzy_match(title: str, toc_map: dict[str, int], threshold: float = 0.75) -> int:
+    """
+    Cerca il livello TOC per un titolo con confronto fuzzy.
+    Ritorna il livello trovato, o 0 se nessun match.
+    """
+    norm = _normalize_title(title)
+    if not norm:
+        return 0
+
+    # Match esatto
+    if norm in toc_map:
+        return toc_map[norm]
+
+    # Match parziale: confronta le prime parole (fino a 8)
+    norm_words = norm.split()[:8]
+    norm_prefix = " ".join(norm_words)
+
+    best_score = 0.0
+    best_level = 0
+    for toc_norm, level in toc_map.items():
+        toc_words = toc_norm.split()[:8]
+        toc_prefix = " ".join(toc_words)
+        # Calcola sovrapposizione su caratteri del prefisso più corto
+        shorter = min(len(norm_prefix), len(toc_prefix))
+        if shorter == 0:
+            continue
+        matches = sum(
+            1 for a, b in zip(norm_prefix, toc_prefix) if a == b
+        )
+        score = matches / shorter
+        if score > best_score:
+            best_score = score
+            best_level = level
+
+    return best_level if best_score >= threshold else 0
+
+
+def _level_from_numbering(text: str) -> int:
+    """Inferisce il livello dall'numerazione gerarchica: "1." → 1, "1.2" → 2, ecc."""
+    m = _NUMBERED_RE.match(text.strip())
+    if not m:
+        return 0
+    dots = m.group(1).count(".")
+    return min(dots + 1, 3)
+
+
+def _level_from_font(font_size: float, cluster_map: dict[float, int]) -> int:
+    """Cerca il livello più vicino nel cluster_map in base alla font_size."""
+    if not cluster_map:
+        return 2  # fallback: tutti H2
+    rounded = round(font_size, 1)
+    if rounded in cluster_map:
+        return cluster_map[rounded]
+    # Cerca il cluster più vicino
+    best = min(cluster_map.keys(), key=lambda s: abs(s - rounded))
+    return cluster_map[best]
+
+
+def infer_hierarchy(
+    blocks: list[Block],
+    profile: FontProfile,
+    toc: list,
+) -> list[Block]:
+    """
+    Assegna block.level ad ogni header_candidate secondo la priorità:
+      Regola 1: numerazione gerarchica (≥3 candidati numerati)
+      Regola 2: allineamento TOC (se TOC non vuoto)
+      Regola 3: font size clustering (fallback)
+    """
+    candidates = [b for b in blocks if b.block_type == "header_candidate"]
+    if not candidates:
+        return blocks
+
+    # ── Regola 1: numerazione ──────────────────────────────────────────────────
+    numbered = [b for b in candidates if _NUMBERED_RE.match(b.text.strip())]
+    use_numbering = len(numbered) >= _MIN_NUMBERED_FOR_RULE1
+
+    # ── Regola 2: costruisci mappa TOC ────────────────────────────────────────
+    toc_map: dict[str, int] = {}
+    for entry in toc:
+        if len(entry) >= 3:
+            level, title, _ = entry[0], entry[1], entry[2]
+            norm = _normalize_title(str(title))
+            if norm:
+                toc_map[norm] = min(int(level), 3)
+    use_toc = bool(toc_map)
+
+    # ── Assegna livelli ───────────────────────────────────────────────────────
+    for block in candidates:
+        text = block.text.strip()
+        level = 0
+
+        if use_numbering and _NUMBERED_RE.match(text):
+            level = _level_from_numbering(text)
+        elif use_numbering:
+            # Documento numerato ma questo candidato non ha numero →
+            # usa font size come hint secondario, poi fallback a 2
+            level = _level_from_font(block.font_size, profile.cluster_map) or 2
+        elif use_toc:
+            level = _fuzzy_match(text, toc_map)
+            if level == 0:
+                level = _level_from_font(block.font_size, profile.cluster_map) or 2
+        else:
+            level = _level_from_font(block.font_size, profile.cluster_map) or 2
+
+        block.level = max(1, min(level, 3))
+
+    # ── Post-correzione: "Capitolo/Chapter N" bold → sezioni primarie (L2) ────
+    # Quando il documento usa "Capitolo N" bold a body-size (senza font distinto
+    # per i titoli), il font clustering assegna L3 perché la dimensione è sotto
+    # tutti i cluster. Con ≥3 capitoli numerati, li promuoviamo a L2.
+    if not use_toc and not use_numbering:
+        chapter_bold = [
+            b for b in candidates
+            if b.is_bold and _CHAPTER_NUM_BOLD_RE.match(b.text.strip()) and b.level > 2
+        ]
+        if len(chapter_bold) >= _MIN_CHAPTER_NUM_FOR_PROMOTION:
+            for b in chapter_bold:
+                b.level = 2
+
+    return blocks
@@ -0,0 +1,54 @@
+"""Stage 6: ricostruzione albero documentale — Section con parent-child stack-based."""
+from .models import Block, Section
+
+
+def build_tree(blocks: list[Block]) -> list[Section]:
+    """
+    Costruisce l'albero di Section dalla lista ordinata di Block.
+
+    Algoritmo stack-based:
+    - header_candidate → nuova Section; pop stack finché livello >= nuovo livello.
+    - Altri block → aggiunti al content della Section in cima allo stack.
+    - Testo prima del primo header → sezione implicita (title="", level=0).
+    """
+    roots: list[Section] = []
+    stack: list[Section] = []   # sezioni aperte, ordinate per livello crescente
+
+    def _current() -> Section | None:
+        return stack[-1] if stack else None
+
+    def _push(section: Section) -> None:
+        """Inserisce la nuova sezione nell'albero rispettando la gerarchia."""
+        # Pop sezioni con livello >= al nuovo (nuovo header chiude i predecessori allo stesso livello)
+        while stack and stack[-1].level >= section.level:
+            stack.pop()
+
+        if stack:
+            stack[-1].children.append(section)
+        else:
+            roots.append(section)
+
+        stack.append(section)
+
+    for block in blocks:
+        if block.block_type == "header_candidate" and block.level > 0:
+            new_section = Section(
+                title=block.text.strip(),
+                level=block.level,
+                page_start=block.page,
+                source_block=block,
+            )
+            _push(new_section)
+        elif block.block_type == "ignore":
+            continue
+        else:
+            cur = _current()
+            if cur is None:
+                # Testo prima del primo header → sezione implicita
+                implicit = Section(title="", level=0, page_start=block.page)
+                roots.append(implicit)
+                stack.append(implicit)
+                cur = implicit
+            cur.content.append(block)
+
+    return roots
@@ -0,0 +1,224 @@
+"""Stage 7: serializzazione del document tree in Markdown valido."""
+import re
+from pathlib import Path
+
+from .models import Block, Section
+
+# Pulisce artefatti finali nei titoli: " | 30", " |", " | "
+# (pipe con eventuale numero di pagina — tipici footer di capitolo nei PDF)
+_TITLE_TRAIL_RE = re.compile(r"\s*\|\s*\d*\s*$")
+
+# Sezioni preliminari da omettere interamente dall'output Markdown
+# (TOC, lista figure, lista tabelle — non sono contenuto RAG-utile)
+_SKIP_SECTION_TITLES = {
+    "indice", "indice generale", "indice analitico",
+    "table of contents", "contents",
+    "elenco delle figure", "lista delle figure", "list of figures",
+    "elenco delle tabelle", "lista delle tabelle", "list of tables",
+    "sommario",
+}
+
+
+_LIST_RE = re.compile(r"^(?:[-*•·]\s|\d+\.\s)")
+
+
+def _split_long_title(title: str) -> tuple[str, str]:
+    """
+    Divide un titolo multi-frase in (titolo_breve, corpo_extra).
+
+    Cerca il primo confine di frase ('. ' seguito da maiuscola) dopo il
+    carattere 15, per non spezzare abbreviazioni brevi all'inizio del titolo.
+    Ritorna (title, '') se non c'è divisione sensata o il titolo è corto.
+    """
+    if len(title) <= 120:
+        return title, ''
+    for i in range(15, len(title) - 2):
+        if title[i] == '.' and title[i + 1] == ' ' and title[i + 2].isupper():
+            return title[:i + 1].strip(), title[i + 2:].strip()
+    return title, ''
+
+
+def _serialize_block(block: Block, pdf_path: Path | None = None) -> str:
+    """Serializza un singolo Block in testo Markdown."""
+    if block.block_type == "ignore":
+        return ""
+
+    text = block.text.strip()
+    if not text:
+        return ""
+
+    if block.block_type == "table":
+        return _serialize_table(block, pdf_path)
+
+    if block.block_type == "list_item":
+        return text  # già formattato con bullet/numero
+
+    return text  # paragraph
+
+
+def _serialize_table(block: Block, pdf_path: Path | None = None) -> str:
+    """
+    Tenta di estrarre la tabella con pdfplumber; fallback a testo raw.
+    """
+    if pdf_path is not None and block.origin_spans:
+        try:
+            import pdfplumber
+            with pdfplumber.open(str(pdf_path)) as pdf:
+                page_idx = block.page - 1
+                if 0 <= page_idx < len(pdf.pages):
+                    page = pdf.pages[page_idx]
+                    x0, y0, x1, y1 = block.bbox
+                    cropped = page.crop((x0 - 2, y0 - 2, x1 + 2, y1 + 2))
+                    table = cropped.extract_table()
+                    if table:
+                        return _table_to_markdown(table)
+        except Exception:
+            pass
+
+    # Fallback: testo grezzo
+    return block.text.strip()
+
+
+def _table_to_markdown(table: list[list[str | None]]) -> str:
+    """Converte una tabella pdfplumber in Markdown GFM."""
+    if not table:
+        return ""
+
+    def _cell(c: str | None) -> str:
+        return (c or "").replace("\n", " ").strip()
+
+    rows = [[_cell(c) for c in row] for row in table]
+    # Normalizza larghezza colonne
+    n_cols = max(len(r) for r in rows)
+    rows = [r + [""] * (n_cols - len(r)) for r in rows]
+
+    header = rows[0]
+    sep = ["---"] * n_cols
+    body = rows[1:]
+
+    lines = [
+        "| " + " | ".join(header) + " |",
+        "| " + " | ".join(sep)    + " |",
+    ]
+    for row in body:
+        lines.append("| " + " | ".join(row) + " |")
+    return "\n".join(lines)
+
+
+def _is_para_break(block: Block) -> bool:
+    """
+    Restituisce True se il block inizia un nuovo paragrafo logico.
+    Soglia: gap verticale > 1× font_size (≈ una riga intera di margine).
+    All'interno di un paragrafo il gap è ≈ 0-4pt; tra paragrafi è ≥ font_size.
+    """
+    return block.space_before > block.font_size
+
+
+def _serialize_section(section: Section, pdf_path: Path | None = None) -> list[str]:
+    """Traversal DFS in-order: header → content → children."""
+    # Salta sezioni preliminari non utili per RAG (TOC, lista figure, ecc.)
+    # I FIGLI vengono comunque serializzati: se la TOC è genitore errato dei capitoli
+    # reali (gerarchia piatta nel PDF), i capitoli appaiono ugualmente nel Markdown.
+    if section.title.strip().lower() in _SKIP_SECTION_TITLES:
+        parts: list[str] = []
+        for child in section.children:
+            parts.extend(_serialize_section(child, pdf_path))
+        return parts
+
+    parts: list[str] = []
+
+    # Header (livello 0 = sezione implicita pre-primo-header → no #)
+    extra_body: str = ''
+    if section.level > 0:
+        title = _TITLE_TRAIL_RE.sub("", section.title).strip()
+        if not title:
+            pass  # titolo vuoto: nessun header, ma il contenuto viene comunque emesso
+        else:
+            title, extra_body = _split_long_title(title)
+            hashes = "#" * section.level
+            parts.append(f"{hashes} {title}")
+            parts.append("")
+
+    # Content: accumula righe di paragrafo consecutive in un unico blocco di testo
+    pending: list[str] = []   # pezzi del paragrafo corrente
+    if extra_body:
+        pending.append(extra_body)
+
+    def _flush() -> None:
+        if not pending:
+            return
+        # Unisci i pezzi riparando la sillabazione inter-riga:
+        # "de-" + "stino" → "destino"  (trattino finale + inizio minuscolo)
+        joined = pending[0]
+        for part in pending[1:]:
+            if joined.endswith("-") and part and part[0].islower():
+                joined = joined[:-1] + part
+            else:
+                joined = joined + " " + part
+        parts.append(joined)
+        parts.append("")
+        pending.clear()
+
+    for block in section.content:
+        text = _serialize_block(block, pdf_path)
+        if not text:
+            continue
+
+        if block.block_type == "list_item":
+            _flush()
+            parts.append(text)
+        elif block.block_type == "table":
+            _flush()
+            parts.append(text)
+            parts.append("")
+        else:
+            # Blocco paragrafo: unisci con il precedente oppure inizia nuovo paragrafo
+            if pending and _is_para_break(block):
+                _flush()
+            pending.append(text)
+
+    _flush()
+
+    # Figli
+    for child in section.children:
+        parts.extend(_serialize_section(child, pdf_path))
+
+    return parts
+
+
+def serialize_tree(
+    roots: list[Section],
+    meta: dict,
+    pdf_path: Path | None = None,
+    include_frontmatter: bool = False,
+) -> str:
+    """
+    Serializza la lista di Section radice in un documento Markdown.
+
+    include_frontmatter: se True, inserisce blocco YAML con metadati.
+    Nota: il frontmatter viene aggiunto dal runner, non qui, per mantenere
+    raw.md privo di metadata soggetti a variazione.
+    """
+    parts: list[str] = []
+
+    if include_frontmatter and meta:
+        fm_lines = ["---", f"source: {meta.get('source', '')}"]
+        if meta.get("title"):
+            fm_lines.append(f'title: "{meta["title"]}"')
+        if meta.get("author"):
+            fm_lines.append(f'author: "{meta["author"]}"')
+        if meta.get("year"):
+            fm_lines.append(f"year: {meta['year']}")
+        if meta.get("pages"):
+            fm_lines.append(f"pages: {meta['pages']}")
+        fm_lines += ["---", ""]
+        parts.extend(fm_lines)
+
+    for root in roots:
+        root_parts = _serialize_section(root, pdf_path)
+        parts.extend(root_parts)
+
+    # Normalizza righe vuote consecutive (max 2)
+    text = "\n".join(parts)
+    text = re.sub(r"\n{3,}", "\n\n", text)
+    return text.strip() + "\n"
@@ -0,0 +1,337 @@
+"""Stage 8: normalizzazione gerarchia Markdown — ripara salti livello, header vuoti, duplicati."""
+import re
+import unicodedata
+
+
+_HEADER_RE = re.compile(r"^(#{1,6})\s+(.+)$")
+
+# Conversione encoding LaTeX accenti italiani estratti da PDF TeX-compilati
+# backtick + vocale → accento grave;  ´ + vocale → accento acuto
+_GRAVE = {'a': 'à', 'e': 'è', 'i': 'ì', 'o': 'ò', 'u': 'ù', 'ı': 'ì',
+          'A': 'À', 'E': 'È', 'I': 'Ì', 'O': 'Ò', 'U': 'Ù'}
+_ACUTE = {'a': 'á', 'e': 'é', 'i': 'í', 'o': 'ó', 'u': 'ú',
+          'A': 'Á', 'E': 'É', 'I': 'Í', 'O': 'Ó', 'U': 'Ú'}
+
+
+def _fix_latex_accents(text: str) -> str:
+    """Converte encoding LaTeX degli accenti: \`e→è, ´e→é, ecc."""
+    text = re.sub(r'`([aeiouAEIOUı])', lambda m: _GRAVE.get(m.group(1), m.group(0)), text)
+    text = re.sub(r'´([aeiouAEIOU])',  lambda m: _ACUTE.get(m.group(1), m.group(0)), text)
+    # Encoding font: "1'" → "l'" (glifo 'l' letto come cifra '1' prima di apostrofo)
+    text = re.sub(r"\b1'([a-zA-ZÀ-ÿ])", r"l'\1", text)
+    return text
+
+
+# Sillabazione TeX/PDF: "evi- tare" → "evitare" (trattino-spazio tra due frammenti)
+_HYPHEN_SPACE_RE = re.compile(r'([a-zà-ÿ])- ([a-zà-ÿ])')
+
+# Bold markup dentro header: ## **Titolo** → ## Titolo
+_HEADER_BOLD_RE = re.compile(r'^(#{1,6})\s+\*\*(.+?)\*\*\s*$', re.MULTILINE)
+
+# Pattern header numerato senza punto: "### 5 Titolo" → "### 5. Titolo"
+_HDR_NUM_NO_DOT_RE = re.compile(r'^(#{1,6})\s+(\d{1,3})\s+(.+)$')
+
+# Figura/Tabella come header (caption di layout finito nei blocchi strutturali)
+_FIGURE_CAPTION_RE = re.compile(
+    r'^(Figura|Figure|Fig\.|Tabella|Table|Tab\.)\s+\d', re.IGNORECASE
+)
+# Numerale romano usato come marcatore di sezione: I, II, IV, VII, XXIII, ecc.
+_ROMAN_NUMERAL_RE = re.compile(r'^[IVXLCDM]+\.?$', re.IGNORECASE)
+
+
+def _sentence_case(s: str) -> str:
+    if not s:
+        return s
+    low = s.lower()
+    return low[0].upper() + low[1:]
+
+
+def _is_garbage_header(content: str) -> bool:
+    """Rileva header privi di significato strutturale."""
+    stripped = content.strip()
+
+    # Simbolo § — marcatore di sezione valido anche se solo numerico/romano
+    if stripped.startswith("§"):
+        return False
+
+    if stripped.startswith("..."):
+        return True
+
+    # Testo che termina con parentesi aperta → testo troncato, non un titolo valido
+    if stripped.endswith("("):
+        return True
+
+    # Testo con caratteri PUA (Symbol/Wingdings font): formula o simbolo matematico
+    if re.search(r'[-]', stripped):
+        return True
+
+    # Testo che inizia con [ → notazione matematica/vettoriale
+    if stripped.startswith("["):
+        return True
+
+    # Header troppo breve (≤4 caratteri non-spazio) → formula, variabile o simbolo isolato
+    if len(stripped.replace(" ", "")) <= 4 and not _ROMAN_NUMERAL_RE.match(stripped):
+        return True
+
+    # Nessuna sequenza di ≥2 lettere → pura punteggiatura/numero
+    if not re.search(r'[A-Za-zÀ-ÿ]{2,}', stripped):
+        return True
+
+    # Header di 1-4 lettere (es. "(a)", "x") — ma non numerali romani di sezione
+    if re.fullmatch(r'\(?\s*[A-Za-z]{1,4}\s*\)?', stripped):
+        if not _ROMAN_NUMERAL_RE.match(stripped.strip("(). ")):
+            return True
+
+    # Equazione breve come header: "x = y", "f(x) ≤"
+    if re.match(r'^[A-Za-zÀ-ÿ_]{1,3}\s*[=<>≤≥]', stripped):
+        return True
+
+    # Caption di figura o tabella estratta come header
+    if _FIGURE_CAPTION_RE.match(stripped):
+        return True
+
+    # Header che inizia con lettera minuscola e testo lungo: frammento corpo
+    first_alpha = next((c for c in content if c.isalpha()), None)
+    if first_alpha and first_alpha.islower() and len(content) > 40:
+        return True
+
+    return False
+
+
+def _header_level(line: str) -> int:
+    m = _HEADER_RE.match(line)
+    return len(m.group(1)) if m else 0
+
+
+def _norm_title(text: str) -> str:
+    text = unicodedata.normalize("NFKC", text).lower().strip()
+    return re.sub(r"\s+", " ", text)
+
+
+def normalize_hierarchy(text: str) -> tuple[str, dict]:
+    """
+    Ripara il Markdown prodotto da Stage 7 in più passate:
+
+    Pass 0   — Accenti LaTeX (encoding PDF TeX-compilati)
+    Pass 0.5 — Sillabazione "word- word" (artefatto TeX/PDF)
+    Pass 1   — Bold dentro header: ## **T** → ## T
+    Pass 1.5 — Header spazzatura rimossi PRIMA del repair (caption figure, equazioni, simboli)
+               Questo evita che simboli chimici/matematici H1/H2 alterino il repair dei salti.
+    Pass 2   — Salti di livello: # A → #### B diventa # A → ## B
+    Pass 3   — Duplicati consecutivi: header identici adiacenti collassati
+    Pass 4   — Header vuoti senza contenuto né sezioni figlio rimossi
+    Pass 5   — Running-header prefisso del successivo (es. "§ 4" prima di "§ 4. Titolo")
+    Pass 6   — ALLCAPS → sentence case (≥4 lettere tutte maiuscole)
+    Pass 7   — Demote # → ## se il documento ha ≥5 header H1
+    Pass 8   — Clamp H4+ → H3; normalizza "### 5 Titolo" → "### 5. Titolo"
+
+    Ritorna (testo_riparato, stats_dict).
+    """
+    lines = text.split("\n")
+    stats = {
+        "n_level_jumps_repaired": 0,
+        "n_empty_headers_removed": 0,
+        "n_duplicate_headers_removed": 0,
+        "n_hyphenations_repaired": 0,
+        "n_bold_in_headers_removed": 0,
+        "n_allcaps_headers_normalized": 0,
+        "n_h1_demoted": 0,
+        "n_garbage_headers_removed": 0,
+        "n_headers_clamped": 0,
+    }
+
+    # ── Pass 0: correggi encoding accenti italiani LaTeX ──────────────────────
+    lines = [_fix_latex_accents(l) for l in lines]
+
+    # ── Pass 0.5: ripara sillabazione "word- word" nei paragrafi ──────────────
+    repaired_lines: list[str] = []
+    for line in lines:
+        if not _HEADER_RE.match(line):
+            new_line, n = _HYPHEN_SPACE_RE.subn(r'\1\2', line)
+            stats["n_hyphenations_repaired"] += n
+            repaired_lines.append(new_line)
+        else:
+            repaired_lines.append(line)
+    lines = repaired_lines
+
+    # ── Pass 1: rimuovi bold markup dentro header ─────────────────────────────
+    no_bold: list[str] = []
+    for line in lines:
+        new_line, n = _HEADER_BOLD_RE.subn(r'\1 \2', line)
+        stats["n_bold_in_headers_removed"] += n
+        no_bold.append(new_line)
+    lines = no_bold
+
+    # ── Pass 1.5: rimuovi header spazzatura PRIMA del repair ──────────────────
+    # I simboli chimici/matematici estratti a font grande (H1/H2) alterano il
+    # repair dei salti di livello se rimossi solo dopo. Rimuovendoli prima, i
+    # capitoli reali ricevono il livello corretto senza distorsioni.
+    no_garbage_pre: list[str] = []
+    for line in lines:
+        m = _HEADER_RE.match(line)
+        if m and _is_garbage_header(m.group(2)):
+            stats["n_garbage_headers_removed"] += 1
+            continue
+        no_garbage_pre.append(line)
+    lines = no_garbage_pre
+
+    # ── Pass 2: ripara salti di livello ───────────────────────────────────────
+    repaired: list[str] = []
+    last_level = 0
+    for line in lines:
+        m = _HEADER_RE.match(line)
+        if m:
+            hashes, title = m.group(1), m.group(2)
+            level = len(hashes)
+            if last_level > 0 and level > last_level + 1:
+                new_level = last_level + 1
+                line = "#" * new_level + " " + title
+                stats["n_level_jumps_repaired"] += 1
+                level = new_level
+            last_level = level
+        repaired.append(line)
+
+    # ── Pass 3: rimuovi duplicati consecutivi ─────────────────────────────────
+    no_dup: list[str] = []
+    last_header_norm: str | None = None
+    for line in repaired:
+        m = _HEADER_RE.match(line)
+        if m:
+            norm = _norm_title(m.group(2))
+            if norm == last_header_norm:
+                stats["n_duplicate_headers_removed"] += 1
+                continue
+            last_header_norm = norm
+        else:
+            if line.strip():
+                last_header_norm = None  # reset su contenuto reale
+        no_dup.append(line)
+
+    # ── Pass 4: rimuovi header vuoti (nessun contenuto E nessuna sezione figlia) ──
+    no_empty: list[str] = []
+    i = 0
+    while i < len(no_dup):
+        line = no_dup[i]
+        m = _HEADER_RE.match(line)
+        if m:
+            cur_level = len(m.group(1))
+            j = i + 1
+            has_content = False
+            next_level: int | None = None
+            while j < len(no_dup):
+                ahead = no_dup[j]
+                m2 = _HEADER_RE.match(ahead)
+                if m2:
+                    next_level = len(m2.group(1))
+                    break
+                if ahead.strip():
+                    has_content = True
+                    break
+                j += 1
+            is_empty     = not has_content and j < len(no_dup)
+            is_container = next_level is not None and next_level > cur_level
+            if is_empty and not is_container:
+                stats["n_empty_headers_removed"] += 1
+                i += 1
+                continue
+        no_empty.append(line)
+        i += 1
+
+    # ── Pass 5: rimuovi running-header prefisso del successivo ────────────────
+    # Es. "§ 4" immediatamente seguito (≤3 righe di contenuto) da "§ 4. Titolo reale".
+    no_prefix: list[str] = []
+    i = 0
+    while i < len(no_empty):
+        line = no_empty[i]
+        m = _HEADER_RE.match(line)
+        if m:
+            cur_norm = _norm_title(m.group(2))
+            if cur_norm:
+                j = i + 1
+                non_blank = 0
+                next_header_norm: str | None = None
+                while j < len(no_empty) and non_blank <= 3:
+                    ahead = no_empty[j]
+                    m2 = _HEADER_RE.match(ahead)
+                    if m2:
+                        next_header_norm = _norm_title(m2.group(2))
+                        break
+                    if ahead.strip():
+                        non_blank += 1
+                    j += 1
+                if (
+                    next_header_norm is not None
+                    and len(cur_norm) < len(next_header_norm)
+                    and next_header_norm.startswith(cur_norm)
+                ):
+                    stats["n_duplicate_headers_removed"] += 1
+                    i += 1
+                    continue
+        no_prefix.append(line)
+        i += 1
+    lines = no_prefix
+
+    # ── Pass 6: ALLCAPS → sentence case ───────────────────────────────────────
+    # Solo header con ≥4 lettere tutte maiuscole; preserva prefissi numerici/simbolici.
+    normalized: list[str] = []
+    for line in lines:
+        m = _HEADER_RE.match(line)
+        if m:
+            hashes, content = m.group(1), m.group(2).strip()
+            letters = [c for c in content if c.isalpha()]
+            if len(letters) >= 4 and all(c.isupper() for c in letters):
+                # Preserva prefisso numerico/simbolico (§, numeri, punteggiatura)
+                prefix_m = re.match(r'^([§\d\s\.\)\(\-]+\s+)', content)
+                if prefix_m:
+                    prefix = prefix_m.group(1)
+                    rest = content[len(prefix):]
+                    if rest:
+                        line = f"{hashes} {prefix}{_sentence_case(rest)}"
+                else:
+                    line = f"{hashes} {_sentence_case(content)}"
+                stats["n_allcaps_headers_normalized"] += 1
+        normalized.append(line)
+    lines = normalized
+
+    # ── Pass 7: demote # → ## se il documento ha ≥5 header H1 ───────────────
+    # Documenti con H1 come sezione principale (non come titolo unico) producono
+    # una gerarchia piatta ## → ### senza livello intermedio.
+    # Quando si abbassa di un livello, il cascade è totale: H1→H2, H2→H3, H3→H3
+    # (clamp: non si scende sotto H3). Questo preserva la gerarchia relativa.
+    h1_count = sum(1 for l in lines if re.match(r'^# [A-Za-zÀ-ÿ§\d]', l))
+    if h1_count >= 5:
+        demoted: list[str] = []
+        for line in lines:
+            m = _HEADER_RE.match(line)
+            if m:
+                level = len(m.group(1))
+                if level == 1:
+                    line = f"## {m.group(2)}"
+                    stats["n_h1_demoted"] += 1
+                elif level == 2:
+                    line = f"### {m.group(2)}"
+                    stats["n_h1_demoted"] += 1
+                # level 3 resta a 3 (clamp)
+            demoted.append(line)
+        lines = demoted
+
+    clamped: list[str] = []
+    for line in lines:
+        m = _HEADER_RE.match(line)
+        if m:
+            level = len(m.group(1))
+            content = m.group(2)
+            if level > 3:
+                line = f"### {content}"
+                stats["n_headers_clamped"] += 1
+            else:
+                # "### 5 Titolo" → "### 5. Titolo" (numerazione senza punto separatore)
+                nm = _HDR_NUM_NO_DOT_RE.match(line)
+                if nm and len(nm.group(1)) == 3:
+                    line = f"{nm.group(1)} {nm.group(2)}. {nm.group(3)}"
+        clamped.append(line)
+    lines = clamped
+
+    result = "\n".join(lines)
+    result = re.sub(r"\n{3,}", "\n\n", result)
+    return result, stats
@@ -0,0 +1,97 @@
+"""Stage 9: validazione strutturale del Markdown finale."""
+import re
+from dataclasses import dataclass, field
+
+
+_HEADER_RE = re.compile(r"^(#{1,6})\s+(.+)$")
+_TABLE_ROW_RE = re.compile(r"^\|.+\|$")
+
+
+@dataclass
+class ValidationResult:
+    is_valid: bool
+    errors: list[str] = field(default_factory=list)
+    warnings: list[str] = field(default_factory=list)
+
+    def to_dict(self) -> dict:
+        return {
+            "valid": self.is_valid,
+            "errors": self.errors,
+            "warnings": self.warnings,
+        }
+
+
+def validate_markdown(text: str, page_count: int = 0) -> ValidationResult:
+    """
+    Valida l'integrità strutturale del Markdown.
+
+    Check 1: no salti di livello heading
+    Check 2: no sezioni vuote eccessive
+    Check 3: tabelle con colonne inconsistenti
+    Check 4: ordine heading ragionevole
+    """
+    lines = text.split("\n")
+    errors: list[str] = []
+    warnings: list[str] = []
+
+    # ── Check 1: salti di livello ─────────────────────────────────────────────
+    last_level = 0
+    level_jumps = 0
+    for i, line in enumerate(lines, 1):
+        m = _HEADER_RE.match(line)
+        if m:
+            level = len(m.group(1))
+            if last_level > 0 and level > last_level + 1:
+                level_jumps += 1
+            last_level = level
+    if level_jumps > 0:
+        errors.append(f"Salti di livello heading non riparati: {level_jumps}")
+
+    # ── Check 2: sezioni vuote ────────────────────────────────────────────────
+    header_indices = [i for i, l in enumerate(lines) if _HEADER_RE.match(l)]
+    total_sections = len(header_indices)
+    empty_sections = 0
+    for idx in range(len(header_indices)):
+        start = header_indices[idx] + 1
+        end = header_indices[idx + 1] if idx + 1 < len(header_indices) else len(lines)
+        content_lines = [l for l in lines[start:end] if l.strip() and not _HEADER_RE.match(l)]
+        if not content_lines:
+            empty_sections += 1
+
+    if total_sections > 0:
+        empty_ratio = empty_sections / total_sections
+        if empty_ratio > 0.30:
+            errors.append(
+                f"Troppe sezioni vuote: {empty_sections}/{total_sections} "
+                f"({empty_ratio:.0%})"
+            )
+        elif empty_ratio > 0.10:
+            warnings.append(
+                f"Sezioni vuote: {empty_sections}/{total_sections} ({empty_ratio:.0%})"
+            )
+
+    # ── Check 3: colonne tabelle inconsistenti ────────────────────────────────
+    in_table = False
+    table_cols: int | None = None
+    inconsistent_tables = 0
+    for line in lines:
+        if _TABLE_ROW_RE.match(line.strip()):
+            cols = line.count("|") - 1
+            if not in_table:
+                in_table = True
+                table_cols = cols
+            elif table_cols is not None and cols != table_cols:
+                inconsistent_tables += 1
+                table_cols = None  # non segnalare ulteriori righe della stessa tabella
+        else:
+            in_table = False
+            table_cols = None
+    if inconsistent_tables > 0:
+        warnings.append(f"Tabelle con colonne inconsistenti: {inconsistent_tables}")
+
+    # ── Check 4: struttura minima ─────────────────────────────────────────────
+    if total_sections == 0:
+        warnings.append("Nessun header rilevato — documento non strutturato")
+
+    is_valid = len(errors) == 0
+    return ValidationResult(is_valid=is_valid, errors=errors, warnings=warnings)
@@ -0,0 +1,141 @@
+import re
+from pathlib import Path
+
+# ─── Rilevamento lingua ───────────────────────────────────────────────────────
+
+_IT_WORDS = frozenset([
+    "il", "la", "di", "e", "che", "non", "per", "un", "una", "si",
+    "con", "da", "del", "della", "dei", "in", "ma", "se", "lo", "le",
+    "gli", "al", "alla", "ai", "alle", "sono", "ha", "hanno", "era",
+    "erano", "nel", "nella", "nei", "nelle", "questo", "questa", "così",
+])
+_EN_WORDS = frozenset([
+    "the", "of", "and", "to", "in", "is", "that", "it", "was", "for",
+    "on", "are", "as", "with", "his", "they", "at", "be", "this", "have",
+    "from", "or", "an", "but", "not", "by", "he", "she", "we", "you",
+    "which", "their", "been", "has", "would", "there", "when", "will",
+])
+_FR_WORDS = frozenset([
+    "le", "les", "de", "du", "des", "et", "un", "une", "est", "que",
+    "pour", "dans", "sur", "avec", "qui", "par", "pas", "plus", "au",
+    "ce", "se", "ou", "mais", "comme", "aussi",
+])
+_DE_WORDS = frozenset([
+    "der", "die", "das", "und", "in", "von", "zu", "den", "mit", "ist",
+    "auf", "eine", "als", "dem", "des", "sich", "nicht", "auch", "werden",
+    "bei", "nach", "oder", "wenn", "wird", "war",
+])
+_ES_WORDS = frozenset([
+    "el", "los", "las", "de", "en", "un", "una", "es", "que", "por",
+    "con", "del", "para", "como", "pero", "sus", "son", "los", "hay",
+    "todo", "esta", "este", "ser", "más", "ya",
+])
+
+
+def _detect_language(text: str) -> str:
+    words  = re.findall(r"\b[a-zA-Z]{2,}\b", text.lower())
+    sample = words[:2000]
+    scores = {
+        "it": sum(1 for w in sample if w in _IT_WORDS),
+        "en": sum(1 for w in sample if w in _EN_WORDS),
+        "fr": sum(1 for w in sample if w in _FR_WORDS),
+        "de": sum(1 for w in sample if w in _DE_WORDS),
+        "es": sum(1 for w in sample if w in _ES_WORDS),
+    }
+    best = max(scores, key=scores.get)
+    return best if scores[best] > 0 else "unknown"
+
+
+# ─── Analisi struttura ────────────────────────────────────────────────────────
+
+def _count_headers(text: str, level: int) -> int:
+    prefix = "#" * level + " "
+    return len(re.findall(rf"(?m)^{re.escape(prefix)}", text))
+
+
+def _count_paragraphs(text: str) -> int:
+    blocks = re.split(r"\n{2,}", text)
+    return sum(1 for b in blocks if b.strip() and not re.match(r"^#+\s", b.strip()))
+
+
+def _split_sections(text: str, level: int) -> list[str]:
+    prefix = "#" * level + " "
+    parts  = re.split(rf"(?m)^{re.escape(prefix)}.+", text)
+    return [p for p in parts[1:] if p.strip()]
+
+
+def _parse_sections_with_body(text: str, level: int = 3) -> list[tuple[str, str]]:
+    """Restituisce lista di (header_line, body_text) per tutti gli header al livello dato."""
+    prefix   = "#" * level + " "
+    lines    = text.split("\n")
+    sections: list[tuple[str, str]] = []
+    cur_hdr:  str | None = None
+    cur_body: list[str]  = []
+    for line in lines:
+        if line.startswith(prefix):
+            if cur_hdr is not None:
+                sections.append((cur_hdr, "\n".join(cur_body).strip()))
+            cur_hdr  = line
+            cur_body = []
+        elif cur_hdr is not None:
+            cur_body.append(line)
+    if cur_hdr is not None:
+        sections.append((cur_hdr, "\n".join(cur_body).strip()))
+    return sections
+
+
+def analyze(md_path: Path) -> dict:
+    text        = md_path.read_text(encoding="utf-8")
+    n_h1        = _count_headers(text, 1)
+    n_h2        = _count_headers(text, 2)
+    n_h3        = _count_headers(text, 3)
+    n_paragrafi = _count_paragraphs(text)
+
+    if n_h3 >= 5:
+        livello, boundary, strategia = 3, "h3", "h3_aware"
+        section_bodies = _split_sections(text, 3)
+        # Se h3 sono enormi e h2 più brevi, h2 è il boundary corretto
+        if n_h2 >= 3:
+            h2_bodies = _split_sections(text, 2)
+            avg_h3 = sum(len(b) for b in section_bodies) / len(section_bodies) if section_bodies else 0
+            avg_h2 = sum(len(b) for b in h2_bodies) / len(h2_bodies) if h2_bodies else 0
+            if avg_h3 > 5000 and avg_h2 < avg_h3 * 0.7:
+                livello, boundary, strategia = 2, "h2", "h2_paragraph_split"
+                section_bodies = h2_bodies
+    elif n_h2 >= 3:
+        livello, boundary, strategia = 2, "h2", "h2_paragraph_split"
+        section_bodies = _split_sections(text, 2)
+    elif n_h1 + n_h2 + n_h3 >= 1:
+        livello, boundary, strategia = 1, "paragrafo", "paragraph"
+        section_bodies = [b for b in re.split(r"\n{2,}", text) if b.strip()]
+    elif n_paragrafi >= 3:
+        livello, boundary, strategia = 1, "paragrafo", "paragraph"
+        section_bodies = [b for b in re.split(r"\n{2,}", text) if b.strip()]
+    else:
+        livello, boundary, strategia = 0, "nessuno", "sliding_window"
+        section_bodies = [text] if text.strip() else []
+
+    lengths          = [len(b) for b in section_bodies if b.strip()]
+    lunghezza_media  = int(sum(lengths) / len(lengths)) if lengths else 0
+    lingua           = _detect_language(text)
+
+    avvertenze = []
+    short = sum(1 for l in lengths if l < 200)
+    long_ = sum(1 for l in lengths if l > 800)
+    if short:
+        avvertenze.append(f"{short} sezioni sotto i 200 caratteri — verranno accorpate")
+    if long_:
+        avvertenze.append(f"{long_} sezioni sopra i 800 caratteri — verranno divise")
+
+    return {
+        "livello_struttura":     livello,
+        "n_h1":                  n_h1,
+        "n_h2":                  n_h2,
+        "n_h3":                  n_h3,
+        "n_paragrafi":           n_paragrafi,
+        "boundary_primario":     boundary,
+        "lingua_rilevata":       lingua,
+        "lunghezza_media_sezione": lunghezza_media,
+        "strategia_chunking":    strategia,
+        "avvertenze":            avvertenze,
+    }
@@ -0,0 +1,152 @@
+import json
+import sys
+from pathlib import Path
+
+_GRADES = [(90, "A"), (75, "B"), (60, "C"), (40, "D"), (0, "F")]
+
+
+def _score(r: dict) -> tuple[int, list[str]]:
+    """
+    Voto 0-100 sulla qualità del clean.md per vettorizzazione.
+
+    Penalità struttura:
+      livello 0 (assente)  → −40
+      livello 1 (piatto)   → −15
+
+    Penalità residui (degradano il retrieval):
+      backtick             → −2/cad  (max −20)
+      dot-leader           → −5/cad  (max −10)
+      URL/watermark        → −5/cad  (max −15)
+      immagini             → −5/cad  (max −10)
+      <br> inline          → −2/cad  (max −15)
+      simboli encoding     → −1/cad  (max −10)
+      formule inline [N.M] → −1/cad  (max −8)
+      footnote residui     → −1/cad  (max −8)
+      caratteri PUA        → −2/cad  (max −20)
+
+    Penalità anomalie:
+      bare headers         → −3/cad  (max −15)
+    """
+    score     = 100
+    detail    = []
+    structure = r.get("structure", {})
+    anomalie  = r.get("anomalie",  {})
+    residui   = r.get("residui",   {})
+
+    livello = structure.get("livello_struttura", 0)
+    if livello == 0:
+        score -= 40
+        detail.append("struttura assente −40")
+    elif livello == 1:
+        score -= 15
+        detail.append("struttura piatta −15")
+
+    def _pen(key: str, per_item: int, cap: int, label: str) -> None:
+        n = residui.get(key, 0)
+        if n:
+            p = min(cap, n * per_item)
+            nonlocal score
+            score -= p
+            detail.append(f"{label} ×{n} −{p}")
+
+    _pen("backtick",         2, 20, "backtick")
+    _pen("dotleader",        5, 10, "dot-leader")
+    _pen("url",              5, 15, "url")
+    _pen("immagini",         5, 10, "immagini")
+    _pen("br_inline",        2, 15, "<br> inline")
+    _pen("simboli_encoding", 1, 10, "simboli encoding")
+    _pen("formule_inline",   1,  8, "formule inline")
+    _pen("footnote_markers", 1,  8, "footnote residui")
+    _pen("pua_markers",      2, 20, "caratteri PUA font Symbol")
+    _pen("formula_headers",  3, 15, "formula/esercizio come header")
+
+    n_bare = anomalie.get("bare_headers", 0)
+    if n_bare:
+        p = min(15, n_bare * 3)
+        score -= p
+        detail.append(f"bare headers ×{n_bare} −{p}")
+
+    return max(0, score), detail
+
+
+def _grade(score: int) -> str:
+    return next(g for threshold, g in _GRADES if score >= threshold)
+
+
+def validate(stems: list[str], project_root: Path, detail: bool = False) -> None:
+    conv_dir = project_root / "conversione"
+
+    paths = (
+        [conv_dir / s / "report.json" for s in stems]
+        if stems
+        else sorted(conv_dir.glob("*/report.json"))
+    )
+
+    if not paths:
+        print("Nessun report.json trovato in conversione/*/")
+        sys.exit(0)
+
+    rows = [
+        json.loads(p.read_text(encoding="utf-8")) if p.exists()
+        else {"stem": p.parent.name, "_missing": True}
+        for p in paths
+    ]
+
+    col    = max(len(r.get("stem", "stem")) for r in rows) + 2
+    header = (
+        f"{'stem':<{col}}"
+        f"{'h2':>4}{'h3':>5}  "
+        f"{'strategia':<18}"
+        f"{'bare':>5}{'corte':>6}{'lunghe':>7}"
+        f"{'btk':>5}{'br':>4}{'enc':>4}{'url':>4}{'fhdr':>5}"
+        f"{'med':>6}"
+        f"  {'voto':>4}  grade"
+    )
+    sep = "─" * len(header)
+    print(f"\n{header}\n{sep}")
+
+    scores = []
+    for r in rows:
+        if r.get("_missing"):
+            print(f"{r['stem']:<{col}}  (report.json non trovato)")
+            continue
+
+        st   = r.get("structure",    {})
+        an   = r.get("anomalie",     {})
+        res  = r.get("residui",      {})
+        dist = r.get("distribution", {})
+        s, pen = _score(r)
+        scores.append(s)
+
+        print(
+            f"{r['stem']:<{col}}"
+            f"{st.get('n_h2',              0):>4}"
+            f"{st.get('n_h3',              0):>5}  "
+            f"{st.get('strategia_chunking','?'):<18}"
+            f"{an.get('bare_headers',      0):>5}"
+            f"{an.get('short_sections',    0):>6}"
+            f"{an.get('long_sections',     0):>7}"
+            f"{res.get('backtick',         0):>5}"
+            f"{res.get('br_inline',        0):>4}"
+            f"{res.get('simboli_encoding', 0):>4}"
+            f"{res.get('url',              0):>4}"
+            f"{res.get('formula_headers',  0):>5}"
+            f"{dist.get('mediana',         0):>6}"
+            f"  {s:>4}  {_grade(s)}"
+        )
+        if detail and pen:
+            for p in pen:
+                print(f"  {'':>{col}}  ↳ {p}")
+
+    print(sep)
+    if scores:
+        media = sum(scores) / len(scores)
+        print(
+            f"Documenti: {len(scores)}   "
+            f"Media: {media:.0f}/100 {_grade(int(media))}   "
+            f"(A≥90  B≥75  C≥60  D≥40  F<40)"
+        )
+    print(
+        "\nColonne: bare=header vuoti  corte=sez<150ch  lunghe=sez>1500ch  "
+        "btk=backtick  br=<br>inline  enc=simboli encoding  fhdr=formula-header  med=mediana chars\n"
+    )
@@ -4,10 +4,30 @@ set -euo pipefail
 SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
 cd "$SCRIPT_DIR"

-mapfile -t dirs < <(find . -maxdepth 1 -mindepth 1 -type d | sort)
+STEM="${1:-}"
+
+if [[ -n "$STEM" ]]; then
+    # ── Modalità singolo stem ─────────────────────────────────────────────
+    target="./$STEM"
+    if [[ ! -d "$target" ]]; then
+        echo "Errore: cartella '$STEM' non trovata in conversione/."
+        exit 1
+    fi
+    rm -rf "$target"
+    echo "Rimossa: conversione/$STEM/"
+    exit 0
+fi
+
+# ── Modalità batch: tutti gli output (escluse cartelle infrastruttura) ────
+mapfile -t dirs < <(
+    find . -maxdepth 1 -mindepth 1 -type d \
+        ! -name '_*' \
+        ! -name '__*' \
+    | sort
+)

 if [[ ${#dirs[@]} -eq 0 ]]; then
-    echo "Nessuna cartella da cancellare."
+    echo "Nessuna cartella di output da cancellare."
    exit 0
 fi

@@ -16,10 +36,8 @@ for d in "${dirs[@]}"; do
    echo "  $d"
 done

-if [[ "${1:-}" != "-f" ]]; then
-    read -r -p "Confermi? [s/N] " answer
-    [[ "$answer" =~ ^[sS]$ ]] || { echo "Annullato."; exit 0; }
-fi
+read -r -p "Confermi? [s/N] " answer
+[[ "$answer" =~ ^[sS]$ ]] || { echo "Annullato."; exit 0; }

 for d in "${dirs[@]}"; do
    rm -rf "$d"
@@ -30,6 +30,7 @@ import re
 import subprocess
 import sys
 import tempfile
+from collections import Counter
 from datetime import datetime
 from functools import partial
 from pathlib import Path
@@ -69,8 +70,11 @@ def check_pdf(pdf_path: Path) -> tuple[bool, str]:
        return False, f"File non trovato: {pdf_path}"
    if pdf_path.suffix.lower() != ".pdf":
        return False, f"Non è un PDF: {pdf_path.name}"
-    if pdf_path.stat().st_size == 0:
+    size = pdf_path.stat().st_size
+    if size == 0:
        return False, "File vuoto"
+    if size < 1024:
+        return False, f"File troppo piccolo ({size} byte) — probabilmente corrotto"

    try:
        import pdfplumber
@@ -84,11 +88,26 @@ def check_pdf(pdf_path: Path) -> tuple[bool, str]:
                if len((pdf.pages[i].extract_text() or "").strip()) > 50
            )
            if pages_with_text == 0:
+                # Estende il campione: copertine immagine o pagine bianche iniziali
+                extended = min(15, n_pages)
+                if extended > sample:
+                    ext_with_text = sum(
+                        1 for i in range(sample, extended)
+                        if len((pdf.pages[i].extract_text() or "").strip()) > 50
+                    )
+                    if ext_with_text > 0:
+                        return True, (
+                            f"{n_pages} pagine — prime {sample} vuote, "
+                            f"testo trovato in pagine successive "
+                            f"(possibile copertina immagine)"
+                        )
                return False, (
-                    f"Nessun testo nelle prime {sample} pagine "
-                    f"— probabilmente scansionato (usa modalità hybrid)"
+                    f"Nessun testo nelle prime {extended} pagine "
+                    f"— probabilmente scansionato (OCR non supportato)"
                )
        return True, f"{n_pages} pagine, testo digitale confermato"
+    except MemoryError:
+        return False, "Memoria esaurita durante l'apertura del PDF"
    except Exception as e:
        msg = str(e).lower()
        if "password" in msg or "encrypted" in msg:
@@ -131,6 +150,13 @@ def convert_pdf(pdf_path: Path, out_dir: Path) -> Path:
            raise RuntimeError(f"Nessun file .md prodotto in {out_dir}")
        md_file = candidates[0]

+    content = md_file.read_text(encoding="utf-8", errors="replace").strip()
+    if len(content) < 100:
+        raise RuntimeError(
+            f"opendataloader ha prodotto un file .md quasi vuoto ({len(content)} char) "
+            f"— il PDF potrebbe essere corrotto o non supportato"
+        )
+
    return md_file


@@ -139,6 +165,9 @@ def convert_pdf(pdf_path: Path, out_dir: Path) -> Path:
 _TOC_KEYWORDS = frozenset([
    "indice", "index", "contents", "table of contents",
    "sommario", "inhaltsverzeichnis", "inhalt",
+    "indice generale", "indice analitico", "indice dei contenuti",
+    "elenco dei capitoli", "argomenti", "table des matières",
+    "tabla de contenidos", "содержание",
 ])

 _ORDINALS_IT = {
@@ -166,6 +195,7 @@ def _is_allcaps_line(line: str) -> bool:
        len(letters) >= 3
        and all(c.isupper() for c in letters)
        and not stripped.startswith("#")
+        and not stripped.startswith("|")   # esclude righe tabella Markdown
    )


@@ -208,8 +238,9 @@ def _extract_math_environments(text: str) -> tuple[str, int]:
    Deve girare PRIMA del merge paragrafi (step 5) per sfruttare i blocchi intatti.
    """
    _ENVS = (
-        r"Definizione|Teorema|Lemma|Proposizione|"
-        r"Corollario|Osservazione|Nota|Esempio"
+        r"Definizione|Definition|Teorema|Theorem|Lemma|"
+        r"Proposizione|Proposition|Corollario|Corollary|"
+        r"Osservazione|Remark|Nota|Note|Esempio|Example"
    )
    count  = 0
    blocks = text.split("\n\n")
@@ -343,12 +374,158 @@ def _extract_article_headers(text: str) -> tuple[str, int]:

 # ─── [3a] Funzioni di trasformazione ─────────────────────────────────────────

+# Mapping PUA Unicode (U+F020-U+F0FF) → simboli corretti per font Symbol/Wingdings.
+# Il font Symbol di Windows codifica lettere greche e operatori matematici nel
+# range Private Use Area invece dei codepoint Unicode standard.
+_SYMBOL_PUA_MAP: dict[str, str] = {
+    "\uf020": " ",   # space
+    "\uf028": "(",
+    "\uf029": ")",
+    "\uf02b": "+",
+    "\uf02d": "\u2212",  # minus
+    "\uf02e": ".",
+    "\uf02f": "/",
+    "\uf030": "0", "\uf031": "1", "\uf032": "2", "\uf033": "3", "\uf034": "4",
+    "\uf035": "5", "\uf036": "6", "\uf037": "7", "\uf038": "8", "\uf039": "9",
+    "\uf03a": ":", "\uf03b": ";", "\uf03c": "<", "\uf03d": "=", "\uf03e": ">",
+    "\uf040": "\u2245",  # congruent
+    "\uf041": "\u0391",  # Alpha
+    "\uf042": "\u0392",  # Beta
+    "\uf043": "\u03a7",  # Chi
+    "\uf044": "\u0394",  # Delta
+    "\uf045": "\u0395",  # Epsilon
+    "\uf046": "\u03a6",  # Phi
+    "\uf047": "\u0393",  # Gamma
+    "\uf048": "\u0397",  # Eta
+    "\uf049": "\u0399",  # Iota
+    "\uf04a": "\u03d1",  # theta variant
+    "\uf04b": "\u039a",  # Kappa
+    "\uf04c": "\u039b",  # Lambda
+    "\uf04d": "\u039c",  # Mu
+    "\uf04e": "\u039d",  # Nu
+    "\uf04f": "\u039f",  # Omicron
+    "\uf050": "\u03a0",  # Pi
+    "\uf051": "\u0398",  # Theta
+    "\uf052": "\u03a1",  # Rho
+    "\uf053": "\u03a3",  # Sigma
+    "\uf054": "\u03a4",  # Tau
+    "\uf055": "\u03a5",  # Upsilon
+    "\uf056": "\u03c2",  # sigma final
+    "\uf057": "\u03a9",  # Omega
+    "\uf058": "\u039e",  # Xi
+    "\uf059": "\u03a8",  # Psi
+    "\uf05a": "\u0396",  # Zeta
+    "\uf05b": "[",
+    "\uf05c": "\u2234",  # therefore
+    "\uf05d": "]",
+    "\uf05e": "\u22a5",  # perpendicular
+    "\uf061": "\u03b1",  # alpha
+    "\uf062": "\u03b2",  # beta
+    "\uf063": "\u03c7",  # chi
+    "\uf064": "\u03b4",  # delta
+    "\uf065": "\u03b5",  # epsilon
+    "\uf066": "\u03c6",  # phi
+    "\uf067": "\u03b3",  # gamma
+    "\uf068": "\u03b7",  # eta
+    "\uf069": "\u03b9",  # iota
+    "\uf06a": "\u03d5",  # phi variant
+    "\uf06b": "\u03ba",  # kappa
+    "\uf06c": "\u03bb",  # lambda
+    "\uf06d": "\u03bc",  # mu
+    "\uf06e": "\u03bd",  # nu
+    "\uf06f": "\u03bf",  # omicron
+    "\uf070": "\u03c0",  # pi
+    "\uf071": "\u03b8",  # theta
+    "\uf072": "\u03c1",  # rho
+    "\uf073": "\u03c3",  # sigma
+    "\uf074": "\u03c4",  # tau
+    "\uf075": "\u03c5",  # upsilon
+    "\uf076": "\u03d6",  # pi symbol
+    "\uf077": "\u03c9",  # omega
+    "\uf078": "\u03be",  # xi
+    "\uf079": "\u03c8",  # psi
+    "\uf07a": "\u03b6",  # zeta
+    "\uf07b": "{",
+    "\uf07c": "|",
+    "\uf07d": "}",
+    "\uf07e": "~",
+    "\uf0b1": "\u00b1",  # plus-minus
+    "\uf0b7": "\u2022",  # bullet
+    "\uf0ba": "\u221a",  # square root
+    "\uf0bc": "\u2264",  # less or equal
+    "\uf0bd": "\u2265",  # greater or equal
+    "\uf0be": "\u221d",  # proportional
+    "\uf0d7": "\u00d7",  # multiplication
+    "\uf0f7": "\u00f7",  # division
+    "\uf0b4": "\u00d7",  # alternate multiply
+    "\uf0bb": "\u2260",  # not equal
+    "\uf0b9": "\u2260",  # not equal alternate
+    "\uf0b3": "\u2265",  # greater or equal alternate
+    "\uf0b2": "\u2032",  # prime
+    "\uf02a": "*",
+    "\uf02c": ",",
+    "\uf0a3": "\u2264",  # less or equal (Symbol 0xA3)
+    "\uf0a7": "\u2022",  # bullet (Wingdings 0xA7)
+    "\uf0a8": "\u2022",  # bullet variant
+    "\uf0ae": "\u2192",  # right arrow (Symbol 0xAE)
+    "\uf0b8": "\u00f7",  # division / range separator
+    "\uf0eb": "",        # Wingdings decorative icon (rimosso)
+    "\uf0f0": "\u2192",  # right arrow variant
+    "\uf0db": "",        # bracket extension piece (non ricostruibile)
+    "\uf0dc": "",        # bracket extension piece
+    "\uf0dd": "",        # bracket extension piece
+    "\uf0de": "",        # brace middle piece (non ricostruibile)
+    "\uf0df": "",        # brace extension piece
+}
+
+_SYMBOL_PUA_RE = re.compile(
+    "[" + "".join(re.escape(k) for k in _SYMBOL_PUA_MAP) + "]"
+)
+
+
+def _t_fix_symbol_font(text: str) -> tuple[str, int]:
+    """Rimappa caratteri PUA font Symbol (U+F020-U+F0FF) in simboli Unicode corretti."""
+    count = [0]
+
+    def _repl(m: re.Match) -> str:
+        count[0] += 1
+        return _SYMBOL_PUA_MAP[m.group(0)]
+
+    result = _SYMBOL_PUA_RE.sub(_repl, text)
+    return result, count[0]
+
+
 def _t_remove_images(text: str) -> tuple[str, int]:
    n = len(re.findall(r"!\[[^\]]*\]\([^)]*\)", text))
    text = re.sub(r"!\[[^\]]*\]\([^)]*\)\s*", "", text)
    return text, n


+# Superscript Unicode: ¹²³⁴⁵⁶⁷⁸⁹⁰
+_SUPERSCRIPT_RE = re.compile(r'[\u00b9\u00b2\u00b3\u2070\u2074-\u2079]+')
+# Riga corpo-nota: inizia con superscript o [N]
+_FOOTNOTE_BODY_RE = re.compile(
+    r'^([\u00b9\u00b2\u00b3\u2070\u2074-\u2079]+\s+|\[\d{1,3}\]\s+)'
+)
+
+
+def _t_remove_footnotes(text: str) -> tuple[str, int]:
+    """Rimuovi marcatori footnote superscript inline e righe corpo-nota."""
+    lines = text.split("\n")
+    result, count = [], 0
+    for line in lines:
+        stripped = line.strip()
+        # Corpo nota: riga breve che inizia con ¹ o [N]
+        if stripped and _FOOTNOTE_BODY_RE.match(stripped) and len(stripped) < 300:
+            count += 1
+            continue
+        cleaned = _SUPERSCRIPT_RE.sub("", line)
+        if cleaned != line:
+            count += 1
+        result.append(cleaned)
+    return "\n".join(result), count
+
+
 def _t_fix_br(text: str) -> tuple[str, int]:
    n = len(re.findall(r"<br>", text, re.IGNORECASE))
    text = re.sub(r"<br>\s*", " ", text, flags=re.IGNORECASE)
@@ -457,8 +634,50 @@ def _t_extract_capitolo(text: str) -> tuple[str, int]:
    return text, 0


+_NUMBERED_HDR_RE = re.compile(
+    r"^(#{1,6})\s+(\d+(?:\.\d+)*)\.\s+(.+)$",
+    re.MULTILINE,
+)
+
+
+def _t_normalize_numbered_headings(text: str) -> tuple[str, int]:
+    """Corregge livelli header per documenti con numerazione decimale.
+
+    Assegna livello heading in base alla profondità numerica usando come base
+    il livello corrente degli header di profondità minima.
+    Attivo solo se il documento ha almeno 2 profondità di numerazione.
+    """
+    all_matches = list(_NUMBERED_HDR_RE.finditer(text))
+    if not all_matches:
+        return text, 0
+
+    pairs = [
+        (m.group(2).count(".") + 1, len(m.group(1)))
+        for m in all_matches
+    ]
+    depths = [d for d, _ in pairs]
+    min_depth, max_depth = min(depths), max(depths)
+    if max_depth == min_depth:
+        return text, 0
+
+    base_level = min(lv for d, lv in pairs if d == min_depth)
+    count = 0
+
+    def _repl(m: re.Match) -> str:
+        nonlocal count
+        hashes, num, title = m.group(1), m.group(2), m.group(3)
+        depth = num.count(".") + 1
+        new_level = min(base_level + (depth - min_depth), 6)
+        if new_level == len(hashes):
+            return m.group(0)
+        count += 1
+        return f"{'#' * new_level} {num}. {title}"
+
+    return _NUMBERED_HDR_RE.sub(_repl, text), count
+
+
 def _t_normalize_header_levels(text: str) -> tuple[str, int]:
-    """Normalizza h4+ → h3; rimuove header vuoti."""
+    """Normalizza h4+ → h3; rimuove header vuoti; rimuove numero pagina '| N' finale."""
    text = re.sub(r"^#{3,6}\s*$", "", text, flags=re.MULTILINE)
    text = re.sub(
        r"^(#{3,6})\s+(\d{1,3})\s+(.+)$",
@@ -514,11 +733,20 @@ def _t_remove_toc(text: str) -> tuple[str, int]:
        if _in_toc:
            if re.match(r"^\s*$", line) or re.match(r"^\s*[-*+]\s+\d", line):
                continue
+            # Voce TOC con numero pagina finale (sicuro: siamo gia in contesto TOC)
+            if re.match(r"^\s*[-*+]\s+.{2,70}\s+\d{1,3}\s*$", line):
+                continue
+            # Riga di testo lungo = probabilmente abstract o corpo, non voce di indice
+            if len(line.strip()) > 200:
+                _in_toc = False
+                new_lines.append(line)
+                continue
            _in_toc = False
        new_lines.append(line)
    return "\n".join(new_lines), 1 if removed else 0


+
 def _t_allcaps_to_headers(text: str) -> tuple[str, int]:
    """Converti righe ALL-CAPS standalone → ## header."""
    count = 0
@@ -542,6 +770,13 @@ def _t_allcaps_to_headers(text: str) -> tuple[str, int]:
    return "\n\n".join(new_blocks), count


+_BIB_MARKERS_RE = re.compile(
+    r'\b(pp?\.|vol\.|n\.\s*\d|ed\.|edn\.|ISBN|DOI|arXiv)\b'
+    r'|\b(19|20)\d{2}\b',
+    re.IGNORECASE,
+)
+
+
 def _t_numbered_sections(text: str, has_exercises: bool = False) -> tuple[str, int]:
    """Converti sezioni numerate 'N. testo' / '- N. testo' / '- N testo' → ### header."""
    count = 0
@@ -551,6 +786,8 @@ def _t_numbered_sections(text: str, has_exercises: bool = False) -> tuple[str, i
        content = m.group(2).strip()
        if content.endswith(".") and len(content) > 40:
            return m.group(0)
+        if _BIB_MARKERS_RE.search(content):
+            return m.group(0)
        count += 1
        return f"### {m.group(1)}.\n\n{content}"

@@ -568,8 +805,11 @@ def _t_numbered_sections(text: str, has_exercises: bool = False) -> tuple[str, i
    if not has_exercises:
        def _aphorism_repl(m: re.Match) -> str:
            nonlocal count
+            content = m.group(2).strip()
+            if _BIB_MARKERS_RE.search(content):
+                return m.group(0)
            count += 1
-            return f"\n\n### {m.group(1)}.\n\n{m.group(2).strip()}"
+            return f"\n\n### {m.group(1)}.\n\n{content}"

        text = re.sub(
            r"^-\s+(\d{1,3})\.\s+(.{10,})$",
@@ -582,6 +822,8 @@ def _t_numbered_sections(text: str, has_exercises: bool = False) -> tuple[str, i
        nonlocal count
        num = m.group(1)
        content = m.group(2).strip()
+        if _BIB_MARKERS_RE.search(content):
+            return m.group(0)
        count += 1
        split = re.search(r"(?<=[a-zàèéìíòóùú])\s+(?=[A-ZÀÈÉÌÍÒÓÙÚ])", content)
        if split and split.start() >= 3:
@@ -619,10 +861,11 @@ def _t_merge_paragraphs(text: str) -> tuple[str, int]:
            i + 1 < len(blocks)
            and stripped
            and not stripped.startswith("#")
+            and not stripped.startswith("|")   # non unire righe tabella in avanti
            and stripped[-1] not in _SENTENCE_END
        ):
            nxt = blocks[i + 1].strip()
-            if not nxt or nxt.startswith("#") or re.match(r"^\d+\.", nxt):
+            if not nxt or nxt.startswith("#") or nxt.startswith("|") or re.match(r"^\d+\.", nxt) or re.match(r"^[-*+]\s", nxt):
                break
            b = stripped + " " + nxt
            stripped = b.strip()
@@ -651,6 +894,97 @@ def _t_collapse_blank_lines(text: str) -> tuple[str, int]:
    return re.sub(r"\n{3,}", "\n\n", text), 0


+def _t_demote_verse_headers(text: str) -> tuple[str, int]:
+    """Demoti header che sono in realtà terzine/versi.
+
+    opendataloader promuove a ## le iscrizioni e i testi in evidenza nel PDF
+    (corpo maggiore, centrato). Si riconoscono perché:
+      - terminano con un numero nudo (numero di verso: 3, 6, 9, …)
+      - contengono punteggiatura interna di fine verso (', ' o '. ')
+    Esempio: '## «per me si va ne la città dolente, ... gente. 3'
+    → paragrafo normale senza il numero finale.
+    """
+    count = 0
+
+    def _demote(m: re.Match) -> str:
+        nonlocal count
+        hashes, content = m.group(1), m.group(2).strip()
+        # Deve terminare con numero nudo (numero di verso ≤ 9999)
+        if not re.search(r"\s\d{1,4}\s*$", content):
+            return m.group(0)
+        # Deve contenere punteggiatura interna (è un blocco di più versi)
+        inner = re.sub(r"\s\d{1,4}\s*$", "", content)
+        if not re.search(r"[,;:.!?»\"\']\s+[A-Za-zÀ-ÿ«\"]", inner):
+            return m.group(0)
+        count += 1
+        # Rimuovi il numero di verso finale e restituisci come testo normale
+        clean = re.sub(r"\s\d{1,4}\s*$", "", content)
+        return clean
+
+    text = re.sub(
+        r"^(#{1,6})\s+(.{20,})$",
+        _demote,
+        text,
+        flags=re.MULTILINE,
+    )
+    return text, count
+
+
+def _t_restore_poetry_lines(text: str) -> tuple[str, int]:
+    """Ripristina line break di poesia distrutti da keep_line_breaks=False.
+
+    Quando il PDF è poesia (terzine dantesche, sonetti, ecc.) opendataloader
+    con keep_line_breaks=False produce un unico paragrafo con i numeri di verso
+    (3, 6, 9 … oppure 1, 2, 3 …) incorporati inline:
+      'smarrita. 3 Ahi quanto a dir qual era è cosa dura … paura! 6 Tant'è …'
+
+    Il transform rileva blocchi con numeri di verso in progressione aritmetica
+    e li separa in righe, con riga vuota ogni 3 versi (terzina).
+    """
+    count = 0
+    blocks = text.split("\n\n")
+    result = []
+
+    # Pattern: numero isolato preceduto da punteggiatura-fine-verso e seguito
+    # da lettera maiuscola (inizio verso successivo).
+    _VERSE_NUM_RE = re.compile(
+        r'([.!?»\'\"]\s+)(\d+)(\s+)(?=[A-ZÀ-Ùa-zà-ù«"‟])'
+    )
+
+    for block in blocks:
+        stripped = block.strip()
+        if not stripped or stripped.startswith("#"):
+            result.append(block)
+            continue
+
+        matches = list(_VERSE_NUM_RE.finditer(stripped))
+        if len(matches) < 2:
+            result.append(block)
+            continue
+
+        nums = [int(m.group(2)) for m in matches]
+        diffs = [nums[i + 1] - nums[i] for i in range(len(nums) - 1)]
+        # Accetta progressioni con passo costante 1–5 (terzine: 3, endecasillabi: 1)
+        if not diffs or len(set(diffs)) > 2 or not (1 <= diffs[0] <= 5):
+            result.append(block)
+            continue
+
+        step = diffs[0]
+
+        def _replace_verse_num(m: re.Match) -> str:
+            n = int(m.group(2))
+            # Ogni 'step' versi → riga vuota (inizio nuova terzina/strofa)
+            sep = "\n\n" if n % (step * 3) == 0 else "\n"
+            return m.group(1).rstrip() + sep
+
+        new_block = _VERSE_NUM_RE.sub(_replace_verse_num, stripped)
+        if new_block != stripped:
+            count += len(matches)
+        result.append(new_block)
+
+    return "\n\n".join(result), count
+
+
 def _t_remove_urls(text: str) -> tuple[str, int]:
    """Rimuovi righe che sono solo URL (watermark, footer di piattaforme)."""
    return re.sub(r"(?m)^(https?://|www\.)\S+\s*$", "", text), 0
@@ -664,7 +998,14 @@ def _t_remove_empty_headers(text: str) -> tuple[str, int]:
        stripped = block.strip()
        if re.match(r"^#{1,6} ", stripped) and "\n" not in stripped:
            next_stripped = blocks[i + 1].strip() if i + 1 < len(blocks) else ""
-            if not next_stripped or re.match(r"^#{1,6} ", next_stripped):
+            # Non rimuovere un header breve se il successivo è un header molto lungo
+            # (> 80 char): quasi certamente è testo PDF mal classificato come heading.
+            next_is_long_header = (
+                re.match(r"^#{1,6} ", next_stripped) and len(next_stripped) > 80
+            )
+            if not next_stripped or (
+                re.match(r"^#{1,6} ", next_stripped) and not next_is_long_header
+            ):
                continue
        cleaned.append(block)
    return re.sub(r"\n{3,}", "\n\n", "\n\n".join(cleaned)), 0
@@ -680,12 +1021,22 @@ def _t_remove_garbage_headers(text: str) -> tuple[str, int]:
    def _is_garbage_header(content: str) -> bool:
        if content.lstrip().startswith("..."):
            return True
-        if not re.search(r"[A-Za-zÀ-ÿ]{2,}", content):
+        if not re.search(r"[A-Za-zÀ-ÿ\u0391-\u03c9]{2,}", content):
            return True
        if re.fullmatch(r"\(?\s*[A-Za-z]{1,4}\s*\)?", content.strip()):
            return True
        if len(content) > 60 and re.search(r"[!%#]\w|\w[!%#]|\b\w+-\s*\w", content):
            return True
+        # Frammento di frase: inizia con minuscola ed e abbastanza lungo
+        first_alpha = next((c for c in content if c.isalpha()), None)
+        if first_alpha and first_alpha.islower() and len(content) > 40:
+            return True
+        # Formula matematica: variabile singola (o breve) seguita da = o operatore
+        if re.match(r"^[A-Za-z\u0391-\u03c9_]{1,3}\s*[=<>≤≥]", content.strip()):
+            return True
+        # Didascalia figura/tabella: "Figura N..." o "Figure N..." o "Tabella N..."
+        if re.match(r"^(Figura|Figure|Fig\.|Tabella|Table|Tab\.)\s+\d", content.strip(), re.IGNORECASE):
+            return True
        return False

    count = 0
@@ -713,8 +1064,14 @@ def _t_remove_frontmatter(text: str) -> tuple[str, int]:
    blocks = re.split(r"\n{2,}", text)
    cleaned = []
    count = 0
+    total = len(blocks)
+    cutoff = max(5, min(15, int(total * 0.20)))
    for i, block in enumerate(blocks):
        stripped = block.strip()
+        # Frontmatter compare solo nelle prime sezioni del documento
+        if i >= cutoff:
+            cleaned.append(block)
+            continue
        if not re.match(r"^### ", stripped) or re.match(r"^### \d", stripped):
            cleaned.append(block)
            continue
@@ -728,6 +1085,59 @@ def _t_remove_frontmatter(text: str) -> tuple[str, int]:
    return re.sub(r"\n{3,}", "\n\n", "\n\n".join(cleaned)), count


+_WATERMARK_RE = re.compile(
+    r"^(BOZZA|DRAFT|CONFIDENTIAL|RISERVATO|PROVVISORIO|SAMPLE|SPECIMEN"
+    r"|DO NOT DISTRIBUTE|NON DISTRIBUIRE|COPY|COPIA)\s*$",
+    re.IGNORECASE | re.MULTILINE,
+)
+
+
+def _t_remove_watermarks(text: str) -> tuple[str, int]:
+    """Rimuovi righe standalone con testo watermark comune."""
+    lines = text.split("\n")
+    result, count = [], 0
+    for line in lines:
+        if _WATERMARK_RE.match(line):
+            count += 1
+        else:
+            result.append(line)
+    return "\n".join(result), count
+
+
+def _t_fix_math_symbols(text: str) -> tuple[str, int]:
+    """Rimuovi righe composte solo da simboli box/placeholder (font non estratti)."""
+    lines = text.split("\n")
+    result, count = [], 0
+    for line in lines:
+        if line.strip() and re.match(r"^[\s□■▪▫◆◇●○•\u25a0-\u25ff]+$", line):
+            count += 1
+        else:
+            result.append(line)
+    return "\n".join(result), count
+
+
+def _t_remove_recurring_lines(text: str) -> tuple[str, int]:
+    """Rimuovi righe corte che si ripetono ≥5 volte (header/footer di pagina)."""
+    lines = text.split("\n")
+    short_lines = [
+        ln.strip() for ln in lines
+        if 3 < len(ln.strip()) < 80
+        and not ln.strip().startswith("#")
+        and not ln.strip().startswith("|")
+    ]
+    freq = Counter(short_lines)
+    recurring = {ln for ln, c in freq.items() if c >= 5}
+    if not recurring:
+        return text, 0
+    result, count = [], 0
+    for line in lines:
+        if line.strip() in recurring:
+            count += 1
+        else:
+            result.append(line)
+    return "\n".join(result), count
+
+
 # ─── [3b] Pipeline delle trasformazioni ──────────────────────────────────────

 def apply_transforms(text: str) -> tuple[str, dict]:
@@ -737,19 +1147,24 @@ def apply_transforms(text: str) -> tuple[str, dict]:
    """
    # Flag calcolato prima del loop: disabilita il transform 4b nei documenti
    # con sezioni "Esercizi" (i "- N. testo" sarebbero numerazioni, non header).
-    _has_ex = bool(re.search(r"\bEsercizi\b", text, re.IGNORECASE))
+    _has_ex = bool(re.search(r"\b(Esercizi|Exercises|Problems|Homework)\b", text, re.IGNORECASE))

    _transforms: list[tuple[str | None, object]] = [
+        ("n_simboli_pua_corretti",      _t_fix_symbol_font),
        ("n_immagini_rimosse",          _t_remove_images),
        ("n_br_rimossi",                _t_fix_br),
        ("n_tabsep_rimossi",            _t_fix_tabsep),
+        ("n_note_rimosse",              _t_remove_footnotes),
        ("n_accenti_corretti",          _t_fix_accents),
        ("n_moltiplicazioni_corrette",  _t_fix_multiplication),
        ("n_micro_corretti",            _t_fix_micro),
+        ("n_simboli_math_rimossi",      _t_fix_math_symbols),
        ("n_formule_rimossi",           _t_remove_formula_labels),
        ("n_dotleader_rimossi",         _t_remove_dotleaders),
+        ("n_righe_ricorrenti_rimosse",  _t_remove_recurring_lines),
        ("n_header_concat_fixati",      _t_fix_header_concat),
        (None,                          _t_extract_capitolo),
+        ("n_header_numerati_normalizzati", _t_normalize_numbered_headings),
        (None,                          _t_normalize_header_levels),
        ("n_articoli_estratti",         _t_extract_articles),
        (None,                          _t_remove_header_bold),
@@ -761,11 +1176,15 @@ def apply_transforms(text: str) -> tuple[str, dict]:
        ("n_paragrafi_uniti",           _t_merge_paragraphs),
        (None,                          _t_normalize_whitespace),
        (None,                          _t_collapse_blank_lines),
+        ("n_versi_ripristinati",        _t_restore_poetry_lines),
+        ("n_header_verso_demotati",     _t_demote_verse_headers),
        (None,                          _t_remove_urls),
        (None,                          _t_remove_empty_headers),
        ("n_titoli_uniti",              _t_merge_title_headers),
+        (None,                          lambda t: (re.sub(r"(?m)^(#{1,6}.+?)\s*\|\s*\d{1,3}\s*$", r"\1", t), 0)),
        ("n_garbage_headers_rimossi",   _t_remove_garbage_headers),
        ("n_frontmatter_rimossi",       _t_remove_frontmatter),
+        ("n_watermark_rimossi",         _t_remove_watermarks),
    ]

    stats: dict = {}
@@ -792,16 +1211,35 @@ _EN_WORDS = frozenset([
    "from", "or", "an", "but", "not", "by", "he", "she", "we", "you",
    "which", "their", "been", "has", "would", "there", "when", "will",
 ])
+_FR_WORDS = frozenset([
+    "le", "les", "de", "du", "des", "et", "un", "une", "est", "que",
+    "pour", "dans", "sur", "avec", "qui", "par", "pas", "plus", "au",
+    "ce", "se", "ou", "mais", "comme", "aussi",
+])
+_DE_WORDS = frozenset([
+    "der", "die", "das", "und", "in", "von", "zu", "den", "mit", "ist",
+    "auf", "eine", "als", "dem", "des", "sich", "nicht", "auch", "werden",
+    "bei", "nach", "oder", "wenn", "wird", "war",
+])
+_ES_WORDS = frozenset([
+    "el", "los", "las", "de", "en", "un", "una", "es", "que", "por",
+    "con", "del", "para", "como", "pero", "sus", "son", "los", "hay",
+    "todo", "esta", "este", "ser", "más", "ya",
+])


 def _detect_language(text: str) -> str:
    words = re.findall(r"\b[a-zA-Z]{2,}\b", text.lower())
    sample = words[:2000]
-    it = sum(1 for w in sample if w in _IT_WORDS)
-    en = sum(1 for w in sample if w in _EN_WORDS)
-    if it == 0 and en == 0:
-        return "unknown"
-    return "it" if it >= en else "en"
+    scores = {
+        "it": sum(1 for w in sample if w in _IT_WORDS),
+        "en": sum(1 for w in sample if w in _EN_WORDS),
+        "fr": sum(1 for w in sample if w in _FR_WORDS),
+        "de": sum(1 for w in sample if w in _DE_WORDS),
+        "es": sum(1 for w in sample if w in _ES_WORDS),
+    }
+    best = max(scores, key=scores.get)
+    return best if scores[best] > 0 else "unknown"


 def _count_headers(text: str, level: int) -> int:
@@ -850,6 +1288,17 @@ def analyze(md_path: Path) -> dict:
    if n_h3 >= 5:
        livello, boundary, strategia = 3, "h3", "h3_aware"
        section_bodies = _split_sections(text, 3)
+        # Gerarchia invertita: h3 sono capitoli enormi, h2 sono sottosezioni più brevi.
+        # Succede quando opendataloader classifica titoli capitolo come h6 (→ normalizzati
+        # a h3) e le sottosezioni ALL-CAPS diventano ## (h2). In questo caso h2 è
+        # il boundary corretto per il chunking.
+        if n_h2 >= 3:
+            h2_bodies = _split_sections(text, 2)
+            avg_h3 = sum(len(b) for b in section_bodies) / len(section_bodies) if section_bodies else 0
+            avg_h2 = sum(len(b) for b in h2_bodies) / len(h2_bodies) if h2_bodies else 0
+            if avg_h3 > 5000 and avg_h2 < avg_h3 * 0.7:
+                livello, boundary, strategia = 2, "h2", "h2_paragraph_split"
+                section_bodies = h2_bodies
    elif n_h2 >= 3:
        livello, boundary, strategia = 2, "h2", "h2_paragraph_split"
        section_bodies = _split_sections(text, 2)
@@ -955,13 +1404,15 @@ def build_report(
        return hits

    residui = {
-        "backtick":        _scan(r"`"),
-        "dotleader":       _scan(r"(?:\. ){3,}"),
-        "url":             _scan(r"^(https?://|www\.)\S+"),
-        "immagini":        _scan(r"!\[[^\]]*\]\([^)]*\)"),
-        "br_inline":       _scan(r"<br>"),
-        "simboli_encoding":_scan(r'(?<=[0-9A-Za-z])[!"](?=[0-9A-Za-z])'),
-        "formule_inline":  _scan(r"\[\d+\.\d+\]"),
+        "backtick":         _scan(r"`"),
+        "dotleader":        _scan(r"(?:\. ){3,}"),
+        "url":              _scan(r"^(https?://|www\.)\S+"),
+        "immagini":         _scan(r"!\[[^\]]*\]\([^)]*\)"),
+        "br_inline":        _scan(r"<br>"),
+        "simboli_encoding": _scan(r'(?<=[0-9A-Za-z])[!"](?=[0-9A-Za-z])'),
+        "formule_inline":   _scan(r"\[\d+\.\d+\]"),
+        "footnote_markers": _scan(r'[\u00b9\u00b2\u00b3\u2070\u2074-\u2079]'),
+        "pua_markers":      _scan(r'[\ue000-\uf8ff]'),
    }

    # ── Composizione report ───────────────────────────────────────────────
@@ -990,13 +1441,17 @@ def build_report(
            "br_inline":        len(residui["br_inline"]),
            "simboli_encoding": len(residui["simboli_encoding"]),
            "formule_inline":   len(residui["formule_inline"]),
-            "backtick_esempi":         residui["backtick"],
-            "dotleader_esempi":        residui["dotleader"],
-            "url_esempi":              residui["url"],
-            "immagini_esempi":         residui["immagini"],
-            "br_inline_esempi":        residui["br_inline"],
-            "simboli_encoding_esempi": residui["simboli_encoding"],
-            "formule_inline_esempi":   residui["formule_inline"],
+            "footnote_markers": len(residui["footnote_markers"]),
+            "pua_markers":      len(residui["pua_markers"]),
+            "backtick_esempi":          residui["backtick"],
+            "dotleader_esempi":         residui["dotleader"],
+            "url_esempi":               residui["url"],
+            "immagini_esempi":          residui["immagini"],
+            "br_inline_esempi":         residui["br_inline"],
+            "simboli_encoding_esempi":  residui["simboli_encoding"],
+            "formule_inline_esempi":    residui["formule_inline"],
+            "footnote_markers_esempi":  residui["footnote_markers"],
+            "pua_markers_esempi":       residui["pua_markers"],
        },
    }

@@ -1035,10 +1490,17 @@ def run(stem: str, project_root: Path, force: bool) -> bool:
    with tempfile.TemporaryDirectory() as tmp:
        try:
            md_file = convert_pdf(pdf_path, Path(tmp))
+        except MemoryError:
+            print("  ✗ Memoria esaurita durante la conversione")
+            return False
        except Exception as e:
            print(f"  ✗ Conversione fallita: {e}")
            return False
-        raw_text = md_file.read_text(encoding="utf-8")
+        try:
+            raw_text = md_file.read_text(encoding="utf-8")
+        except UnicodeDecodeError as e:
+            print(f"  ✗ Errore encoding nel file prodotto: {e}")
+            return False

    size_kb = len(raw_text.encode()) // 1024
    n_lines = raw_text.count("\n")
@@ -1048,14 +1510,19 @@ def run(stem: str, project_root: Path, force: bool) -> bool:
    print("  [3/4] Pulizia strutturale...")
    clean_text, t_stats = apply_transforms(raw_text)
    reduction = 100 * (1 - len(clean_text) / len(raw_text)) if raw_text else 0
-    print(f"  ✅ Immagini rimosse:      {t_stats['n_immagini_rimosse']}")
+    print(f"  ✅ Simboli PUA corretti:  {t_stats['n_simboli_pua_corretti']}")
+    print(f"     Immagini rimosse:      {t_stats['n_immagini_rimosse']}")
+    print(f"     Note rimossa:          {t_stats['n_note_rimosse']}")
    print(f"     Accenti corretti:      {t_stats['n_accenti_corretti']}")
    print(f"     Dot-leader rimossi:    {t_stats['n_dotleader_rimossi']}")
    print(f"     Header concat fixati:  {t_stats['n_header_concat_fixati']}")
+    print(f"     Header num. normaliz.: {t_stats['n_header_numerati_normalizzati']}")
    print(f"     Articoli → ###:        {t_stats['n_articoli_estratti']}")
    print(f"     Ambienti matematici:   {t_stats['n_ambienti_matematici']}")
    print(f"     Titoli header uniti:   {t_stats['n_titoli_uniti']}")
    print(f"     TOC rimosso:           {'sì' if t_stats['toc_rimosso'] else 'no'}")
+    print(f"     Versi poesia riprist.: {t_stats['n_versi_ripristinati']}")
+    print(f"     Header verso demotati: {t_stats['n_header_verso_demotati']}")
    print(f"     ALL-CAPS → ##:         {t_stats['n_header_allcaps']}")
    print(f"     Sezioni → ###:         {t_stats['n_sezioni_numerate']}")
    print(f"     Paragrafi uniti:       {t_stats['n_paragrafi_uniti']}")
@@ -1063,9 +1530,13 @@ def run(stem: str, project_root: Path, force: bool) -> bool:

    # ── [4] Profilo strutturale ────────────────────────────────────────────
    print("  [4/4] Analisi struttura...")
-    out_dir.mkdir(parents=True, exist_ok=True)
-    raw_out.write_text(raw_text, encoding="utf-8")
-    clean_out.write_text(clean_text, encoding="utf-8")
+    try:
+        out_dir.mkdir(parents=True, exist_ok=True)
+        raw_out.write_text(raw_text, encoding="utf-8")
+        clean_out.write_text(clean_text, encoding="utf-8")
+    except PermissionError as e:
+        print(f"  ✗ Permesso negato durante la scrittura: {e}")
+        return False
    profile = analyze(clean_out)

    _LIVELLO_DESC = {3: "ricca (h3)", 2: "parziale (h2)", 1: "paragrafi", 0: "testo piatto"}
@@ -86,6 +86,8 @@ def _score(r: dict) -> tuple[int, list[str]]:
    _pen("br_inline",        2, 15, "<br> inline")
    _pen("simboli_encoding", 1, 10, "simboli encoding")
    _pen("formule_inline",   1,  8, "formule inline")
+    _pen("footnote_markers", 1,  8, "footnote residui")
+    _pen("pua_markers",      2, 20, "caratteri PUA font Symbol")

    # ── Anomalie ──────────────────────────────────────────────────────────
    n_bare = anomalie.get("bare_headers", 0)
@@ -1,29 +1,39 @@
-# Step 8 — Vettorizzazione
+# Ingestion — Vettorizzazione

-Legge i chunk prodotti da step-6, genera gli embedding tramite Ollama e li
+Legge i chunk prodotti dal chunker, genera gli embedding tramite Ollama e li
 salva in ChromaDB (vector store persistente su disco).

 ---

 ## Prerequisiti

- Step-6 completato (esiste `step-6/<stem>/chunks.json`)
+- Chunking completato (esiste `chunks/<stem>/chunks.json`)
 - Ollama attivo con il modello di embedding scaricato
 - `chromadb` installato (`pip install -r requirements.txt`)

 ---

+## Verifica ambiente
+
+Prima di eseguire l'ingestion, verifica che Ollama e i modelli siano disponibili:
+
+```bash
+.venv/bin/python ollama/check_env.py
+```
+
+---
+
 ## Configurazione modello

-Il modello di embedding viene letto da **`step-9/config.py`**:
+Il modello di embedding viene letto da **`config.py`**:

 ```python
-# step-9/config.py
+# config.py
 EMBED_MODEL = "nomic-embed-text"   # ← cambia qui
 ```

-> Il modello scelto qui deve corrispondere a quello usato in step-9.
-> Se lo cambi dopo aver già vettorizzato, devi rieseguire step-8 con `--force`.
+> Il modello scelto qui deve corrispondere a quello usato in rag.py.
+> Se lo cambi dopo aver già vettorizzato, devi rieseguire ingestion con `--force`.

 ---

@@ -31,16 +41,16 @@ EMBED_MODEL = "nomic-embed-text"   # ← cambia qui

 ```bash
 # Vettorizza un singolo documento
-python step-8/ingest.py --stem <nome>
+.venv/bin/python ingestion/ingest.py --stem <nome>

 # Vettorizza tutti i documenti trovati in step-6/
-python step-8/ingest.py
+.venv/bin/python ingestion/ingest.py

 # Sovrascrive una collection già esistente
-python step-8/ingest.py --stem <nome> --force
+.venv/bin/python ingestion/ingest.py --stem <nome> --force

 # Override modello (senza modificare config.py)
-python step-8/ingest.py --stem <nome> --model bge-m3
+.venv/bin/python ingestion/ingest.py --stem <nome> --model bge-m3
 ```

 ---
@@ -54,7 +64,7 @@ distanza coseno. La directory è ignorata da git (generata automaticamente).

 ## Modelli supportati

-Stessi modelli raccomandati nel [README di step-7](../step-7/README.md).
+Stessi modelli raccomandati nel [README di ollama](../ollama/README.md).
 Il modello deve essere scaricato in Ollama prima di eseguire questo script
 (`ollama pull <modello>`).

@@ -81,20 +91,20 @@ Prima scelta: `qwen3-embedding:0.6b`.
 `qwen3-embedding` + `qwen3.5` condividono tokenizer e spazio semantico —
 il retrieval è più coerente rispetto a modelli di famiglie diverse.

-### Coerenza tra step-8 e step-9
+### Coerenza tra ingest e retrieval

-**`EMBED_MODEL` deve essere identico in step-8 e step-9.**
-ChromaDB memorizza i vettori generati con un certo modello. Se step-9 usa un
+**`EMBED_MODEL` deve essere identico in `ingest.py` e `rag.py`.**
+ChromaDB memorizza i vettori generati con un certo modello. Se `rag.py` usa un
 modello diverso per la query di ricerca, gli spazi vettoriali non corrispondono
 e il retrieval restituisce risultati casuali — senza alcun errore visibile.

 **Dopo aver cambiato `EMBED_MODEL`, riesegui sempre con `--force`.**
 Senza `--force` lo script salta la collection già esistente — i vecchi vettori
-(generati col modello precedente) restano e continuano a essere usati da step-9.
+(generati col modello precedente) restano e continuano a essere usati da `rag.py`.

 ```bash
 # Cambio modello → ricrea sempre la collection
-python step-8/ingest.py --stem <nome> --force
+.venv/bin/python ingestion/ingest.py --stem <nome> --force
 ```

 ### Quando usare `--force`
@@ -103,7 +113,7 @@ python step-8/ingest.py --stem <nome> --force
 |---|---|
 | Prima esecuzione | No |
 | Hai cambiato `EMBED_MODEL` | **Sì** |
-| Hai migliorato i chunk in step-6 | **Sì** |
+| Hai migliorato i chunk in chunks/ | **Sì** |
 | Hai aggiunto nuovi documenti (stem diverso) | No |
 | Vuoi solo verificare che funzioni | No |

@@ -2,21 +2,21 @@
 """
 Step 8 — Vettorizzazione

-Legge i chunk prodotti da step-6, genera gli embedding tramite Ollama
+Legge i chunk prodotti da chunks/, genera gli embedding tramite Ollama
 e li indicizza in ChromaDB (persistente).

-Il modello di embedding viene letto da step-9/config.py (EMBED_MODEL).
+Il modello di embedding viene letto da config.py (EMBED_MODEL).
 Puoi sovrascriverlo con --model, ma deve corrispondere al modello che
-userai in step-9 — altrimenti riesegui con --force dopo aver cambiato.
+userai in rag.py — altrimenti riesegui con --force dopo aver cambiato.

-Input:  step-6/<stem>/chunks.json
+Input:  chunks/<stem>/chunks.json
 Output: chroma_db/<stem> (collection ChromaDB)

 Uso:
-    python step-8/ingest.py --stem <nome>            # singolo documento
-    python step-8/ingest.py                          # tutti gli stem trovati
-    python step-8/ingest.py --stem <nome> --force    # sovrascrive collection
-    python step-8/ingest.py --model bge-m3           # override modello
+    python ingestion/ingest.py --stem <nome>            # singolo documento
+    python ingestion/ingest.py                          # tutti gli stem trovati
+    python ingestion/ingest.py --stem <nome> --force    # sovrascrive collection
+    python ingestion/ingest.py --model bge-m3           # override modello
 """

 import argparse
@@ -33,12 +33,10 @@ import chromadb

 project_root = Path(__file__).parent.parent

-CHUNKS_DIR = project_root / "step-6"
+CHUNKS_DIR = project_root / "chunks"
 CHROMA_DIR = project_root / "chroma_db"

-# Legge EMBED_MODEL e OLLAMA_URL da step-9/config.py (fonte di verità).
-# Per spostare config.py alla root: cambia solo la riga qui sotto.
-sys.path.insert(0, str(project_root / "step-9"))
+sys.path.insert(0, str(project_root))
 from config import EMBED_MODEL, OLLAMA_URL  # noqa: E402

 EMBED_ENDPOINT = f"{OLLAMA_URL}/api/embeddings"
@@ -96,40 +94,27 @@ def collection_exists(client: chromadb.PersistentClient, stem: str) -> bool:

 # ─── Ingestione ───────────────────────────────────────────────────────────────

-def ingest(stem: str, force: bool, model: str = EMBED_MODEL) -> bool:
+def _ingest_stem(stem: str, collection: chromadb.Collection,
+                 model: str, offset: int = 0) -> int:
    """
-    Legge step-6/<stem>/chunks.json, genera embedding e popola ChromaDB.
-    Ritorna True se completato con successo, False altrimenti.
+    Aggiunge i chunk di uno stem a una collection esistente.
+    Prefissa chunk_id con stem per evitare collisioni multi-documento.
+    Ritorna il numero di chunk aggiunti.
    """
    chunks_path = CHUNKS_DIR / stem / "chunks.json"
    if not chunks_path.exists():
        print(f"❌ File non trovato: {chunks_path}")
-        return False
+        return 0

    with open(chunks_path, encoding="utf-8") as f:
        chunks = json.load(f)

    if not chunks:
        print(f"⚠️  {stem}: chunks.json è vuoto — skip")
-        return False
-
-    client = get_client()
-
-    if collection_exists(client, stem):
-        if not force:
-            print(f"⚠️  Collection '{stem}' già presente in ChromaDB — skip")
-            print(f"   → usa --force per sovrascrivere")
-            return True  # non è un errore, è uno skip
-        client.delete_collection(stem)
-        print(f"🗑️  Collection '{stem}' rimossa (--force)")
-
-    collection = client.create_collection(
-        name=stem,
-        metadata={"hnsw:space": "cosine"},
-    )
+        return 0

    total = len(chunks)
-    print(f"📦 {total} chunk da ingestire\n")
+    print(f"  📄 {stem}: {total} chunk\n")

    ids        = []
    embeddings = []
@@ -145,10 +130,11 @@ def ingest(stem: str, force: bool, model: str = EMBED_MODEL) -> bool:
        t1 = time.monotonic()
        durations.append(t1 - t0)

-        ids.append(chunk["chunk_id"])
+        ids.append(f"{stem}__{chunk['chunk_id']}")
        embeddings.append(vector)
        documents.append(chunk["text"])
        metadatas.append({
+            "source":    stem,
            "sezione":   chunk.get("sezione", ""),
            "titolo":    chunk.get("titolo", ""),
            "sub_index": chunk.get("sub_index", 0),
@@ -156,41 +142,69 @@ def ingest(stem: str, force: bool, model: str = EMBED_MODEL) -> bool:

        avg  = sum(durations) / len(durations)
        eta  = int(avg * (total - i))
-        done = f"[{i:>{len(str(total))}}/{total}]"
-        cid  = chunk["chunk_id"][:50]
-        line = f"  {done} ✓ {cid:<50}  ETA: {eta}s"
-        print(f"{line:<80}", end="\r", flush=True)
+        done = f"[{offset + i:>6}/{offset + total}]"
+        cid  = chunk["chunk_id"][:40]
+        print(f"  {done} ✓ {stem}/{cid:<40}  ETA: {eta}s", end="\r", flush=True)

-        # Upsert in batch da 100 per non sovraccaricare la memoria
        if len(ids) == 100:
-            collection.add(
-                ids=ids,
-                embeddings=embeddings,
-                documents=documents,
-                metadatas=metadatas,
-            )
+            collection.add(ids=ids, embeddings=embeddings,
+                           documents=documents, metadatas=metadatas)
            ids, embeddings, documents, metadatas = [], [], [], []

-    # Upsert dei rimanenti
    if ids:
-        collection.add(
-            ids=ids,
-            embeddings=embeddings,
-            documents=documents,
-            metadatas=metadatas,
-        )
+        collection.add(ids=ids, embeddings=embeddings,
+                       documents=documents, metadatas=metadatas)

    elapsed = int(time.monotonic() - start)
-    print()  # nuova riga dopo il \r
-    print(f"\n✅ Ingestione completata in {elapsed}s — {total}/{total} chunk salvati")
-    print(f"   Collection '{stem}' in {CHROMA_DIR}/")
+    print()
+    print(f"  ✅ {stem}: {total} chunk in {elapsed}s")
+    return total
+
+
+def ingest(stem: str, force: bool, model: str = EMBED_MODEL) -> bool:
+    """Ingest singolo documento nella sua collection dedicata (retrocompatibile)."""
+    return ingest_multi([stem], collection_name=stem, force=force, model=model)
+
+
+def ingest_multi(stems: list[str], collection_name: str,
+                 force: bool, model: str = EMBED_MODEL) -> bool:
+    """
+    Ingerisce uno o più documenti in una singola collection ChromaDB.
+    I chunk_id sono prefissati con lo stem per evitare collisioni.
+    Il metadato 'source' identifica il documento di provenienza.
+    """
+    client = get_client()
+
+    if collection_exists(client, collection_name):
+        if not force:
+            print(f"⚠️  Collection '{collection_name}' già presente in ChromaDB — skip")
+            print(f"   → usa --force per sovrascrivere")
+            return True
+        client.delete_collection(collection_name)
+        print(f"🗑️  Collection '{collection_name}' rimossa (--force)")
+
+    collection = client.create_collection(
+        name=collection_name,
+        metadata={"hnsw:space": "cosine"},
+    )
+
+    total_chunks = 0
+    for stem in stems:
+        n = _ingest_stem(stem, collection, model, offset=total_chunks)
+        if n == 0 and len(stems) == 1:
+            return False
+        total_chunks += n
+
+    print(f"\n✅ Collection '{collection_name}': {total_chunks} chunk totali")
+    print(f"   Documenti: {', '.join(stems)}")
+    print(f"   Percorso: {CHROMA_DIR}/")
    return True


 # ─── Entry point ──────────────────────────────────────────────────────────────

 def find_stems() -> list[str]:
-    """Ritorna tutti gli stem che hanno un chunks.json in step-6/."""
+    """Ritorna tutti gli stem che hanno un chunks.json in chunks/."""
    return sorted(
        p.parent.name
        for p in CHUNKS_DIR.glob("*/chunks.json")
@@ -199,26 +213,53 @@ def find_stems() -> list[str]:

 def main() -> int:
    parser = argparse.ArgumentParser(
-        description="Step 8 — Vettorizzazione chunk in ChromaDB"
+        description="Vettorizzazione chunk in ChromaDB",
+        epilog=(
+            "Esempi:\n"
+            "  python ingestion/ingest.py --stem manuale\n"
+            "  python ingestion/ingest.py --collection archivio --stems doc1 doc2 doc3\n"
+            "  python ingestion/ingest.py --collection archivio --stems doc1 doc2 --force\n"
+            "  python ingestion/ingest.py                          # tutti i documenti, collection separate"
+        ),
+        formatter_class=argparse.RawDescriptionHelpFormatter,
    )
-    parser.add_argument("--stem", help="Nome del documento (senza --stem = tutti)")
+    parser.add_argument("--stem",
+                        help="Singolo documento → collection con lo stesso nome")
+    parser.add_argument("--stems", nargs="+", metavar="STEM",
+                        help="Uno o più documenti da unire in --collection")
+    parser.add_argument("--collection",
+                        help="Nome della collection di destinazione (richiesto con --stems)")
    parser.add_argument("--force", action="store_true",
                        help="Sovrascrive la collection se già esistente")
    parser.add_argument("--model", default=EMBED_MODEL,
-                        help=f"Modello embedding Ollama (default da step-9/config.py: {EMBED_MODEL})")
+                        help=f"Modello embedding (default: {EMBED_MODEL})")
    args = parser.parse_args()

-    print("─── Step 8 — Vettorizzazione ─────────────────────────────────────────\n")
+    print("─── Vettorizzazione ──────────────────────────────────────────────────\n")

    if not check_ollama(args.model):
        return 1

+    # ── Modalità multi-documento ─────────────────────────────────────────────
+    if args.stems or args.collection:
+        if not args.stems:
+            print("❌ --collection richiede --stems (es. --stems doc1 doc2 doc3)")
+            return 1
+        if not args.collection:
+            print("❌ --stems richiede --collection (es. --collection archivio)")
+            return 1
+        print(f"  Collection : {args.collection}")
+        print(f"  Documenti  : {', '.join(args.stems)}\n")
+        ok = ingest_multi(args.stems, args.collection,
+                          force=args.force, model=args.model)
+        return 0 if ok else 1
+
+    # ── Modalità singolo / tutti ─────────────────────────────────────────────
    stems = [args.stem] if args.stem else find_stems()
    if not stems:
-        print("❌ Nessun chunks.json trovato in step-6/")
+        print("❌ Nessun chunks.json trovato in chunks/")
        return 1

-    print()
    results = []
    for stem in stems:
        if len(stems) > 1:
@@ -0,0 +1,113 @@
+# Ollama — Verifica Ambiente
+
+Prima di procedere con la vettorizzazione (step 8) devi avere installato:
+
+- **Ollama** — server locale per LLM e embedding
+- un **modello di embedding** (es. `qwen3-embedding:0.6b`, `bge-m3`)
+- un **modello LLM** (es. `qwen3.5:4b`)
+- **chromadb** — libreria Python per il vector store
+
+---
+
+## 1. Installa Ollama
+
+```bash
+curl -fsSL https://ollama.com/install.sh | sh
+```
+
+Verifica che il servizio sia attivo:
+
+```bash
+ollama list
+```
+
+### Disinstalla Ollama
+
+```bash
+# Ferma e rimuovi il servizio systemd
+sudo systemctl stop ollama
+sudo systemctl disable ollama
+sudo rm /etc/systemd/system/ollama.service
+sudo systemctl daemon-reload
+
+# Rimuovi il binario
+sudo rm /usr/local/bin/ollama
+
+# Rimuovi modelli e dati (opzionale)
+sudo rm -rf /usr/share/ollama
+
+# Rimuovi utente e gruppo di sistema (opzionale)
+sudo userdel ollama
+sudo groupdel ollama
+```
+
+---
+
+## 2. Scarica i modelli
+
+### Modello di embedding (consigliato)
+
+```bash
+ollama pull qwen3-embedding:0.6b
+```
+
+Alternative supportate:
+
+- `nomic-embed-text-v2-moe`
+- `bge-m3`
+- `nomic-embed-text`
+
+Se cambi embedding model rispetto a quello usato in ingestion, riesegui ingest con `--force` e aggiorna `EMBED_MODEL` in `config.py`.
+
+### Modello LLM (consigliato per 8 GB RAM)
+
+```bash
+ollama pull qwen3.5:4b
+```
+
+Se usi un modello diverso, aggiorna `OLLAMA_MODEL` in `config.py`.
+
+### Disinstalla un modello
+
+```bash
+ollama rm qwen3.5:4b
+ollama rm qwen3-embedding:0.6b
+```
+
+---
+
+## 3. Installa le dipendenze Python
+
+```bash
+source .venv/bin/activate
+pip install -r requirements.txt
+```
+
+---
+
+## 4. Verifica ambiente
+
+```bash
+source .venv/bin/activate
+python ollama/check_env.py
+```
+
+Output atteso (esempio):
+
+```text
+✅ ollama trovato nel PATH
+✅ ollama risponde correttamente
+✅ embedding disponibile: qwen3-embedding:0.6b
+✅ LLM disponibile: qwen3.5:4b
+✅ chromadb importabile
+✅ Ambiente pronto — procedi con la vettorizzazione:
+   python ingestion/ingest.py --stem <nome>
+```
+
+---
+
+## Prossimo step
+
+```bash
+python ingestion/ingest.py --stem <nome>
+```
@@ -1,6 +1,6 @@
 #!/usr/bin/env python3
 """
-Step 7 — Verifica ambiente
+Verifica ambiente Ollama

 Controlla che tutti i prerequisiti per la vettorizzazione siano soddisfatti:
  1. ollama è nel PATH e risponde
@@ -12,7 +12,7 @@ Output: report a schermo con ✅ / ❌ per ogni componente.
 Nessun file scritto. Exit 0 se tutto OK, 1 altrimenti.

 Uso:
-    python step-7/check_env.py
+    python ollama/check_env.py
 """

 import shutil
@@ -22,7 +22,7 @@ from pathlib import Path


 # ─── Lista canonica di modelli embedding supportati ───────────────────────────
-# Ordine: prima scelta → ultima scelta (come da README step-7)
+# Ordine: prima scelta → ultima scelta (come da ollama/README.md)
 EMBED_MODELS = [
    "qwen3-embedding",
    "nomic-embed-text-v2-moe",
@@ -32,17 +32,35 @@ EMBED_MODELS = [
    "paraphrase-multilingual",
    "all-minilm",
 ]
+EMBED_MODEL_PREFIXES = tuple(EMBED_MODELS)
+
+OLLAMA_SERVE_HINT = "   → Avvia il servizio con: ollama serve"
+RECOMMENDED_EMBED_MODEL = "qwen3-embedding:0.6b"
+RECOMMENDED_LLM_MODEL = "qwen3.5:4b"


 def _is_embed(model_name: str) -> bool:
    """True se il modello è riconosciuto come embedding (lista canonica o keyword)."""
    base = model_name.split(":")[0].lower()
-    return any(base == e or base.startswith(e) for e in EMBED_MODELS) or "embed" in base
+    return base.startswith(EMBED_MODEL_PREFIXES) or "embed" in base


-# ─── Modelli configurati in step-9/config.py ─────────────────────────────────
-# Per spostare config.py alla root: cambia solo la riga qui sotto.
-sys.path.insert(0, str(Path(__file__).parent.parent / "step-9"))
+def _parse_ollama_models(raw_output: str) -> list[str]:
+    """Estrae i nomi modello dall'output di `ollama list`."""
+    models: list[str] = []
+    for idx, line in enumerate(raw_output.splitlines()):
+        line = line.strip()
+        if not line:
+            continue
+        # Prima riga: header tabellare ("NAME ...")
+        if idx == 0 and line.lower().startswith("name"):
+            continue
+        model_name = line.split(maxsplit=1)[0]
+        models.append(model_name)
+    return models
+
+
+sys.path.insert(0, str(Path(__file__).parent.parent))
 try:
    from config import EMBED_MODEL as CONFIGURED_EMBED, OLLAMA_MODEL as CONFIGURED_LLM
 except Exception:
@@ -54,6 +72,15 @@ REQUIRED_LIBS = ["chromadb"]

 # ─── Checks ───────────────────────────────────────────────────────────────────

+def _print_model_list(title: str, models: list[str]) -> None:
+    """Stampa in modo uniforme una lista di modelli."""
+    if not models:
+        print(f"   {title}: nessuno")
+        return
+    print(f"   {title} ({len(models)}):")
+    for model in models:
+        print(f"   - {model}")
+
 def check_ollama_in_path() -> bool:
    """Verifica che ollama sia nel PATH."""
    found = shutil.which("ollama") is not None
@@ -77,14 +104,9 @@ def check_ollama_running() -> list[str] | None:
        )
        if result.returncode != 0:
            print("❌ ollama non risponde (errore all'avvio)")
-            print("   → Avvia il servizio con: ollama serve")
+            print(OLLAMA_SERVE_HINT)
            return None
-        lines = result.stdout.strip().splitlines()
-        models = []
-        for line in lines[1:]:  # salta l'header
-            parts = line.split()
-            if parts:
-                models.append(parts[0])
+        models = _parse_ollama_models(result.stdout)
        print("✅ ollama risponde correttamente")
        return models
    except FileNotFoundError:
@@ -92,7 +114,7 @@ def check_ollama_running() -> list[str] | None:
        return None
    except subprocess.TimeoutExpired:
        print("❌ ollama non risponde (timeout)")
-        print("   → Avvia il servizio con: ollama serve")
+        print(OLLAMA_SERVE_HINT)
        return None


@@ -107,45 +129,58 @@ def _match(model_name: str, available: list[str]) -> str | None:
    return None


+def _check_configured_model(
+    configured_name: str | None,
+    available: list[str],
+    label: str,
+) -> bool | None:
+    """
+    Se esiste un modello configurato, lo verifica e ritorna True/False.
+    Se non è configurato, ritorna None (il chiamante userà il fallback).
+    """
+    if not configured_name:
+        return None
+
+    print(f"   modello configurato (config.py): {configured_name}")
+    found = _match(configured_name, available)
+    if found:
+        print(f"✅ {label} disponibile: {found}")
+        return True
+
+    print(f"❌ {configured_name} non trovato in Ollama")
+    print(f"   → ollama pull {configured_name}")
+    return False
+
+
 def check_embed_model(available: list[str]) -> bool:
    """Verifica che il modello di embedding configurato sia disponibile."""
-    if CONFIGURED_EMBED:
-        print(f"   modello configurato (step-9/config.py): {CONFIGURED_EMBED}")
-        found = _match(CONFIGURED_EMBED, available)
-        if found:
-            print(f"✅ embedding disponibile: {found}")
-            return True
-        print(f"❌ {CONFIGURED_EMBED} non trovato in Ollama")
-        print(f"   → ollama pull {CONFIGURED_EMBED}")
-        return False
+    configured_check = _check_configured_model(CONFIGURED_EMBED, available, "embedding")
+    if configured_check is not None:
+        return configured_check
+
    # fallback: config.py non leggibile
    found = next((m for m in available if _is_embed(m)), None)
    if found:
        print(f"✅ modello embedding trovato: {found}")
        return True
    print("❌ nessun modello di embedding trovato")
-    print(f"   → Prima scelta: ollama pull qwen3-embedding:0.6b")
+    print(f"   → Prima scelta: ollama pull {RECOMMENDED_EMBED_MODEL}")
    return False


 def check_llm_model(available: list[str]) -> bool:
    """Verifica che il modello LLM configurato sia disponibile."""
-    if CONFIGURED_LLM:
-        print(f"   modello configurato (step-9/config.py): {CONFIGURED_LLM}")
-        found = _match(CONFIGURED_LLM, available)
-        if found:
-            print(f"✅ LLM disponibile: {found}")
-            return True
-        print(f"❌ {CONFIGURED_LLM} non trovato in Ollama")
-        print(f"   → ollama pull {CONFIGURED_LLM}")
-        return False
+    configured_check = _check_configured_model(CONFIGURED_LLM, available, "LLM")
+    if configured_check is not None:
+        return configured_check
+
    # fallback: config.py non leggibile
-    llm_candidates = [m for m in available if not _is_embed(m)]
-    if llm_candidates:
-        print(f"✅ modello LLM trovato: {llm_candidates[0]}")
+    first_llm = next((m for m in available if not _is_embed(m)), None)
+    if first_llm:
+        print(f"✅ modello LLM trovato: {first_llm}")
        return True
    print("❌ nessun modello LLM trovato")
-    print(f"   → Consigliato per 8 GB RAM: ollama pull qwen3.5:4b")
+    print(f"   → Consigliato per 8 GB RAM: ollama pull {RECOMMENDED_LLM_MODEL}")
    return False


@@ -164,7 +199,7 @@ def check_library(lib: str) -> bool:
 # ─── Entry point ──────────────────────────────────────────────────────────────

 def main() -> int:
-    print("─── Step 7 — Verifica ambiente ───────────────────────────────────────\n")
+    print("─── Verifica ambiente Ollama ─────────────────────────────────────────\n")

    results: list[bool] = []

@@ -179,6 +214,14 @@ def main() -> int:
            results.extend([False, False, False])
        else:
            results.append(True)
+            _print_model_list(
+                "modelli embedding rilevati",
+                [m for m in available if _is_embed(m)],
+            )
+            _print_model_list(
+                "modelli LLM rilevati",
+                [m for m in available if not _is_embed(m)],
+            )
            results.append(check_embed_model(available))
            results.append(check_llm_model(available))
    else:
@@ -195,11 +238,10 @@ def main() -> int:
    print("──────────────────────────────────────────────────────────────────────")
    all_ok = all(results)
    if all_ok:
-        print("✅ Ambiente pronto — procedi con la vettorizzazione:")
-        print("   python step-8/ingest.py --stem <nome>")
+        print("✅ Ambiente pronto")
    else:
        n_fail = sum(1 for r in results if not r)
-        print(f"⚠️  {n_fail} problema/i rilevato/i — risolvi prima di procedere con step-8.")
+        print(f"⚠️  {n_fail} problema/i rilevato/i — risolvi prima di procedere.")

    return 0 if all_ok else 1

@@ -1,7 +1,7 @@
 #!/usr/bin/env python3
 """
 Test chat locale Ollama — senza RAG, senza ChromaDB.
-Uso: python step-9/test_ollama.py
+Uso: python ollama/test_ollama.py
 """

 import json
@@ -10,7 +10,7 @@ import urllib.error
 import urllib.request
 from pathlib import Path

-sys.path.insert(0, str(Path(__file__).parent))
+sys.path.insert(0, str(Path(__file__).parent.parent))
 import config as _cfg

 OLLAMA_URL  = _cfg.OLLAMA_URL
@@ -1,6 +1,6 @@
 #!/usr/bin/env python3
 """
-Step 9 — Pipeline RAG interattiva
+Pipeline RAG interattiva

 Riceve una domanda, recupera i chunk più rilevanti da ChromaDB (retrieval)
 e genera una risposta tramite Ollama (generation).
@@ -9,7 +9,7 @@ Input:  chroma_db/<stem> (collection ChromaDB)
 Output: risposta a schermo

 Uso:
-    python step-9/rag.py --stem <nome>
+    python rag.py --stem <nome>

 Nel loop interattivo:
    Domanda: <testo>       → risposta
@@ -31,7 +31,7 @@ import chromadb
 sys.path.insert(0, str(Path(__file__).parent))
 import config as _cfg

-project_root = Path(__file__).parent.parent
+project_root = Path(__file__).parent
 CHROMA_DIR   = project_root / "chroma_db"

 OLLAMA_URL    = _cfg.OLLAMA_URL
@@ -101,6 +101,7 @@ def retrieve(collection: chromadb.Collection, question: str) -> list[dict]:
    ):
        chunks.append({
            "text":     text,
+            "source":   meta.get("source", ""),
            "sezione":  meta.get("sezione", ""),
            "titolo":   meta.get("titolo", ""),
            "distance": dist,
@@ -143,7 +144,8 @@ def answer(question: str, collection: chromadb.Collection, verbose: bool) -> Non
            if c["titolo"]:
                loc += f" > {c['titolo']}"
            sim = 1 - c["distance"]
-            print(f"  [{i}] {loc}  (similarità: {sim:.3f})")
+            src = f"[{c['source']}] " if c.get("source") else ""
+            print(f"  [{i}] {src}{loc}  (similarità: {sim:.3f})")
            print(f"      {c['text'][:120].replace(chr(10), ' ')}...")
        print("──────────────────────────────────────────────────────────────\n")

@@ -183,7 +185,7 @@ def run_loop(collection: chromadb.Collection) -> None:
 def _build_epilog() -> str:
    lines = [
        "Uso:",
-        "  python step-9/rag.py --stem <nome>",
+        "  python rag.py --stem <nome>",
        "",
        "Loop interattivo:",
        "  <domanda>       risposta basata sul documento",
@@ -197,7 +199,7 @@ def _build_epilog() -> str:
            if names:
                lines += ["", f"Collection disponibili: {', '.join(names)}"]
            else:
-                lines += ["", "Nessuna collection trovata — eseguire prima: python step-8/ingest.py"]
+                lines += ["", "Nessuna collection trovata — eseguire prima: python ingestion/ingest.py"]
        except Exception:
            pass
    return "\n".join(lines)
@@ -206,43 +208,50 @@ def _build_epilog() -> str:
 def main() -> int:
    parser = argparse.ArgumentParser(
        description=(
-            "Step 9 — Pipeline RAG interattiva\n\n"
+            "Pipeline RAG interattiva\n\n"
            "Risponde a domande in linguaggio naturale su un documento\n"
-            "indicizzato in ChromaDB da step-8/ingest.py."
+            "indicizzato in ChromaDB da ingestion/ingest.py."
        ),
        epilog=_build_epilog(),
        formatter_class=argparse.RawDescriptionHelpFormatter,
    )
    parser.add_argument(
        "--stem",
-        required=True,
-        help=(
-            "Nome della collection ChromaDB da interrogare. "
-            "Le collection vengono create da: python step-8/ingest.py --stem <nome>"
-        ),
+        help="Collection di un singolo documento (retrocompatibile)",
+    )
+    parser.add_argument(
+        "--collection",
+        help="Collection multi-documento creata con: ingest.py --collection <nome> --stems ...",
    )
    args = parser.parse_args()

-    print("─── Step 9 — Pipeline RAG ────────────────────────────────────────────\n")
-    print(f"  Documento : {args.stem}")
-    print(f"  Modello   : {LLM_MODEL}")
-    print(f"  Top-K     : {TOP_K}")
-    print(f"  Thinking  : {'off' if NO_THINK else 'on'}")
+    collection_name = args.collection or args.stem
+    if not collection_name:
+        parser.error("specifica --stem <nome> oppure --collection <nome>")
+
+    print("─── Pipeline RAG ────────────────────────────────────────────\n")
+    print(f"  Collection : {collection_name}")
+    print(f"  Modello    : {LLM_MODEL}")
+    print(f"  Top-K      : {TOP_K}")
+    print(f"  Thinking   : {'off' if NO_THINK else 'on'}")
    print()

    if not CHROMA_DIR.exists():
-        print("❌ chroma_db/ non trovata — esegui prima step-8")
+        print("❌ chroma_db/ non trovata — esegui prima ingestion")
        return 1

    client = chromadb.PersistentClient(path=str(CHROMA_DIR))
    collections = [c.name for c in client.list_collections()]
-    if args.stem not in collections:
-        print(f"❌ Collection '{args.stem}' non trovata in chroma_db/")
-        print(f"   → python step-8/ingest.py --stem {args.stem}")
+    if collection_name not in collections:
+        print(f"❌ Collection '{collection_name}' non trovata in chroma_db/")
+        if args.stem:
+            print(f"   → python ingestion/ingest.py --stem {collection_name}")
+        else:
+            print(f"   → python ingestion/ingest.py --collection {collection_name} --stems doc1 doc2 ...")
        return 1

-    collection = client.get_collection(args.stem)
-    print(f"✅ Collection '{args.stem}' caricata ({collection.count()} chunk)\n")
+    collection = client.get_collection(collection_name)
+    print(f"✅ Collection '{collection_name}' caricata ({collection.count()} chunk)\n")

    run_loop(collection)
    return 0
@@ -1,12 +1,4 @@
-# Step 0-1 — Ispezione e verifica PDF
 pdfplumber==0.11.9
-
-# Step 2 — Conversione PDF → Markdown
-pymupdf4llm
-
-# conversione/ — Pipeline automatica PDF → clean Markdown (alternativa a step 0+1+2+3+4)
-# Richiede anche Java 11+ sul PATH: https://adoptium.net/
-opendataloader-pdf
-
-# Step 8 — Vettorizzazione
+PyMuPDF>=1.24.0
 chromadb
+pytest>=8.0
@@ -1,6 +1,6 @@
 #!/usr/bin/env python3
 """
-Step 9 — Retrieval puro (senza generazione LLM)
+Retrieval puro (senza generazione LLM)

 Loop interattivo: inserisci una query, ottieni i chunk più simili dalla
 collection ChromaDB tramite embedding semantico — senza chiamare Ollama
@@ -15,7 +15,7 @@ Input:  chroma_db/<stem> (collection ChromaDB)
 Output: lista chunk con score di similarità

 Uso:
-    python step-9/retrieve.py --stem <nome>
+    python retrieve.py --stem <nome>

 Nel loop interattivo:
    Query: <testo>      → chunk più simili con score
@@ -37,7 +37,7 @@ import chromadb
 sys.path.insert(0, str(Path(__file__).parent))
 import config as _cfg

-project_root = Path(__file__).parent.parent
+project_root = Path(__file__).parent
 CHROMA_DIR   = project_root / "chroma_db"

 OLLAMA_URL  = _cfg.OLLAMA_URL
@@ -85,6 +85,7 @@ def retrieve(collection: chromadb.Collection, query: str, top_k: int) -> list[di
        chunks.append({
            "rank":       rank,
            "similarity": round(1 - dist, 4),
+            "source":     meta.get("source", ""),
            "sezione":    meta.get("sezione", ""),
            "titolo":     meta.get("titolo", ""),
            "text":       text,
@@ -97,10 +98,11 @@ def retrieve(collection: chromadb.Collection, query: str, top_k: int) -> list[di
 def print_results(chunks: list[dict], full: bool = False) -> None:
    print(f"── {len(chunks)} chunk recuperati ─────────────────────────────────\n")
    for c in chunks:
+        src = f"[{c['source']}] " if c.get("source") else ""
        loc = c["sezione"]
        if c["titolo"]:
            loc += f" > {c['titolo']}"
-        print(f"  [{c['rank']}] similarità: {c['similarity']:.4f}  |  {loc}")
+        print(f"  [{c['rank']}] similarità: {c['similarity']:.4f}  |  {src}{loc}")
        if full:
            print()
            print(c["text"])
@@ -145,7 +147,7 @@ def run_loop(collection: chromadb.Collection, top_k: int) -> None:
 def _build_epilog() -> str:
    lines = [
        "Uso:",
-        "  python step-9/retrieve.py --stem <nome>",
+        "  python retrieve.py --stem <nome>",
        "",
        "Nel loop interattivo:",
        "  <query>       chunk più simili con score (testo troncato)",
@@ -159,7 +161,7 @@ def _build_epilog() -> str:
            if names:
                lines += ["", f"Collection disponibili: {', '.join(names)}"]
            else:
-                lines += ["", "Nessuna collection trovata — eseguire prima: python step-8/ingest.py"]
+                lines += ["", "Nessuna collection trovata — eseguire prima: python ingestion/ingest.py"]
        except Exception:
            pass
    return "\n".join(lines)
@@ -168,7 +170,7 @@ def _build_epilog() -> str:
 def main() -> int:
    parser = argparse.ArgumentParser(
        description=(
-            "Step 9 — Retrieval puro (senza LLM)\n\n"
+            "Retrieval puro (senza LLM)\n\n"
            "Loop interattivo: inserisci una query e ottieni i chunk più simili\n"
            "tramite embedding semantico, senza generazione LLM."
        ),
@@ -177,8 +179,11 @@ def main() -> int:
    )
    parser.add_argument(
        "--stem",
-        required=True,
-        help="Nome della collection ChromaDB da interrogare.",
+        help="Collection di un singolo documento (retrocompatibile)",
+    )
+    parser.add_argument(
+        "--collection",
+        help="Collection multi-documento creata con: ingest.py --collection <nome> --stems ...",
    )
    parser.add_argument(
        "--top-k",
@@ -189,25 +194,32 @@ def main() -> int:
    )
    args = parser.parse_args()

-    print("─── Step 9 — Retrieval puro ──────────────────────────────────────────\n")
-    print(f"  Documento    : {args.stem}")
+    collection_name = args.collection or args.stem
+    if not collection_name:
+        parser.error("specifica --stem <nome> oppure --collection <nome>")
+
+    print("─── Retrieval puro ──────────────────────────────────────────\n")
+    print(f"  Collection   : {collection_name}")
    print(f"  Embed model  : {EMBED_MODEL}")
    print(f"  Top-K        : {args.top_k}")
    print()

    if not CHROMA_DIR.exists():
-        print("❌ chroma_db/ non trovata — esegui prima step-8", file=sys.stderr)
+        print("❌ chroma_db/ non trovata — esegui prima ingestion", file=sys.stderr)
        return 1

    client = chromadb.PersistentClient(path=str(CHROMA_DIR))
    collections = [c.name for c in client.list_collections()]
-    if args.stem not in collections:
-        print(f"❌ Collection '{args.stem}' non trovata in chroma_db/", file=sys.stderr)
-        print(f"   → python step-8/ingest.py --stem {args.stem}", file=sys.stderr)
+    if collection_name not in collections:
+        print(f"❌ Collection '{collection_name}' non trovata in chroma_db/", file=sys.stderr)
+        if args.stem:
+            print(f"   → python ingestion/ingest.py --stem {collection_name}", file=sys.stderr)
+        else:
+            print(f"   → python ingestion/ingest.py --collection {collection_name} --stems doc1 doc2 ...", file=sys.stderr)
        return 1

-    collection = client.get_collection(args.stem)
-    print(f"✅ Collection '{args.stem}' caricata ({collection.count()} chunk)\n")
+    collection = client.get_collection(collection_name)
+    print(f"✅ Collection '{collection_name}' caricata ({collection.count()} chunk)\n")

    run_loop(collection, args.top_k)
    return 0
@@ -1,229 +0,0 @@
-#!/usr/bin/env python3
-"""
-Step 0 — Verifica idoneità PDF
-
-Legge tutti i PDF in sources/ e salva un report per ognuno in step-0/.
-
-Uso:
-    python step-0/check_pdf.py
-
-Output:
-    step-0/<nome_pdf>_step0_report.txt
-"""
-
-import sys
-import statistics
-from datetime import datetime
-from pathlib import Path
-
-
-def check_pdf(pdf_path: str, save: bool = True) -> None:
-    try:
-        import pdfplumber
-    except ImportError:
-        print("Errore: pdfplumber non è installato.")
-        print("       pip install pdfplumber")
-        sys.exit(1)
-
-    path = Path(pdf_path)
-    if not path.exists():
-        print(f"Errore: file non trovato — {pdf_path}")
-        sys.exit(1)
-    if path.suffix.lower() != ".pdf":
-        print(f"Errore: il file non è un PDF — {pdf_path}")
-        sys.exit(1)
-
-    lines = []  # righe del report
-    results = []  # (etichetta, stato, messaggio)
-
-    def out(text=""):
-        lines.append(text)
-        print(text)
-
-    out(f"Step 0 — Verifica idoneità PDF")
-    out(f"File:    {path.name}")
-    out(f"Data:    {datetime.now().strftime('%Y-%m-%d %H:%M')}")
-    out("=" * 50)
-
-    # ------------------------------------------------------------------ #
-    # Criterio 1 — Non protetto da password
-    # ------------------------------------------------------------------ #
-    try:
-        with pdfplumber.open(path) as pdf:
-            n_pages = len(pdf.pages)
-        results.append(("Non protetto da password", "PASS", f"{n_pages} pagine"))
-    except Exception as e:
-        msg = str(e).lower()
-        if "password" in msg or "encrypted" in msg or "decrypt" in msg:
-            results.append(("Non protetto da password", "FAIL",
-                             "Il PDF è cifrato — non può essere elaborato"))
-        else:
-            results.append(("Non protetto da password", "FAIL",
-                             f"Impossibile aprire il file: {e}"))
-        _render_results(results, out)
-        _maybe_save(lines, path, save)
-        return
-
-    # ------------------------------------------------------------------ #
-    # Lettura pagine — una sola passata
-    # ------------------------------------------------------------------ #
-    char_counts = []
-    line_lengths = []
-    all_text = ""
-    empty_pages = 0
-
-    with pdfplumber.open(path) as pdf:
-        for page in pdf.pages:
-            text = page.extract_text() or ""
-            all_text += text + "\n"
-            chars = len(text.strip())
-            char_counts.append(chars)
-            if chars == 0:
-                empty_pages += 1
-            for line in text.splitlines():
-                stripped = line.strip()
-                if stripped:
-                    line_lengths.append(len(stripped))
-
-    total_pages = len(char_counts)
-    pages_with_text = sum(1 for c in char_counts if c > 50)
-    text_coverage = pages_with_text / total_pages if total_pages > 0 else 0
-
-    # ------------------------------------------------------------------ #
-    # Criterio 2 — Testo estraibile
-    # ------------------------------------------------------------------ #
-    if text_coverage >= 0.7:
-        results.append(("Testo estraibile", "PASS",
-                         f"{pages_with_text}/{total_pages} pagine con testo ({text_coverage:.0%})"))
-    elif text_coverage >= 0.4:
-        results.append(("Testo estraibile", "WARN",
-                         f"Solo {pages_with_text}/{total_pages} pagine con testo — revisione estesa necessaria"))
-    else:
-        results.append(("Testo estraibile", "FAIL",
-                         f"Solo {pages_with_text}/{total_pages} pagine con testo — probabilmente scansionato"))
-
-    # ------------------------------------------------------------------ #
-    # Criterio 3 — Generato digitalmente (non scansionato)
-    # ------------------------------------------------------------------ #
-    pages_text_only = [c for c in char_counts if c > 0]
-    avg_chars = statistics.mean(pages_text_only) if pages_text_only else 0
-
-    if avg_chars >= 300:
-        results.append(("Generato digitalmente (non scansionato)", "PASS",
-                         f"Media {avg_chars:.0f} char/pagina"))
-    elif avg_chars >= 100:
-        results.append(("Generato digitalmente (non scansionato)", "WARN",
-                         f"Media bassa: {avg_chars:.0f} char/pagina — alcune pagine potrebbero essere immagini"))
-    else:
-        results.append(("Generato digitalmente (non scansionato)", "FAIL",
-                         f"Media molto bassa: {avg_chars:.0f} char/pagina — il PDF sembra scansionato"))
-
-    # ------------------------------------------------------------------ #
-    # Criterio 4 — Pagine vuote
-    # ------------------------------------------------------------------ #
-    if empty_pages == 0:
-        results.append(("Pagine vuote", "PASS", "Nessuna pagina vuota"))
-    elif empty_pages <= total_pages * 0.05:
-        results.append(("Pagine vuote", "WARN",
-                         f"{empty_pages} pagine vuote (≤ 5%) — probabilmente copertine o separatori"))
-    else:
-        results.append(("Pagine vuote", "WARN",
-                         f"{empty_pages} pagine vuote ({empty_pages/total_pages:.0%}) — controllare"))
-
-    # ------------------------------------------------------------------ #
-    # Criterio desiderabile — Layout a colonne singola
-    # ------------------------------------------------------------------ #
-    if line_lengths:
-        median_len = statistics.median(line_lengths)
-        short_lines = sum(1 for l in line_lengths if l < median_len * 0.4)
-        short_ratio = short_lines / len(line_lengths)
-        if short_ratio < 0.15:
-            results.append(("Layout a colonne singola (desiderabile)", "PASS",
-                             f"Righe corte: {short_ratio:.0%} — struttura lineare"))
-        elif short_ratio < 0.35:
-            results.append(("Layout a colonne singola (desiderabile)", "WARN",
-                             f"Righe corte: {short_ratio:.0%} — possibile layout a colonne parziale"))
-        else:
-            results.append(("Layout a colonne singola (desiderabile)", "WARN",
-                             f"Righe corte: {short_ratio:.0%} — probabile layout a colonne multiple"))
-    else:
-        results.append(("Layout a colonne singola (desiderabile)", "WARN",
-                         "Impossibile analizzare (nessuna riga estratta)"))
-
-    # ------------------------------------------------------------------ #
-    # Criterio desiderabile — Struttura logica (titoli)
-    # ------------------------------------------------------------------ #
-    candidate_headings = [
-        line.strip() for line in all_text.splitlines()
-        if 3 <= len(line.strip()) <= 80
-        and line.strip()[0].isupper()
-        and not line.strip().endswith(".")
-        and not line.strip().endswith(",")
-        and len(line.strip().split()) <= 10
-    ]
-    heading_density = len(candidate_headings) / total_pages if total_pages > 0 else 0
-
-    if heading_density >= 1.0:
-        results.append(("Struttura logica riconoscibile (desiderabile)", "PASS",
-                         f"~{len(candidate_headings)} possibili titoli rilevati ({heading_density:.1f}/pagina)"))
-    elif heading_density >= 0.3:
-        results.append(("Struttura logica riconoscibile (desiderabile)", "WARN",
-                         f"~{len(candidate_headings)} possibili titoli ({heading_density:.1f}/pagina) — struttura parziale"))
-    else:
-        results.append(("Struttura logica riconoscibile (desiderabile)", "WARN",
-                         "Pochi titoli rilevati — testo narrativo o struttura non standard"))
-
-    _render_results(results, out)
-    _maybe_save(lines, path, save)
-
-
-def _render_results(results: list, out) -> None:
-    icons = {"PASS": "✅", "WARN": "⚠️ ", "FAIL": "❌"}
-    out()
-    for label, status, message in results:
-        icon = icons.get(status, "  ")
-        out(f"  {icon} {label}")
-        out(f"       {message}")
-    out()
-
-    fails = [r for r in results if r[1] == "FAIL"]
-    warns = [r for r in results if r[1] == "WARN"]
-
-    if fails:
-        out("ESITO: ❌ PDF NON IDONEO")
-        out("       Criteri obbligatori non soddisfatti — scegli un PDF diverso.")
-    elif warns:
-        out("ESITO: ⚠️  PDF ACCETTABILE CON CAUTELA")
-        out("       Procedi, ma aspettati più lavoro nella revisione manuale (step 4).")
-    else:
-        out("ESITO: ✅ PDF IDONEO")
-        out("       Tutti i criteri soddisfatti — procedi con lo step 1.")
-    out()
-
-
-def _maybe_save(lines: list, pdf_path: Path, save: bool) -> None:
-    if not save:
-        return
-    script_dir = Path(__file__).parent
-    out_file = script_dir / f"{pdf_path.stem}_step0_report.txt"
-    out_file.write_text("\n".join(lines), encoding="utf-8")
-    print(f"Report salvato in: {out_file}")
-
-
-if __name__ == "__main__":
-    project_root = Path(__file__).parent.parent
-    sources_dir = project_root / "sources"
-
-    if not sources_dir.exists():
-        print(f"Errore: cartella sources/ non trovata in {project_root}")
-        sys.exit(1)
-
-    pdfs = sorted(sources_dir.glob("*.pdf"))
-    if not pdfs:
-        print(f"Errore: nessun PDF trovato in {sources_dir}")
-        sys.exit(1)
-
-    for pdf in pdfs:
-        check_pdf(str(pdf), save=True)
-        if len(pdfs) > 1:
-            print("-" * 50)
@@ -1,199 +0,0 @@
-#!/usr/bin/env python3
-"""
-Step 1 — Ispezione automatica PDF
-
-Analizza il PDF pagina per pagina e produce un report con score (0–100)
-e lista dei problemi per pagina. Serve per capire la qualità del documento
-e mappare i problemi prima della revisione manuale (step 4).
-
-Uso:
-    python step1/inspect.py
-
-Output:
-    step1/<nome_pdf>_step1_report.txt
-"""
-
-import re
-import sys
-import statistics
-from collections import Counter
-from datetime import datetime
-from pathlib import Path
-
-
-# ── Penalità per il calcolo dello score ───────────────────────────────────
-SYLLABIF_PENALTY  = 0.3   # per occorrenza di sillabazione
-COLUMN_PENALTY    = 3.0   # per pagina con layout a colonne
-UNICODE_PENALTY   = 1.5   # per pagina con caratteri anomali
-EMPTY_PENALTY     = 1.0   # per pagina vuota
-HEADER_FOOTER_PEN = 5.0   # fisso se intestazioni/piè ripetitivi rilevati
-
-
-def inspect_pdf(pdf_path: str, save: bool = True) -> None:
-    try:
-        import pdfplumber
-    except ImportError:
-        print("Errore: pdfplumber non è installato.")
-        print("       pip install pdfplumber")
-        sys.exit(1)
-
-    path = Path(pdf_path)
-    if not path.exists():
-        print(f"Errore: file non trovato — {pdf_path}")
-        sys.exit(1)
-
-    lines = []
-
-    def out(text=""):
-        lines.append(text)
-        print(text)
-
-    out("Step 1 — Ispezione automatica PDF")
-    out(f"File:    {path.name}")
-    out(f"Data:    {datetime.now().strftime('%Y-%m-%d %H:%M')}")
-    out("=" * 50)
-
-    # ── Lettura pagine ─────────────────────────────────────────────────────
-    with pdfplumber.open(path) as pdf:
-        n_pages = len(pdf.pages)
-        pages_text = [page.extract_text() or "" for page in pdf.pages]
-
-    # ── Analisi per pagina ─────────────────────────────────────────────────
-    issues = []       # (page_num, descrizione)  — page_num=0 → problema globale
-    deductions = 0.0
-
-    first_lines = []  # prima riga significativa di ogni pagina (per header)
-    last_lines  = []  # ultima riga significativa di ogni pagina (per footer)
-
-    for i, text in enumerate(pages_text):
-        page_num = i + 1
-        stripped = text.strip()
-
-        # 1. Pagina vuota
-        if len(stripped) < 50:
-            issues.append((page_num, "pagina vuota"))
-            deductions += EMPTY_PENALTY
-            continue
-
-        page_lines = text.splitlines()
-        nonempty   = [l.strip() for l in page_lines if l.strip()]
-
-        # Raccogli prima/ultima riga per il controllo header/footer
-        if nonempty:
-            first_lines.append(nonempty[0])
-            last_lines.append(nonempty[-1])
-
-        # 2. Sillabazione a fine riga  (es. "estra-" + a capo)
-        syllabif = sum(
-            1 for line in page_lines
-            if re.search(r'\b\w{2,}-$', line.rstrip())
-        )
-        if syllabif:
-            label = "occorrenza" if syllabif == 1 else "occorrenze"
-            issues.append((page_num, f"sillabazione rilevata ({syllabif} {label})"))
-            deductions += syllabif * SYLLABIF_PENALTY
-
-        # 3. Layout a colonne  (righe molto corte e numerose)
-        if len(nonempty) >= 10:
-            median_len  = statistics.median(len(l) for l in nonempty)
-            short_ratio = sum(1 for l in nonempty if len(l) < median_len * 0.4) / len(nonempty)
-            if short_ratio > 0.35:
-                issues.append((page_num, f"possibile layout a colonne ({short_ratio:.0%} righe corte)"))
-                deductions += COLUMN_PENALTY
-
-        # 4. Caratteri Unicode anomali
-        #    (control chars esclusi \n \t \r, replacement char, PUA block)
-        anomalies = re.findall(
-            r'[\x00-\x08\x0b\x0c\x0e-\x1f\x7f-\x9f\ufffd\ue000-\uf8ff]', text
-        )
-        if anomalies:
-            issues.append((page_num, f"caratteri Unicode anomali ({len(anomalies)} trovati)"))
-            deductions += UNICODE_PENALTY
-
-    # ── Intestazioni e piè di pagina ripetitivi ────────────────────────────
-    def _check_repetition(line_list: list, label: str) -> None:
-        nonlocal deductions
-        if not line_list:
-            return
-        threshold = max(3, len(line_list) * 0.25)
-        repeated  = [
-            (txt, cnt) for txt, cnt in Counter(line_list).items()
-            if cnt >= threshold and len(txt) > 3
-        ]
-        if repeated:
-            deductions += HEADER_FOOTER_PEN
-            for txt, cnt in repeated[:3]:
-                issues.append((0, f"{label} ripetitivo: \"{txt[:45]}\" ({cnt} volte)"))
-
-    _check_repetition(first_lines, "intestazione")
-    _check_repetition(last_lines,  "piè di pagina")
-
-    # ── Score ──────────────────────────────────────────────────────────────
-    score = max(0, round(100 - deductions))
-
-    # ── Riepilogo ──────────────────────────────────────────────────────────
-    pages_with_issues = len({p for p, _ in issues if p > 0})
-    out()
-    out(f"Score: {score}/100")
-    out(f"Pagine totali:        {n_pages}")
-    out(f"Pagine con problemi:  {pages_with_issues}")
-    out()
-
-    if issues:
-        global_issues = [(p, d) for p, d in issues if p == 0]
-        page_issues   = sorted([(p, d) for p, d in issues if p > 0])
-        for _, desc in global_issues:
-            out(f"  ⚠️  {desc}")
-        for page_num, desc in page_issues:
-            out(f"  Pagina {page_num:>4}: {desc}")
-    else:
-        out("  Nessun problema rilevato.")
-
-    out()
-
-    # ── Prossimi passi ─────────────────────────────────────────────────────
-    out("PROSSIMI PASSI:")
-    if score >= 70:
-        out("  → conversione con marker funzionerà bene")
-    elif score >= 40:
-        out("  → conversione possibile, attendi più errori nella revisione")
-    else:
-        out("  → qualità bassa — valuta una fonte PDF migliore")
-
-    attention_pages = sorted({p for p, _ in issues if p > 0})
-    if attention_pages:
-        sample = ", ".join(str(p) for p in attention_pages[:10])
-        if len(attention_pages) > 10:
-            sample += f" … e altre {len(attention_pages) - 10}"
-        out(f"  → attenzione alle pagine {sample} nella revisione manuale")
-    out()
-
-    _maybe_save(lines, path, save)
-
-
-def _maybe_save(lines: list, pdf_path: Path, save: bool) -> None:
-    if not save:
-        return
-    script_dir = Path(__file__).parent
-    out_file   = script_dir / f"{pdf_path.stem}_step1_report.txt"
-    out_file.write_text("\n".join(lines), encoding="utf-8")
-    print(f"Report salvato in: {out_file}")
-
-
-if __name__ == "__main__":
-    project_root = Path(__file__).parent.parent
-    sources_dir  = project_root / "sources"
-
-    if not sources_dir.exists():
-        print(f"Errore: cartella sources/ non trovata in {project_root}")
-        sys.exit(1)
-
-    pdfs = sorted(sources_dir.glob("*.pdf"))
-    if not pdfs:
-        print(f"Errore: nessun PDF trovato in {sources_dir}")
-        sys.exit(1)
-
-    for pdf in pdfs:
-        inspect_pdf(str(pdf), save=True)
-        if len(pdfs) > 1:
-            print("-" * 50)
@@ -1,80 +0,0 @@
-#!/usr/bin/env python3
-"""
-Step 2 — Conversione PDF → Markdown grezzo
-
-Usa pymupdf4llm (PyMuPDF puro C, zero modelli ML, ~30-50 MB RAM)
-per convertire ogni PDF in sources/ e organizza l'output in:
-  step-2/<stem>/raw.md    — MD grezzo, non modificare mai
-  step-2/<stem>/clean.md  — copia di lavoro per lo step 4
-
-Uso:
-    python step-2/convert_pdf.py                        # tutti i PDF in sources/
-    python step-2/convert_pdf.py --pdf sources/doc.pdf  # un solo PDF
-"""
-
-import argparse
-import shutil
-import sys
-from pathlib import Path
-
-import pymupdf4llm
-
-
-def convert_pdf(pdf_path: Path, project_root: Path) -> bool:
-    stem = pdf_path.stem
-    out_dir = project_root / "step-2" / stem
-    raw_md = out_dir / "raw.md"
-    clean_md = out_dir / "clean.md"
-
-    print(f"\nConversione: {pdf_path.name}")
-    print(f"  Output:    step-2/{stem}/")
-
-    if raw_md.exists():
-        print(f"  ⚠️  raw.md già presente — skip")
-        print(f"       (elimina {raw_md} per riconvertire)")
-        return True
-
-    out_dir.mkdir(parents=True, exist_ok=True)
-
-    print(f"  Conversione in corso...")
-    md_text = pymupdf4llm.to_markdown(str(pdf_path))
-
-    raw_md.write_text(md_text, encoding="utf-8")
-    shutil.copy2(raw_md, clean_md)
-
-    size_kb = raw_md.stat().st_size // 1024
-    print(f"  ✅ raw.md salvato ({size_kb} KB)")
-    print(f"  ✅ clean.md creato (copia di lavoro per step 4)")
-    return True
-
-
-if __name__ == "__main__":
-    project_root = Path(__file__).parent.parent
-
-    parser = argparse.ArgumentParser(description="Step 2 — Conversione PDF → Markdown")
-    parser.add_argument("--pdf", help="Percorso di un singolo PDF da convertire")
-    args = parser.parse_args()
-
-    if args.pdf:
-        pdf_path = Path(args.pdf)
-        if not pdf_path.exists():
-            print(f"Errore: file non trovato — {args.pdf}")
-            sys.exit(1)
-        pdfs = [pdf_path]
-    else:
-        sources_dir = project_root / "sources"
-        if not sources_dir.exists():
-            print(f"Errore: cartella sources/ non trovata in {project_root}")
-            sys.exit(1)
-        pdfs = sorted(sources_dir.glob("*.pdf"))
-        if not pdfs:
-            print(f"Errore: nessun PDF trovato in {sources_dir}")
-            sys.exit(1)
-
-    results = [convert_pdf(p, project_root) for p in pdfs]
-
-    ok_count = sum(results)
-    total = len(results)
-    print(f"\n{'✅' if all(results) else '⚠️ '} {ok_count}/{total} PDF convertiti")
-
-    sys.exit(0 if all(results) else 1)
@@ -1,223 +0,0 @@
-#!/usr/bin/env python3
-"""
-Step 3 — Rilevamento struttura Markdown
-
-Analizza il Markdown grezzo prodotto dallo step 2 senza modificarlo.
-Copia i file da step-2/<stem>/ e produce structure_profile.json che
-guida la revisione manuale (step 4) e il chunker adattivo (step 5).
-
-Output in step-3/<stem>/:
-  raw.md                  — copia da step-2 (non modificare mai)
-  clean.md                — copia da step-2 (da revisionare nello step 4)
-  structure_profile.json  — profilo strutturale
-
-Uso:
-    python step-3/detect_structure.py                    # tutti i documenti in step-2/
-    python step-3/detect_structure.py --stem nietzsche   # un solo documento
-    python step-3/detect_structure.py --force            # riesegui anche se già presente
-"""
-
-import argparse
-import json
-import re
-import shutil
-import sys
-from pathlib import Path
-
-
-# ─── Language detection ───────────────────────────────────────────────────────
-
-_IT_WORDS = frozenset([
-    "il", "la", "di", "e", "che", "non", "per", "un", "una", "si",
-    "con", "da", "del", "della", "dei", "in", "ma", "se", "lo", "le",
-    "gli", "al", "alla", "ai", "alle", "sono", "ha", "hanno", "era",
-    "erano", "nel", "nella", "nei", "nelle", "questo", "questa", "così",
-])
-
-_EN_WORDS = frozenset([
-    "the", "of", "and", "to", "in", "is", "that", "it", "was", "for",
-    "on", "are", "as", "with", "his", "they", "at", "be", "this", "have",
-    "from", "or", "an", "but", "not", "by", "he", "she", "we", "you",
-    "which", "their", "been", "has", "would", "there", "when", "will",
-])
-
-
-def detect_language(text: str) -> str:
-    words = re.findall(r'\b[a-zA-Z]{2,}\b', text.lower())
-    sample = words[:2000]
-    it = sum(1 for w in sample if w in _IT_WORDS)
-    en = sum(1 for w in sample if w in _EN_WORDS)
-    if it == 0 and en == 0:
-        return "unknown"
-    return "it" if it >= en else "en"
-
-
-# ─── Markdown parsing ─────────────────────────────────────────────────────────
-
-def split_sections(text: str, header_level: int) -> list[str]:
-    """
-    Split text on headers of the given level (1=h1, 2=h2, 3=h3).
-    Returns list of body texts for each matching section.
-    """
-    prefix = "#" * header_level + " "
-    parts = re.split(rf'(?m)^{re.escape(prefix)}.+', text)
-    # parts[0] is preamble, rest are section bodies
-    return [p for p in parts[1:] if p.strip()]
-
-
-def count_headers(text: str, level: int) -> int:
-    prefix = "#" * level + " "
-    return len(re.findall(rf'(?m)^{re.escape(prefix)}', text))
-
-
-def count_paragraphs(text: str) -> int:
-    """Count non-empty, non-header paragraph blocks."""
-    blocks = re.split(r'\n{2,}', text)
-    return sum(1 for b in blocks if b.strip() and not re.match(r'^#+\s', b.strip()))
-
-
-# ─── Core analysis ────────────────────────────────────────────────────────────
-
-def analyze(raw_md_path: Path) -> dict:
-    text = raw_md_path.read_text(encoding="utf-8")
-
-    n_h1 = count_headers(text, 1)
-    n_h2 = count_headers(text, 2)
-    n_h3 = count_headers(text, 3)
-    n_paragrafi = count_paragraphs(text)
-
-    # Determine structural level and primary boundary
-    if n_h3 >= 5:
-        livello = 3
-        boundary = "h3"
-        strategia = "h3_aware"
-        section_bodies = split_sections(text, 3)
-    elif n_h2 >= 3:
-        livello = 2
-        boundary = "h2"
-        strategia = "h2_paragraph_split"
-        section_bodies = split_sections(text, 2)
-    elif n_h1 + n_h2 + n_h3 >= 1:
-        livello = 1
-        boundary = "paragrafo"
-        strategia = "paragraph"
-        section_bodies = [b for b in re.split(r'\n{2,}', text) if b.strip()]
-    else:
-        if n_paragrafi >= 3:
-            livello = 1
-            boundary = "paragrafo"
-            strategia = "paragraph"
-            section_bodies = [b for b in re.split(r'\n{2,}', text) if b.strip()]
-        else:
-            livello = 0
-            boundary = "nessuno"
-            strategia = "sliding_window"
-            section_bodies = [text] if text.strip() else []
-
-    lengths = [len(b) for b in section_bodies if b.strip()]
-    lunghezza_media = int(sum(lengths) / len(lengths)) if lengths else 0
-
-    lingua = detect_language(text)
-
-    avvertenze = []
-    short = sum(1 for l in lengths if l < 200)
-    long_ = sum(1 for l in lengths if l > 800)
-    if short:
-        avvertenze.append(f"{short} sezioni sotto i 200 caratteri — verranno accorpate")
-    if long_:
-        avvertenze.append(f"{long_} sezioni sopra i 800 caratteri — verranno divise")
-
-    return {
-        "livello_struttura": livello,
-        "n_h1": n_h1,
-        "n_h2": n_h2,
-        "n_h3": n_h3,
-        "n_paragrafi": n_paragrafi,
-        "boundary_primario": boundary,
-        "lingua_rilevata": lingua,
-        "lunghezza_media_sezione": lunghezza_media,
-        "strategia_chunking": strategia,
-        "avvertenze": avvertenze,
-    }
-
-
-# ─── Per-document processing ─────────────────────────────────────────────────
-
-def process_stem(stem: str, project_root: Path, force: bool) -> bool:
-    src_dir = project_root / "step-2" / stem
-    out_dir = project_root / "step-3" / stem
-    raw_src = src_dir / "raw.md"
-    clean_src = src_dir / "clean.md"
-    profile_out = out_dir / "structure_profile.json"
-
-    print(f"\nDocumento: {stem}")
-
-    if not raw_src.exists():
-        print(f"  ✗ raw.md non trovato in step-2/{stem}/ — skip")
-        return False
-
-    if profile_out.exists() and not force:
-        print(f"  ⚠️  structure_profile.json già presente — skip")
-        print(f"       (usa --force per rieseguire)")
-        return True
-
-    out_dir.mkdir(parents=True, exist_ok=True)
-
-    # Copy files from step-2
-    shutil.copy2(raw_src, out_dir / "raw.md")
-    if clean_src.exists():
-        shutil.copy2(clean_src, out_dir / "clean.md")
-    print(f"  Copiati raw.md e clean.md da step-2/{stem}/")
-
-    # Analyze
-    print(f"  Analisi struttura in corso...")
-    profile = analyze(out_dir / "raw.md")
-
-    profile_out.write_text(json.dumps(profile, ensure_ascii=False, indent=2), encoding="utf-8")
-
-    # Report
-    _LIVELLO_DESC = {
-        3: "struttura ricca (###)",
-        2: "struttura parziale (##)",
-        1: "solo paragrafi",
-        0: "testo piatto",
-    }
-    print(f"  ✅ Livello {profile['livello_struttura']} — {_LIVELLO_DESC[profile['livello_struttura']]}")
-    print(f"     h1={profile['n_h1']}  h2={profile['n_h2']}  h3={profile['n_h3']}  paragrafi={profile['n_paragrafi']}")
-    print(f"     Boundary: {profile['boundary_primario']}  |  Strategia: {profile['strategia_chunking']}")
-    print(f"     Lingua: {profile['lingua_rilevata']}  |  Lunghezza media sezione: {profile['lunghezza_media_sezione']} char")
-    for w in profile["avvertenze"]:
-        print(f"     ⚠️  {w}")
-    print(f"  ✅ structure_profile.json salvato")
-    return True
-
-
-# ─── Entry point ─────────────────────────────────────────────────────────────
-
-if __name__ == "__main__":
-    project_root = Path(__file__).parent.parent
-
-    parser = argparse.ArgumentParser(description="Step 3 — Rilevamento struttura Markdown")
-    parser.add_argument("--stem", help="Nome del documento (sottocartella di step-2/)")
-    parser.add_argument("--force", action="store_true", help="Riesegui anche se già presente")
-    args = parser.parse_args()
-
-    if args.stem:
-        stems = [args.stem]
-    else:
-        step2_dir = project_root / "step-2"
-        if not step2_dir.exists():
-            print(f"Errore: cartella step-2/ non trovata in {project_root}")
-            sys.exit(1)
-        stems = sorted(p.name for p in step2_dir.iterdir() if p.is_dir())
-        if not stems:
-            print(f"Errore: nessun documento trovato in step-2/")
-            sys.exit(1)
-
-    results = [process_stem(s, project_root, args.force) for s in stems]
-
-    ok = sum(results)
-    total = len(results)
-    print(f"\n{'✅' if all(results) else '⚠️ '} {ok}/{total} documenti analizzati")
-
-    sys.exit(0 if all(results) else 1)
@@ -1,433 +0,0 @@
-#!/usr/bin/env python3
-"""
-Step 4 — Revisione automatica del Markdown
-
-Trasforma clean.md da step-3 rivelando la struttura latente del documento.
-Le trasformazioni sono euristiche universali che funzionano su qualsiasi PDF:
-
-  - Normalizza whitespace multiplo (artefatto PDF)
-  - Riduce righe vuote multiple
-  - Rimuove marcatori **bold** nelle intestazioni esistenti
-  - Converte righe ALL-CAPS standalone → ## header (euristico, qualsiasi lingua)
-  - Converte sezioni numerate "N.  testo" → ### N. (qualsiasi numerazione)
-  - Rimuove blocchi TOC (righe che iniziano con parole-chiave indice)
-
-Per ogni documento viene ricalcolato il profilo strutturale: il livello può
-salire (es. livello 1 → 3) se le strutture latenti vengono rilevate.
-
-Output in step-4/<stem>/:
-  raw.md                  — copia da step-3 (non modificare mai)
-  clean.md                — MD revisionato
-  structure_profile.json  — profilo aggiornato dopo la revisione
-
-Uso:
-    python step-4/revise.py                    # tutti i documenti in step-3/
-    python step-4/revise.py --stem nietzsche   # un solo documento
-    python step-4/revise.py --force            # riesegui anche se già presente
-"""
-
-import argparse
-import json
-import re
-import shutil
-import sys
-from datetime import date
-from pathlib import Path
-
-# Riusa la funzione analyze() già scritta nello step 3
-sys.path.insert(0, str(Path(__file__).parent.parent / "step-3"))
-from detect_structure import analyze  # noqa: E402
-
-
-# ─── Costanti ─────────────────────────────────────────────────────────────────
-
-# Parole-chiave che identificano blocchi TOC (da rimuovere)
-_TOC_KEYWORDS = frozenset([
-    "indice", "index", "contents", "table of contents",
-    "sommario", "inhaltsverzeichnis", "inhalt",
-])
-
-# Preposizioni/articoli da non capitalizzare nel title-case
-_STOP_IT_EN = frozenset([
-    # italiano
-    "di", "del", "della", "dei", "delle", "da", "in", "e", "il", "la",
-    "lo", "le", "gli", "un", "una", "per", "a", "al", "alla", "ai",
-    "alle", "con", "su", "sul", "sulla", "che", "o",
-    # inglese
-    "of", "the", "a", "an", "and", "or", "but", "in", "on", "at",
-    "to", "for", "with", "by", "from", "as",
-])
-
-# Ordinali italiani → romani (per titoli come "CAPITOLO PRIMO")
-_ORDINALS_IT = {
-    "PRIMO": "I", "SECONDO": "II", "TERZO": "III", "QUARTO": "IV",
-    "QUINTO": "V", "SESTO": "VI", "SETTIMO": "VII", "OTTAVO": "VIII",
-    "NONO": "IX", "DECIMO": "X",
-}
-
-# Ordinali inglesi → arabici (per "CHAPTER ONE")
-_ORDINALS_EN = {
-    "ONE": "1", "TWO": "2", "THREE": "3", "FOUR": "4", "FIVE": "5",
-    "SIX": "6", "SEVEN": "7", "EIGHT": "8", "NINE": "9", "TEN": "10",
-}
-
-
-# ─── Utilità ──────────────────────────────────────────────────────────────────
-
-def _sentence_case(s: str) -> str:
-    """
-    Sentence-case: prima lettera maiuscola, resto minuscolo.
-    Corretto per l'italiano e accettabile per l'inglese accademico.
-    """
-    if not s:
-        return s
-    lower = s.lower()
-    return lower[0].upper() + lower[1:]
-
-
-def _is_allcaps_line(line: str) -> bool:
-    """
-    True se la riga è una candidata per conversione a ## header.
-    Criterio: tutti i caratteri alfabetici sono maiuscoli, lunghezza >= 3.
-    """
-    stripped = line.strip()
-    letters = [c for c in stripped if c.isalpha()]
-    return (
-        len(letters) >= 3
-        and all(c.isupper() for c in letters)
-        and not stripped.startswith("#")
-    )
-
-
-def _allcaps_to_header(raw_line: str) -> str:
-    """
-    Converte una riga ALL-CAPS in un ## header title-case.
-    Riconosce pattern specifici (CAPITOLO ORDINE, CHAPTER N) come bonus,
-    ma funziona in modalità generica su qualsiasi testo.
-    """
-    text = raw_line.strip().rstrip('.').rstrip('?').strip()
-
-    # ── Pattern italiano: "CAPITOLO PRIMO. TITOLO DEL CAPITOLO"
-    _ORD_IT_PAT = "|".join(_ORDINALS_IT.keys())
-    m = re.match(rf'^CAPITOLO ({_ORD_IT_PAT})\. (.+)', text)
-    if m:
-        roman = _ORDINALS_IT[m.group(1)]
-        titolo = m.group(2).rstrip('.').rstrip('?').strip()
-        return f"## Capitolo {roman} — {_sentence_case(titolo)}"
-
-    # ── Pattern inglese: "CHAPTER ONE. TITLE" o "CHAPTER 1. TITLE"
-    _ORD_EN_PAT = "|".join(_ORDINALS_EN.keys())
-    m = re.match(rf'^CHAPTER ({_ORD_EN_PAT}|\d+)\.? (.+)', text)
-    if m:
-        n = _ORDINALS_EN.get(m.group(1), m.group(1))
-        titolo = m.group(2).rstrip('.').rstrip('?').strip()
-        return f"## Chapter {n} — {_sentence_case(titolo)}"
-
-    # ── Pattern generico con numerazione romana o arabica nel prefisso
-    m = re.match(r'^([IVXLCDM]+|[0-9]+)\. (.+)', text)
-    if m:
-        n = m.group(1)
-        titolo = m.group(2).rstrip('.').strip()
-        return f"## {n}. {_sentence_case(titolo)}"
-
-    # ── Caso generico: tutto maiuscolo senza pattern riconoscibile
-    return f"## {_sentence_case(text)}"
-
-
-def _is_toc_line(line: str) -> bool:
-    """True se la riga è l'intestazione di un blocco indice/TOC."""
-    first_word = line.strip().split('.')[0].strip().lower()
-    return first_word in _TOC_KEYWORDS
-
-
-# ─── Trasformazioni ────────────────────────────────────────────────────────────
-
-def apply_transforms(text: str) -> tuple[str, dict]:
-    """
-    Applica tutte le trasformazioni strutturali al testo MD.
-    Restituisce (testo_modificato, statistiche).
-    """
-    stats = {
-        "toc_rimosso": False,
-        "n_header_allcaps": 0,
-        "n_sezioni_numerate": 0,
-        "n_paragrafi_uniti": 0,
-    }
-
-    # ── 1. Rimuovi marcatori **bold** nelle intestazioni esistenti
-    #       ## **Titolo** → ## Titolo
-    text = re.sub(
-        r'^(#{1,6})\s+\*\*(.+?)\*\*\s*$',
-        r'\1 \2',
-        text, flags=re.MULTILINE,
-    )
-
-    # ── 1b. Normalizza header esistenti con contenuto ALL-CAPS → sentence-case
-    #        ## AL DI LA' DEL BENE E DEL MALE → ## Al di la' del bene e del male
-    def _norm_allcaps_header(m: re.Match) -> str:
-        hashes = m.group(1)
-        content = m.group(2).strip()
-        letters = [c for c in content if c.isalpha()]
-        if letters and all(c.isupper() for c in letters):
-            return f"{hashes} {_sentence_case(content)}"
-        return m.group(0)
-
-    text = re.sub(
-        r'^(#{1,6}) (.+)$',
-        _norm_allcaps_header,
-        text, flags=re.MULTILINE,
-    )
-
-    # ── 2. Rimuovi blocco TOC (riga indice + contenuto inline sulla stessa riga)
-    #       "INDICE. Capitolo 1 Capitolo 2 ..."  → rimossa
-    lines = text.split('\n')
-    new_lines = []
-    for line in lines:
-        if _is_toc_line(line):
-            stats["toc_rimosso"] = True
-        else:
-            new_lines.append(line)
-    text = '\n'.join(new_lines)
-
-    # ── 3. Converti righe ALL-CAPS standalone → ## header
-    #       Una riga è "standalone" se è preceduta/seguita da riga vuota
-    #       oppure si trova all'inizio/fine del documento.
-    blocks = text.split('\n\n')
-    new_blocks = []
-    for block in blocks:
-        stripped = block.strip()
-        # Blocco standalone = un'unica riga (nessun \n interno rilevante)
-        if '\n' not in stripped and _is_allcaps_line(stripped):
-            new_blocks.append(_allcaps_to_header(stripped))
-            stats["n_header_allcaps"] += 1
-        else:
-            # Controlla riga per riga per righe ALL-CAPS seguite da altri contenuti
-            sub_lines = block.split('\n')
-            converted = []
-            for ln in sub_lines:
-                if _is_allcaps_line(ln) and len(ln.strip()) > 3:
-                    converted.append(_allcaps_to_header(ln))
-                    stats["n_header_allcaps"] += 1
-                else:
-                    converted.append(ln)
-            new_blocks.append('\n'.join(converted))
-    text = '\n\n'.join(new_blocks)
-
-    # ── 4. Converti sezioni numerate "N.  testo" → "### N.\n\ntesto"
-    #       Riconosce: "1.  Testo", "42.  Testo" (due o più spazi dopo il punto)
-    def _num_repl(m: re.Match) -> str:
-        num = m.group(1)
-        testo = m.group(2).strip()
-        stats["n_sezioni_numerate"] += 1
-        return f"### {num}.\n\n{testo}"
-
-    # Pattern standard: "1.  testo" o "1. testo"
-    text = re.sub(
-        r'^(\d+)\.\s+(.+)$',
-        _num_repl,
-        text, flags=re.MULTILINE,
-    )
-
-    # Pattern con lettera-suffisso: "65 a. testo" o "65a. testo"
-    def _num_letter_repl(m: re.Match) -> str:
-        num = m.group(1) + m.group(2)
-        testo = m.group(3).strip()
-        stats["n_sezioni_numerate"] += 1
-        return f"### {num}.\n\n{testo}"
-
-    text = re.sub(
-        r'^(\d+)\s*([a-z])\.\s+(.+)$',
-        _num_letter_repl,
-        text, flags=re.MULTILINE,
-    )
-
-    # ── 5. Unisci paragrafi spezzati da salti pagina PDF
-    #       Criterio: blocco A non finisce con punteggiatura di fine frase,
-    #       blocco B non inizia con maiuscola "di sezione" né è un header.
-    #       Unione sicura: mai attraverso confini ###/##.
-    _SENTENCE_END = set('.?!»)\'"')
-    blocks = text.split('\n\n')
-    merged = []
-    i = 0
-    while i < len(blocks):
-        b = blocks[i]
-        stripped = b.strip()
-        # Prova a unire con il successivo se la frase è spezzata
-        while (
-            i + 1 < len(blocks)
-            and stripped
-            and not stripped.startswith('#')
-            and stripped[-1] not in _SENTENCE_END
-        ):
-            nxt = blocks[i + 1].strip()
-            # Non unire se il successivo è un header o è vuoto
-            if not nxt or nxt.startswith('#'):
-                break
-            # Non unire se il successivo inizia con una cifra seguita da punto
-            # (sarebbe l'inizio di un nuovo aforisma non ancora convertito)
-            if re.match(r'^\d+\.', nxt):
-                break
-            b = stripped + ' ' + nxt
-            stripped = b.strip()
-            stats["n_paragrafi_uniti"] += 1
-            i += 1
-        merged.append(b)
-        i += 1
-    text = '\n\n'.join(merged)
-
-    # ── 6. Normalizza whitespace multiplo interno alle righe
-    #       "parola  parola" → "parola parola"  (inclusi gli header)
-    lines = text.split('\n')
-    normalized = []
-    for line in lines:
-        if not line.strip():
-            normalized.append(line)
-        else:
-            normalized.append(re.sub(r'  +', ' ', line))
-    text = '\n'.join(normalized)
-
-    # ── 7. Riduci righe vuote multiple a doppie
-    text = re.sub(r'\n{3,}', '\n\n', text)
-
-    return text, stats
-
-
-# ─── Aggiornamento revision log ────────────────────────────────────────────────
-
-def update_revision_log(
-    log_path: Path,
-    stem: str,
-    profile_before: dict,
-    profile_after: dict,
-    t_stats: dict,
-) -> None:
-    header_exists = log_path.exists() and log_path.stat().st_size > 0
-
-    avv = profile_after.get("avvertenze", [])
-    avv_str = "; ".join(avv) if avv else "nessuna"
-
-    entry = f"""
-## {stem} — {date.today().isoformat()}
-
-**Trasformazioni automatiche:**
- Normalizzazione whitespace multiplo e righe vuote
- Blocco TOC rimosso: {'sì' if t_stats['toc_rimosso'] else 'no'}
- Righe ALL-CAPS → ## header: {t_stats['n_header_allcaps']}
- Sezioni numerate → ### header: {t_stats['n_sezioni_numerate']}
- Paragrafi uniti (salti pagina PDF): {t_stats['n_paragrafi_uniti']}
- Livello struttura: {profile_before.get('livello_struttura', '?')} → {profile_after.get('livello_struttura', '?')}
-
-**Avvertenze residue:** {avv_str}
-
-**Revisioni manuali pendenti:**
- [ ] Verificare conversioni ALL-CAPS errate
- [ ] Controllare sezioni troppo corte o troppo lunghe
-"""
-
-    if not header_exists:
-        log_path.write_text("# Revision log\n" + entry, encoding="utf-8")
-    else:
-        existing = log_path.read_text(encoding="utf-8")
-        log_path.write_text(existing + entry, encoding="utf-8")
-
-
-# ─── Per-document processing ─────────────────────────────────────────────────
-
-def process_stem(stem: str, project_root: Path, force: bool) -> bool:
-    src_dir = project_root / "step-3" / stem
-    out_dir = project_root / "step-4" / stem
-    raw_src = src_dir / "raw.md"
-    clean_src = src_dir / "clean.md"
-    profile_src = src_dir / "structure_profile.json"
-    clean_out = out_dir / "clean.md"
-    profile_out = out_dir / "structure_profile.json"
-
-    print(f"\nDocumento: {stem}")
-
-    if not clean_src.exists():
-        print(f"  ✗ clean.md non trovato in step-3/{stem}/ — skip")
-        return False
-
-    if clean_out.exists() and not force:
-        print(f"  ⚠️  clean.md già presente — skip")
-        print(f"       (usa --force per rieseguire)")
-        return True
-
-    out_dir.mkdir(parents=True, exist_ok=True)
-
-    # Copia raw.md immutabile (riferimento)
-    if raw_src.exists():
-        shutil.copy2(raw_src, out_dir / "raw.md")
-        print(f"  Copiato raw.md da step-3/{stem}/")
-
-    # Leggi profilo step-3 (per confronto nel report)
-    profile_before: dict = {}
-    if profile_src.exists():
-        profile_before = json.loads(profile_src.read_text(encoding="utf-8"))
-
-    # Applica trasformazioni
-    print(f"  Applicazione trasformazioni strutturali...")
-    text = clean_src.read_text(encoding="utf-8")
-    text_revised, t_stats = apply_transforms(text)
-
-    # Salva clean.md revisionato
-    clean_out.write_text(text_revised, encoding="utf-8")
-
-    # Ricalcola profilo sul nuovo clean.md
-    profile_after = analyze(clean_out)
-    profile_out.write_text(
-        json.dumps(profile_after, ensure_ascii=False, indent=2),
-        encoding="utf-8",
-    )
-
-    # Report
-    lv_b = profile_before.get("livello_struttura", "?")
-    lv_a = profile_after["livello_struttura"]
-    _STRAT = {3: "h3_aware", 2: "h2_paragraph_split", 1: "paragraph", 0: "sliding_window"}
-    print(f"  ✅ Livello struttura: {lv_b} → {lv_a}  ({_STRAT.get(lv_a, '?')})")
-    print(f"     h2: {profile_before.get('n_h2','?')} → {profile_after['n_h2']}")
-    print(f"     h3: {profile_before.get('n_h3','?')} → {profile_after['n_h3']}")
-    print(f"     TOC rimosso: {'sì' if t_stats['toc_rimosso'] else 'no'}")
-    print(f"     Righe ALL-CAPS → ##: {t_stats['n_header_allcaps']}")
-    print(f"     Sezioni numerate → ###: {t_stats['n_sezioni_numerate']}")
-    print(f"     Paragrafi uniti (salti pagina): {t_stats['n_paragrafi_uniti']}")
-    for w in profile_after["avvertenze"]:
-        print(f"     ⚠️  {w}")
-
-    # Aggiorna revision log (direttamente in step-4/, non in sottocartella)
-    log_path = project_root / "step-4" / "revision_log.md"
-    update_revision_log(log_path, stem, profile_before, profile_after, t_stats)
-    print(f"  ✅ step-4/revision_log.md aggiornato")
-    print(f"  ✅ structure_profile.json salvato")
-    return True
-
-
-# ─── Entry point ─────────────────────────────────────────────────────────────
-
-if __name__ == "__main__":
-    project_root = Path(__file__).parent.parent
-
-    parser = argparse.ArgumentParser(description="Step 4 — Revisione automatica Markdown")
-    parser.add_argument("--stem", help="Nome del documento (sottocartella di step-3/)")
-    parser.add_argument("--force", action="store_true", help="Riesegui anche se già presente")
-    args = parser.parse_args()
-
-    if args.stem:
-        stems = [args.stem]
-    else:
-        step3_dir = project_root / "step-3"
-        if not step3_dir.exists():
-            print(f"Errore: cartella step-3/ non trovata in {project_root}")
-            sys.exit(1)
-        stems = sorted(p.name for p in step3_dir.iterdir() if p.is_dir())
-        if not stems:
-            print(f"Errore: nessun documento trovato in step-3/")
-            sys.exit(1)
-
-    results = [process_stem(s, project_root, args.force) for s in stems]
-
-    ok = sum(results)
-    total = len(results)
-    print(f"\n{'✅' if all(results) else '⚠️ '} {ok}/{total} documenti revisionati")
-
-    sys.exit(0 if all(results) else 1)
@@ -1,320 +0,0 @@
-#!/usr/bin/env python3
-"""
-Step 6 — Fix chunk
-
-Applica correzioni dirette su step-6/<stem>/chunks.json basandosi sul report.json
-prodotto da verify_chunks.py. Non tocca clean.md né step-5.
-
-Fixes applicati:
-  empty      → rimuove il chunk
-  incomplete → fonde con il chunk successivo (la frase continua)
-  no_prefix  → aggiunge prefisso [sezione > titolo] se mancante
-  too_short  → fonde con il chunk adiacente nello stesso sezione
-  too_long   → spezza all'ultimo confine di paragrafo/frase entro MAX_CHARS
-
-Input:  step-6/<stem>/chunks.json  +  step-6/<stem>/report.json
-Output: step-6/<stem>/chunks.json  (sovrascrive)
-
-Uso:
-    python step-6/fix_chunks.py --stem nietzsche
-    python step-6/fix_chunks.py --stem nietzsche --dry-run
-"""
-
-import argparse
-import json
-import re
-import sys
-from pathlib import Path
-
-MAX_CHARS = 800
-PUNCT_END = re.compile(r"[.!?»)\]'\u2019\"\u201c\u201d\u2018\u2014\u2013-]$")
-
-
-# ─── Helpers ──────────────────────────────────────────────────────────────────
-
-def _prefix(chunk: dict) -> str:
-    """Costruisce il prefisso [sezione > titolo] o [sezione]."""
-    sezione = chunk.get("sezione", "")
-    titolo  = chunk.get("titolo", "")
-    if titolo:
-        return f"[{sezione} > {titolo}]"
-    return f"[{sezione}]"
-
-
-def _strip_prefix(text: str) -> str:
-    """Rimuove il prefisso [...] iniziale dal testo, se presente."""
-    text = text.lstrip()
-    if text.startswith("["):
-        end = text.find("]")
-        if end != -1:
-            return text[end + 1:].lstrip("\n")
-    return text
-
-
-def _rebuild_text(chunk: dict, body: str) -> str:
-    """Ricompone il testo completo: prefisso + corpo."""
-    return f"{_prefix(chunk)}\n{body}"
-
-
-def _split_at_boundary(text: str, max_chars: int) -> list[str]:
-    """
-    Spezza 'text' in segmenti di al massimo max_chars caratteri,
-    tagliando al confine più vicino (doppio newline, poi fine frase).
-    Ritorna una lista di segmenti non vuoti.
-    """
-    if len(text) <= max_chars:
-        return [text]
-
-    parts = []
-    remaining = text
-
-    while len(remaining) > max_chars:
-        # Cerca l'ultimo \n\n entro max_chars
-        candidate = remaining[:max_chars]
-        split_pos = candidate.rfind("\n\n")
-
-        if split_pos == -1:
-            # Cerca l'ultimo . ? ! entro max_chars
-            m = None
-            for m in re.finditer(r"[.!?»]\s+", candidate):
-                pass
-            split_pos = m.end() if m else None
-
-        if split_pos is None or split_pos == 0:
-            # Nessun punto naturale: taglia sul primo spazio dopo max_chars
-            sp = remaining.find(" ", max_chars)
-            split_pos = sp if sp != -1 else len(remaining)
-
-        parts.append(remaining[:split_pos].rstrip())
-        remaining = remaining[split_pos:].lstrip()
-
-    if remaining:
-        parts.append(remaining)
-
-    return [p for p in parts if p.strip()]
-
-
-# ─── Operazioni sui chunk ─────────────────────────────────────────────────────
-
-def fix_empty(chunks: list[dict], empty_ids: set[str]) -> tuple[list[dict], int]:
-    """Rimuove i chunk vuoti."""
-    before = len(chunks)
-    chunks = [c for c in chunks if c["chunk_id"] not in empty_ids]
-    return chunks, before - len(chunks)
-
-
-def fix_no_prefix(chunks: list[dict], no_prefix_ids: set[str]) -> tuple[list[dict], int]:
-    """Aggiunge il prefisso mancante ai chunk che ne sono privi."""
-    count = 0
-    for c in chunks:
-        if c["chunk_id"] in no_prefix_ids:
-            body = _strip_prefix(c["text"])
-            c["text"] = _rebuild_text(c, body)
-            c["n_chars"] = len(c["text"])
-            count += 1
-    return chunks, count
-
-
-def fix_incomplete_and_short(chunks: list[dict],
-                              problem_ids: set[str]) -> tuple[list[dict], int]:
-    """
-    Fonde i chunk problematici (incompleti o troppo corti) con il chunk
-    immediatamente successivo che appartiene allo stesso sezione.
-    Usato sia per 'incomplete' che per 'too_short'.
-    """
-    merged = 0
-    i = 0
-    result: list[dict] = []
-
-    while i < len(chunks):
-        c = chunks[i]
-        if c["chunk_id"] in problem_ids and i + 1 < len(chunks):
-            nxt = chunks[i + 1]
-            # Fonde solo se stesso sezione (o se il successivo è compatibile)
-            body_c   = _strip_prefix(c["text"])
-            body_nxt = _strip_prefix(nxt["text"])
-            merged_body = body_c.rstrip() + "\n" + body_nxt.lstrip()
-            nxt["text"]   = _rebuild_text(nxt, merged_body)
-            nxt["n_chars"] = len(nxt["text"])
-            # Salta c (è stato assorbito in nxt)
-            merged += 1
-            i += 1
-            continue
-        result.append(c)
-        i += 1
-
-    return result, merged
-
-
-def fix_too_long(chunks: list[dict],
-                 too_long_ids: set[str],
-                 max_chars: int) -> tuple[list[dict], int]:
-    """
-    Spezza i chunk troppo lunghi al confine naturale più vicino a max_chars.
-    Ogni sotto-chunk eredita sezione/titolo e riceve un nuovo sub_index.
-    """
-    result: list[dict] = []
-    split_count = 0
-
-    for c in chunks:
-        if c["chunk_id"] not in too_long_ids:
-            result.append(c)
-            continue
-
-        body = _strip_prefix(c["text"])
-        parts = _split_at_boundary(body, max_chars)
-
-        if len(parts) == 1:
-            # Non spezzabile: lascia invariato
-            result.append(c)
-            continue
-
-        base_id = re.sub(r"__s\d+$", "", c["chunk_id"])
-        base_sub = c.get("sub_index", 0)
-
-        for j, part in enumerate(parts):
-            new_chunk = dict(c)
-            new_chunk["sub_index"] = base_sub + j
-            new_chunk["chunk_id"]  = f"{base_id}__s{base_sub + j}"
-            new_chunk["text"]      = _rebuild_text(new_chunk, part)
-            new_chunk["n_chars"]   = len(new_chunk["text"])
-            result.append(new_chunk)
-
-        split_count += 1
-
-    return result, split_count
-
-
-def renumber_ids(chunks: list[dict]) -> list[dict]:
-    """
-    Rinomina i chunk_id per evitare duplicati dopo merge/split.
-    Mantiene il pattern base__sN dove N è il nuovo indice per sezione/titolo.
-    """
-    seen: dict[str, int] = {}
-    for c in chunks:
-        base = re.sub(r"__s\d+$", "", c["chunk_id"])
-        idx = seen.get(base, 0)
-        c["chunk_id"] = f"{base}__s{idx}"
-        c["sub_index"] = idx
-        seen[base] = idx + 1
-    return chunks
-
-
-# ─── Core ─────────────────────────────────────────────────────────────────────
-
-def fix_stem(stem: str, project_root: Path, max_chars: int, dry_run: bool) -> bool:
-    step6_dir   = project_root / "step-6" / stem
-    chunks_path = step6_dir / "chunks.json"
-    report_path = step6_dir / "report.json"
-
-    if not chunks_path.exists():
-        print(f"✗ step-6/{stem}/chunks.json non trovato.")
-        print(f"  Esegui prima: python step-6/verify_chunks.py --stem {stem}")
-        return False
-
-    if not report_path.exists():
-        print(f"✗ step-6/{stem}/report.json non trovato.")
-        print(f"  Esegui prima: python step-6/verify_chunks.py --stem {stem}")
-        return False
-
-    chunks: list[dict] = json.loads(chunks_path.read_text(encoding="utf-8"))
-    report: dict       = json.loads(report_path.read_text(encoding="utf-8"))
-
-    verdict = report.get("verdict", "ok")
-    print(f"\nDocumento: {stem}  (verdict: {verdict})")
-
-    if verdict == "ok":
-        print("  ✅ Nessun problema — nulla da correggere.")
-        return True
-
-    # Raccogli gli ID per categoria
-    empty_ids      = {e["chunk_id"] for e in report.get("blockers", {}).get("empty", [])}
-    no_prefix_ids  = {e["chunk_id"] for e in report.get("blockers", {}).get("no_prefix", [])}
-    incomplete_ids = {e["chunk_id"] for e in report.get("blockers", {}).get("incomplete", [])}
-    too_short_ids  = {e["chunk_id"] for e in report.get("warnings", {}).get("too_short", [])}
-    too_long_ids   = {e["chunk_id"] for e in report.get("warnings", {}).get("too_long", [])}
-
-    # Riepilogo operazioni pianificate
-    ops: list[str] = []
-    if empty_ids:
-        ops.append(f"  🗑  rimuovi {len(empty_ids)} chunk vuoti")
-    if no_prefix_ids:
-        ops.append(f"  🔧 aggiungi prefisso a {len(no_prefix_ids)} chunk")
-    if incomplete_ids:
-        ops.append(f"  🔗 fondi {len(incomplete_ids)} chunk incompleti col successivo")
-    if too_short_ids:
-        ops.append(f"  🔗 fondi {len(too_short_ids)} chunk troppo corti col successivo")
-    if too_long_ids:
-        ops.append(f"  ✂️  spezza {len(too_long_ids)} chunk troppo lunghi")
-
-    if not ops:
-        print("  ✅ Nessuna correzione necessaria.")
-        return True
-
-    print("\n  Operazioni pianificate:")
-    for op in ops:
-        print(op)
-
-    if dry_run:
-        print("\n  [dry-run] Nessuna modifica applicata.")
-        return True
-
-    n_before = len(chunks)
-
-    # Applica nell'ordine corretto
-    if empty_ids:
-        chunks, n = fix_empty(chunks, empty_ids)
-        print(f"\n  🗑  Rimossi {n} chunk vuoti.")
-
-    if no_prefix_ids:
-        chunks, n = fix_no_prefix(chunks, no_prefix_ids)
-        print(f"  🔧 Aggiunto prefisso a {n} chunk.")
-
-    # incomplete prima di too_short (entrambi usano merge-forward)
-    merge_ids = incomplete_ids | too_short_ids
-    if merge_ids:
-        chunks, n = fix_incomplete_and_short(chunks, merge_ids)
-        print(f"  🔗 Fusi {n} chunk (incompleti + corti).")
-
-    if too_long_ids:
-        # Aggiorna too_long_ids per chunk che potrebbero essere stati rinominati
-        # dopo il merge (usa ancora gli ID originali, il merge non tocca too_long)
-        chunks, n = fix_too_long(chunks, too_long_ids, max_chars)
-        print(f"  ✂️  Spezzati {n} chunk lunghi.")
-
-    # Rinumera per evitare duplicati
-    chunks = renumber_ids(chunks)
-
-    n_after = len(chunks)
-    print(f"\n  Totale chunk: {n_before} → {n_after}")
-
-    # Salva
-    chunks_path.write_text(
-        json.dumps(chunks, ensure_ascii=False, indent=2), encoding="utf-8"
-    )
-    print(f"  ✅ Salvato: step-6/{stem}/chunks.json")
-    print(f"\n  Riesegui la verifica:")
-    print(f"     python step-6/verify_chunks.py --stem {stem}")
-
-    return True
-
-
-# ─── Entry point ──────────────────────────────────────────────────────────────
-
-if __name__ == "__main__":
-    project_root = Path(__file__).parent.parent
-
-    parser = argparse.ArgumentParser(description="Step 6 — Fix chunk")
-    parser.add_argument("--stem", required=True, help="Nome del documento (sottocartella di step-6/)")
-    parser.add_argument(
-        "--max", type=int, default=MAX_CHARS,
-        help=f"Soglia massima caratteri per lo split (default: {MAX_CHARS})"
-    )
-    parser.add_argument(
-        "--dry-run", action="store_true",
-        help="Mostra le operazioni pianificate senza applicarle"
-    )
-    args = parser.parse_args()
-
-    ok = fix_stem(args.stem, project_root, args.max, args.dry_run)
-    sys.exit(0 if ok else 1)
@@ -1,147 +0,0 @@
-# Step 7 — Verifica ambiente
-
-Prima di procedere con la vettorizzazione (step 8) devi avere installato:
-
- **Ollama** — server locale per LLM e embedding
- un **modello di embedding** (es. `nomic-embed-text`, `bge-m3`)
- un **modello LLM** (es. `qwen3.5:4b`, `qwen3:4b`)
- **chromadb** — libreria Python per il vector store
-
---
-
-## 1. Installa Ollama
-
-```bash
-curl -fsSL https://ollama.com/install.sh | sh
-```
-
-Verifica che il servizio sia attivo:
-
-```bash
-ollama list
-```
-
-### Disinstalla Ollama
-
-```bash
-# Ferma e rimuovi il servizio systemd
-sudo systemctl stop ollama
-sudo systemctl disable ollama
-sudo rm /etc/systemd/system/ollama.service
-sudo systemctl daemon-reload
-
-# Rimuovi il binario
-sudo rm /usr/local/bin/ollama
-
-# Rimuovi modelli e dati (opzionale — occupa spazio su disco)
-# I modelli sono salvati sotto l'utente di sistema "ollama", non nella tua home
-sudo rm -rf /usr/share/ollama
-
-# Rimuovi l'utente e il gruppo di sistema creati dall'installer (opzionale)
-sudo userdel ollama
-sudo groupdel ollama
-```
-
---
-
-## 2. Scarica i modelli
-
-### Modello di embedding
-
-Per testi in italiano serve un modello multilingue — i modelli English-first producono embeddings di qualità inferiore su lingue diverse dall'inglese, con retrieval meno preciso.
-
-Prima scelta consigliata:
-
-```bash
-ollama pull qwen3-embedding:0.6b
-```
-
-Stessa famiglia del LLM in uso (`qwen3.5`), multilingue, recente, gira comodamente in CPU.
-
-| Modello | Dim | Dimensione | Lingue | Consigliato |
-|---|---|---|---|---|
-| `qwen3-embedding:0.6b` | 1024 | ~522 MB | multilingue | ✅ prima scelta |
-| `nomic-embed-text-v2-moe` | 768 | ~523 MB | multilingue | ✅ seconda scelta |
-| `bge-m3` | 1024 | ~1.2 GB | 100+ lingue incl. IT | ✅ terza scelta |
-| `nomic-embed-text` | 768 | ~274 MB | principalmente EN | ⚠️ default corrente |
-| `mxbai-embed-large` | 1024 | ~670 MB | principalmente EN | ❌ |
-| `paraphrase-multilingual` | 768 | ~278 MB | multilingue | ❌ obsoleto |
-| `all-minilm` | 384 | ~46 MB | principalmente EN | ❌ troppo piccolo |
-
-Se cambi modello rispetto a quello usato in step-8, devi rieseguire la vettorizzazione con `--force` e aggiornare `EMBED_MODEL` in `step-9/config.py`.
-
-### Modello LLM
-
-Per RAG su testi italiani servono: buon instruction following, supporto multilingue e context window ampia (i prompt RAG includono più chunk).
-
-Prima scelta consigliata per 8 GB RAM:
-
-```bash
-ollama pull qwen3.5:4b
-```
-
-Il progetto è pensato per la famiglia **Qwen3.5** — stessa famiglia dell'embedding consigliato (`qwen3-embedding`), context window 256K, ottimo italiano. Altri modelli sono compatibili ma non testati.
-
-| Modello | RAM | Note |
-|---|---|---|
-| `qwen3.5:0.8b` | ≥ 1 GB | minimo assoluto |
-| `qwen3.5:2b` | ≥ 3 GB | leggero |
-| `qwen3.5:4b` | ≥ 5 GB | **consigliato per 8 GB** |
-| `qwen3.5:9b` | ≥ 8 GB | lento su CPU, meglio con GPU |
-
-Se usi un modello diverso da `qwen3.5:4b`, aggiorna `OLLAMA_MODEL` in `step-9/config.py`.
-
-### Disinstalla un modello
-
-```bash
-ollama rm qwen3.5:4b
-ollama rm nomic-embed-text
-```
-
-Per vedere tutti i modelli installati:
-
-```bash
-ollama list
-```
-
---
-
-## 3. Installa le dipendenze nel venv
-
-Assicurati di avere `chromadb` installato nel `.venv`:
-
-```bash
-source .venv/bin/activate
-pip install -r requirements.txt
-```
-
---
-
-## 4. Verifica tutto
-
-```bash
-source .venv/bin/activate
-python step-7/check_env.py
-```
-
-Output atteso se tutto è a posto:
-
-```
-✅ ollama trovato nel PATH
-✅ ollama risponde correttamente
-✅ modello embedding trovato: nomic-embed-text:latest
-✅ modello LLM trovato: qwen3.5:4b
-
-✅ chromadb importabile
-
-✅ Ambiente pronto — procedi con la vettorizzazione:
-   python step-8/ingest.py --stem <nome>
-```
-
---
-
-## Prossimo step
-
-```bash
-python step-8/ingest.py --stem <nome>
-```
@@ -1,109 +0,0 @@
-# Step 9 — Interrogazione del documento
-
-Due modalità di interrogazione, entrambe con loop interattivo:
-
-| Script | Modalità | Quando usarlo |
-|---|---|---|
-| `rag.py` | Retrieval + generazione LLM | Risposta in linguaggio naturale |
-| `retrieve.py` | Solo retrieval (no LLM) | Debug, verifica chunk, ricerca semantica |
-
---
-
-## Prerequisiti
-
- Step 8 completato (`chroma_db/` popolata)
- Ollama attivo con il modello di embedding scaricato
- Per `rag.py`: anche il modello LLM scaricato
-
---
-
-## rag.py — Risposta in linguaggio naturale
-
-```bash
-source .venv/bin/activate
-python step-9/rag.py --stem <nome>
-```
-
-Per ogni domanda: vettorizza la query, recupera i chunk più rilevanti da ChromaDB e genera la risposta tramite Ollama.
-
-```
-── Loop RAG ─────────────────────────────────────── (exit per uscire)
-
-Domanda:
-```
-
-| Sintassi | Comportamento |
-|---|---|
-| `<testo>` | Risposta basata sul documento |
-| `<testo> -v` | Risposta + chunk recuperati con score di similarità |
-| `exit` | Esce dal programma |
-
-Flusso interno:
-
-```
-domanda
-    │
-    ▼  embed (EMBED_MODEL, Ollama)
-vettore N-dim
-    │
-    ▼  query ChromaDB — similarità coseno, top-K
-chunk rilevanti
-    │
-    ▼  build_prompt (SYSTEM_PROMPT + contesti + domanda)
-    │
-    ▼  generate (OLLAMA_MODEL, Ollama)
-risposta
-```
-
-Il LLM risponde esclusivamente dal contesto fornito. Se il contesto è irrilevante rispetto alla domanda, lo dichiara esplicitamente.
-
---
-
-## retrieve.py — Retrieval puro (senza LLM)
-
-```bash
-source .venv/bin/activate
-python step-9/retrieve.py --stem <nome>
-```
-
-Vettorizza la query e restituisce i chunk più simili con score di similarità — senza chiamare Ollama per la generation. Utile per verificare la qualità del retrieval e diagnosticare risposte sbagliate.
-
-```
-── Loop retrieval ──────────────────────── (exit per uscire, -f per testo completo)
-
-Query:
-```
-
-| Sintassi | Comportamento |
-|---|---|
-| `<testo>` | Chunk più simili con score di similarità (testo troncato a 200 car.) |
-| `<testo> -f` | Chunk più simili con testo completo |
-| `exit` | Esce dal programma |
-
-Accetta `--top-k N` per sovrascrivere il valore di `config.py` per quella sessione.
-
---
-
-## Configurazione (`config.py`)
-
-| Parametro | Default | Descrizione |
-|---|---|---|
-| `TOP_K` | `6` | Chunk recuperati per ogni domanda. Valori consigliati: `3`–`10` |
-| `TEMPERATURE` | `0.0` | Deterministico a `0.0`, creativo verso `1.0`. Per RAG consigliato `0.0` |
-| `NO_THINK` | `True` | Disabilita il chain-of-thought interno dei modelli Qwen3/Qwen3.5. `True` = risposta diretta, più veloce |
-| `EMBED_MODEL` | `"nomic-embed-text"` | Deve corrispondere al modello usato in step-8. Se cambiato, rieseguire step-8 con `--force` |
-| `OLLAMA_URL` | `"http://localhost:11434"` | Modifica solo se Ollama gira su porta o host diversi |
-| `OLLAMA_MODEL` | `"qwen3.5:0.8b"` | Modello LLM. Vedi `step-7/README.md` per la scelta |
-| `SYSTEM_PROMPT` | *(vedi file)* | Istruzioni di comportamento inviate al LLM. Modifica per cambiare tono, lingua o condizione di fallback |
-
---
-
-## Test senza RAG
-
-Per verificare che Ollama risponda correttamente prima di interrogare il documento:
-
-```bash
-python step-9/test_ollama.py
-```
-
-Chat diretta con il modello, senza ChromaDB. Usa gli stessi parametri di `config.py`.
@@ -0,0 +1,96 @@
+"""Fixture condivise per l'intera test suite."""
+import pytest
+from conversione._pipeline.models import Block, Section
+
+
+@pytest.fixture
+def make_block():
+    """Factory per Block di test con valori di default ragionevoli."""
+    def _make(
+        text="testo di prova",
+        page=1,
+        font_size=12.0,
+        font_name="Helvetica",
+        is_bold=False,
+        block_type="paragraph",
+        space_before=5.0,
+        bbox=(50.0, 100.0, 400.0, 114.0),
+        level=0,
+    ):
+        return Block(
+            text=text,
+            page=page,
+            bbox=bbox,
+            font_size=font_size,
+            font_name=font_name,
+            is_bold=is_bold,
+            block_type=block_type,
+            space_before=space_before,
+            level=level,
+        )
+    return _make
+
+
+@pytest.fixture
+def mock_fitz_page():
+    """Dizionario che simula l'output di page.get_text('dict') per una pagina."""
+    return {
+        "width": 595.0,
+        "height": 842.0,
+        "blocks": [
+            {
+                "type": 0,
+                "bbox": (50, 50, 450, 70),
+                "lines": [{
+                    "bbox": (50, 50, 450, 70),
+                    "spans": [{
+                        "text": "1. Capitolo Primo",
+                        "font": "Helvetica-Bold",
+                        "size": 18.0,
+                        "flags": 16,
+                        "bbox": (50, 50, 450, 70),
+                        "origin": (50, 68),
+                        "color": 0,
+                    }],
+                }],
+            },
+            {
+                "type": 0,
+                "bbox": (50, 90, 500, 104),
+                "lines": [{
+                    "bbox": (50, 90, 500, 104),
+                    "spans": [{
+                        "text": "Testo del primo paragrafo del capitolo.",
+                        "font": "Helvetica",
+                        "size": 12.0,
+                        "flags": 0,
+                        "bbox": (50, 90, 500, 104),
+                        "origin": (50, 102),
+                        "color": 0,
+                    }],
+                }],
+            },
+        ],
+    }
+
+
+@pytest.fixture
+def simple_hierarchy_blocks(make_block):
+    """Lista di Block con gerarchia semplice H1→H2→H3 numerata."""
+    return [
+        make_block("1. Introduzione", font_size=18, is_bold=True, space_before=20.0),
+        make_block("Testo del paragrafo di introduzione.", font_size=12),
+        make_block("1.1 Contesto", font_size=15, is_bold=True, space_before=15.0),
+        make_block("Testo della sezione di contesto.", font_size=12),
+        make_block("1.1.1 Dettaglio", font_size=13, is_bold=True, space_before=10.0),
+        make_block("Testo del dettaglio specifico.", font_size=12),
+        make_block("2. Conclusioni", font_size=18, is_bold=True, space_before=20.0),
+        make_block("Testo conclusivo.", font_size=12),
+    ]
+
+
+@pytest.fixture
+def sources_dir():
+    from pathlib import Path
+    d = Path(__file__).parent.parent / "sources"
+    return d if d.exists() else None
@@ -0,0 +1,68 @@
+"""Test end-to-end: pipeline completa su PDF reali da sources/."""
+import json
+import shutil
+import pytest
+from pathlib import Path
+
+from conversione._pipeline import run
+
+
+PROJECT_ROOT = Path(__file__).parent.parent.parent
+
+
+def _sources_available(stem: str) -> bool:
+    return (PROJECT_ROOT / "sources" / f"{stem}.pdf").exists()
+
+
+@pytest.mark.skipif(not _sources_available("bitcoin"), reason="sources/bitcoin.pdf non disponibile")
+def test_bitcoin_produces_clean_md(tmp_path, monkeypatch):
+    """Pipeline completa su bitcoin.pdf — verifica output strutturato."""
+    # Usa tmp_path come output per non inquinare il repo
+    out_dir = tmp_path / "conversione" / "bitcoin"
+    out_dir.mkdir(parents=True)
+    sources_dir = tmp_path / "sources"
+    sources_dir.mkdir()
+    shutil.copy(PROJECT_ROOT / "sources" / "bitcoin.pdf", sources_dir / "bitcoin.pdf")
+
+    ok = run("bitcoin", tmp_path, force=True)
+    assert ok, "La pipeline deve completare senza errori"
+
+    clean_md = out_dir / "clean.md"
+    assert clean_md.exists(), "clean.md deve essere creato"
+
+    text = clean_md.read_text(encoding="utf-8")
+    assert len(text) > 1000, "clean.md deve avere contenuto significativo"
+    assert "#" in text, "clean.md deve avere almeno un header"
+
+    report = json.loads((out_dir / "report.json").read_text(encoding="utf-8"))
+    assert report["structure"]["livello_struttura"] >= 1, "Struttura deve avere almeno livello 1"
+
+
+@pytest.mark.skipif(not _sources_available("bitcoin"), reason="sources/bitcoin.pdf non disponibile")
+def test_determinism(tmp_path):
+    """Due run consecutive sullo stesso PDF producono output identico."""
+    sources_dir = tmp_path / "sources"
+    sources_dir.mkdir()
+    shutil.copy(PROJECT_ROOT / "sources" / "bitcoin.pdf", sources_dir / "bitcoin.pdf")
+
+    run("bitcoin", tmp_path, force=True)
+    first = (tmp_path / "conversione" / "bitcoin" / "clean.md").read_text()
+
+    run("bitcoin", tmp_path, force=True)
+    second = (tmp_path / "conversione" / "bitcoin" / "clean.md").read_text()
+
+    assert first == second, "Output deve essere deterministico tra due run"
+
+
+@pytest.mark.skipif(not _sources_available("codice_civile"), reason="sources/codice_civile.pdf non disponibile")
+def test_codice_civile_has_articles(tmp_path):
+    """Il Codice Civile deve produrre header con 'Art.'."""
+    sources_dir = tmp_path / "sources"
+    sources_dir.mkdir()
+    shutil.copy(PROJECT_ROOT / "sources" / "codice_civile.pdf", sources_dir / "codice_civile.pdf")
+
+    ok = run("codice_civile", tmp_path, force=True)
+    assert ok
+
+    text = (tmp_path / "conversione" / "codice_civile" / "clean.md").read_text()
+    assert "Art." in text, "clean.md del codice civile deve contenere articoli"
@@ -0,0 +1,40 @@
+"""Test categoria 8: riparazione automatica gerarchia rotta (todo.md Cat.8)."""
+from conversione._pipeline.stage8_normalize import normalize_hierarchy
+
+
+def test_cat8_invalid_hierarchy_auto_repaired():
+    """
+    Categoria 8 dal todo.md:
+    Input:  # A \\n\\n#### B
+    Atteso: # A \\n\\n## B   (salto riparato a max +1)
+    """
+    md_input = "# A\n\n#### B\n\nContenuto di B.\n"
+    result, stats = normalize_hierarchy(md_input)
+
+    assert "## B" in result, "#### deve diventare ## (salto +1 dal padre #)"
+    assert "#### B" not in result, "Il livello originale non deve restare"
+    assert stats["n_level_jumps_repaired"] >= 1
+
+
+def test_multiple_jumps_all_repaired():
+    """Catena di salti: # → #### → ######."""
+    md_input = "# Root\n\n#### Middle\n\nTesto\n\n###### Deep\n\nTesto\n"
+    result, stats = normalize_hierarchy(md_input)
+
+    lines = [l for l in result.split("\n") if l.startswith("#")]
+    levels = [len(l) - len(l.lstrip("#")) for l in lines]
+
+    # Verifica che non ci siano salti > 1
+    for i in range(1, len(levels)):
+        assert levels[i] <= levels[i - 1] + 1, \
+            f"Salto non riparato: {levels[i-1]} → {levels[i]}"
+
+
+def test_valid_hierarchy_not_touched():
+    """Gerarchia valida non deve essere modificata."""
+    md_valid = "# H1\n\nTesto\n\n## H2\n\nTesto\n\n### H3\n\nTesto\n"
+    result, stats = normalize_hierarchy(md_valid)
+    assert stats["n_level_jumps_repaired"] == 0
+    assert "# H1" in result
+    assert "## H2" in result
+    assert "### H3" in result
@@ -0,0 +1,47 @@
+"""Test dataclass Block, Section, FontProfile."""
+from conversione._pipeline.models import Block, Section, FontProfile
+
+
+def test_block_creation():
+    b = Block(
+        text="Titolo", page=1,
+        bbox=(0, 0, 100, 14),
+        font_size=16.0, font_name="Arial-Bold",
+        is_bold=True,
+    )
+    assert b.text == "Titolo"
+    assert b.is_bold
+    assert b.block_type == "paragraph"
+    assert b.level == 0
+    assert b.x0 == 0.0
+    assert b.y1 == 14.0
+
+
+def test_block_properties():
+    b = Block("x", 1, (10.0, 20.0, 110.0, 34.0), 12.0, "Helvetica", False)
+    assert b.x0 == 10.0
+    assert b.y0 == 20.0
+    assert b.x1 == 110.0
+    assert b.y1 == 34.0
+
+
+def test_section_defaults():
+    s = Section(title="Intro", level=1)
+    assert s.content == []
+    assert s.children == []
+    assert s.page_start == 0
+
+
+def test_section_nesting():
+    parent = Section("Parent", level=1)
+    child  = Section("Child", level=2)
+    parent.children.append(child)
+    assert len(parent.children) == 1
+    assert parent.children[0].title == "Child"
+
+
+def test_font_profile():
+    fp = FontProfile(body_size=11.0, cluster_map={18.0: 1, 15.0: 2}, header_sizes=[18.0, 15.0])
+    assert fp.body_size == 11.0
+    assert fp.cluster_map[18.0] == 1
+    assert len(fp.header_sizes) == 2
@@ -0,0 +1,44 @@
+"""Test Stage 3: font analysis."""
+from conversione._pipeline.models import Block
+from conversione._pipeline.stage3_font import build_font_profile
+
+
+def _make_block(font_size, n=1):
+    return [
+        Block(f"testo {i}", 1, (0, i*14.0, 100, (i+1)*14.0), font_size, "Helvetica", False)
+        for i in range(n)
+    ]
+
+
+def test_body_size_is_most_frequent():
+    blocks = _make_block(12.0, 20) + _make_block(18.0, 2) + _make_block(15.0, 3)
+    profile = build_font_profile(blocks)
+    assert profile.body_size == 12.0
+
+
+def test_header_sizes_above_body():
+    blocks = _make_block(12.0, 20) + _make_block(18.0, 2) + _make_block(15.0, 3)
+    profile = build_font_profile(blocks)
+    assert all(s > profile.body_size for s in profile.header_sizes)
+
+
+def test_cluster_map_levels():
+    blocks = _make_block(12.0, 20) + _make_block(24.0, 2) + _make_block(18.0, 3) + _make_block(14.0, 4)
+    profile = build_font_profile(blocks)
+    # Taglia più grande deve avere livello 1
+    if profile.header_sizes:
+        assert profile.cluster_map[profile.header_sizes[0]] == 1
+
+
+def test_empty_blocks():
+    profile = build_font_profile([])
+    assert profile.body_size == 11.0
+    assert profile.header_sizes == []
+
+
+def test_single_font_size():
+    blocks = _make_block(11.0, 50)
+    profile = build_font_profile(blocks)
+    assert profile.body_size == 11.0
+    assert profile.header_sizes == []
+    assert profile.cluster_map == {}
@@ -0,0 +1,52 @@
+"""Test Stage 4: header detection — segnali combinati."""
+import pytest
+from conversione._pipeline.models import Block, FontProfile
+from conversione._pipeline.stage4_headers import classify_blocks
+
+
+def _profile(body=12.0):
+    return FontProfile(body_size=body, cluster_map={18.0: 1, 15.0: 2}, header_sizes=[18.0, 15.0])
+
+
+def _block(text, font_size=12.0, is_bold=False, space_before=5.0, block_type="paragraph"):
+    return Block(text, 1, (50, 100, 400, 114), font_size, "Helvetica", is_bold,
+                 block_type=block_type, space_before=space_before)
+
+
+def test_numbered_large_bold_short_becomes_header():
+    # Tutti i segnali positivi
+    b = _block("1. Introduzione", font_size=18, is_bold=True, space_before=30.0)
+    classify_blocks([b], _profile())
+    assert b.block_type == "header_candidate"
+
+
+def test_body_text_stays_paragraph():
+    b = _block("Questo è un lungo paragrafo di testo normale che non deve diventare un header.", font_size=12)
+    classify_blocks([b], _profile())
+    assert b.block_type == "paragraph"
+
+
+def test_bold_body_text_not_header():
+    # Bold ma stesso size del corpo e testo lungo → NON header (bold_signal richiede size > body+0.5)
+    b = _block("Testo importante in grassetto nel corpo del documento.", font_size=12, is_bold=True)
+    classify_blocks([b], _profile())
+    assert b.block_type == "paragraph"
+
+
+def test_article_forced_header():
+    # "Art. N" → sempre header candidate
+    b = _block("Art. 1423. Nullità del contratto.", font_size=12)
+    classify_blocks([b], _profile())
+    assert b.block_type == "header_candidate"
+
+
+def test_table_preserved():
+    b = _block("Colonna A | Colonna B", font_size=12, block_type="table")
+    classify_blocks([b], _profile())
+    assert b.block_type == "table"
+
+
+def test_list_item_detection():
+    b = _block("- primo elemento della lista", font_size=12)
+    classify_blocks([b], _profile())
+    assert b.block_type == "list_item"
@@ -0,0 +1,95 @@
+"""Test Stage 5: hierarchy inference — numerazione, TOC, font fallback."""
+from conversione._pipeline.models import Block, FontProfile
+from conversione._pipeline.stage5_hierarchy import infer_hierarchy, _level_from_numbering
+
+
+def _profile():
+    return FontProfile(body_size=12.0, cluster_map={18.0: 1, 15.0: 2, 13.0: 3}, header_sizes=[18.0, 15.0, 13.0])
+
+
+def _hblock(text, font_size=18.0, is_bold=True):
+    b = Block(text, 1, (50, 100, 400, 114), font_size, "Helvetica-Bold", is_bold)
+    b.block_type = "header_candidate"
+    return b
+
+
+def _pblock(text):
+    b = Block(text, 1, (50, 120, 400, 134), 12.0, "Helvetica", False)
+    b.block_type = "paragraph"
+    return b
+
+
+# ── Test _level_from_numbering ────────────────────────────────────────────────
+
+def test_numbering_level1():
+    assert _level_from_numbering("1. Titolo") == 1
+
+def test_numbering_level2():
+    assert _level_from_numbering("1.2 Sottotitolo") == 2
+
+def test_numbering_level3():
+    assert _level_from_numbering("1.2.3 Dettaglio") == 3
+
+def test_numbering_deep_capped_at_3():
+    assert _level_from_numbering("1.2.3.4 Troppo profondo") == 3
+
+def test_numbering_no_match():
+    assert _level_from_numbering("Testo senza numero") == 0
+
+
+# ── Test infer_hierarchy con numerazione ─────────────────────────────────────
+
+def test_numbered_sections_get_correct_levels():
+    blocks = [
+        _hblock("1. Introduzione", font_size=18),
+        _pblock("Testo."),
+        _hblock("1.1 Contesto", font_size=15),
+        _pblock("Testo."),
+        _hblock("1.1.1 Dettaglio", font_size=13),
+        _pblock("Testo."),
+        _hblock("2. Conclusioni", font_size=18),
+    ]
+    result = infer_hierarchy(blocks, _profile(), toc=[])
+    headers = [b for b in result if b.block_type == "header_candidate"]
+    assert headers[0].level == 1  # "1."
+    assert headers[1].level == 2  # "1.1"
+    assert headers[2].level == 3  # "1.1.1"
+    assert headers[3].level == 1  # "2."
+
+
+# ── Test infer_hierarchy con TOC ─────────────────────────────────────────────
+
+def test_toc_alignment():
+    toc = [[1, "Introduzione", 1], [2, "Contesto storico", 3], [1, "Conclusioni", 10]]
+    blocks = [
+        _hblock("Introduzione", font_size=14),
+        _hblock("Contesto storico", font_size=13),
+        _hblock("Conclusioni", font_size=14),
+    ]
+    result = infer_hierarchy(blocks, _profile(), toc=toc)
+    headers = [b for b in result if b.block_type == "header_candidate"]
+    assert headers[0].level == 1
+    assert headers[1].level == 2
+    assert headers[2].level == 1
+
+
+# ── Test infer_hierarchy con font fallback ────────────────────────────────────
+
+def test_font_fallback_no_numbering_no_toc():
+    blocks = [
+        _hblock("Capitolo Grande", font_size=18),
+        _pblock("Testo."),
+        _hblock("Sezione Media", font_size=15),
+        _pblock("Testo."),
+    ]
+    result = infer_hierarchy(blocks, _profile(), toc=[])
+    headers = [b for b in result if b.block_type == "header_candidate"]
+    assert headers[0].level == 1  # 18pt → cluster level 1
+    assert headers[1].level == 2  # 15pt → cluster level 2
+
+
+def test_empty_cluster_map_defaults_to_2():
+    profile_empty = FontProfile(body_size=12.0, cluster_map={}, header_sizes=[])
+    blocks = [_hblock("Titolo qualsiasi", font_size=18)]
+    result = infer_hierarchy(blocks, profile_empty, toc=[])
+    assert result[0].level == 2
@@ -0,0 +1,98 @@
+"""Test Stage 6: document tree reconstruction."""
+import pytest
+from conversione._pipeline.models import Block, Section
+from conversione._pipeline.stage6_tree import build_tree
+
+
+def _hblock(text, level, page=1):
+    b = Block(text, page, (50, 100, 400, 114), 16.0, "Helvetica-Bold", True)
+    b.block_type = "header_candidate"
+    b.level = level
+    return b
+
+
+def _pblock(text, page=1):
+    b = Block(text, page, (50, 120, 400, 134), 12.0, "Helvetica", False)
+    b.block_type = "paragraph"
+    return b
+
+
+def test_simple_hierarchy():
+    blocks = [
+        _hblock("H1", 1),
+        _pblock("p1"),
+        _hblock("H2", 2),
+        _pblock("p2"),
+    ]
+    roots = build_tree(blocks)
+    assert len(roots) == 1
+    h1 = roots[0]
+    assert h1.title == "H1"
+    assert h1.level == 1
+    assert len(h1.content) == 1
+    assert h1.content[0].text == "p1"
+    assert len(h1.children) == 1
+    h2 = h1.children[0]
+    assert h2.title == "H2"
+    assert len(h2.content) == 1
+
+
+def test_two_siblings():
+    blocks = [
+        _hblock("Cap 1", 1),
+        _pblock("testo 1"),
+        _hblock("Cap 2", 1),
+        _pblock("testo 2"),
+    ]
+    roots = build_tree(blocks)
+    assert len(roots) == 2
+    assert roots[0].title == "Cap 1"
+    assert roots[1].title == "Cap 2"
+
+
+def test_pre_header_text_gets_implicit_section():
+    blocks = [
+        _pblock("Testo introduttivo prima del primo header."),
+        _hblock("Primo header", 1),
+    ]
+    roots = build_tree(blocks)
+    # La sezione implicita (level=0) è la radice; contiene il testo pre-header
+    # e il primo header diventa suo figlio.
+    assert len(roots) == 1
+    implicit = roots[0]
+    assert implicit.title == ""
+    assert implicit.level == 0
+    assert len(implicit.content) == 1
+    assert len(implicit.children) == 1
+    assert implicit.children[0].title == "Primo header"
+
+
+def test_deep_nesting():
+    blocks = [
+        _hblock("H1", 1),
+        _hblock("H2", 2),
+        _hblock("H3", 3),
+        _pblock("testo profondo"),
+    ]
+    roots = build_tree(blocks)
+    assert len(roots) == 1
+    h1 = roots[0]
+    assert len(h1.children) == 1
+    h2 = h1.children[0]
+    assert len(h2.children) == 1
+    h3 = h2.children[0]
+    assert len(h3.content) == 1
+
+
+def test_ignore_blocks_skipped():
+    b_ignore = Block("superscript", 1, (0,0,10,10), 8.0, "Helvetica", False, block_type="ignore")
+    blocks = [
+        _hblock("Titolo", 1),
+        b_ignore,
+        _pblock("paragrafo"),
+    ]
+    roots = build_tree(blocks)
+    h1 = roots[0]
+    # Il blocco ignore non deve essere nel content
+    assert all(b.block_type != "ignore" for b in h1.content)
+    assert len(h1.content) == 1
@@ -0,0 +1,62 @@
+"""Test Stage 7: serializzazione Markdown."""
+from conversione._pipeline.models import Block, Section
+from conversione._pipeline.stage7_markdown import serialize_tree, _table_to_markdown
+
+
+def _section(title, level, texts=None, children=None):
+    blocks = []
+    for t in (texts or []):
+        b = Block(t, 1, (0,0,100,14), 12.0, "Helvetica", False, block_type="paragraph")
+        blocks.append(b)
+    s = Section(title=title, level=level, content=blocks, children=children or [])
+    return s
+
+
+def test_h1_header():
+    roots = [_section("Introduzione", 1, ["Testo."])]
+    md = serialize_tree(roots, {})
+    assert "# Introduzione" in md
+    assert "Testo." in md
+
+
+def test_h2_nested():
+    child = _section("Sezione 1.1", 2, ["Contenuto della sezione."])
+    root  = _section("Capitolo 1", 1, [], [child])
+    md = serialize_tree([root], {})
+    assert "# Capitolo 1" in md
+    assert "## Sezione 1.1" in md
+    assert "Contenuto della sezione." in md
+
+
+def test_implicit_section_no_hash():
+    # Sezione implicita level=0 → nessun # header
+    s = Section(title="", level=0)
+    b = Block("Testo iniziale.", 1, (0,0,100,14), 12.0, "Helvetica", False)
+    s.content.append(b)
+    md = serialize_tree([s], {})
+    assert not md.startswith("#")
+    assert "Testo iniziale." in md
+
+
+def test_ignore_blocks_not_serialized():
+    s = Section("Titolo", 1)
+    b_ignore = Block("superscript", 1, (0,0,10,10), 8.0, "Helvetica", False, block_type="ignore")
+    b_para   = Block("Paragrafo valido.", 1, (0,0,100,14), 12.0, "Helvetica", False, block_type="paragraph")
+    s.content.extend([b_ignore, b_para])
+    md = serialize_tree([s], {})
+    assert "superscript" not in md
+    assert "Paragrafo valido." in md
+
+
+def test_table_to_markdown():
+    table = [["Nome", "Età"], ["Alice", "30"], ["Bob", "25"]]
+    md = _table_to_markdown(table)
+    assert "| Nome | Età |" in md
+    assert "| --- | --- |" in md
+    assert "| Alice | 30 |" in md
+
+
+def test_no_excessive_blank_lines():
+    roots = [_section("A", 1, ["p1", "p2", "p3"])]
+    md = serialize_tree(roots, {})
+    assert "\n\n\n" not in md
@@ -0,0 +1,49 @@
+"""Test Stage 8: normalizzazione gerarchia Markdown."""
+from conversione._pipeline.stage8_normalize import normalize_hierarchy
+
+
+def test_level_jump_repaired():
+    md = "# A\n\n#### B\n\nTesto\n"
+    result, stats = normalize_hierarchy(md)
+    assert "## B" in result
+    assert "#### B" not in result
+    assert stats["n_level_jumps_repaired"] == 1
+
+
+def test_valid_hierarchy_unchanged():
+    md = "# A\n\n## B\n\nTesto\n\n### C\n\nTesto\n"
+    result, stats = normalize_hierarchy(md)
+    assert "# A" in result
+    assert "## B" in result
+    assert "### C" in result
+    assert stats["n_level_jumps_repaired"] == 0
+
+
+def test_empty_header_removed():
+    md = "# Titolo\n\n## Vuoto\n\n## Con contenuto\n\nTesto.\n"
+    result, stats = normalize_hierarchy(md)
+    assert "## Vuoto" not in result
+    assert "## Con contenuto" in result
+    assert stats["n_empty_headers_removed"] == 1
+
+
+def test_duplicate_consecutive_header_collapsed():
+    md = "# Titolo\n\n# Titolo\n\nTesto.\n"
+    result, stats = normalize_hierarchy(md)
+    assert result.count("# Titolo") == 1
+    assert stats["n_duplicate_headers_removed"] == 1
+
+
+def test_multiple_jumps():
+    md = "# A\n\n### B\n\nTesto B\n\n##### C\n\nTesto C\n"
+    result, stats = normalize_hierarchy(md)
+    assert stats["n_level_jumps_repaired"] == 2
+    assert "## B" in result
+    assert "### C" in result
+
+
+def test_no_false_positives():
+    md = "# A\n\nTesto.\n\n## B\n\nTesto.\n"
+    result, stats = normalize_hierarchy(md)
+    assert stats["n_level_jumps_repaired"] == 0
+    assert stats["n_empty_headers_removed"] == 0
@@ -0,0 +1,36 @@
+"""Test Stage 9: validazione strutturale Markdown."""
+from conversione._pipeline.stage9_validate import validate_markdown
+
+
+def test_valid_document():
+    md = "# Titolo\n\nTesto.\n\n## Sezione\n\nContenuto.\n"
+    result = validate_markdown(md)
+    assert result.is_valid
+    assert not result.errors
+
+
+def test_level_jump_detected():
+    md = "# A\n\n### B\n\nTesto.\n"
+    result = validate_markdown(md)
+    assert not result.is_valid
+    assert any("salto" in e.lower() or "livello" in e.lower() for e in result.errors)
+
+
+def test_no_headers_warning():
+    md = "Testo senza nessun header.\n\nAltro paragrafo.\n"
+    result = validate_markdown(md)
+    assert any("header" in w.lower() or "strutturato" in w.lower() for w in result.warnings)
+
+
+def test_inconsistent_table_warning():
+    md = "# Titolo\n\nTesto.\n\n| A | B |\n|---|---|\n| 1 | 2 | 3 |\n"
+    result = validate_markdown(md)
+    assert any("tabelle" in w.lower() or "colonne" in w.lower() for w in result.warnings)
+
+
+def test_to_dict():
+    md = "# A\n\nTesto.\n"
+    d = validate_markdown(md).to_dict()
+    assert "valid" in d
+    assert "errors" in d
+    assert "warnings" in d
Author	SHA1	Message	Date
davide	48567fa5e7	fix(verify): riconosce URL www. come terminatori validi + doc multi-documento - _URL_TAIL ora matcha anche www.* (non solo https://) — evita falsi blockers su watermark tipo www.docsity.com - README: documenta --collection / --stems per ingestion, retrieve e rag Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-12 11:21:24 +02:00
davide	8d972fa7c6	feat(ingestion): supporto multi-documento in unica collection ChromaDB Aggiunge la possibilità di unire più documenti in una singola collection ChromaDB, con chunk_id prefissati per stem e metadato source per filtrare. - ingest.py: --stems doc1 doc2 --collection nome (nuovo), --stem (invariato) - rag.py / retrieve.py: --collection, source nei chunk, verbose mostra [source] Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-12 11:21:17 +02:00
davide	5b63c423cc	feat(chunks): ottimizzazione chunking e post-processing - chunker.py: scrive meta.json con strategia e soglie effettive (target, min_chars, max_chars) per ogni documento chunked - verify_chunks.py: * _load_thresholds(): legge min/max da meta.json invece del TARGET_CHARS globale, eliminando il mismatch tra soglie chunker e verify (h3_aware target=600 -> range 450-750, non piu' validato a 225-375) * _ROMAN_END: esclude numeri romani finali (XV, XIV...) dagli incompleti perche' sono artefatti indice PDF, non frasi spezzate * PUNCT_END: aggiunge ; come fine valida (clausole legali italiane) - fix_chunks.py: * _load_thresholds(): usa max_chars da meta.json per split coerente * _SECONDARY_END: split secondario su ; per testo legale multi-clausola * Fase 1 (convergenza): risolve solo blockers (incomplete, empty, no_prefix) senza toccare warnings -- elimina il ciclo merge->too_long->split->incomplete->merge * Fase 2 (finale): una sola passata di merge too_short + split too_long dopo che i blockers sono azzerati Risultato su dirittopenale: da blocked (265 incomplete) a warnings_only in 2 iterazioni, senza cicli infiniti. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-12 11:09:28 +02:00
davide	587238f9f5	docs(conversione): aggiorna README — comandi, output e log di esecuzione Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-12 10:43:17 +02:00
davide	c381d7da3c	docs(readme): aggiunge sezioni configurazione modelli, test ollama, retrieval e RAG Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-12 10:39:27 +02:00
davide	b5fb363104	chore(config): tuning RAG — modello 4b, temperatura 0.2, chunk target 300 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-12 10:37:39 +02:00
davide	602dc87045	fix(ingestion): correggi path chunks da step-6/ a chunks/ Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-12 10:37:35 +02:00
davide	b49ef8edf0	docs: aggiorna README con flusso ingestion completo - README.md: aggiunge step 7 (ingestion) con verifica ambiente, comandi base e --force; aggiorna pipeline header e riferimenti - ingestion/README.md: rinomina da "Step 8" a "Ingestion", aggiorna riferimenti da step-6 a chunks/, aggiunge sezione "Verifica ambiente", corregge comandi con .venv/bin/python Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-11 16:05:23 +02:00
davide	9e1a72a9e6	refactor: rinomina step-8 → ingestion Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-11 15:58:54 +02:00
davide	70b304e1d4	docs(readme): flusso completo conversione → chunking Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-11 15:46:52 +02:00
davide	02c785678d	feat(chunks): target-based chunking con config centralizzata Introduce chunks/config.py come unica fonte di verità per tutti i parametri della pipeline di chunking. TARGET_CHARS + CHUNK_TOLERANCE sostituiscono MIN_CHARS/MAX_CHARS: il chunker mira a una dimensione target e si avvicina il più possibile rispettando il vincolo assoluto di terminare ogni chunk su un confine di frase (punto/punteggiatura). - config.py: TARGET_CHARS, CHUNK_TOLERANCE, SPLIT_THRESHOLD_FACTOR, PROTECT_TABLES, FIX_MAX_ITERATIONS, STRATEGY_OVERRIDES per strategia - chunker.py: algoritmo target-based (emit quando frase successiva sfora upper_body = upper - prefix_len), table protection atomica, override MIN/MAX/overlap per ciascuna delle 4 strategie - verify_chunks.py: soglie derivate da target*(1±tolerance) - fix_chunks.py: _split_at_boundary sempre su punteggiatura finale, loop ricorsivo fix→verify fino a FIX_MAX_ITERATIONS, split solo per chunk > upper × SPLIT_THRESHOLD_FACTOR Risultato su bitcoin: 694 chunk, 0 incompleti, 83% in range [450,750], tutti terminanti su punteggiatura indipendentemente dalla dimensione. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-11 15:45:24 +02:00
davide	508587c5bf	Merge branch 'chunks' into main Integra la pipeline di chunking (chunks/) con tutte le modifiche accumulate sul branch, inclusa la pipeline PDF→Markdown a 9 stadi.	2026-05-11 14:51:53 +02:00
davide	ebd2a43f84	feat: integra pipeline PDF→Markdown a 9 stadi e test suite Porta da main la riscrittura completa di conversione/_pipeline/ (9 stadi PyMuPDF) e la suite tests/ senza modificare chunks/, step-8/, rag.py, ollama/, retrieve.py, config.py. requirements.txt: aggiunge PyMuPDF>=1.24.0 e pytest>=8.0, mantiene chromadb, rimuove opendataloader-pdf e pymupdf4llm. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-11 14:46:16 +02:00
davide	e1b5298b20	feat: integra pipeline PDF→Markdown a 9 stadi e test suite Porta da branch marker la riscrittura completa di conversione/_pipeline/ (9 stadi PyMuPDF) e la suite tests/ senza modificare il resto del progetto RAG (ollama/, step-5/, step-6/, step-8/, rag.py, retrieve.py, config.py). requirements.txt: aggiunge PyMuPDF>=1.24.0 e pytest>=8.0, mantiene chromadb, rimuove opendataloader-pdf e pymupdf4llm. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-11 14:44:16 +02:00
davide	a7b71fa508	refactor(skills): rinomina step6-fix → post-chunk - rimpiazza .claude/commands/step6-fix.md con post-chunk.md - aggiorna path da step-6/ a chunks/ in tutta la skill - aggiunge gestione incomplete_math nel report summary - scope ampliato: workflow completo fino alla vettorizzazione - CLAUDE.md: aggiorna /step6-fix → /post-chunk	2026-04-20 14:25:31 +02:00
davide	fe0ecc24ad	feat(chunks): sentence-boundary flush, math incomplete detection, structure profile export - chunker: estrai _flush_chunk() con estensione al confine di frase (max 120%) - verify: rileva chunk matematici incompleti come warning, gestisci hash hex e URL - conversione: esporta structure_profile.json nell'output dir	2026-04-20 12:28:03 +02:00
davide	995a8be735	chore: pulisci .gitignore — rimuovi step-2..6, aggiungi chunks/	2026-04-20 12:25:04 +02:00
davide	c87a7cb3eb	refactor: rimuovi step-5/ e step-6/, sostituiti da chunks/	2026-04-20 12:22:21 +02:00
davide	4c0e0db2a5	feat(chunks): aggiungi pipeline chunking consolidata Nuova cartella chunks/ con chunker.py (step 5), verify_chunks.py e fix_chunks.py (step 6). Tutto l'I/O va in chunks/<stem>/ invece di step-5/ e step-6/ separati. Input: conversione/<stem>/clean.md	2026-04-20 11:36:24 +02:00
davide	5215f53ad0	docs: compatta README — rimuovi sezioni verbose, mantieni l'essenziale	2026-04-20 11:21:01 +02:00
davide	4f28358ec1	feat: pipeline RAG consolidata — conversione unificata, refactor struttura, CLAUDE.md minimale Branch ollama introduce: - Pipeline conversione PDF → Markdown unificata (conversione/pipeline.py) con 30+ transform che sostituisce i vecchi step-0..step-4 - Ambiente Ollama (ollama/) con check_env.py e test_ollama.py - Rimozione cartelle step-0..step-4, step-7, step-9 obsolete - Consolidamento script alla root - CLAUDE.md riscritto: path reali, istruzioni minime, agnostico alla struttura step-X - validate.py con scoring orientato a chunking/vettorizzazione - README riallineato alla struttura reale del repo	2026-04-20 11:06:18 +02:00
davide	6f8785d90a	docs(CLAUDE.md): semplifica istruzioni, rimuovi path step-X hardcoded	2026-04-20 11:05:20 +02:00
davide	c8167d4f01	fix: aggiorna path step-4/ → conversione/ e riferimenti step-X - chunker.py: input da conversione/<stem>/ (era step-4/, non esistente) - verify_chunks.py: messaggi errore aggiornati a conversione/ - config.py: commenti step-8 → ingest.py	2026-04-19 00:03:43 +02:00
davide	e4dc0856bb	refactor: pulizia files	2026-04-17 18:52:37 +02:00
davide	af9ffc0559	docs(README): riscrittura per struttura reale del progetto Sostituisce la struttura step-0…step-10 con la pipeline effettiva: conversione/, revisione /prepare-md, chunking, verifica, ollama/, vettorizzazione, interrogazione	2026-04-17 18:51:15 +02:00
davide	e02e3496a3	chore(requirements): rimuovi commenti step-X obsoleti	2026-04-17 18:50:53 +02:00
davide	12effa1a51	refactor: elimina step-7 e step-9, consolida script alla root - step-9/: config.py, rag.py, retrieve.py → root; test_ollama.py → ollama/ - step-7/: eliminata, già coperta da ollama/ - sys.path aggiornati in rag.py, retrieve.py, ingest.py, check_env.py (step-7 e ollama) - Riferimenti step-9/config.py → config.py in tutti i file	2026-04-17 18:50:36 +02:00
davide	fc457e8525	feat(ollama): aggiungi step 7 — verifica ambiente Ollama Script check_env.py e README per controllare prerequisiti prima della vettorizzazione: ollama nel PATH, modelli embedding e LLM disponibili, chromadb importabile	2026-04-17 18:16:40 +02:00
davide	610d4db348	feat(conversione): pipeline unificata PDF → Markdown, sostituisce step-0..4 Consolida in un unico modulo (conversione/pipeline.py) l'intera catena di conversione che prima era distribuita tra step-0 (validazione PDF), step-1 (ispezione qualità), step-2 (conversione pymupdf4llm), step-3 (rilevamento struttura) e step-4 (revisione euristica). Miglioramenti principali rispetto agli step separati: - libreria di conversione XY-Cut++ (ordine di lettura più preciso) - 30+ trasformazioni euristiche vs le 7 originali di step-4 - validazione, struttura e profilo chunking prodotti in un solo passaggio - validate.py con scoring orientato a chunking/vettorizzazione	2026-04-17 16:05:20 +02:00
davide	82f205faa2	chore: rimuovi cartelle step-0..step-4 ora obsolete La logica è consolidata in conversione/pipeline.py.	2026-04-17 16:04:59 +02:00
davide	368530bc25	refactor(docs): skill prepare-md sostituisce step4-review, CLAUDE.md senza step-X	2026-04-17 13:44:41 +02:00
davide	cdb2d4cab9	fix(conversione): PUA Symbol, garbage headers, merge+bib guard, math EN	2026-04-17 13:44:30 +02:00
davide	ef8f56fdba	fix(conversione): 5 fix robustezza e precisione transform - _t_remove_footnotes: rimuove marcatori superscript inline e righe corpo-nota (¹ testo, [N] testo) — nuovo transform in posizione early - _t_numbered_sections: esclude voci bibliografiche (anno, pp., vol., DOI, ISBN) dalla promozione a ### header - _t_remove_toc: intercetta voci con numero pagina finale nel contesto TOC — rimosso _t_remove_toc_page_list standalone - _t_remove_frontmatter: limitata alle prime ~20% sezioni del documento - _t_remove_recurring_lines: soglia 3->5, Counter spostato a top-level	2026-04-17 12:06:25 +02:00
davide	0a8d98279c	feat(conversione): robustezza e 7 nuovi transform - check_pdf: file < 1KB, campione esteso 15pp, MemoryError - convert_pdf: validazione output ≥ 100 char - analyze: rilevamento gerarchia invertita h3 > h2 - _detect_language: supporto FR/DE/ES - 7 nuovi transform: fix_math_symbols, remove_recurring_lines, normalize_numbered_headings, remove_toc_page_list, restore_poetry_lines, demote_verse_headers, remove_watermarks - bug fix: tabelle MD, garbage headers lowercase, empty headers - run(): MemoryError / UnicodeDecodeError / PermissionError	2026-04-17 11:53:42 +02:00