feat: demota #→## quando il documento usa h1 per sezioni principali

Aggiunge _t_demote_h1 in _headers.py: se il documento contiene ≥5 header # con contenuto testuale (lettere iniziali), i # vengono demotati a ## creando la gerarchia ## (parti) → ### (sezioni) invece di # → ###. Utile per manuali strutturati in parti principali (h1) con sezioni (h3) senza livello intermedio h2. La soglia di 5 evita falsi positivi su documenti con un solo titolo h1 o h1 da artefatti di encoding. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
feat: rileva note bibliografiche e raccolte multi-articolo in pipeline
2026-05-07 16:21:02 +02:00 · 2026-05-07 16:12:50 +02:00 · 2026-05-07 14:58:09 +02:00 · 2026-05-07 14:53:17 +02:00 · 2026-05-07 14:44:40 +02:00 · 2026-05-07 14:30:41 +02:00
34 changed files with 2329 additions and 4320 deletions
@@ -27,8 +27,11 @@ __pycache__/
 Thumbs.db
-# Output conversione/ — generati da conversione/pipeline.py
+# Output conversione/ — generati dagli script
 conversione/*/
 !conversione/_pipeline/
 !conversione/_pipeline/**
 conversione/_pipeline/__pycache__/
 # Output chunks/ — generati da chunks/chunker.py e chunks/verify_chunks.py
 chunks/*/
@@ -1,46 +1,167 @@
-# CLAUDE.md — RAG from Scratch
+# CLAUDE.md
 This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
 ## Missione
 Convertire PDF digitali in Markdown **perfetto per la vettorizzazione RAG**, senza revisione manuale. L'output deve essere testo pulito, strutturato in sezioni semanticamente coerenti, privo di artefatti, pronto per chunking e indicizzazione in un vector store.
 **Non supportato:** PDF scansionati (immagini), PDF protetti da password.
 ---
 ## Regole invarianti
 - **Lingua:** Rispondi sempre in italiano.
 - **Venv:** Usa `.venv/bin/python` o `source .venv/bin/activate`. Mai `pip`/`python` di sistema.
- **`raw.md` immutabile:** La copia di lavoro è sempre `clean.md`.
+- **`raw.md` immutabile:** Non modificare mai `raw.md`. La copia di lavoro è sempre `clean.md`.
 - **Obiettivo zero revisioni:** ogni miglioramento alla pipeline deve ridurre i casi in cui il `clean.md` richiede correzioni manuali.
 ---
-## Pipeline
+## Setup
-```
+```bash
-PDF → conversione → chunking → verifica → vettorizzazione → retrieval
+python -m venv .venv
 source .venv/bin/activate
 pip install -r requirements.txt
 # Java 11+ richiesto da opendataloader-pdf
 sudo apt install default-jdk   # Ubuntu/Debian/WSL
 java -version
 ```
-`--stem` = nome PDF senza estensione = nome collection ChromaDB.
+---
-Per i path degli script e degli output usa `git ls-files` o esplora la root: la struttura è in evoluzione verso un programma unico.
+## Comandi
 ```bash
 # Converti un PDF (posizionalo prima in sources/<nome>.pdf)
 .venv/bin/python conversione/ --stem <nome>
 # Tutti i PDF in sources/
 .venv/bin/python conversione/
 # Forza riesecuzione (sovrascrive clean.md esistente)
 .venv/bin/python conversione/ --stem <nome> --force
 # Validazione batch di tutti gli stem
 .venv/bin/python conversione/ validate
 # Validazione con dettaglio penalità
 .venv/bin/python conversione/ validate <stem> --detail
 # Rimuove l'output di uno stem
 bash conversione/clear.sh <nome>
 ```
 `--stem` = nome file PDF senza estensione.
 ---
-## Configurazione
+## Architettura
-`config.py` è la fonte di verità: `EMBED_MODEL`, `OLLAMA_MODEL`, `TOP_K`, `TEMPERATURE`, `SYSTEM_PROMPT`.
+Il codice è organizzato in `conversione/__main__.py` (entry point) e il package `conversione/_pipeline/` (logica modulare).
-**Se cambi `EMBED_MODEL`:** riesegui ingest con `--force` — embedding incoerenti non producono errori ma risposte insensate.
+```
 conversione/
 ├── __main__.py          # Entry point unificato: convert + validate
 ├── clear.sh             # Rimuove output di uno stem
 └── _pipeline/
    ├── __init__.py      # Re-export pubblico
    ├── extract.py       # _check_deps() + validate_pdf() + convert_pdf()
    ├── runner.py        # run() — orchestrazione 4 fasi
    ├── structure.py     # analyze() + rilevamento lingua e struttura
    ├── report.py        # build_report() → report.json
    ├── validator.py     # validate() + _score() + _grade()
    ├── _apply.py        # apply_transforms() — orchestratore in ordine semantico
    ├── _constants.py    # Regex compilate e mappe statiche condivise
    ├── _encoding.py     # Gruppo 1: PUA font Symbol, accenti LaTeX, simboli SI
    ├── _artifacts.py    # Gruppo 2: immagini, BR, footnote, URL, righe ricorrenti, watermark
    ├── _headers.py      # Gruppo 3: normalizzazione livelli, concat, bold, ALLCAPS
    ├── _structure.py    # Gruppo 4: TOC, ALLCAPS→##, sezioni numerate, math, articoli
    ├── _text.py         # Gruppo 5: merge paragrafi, whitespace, poesia, versi
    ├── _finish.py       # Gruppo 6: header vuoti/garbage, formula-header, frontmatter
    └── _helpers.py      # Funzioni pure condivise (_sentence_case, _is_allcaps_line, ecc.)
 ```
-**Se cambi `MIN_CHARS` / `MAX_CHARS`:** cerca tutte le occorrenze nel repo e sincronizza.
+### `__main__.py` — entry point unificato
 CLI con due modalità: conversione (default, `--stem`, `--force`) e validazione (subcommand `validate`, con stems opzionali e `--detail`). Aggiunge `conversione/` a `sys.path` e delega a `_pipeline`. Uso: `python conversione/ [--stem X] [--force]` oppure `python conversione/ validate [X] [--detail]`.
 ### `_pipeline/extract.py` — fronte PDF
 Raggruppa in un unico modulo le tre responsabilità legate al PDF: `_check_deps()` verifica che `opendataloader-pdf` e Java 11+ siano disponibili; `validate_pdf(pdf_path) -> (bool, str)` controlla esistenza, dimensione e presenza di testo digitale; `convert_pdf(pdf_path, out_dir) -> Path` invoca opendataloader-pdf con i parametri RAG-ottimali e restituisce il percorso del `.md` prodotto. Auto-rileva i PDF taggati (Word/InDesign) per attivare `use_struct_tree`.
 ### `_pipeline/_apply.py` — cuore della pipeline
 Contiene `apply_transforms(text) -> (text, stats)` che chiama ~35 trasformazioni atomiche (`_t_*`) in ordine semantico fisso — **non modificare l'ordine senza capire le dipendenze tra gruppi**. Ogni `_t_*` vive nel modulo del suo gruppo; le costanti regex compilate stanno in `_constants.py`.
 Ordine logico dei gruppi:
 1. **Encoding** (`_encoding.py`) — PUA font Symbol, accenti backtick LaTeX, moltiplicazione, micro
 2. **Pulizia artefatti** (`_artifacts.py`) — immagini, `<br>`, footnote superscript, URL, box symbol, righe ricorrenti, watermark
 3. **Struttura header** (`_headers.py`) — fix header+body concatenati, Capitolo inline, normalizzazione livelli numerati, `####`→`###`, bold, ALL-CAPS
 4. **Costruzione struttura** (`_structure.py`) — TOC rimosso, ALL-CAPS→`##`, sezioni numerate→`###`, ambienti matematici, articoli
 5. **Testo** (`_text.py`) — merge paragrafi spezzati, whitespace, blank lines, poesia, versi
 6. **Rifinitura** (`_finish.py`) — header vuoti, garbage header, merge titoli isolati, frontmatter
 Flag automatico: se il testo contiene "Esercizi/Problems/Homework", `_t_numbered_sections` non converte `- N. testo` in header (sono numerazioni di esercizi, non titoli).
 ### `_pipeline/structure.py` — analisi struttura
 `analyze(md_path) -> dict` conta `#`/`##`/`###`, rileva lingua (it/en/fr/de/es), sceglie `strategia_chunking`:
 | Strategia | Condizione |
 |-----------|------------|
 | `h3_aware` | ≥5 `###` |
 | `h2_paragraph_split` | ≥3 `##`, pochi `###` |
 | `paragraph` | struttura rada |
 | `sliding_window` | testo piatto |
 ### `_pipeline/report.py` — metriche qualità
 `build_report()` genera `report.json` con: statistiche trasformazioni, struttura, distribuzione lunghezze sezioni (`min`/`p25`/`mediana`/`p75`/`max`), anomalie (bare headers, sezioni corte/lunghe), residui con esempi (backtick, dot-leader, URL, `<br>`, simboli encoding, formule inline, footnote, PUA).
 ### `validate.py` — scoring
 Assegna un voto 0–100 (A/B/C/D/F) leggendo `report.json`. Penalità principali:
 | Problema | Penalità | Cap |
 |----------|----------|-----|
 | Struttura assente (livello 0) | −40 | — |
 | Struttura piatta (livello 1) | −15 | — |
 | Backtick residui | −2/cad | −20 |
 | Caratteri PUA font Symbol | −2/cad | −20 |
 | Dot-leader | −5/cad | −10 |
 | URL/watermark | −5/cad | −15 |
 | `<br>` inline | −2/cad | −15 |
 | Bare headers | −3/cad | −15 |
 ---
-## Workflow consigliato
+## Cosa rende un Markdown perfetto per la vettorizzazione
-1. Converti il PDF con lo script di conversione
+- **Struttura semantica:** header Markdown = confini naturali dei chunk; ogni sezione è un'unità concettuale.
-2. `/prepare-md conversione/<stem>/clean.md`
+- **Testo pulito:** nessun backtick, dot-leader, footnote superscript, carattere PUA, `<br>`.
-3. Chunking
+- **Paragrafi interi:** nessuna frase troncata da salto pagina PDF.
-4. Vettorizza con `--stem <stem>`
+- **Formule e simboli:** lettere greche e operatori in Unicode standard, non in font-encoding privato.
-6. `python rag.py --stem <stem>`
+- **Nessun rumore strutturale:** TOC, header/footer ripetuti, URL, watermark — tutto rimosso.
 - **Gerarchia corretta:** h1/h2/h3 riflettono la struttura logica, non il layout tipografico.
 ---
 ## Linee guida per migliorare la pipeline
 Quando si aggiunge una trasformazione in `apply_transforms()`:
 - Ogni `_t_*` deve restituire `(testo, n_modifiche)` — il contatore alimenta `report.json`.
 - Implementare la funzione nel modulo del gruppo corretto (`_encoding.py`, `_artifacts.py`, ecc.), importarla in `_apply.py` e inserire la coppia `("stat_key", _t_nuova)` nella lista `_transforms` nel punto logicamente corretto.
 - Compilare i pattern regex in `_constants.py` come costanti di modulo, mai dentro la funzione.
 - Testare con `.venv/bin/python conversione/ --stem <stem> --force` e confrontare `report.json`.
 - Un nuovo tipo di artefatto: prima aggiungerlo come residuo in `report.py` (funzione `_scan`), poi implementare la `_t_*` che lo rimuove.
 - I residui in `report.py` usano `_MATH_SYMBOLS_RE`, `_EXERCISE_TRIGGER_RE` e `_MATH_HDR_RE` da `transforms._constants` — non ridefinirli localmente.
 ---
 ## Skills custom
- `/prepare-md <path|stem>` — corregge `clean.md`: sillabazione, artefatti, header, paragrafi spezzati, gerarchia.
+- `/prepare-md <path|stem>` — corregge `clean.md` quando la pipeline non basta: sillabazione, artefatti residui, header malformati, gerarchia incoerente.
 - `/step6-fix <stem>` — verifica chunk, dry-run e applicazione fix via `fix_chunks.py`.
@@ -1,62 +1,9 @@
-# RAG from Scratch
+# PDF → Markdown
-Sistema RAG (Retrieval-Augmented Generation) costruito da zero, senza framework di alto livello.
+Converte PDF digitali in Markdown strutturato e pulito.
 Funziona su qualsiasi PDF digitale. Gira interamente in locale, senza GPU, senza cloud.
-**Stack:** Python · Ollama · ChromaDB  
+**Stack:** Python · opendataloader-pdf (XY-Cut++) · Java 11+  
-**Compatibile con:** Linux · macOS · Windows (WSL2) · CPU only · ~8 GB RAM
+**Compatibile con:** Linux · macOS · Windows (WSL2)
 ---
 ## Pipeline
 ```
 PDF → conversione → chunking → verifica → vettorizzazione → retrieval
 ```
 | Fase | Rischio | Motivo |
 |---|---|---|
 | Conversione | 🟡 Medio | Automatica, ma il PDF deve essere digitale e non protetto |
 | Revisione Markdown | 🔴 Alto | La qualità del MD determina la qualità del RAG |
 | Chunking | 🟡 Medio | Adattivo, dipende dalla qualità del MD |
 | Vettorizzazione | 🟢 Basso | Meccanica, lenta ma affidabile |
 | Retrieval | 🟡 Medio | Dipende dai parametri in `config.py` |
 ---
 ## Struttura del progetto
 ```
 rag/
 ├── sources/                  # PDF originali — non modificare
 ├── conversione/              # PDF → Markdown strutturato
 │   ├── pipeline.py
 │   ├── validate.py
 │   └── <stem>/
 │       ├── raw.md            # grezzo — non modificare
 │       ├── clean.md          # copia di lavoro
 │       └── report.json
 ├── step-5/                   # Chunking
 │   ├── chunker.py
 │   └── <stem>/chunks.json
 ├── step-6/                   # Verifica e fix chunk
 │   ├── verify_chunks.py
 │   ├── fix_chunks.py
 │   └── <stem>/
 │       ├── chunks.json
 │       └── report.json
 ├── step-8/                   # Vettorizzazione
 │   └── ingest.py
 ├── ollama/                   # Setup ambiente
 │   ├── check_env.py
 │   └── test_ollama.py
 ├── chroma_db/                # Vector store (generato)
 ├── config.py                 # Configurazione pipeline ← modifica qui
 ├── rag.py                    # Interrogazione RAG interattiva
 └── retrieve.py               # Retrieval puro (senza LLM)
 ```
 `--stem` = nome del PDF senza estensione = nome della collection ChromaDB.
 ---
@@ -68,92 +15,54 @@ source .venv/bin/activate
 pip install -r requirements.txt
 ```
-**Java 11+** richiesto per la conversione (`opendataloader-pdf`):
+**Java 11+** richiesto:
 ```bash
 sudo apt install default-jdk   # Ubuntu/Debian/WSL
-java -version                  # verifica
+java -version
 ```
 Vedi [`ollama/README.md`](ollama/README.md) per l'installazione di Ollama e il download dei modelli.
 ---
-## Workflow
+## Utilizzo
 ### 1. Converti il PDF
 ```bash
 # Singolo PDF
 python conversione/pipeline.py --stem <nome>
 # Tutti i PDF in sources/
 python conversione/pipeline.py
 # Forza riesecuzione
 python conversione/pipeline.py --stem <nome> --force
 ```
-Produce `conversione/<stem>/clean.md`. Vedi [`conversione/README.md`](conversione/README.md).
+`--stem` = nome file PDF senza estensione.  
-
+Esempio: `sources/analisi1.pdf` → `--stem analisi1`
 ### 2. Rivedi il Markdown
 ```
 /prepare-md conversione/<stem>/clean.md
 ```
 Passaggio più importante: la qualità del RAG dipende da questo.
 ### 3. Chunking
 ```bash
 python step-5/chunker.py --stem <nome>
 ```
 ### 4. Verifica e fix chunk
 ```bash
 python step-6/verify_chunks.py --stem <nome>
 python step-6/fix_chunks.py --stem <nome>     # se ci sono 🔴
 python step-6/verify_chunks.py --stem <nome>  # ri-verifica
 ```
 Non procedere alla vettorizzazione se ci sono 🔴.
 ### 5. Vettorizza
 ```bash
 python step-8/ingest.py --stem <nome>
 ```
 Vedi [`step-8/README.md`](step-8/README.md). Usa `--force` se hai cambiato `EMBED_MODEL` o i chunk.
 ### 6. Interroga
 ```bash
 python rag.py --stem <nome>       # risposta LLM
 python retrieve.py --stem <nome>  # retrieval puro (debug)
 ```
 ---
-## Configurazione (`config.py`)
+## Output
-| Parametro | Default | Descrizione |
+Per ogni stem in `conversione/<stem>/`:
 |---|---|---|
 | `EMBED_MODEL` | `"nomic-embed-text"` | Modello embedding — deve corrispondere tra ingest e retrieval |
 | `OLLAMA_MODEL` | `"qwen3.5:0.8b"` | Modello LLM |
 | `OLLAMA_URL` | `"http://localhost:11434"` | Endpoint Ollama |
 | `TOP_K` | `6` | Chunk recuperati per query |
 | `TEMPERATURE` | `0.0` | Deterministico a `0.0` |
 | `NO_THINK` | `True` | Disabilita chain-of-thought (Qwen3/Qwen3.5) |
 | `SYSTEM_PROMPT` | *(vedi file)* | Istruzioni di comportamento per il LLM |
-> Se cambi `EMBED_MODEL`, riesegui `step-8/ingest.py --stem <nome> --force`.
+| File | Descrizione |
 |------|-------------|
 | `raw.md` | Markdown grezzo — **non modificare** |
 | `clean.md` | Markdown pulito — copia di lavoro |
 | `structure_profile.json` | Struttura rilevata e metriche |
 | `report.json` | Statistiche complete della conversione |
 ---
-## Principi
+## Validazione batch
-**Atomico** — ogni fase fa una cosa sola; se si rompe qualcosa sai esattamente dove.
+```bash
 python conversione/validate.py
 ```
-**Verificabile** — ogni fase ha un criterio di completamento oggettivo prima di procedere.
+Stampa una tabella di stato su tutti gli stem convertiti.
-**Reversibile** — puoi tornare indietro senza perdere il lavoro delle altre fasi.
+---
-**Adattivo** — nessuna assunzione sulla struttura del documento; si adatta automaticamente.
+Vedi [`conversione/README.md`](conversione/README.md) per dettagli sulla pipeline e i tipi di documento supportati.
 **Locale** — nessuna API esterna, nessun dato trasmesso fuori dalla macchina.
@@ -1,443 +0,0 @@
 #!/usr/bin/env python3
 """
 Chunking adattivo
 Divide il Markdown revisionato in chunk semantici pronti per la
 vettorizzazione. La strategia dipende dal profilo strutturale del documento.
 Input:  conversione/<stem>/clean.md + conversione/<stem>/structure_profile.json
 Output: chunks/<stem>/chunks.json
 Uso:
    python chunks/chunker.py                    # tutti i documenti in conversione/
    python chunks/chunker.py --stem documento   # un solo documento
    python chunks/chunker.py --stem documento --force
 """
 import argparse
 import json
 import re
 import sys
 from pathlib import Path
 # ─── Parametri ────────────────────────────────────────────────────────────────
 MIN_CHARS = 200   # sotto questa soglia → accorpa al chunk successivo
 MAX_CHARS = 800   # sopra questa soglia → spezza su frasi
 OVERLAP_S = 2     # frasi di overlap tra sotto-chunk dello stesso boundary
 # ─── Utilità ──────────────────────────────────────────────────────────────────
 def split_sentences(text: str) -> list[str]:
    parts = re.split(r'(?<=[.!?»])\s+(?=[A-ZÀÈÉÌÒÙA-Z\"])', text.strip())
    if len(parts) <= 1:
        parts = re.split(r'(?<=[.!?»])\s+', text.strip())
    return [p.strip() for p in parts if p.strip()]
 def slugify(s: str, max_len: int = 60) -> str:
    s = s.lower()
    s = re.sub(r'[^\w\s-]', '', s)
    s = re.sub(r'[\s_-]+', '_', s).strip('_')
    return s[:max_len] if s else "section"
 _SENT_BOUNDARY = re.compile(r"[.!?»)\]'\u2019\"\u201c\u201d/:|\u2026]$")
 def _flush_chunk(
    current: list[str],
    sentences: list[str],
    i: int,
    prefix: str,
    sezione: str,
    titolo: str,
    sub_index: int,
    max_chars: int,
 ) -> tuple[dict, list[str], int, int]:
    """Emette un chunk, estendendo fino a un confine di frase (max +20%)."""
    hard_limit = int(max_chars * 1.2)
    current_len = sum(len(s) + 1 for s in current)
    while i < len(sentences) and not _SENT_BOUNDARY.search(" ".join(current)):
        nxt = sentences[i]
        if current_len + len(nxt) + 1 > hard_limit:
            break
        current.append(nxt)
        current_len += len(nxt) + 1
        i += 1
    chunk_text = prefix + " ".join(current)
    chunk = {
        "chunk_id": f"{slugify(sezione)}__{slugify(titolo)}__s{sub_index}",
        "text": chunk_text,
        "sezione": sezione,
        "titolo": titolo,
        "sub_index": sub_index,
        "n_chars": len(chunk_text),
    }
    return chunk, current, i, sub_index + 1
 def make_sub_chunks(
    body: str,
    prefix: str,
    sezione: str,
    titolo: str,
    max_chars: int,
    overlap_s: int,
 ) -> list[dict]:
    sentences = split_sentences(body)
    if not sentences:
        return []
    chunks = []
    current: list[str] = []
    current_len = 0
    sub_index = 0
    i = 0
    while i < len(sentences):
        sent = sentences[i]
        if not current or current_len + len(sent) + 1 <= max_chars:
            current.append(sent)
            current_len += len(sent) + (1 if len(current) > 1 else 0)
            i += 1
        else:
            chunk, current, i, sub_index = _flush_chunk(
                current, sentences, i, prefix, sezione, titolo, sub_index, max_chars
            )
            chunks.append(chunk)
            overlap = current[-overlap_s:] if overlap_s and len(current) > overlap_s else []
            current = overlap[:]
            current_len = sum(len(s) + 1 for s in current)
    if current:
        chunk_text = prefix + " ".join(current)
        chunks.append({
            "chunk_id": f"{slugify(sezione)}__{slugify(titolo)}__s{sub_index}",
            "text": chunk_text,
            "sezione": sezione,
            "titolo": titolo,
            "sub_index": sub_index,
            "n_chars": len(chunk_text),
        })
    return chunks
 # ─── Parser Markdown ──────────────────────────────────────────────────────────
 def parse_h3_sections(text: str) -> list[dict]:
    sections = []
    current_h2 = ""
    current_h3 = ""
    current_body_lines: list[str] = []
    def flush():
        body = "\n".join(current_body_lines).strip()
        if body:
            sections.append({
                "sezione": current_h2,
                "titolo": current_h3,
                "body": body,
            })
    for line in text.splitlines():
        if re.match(r"^# ", line):
            flush()
            current_h2 = line[2:].strip()
            current_h3 = ""
            current_body_lines = []
        elif re.match(r"^## ", line):
            flush()
            current_h2 = line[3:].strip()
            current_h3 = ""
            current_body_lines = []
        elif re.match(r"^### ", line):
            flush()
            current_h3 = line[4:].strip()
            current_body_lines = []
        else:
            current_body_lines.append(line)
    flush()
    return sections
 def parse_h2_sections(text: str) -> list[dict]:
    sections = []
    current_h2 = ""
    current_body_lines: list[str] = []
    def flush():
        body = "\n".join(current_body_lines).strip()
        if body:
            sections.append({"sezione": current_h2, "body": body})
    for line in text.splitlines():
        if re.match(r"^## ", line):
            flush()
            current_h2 = line[3:].strip()
            current_body_lines = []
        elif re.match(r"^# ", line):
            flush()
            current_h2 = line[2:].strip()
            current_body_lines = []
        else:
            current_body_lines.append(line)
    flush()
    return sections
 # ─── Strategie di chunking ────────────────────────────────────────────────────
 def chunk_h3_aware(text: str, stem: str) -> list[dict]:
    sections = parse_h3_sections(text)
    merged: list[dict] = []
    pending: dict | None = None
    for sec in sections:
        if pending is None:
            pending = dict(sec)
            continue
        if (pending["sezione"] == sec["sezione"]
                and len(pending["body"]) < MIN_CHARS):
            sep_title = " / ".join(filter(None, [pending["titolo"], sec["titolo"]]))
            pending = {
                "sezione": pending["sezione"],
                "titolo": sep_title or pending["titolo"],
                "body": pending["body"] + "\n\n" + sec["body"],
            }
        else:
            merged.append(pending)
            pending = dict(sec)
    if pending:
        merged.append(pending)
    chunks = []
    for sec in merged:
        sezione = sec["sezione"] or stem
        titolo = sec["titolo"] or ""
        body = sec["body"]
        prefix = f"[{sezione} > {titolo}]\n" if titolo else f"[{sezione}]\n"
        sub = make_sub_chunks(body, prefix, sezione, titolo, MAX_CHARS, OVERLAP_S)
        chunks.extend(sub)
    return chunks
 def chunk_h2_paragraph_split(text: str, stem: str) -> list[dict]:
    sections = parse_h2_sections(text)
    chunks = []
    for sec in sections:
        sezione = sec["sezione"] or stem
        body = sec["body"]
        prefix = f"[{sezione}]\n"
        paragraphs = [
            p.strip()
            for p in re.split(r"\n{2,}", body)
            if p.strip() and not re.match(r"^#+\s", p.strip())
        ]
        merged_pars: list[str] = []
        pending = ""
        for par in paragraphs:
            if pending and len(pending) < MIN_CHARS:
                pending = pending + "\n\n" + par
            else:
                if pending:
                    merged_pars.append(pending)
                pending = par
        if pending:
            merged_pars.append(pending)
        for idx, par in enumerate(merged_pars):
            sub = make_sub_chunks(par, prefix, sezione, f"par{idx}", MAX_CHARS, OVERLAP_S)
            for c in sub:
                c["chunk_id"] = f"{slugify(sezione)}__p{idx}__s{c['sub_index']}"
            chunks.extend(sub)
    return chunks
 def chunk_paragraph(text: str, stem: str) -> list[dict]:
    paragraphs = [
        p.strip()
        for p in re.split(r"\n{2,}", text)
        if p.strip() and not re.match(r"^#+\s", p.strip())
    ]
    prefix = f"[Documento: {stem}]\n"
    merged: list[str] = []
    pending = ""
    for par in paragraphs:
        if pending and len(pending) < MIN_CHARS:
            pending = pending + "\n\n" + par
        else:
            if pending:
                merged.append(pending)
            pending = par
    if pending:
        merged.append(pending)
    chunks = []
    for idx, par in enumerate(merged):
        sub = make_sub_chunks(par, prefix, stem, f"par{idx}", MAX_CHARS, OVERLAP_S)
        for c in sub:
            c["chunk_id"] = f"para__{idx}__s{c['sub_index']}"
        chunks.extend(sub)
    return chunks
 def chunk_sliding_window(text: str, stem: str) -> list[dict]:
    sentences = split_sentences(text)
    prefix = f"[Documento: {stem}]\n"
    chunks = []
    i = 0
    win_idx = 0
    while i < len(sentences):
        window: list[str] = []
        cur_len = 0
        j = i
        while j < len(sentences):
            s = sentences[j]
            if window and cur_len + len(s) + 1 > MAX_CHARS:
                break
            window.append(s)
            cur_len += len(s) + (1 if len(window) > 1 else 0)
            j += 1
        if not window:
            window = [sentences[i]]
            j = i + 1
        chunk_text = prefix + " ".join(window)
        chunks.append({
            "chunk_id": f"win__{win_idx}",
            "text": chunk_text,
            "sezione": stem,
            "titolo": f"finestra {win_idx}",
            "sub_index": win_idx,
            "n_chars": len(chunk_text),
        })
        win_idx += 1
        i += max(1, len(window) - OVERLAP_S)
    return chunks
 # ─── Dispatcher ───────────────────────────────────────────────────────────────
 _STRATEGIES: dict[str, callable] = {
    "h3_aware": chunk_h3_aware,
    "h2_paragraph_split": chunk_h2_paragraph_split,
    "paragraph": chunk_paragraph,
    "sliding_window": chunk_sliding_window,
 }
 def chunk_document(clean_md: Path, profile: dict, stem: str) -> list[dict]:
    text = clean_md.read_text(encoding="utf-8")
    strategia = profile.get("strategia_chunking", "paragraph")
    fn = _STRATEGIES.get(strategia, chunk_paragraph)
    return fn(text, stem)
 # ─── Per-document processing ──────────────────────────────────────────────────
 def process_stem(stem: str, project_root: Path, force: bool) -> bool:
    conv_dir  = project_root / "conversione" / stem
    out_dir   = project_root / "chunks" / stem
    clean_md  = conv_dir / "clean.md"
    profile_path = conv_dir / "structure_profile.json"
    out_file  = out_dir / "chunks.json"
    print(f"\nDocumento: {stem}")
    if not clean_md.exists():
        print(f"  ✗ clean.md non trovato in conversione/{stem}/ — skip")
        return False
    if not profile_path.exists():
        print(f"  ✗ structure_profile.json non trovato in conversione/{stem}/ — skip")
        return False
    if out_file.exists() and not force:
        print(f"  ⚠️  chunks.json già presente — skip")
        print(f"       (usa --force per rieseguire)")
        return True
    profile   = json.loads(profile_path.read_text(encoding="utf-8"))
    strategia = profile.get("strategia_chunking", "paragraph")
    print(f"  Strategia: {strategia}")
    chunks = chunk_document(clean_md, profile, stem)
    if not chunks:
        print(f"  ✗ Nessun chunk generato — controlla clean.md")
        return False
    out_dir.mkdir(parents=True, exist_ok=True)
    out_file.write_text(
        json.dumps(chunks, ensure_ascii=False, indent=2), encoding="utf-8"
    )
    lengths = [c["n_chars"] for c in chunks]
    min_c = min(lengths)
    max_c = max(lengths)
    avg_c = int(sum(lengths) / len(lengths))
    short = sum(1 for l in lengths if l < MIN_CHARS)
    long_ = sum(1 for l in lengths if l > MAX_CHARS * 1.5)
    print(f"  Chunk totali: {len(chunks)}")
    print(f"  Min: {min_c} char  Max: {max_c} char  Media: {avg_c} char")
    if short:
        print(f"  ⚠️  {short} chunk sotto MIN_CHARS ({MIN_CHARS})")
    if long_:
        print(f"  ⚠️  {long_} chunk sopra MAX_CHARS×1.5 ({int(MAX_CHARS * 1.5)})")
    print(f"  ✅ chunks.json salvato in chunks/{stem}/")
    return True
 # ─── Entry point ─────────────────────────────────────────────────────────────
 if __name__ == "__main__":
    project_root = Path(__file__).parent.parent
    parser = argparse.ArgumentParser(description="Chunking adattivo")
    parser.add_argument("--stem", help="Nome del documento (sottocartella di conversione/)")
    parser.add_argument("--force", action="store_true", help="Riesegui anche se già presente")
    args = parser.parse_args()
    if args.stem:
        stems = [args.stem]
    else:
        conv_dir = project_root / "conversione"
        if not conv_dir.exists():
            print(f"Errore: cartella conversione/ non trovata in {project_root}")
            sys.exit(1)
        stems = sorted(
            p.name for p in conv_dir.iterdir()
            if p.is_dir() and (p / "clean.md").exists()
        )
        if not stems:
            print(f"Errore: nessun documento trovato in conversione/")
            sys.exit(1)
    results = [process_stem(s, project_root, args.force) for s in stems]
    ok    = sum(results)
    total = len(results)
    print(f"\n{'✅' if all(results) else '⚠️ '} {ok}/{total} documenti processati")
    sys.exit(0 if all(results) else 1)
@@ -1,283 +0,0 @@
 #!/usr/bin/env python3
 """
 Fix chunk
 Applica correzioni dirette su chunks/<stem>/chunks.json basandosi sul
 report.json prodotto da verify_chunks.py. Non tocca clean.md.
 Fixes applicati:
  empty      → rimuove il chunk
  incomplete → fonde con il chunk successivo (la frase continua)
  no_prefix  → aggiunge prefisso [sezione > titolo] se mancante
  too_short  → fonde con il chunk adiacente nello stesso sezione
  too_long   → spezza all'ultimo confine di paragrafo/frase entro MAX_CHARS
 Input:  chunks/<stem>/chunks.json  +  chunks/<stem>/report.json
 Output: chunks/<stem>/chunks.json  (sovrascrive)
 Uso:
    python chunks/fix_chunks.py --stem documento
    python chunks/fix_chunks.py --stem documento --dry-run
 """
 import argparse
 import json
 import re
 import sys
 from pathlib import Path
 MAX_CHARS = 800
 PUNCT_END = re.compile(r"[.!?»)\]'\u2019\"\u201c\u201d\u2018\u2014\u2013-]$")
 # ─── Helpers ──────────────────────────────────────────────────────────────────
 def _prefix(chunk: dict) -> str:
    sezione = chunk.get("sezione", "")
    titolo  = chunk.get("titolo", "")
    if titolo:
        return f"[{sezione} > {titolo}]"
    return f"[{sezione}]"
 def _strip_prefix(text: str) -> str:
    text = text.lstrip()
    if text.startswith("["):
        end = text.find("]")
        if end != -1:
            return text[end + 1:].lstrip("\n")
    return text
 def _rebuild_text(chunk: dict, body: str) -> str:
    return f"{_prefix(chunk)}\n{body}"
 def _split_at_boundary(text: str, max_chars: int) -> list[str]:
    if len(text) <= max_chars:
        return [text]
    parts = []
    remaining = text
    while len(remaining) > max_chars:
        candidate = remaining[:max_chars]
        split_pos = candidate.rfind("\n\n")
        if split_pos == -1:
            m = None
            for m in re.finditer(r"[.!?»]\s+", candidate):
                pass
            split_pos = m.end() if m else None
        if split_pos is None or split_pos == 0:
            sp = remaining.find(" ", max_chars)
            split_pos = sp if sp != -1 else len(remaining)
        parts.append(remaining[:split_pos].rstrip())
        remaining = remaining[split_pos:].lstrip()
    if remaining:
        parts.append(remaining)
    return [p for p in parts if p.strip()]
 # ─── Operazioni sui chunk ─────────────────────────────────────────────────────
 def fix_empty(chunks: list[dict], empty_ids: set[str]) -> tuple[list[dict], int]:
    before = len(chunks)
    chunks = [c for c in chunks if c["chunk_id"] not in empty_ids]
    return chunks, before - len(chunks)
 def fix_no_prefix(chunks: list[dict], no_prefix_ids: set[str]) -> tuple[list[dict], int]:
    count = 0
    for c in chunks:
        if c["chunk_id"] in no_prefix_ids:
            body = _strip_prefix(c["text"])
            c["text"] = _rebuild_text(c, body)
            c["n_chars"] = len(c["text"])
            count += 1
    return chunks, count
 def fix_incomplete_and_short(chunks: list[dict],
                              problem_ids: set[str]) -> tuple[list[dict], int]:
    merged = 0
    i = 0
    result: list[dict] = []
    while i < len(chunks):
        c = chunks[i]
        if c["chunk_id"] in problem_ids and i + 1 < len(chunks):
            nxt = chunks[i + 1]
            body_c   = _strip_prefix(c["text"])
            body_nxt = _strip_prefix(nxt["text"])
            merged_body = body_c.rstrip() + "\n" + body_nxt.lstrip()
            nxt["text"]    = _rebuild_text(nxt, merged_body)
            nxt["n_chars"] = len(nxt["text"])
            merged += 1
            i += 1
            continue
        result.append(c)
        i += 1
    return result, merged
 def fix_too_long(chunks: list[dict],
                 too_long_ids: set[str],
                 max_chars: int) -> tuple[list[dict], int]:
    result: list[dict] = []
    split_count = 0
    for c in chunks:
        if c["chunk_id"] not in too_long_ids:
            result.append(c)
            continue
        body  = _strip_prefix(c["text"])
        parts = _split_at_boundary(body, max_chars)
        if len(parts) == 1:
            result.append(c)
            continue
        base_id  = re.sub(r"__s\d+$", "", c["chunk_id"])
        base_sub = c.get("sub_index", 0)
        for j, part in enumerate(parts):
            new_chunk = dict(c)
            new_chunk["sub_index"] = base_sub + j
            new_chunk["chunk_id"]  = f"{base_id}__s{base_sub + j}"
            new_chunk["text"]      = _rebuild_text(new_chunk, part)
            new_chunk["n_chars"]   = len(new_chunk["text"])
            result.append(new_chunk)
        split_count += 1
    return result, split_count
 def renumber_ids(chunks: list[dict]) -> list[dict]:
    seen: dict[str, int] = {}
    for c in chunks:
        base = re.sub(r"__s\d+$", "", c["chunk_id"])
        idx  = seen.get(base, 0)
        c["chunk_id"]  = f"{base}__s{idx}"
        c["sub_index"] = idx
        seen[base] = idx + 1
    return chunks
 # ─── Core ─────────────────────────────────────────────────────────────────────
 def fix_stem(stem: str, project_root: Path, max_chars: int, dry_run: bool) -> bool:
    stem_dir    = project_root / "chunks" / stem
    chunks_path = stem_dir / "chunks.json"
    report_path = stem_dir / "report.json"
    if not chunks_path.exists():
        print(f"✗ chunks/{stem}/chunks.json non trovato.")
        print(f"  Esegui prima: python chunks/chunker.py --stem {stem}")
        return False
    if not report_path.exists():
        print(f"✗ chunks/{stem}/report.json non trovato.")
        print(f"  Esegui prima: python chunks/verify_chunks.py --stem {stem}")
        return False
    chunks: list[dict] = json.loads(chunks_path.read_text(encoding="utf-8"))
    report: dict       = json.loads(report_path.read_text(encoding="utf-8"))
    verdict = report.get("verdict", "ok")
    print(f"\nDocumento: {stem}  (verdict: {verdict})")
    if verdict == "ok":
        print("  ✅ Nessun problema — nulla da correggere.")
        return True
    empty_ids      = {e["chunk_id"] for e in report.get("blockers", {}).get("empty", [])}
    no_prefix_ids  = {e["chunk_id"] for e in report.get("blockers", {}).get("no_prefix", [])}
    incomplete_ids = {e["chunk_id"] for e in report.get("blockers", {}).get("incomplete", [])}
    too_short_ids  = {e["chunk_id"] for e in report.get("warnings", {}).get("too_short", [])}
    too_long_ids   = {e["chunk_id"] for e in report.get("warnings", {}).get("too_long", [])}
    ops: list[str] = []
    if empty_ids:
        ops.append(f"  🗑  rimuovi {len(empty_ids)} chunk vuoti")
    if no_prefix_ids:
        ops.append(f"  🔧 aggiungi prefisso a {len(no_prefix_ids)} chunk")
    if incomplete_ids:
        ops.append(f"  🔗 fondi {len(incomplete_ids)} chunk incompleti col successivo")
    if too_short_ids:
        ops.append(f"  🔗 fondi {len(too_short_ids)} chunk troppo corti col successivo")
    if too_long_ids:
        ops.append(f"  ✂️  spezza {len(too_long_ids)} chunk troppo lunghi")
    if not ops:
        print("  ✅ Nessuna correzione necessaria.")
        return True
    print("\n  Operazioni pianificate:")
    for op in ops:
        print(op)
    if dry_run:
        print("\n  [dry-run] Nessuna modifica applicata.")
        return True
    n_before = len(chunks)
    if empty_ids:
        chunks, n = fix_empty(chunks, empty_ids)
        print(f"\n  🗑  Rimossi {n} chunk vuoti.")
    if no_prefix_ids:
        chunks, n = fix_no_prefix(chunks, no_prefix_ids)
        print(f"  🔧 Aggiunto prefisso a {n} chunk.")
    merge_ids = incomplete_ids | too_short_ids
    if merge_ids:
        chunks, n = fix_incomplete_and_short(chunks, merge_ids)
        print(f"  🔗 Fusi {n} chunk (incompleti + corti).")
    if too_long_ids:
        chunks, n = fix_too_long(chunks, too_long_ids, max_chars)
        print(f"  ✂️  Spezzati {n} chunk lunghi.")
    chunks = renumber_ids(chunks)
    n_after = len(chunks)
    print(f"\n  Totale chunk: {n_before} → {n_after}")
    chunks_path.write_text(
        json.dumps(chunks, ensure_ascii=False, indent=2), encoding="utf-8"
    )
    print(f"  ✅ Salvato: chunks/{stem}/chunks.json")
    print(f"\n  Riesegui la verifica:")
    print(f"     python chunks/verify_chunks.py --stem {stem}")
    return True
 # ─── Entry point ──────────────────────────────────────────────────────────────
 if __name__ == "__main__":
    project_root = Path(__file__).parent.parent
    parser = argparse.ArgumentParser(description="Fix chunk")
    parser.add_argument("--stem", required=True, help="Nome del documento (sottocartella di chunks/)")
    parser.add_argument(
        "--max", type=int, default=MAX_CHARS,
        help=f"Soglia massima caratteri per lo split (default: {MAX_CHARS})"
    )
    parser.add_argument(
        "--dry-run", action="store_true",
        help="Mostra le operazioni pianificate senza applicarle"
    )
    args = parser.parse_args()
    ok = fix_stem(args.stem, project_root, args.max, args.dry_run)
    sys.exit(0 if ok else 1)
@@ -1,334 +0,0 @@
 #!/usr/bin/env python3
 """
 Verifica chunk
 Analizza chunks/<stem>/chunks.json e segnala ogni anomalia che potrebbe
 degradare la qualità del retrieval. Non modifica nulla.
 Input:  chunks/<stem>/chunks.json
 Output: report a schermo + chunks/<stem>/report.json + exit code (0 = OK, 1 = problemi)
 Uso:
    python chunks/verify_chunks.py --stem documento
    python chunks/verify_chunks.py                    # tutti i documenti in chunks/
    python chunks/verify_chunks.py --min 200 --max 800
 """
 import argparse
 import json
 import re
 import sys
 from pathlib import Path
 # ─── Soglie ───────────────────────────────────────────────────────────────────
 MIN_CHARS = 200
 MAX_CHARS = 800
 PUNCT_END = re.compile(
    r"[.!?»)\]'\u2019\"\u201c\u201d\u2018\u2014\u2013\u2026]$"
    r"|/$"    # URL che finisce con /
    r"|\|$"   # riga di tabella Markdown
    r"|:$"    # introduzione a lista o formula
 )
 _HEX_END     = re.compile(r"[0-9a-fA-F]{8,}$")
 _URL_TAIL    = re.compile(r"https?://\S+(\s+\S+){0,3}$")  # URL con fino a 3 token extra
 _MATH_SYMS   = re.compile(r"[∈∑≤≥≠∀∃∫√∞∂±×÷→←↔⊂⊃⊆⊇∩∪·°]")
 # ─── Checks ───────────────────────────────────────────────────────────────────
 def has_prefix(chunk: dict) -> bool:
    return chunk.get("text", "").lstrip().startswith("[")
 def is_empty(chunk: dict) -> bool:
    return not chunk.get("text", "").strip()
 def is_too_short(chunk: dict, min_chars: int) -> bool:
    return chunk.get("n_chars", 0) < min_chars
 def is_too_long(chunk: dict, max_chars: int) -> bool:
    return chunk.get("n_chars", 0) > max_chars * 1.5
 def ends_incomplete(chunk: dict) -> bool:
    text = chunk.get("text", "").rstrip()
    if not text:
        return False
    text_check = re.sub(r"[_*]+$", "", text).rstrip()
    if not text_check:
        return False
    if PUNCT_END.search(text_check):
        return False
    if _HEX_END.search(text_check):   # hash SHA / codice hex
        return False
    if _URL_TAIL.search(text_check[-200:]):  # URL (con eventuale path dopo spazio)
        return False
    return True
 def is_math_incomplete(chunk: dict) -> bool:
    """Incompleto ma in contesto matematico — degrada a warning invece di blocker."""
    return ends_incomplete(chunk) and len(_MATH_SYMS.findall(chunk.get("text", ""))) >= 3
 # ─── Report ───────────────────────────────────────────────────────────────────
 def _fmt_chunk(c: dict) -> str:
    cid     = c.get("chunk_id", "?")
    n       = c.get("n_chars", 0)
    preview = c.get("text", "")[:60].replace("\n", " ")
    return f"  [{cid}] ({n} char) «{preview}»"
 def verify_stem(stem: str, project_root: Path, min_chars: int, max_chars: int) -> bool:
    chunks_path = project_root / "chunks" / stem / "chunks.json"
    print(f"\nDocumento: {stem}")
    if not chunks_path.exists():
        print(f"  ✗ chunks/{stem}/chunks.json non trovato")
        print(f"    Esegui prima: python chunks/chunker.py --stem {stem}")
        return False
    chunks: list[dict] = json.loads(chunks_path.read_text(encoding="utf-8"))
    if not chunks:
        print(f"  ✗ chunks.json è vuoto")
        return False
    # ── Raccogli problemi ──────────────────────────────────────────────────────
    empty_chunks      = [c for c in chunks if is_empty(c)]
    no_prefix         = [c for c in chunks if not is_empty(c) and not has_prefix(c)]
    too_short         = [c for c in chunks if is_too_short(c, min_chars)]
    too_long          = [c for c in chunks if is_too_long(c, max_chars)]
    _incomplete_all   = [c for c in chunks if not is_empty(c) and ends_incomplete(c)]
    incomplete_math   = [c for c in _incomplete_all if is_math_incomplete(c)]
    incomplete        = [c for c in _incomplete_all if not is_math_incomplete(c)]
    # ── Statistiche ───────────────────────────────────────────────────────────
    lengths = [c.get("n_chars", 0) for c in chunks]
    n_total = len(chunks)
    n_ok    = n_total - len(set(
        c["chunk_id"]
        for lst in [empty_chunks, no_prefix, too_short, too_long, incomplete]
        for c in lst
    ))
    min_l = min(lengths)
    max_l = max(lengths)
    avg_l = int(sum(lengths) / n_total)
    n_under  = sum(1 for l in lengths if l < min_chars)
    n_normal = sum(1 for l in lengths if min_chars <= l <= max_chars)
    n_over   = sum(1 for l in lengths if l > max_chars)
    # ── Output ────────────────────────────────────────────────────────────────
    print(f"  Totale chunk:  {n_total}")
    print(f"  ✅ OK:         {n_ok}")
    print()
    print(f"  Distribuzione lunghezze:")
    print(f"    Min:   {min_l} char")
    print(f"    Max:   {max_l} char")
    print(f"    Media: {avg_l} char")
    print(f"    < {min_chars} char (sotto MIN): {n_under}")
    print(f"    {min_chars}–{max_chars} char (ideale):  {n_normal}")
    print(f"    > {max_chars} char (sopra MAX): {n_over}")
    has_errors = False
    if empty_chunks:
        has_errors = True
        print(f"\n  🔴 {len(empty_chunks)} chunk VUOTI:")
        for c in empty_chunks[:5]:
            print(f"  [{c.get('chunk_id', '?')}]")
        if len(empty_chunks) > 5:
            print(f"  ... e altri {len(empty_chunks) - 5}")
    if no_prefix:
        has_errors = True
        print(f"\n  🔴 {len(no_prefix)} chunk SENZA PREFISSO DI CONTESTO:")
        for c in no_prefix[:5]:
            print(_fmt_chunk(c))
        if len(no_prefix) > 5:
            print(f"  ... e altri {len(no_prefix) - 5}")
        print(f"  → Causa probabile: header ### mancanti o malformati nel MD")
    if too_short:
        has_errors = True
        print(f"\n  🟡 {len(too_short)} chunk SOTTO MIN_CHARS ({min_chars}):")
        for c in too_short[:5]:
            print(_fmt_chunk(c))
        if len(too_short) > 5:
            print(f"  ... e altri {len(too_short) - 5}")
        print(f"  → Soluzione: abbassa MIN_CHARS o revisiona il MD")
    if too_long:
        has_errors = True
        print(f"\n  🟡 {len(too_long)} chunk SOPRA MAX_CHARS×1.5 ({int(max_chars * 1.5)}):")
        for c in too_long[:5]:
            print(_fmt_chunk(c))
        if len(too_long) > 5:
            print(f"  ... e altri {len(too_long) - 5}")
        print(f"  → Soluzione: alza MAX_CHARS o verifica il testo nel MD")
    if incomplete:
        has_errors = True
        print(f"\n  🔴 {len(incomplete)} chunk CHE FINISCONO SENZA PUNTEGGIATURA (frase spezzata):")
        for c in incomplete[:5]:
            last_line = c.get("text", "").rstrip().split("\n")[-1][-80:]
            print(f"  [{c.get('chunk_id', '?')}] ...{last_line!r}")
        if len(incomplete) > 5:
            print(f"  ... e altri {len(incomplete) - 5}")
        print(f"  → Soluzione: correggi le righe spezzate in conversione/{stem}/clean.md")
    if incomplete_math:
        has_errors = True
        print(f"\n  🟡 {len(incomplete_math)} chunk MATEMATICI SENZA PUNTEGGIATURA (formula/espressione):")
        for c in incomplete_math[:3]:
            last_line = c.get("text", "").rstrip().split("\n")[-1][-80:]
            print(f"  [{c.get('chunk_id', '?')}] ...{last_line!r}")
        if len(incomplete_math) > 3:
            print(f"  ... e altri {len(incomplete_math) - 3}")
        print(f"  → Le formule non finiscono con punteggiatura — avviso non bloccante")
    # ── Costruisci e salva report.json ────────────────────────────────────────
    blockers = empty_chunks + no_prefix + incomplete
    warnings = too_short + too_long + incomplete_math
    def _chunk_entry(c: dict) -> dict:
        return {
            "chunk_id":  c.get("chunk_id", ""),
            "sezione":   c.get("sezione", ""),
            "titolo":    c.get("titolo", ""),
            "n_chars":   c.get("n_chars", 0),
            "last_text": c.get("text", "").rstrip().split("\n")[-1][-120:],
        }
    verdict = "ok" if not blockers else "blocked"
    if not blockers and warnings:
        verdict = "warnings_only"
    report = {
        "stem":    stem,
        "verdict": verdict,
        "stats": {
            "total":     n_total,
            "ok":        n_ok,
            "min_chars": min_l,
            "max_chars": max_l,
            "avg_chars": avg_l,
        },
        "thresholds": {"min_chars": min_chars, "max_chars": max_chars},
        "blockers": {
            "empty":      [_chunk_entry(c) for c in empty_chunks],
            "no_prefix":  [_chunk_entry(c) for c in no_prefix],
            "incomplete": [_chunk_entry(c) for c in incomplete],
        },
        "warnings": {
            "too_short":       [_chunk_entry(c) for c in too_short],
            "too_long":        [_chunk_entry(c) for c in too_long],
            "incomplete_math": [_chunk_entry(c) for c in incomplete_math],
        },
    }
    out_dir = project_root / "chunks" / stem
    out_dir.mkdir(parents=True, exist_ok=True)
    (out_dir / "report.json").write_text(
        json.dumps(report, ensure_ascii=False, indent=2), encoding="utf-8"
    )
    print(f"\n  report.json salvato in chunks/{stem}/")
    # ── Prossimi passi ────────────────────────────────────────────────────────
    print(f"\n  {'─' * 50}")
    print(f"  PROSSIMI PASSI")
    print(f"  {'─' * 50}")
    if not blockers and not warnings:
        print(f"  ✅ Tutto OK — procedi alla vettorizzazione:")
        print(f"       python step-8/ingest.py --stem {stem}")
    elif not blockers:
        print(f"  🟡 Solo avvisi minori — puoi procedere alla vettorizzazione:")
        print(f"       python step-8/ingest.py --stem {stem}")
        print()
        print(f"  Oppure, per ottimizzare prima:")
        if too_short:
            pct = int(len(too_short) / n_total * 100)
            print(f"    • {len(too_short)} chunk corti ({pct}% del totale)")
        if too_long:
            pct = int(len(too_long) / n_total * 100)
            print(f"    • {len(too_long)} chunk lunghi ({pct}% del totale)")
        if too_short or too_long:
            print(f"      → Esegui: python chunks/fix_chunks.py --stem {stem} --dry-run")
            print(f"        poi:     python chunks/fix_chunks.py --stem {stem}")
            print(f"        poi:     python chunks/verify_chunks.py --stem {stem}")
    else:
        print(f"  🔴 Problemi bloccanti — correggi prima di procedere:")
        print()
        if empty_chunks:
            print(f"    • {len(empty_chunks)} chunk vuoti")
            print(f"      → Controlla conversione/{stem}/clean.md per sezioni prive di testo")
        if no_prefix:
            print(f"    • {len(no_prefix)} chunk senza prefisso di contesto")
            print(f"      → Controlla che gli header ### siano corretti in conversione/{stem}/clean.md")
        if incomplete:
            print(f"    • {len(incomplete)} chunk con frase spezzata")
            print(f"      → Esegui: python chunks/fix_chunks.py --stem {stem}")
        print()
        print(f"  Dopo le correzioni, riesegui nell'ordine:")
        print(f"       python chunks/chunker.py --stem {stem} --force")
        print(f"       python chunks/verify_chunks.py --stem {stem}")
        print()
        if warnings:
            print(f"  🟡 Hai anche {len(warnings)} avvisi minori — affrontali dopo aver risolto i 🔴.")
    return not blockers
 # ─── Entry point ──────────────────────────────────────────────────────────────
 if __name__ == "__main__":
    project_root = Path(__file__).parent.parent
    parser = argparse.ArgumentParser(description="Verifica chunk")
    parser.add_argument("--stem", help="Nome del documento (sottocartella di chunks/)")
    parser.add_argument(
        "--min", type=int, default=MIN_CHARS,
        help=f"Soglia minima caratteri (default: {MIN_CHARS})"
    )
    parser.add_argument(
        "--max", type=int, default=MAX_CHARS,
        help=f"Soglia massima caratteri (default: {MAX_CHARS})"
    )
    args = parser.parse_args()
    if args.stem:
        stems = [args.stem]
    else:
        chunks_dir = project_root / "chunks"
        if not chunks_dir.exists():
            print(f"Errore: cartella chunks/ non trovata in {project_root}")
            sys.exit(1)
        stems = sorted(
            p.name for p in chunks_dir.iterdir()
            if p.is_dir() and (p / "chunks.json").exists()
        )
        if not stems:
            print("Errore: nessun chunks.json trovato in chunks/")
            sys.exit(1)
    results = [verify_stem(s, project_root, args.min, args.max) for s in stems]
    ok    = sum(results)
    total = len(results)
    print(f"\n{'✅' if all(results) else '⚠️ '} {ok}/{total} documenti senza problemi")
    sys.exit(0 if all(results) else 1)
@@ -1,54 +0,0 @@
 # ─── Configurazione RAG ───────────────────────────────────────────────────────
 #
 # Modifica questo file per cambiare i parametri della pipeline.
 #
 # Uso:
 #   python rag.py --stem nietzsche
 # ──────────────────────────────────────────────────────────────────────────────
 # ── Retrieval ─────────────────────────────────────────────────────────────────
 # Numero di chunk da recuperare per ogni domanda.
 # Valori più alti = più contesto, risposte potenzialmente più complete,
 # ma prompt più lunghi e generazione più lenta.
 TOP_K = 6
 # ── Generazione ───────────────────────────────────────────────────────────────
 # Temperatura del modello LLM.
 # 0.0 = completamente deterministico (stessa risposta ad ogni run)
 # 0.7 = più creativo e vario
 TEMPERATURE = 0.0
 # Disabilita il "thinking" (ragionamento interno) nei modelli Qwen3/Qwen3.5.
 # True  = risposta diretta, più veloce
 # False = ragionamento interno abilitato (più lento ma potenzialmente più accurato)
 NO_THINK = True
 # ── Embedding ─────────────────────────────────────────────────────────────────
 # Modello di embedding usato da Ollama.
 # Deve corrispondere al modello usato durante la vettorizzazione (ingest.py).
 # Se cambi questo, devi rieseguire ingest.py con --force.
 EMBED_MODEL = "nomic-embed-text"
 # ── Ollama ────────────────────────────────────────────────────────────────────
 # URL del server Ollama (default: locale sulla porta 11434).
 OLLAMA_URL = "http://localhost:11434"
 # Modello LLM. Scegli in base alla RAM disponibile (vedi README).
 OLLAMA_MODEL = "qwen3.5:0.8b"
 # ── Prompt di sistema ─────────────────────────────────────────────────────────
 # Istruzioni di comportamento inviate al LLM prima del contesto e della domanda.
 # Modifica per cambiare il tono, la lingua, il grado di libertà interpretativa
 # o le condizioni di fallback ("non so rispondere").
 SYSTEM_PROMPT = (
    "Sei un assistente che risponde usando il contesto fornito. "
    "Sintetizza e interpreta liberamente i passaggi del contesto per rispondere alla domanda. "
    "Se il contesto contiene informazioni pertinenti, anche indirette, usale per costruire una risposta. "
    "Solo se il contesto è completamente irrilevante, rispondi: "
    "\"Non trovo questa informazione nel documento.\""
 )
@@ -0,0 +1,111 @@
 #!/usr/bin/env python3
 """
 Pipeline PDF → clean Markdown per vettorizzazione RAG.
 Uso:
    # Converti
    python conversione/ --stem <nome>
    python conversione/ --stem <nome> --force
    python conversione/                          # tutti i PDF in sources/
    # Valida
    python conversione/ validate
    python conversione/ validate <stem> [<stem> ...] --detail
 Prerequisiti:
    pip install opendataloader-pdf pdfplumber
    Java 11+ sul PATH (https://adoptium.net/)
 """
 import argparse
 import sys
 from pathlib import Path
 # Rende _pipeline importabile da conversione/
 sys.path.insert(0, str(Path(__file__).parent))
 from _pipeline import _check_deps, run, validate
 def _build_parser() -> argparse.ArgumentParser:
    parser = argparse.ArgumentParser(
        prog="conversione",
        description="PDF → clean Markdown strutturato, pronto per chunking RAG",
        formatter_class=argparse.RawDescriptionHelpFormatter,
        epilog=(
            "Esempi:\n"
            "  python conversione/ --stem manuale\n"
            "  python conversione/ --stem manuale --force\n"
            "  python conversione/ validate\n"
            "  python conversione/ validate manuale --detail"
        ),
    )
    # ── Subcommand: validate ──────────────────────────────────────────────
    sub = parser.add_subparsers(dest="cmd", metavar="comando")
    val = sub.add_parser(
        "validate",
        help="valida i report.json prodotti dalla conversione",
        description="Legge i report.json e assegna un voto 0-100 (A/B/C/D/F).",
    )
    val.add_argument(
        "stems",
        nargs="*",
        metavar="STEM",
        help="stem da validare. Ometti per tutti.",
    )
    val.add_argument(
        "--detail", "-d",
        action="store_true",
        help="mostra il dettaglio delle penalità per ogni documento",
    )
    # ── Opzioni convert (modalità default) ───────────────────────────────
    parser.add_argument(
        "--stem",
        metavar="NOME",
        help="nome del PDF in sources/ (senza estensione). Ometti per tutti.",
    )
    parser.add_argument(
        "--force",
        action="store_true",
        help="riesegui anche se clean.md è già presente",
    )
    return parser
 def main() -> None:
    parser = _build_parser()
    args   = parser.parse_args()
    root   = Path(__file__).parent.parent
    # ── Validate ─────────────────────────────────────────────────────────
    if args.cmd == "validate":
        validate(args.stems, root, detail=args.detail)
        return
    # ── Convert (default) ────────────────────────────────────────────────
    _check_deps()
    if args.stem:
        stems = [args.stem]
    else:
        sources_dir = root / "sources"
        if not sources_dir.exists():
            print("Errore: cartella sources/ non trovata.")
            sys.exit(1)
        stems = sorted(p.stem for p in sources_dir.glob("*.pdf"))
        if not stems:
            print("Errore: nessun PDF trovato in sources/.")
            sys.exit(1)
    results = [run(s, root, args.force) for s in stems]
    ok      = sum(results)
    total   = len(results)
    print(f"\n{'✅' if all(results) else '⚠️ '} {ok}/{total} documenti convertiti")
    sys.exit(0 if all(results) else 1)
 if __name__ == "__main__":
    main()
@@ -0,0 +1,17 @@
 from .extract   import _check_deps, validate_pdf, convert_pdf
 from ._apply    import apply_transforms
 from .structure import analyze
 from .report     import build_report
 from .runner     import run
 from .validator  import validate
 __all__ = [
    "_check_deps",
    "validate_pdf",
    "convert_pdf",
    "apply_transforms",
    "analyze",
    "build_report",
    "run",
    "validate",
 ]
@@ -0,0 +1,110 @@
 """Orchestratore: applica le trasformazioni in ordine semantico."""
 import re
 from functools import partial
 from ._encoding  import (
    _t_fix_symbol_font, _t_fix_accents,
    _t_fix_multiplication, _t_fix_micro,
 )
 from ._artifacts import (
    _t_remove_images, _t_fix_br, _t_fix_tabsep, _t_remove_footnotes,
    _t_remove_formula_labels, _t_remove_dotleaders, _t_remove_recurring_lines,
    _t_fix_math_symbols, _t_remove_watermarks, _t_remove_urls,
    _t_remove_page_markers, _t_remove_page_numbers, _t_remove_separators,
 )
 from ._headers   import (
    _t_fix_header_concat, _t_extract_capitolo,
    _t_normalize_numbered_headings, _t_normalize_header_levels,
    _t_remove_header_bold, _t_normalize_allcaps_headers, _t_demote_h1,
 )
 from ._structure import (
    _t_remove_toc, _t_remove_orphan_toc, _t_allcaps_to_headers,
    _t_numbered_sections, _t_promote_chapter_headers,
    _t_extract_math, _t_extract_articles,
 )
 from ._text      import (
    _t_merge_paragraphs, _t_normalize_whitespace, _t_collapse_blank_lines,
    _t_restore_poetry_lines, _t_demote_verse_headers,
 )
 from ._finish    import (
    _t_remove_empty_headers, _t_merge_title_headers,
    _t_remove_garbage_headers, _t_math_header_demotion,
    _t_remove_frontmatter,
 )
 def apply_transforms(text: str, on_step=None) -> tuple[str, dict]:
    """
    Applica le trasformazioni strutturali al Markdown grezzo.
    Restituisce (testo_modificato, statistiche).
    L'ordine è semantico: encoding → artefatti → struttura header →
    costruzione struttura → testo → rifinitura.
    on_step(i, total, label) — callback opzionale chiamato dopo ogni step.
    """
    _has_ex = bool(re.search(r"\b(Esercizi|Exercises|Problems|Homework)\b", text, re.IGNORECASE))
    # (stat_key, fn, label)
    _transforms: list[tuple[str | None, object, str]] = [
        # 1. Encoding
        ("n_simboli_pua_corretti",         _t_fix_symbol_font,              "encoding PUA Symbol"),
        ("n_accenti_corretti",             _t_fix_accents,                  "accenti backtick LaTeX"),
        ("n_moltiplicazioni_corrette",     _t_fix_multiplication,           "simbolo moltiplicazione"),
        ("n_micro_corretti",               _t_fix_micro,                    "simbolo micro SI"),
        # 2. Pulizia artefatti
        ("n_page_markers_rimossi",         _t_remove_page_markers,          "rimozione page markers PDF"),
        ("n_separatori_rimossi",           _t_remove_separators,            "rimozione separatori underscore"),
        ("n_immagini_rimosse",             _t_remove_images,                "rimozione immagini"),
        ("n_br_rimossi",                   _t_fix_br,                       "fix <br> inline"),
        ("n_tabsep_rimossi",               _t_fix_tabsep,                   "fix separatori tabella"),
        ("n_note_rimosse",                 _t_remove_footnotes,             "rimozione footnote"),
        ("n_simboli_math_rimossi",         _t_fix_math_symbols,             "rimozione box math"),
        ("n_formule_rimossi",              _t_remove_formula_labels,        "rimozione label formula"),
        ("n_dotleader_rimossi",            _t_remove_dotleaders,            "rimozione dot-leader TOC"),
        ("n_righe_ricorrenti_rimosse",     _t_remove_recurring_lines,       "rimozione righe ricorrenti"),
        ("n_numeri_pagina_rimossi",        _t_remove_page_numbers,          "rimozione numeri pagina isolati"),
        # 3. Struttura header
        ("n_header_concat_fixati",         _t_fix_header_concat,            "fix header+corpo concatenati"),
        (None,                             _t_extract_capitolo,             "estrazione Capitolo inline"),
        ("n_header_numerati_normalizzati", _t_normalize_numbered_headings,  "normalizzazione livelli numerati"),
        (None,                             _t_normalize_header_levels,      "normalizzazione livelli ####→###"),
        (None,                             _t_demote_h1,                    "demozione #→## (sezioni principali)"),
        (None,                             _t_remove_header_bold,           "rimozione bold negli header"),
        (None,                             _t_normalize_allcaps_headers,    "normalizzazione ALL-CAPS header"),
        # 4. Costruzione struttura
        ("toc_rimosso",                    _t_remove_toc,                   "rimozione TOC"),
        ("n_toc_orfani_rimossi",           _t_remove_orphan_toc,            "rimozione voci TOC orfane"),
        ("n_header_allcaps",               _t_allcaps_to_headers,           "ALL-CAPS → ##"),
        ("n_sezioni_numerate",             partial(_t_numbered_sections, has_exercises=_has_ex), "sezioni numerate → ###"),
        ("n_capitoli_promossi",            _t_promote_chapter_headers,      "promozione capitoli ### → ##"),
        ("n_ambienti_matematici",          _t_extract_math,                 "estrazione ambienti matematici"),
        ("n_articoli_estratti",            _t_extract_articles,             "estrazione articoli → ###"),
        # 5. Testo
        ("n_paragrafi_uniti",              _t_merge_paragraphs,             "merge paragrafi spezzati"),
        (None,                             _t_normalize_whitespace,         "normalizzazione whitespace"),
        (None,                             _t_collapse_blank_lines,         "riduzione righe vuote"),
        ("n_versi_ripristinati",           _t_restore_poetry_lines,         "ripristino versi poesia"),
        ("n_header_verso_demotati",        _t_demote_verse_headers,         "demozione header-verso"),
        ("n_url_rimossi",                  _t_remove_urls,                  "rimozione URL"),
        # 6. Rifinitura
        (None,                             lambda t: (re.sub(r"(?m)^(#{1,6}.+?)\s+pag\.\s*\d{1,4}\s*$", r"\1", t), 0), "strip pag.N dagli header"),
        (None,                             _t_remove_empty_headers,         "rimozione header vuoti"),
        ("n_titoli_uniti",                 _t_merge_title_headers,          "merge titoli isolati"),
        (None,                             lambda t: (re.sub(r"(?m)^(#{1,6}.+?)\s*\|\s*\d{1,3}\s*$", r"\1", t), 0), "fix header|pagina"),
        ("n_garbage_headers_rimossi",      _t_remove_garbage_headers,       "rimozione garbage header"),
        ("n_formula_headers_demotati",     _t_math_header_demotion,         "demozione formula-header"),
        ("n_frontmatter_rimossi",          _t_remove_frontmatter,           "rimozione frontmatter"),
        ("n_watermark_rimossi",            _t_remove_watermarks,            "rimozione watermark"),
    ]
    total = len(_transforms)
    stats: dict = {}
    for i, (stat_key, fn, label) in enumerate(_transforms, 1):
        text, n = fn(text)
        if stat_key:
            stats[stat_key] = stats.get(stat_key, 0) + n
        if on_step:
            on_step(i, total, label)
    stats["toc_rimosso"] = bool(stats.get("toc_rimosso", 0))
    return text, stats
@@ -0,0 +1,127 @@
 """Rimozione artefatti: immagini, BR, footnote, URL, righe ricorrenti, watermark."""
 import re
 from collections import Counter
 from ._constants import (
    _WATERMARK_RE, _TABSEP_RE, _SUPERSCRIPT_RE, _FOOTNOTE_BODY_RE, _DOTLEADER_RE,
    _PAGE_MARKER_RE, _STANDALONE_NUM_RE, _UNDERSCORE_SEP_RE,
 )
 def _t_remove_images(text: str) -> tuple[str, int]:
    n    = len(re.findall(r"!\[[^\]]*\]\([^)]*\)", text))
    text = re.sub(r"!\[[^\]]*\]\([^)]*\)\s*", "", text)
    return text, n
 def _t_fix_br(text: str) -> tuple[str, int]:
    n    = len(re.findall(r"<br>", text, re.IGNORECASE))
    text = re.sub(r"<br>\s*", " ", text, flags=re.IGNORECASE)
    return text, n
 def _t_fix_tabsep(text: str) -> tuple[str, int]:
    n    = len(_TABSEP_RE.findall(text))
    text = _TABSEP_RE.sub("", text)
    return text, n
 def _t_remove_footnotes(text: str) -> tuple[str, int]:
    lines  = text.split("\n")
    result, count = [], 0
    for line in lines:
        stripped = line.strip()
        if stripped and _FOOTNOTE_BODY_RE.match(stripped) and len(stripped) < 300:
            count += 1
            continue
        cleaned = _SUPERSCRIPT_RE.sub("", line)
        if cleaned != line:
            count += 1
        result.append(cleaned)
    return "\n".join(result), count
 def _t_remove_formula_labels(text: str) -> tuple[str, int]:
    n    = len(re.findall(r"\[\d+\.\d+\]", text))
    text = re.sub(r"\s*\[\d+\.\d+\]\s*", " ", text)
    return text, n
 def _t_remove_dotleaders(text: str) -> tuple[str, int]:
    n    = len(_DOTLEADER_RE.findall(text))
    text = _DOTLEADER_RE.sub("", text)
    text = re.sub(r"(?m)^(i{1,3}|iv|vi{0,3}|ix|xi{0,2}|x)$", "", text, flags=re.IGNORECASE)
    return text, n
 def _t_remove_recurring_lines(text: str) -> tuple[str, int]:
    lines       = text.split("\n")
    short_lines = [
        ln.strip() for ln in lines
        if 3 < len(ln.strip()) < 80
        and not ln.strip().startswith("#")
        and not ln.strip().startswith("|")
    ]
    freq      = Counter(short_lines)
    recurring = {ln for ln, c in freq.items() if c >= 5}
    if not recurring:
        return text, 0
    result, count = [], 0
    for line in lines:
        if line.strip() in recurring:
            count += 1
        else:
            result.append(line)
    return "\n".join(result), count
 def _t_fix_math_symbols(text: str) -> tuple[str, int]:
    lines  = text.split("\n")
    result, count = [], 0
    for line in lines:
        if line.strip() and re.match(r"^[\s■-◿☐-☒•▪▫◆◇●○•]+$", line):
            count += 1
        else:
            result.append(line)
    return "\n".join(result), count
 def _t_remove_watermarks(text: str) -> tuple[str, int]:
    lines  = text.split("\n")
    result, count = [], 0
    for line in lines:
        if _WATERMARK_RE.match(line):
            count += 1
        else:
            result.append(line)
    return "\n".join(result), count
 def _t_remove_urls(text: str) -> tuple[str, int]:
    n    = len(re.findall(r"(?m)^(https?://|www\.)\S+\s*$", text))
    text = re.sub(r"(?m)^(https?://|www\.)\S+\s*$", "", text)
    return text, n
 def _t_remove_page_markers(text: str) -> tuple[str, int]:
    """Rimuove i marcatori <!-- page: N --> e i separatori --- adiacenti."""
    n = len(_PAGE_MARKER_RE.findall(text))
    # Rimuovi ---\n<!-- page: N --> come blocco unico (separatori di pagina PDF)
    text = re.sub(r"(?m)^---\s*\n<!-- page: \d+ -->\s*\n?", "", text)
    # Rimuovi eventuali <!-- page: N --> rimasti senza ---
    text = _PAGE_MARKER_RE.sub("", text)
    return text, n
 def _t_remove_page_numbers(text: str) -> tuple[str, int]:
    """Rimuove numeri di pagina isolati (1-3 cifre su una riga solitaria)."""
    n    = len(_STANDALONE_NUM_RE.findall(text))
    text = _STANDALONE_NUM_RE.sub("", text)
    return text, n
 def _t_remove_separators(text: str) -> tuple[str, int]:
    """Rimuove linee di separazione formate solo da underscore (___...)."""
    n    = len(_UNDERSCORE_SEP_RE.findall(text))
    text = _UNDERSCORE_SEP_RE.sub("", text)
    return text, n
@@ -0,0 +1,169 @@
 """
 Costanti di modulo condivise tra i moduli di trasformazione.
 Tutte le regex compilate e le mappe statiche vivono qui.
 """
 import re
 # ─── Keyword sets ─────────────────────────────────────────────────────────────
 _TOC_KEYWORDS = frozenset([
    "indice", "index", "contents", "table of contents",
    "sommario", "inhaltsverzeichnis", "inhalt",
    "indice generale", "indice analitico", "indice dei contenuti",
    "elenco dei capitoli", "argomenti", "table des matières",
    "tabla de contenidos", "содержание",
 ])
 _ORDINALS_IT = {
    "PRIMO": "I", "SECONDO": "II", "TERZO": "III", "QUARTO": "IV",
    "QUINTO": "V", "SESTO": "VI", "SETTIMO": "VII", "OTTAVO": "VIII",
    "NONO": "IX", "DECIMO": "X",
 }
 _ORDINALS_EN = {
    "ONE": "1", "TWO": "2", "THREE": "3", "FOUR": "4", "FIVE": "5",
    "SIX": "6", "SEVEN": "7", "EIGHT": "8", "NINE": "9", "TEN": "10",
 }
 # ─── PUA Symbol font map ──────────────────────────────────────────────────────
 _SYMBOL_PUA_MAP: dict[str, str] = {
    "": " ",
    "": "(",
    "": ")",
    "": "+",
    "": "−",
    "": ".",
    "": "/",
    "": "0", "": "1", "": "2", "": "3", "": "4",
    "": "5", "": "6", "": "7", "": "8", "": "9",
    "": ":", "": ";", "": "<", "": "=", "": ">",
    "": "≅",
    "": "Α", "": "Β", "": "Χ", "": "Δ", "": "Ε",
    "": "Φ", "": "Γ", "": "Η", "": "Ι", "": "ϑ",
    "": "Κ", "": "Λ", "": "Μ", "": "Ν", "": "Ο",
    "": "Π", "": "Θ", "": "Ρ", "": "Σ", "": "Τ",
    "": "Υ", "": "ς", "": "Ω", "": "Ξ", "": "Ψ",
    "": "Ζ",
    "": "[",
    "": "∴",
    "": "]",
    "": "⊥",
    "": "α", "": "β", "": "χ", "": "δ", "": "ε",
    "": "φ", "": "γ", "": "η", "": "ι", "": "ϕ",
    "": "κ", "": "λ", "": "μ", "": "ν", "": "ο",
    "": "π", "": "θ", "": "ρ", "": "σ", "": "τ",
    "": "υ", "": "ϖ", "": "ω", "": "ξ", "": "ψ",
    "": "ζ",
    "": "{",
    "": "|",
    "": "}",
    "": "~",
    "": "±",
    "": "•",
    "": "√",
    "": "≤",
    "": "≥",
    "": "∝",
    "": "×",
    "": "÷",
    "": "×",
    "": "≠",
    "": "≠",
    "": "≥",
    "": "′",
    "": "*",
    "": ",",
    "": "≤",
    "": "•",
    "": "•",
    "": "→",
    "": "÷",
    "": "",
    "": "→",
    "": "",
    "": "",
    "": "",
    "": "",
    # TeX Computer Modern bracket/delimiter pieces (U+F8EB–F8FE) → stringa vuota
    "": "",  # TeX large paren left
    "": "",  # TeX large paren extension
    "": "",  # TeX large paren right
    "": "",  # TeX large paren right ext
    "": "",  # TeX large bracket left
    "": "",  # TeX large bracket ext
    "": "",  # TeX brace top-left
    "": "",  # TeX brace mid
    "": "",  # TeX brace mid-right
    "": "",  # TeX brace extension
    "": "",  # TeX brace right
    "": "",  # TeX bracket right large
    "": "",  # TeX bracket right ext
    "": "",  # TeX bracket right close
    "": "",  # TeX integral large
    "": "",  # TeX integral extension
    "": "",  # TeX integral top
    "": "",  # TeX radical top
    "": "",  # TeX radical extension
    "": "",  # TeX arrowhead
 }
 _SYMBOL_PUA_RE = re.compile(
    "[" + "".join(re.escape(k) for k in _SYMBOL_PUA_MAP) + "]"
 )
 # ─── Regex compilate condivise ────────────────────────────────────────────────
 _SUPERSCRIPT_RE = re.compile(r'[¹²³⁰⁴-⁹]+')
 _FOOTNOTE_BODY_RE = re.compile(
    r'^([¹²³⁰⁴-⁹]+\s+|\[\d{1,3}\]\s+)'
 )
 _NUMBERED_HDR_RE = re.compile(
    r"^(#{1,6})\s+(\d+(?:\.\d+)*)\.\s+(.+)$",
    re.MULTILINE,
 )
 _BIB_MARKERS_RE = re.compile(
    r'\b(pp?\.|vol\.|n\.\s*\d|ed\.|edn\.|ISBN|DOI|arXiv)\b'
    r'|\b(19|20)\d{2}\b'
    r'|\b(ibid\.?|ibidem|op\.\s*cit\.?|cit\.|cfr\.|ivi[,;\s])\b',
    re.IGNORECASE,
 )
 # Pattern autore accademico: iniziale maiuscola + cognome TUTTO-MAIUSCOLO (es. A. PAJNO, G. GUZZETTA)
 _FOOTNOTE_AUTHOR_RE = re.compile(r'(?<![A-Z])[A-Z]\.\s+[A-Z]{3,}')
 _WATERMARK_RE = re.compile(
    r"^(BOZZA|DRAFT|CONFIDENTIAL|RISERVATO|PROVVISORIO|SAMPLE|SPECIMEN"
    r"|DO NOT DISTRIBUTE|NON DISTRIBUIRE|COPY|COPIA)\s*$",
    re.IGNORECASE | re.MULTILINE,
 )
 _TABSEP_RE = re.compile(r"(?m)^\|\s*\|\s*$|^\|---\|?\s*$")
 _DOTLEADER_RE = re.compile(r"^[^\n]*(?:(?:\. ){3,}|\.{4,})[^\n]*$", re.MULTILINE)
 _FM_RE = re.compile(
    r"https?://|www\.|@[A-Za-z]|\bUniversit[àa]\b|\bDipartimento\b|"
    r"\bCopyright\b|\bLicenza\b|\bEdizione\b|"
    r"protetto da|tutti i diritti",
    re.IGNORECASE,
 )
 _VERSE_NUM_RE = re.compile(
    r"([.!?\xbb'\"" + "’" + r"]\s+)(\d+)(\s+)(?=[A-Z\xc0-\xd9a-z\xe0-\xf9\xab“”‟])"
 )
 # Math header demotion
 _MATH_SYMBOLS_RE = re.compile(
    r"[=+∈∀∃≤≥∞∑∫∂→↔⊂⊃∩∪αβγδεζηθικλμνξοπρστυφχψωΑΒΓΔΕΖΗΘΙΚΛΜΝΞΟΠΡΣΤΥΦΧΨΩ]"
 )
 _EXERCISE_TRIGGER_RE = re.compile(
    r"\b(Si dimostri|Si calcoli|Si provi|Si trovi|Trovare|Find|Prove|Show that"
    r"|Compute|Calculate|Dimostrare|Verificare)\b",
    re.IGNORECASE,
 )
 _MATH_HDR_RE = re.compile(r"^(#{2,3})\s+(.+)$")
 _NUMBERED_PREFIX_RE = re.compile(r"^(\d+(?:\.\d+)*[.)])\s+(.+)$", re.DOTALL)
 # Orphan TOC: voce di indice senza dot-leader (es. "3. Funzioni 174")
 _TOC_ITEM_RE = re.compile(
    r"^\d+(\.\d+)*\.?\s+[A-Za-zÀ-ú\'\(][^\n]{2,70}$"
 )
 _TOC_HDR_WITH_PAGE_RE = re.compile(
    r"^#{1,3}\s+\d+\.?\s+.{3,60}\s+\d{1,4}$"
 )
 # Artefatti PDF: page markers e separatori
 _PAGE_MARKER_RE = re.compile(r"(?m)^<!-- page: \d+ -->\s*$")
 _STANDALONE_NUM_RE = re.compile(r"(?m)^(?:- )?\d{1,3}$")
 _UNDERSCORE_SEP_RE = re.compile(r"(?m)^_{4,}\s*$")
@@ -0,0 +1,45 @@
 """Trasformazioni di encoding: PUA font Symbol, accenti LaTeX, simboli SI."""
 import re
 from ._constants import _SYMBOL_PUA_MAP, _SYMBOL_PUA_RE
 def _t_fix_symbol_font(text: str) -> tuple[str, int]:
    count = [0]
    def _repl(m: re.Match) -> str:
        count[0] += 1
        return _SYMBOL_PUA_MAP[m.group(0)]
    result = _SYMBOL_PUA_RE.sub(_repl, text)
    return result, count[0]
 def _t_fix_accents(text: str) -> tuple[str, int]:
    _ACCENT_MAP = {
        "e": "\xe8", "E": "\xc8", "a": "\xe0", "A": "\xc0",
        "u": "\xf9", "U": "\xd9", "i": "\xec", "I": "\xcc",
        "o": "\xf2", "O": "\xd2",
    }
    n_bt_before = text.count("`")
    text = re.sub(r"`([eEaAuUiIoO])", lambda m: _ACCENT_MAP[m.group(1)], text)
    text = re.sub(r"([eEaAuUiIoO])`", lambda m: _ACCENT_MAP[m.group(1)], text)
    n_accenti   = n_bt_before - text.count("`")
    n_bt_orfani = text.count("`")
    if n_bt_orfani:
        text = re.sub(r"`", "", text)
        n_accenti += n_bt_orfani
    return text, n_accenti
 def _t_fix_multiplication(text: str) -> tuple[str, int]:
    n    = len(re.findall(r'(?<=[0-9])"(?=[0-9(])', text))
    text = re.sub(r'(?<=[0-9])"(?=[0-9(])', '×', text)
    return text, n
 def _t_fix_micro(text: str) -> tuple[str, int]:
    _SI_UNITS_RE = r'[mAsgVWFHTKNJClΩ]'
    n    = len(re.findall(rf'\d\s*!(?={_SI_UNITS_RE})', text))
    text = re.sub(rf'(\d)\s*!({_SI_UNITS_RE})', r'\1 µ\2', text)
    return text, n
@@ -0,0 +1,116 @@
 """Trasformazioni di rifinitura: header vuoti, garbage, demozione formula-header, frontmatter."""
 import re
 from ._constants import (
    _FM_RE, _MATH_HDR_RE, _MATH_SYMBOLS_RE,
    _EXERCISE_TRIGGER_RE, _NUMBERED_PREFIX_RE,
 )
 from ._helpers import _merge_title_headers
 def _t_remove_empty_headers(text: str) -> tuple[str, int]:
    blocks  = re.split(r"\n{2,}", text)
    cleaned = []
    for i, block in enumerate(blocks):
        stripped = block.strip()
        if re.match(r"^#{1,6} ", stripped) and "\n" not in stripped:
            next_stripped    = blocks[i + 1].strip() if i + 1 < len(blocks) else ""
            next_is_long_hdr = (
                re.match(r"^#{1,6} ", next_stripped) and len(next_stripped) > 80
            )
            if not next_stripped or (
                re.match(r"^#{1,6} ", next_stripped) and not next_is_long_hdr
            ):
                continue
        cleaned.append(block)
    return re.sub(r"\n{3,}", "\n\n", "\n\n".join(cleaned)), 0
 def _t_merge_title_headers(text: str) -> tuple[str, int]:
    return _merge_title_headers(text)
 def _t_remove_garbage_headers(text: str) -> tuple[str, int]:
    def _is_garbage(content: str) -> bool:
        if content.lstrip().startswith("..."):
            return True
        if not re.search(r"[A-Za-z\xc0-\xffΑ-ω]{2,}", content):
            return True
        if re.fullmatch(r"\(?\s*[A-Za-z]{1,4}\s*\)?", content.strip()):
            return True
        if len(content) > 60 and re.search(r"[!%#]\w|\w[!%#]|\b\w+-\s*\w", content):
            return True
        first_alpha = next((c for c in content if c.isalpha()), None)
        if first_alpha and first_alpha.islower() and len(content) > 40:
            return True
        if re.match(r"^[A-Za-zΑ-ω_]{1,3}\s*[=<>≤≥]", content.strip()):
            return True
        if re.match(
            r"^(Figura|Figure|Fig\.|Tabella|Table|Tab\.)\s+\d",
            content.strip(), re.IGNORECASE,
        ):
            return True
        return False
    count     = 0
    lines     = text.split("\n")
    new_lines = []
    for line in lines:
        m = re.match(r"^#{1,6} (.+)$", line)
        if m and _is_garbage(m.group(1)):
            count += 1
            continue
        new_lines.append(line)
    text = "\n".join(new_lines)
    text = re.sub(r"\n{3,}", "\n\n", text)
    return text, count
 def _t_math_header_demotion(text: str) -> tuple[str, int]:
    lines = text.split("\n")
    result, count = [], 0
    for line in lines:
        m = _MATH_HDR_RE.match(line)
        if not m:
            result.append(line)
            continue
        body = m.group(2)
        if len(body) <= 100:
            result.append(line)
            continue
        has_math     = len(_MATH_SYMBOLS_RE.findall(body)) >= 3
        has_exercise = bool(_EXERCISE_TRIGGER_RE.search(body))
        if not (has_math or has_exercise):
            result.append(line)
            continue
        nm = _NUMBERED_PREFIX_RE.match(body)
        if nm:
            result.append(f"**{nm.group(1)}** {nm.group(2)}")
        else:
            result.append(body)
        count += 1
    return "\n".join(result), count
 def _t_remove_frontmatter(text: str) -> tuple[str, int]:
    blocks  = re.split(r"\n{2,}", text)
    cleaned = []
    count   = 0
    total   = len(blocks)
    cutoff  = max(5, min(15, int(total * 0.20)))
    for i, block in enumerate(blocks):
        stripped = block.strip()
        if i >= cutoff:
            cleaned.append(block)
            continue
        if not re.match(r"^### ", stripped) or re.match(r"^### \d", stripped):
            cleaned.append(block)
            continue
        body       = blocks[i + 1].strip() if i + 1 < len(blocks) else ""
        is_fm_body = len(body) < 250 and _FM_RE.search(body)
        is_fm_hdr  = _FM_RE.search(stripped)
        if is_fm_body or is_fm_hdr:
            count += 1
            continue
        cleaned.append(block)
    return re.sub(r"\n{3,}", "\n\n", "\n\n".join(cleaned)), count
@@ -0,0 +1,127 @@
 """Trasformazioni sulla struttura degli header: normalizzazione livelli, concat, bold."""
 import re
 from ._constants import _NUMBERED_HDR_RE
 from ._helpers import _sentence_case
 def _t_fix_header_concat(text: str) -> tuple[str, int]:
    count = 0
    def _fix(m: re.Match) -> str:
        nonlocal count
        hashes = m.group(1)
        full   = m.group(2).strip()
        if len(full) < 60:
            return m.group(0)
        skip  = min(10, len(full) // 3)
        split = re.search(
            r"(?<=[a-z\xe0\xe8\xe9\xec\xed\xf2\xf3\xf9\xfa\xe4])"
            r"(?=[A-Z\xc0\xc8\xc9\xcc\xcd\xd2\xd3\xd9\xda])",
            full[skip:],
        )
        if split:
            pos   = skip + split.start()
            title = full[:pos].strip()
            body  = full[pos:].strip()
            if len(title) >= 5 and len(body) >= 15:
                count += 1
                return f"{hashes} {title}\n\n{body}"
        return m.group(0)
    text = re.sub(r"^(#{2,6})\s+(.{40,})$", _fix, text, flags=re.MULTILINE)
    return text, count
 def _t_extract_capitolo(text: str) -> tuple[str, int]:
    def _repl(m: re.Match) -> str:
        num    = m.group(1)
        titolo = _sentence_case(m.group(2).strip().rstrip("- ").strip())
        return f"\n\n## Capitolo {num}: {titolo}\n\n"
    text = re.sub(
        r"\bCapitolo\s+(\d+)\s*[:\s]\s*([A-Z\xc0\xc8\xc9\xcc\xcd\xd2\xd3\xd9\xda\'L]"
        r"[A-Z\xc0\xc8\xc9\xcc\xcd\xd2\xd3\xd9\xda\s\'\.,\(\)]{5,80}?)"
        r"(?=\s*[-–]\s*\d|\s*\n|\s*$)",
        _repl,
        text,
    )
    return text, 0
 def _t_normalize_numbered_headings(text: str) -> tuple[str, int]:
    all_matches = list(_NUMBERED_HDR_RE.finditer(text))
    if not all_matches:
        return text, 0
    pairs     = [(m.group(2).count(".") + 1, len(m.group(1))) for m in all_matches]
    depths    = [d for d, _ in pairs]
    min_depth = min(depths)
    max_depth = max(depths)
    if max_depth == min_depth:
        return text, 0
    base_level = min(lv for d, lv in pairs if d == min_depth)
    count = 0
    def _repl(m: re.Match) -> str:
        nonlocal count
        hashes, num, title = m.group(1), m.group(2), m.group(3)
        depth     = num.count(".") + 1
        new_level = min(base_level + (depth - min_depth), 6)
        if new_level == len(hashes):
            return m.group(0)
        count += 1
        return f"{'#' * new_level} {num}. {title}"
    return _NUMBERED_HDR_RE.sub(_repl, text), count
 def _t_normalize_header_levels(text: str) -> tuple[str, int]:
    text = re.sub(r"^#{3,6}\s*$", "", text, flags=re.MULTILINE)
    text = re.sub(
        r"^(#{3,6})\s+(\d{1,3})\s+(.+)$",
        lambda m: f"### {m.group(2)}. {m.group(3)}",
        text,
        flags=re.MULTILINE,
    )
    text = re.sub(r"^#{4,6}\s+(.+)$", r"### \1", text, flags=re.MULTILINE)
    return text, 0
 def _t_remove_header_bold(text: str) -> tuple[str, int]:
    text = re.sub(
        r"^(#{1,6})\s+\*\*(.+?)\*\*\s*$",
        r"\1 \2",
        text, flags=re.MULTILINE,
    )
    return text, 0
 def _t_demote_h1(text: str) -> tuple[str, int]:
    """
    Demota # → ## quando il documento usa # per sezioni principali (≥5 h1
    con contenuto testuale). Crea gerarchia ## → ### invece di # → ###.
    """
    h1_count = len(re.findall(r"^# [A-Za-z\xc0-\xff]", text, re.MULTILINE))
    if h1_count < 5:
        return text, 0
    count = 0
    def _repl(m: re.Match) -> str:
        nonlocal count
        count += 1
        return f"## {m.group(1)}"
    text = re.sub(r"^# (.+)$", _repl, text, flags=re.MULTILINE)
    return text, count
 def _t_normalize_allcaps_headers(text: str) -> tuple[str, int]:
    def _norm(m: re.Match) -> str:
        hashes, content = m.group(1), m.group(2).strip()
        letters = [c for c in content if c.isalpha()]
        if letters and all(c.isupper() for c in letters):
            return f"{hashes} {_sentence_case(content)}"
        return m.group(0)
    text = re.sub(r"^(#{1,6}) (.+)$", _norm, text, flags=re.MULTILINE)
    return text, 0
@@ -0,0 +1,153 @@
 """Funzioni helper pure condivise tra i moduli di trasformazione."""
 import re
 from ._constants import _ORDINALS_IT, _ORDINALS_EN
 def _sentence_case(s: str) -> str:
    if not s:
        return s
    lower = s.lower()
    return lower[0].upper() + lower[1:]
 def _is_allcaps_line(line: str) -> bool:
    stripped = line.strip()
    letters  = [c for c in stripped if c.isalpha()]
    return (
        len(letters) >= 3
        and all(c.isupper() for c in letters)
        and not stripped.startswith("#")
        and not stripped.startswith("|")
    )
 def _allcaps_to_header(raw_line: str) -> str:
    text = re.sub(r"^[-*+]\s+", "", raw_line.strip())
    text = text.rstrip(".").rstrip("?").strip()
    _ORD_IT_PAT = "|".join(_ORDINALS_IT.keys())
    m = re.match(rf"^CAPITOLO ({_ORD_IT_PAT})\. (.+)", text)
    if m:
        roman  = _ORDINALS_IT[m.group(1)]
        titolo = m.group(2).rstrip(".").rstrip("?").strip()
        return f"## Capitolo {roman} — {_sentence_case(titolo)}"
    _ORD_EN_PAT = "|".join(_ORDINALS_EN.keys())
    m = re.match(rf"^CHAPTER ({_ORD_EN_PAT}|\d+)\.? (.+)", text)
    if m:
        n      = _ORDINALS_EN.get(m.group(1), m.group(1))
        titolo = m.group(2).rstrip(".").rstrip("?").strip()
        return f"## Chapter {n} — {_sentence_case(titolo)}"
    m = re.match(r"^([IVXLCDM]+|[0-9]+)\. (.+)", text)
    if m:
        return f"## {m.group(1)}. {_sentence_case(m.group(2).rstrip('.').strip())}"
    return f"## {_sentence_case(text)}"
 def _extract_math_environments(text: str) -> tuple[str, int]:
    _ENVS = (
        r"Definizione|Definition|Teorema|Theorem|Lemma|"
        r"Proposizione|Proposition|Corollario|Corollary|"
        r"Osservazione|Remark|Nota|Note|Esempio|Example"
    )
    count  = 0
    blocks = text.split("\n\n")
    result = []
    for block in blocks:
        stripped = block.strip()
        if not stripped or stripped.startswith("#"):
            result.append(block)
            continue
        m = re.match(
            rf"^({_ENVS})\s+((?:\d+\.?){{1,4}})\s*(.*)",
            stripped,
            re.DOTALL,
        )
        if not m:
            result.append(block)
            continue
        env  = m.group(1)
        num  = m.group(2).rstrip(".")
        rest = m.group(3).strip()
        title_m = re.match(r"^(\([^)]{2,60}\))\s+(.*)", rest, re.DOTALL)
        if title_m:
            header = f"### {env} {num} {title_m.group(1)}"
            body   = title_m.group(2).strip()
        else:
            header = f"### {env} {num}."
            body   = rest
        result.append(f"{header}\n\n{body}" if body else header)
        count += 1
    return "\n\n".join(result), count
 def _merge_title_headers(text: str) -> tuple[str, int]:
    count  = 0
    blocks = re.split(r"\n{2,}", text)
    result = []
    i = 0
    while i < len(blocks):
        block    = blocks[i]
        stripped = block.strip()
        if (
            re.match(r"^#{2,3} \d+\.\s*$", stripped)
            and i + 1 < len(blocks)
        ):
            nxt = blocks[i + 1].strip()
            if (
                nxt
                and "\n" not in nxt
                and len(nxt) <= 80
                and not nxt.startswith("#")
                and not re.match(r"^\d+[\.\)]\s", nxt)
            ):
                result.append(stripped.rstrip() + " " + nxt)
                count += 1
                i += 2
                continue
        result.append(block)
        i += 1
    return re.sub(r"\n{3,}", "\n\n", "\n\n".join(result)), count
 def _extract_article_headers(text: str) -> tuple[str, int]:
    count = 0
    def _repl(m: re.Match) -> str:
        nonlocal count
        num  = m.group(1)
        rest = m.group(2).strip()
        title_m = re.match(
            r"^([A-Z\xc0\xc8\xc9\xcc\xcd\xd2\xd3\xd9\xda].{1,74}?)\.\s+"
            r"([A-Z\xc0\xc8\xc9\xcc\xcd\xd2\xd3\xd9\xda\(\d].{4,})",
            rest,
        )
        if title_m:
            count += 1
            return (
                f"### Art. {num}. {title_m.group(1)}.\n\n"
                f"{title_m.group(2).strip()}"
            )
        if rest:
            count += 1
            return f"### Art. {num}.\n\n{rest}"
        count += 1
        return f"### Art. {num}."
    text = re.sub(
        r"^-\s+Art\.\s+([\d]+[a-z\-]*)\.\s*(.*)",
        _repl,
        text,
        flags=re.MULTILINE,
    )
    return text, count
@@ -0,0 +1,248 @@
 """Costruzione struttura: TOC, ALLCAPS→##, sezioni numerate, ambienti matematici, articoli."""
 import re
 from ._constants import (
    _TOC_KEYWORDS, _BIB_MARKERS_RE, _FOOTNOTE_AUTHOR_RE,
    _TOC_ITEM_RE, _TOC_HDR_WITH_PAGE_RE,
 )
 from ._helpers import (
    _is_allcaps_line, _allcaps_to_header,
    _extract_math_environments, _extract_article_headers,
 )
 def _t_remove_toc(text: str) -> tuple[str, int]:
    lines     = text.split("\n")
    new_lines = []
    _in_toc   = False
    removed   = False
    for line in lines:
        bare       = re.sub(r"^#+\s*", "", line.strip())
        first_word = bare.split(".")[0].strip().lower()
        if first_word in _TOC_KEYWORDS:
            removed = True
            _in_toc = True
            continue
        if _in_toc:
            if re.match(r"^\s*$", line) or re.match(r"^\s*[-*+]\s+\d", line):
                continue
            if re.match(r"^\s*[-*+]\s+.{2,70}\s+\d{1,3}\s*$", line):
                continue
            # Righe brevi con riferimento pagina (es. "Prefazione pag. 4")
            if re.match(r"^.{3,80}\s+pag\.\s*\d{1,4}\s*$", line.strip()):
                continue
            if len(line.strip()) > 200:
                _in_toc = False
                new_lines.append(line)
                continue
            _in_toc = False
        new_lines.append(line)
    return "\n".join(new_lines), 1 if removed else 0
 def _t_remove_orphan_toc(text: str) -> tuple[str, int]:
    """
    Rimuove voci di sommario senza dot-leader che sfuggono a _t_remove_toc.
    Rileva: (a) blocchi di 3+ righe consecutive che matchano il pattern TOC
    nei primi 25% del documento; (b) header ### N. Titolo PAGINA il cui corpo
    è una lista di voci numerate.
    """
    blocks  = re.split(r"\n{2,}", text)
    total   = len(blocks)
    cutoff  = max(10, min(40, int(total * 0.25)))
    to_drop = set()
    i = 0
    while i < cutoff and i < total:
        b = blocks[i].strip()
        # (a) Sequenza di 3+ blocchi TOC consecutivi
        if _TOC_ITEM_RE.match(b):
            j = i
            while j < min(cutoff, i + 60) and j < len(blocks) and _TOC_ITEM_RE.match(blocks[j].strip()):
                j += 1
            if j - i >= 3:
                for k in range(i, j):
                    to_drop.add(k)
                # Rimuovi anche l'header ### precedente se ha numero di pagina
                if i > 0 and _TOC_HDR_WITH_PAGE_RE.match(blocks[i - 1].strip()):
                    to_drop.add(i - 1)
                i = j
                continue
        # (b) Header ### N. Titolo PAGINA con corpo che è lista di voci numerate
        if _TOC_HDR_WITH_PAGE_RE.match(b):
            body = blocks[i + 1].strip() if i + 1 < len(blocks) else ""
            # Il corpo contiene 2+ occorrenze di "N. Titolo"
            toc_hits = re.findall(r"\d+\.?\s+[A-Za-zÀ-ú]", body)
            if len(toc_hits) >= 2 and len(body) < 300:
                to_drop.add(i)
                if i + 1 < total:
                    to_drop.add(i + 1)
                i += 2
                continue
        i += 1
    if not to_drop:
        return text, 0
    kept = [b for idx, b in enumerate(blocks) if idx not in to_drop]
    return re.sub(r"\n{3,}", "\n\n", "\n\n".join(kept)), len(to_drop)
 def _t_allcaps_to_headers(text: str) -> tuple[str, int]:
    count      = 0
    blocks     = text.split("\n\n")
    new_blocks = []
    for block in blocks:
        stripped = block.strip()
        if "\n" not in stripped and _is_allcaps_line(stripped):
            new_blocks.append(_allcaps_to_header(stripped))
            count += 1
        else:
            sub_lines = block.split("\n")
            converted = []
            for ln in sub_lines:
                if _is_allcaps_line(ln) and len(ln.strip()) > 3:
                    converted.append(_allcaps_to_header(ln))
                    count += 1
                else:
                    converted.append(ln)
            new_blocks.append("\n".join(converted))
    return "\n\n".join(new_blocks), count
 def _t_numbered_sections(text: str, has_exercises: bool = False) -> tuple[str, int]:
    count = 0
    def _num_repl(m: re.Match) -> str:
        nonlocal count
        content = m.group(2).strip()
        if content.endswith(".") and len(content) > 40:
            return m.group(0)
        # Paragrafo lungo: non è un titolo di sezione
        if len(content) > 130:
            return m.group(0)
        if _BIB_MARKERS_RE.search(content) or _FOOTNOTE_AUTHOR_RE.search(content):
            return m.group(0)
        count += 1
        # Prova a separare titolo dal corpo alla prima transizione minusc→Maiusc
        split = re.search(
            r"(?<=[a-z\xe0\xe8\xe9\xec\xed\xf2\xf3\xf9\xfa])\s+"
            r"(?=[A-Z\xc0\xc8\xc9\xcc\xcd\xd2\xd3\xd9\xda])",
            content,
        )
        if split and 3 <= split.start() and len(content) - split.end() >= 40:
            title = content[: split.start()].strip()
            body  = content[split.end():].strip()
            return f"### {m.group(1)}. {title}\n\n{body}"
        return f"### {m.group(1)}. {content}"
    text = re.sub(r"^(\d+)\.\s+(.+)$", _num_repl, text, flags=re.MULTILINE)
    def _num_letter_repl(m: re.Match) -> str:
        nonlocal count
        count += 1
        return f"### {m.group(1)}{m.group(2)}.\n\n{m.group(3).strip()}"
    text = re.sub(r"^(\d+)\s*([a-z])\.\s+(.+)$", _num_letter_repl, text, flags=re.MULTILINE)
    if not has_exercises:
        def _aphorism_repl(m: re.Match) -> str:
            nonlocal count
            content = m.group(2).strip()
            if _BIB_MARKERS_RE.search(content) or _FOOTNOTE_AUTHOR_RE.search(content):
                return m.group(0)
            count += 1
            split = re.search(
                r"(?<=[a-z\xe0\xe8\xe9\xec\xed\xf2\xf3\xf9\xfa])\s+"
                r"(?=[A-Z\xc0\xc8\xc9\xcc\xcd\xd2\xd3\xd9\xda])",
                content,
            )
            if split and 3 <= split.start() and len(content) - split.end() >= 40:
                title = content[: split.start()].strip()
                body  = content[split.end():].strip()
                return f"\n\n### {m.group(1)}. {title}\n\n{body}"
            return f"\n\n### {m.group(1)}. {content}"
        text = re.sub(
            r"^-[ \t]+(\d{1,3})\.[ \t]+(.{10,})$",
            _aphorism_repl,
            text,
            flags=re.MULTILINE,
        )
    def _list_section_repl(m: re.Match) -> str:
        nonlocal count
        num     = m.group(1)
        content = m.group(2).strip()
        if _BIB_MARKERS_RE.search(content) or _FOOTNOTE_AUTHOR_RE.search(content):
            return m.group(0)
        count += 1
        split = re.search(
            r"(?<=[a-z\xe0\xe8\xe9\xec\xed\xf2\xf3\xf9\xfa])\s+"
            r"(?=[A-Z\xc0\xc8\xc9\xcc\xcd\xd2\xd3\xd9\xda])",
            content,
        )
        if split and split.start() >= 3:
            title = content[: split.start()].strip()
            body  = content[split.end():].strip()
            if len(body) >= 20:
                return f"\n\n### {num}. {title}\n\n{body}"
        return f"\n\n### {num}. {content}"
    text = re.sub(
        r"^-\s+(\d{1,3})\s+([A-Z\xc0\xc8\xc9\xcc\xcd\xd2\xd3\xd9\xda\'L].{10,})$",
        _list_section_repl,
        text,
        flags=re.MULTILINE,
    )
    return text, count
 def _t_promote_chapter_headers(text: str) -> tuple[str, int]:
    """
    Promuove ### N. Titolo → ## N. Titolo quando sembrano capitoli principali.
    Condizioni: ≥3 headers ### con numero 1–50, nessun ## già presente,
    numeri di capitolo sequenziali e NON duplicati.
    Numeri duplicati indicano una raccolta multi-articolo: non promuovere.
    """
    if re.search(r"^## \d", text, re.MULTILINE):
        return text, 0
    pattern = re.compile(r"^### (\d+)\. (.+)$", re.MULTILINE)
    matches = list(pattern.finditer(text))
    chapter_matches = [m for m in matches if int(m.group(1)) <= 50]
    if len(chapter_matches) < 3:
        return text, 0
    chapter_nums_list = [int(m.group(1)) for m in chapter_matches]
    # Se qualche numero appare ≥3 volte è una raccolta multi-articolo: non promuovere
    num_counter: dict[int, int] = {}
    for n in chapter_nums_list:
        num_counter[n] = num_counter.get(n, 0) + 1
    if max(num_counter.values()) >= 3:
        return text, 0
    chapter_nums = set(chapter_nums_list)
    count = 0
    def _repl(m: re.Match) -> str:
        nonlocal count
        if int(m.group(1)) in chapter_nums:
            count += 1
            return f"## {m.group(1)}. {m.group(2)}"
        return m.group(0)
    return pattern.sub(_repl, text), count
 def _t_extract_math(text: str) -> tuple[str, int]:
    return _extract_math_environments(text)
 def _t_extract_articles(text: str) -> tuple[str, int]:
    return _extract_article_headers(text)
@@ -0,0 +1,109 @@
 """Trasformazioni sul testo: merge paragrafi, whitespace, poesia, versi."""
 import re
 from ._constants import _VERSE_NUM_RE
 def _t_merge_paragraphs(text: str) -> tuple[str, int]:
    _SENTENCE_END = set(".?!\xbb)\"'")
    blocks = text.split("\n\n")
    merged = []
    count  = 0
    i = 0
    while i < len(blocks):
        b        = blocks[i]
        stripped = b.strip()
        while (
            i + 1 < len(blocks)
            and stripped
            and not stripped.startswith("#")
            and not stripped.startswith("|")
            and stripped[-1] not in _SENTENCE_END
        ):
            nxt = blocks[i + 1].strip()
            if (
                not nxt
                or nxt.startswith("#")
                or nxt.startswith("|")
                or re.match(r"^\d+\.", nxt)
                or re.match(r"^[-*+]\s", nxt)
            ):
                break
            b        = stripped + " " + nxt
            stripped = b.strip()
            count   += 1
            i       += 1
        merged.append(b)
        i += 1
    text = "\n\n".join(merged)
    text = re.sub(r"(?m)^\|---\|\s*", "", text)
    return text, count
 def _t_normalize_whitespace(text: str) -> tuple[str, int]:
    lines = text.split("\n")
    text  = "\n".join(
        re.sub(r"  +", " ", line) if line.strip() else line
        for line in lines
    )
    return text, 0
 def _t_collapse_blank_lines(text: str) -> tuple[str, int]:
    return re.sub(r"\n{3,}", "\n\n", text), 0
 def _t_restore_poetry_lines(text: str) -> tuple[str, int]:
    count  = 0
    blocks = text.split("\n\n")
    result = []
    for block in blocks:
        stripped = block.strip()
        if not stripped or stripped.startswith("#"):
            result.append(block)
            continue
        matches = list(_VERSE_NUM_RE.finditer(stripped))
        if len(matches) < 2:
            result.append(block)
            continue
        nums  = [int(m.group(2)) for m in matches]
        diffs = [nums[i + 1] - nums[i] for i in range(len(nums) - 1)]
        if not diffs or len(set(diffs)) > 2 or not (1 <= diffs[0] <= 5):
            result.append(block)
            continue
        step = diffs[0]
        def _replace_verse_num(m: re.Match) -> str:
            n   = int(m.group(2))
            sep = "\n\n" if n % (step * 3) == 0 else "\n"
            return m.group(1).rstrip() + sep
        new_block = _VERSE_NUM_RE.sub(_replace_verse_num, stripped)
        if new_block != stripped:
            count += len(matches)
        result.append(new_block)
    return "\n\n".join(result), count
 def _t_demote_verse_headers(text: str) -> tuple[str, int]:
    count = 0
    def _demote(m: re.Match) -> str:
        nonlocal count
        hashes, content = m.group(1), m.group(2).strip()
        if not re.search(r"\s\d{1,4}\s*$", content):
            return m.group(0)
        inner = re.sub(r"\s\d{1,4}\s*$", "", content)
        if not re.search(r'[,;:.!?\xbb"\'][\ ]+[A-Za-z\xc0-\xff\xab""]', inner):
            return m.group(0)
        count += 1
        clean = re.sub(r"\s\d{1,4}\s*$", "", content)
        return clean
    text = re.sub(r"^(#{1,6})\s+(.{20,})$", _demote, text, flags=re.MULTILINE)
    return text, count
@@ -0,0 +1,177 @@
 """Estrazione PDF: verifica dipendenze, validazione, metadati, conversione → raw Markdown."""
 import re
 import subprocess
 import sys
 from pathlib import Path
 # ─── Dipendenze ───────────────────────────────────────────────────────────────
 def _check_deps() -> None:
    try:
        import opendataloader_pdf  # noqa: F401
    except ImportError:
        print("Errore: opendataloader-pdf non installato.")
        print("       pip install opendataloader-pdf")
        sys.exit(1)
    try:
        result = subprocess.run(["java", "-version"], capture_output=True, text=True)
        if result.returncode != 0:
            raise FileNotFoundError
    except FileNotFoundError:
        print("Errore: Java 11+ non trovato sul PATH.")
        print("       Installa da https://adoptium.net/")
        sys.exit(1)
 # ─── Validazione PDF ──────────────────────────────────────────────────────────
 def validate_pdf(pdf_path: Path) -> tuple[bool, str]:
    """Verifica esistenza, leggibilità e presenza di testo digitale estraibile."""
    if not pdf_path.exists():
        return False, f"File non trovato: {pdf_path}"
    if pdf_path.suffix.lower() != ".pdf":
        return False, f"Non è un PDF: {pdf_path.name}"
    size = pdf_path.stat().st_size
    if size == 0:
        return False, "File vuoto"
    if size < 1024:
        return False, f"File troppo piccolo ({size} byte) — probabilmente corrotto"
    try:
        import pdfplumber
        with pdfplumber.open(pdf_path) as pdf:
            n_pages = len(pdf.pages)
            if n_pages == 0:
                return False, "PDF senza pagine"
            sample = min(5, n_pages)
            pages_with_text = sum(
                1 for i in range(sample)
                if len((pdf.pages[i].extract_text() or "").strip()) > 50
            )
            if pages_with_text == 0:
                extended = min(15, n_pages)
                if extended > sample:
                    ext_with_text = sum(
                        1 for i in range(sample, extended)
                        if len((pdf.pages[i].extract_text() or "").strip()) > 50
                    )
                    if ext_with_text > 0:
                        return True, (
                            f"{n_pages} pagine — prime {sample} vuote, "
                            f"testo trovato in pagine successive "
                            f"(possibile copertina immagine)"
                        )
                return False, (
                    f"Nessun testo nelle prime {extended} pagine "
                    f"— probabilmente scansionato (OCR non supportato)"
                )
        return True, f"{n_pages} pagine, testo digitale confermato"
    except MemoryError:
        return False, "Memoria esaurita durante l'apertura del PDF"
    except Exception as e:
        msg = str(e).lower()
        if "password" in msg or "encrypted" in msg:
            return False, "PDF protetto da password"
        return False, f"Impossibile aprire: {e}"
 # ─── Metadati PDF ────────────────────────────────────────────────────────────
 def extract_metadata(pdf_path: Path) -> dict:
    """
    Estrae title, author, year e page count dal PDF tramite fitz.
    Restituisce un dict con chiavi sempre presenti (stringa vuota se assenti).
    """
    try:
        import fitz
        doc  = fitz.open(str(pdf_path))
        meta = doc.metadata
        pages = len(doc)
        doc.close()
        def _clean(s: str) -> str:
            return s.strip() if s else ""
        year = ""
        creation = meta.get("creationDate", "")
        m = re.match(r"D:(\d{4})", creation)
        if m:
            year = m.group(1)
        return {
            "source": pdf_path.name,
            "title":  _clean(meta.get("title",  "")),
            "author": _clean(meta.get("author", "")),
            "year":   year,
            "pages":  pages,
        }
    except Exception:
        return {"source": pdf_path.name, "title": "", "author": "", "year": "", "pages": 0}
 # ─── Conversione PDF → Markdown ───────────────────────────────────────────────
 def _is_tagged_pdf(pdf_path: Path) -> bool:
    try:
        import fitz
        doc = fitz.open(str(pdf_path))
        tagged = "StructTreeRoot" in doc.pdf_catalog()
        doc.close()
        return tagged
    except Exception:
        return False
 def convert_pdf(pdf_path: Path, out_dir: Path) -> Path:
    """
    Converte il PDF in Markdown tramite opendataloader-pdf (XY-Cut++).
    Parametri per output RAG-ottimale:
      keep_line_breaks=False        → testo fluente, elimina hard-wrap del PDF
      reading_order="xycut"         → ricostruisce ordine di lettura multi-colonna
      sanitize=False                → preserva il testo originale senza filtri
      image_output="off"            → nessuna immagine estratta né referenziata
      table_method="cluster"        → rileva tabelle anche senza bordi visibili
      content_safety_off            → non scarta footnote (tiny) né layer OCG nascosti
      use_struct_tree               → attivo solo per PDF taggati (Word/InDesign)
      markdown_page_separator       → inserisce separatore + marker pagina tra pagine
      replace_invalid_chars         → sostituisce caratteri non validi con spazio
    """
    import opendataloader_pdf
    out_dir.mkdir(parents=True, exist_ok=True)
    tagged = _is_tagged_pdf(pdf_path)
    opendataloader_pdf.convert(
        input_path=str(pdf_path),
        output_dir=str(out_dir),
        format="markdown",
        keep_line_breaks=False,
        reading_order="xycut",
        sanitize=False,
        image_output="off",
        table_method="cluster",
        content_safety_off=["tiny", "hidden-ocg"],
        use_struct_tree=tagged,
        markdown_page_separator="\n\n---\n<!-- page: %page-number% -->\n\n",
        replace_invalid_chars=" ",
        quiet=True,
    )
    md_file = out_dir / f"{pdf_path.stem}.md"
    if not md_file.exists():
        candidates = list(out_dir.glob("*.md"))
        if not candidates:
            raise RuntimeError(f"Nessun file .md prodotto in {out_dir}")
        md_file = candidates[0]
    content = md_file.read_text(encoding="utf-8", errors="replace").strip()
    if len(content) < 100:
        raise RuntimeError(
            f"opendataloader ha prodotto un file .md quasi vuoto ({len(content)} char) "
            f"— il PDF potrebbe essere corrotto o non supportato"
        )
    return md_file
@@ -0,0 +1,135 @@
 import json
 import re
 from datetime import datetime
 from pathlib import Path
 from .structure import _parse_sections_with_body
 from ._constants import _MATH_SYMBOLS_RE, _EXERCISE_TRIGGER_RE, _MATH_HDR_RE
 def build_report(
    stem: str,
    out_dir: Path,
    clean_text: str,
    t_stats: dict,
    profile: dict,
    reduction: float,
 ) -> Path:
    text_lines = clean_text.split("\n")
    sections = _parse_sections_with_body(clean_text, 3)
    lengths  = [len(body) for _, body in sections]
    def _pct(data: list[int], p: float) -> int:
        if not data:
            return 0
        s = sorted(data)
        return s[max(0, min(len(s) - 1, int(len(s) * p)))]
    distribution = {
        "min":     min(lengths) if lengths else 0,
        "p25":     _pct(lengths, 0.25),
        "mediana": _pct(lengths, 0.50),
        "p75":     _pct(lengths, 0.75),
        "max":     max(lengths) if lengths else 0,
    }
    bare_hdrs = [
        {"header": hdr, "corpo_inizio": body[:120].replace("\n", " ")}
        for hdr, body in sections
        if re.match(r"^### \d+\.\s*$", hdr) and len(body.strip()) < 30
    ]
    short_secs = [
        {"header": hdr, "chars": length, "testo": body[:80].replace("\n", " ")}
        for (hdr, body), length in zip(sections, lengths)
        if 0 < length < 150
    ]
    long_secs = [
        {"header": hdr, "chars": length}
        for (hdr, _), length in zip(sections, lengths)
        if length > 1500
    ]
    def _scan(pattern: str, max_n: int = 10) -> list[dict]:
        hits = []
        for i, line in enumerate(text_lines):
            if re.search(pattern, line) and not re.match(r"^#+ ", line):
                hits.append({"riga": i + 1, "testo": line.strip()[:120]})
                if len(hits) >= max_n:
                    break
        return hits
    def _scan_formula_headers(max_n: int = 10) -> list[dict]:
        hits = []
        for i, line in enumerate(text_lines):
            m = _MATH_HDR_RE.match(line)
            if not m:
                continue
            body = m.group(2)
            if len(body) <= 100:
                continue
            has_math = len(_MATH_SYMBOLS_RE.findall(body)) >= 3
            has_ex   = bool(_EXERCISE_TRIGGER_RE.search(body))
            if has_math or has_ex:
                hits.append({"riga": i + 1, "testo": line.strip()[:120]})
                if len(hits) >= max_n:
                    break
        return hits
    residui = {
        "backtick":         _scan(r"`"),
        "dotleader":        _scan(r"(?:\. ){3,}"),
        "url":              _scan(r"^(https?://|www\.)\S+"),
        "immagini":         _scan(r"!\[[^\]]*\]\([^)]*\)"),
        "br_inline":        _scan(r"<br>"),
        "simboli_encoding": _scan(r'(?<=[0-9A-Za-z])[!"](?=[0-9A-Za-z])'),
        "formule_inline":   _scan(r"\[\d+\.\d+\]"),
        "footnote_markers": _scan(r'[¹²³⁰⁴-⁹]'),
        "pua_markers":      _scan(r'[-]'),
        "formula_headers":  _scan_formula_headers(),
    }
    report = {
        "stem":      stem,
        "timestamp": datetime.now().strftime("%Y-%m-%d %H:%M"),
        "transforms": {
            **t_stats,
            "riduzione_pct": round(reduction),
        },
        "structure":    profile,
        "distribution": distribution,
        "anomalie": {
            "bare_headers":        len(bare_hdrs),
            "short_sections":      len(short_secs),
            "long_sections":       len(long_secs),
            "bare_headers_list":   bare_hdrs,
            "short_sections_list": short_secs,
            "long_sections_list":  long_secs,
        },
        "residui": {
            "backtick":                   len(residui["backtick"]),
            "dotleader":                  len(residui["dotleader"]),
            "url":                        len(residui["url"]),
            "immagini":                   len(residui["immagini"]),
            "br_inline":                  len(residui["br_inline"]),
            "simboli_encoding":           len(residui["simboli_encoding"]),
            "formule_inline":             len(residui["formule_inline"]),
            "footnote_markers":           len(residui["footnote_markers"]),
            "pua_markers":                len(residui["pua_markers"]),
            "backtick_esempi":            residui["backtick"],
            "dotleader_esempi":           residui["dotleader"],
            "url_esempi":                 residui["url"],
            "immagini_esempi":            residui["immagini"],
            "br_inline_esempi":           residui["br_inline"],
            "simboli_encoding_esempi":    residui["simboli_encoding"],
            "formule_inline_esempi":      residui["formule_inline"],
            "footnote_markers_esempi":    residui["footnote_markers"],
            "pua_markers_esempi":         residui["pua_markers"],
            "formula_headers":            len(residui["formula_headers"]),
            "formula_headers_esempi":     residui["formula_headers"],
        },
    }
    report_path = out_dir / "report.json"
    report_path.write_text(json.dumps(report, ensure_ascii=False, indent=2), encoding="utf-8")
    return report_path
@@ -0,0 +1,193 @@
 import json
 import sys
 import tempfile
 import threading
 import time
 from pathlib import Path
 from .extract    import validate_pdf, convert_pdf, extract_metadata
 from ._apply     import apply_transforms
 from .structure  import analyze
 from .report     import build_report
 from .validator  import _score, _grade
 _LIVELLO_DESC = {3: "ricca (h3)", 2: "parziale (h2)", 1: "paragrafi", 0: "testo piatto"}
 def _build_frontmatter(meta: dict) -> str:
    lines = ["---", f"source: {meta['source']}"]
    if meta["title"]:
        lines.append(f'title: "{meta["title"]}"')
    if meta["author"]:
        lines.append(f'author: "{meta["author"]}"')
    if meta["year"]:
        lines.append(f"year: {meta['year']}")
    if meta["pages"]:
        lines.append(f"pages: {meta['pages']}")
    lines += ["---", ""]
    return "\n".join(lines) + "\n"
 _SPIN_FRAMES = "⠋⠙⠹⠸⠼⠴⠦⠧⠇⠏"
 class _Spinner:
    """Spinner animato in un thread separato — mostra frame + tempo trascorso."""
    def __init__(self, prefix: str):
        self._prefix = prefix
        self._stop   = threading.Event()
        self._thread = threading.Thread(target=self._run, daemon=True)
        self._t0     = 0.0
    def __enter__(self):
        self._t0 = time.perf_counter()
        self._thread.start()
        return self
    def __exit__(self, *_):
        self._stop.set()
        self._thread.join()
        sys.stdout.write("\r" + " " * 72 + "\r")
        sys.stdout.flush()
    def _run(self):
        i = 0
        while not self._stop.wait(0.1):
            elapsed = time.perf_counter() - self._t0
            frame   = _SPIN_FRAMES[i % len(_SPIN_FRAMES)]
            sys.stdout.write(f"\r  {frame} {self._prefix}  {elapsed:.0f}s")
            sys.stdout.flush()
            i += 1
 def run(stem: str, project_root: Path, force: bool) -> bool:
    pdf_path  = project_root / "sources" / f"{stem}.pdf"
    out_dir   = project_root / "conversione" / stem
    raw_out   = out_dir / "raw.md"
    clean_out = out_dir / "clean.md"
    print(f"\n{'─' * 52}")
    print(f"  {stem}")
    print(f"{'─' * 52}")
    if clean_out.exists() and not force:
        print(f"  ⚠️  conversione/{stem}/clean.md già presente — skip")
        print(f"      (usa --force per rieseguire)")
        return True
    # [1] Validazione + metadati
    print("  [1/4] Validazione PDF...")
    pdf_mb = pdf_path.stat().st_size / (1024 * 1024) if pdf_path.exists() else 0
    print(f"     File: {pdf_path.name}  ({pdf_mb:.1f} MB)")
    ok, msg = validate_pdf(pdf_path)
    if not ok:
        print(f"  ✗ {msg}")
        return False
    print(f"  ✅ {msg}")
    meta = extract_metadata(pdf_path)
    if meta["title"]:
        print(f"     Titolo:  {meta['title']}")
    if meta["author"]:
        print(f"     Autore:  {meta['author']}")
    # [2] Conversione
    print("  [2/4] Conversione PDF → Markdown (opendataloader-pdf)...")
    with _Spinner("opendataloader-pdf in esecuzione...") as spinner:
        t0 = time.perf_counter()
        with tempfile.TemporaryDirectory() as tmp:
            try:
                md_file = convert_pdf(pdf_path, Path(tmp))
            except MemoryError:
                print("  ✗ Memoria esaurita durante la conversione")
                return False
            except Exception as e:
                print(f"  ✗ Conversione fallita: {e}")
                return False
            try:
                raw_text = md_file.read_text(encoding="utf-8")
            except UnicodeDecodeError as e:
                print(f"  ✗ Errore encoding nel file prodotto: {e}")
                return False
        elapsed = time.perf_counter() - t0
    size_kb = len(raw_text.encode()) // 1024
    n_lines = raw_text.count("\n")
    print(f"  ✅ Markdown grezzo: {size_kb} KB, {n_lines} righe  ({elapsed:.1f}s)")
    # [3] Pulizia strutturale
    print("  [3/4] Pulizia strutturale...")
    def _on_step(i: int, total: int, label: str) -> None:
        sys.stdout.write(f"\r     [{i}/{total}] {label:<45}")
        sys.stdout.flush()
    clean_text, t = apply_transforms(raw_text, on_step=_on_step)
    sys.stdout.write("\r" + " " * 72 + "\r")
    sys.stdout.flush()
    clean_text = _build_frontmatter(meta) + clean_text
    reduction = 100 * (1 - len(clean_text) / len(raw_text)) if raw_text else 0
    print(f"  ✅ Encoding")
    print(f"     Simboli PUA corretti:  {t['n_simboli_pua_corretti']}")
    print(f"     Accenti corretti:      {t['n_accenti_corretti']}")
    print(f"     Artefatti")
    print(f"     Immagini rimosse:      {t['n_immagini_rimosse']}")
    print(f"     <br> rimossi:          {t['n_br_rimossi']}")
    print(f"     Note rimosse:          {t['n_note_rimosse']}")
    print(f"     Dot-leader rimossi:    {t['n_dotleader_rimossi']}")
    print(f"     Righe ricorrenti rim.: {t['n_righe_ricorrenti_rimosse']}")
    print(f"     URL rimossi:           {t['n_url_rimossi']}")
    print(f"     Watermark rimossi:     {t['n_watermark_rimossi']}")
    print(f"     Header")
    print(f"     Header concat fixati:  {t['n_header_concat_fixati']}")
    print(f"     Header num. normaliz.: {t['n_header_numerati_normalizzati']}")
    print(f"     Struttura")
    print(f"     TOC rimosso:           {'sì' if t['toc_rimosso'] else 'no'}")
    print(f"     TOC orfani rimossi:    {t['n_toc_orfani_rimossi']}")
    print(f"     ALL-CAPS → ##:         {t['n_header_allcaps']}")
    print(f"     Sezioni → ###:         {t['n_sezioni_numerate']}")
    print(f"     Ambienti matematici:   {t['n_ambienti_matematici']}")
    print(f"     Articoli → ###:        {t['n_articoli_estratti']}")
    print(f"     Testo")
    print(f"     Paragrafi uniti:       {t['n_paragrafi_uniti']}")
    print(f"     Versi poesia riprist.: {t['n_versi_ripristinati']}")
    print(f"     Header verso demotati: {t['n_header_verso_demotati']}")
    print(f"     Rifinitura")
    print(f"     Garbage header rim.:   {t['n_garbage_headers_rimossi']}")
    print(f"     Titoli header uniti:   {t['n_titoli_uniti']}")
    print(f"     Formula-hdr demotati:  {t['n_formula_headers_demotati']}")
    print(f"     Frontmatter rimossi:   {t['n_frontmatter_rimossi']}")
    print(f"     Riduzione testo:       {reduction:.0f}%")
    # [4] Profilo strutturale
    print("  [4/4] Analisi struttura...")
    try:
        out_dir.mkdir(parents=True, exist_ok=True)
        raw_out.write_text(raw_text, encoding="utf-8")
        clean_out.write_text(clean_text, encoding="utf-8")
    except PermissionError as e:
        print(f"  ✗ Permesso negato durante la scrittura: {e}")
        return False
    profile = analyze(clean_out)
    (out_dir / "structure_profile.json").write_text(
        json.dumps(profile, ensure_ascii=False, indent=2), encoding="utf-8"
    )
    print(f"  ✅ Struttura: livello {profile['livello_struttura']} — "
          f"{_LIVELLO_DESC[profile['livello_struttura']]}")
    print(f"     h1={profile['n_h1']}  h2={profile['n_h2']}  h3={profile['n_h3']}  "
          f"paragrafi={profile['n_paragrafi']}")
    print(f"     Strategia chunking: {profile['strategia_chunking']}")
    print(f"     Lingua rilevata:    {profile['lingua_rilevata']}")
    for w in profile["avvertenze"]:
        print(f"     ⚠️  {w}")
    report_path = build_report(stem, out_dir, clean_text, t, profile, reduction)
    report_data = json.loads(report_path.read_text(encoding="utf-8"))
    score, _    = _score(report_data)
    print(f"\n  Output → conversione/{stem}/")
    print(f"    raw.md   (immutabile)  clean.md   report.json")
    print(f"  Punteggio qualità: {score}/100 {_grade(score)}")
    return True
@@ -0,0 +1,141 @@
 import re
 from pathlib import Path
 # ─── Rilevamento lingua ───────────────────────────────────────────────────────
 _IT_WORDS = frozenset([
    "il", "la", "di", "e", "che", "non", "per", "un", "una", "si",
    "con", "da", "del", "della", "dei", "in", "ma", "se", "lo", "le",
    "gli", "al", "alla", "ai", "alle", "sono", "ha", "hanno", "era",
    "erano", "nel", "nella", "nei", "nelle", "questo", "questa", "così",
 ])
 _EN_WORDS = frozenset([
    "the", "of", "and", "to", "in", "is", "that", "it", "was", "for",
    "on", "are", "as", "with", "his", "they", "at", "be", "this", "have",
    "from", "or", "an", "but", "not", "by", "he", "she", "we", "you",
    "which", "their", "been", "has", "would", "there", "when", "will",
 ])
 _FR_WORDS = frozenset([
    "le", "les", "de", "du", "des", "et", "un", "une", "est", "que",
    "pour", "dans", "sur", "avec", "qui", "par", "pas", "plus", "au",
    "ce", "se", "ou", "mais", "comme", "aussi",
 ])
 _DE_WORDS = frozenset([
    "der", "die", "das", "und", "in", "von", "zu", "den", "mit", "ist",
    "auf", "eine", "als", "dem", "des", "sich", "nicht", "auch", "werden",
    "bei", "nach", "oder", "wenn", "wird", "war",
 ])
 _ES_WORDS = frozenset([
    "el", "los", "las", "de", "en", "un", "una", "es", "que", "por",
    "con", "del", "para", "como", "pero", "sus", "son", "los", "hay",
    "todo", "esta", "este", "ser", "más", "ya",
 ])
 def _detect_language(text: str) -> str:
    words  = re.findall(r"\b[a-zA-Z]{2,}\b", text.lower())
    sample = words[:2000]
    scores = {
        "it": sum(1 for w in sample if w in _IT_WORDS),
        "en": sum(1 for w in sample if w in _EN_WORDS),
        "fr": sum(1 for w in sample if w in _FR_WORDS),
        "de": sum(1 for w in sample if w in _DE_WORDS),
        "es": sum(1 for w in sample if w in _ES_WORDS),
    }
    best = max(scores, key=scores.get)
    return best if scores[best] > 0 else "unknown"
 # ─── Analisi struttura ────────────────────────────────────────────────────────
 def _count_headers(text: str, level: int) -> int:
    prefix = "#" * level + " "
    return len(re.findall(rf"(?m)^{re.escape(prefix)}", text))
 def _count_paragraphs(text: str) -> int:
    blocks = re.split(r"\n{2,}", text)
    return sum(1 for b in blocks if b.strip() and not re.match(r"^#+\s", b.strip()))
 def _split_sections(text: str, level: int) -> list[str]:
    prefix = "#" * level + " "
    parts  = re.split(rf"(?m)^{re.escape(prefix)}.+", text)
    return [p for p in parts[1:] if p.strip()]
 def _parse_sections_with_body(text: str, level: int = 3) -> list[tuple[str, str]]:
    """Restituisce lista di (header_line, body_text) per tutti gli header al livello dato."""
    prefix   = "#" * level + " "
    lines    = text.split("\n")
    sections: list[tuple[str, str]] = []
    cur_hdr:  str | None = None
    cur_body: list[str]  = []
    for line in lines:
        if line.startswith(prefix):
            if cur_hdr is not None:
                sections.append((cur_hdr, "\n".join(cur_body).strip()))
            cur_hdr  = line
            cur_body = []
        elif cur_hdr is not None:
            cur_body.append(line)
    if cur_hdr is not None:
        sections.append((cur_hdr, "\n".join(cur_body).strip()))
    return sections
 def analyze(md_path: Path) -> dict:
    text        = md_path.read_text(encoding="utf-8")
    n_h1        = _count_headers(text, 1)
    n_h2        = _count_headers(text, 2)
    n_h3        = _count_headers(text, 3)
    n_paragrafi = _count_paragraphs(text)
    if n_h3 >= 5:
        livello, boundary, strategia = 3, "h3", "h3_aware"
        section_bodies = _split_sections(text, 3)
        # Se h3 sono enormi e h2 più brevi, h2 è il boundary corretto
        if n_h2 >= 3:
            h2_bodies = _split_sections(text, 2)
            avg_h3 = sum(len(b) for b in section_bodies) / len(section_bodies) if section_bodies else 0
            avg_h2 = sum(len(b) for b in h2_bodies) / len(h2_bodies) if h2_bodies else 0
            if avg_h3 > 5000 and avg_h2 < avg_h3 * 0.7:
                livello, boundary, strategia = 2, "h2", "h2_paragraph_split"
                section_bodies = h2_bodies
    elif n_h2 >= 3:
        livello, boundary, strategia = 2, "h2", "h2_paragraph_split"
        section_bodies = _split_sections(text, 2)
    elif n_h1 + n_h2 + n_h3 >= 1:
        livello, boundary, strategia = 1, "paragrafo", "paragraph"
        section_bodies = [b for b in re.split(r"\n{2,}", text) if b.strip()]
    elif n_paragrafi >= 3:
        livello, boundary, strategia = 1, "paragrafo", "paragraph"
        section_bodies = [b for b in re.split(r"\n{2,}", text) if b.strip()]
    else:
        livello, boundary, strategia = 0, "nessuno", "sliding_window"
        section_bodies = [text] if text.strip() else []
    lengths          = [len(b) for b in section_bodies if b.strip()]
    lunghezza_media  = int(sum(lengths) / len(lengths)) if lengths else 0
    lingua           = _detect_language(text)
    avvertenze = []
    short = sum(1 for l in lengths if l < 200)
    long_ = sum(1 for l in lengths if l > 800)
    if short:
        avvertenze.append(f"{short} sezioni sotto i 200 caratteri — verranno accorpate")
    if long_:
        avvertenze.append(f"{long_} sezioni sopra i 800 caratteri — verranno divise")
    return {
        "livello_struttura":     livello,
        "n_h1":                  n_h1,
        "n_h2":                  n_h2,
        "n_h3":                  n_h3,
        "n_paragrafi":           n_paragrafi,
        "boundary_primario":     boundary,
        "lingua_rilevata":       lingua,
        "lunghezza_media_sezione": lunghezza_media,
        "strategia_chunking":    strategia,
        "avvertenze":            avvertenze,
    }
@@ -0,0 +1,152 @@
 import json
 import sys
 from pathlib import Path
 _GRADES = [(90, "A"), (75, "B"), (60, "C"), (40, "D"), (0, "F")]
 def _score(r: dict) -> tuple[int, list[str]]:
    """
    Voto 0-100 sulla qualità del clean.md per vettorizzazione.
    Penalità struttura:
      livello 0 (assente)  → −40
      livello 1 (piatto)   → −15
    Penalità residui (degradano il retrieval):
      backtick             → −2/cad  (max −20)
      dot-leader           → −5/cad  (max −10)
      URL/watermark        → −5/cad  (max −15)
      immagini             → −5/cad  (max −10)
      <br> inline          → −2/cad  (max −15)
      simboli encoding     → −1/cad  (max −10)
      formule inline [N.M] → −1/cad  (max −8)
      footnote residui     → −1/cad  (max −8)
      caratteri PUA        → −2/cad  (max −20)
    Penalità anomalie:
      bare headers         → −3/cad  (max −15)
    """
    score     = 100
    detail    = []
    structure = r.get("structure", {})
    anomalie  = r.get("anomalie",  {})
    residui   = r.get("residui",   {})
    livello = structure.get("livello_struttura", 0)
    if livello == 0:
        score -= 40
        detail.append("struttura assente −40")
    elif livello == 1:
        score -= 15
        detail.append("struttura piatta −15")
    def _pen(key: str, per_item: int, cap: int, label: str) -> None:
        n = residui.get(key, 0)
        if n:
            p = min(cap, n * per_item)
            nonlocal score
            score -= p
            detail.append(f"{label} ×{n} −{p}")
    _pen("backtick",         2, 20, "backtick")
    _pen("dotleader",        5, 10, "dot-leader")
    _pen("url",              5, 15, "url")
    _pen("immagini",         5, 10, "immagini")
    _pen("br_inline",        2, 15, "<br> inline")
    _pen("simboli_encoding", 1, 10, "simboli encoding")
    _pen("formule_inline",   1,  8, "formule inline")
    _pen("footnote_markers", 1,  8, "footnote residui")
    _pen("pua_markers",      2, 20, "caratteri PUA font Symbol")
    _pen("formula_headers",  3, 15, "formula/esercizio come header")
    n_bare = anomalie.get("bare_headers", 0)
    if n_bare:
        p = min(15, n_bare * 3)
        score -= p
        detail.append(f"bare headers ×{n_bare} −{p}")
    return max(0, score), detail
 def _grade(score: int) -> str:
    return next(g for threshold, g in _GRADES if score >= threshold)
 def validate(stems: list[str], project_root: Path, detail: bool = False) -> None:
    conv_dir = project_root / "conversione"
    paths = (
        [conv_dir / s / "report.json" for s in stems]
        if stems
        else sorted(conv_dir.glob("*/report.json"))
    )
    if not paths:
        print("Nessun report.json trovato in conversione/*/")
        sys.exit(0)
    rows = [
        json.loads(p.read_text(encoding="utf-8")) if p.exists()
        else {"stem": p.parent.name, "_missing": True}
        for p in paths
    ]
    col    = max(len(r.get("stem", "stem")) for r in rows) + 2
    header = (
        f"{'stem':<{col}}"
        f"{'h2':>4}{'h3':>5}  "
        f"{'strategia':<18}"
        f"{'bare':>5}{'corte':>6}{'lunghe':>7}"
        f"{'btk':>5}{'br':>4}{'enc':>4}{'url':>4}{'fhdr':>5}"
        f"{'med':>6}"
        f"  {'voto':>4}  grade"
    )
    sep = "─" * len(header)
    print(f"\n{header}\n{sep}")
    scores = []
    for r in rows:
        if r.get("_missing"):
            print(f"{r['stem']:<{col}}  (report.json non trovato)")
            continue
        st   = r.get("structure",    {})
        an   = r.get("anomalie",     {})
        res  = r.get("residui",      {})
        dist = r.get("distribution", {})
        s, pen = _score(r)
        scores.append(s)
        print(
            f"{r['stem']:<{col}}"
            f"{st.get('n_h2',              0):>4}"
            f"{st.get('n_h3',              0):>5}  "
            f"{st.get('strategia_chunking','?'):<18}"
            f"{an.get('bare_headers',      0):>5}"
            f"{an.get('short_sections',    0):>6}"
            f"{an.get('long_sections',     0):>7}"
            f"{res.get('backtick',         0):>5}"
            f"{res.get('br_inline',        0):>4}"
            f"{res.get('simboli_encoding', 0):>4}"
            f"{res.get('url',              0):>4}"
            f"{res.get('formula_headers',  0):>5}"
            f"{dist.get('mediana',         0):>6}"
            f"  {s:>4}  {_grade(s)}"
        )
        if detail and pen:
            for p in pen:
                print(f"  {'':>{col}}  ↳ {p}")
    print(sep)
    if scores:
        media = sum(scores) / len(scores)
        print(
            f"Documenti: {len(scores)}   "
            f"Media: {media:.0f}/100 {_grade(int(media))}   "
            f"(A≥90  B≥75  C≥60  D≥40  F<40)"
        )
    print(
        "\nColonne: bare=header vuoti  corte=sez<150ch  lunghe=sez>1500ch  "
        "btk=backtick  br=<br>inline  enc=simboli encoding  fhdr=formula-header  med=mediana chars\n"
    )
@@ -4,10 +4,30 @@ set -euo pipefail
 SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
 cd "$SCRIPT_DIR"
-mapfile -t dirs < <(find . -maxdepth 1 -mindepth 1 -type d | sort)
+STEM="${1:-}"
 if [[ -n "$STEM" ]]; then
    # ── Modalità singolo stem ─────────────────────────────────────────────
    target="./$STEM"
    if [[ ! -d "$target" ]]; then
        echo "Errore: cartella '$STEM' non trovata in conversione/."
        exit 1
    fi
    rm -rf "$target"
    echo "Rimossa: conversione/$STEM/"
    exit 0
 fi
 # ── Modalità batch: tutti gli output (escluse cartelle infrastruttura) ────
 mapfile -t dirs < <(
    find . -maxdepth 1 -mindepth 1 -type d \
        ! -name '_*' \
        ! -name '__*' \
    | sort
 )
 if [[ ${#dirs[@]} -eq 0 ]]; then
-    echo "Nessuna cartella da cancellare."
+    echo "Nessuna cartella di output da cancellare."
    exit 0
 fi
@@ -16,10 +36,8 @@ for d in "${dirs[@]}"; do
    echo "  $d"
 done
-if [[ "${1:-}" != "-f" ]]; then
+read -r -p "Confermi? [s/N] " answer
-    read -r -p "Confermi? [s/N] " answer
+[[ "$answer" =~ ^[sS]$ ]] || { echo "Annullato."; exit 0; }
    [[ "$answer" =~ ^[sS]$ ]] || { echo "Annullato."; exit 0; }
 fi
 for d in "${dirs[@]}"; do
    rm -rf "$d"
@@ -1,210 +0,0 @@
 #!/usr/bin/env python3
 """
 conversione/validate.py — Validazione qualità Markdown
 Legge i report.json prodotti da pipeline.py, stampa una tabella di stato
 e assegna un voto (0-100) a ogni documento.
  90-100  A  — ottimo, pronto per il chunker
  75-89   B  — buono, qualche sezione lunga ma accettabile
  60-74   C  — accettabile, anomalie minori da verificare
  40-59   D  — da rivedere, problemi strutturali o residui evidenti
   0-39   F  — da riprocessare, struttura assente o testo corrotto
 Uso:
    python conversione/validate.py              # tutti gli stem
    python conversione/validate.py analisi1     # stem specifico
    python conversione/validate.py a b c        # stem multipli
    python conversione/validate.py --detail analisi1  # mostra dettaglio penalità
 """
 import argparse
 import json
 import sys
 from pathlib import Path
 # ─── Punteggio ───────────────────────────────────────────────────────────────
 _GRADES = [(90, "A"), (75, "B"), (60, "C"), (40, "D"), (0, "F")]
 def _score(r: dict) -> tuple[int, list[str]]:
    """
    Calcola un punteggio 0-100 sulla qualità del clean.md ai fini della
    suddivisione in chunk e vettorizzazione.
    Restituisce (score, lista_penalità_applicate).
    Penalità struttura (il chunker non può operare senza header):
      struttura assente (livello 0)    → −40
      struttura piatta (livello 1)     → −15
    Penalità residui (finiscono nei vettori e degradano il retrieval):
      backtick                         → −2/cad  (max −20)
      dot-leader                       → −5/cad  (max −10)
      URL / watermark                  → −5/cad  (max −15)
      immagini residue                 → −5/cad  (max −10)
      <br> inline (artefatti tabelle)  → −2/cad  (max −15)
      simboli encoding (!/" residui)   → −1/cad  (max −10)
      formule inline [N.M]             → −1/cad  (max −8)
    Penalità anomalie:
      bare headers                     → −3/cad  (max −15)
    Non penalizzate (il chunker le normalizza):
      sezioni corte, sezioni lunghe, mediana, p25
    """
    score  = 100
    detail = []
    structure = r.get("structure", {})
    anomalie  = r.get("anomalie",  {})
    residui   = r.get("residui",   {})
    livello = structure.get("livello_struttura", 0)
    # ── Struttura ─────────────────────────────────────────────────────────
    if livello == 0:
        score -= 40
        detail.append("struttura assente −40")
    elif livello == 1:
        score -= 15
        detail.append("struttura piatta −15")
    # ── Residui ───────────────────────────────────────────────────────────
    def _pen(key: str, per_item: int, cap: int, label: str) -> None:
        n = residui.get(key, 0)
        if n:
            p = min(cap, n * per_item)
            nonlocal score
            score -= p
            detail.append(f"{label} ×{n} −{p}")
    _pen("backtick",         2, 20, "backtick")
    _pen("dotleader",        5, 10, "dot-leader")
    _pen("url",              5, 15, "url")
    _pen("immagini",         5, 10, "immagini")
    _pen("br_inline",        2, 15, "<br> inline")
    _pen("simboli_encoding", 1, 10, "simboli encoding")
    _pen("formule_inline",   1,  8, "formule inline")
    _pen("footnote_markers", 1,  8, "footnote residui")
    _pen("pua_markers",      2, 20, "caratteri PUA font Symbol")
    # ── Anomalie ──────────────────────────────────────────────────────────
    n_bare = anomalie.get("bare_headers", 0)
    if n_bare:
        p = min(15, n_bare * 3)
        score -= p
        detail.append(f"bare headers ×{n_bare} −{p}")
    return max(0, score), detail
 def _grade(score: int) -> str:
    return next(g for threshold, g in _GRADES if score >= threshold)
 # ─── Validazione ─────────────────────────────────────────────────────────────
 def validate(stems: list[str], project_root: Path, detail: bool = False) -> None:
    conv_dir = project_root / "conversione"
    paths = (
        [conv_dir / s / "report.json" for s in stems]
        if stems
        else sorted(conv_dir.glob("*/report.json"))
    )
    if not paths:
        print("Nessun report.json trovato in conversione/*/")
        sys.exit(0)
    rows = [
        json.loads(p.read_text(encoding="utf-8")) if p.exists()
        else {"stem": p.parent.name, "_missing": True}
        for p in paths
    ]
    # ── Intestazione ─────────────────────────────────────────────────────
    col = max(len(r.get("stem", "stem")) for r in rows) + 2
    header = (
        f"{'stem':<{col}}"
        f"{'h2':>4}{'h3':>5}  "
        f"{'strategia':<18}"
        f"{'bare':>5}{'corte':>6}{'lunghe':>7}"
        f"{'btk':>5}{'br':>4}{'enc':>4}{'url':>4}"
        f"{'med':>6}"
        f"  {'voto':>4}  grade"
    )
    sep = "─" * len(header)
    print(f"\n{header}\n{sep}")
    scores = []
    # ── Righe ─────────────────────────────────────────────────────────────
    for r in rows:
        if r.get("_missing"):
            print(f"{r['stem']:<{col}}  (report.json non trovato)")
            continue
        st   = r.get("structure",    {})
        an   = r.get("anomalie",     {})
        res  = r.get("residui",      {})
        dist = r.get("distribution", {})
        s, pen = _score(r)
        scores.append(s)
        print(
            f"{r['stem']:<{col}}"
            f"{st.get('n_h2',              0):>4}"
            f"{st.get('n_h3',              0):>5}  "
            f"{st.get('strategia_chunking','?'):<18}"
            f"{an.get('bare_headers',      0):>5}"
            f"{an.get('short_sections',    0):>6}"
            f"{an.get('long_sections',     0):>7}"
            f"{res.get('backtick',         0):>5}"
            f"{res.get('br_inline',        0):>4}"
            f"{res.get('simboli_encoding', 0):>4}"
            f"{res.get('url',              0):>4}"
            f"{dist.get('mediana',         0):>6}"
            f"  {s:>4}  {_grade(s)}"
        )
        if detail and pen:
            for p in pen:
                print(f"  {'':>{col}}  ↳ {p}")
    # ── Riepilogo ─────────────────────────────────────────────────────────
    print(sep)
    if scores:
        media = sum(scores) / len(scores)
        print(
            f"Documenti: {len(scores)}   "
            f"Media: {media:.0f}/100 {_grade(int(media))}   "
            f"(A≥90  B≥75  C≥60  D≥40  F<40)"
        )
    print(
        "\nColonne: bare=header vuoti  corte=sez<150ch  lunghe=sez>1500ch  "
        "btk=backtick  br=<br>inline  enc=simboli encoding  med=mediana chars\n"
    )
 # ─── Entry point ─────────────────────────────────────────────────────────────
 if __name__ == "__main__":
    parser = argparse.ArgumentParser(
        description="Valida i report Markdown prodotti da pipeline.py",
        epilog="Senza argomenti valida tutti gli stem in conversione/*/",
    )
    parser.add_argument(
        "stems",
        nargs="*",
        metavar="STEM",
        help="stem da validare (es: analisi1). Ometti per tutti.",
    )
    parser.add_argument(
        "--detail", "-d",
        action="store_true",
        help="mostra dettaglio penalità per ogni documento",
    )
    args = parser.parse_args()
    validate(args.stems, Path(__file__).parent.parent, detail=args.detail)
@@ -1,113 +0,0 @@
 # Ollama — Verifica Ambiente
 Prima di procedere con la vettorizzazione (step 8) devi avere installato:
 - **Ollama** — server locale per LLM e embedding
 - un **modello di embedding** (es. `qwen3-embedding:0.6b`, `bge-m3`)
 - un **modello LLM** (es. `qwen3.5:4b`)
 - **chromadb** — libreria Python per il vector store
 ---
 ## 1. Installa Ollama
 ```bash
 curl -fsSL https://ollama.com/install.sh | sh
 ```
 Verifica che il servizio sia attivo:
 ```bash
 ollama list
 ```
 ### Disinstalla Ollama
 ```bash
 # Ferma e rimuovi il servizio systemd
 sudo systemctl stop ollama
 sudo systemctl disable ollama
 sudo rm /etc/systemd/system/ollama.service
 sudo systemctl daemon-reload
 # Rimuovi il binario
 sudo rm /usr/local/bin/ollama
 # Rimuovi modelli e dati (opzionale)
 sudo rm -rf /usr/share/ollama
 # Rimuovi utente e gruppo di sistema (opzionale)
 sudo userdel ollama
 sudo groupdel ollama
 ```
 ---
 ## 2. Scarica i modelli
 ### Modello di embedding (consigliato)
 ```bash
 ollama pull qwen3-embedding:0.6b
 ```
 Alternative supportate:
 - `nomic-embed-text-v2-moe`
 - `bge-m3`
 - `nomic-embed-text`
 Se cambi embedding model rispetto a quello usato in step-8, riesegui ingest con `--force` e aggiorna `EMBED_MODEL` in `config.py`.
 ### Modello LLM (consigliato per 8 GB RAM)
 ```bash
 ollama pull qwen3.5:4b
 ```
 Se usi un modello diverso, aggiorna `OLLAMA_MODEL` in `config.py`.
 ### Disinstalla un modello
 ```bash
 ollama rm qwen3.5:4b
 ollama rm qwen3-embedding:0.6b
 ```
 ---
 ## 3. Installa le dipendenze Python
 ```bash
 source .venv/bin/activate
 pip install -r requirements.txt
 ```
 ---
 ## 4. Verifica ambiente
 ```bash
 source .venv/bin/activate
 python ollama/check_env.py
 ```
 Output atteso (esempio):
 ```text
 ✅ ollama trovato nel PATH
 ✅ ollama risponde correttamente
 ✅ embedding disponibile: qwen3-embedding:0.6b
 ✅ LLM disponibile: qwen3.5:4b
 ✅ chromadb importabile
 ✅ Ambiente pronto — procedi con la vettorizzazione:
   python step-8/ingest.py --stem <nome>
 ```
 ---
 ## Prossimo step
 ```bash
 python step-8/ingest.py --stem <nome>
 ```
@@ -1,250 +0,0 @@
 #!/usr/bin/env python3
 """
 Verifica ambiente Ollama
 Controlla che tutti i prerequisiti per la vettorizzazione siano soddisfatti:
  1. ollama è nel PATH e risponde
  2. Almeno un modello di embedding è scaricato
  3. Almeno un modello LLM è scaricato
  4. chromadb è importabile
 Output: report a schermo con ✅ / ❌ per ogni componente.
 Nessun file scritto. Exit 0 se tutto OK, 1 altrimenti.
 Uso:
    python ollama/check_env.py
 """
 import shutil
 import subprocess
 import sys
 from pathlib import Path
 # ─── Lista canonica di modelli embedding supportati ───────────────────────────
 # Ordine: prima scelta → ultima scelta (come da ollama/README.md)
 EMBED_MODELS = [
    "qwen3-embedding",
    "nomic-embed-text-v2-moe",
    "bge-m3",
    "nomic-embed-text",
    "mxbai-embed-large",
    "paraphrase-multilingual",
    "all-minilm",
 ]
 EMBED_MODEL_PREFIXES = tuple(EMBED_MODELS)
 OLLAMA_SERVE_HINT = "   → Avvia il servizio con: ollama serve"
 RECOMMENDED_EMBED_MODEL = "qwen3-embedding:0.6b"
 RECOMMENDED_LLM_MODEL = "qwen3.5:4b"
 def _is_embed(model_name: str) -> bool:
    """True se il modello è riconosciuto come embedding (lista canonica o keyword)."""
    base = model_name.split(":")[0].lower()
    return base.startswith(EMBED_MODEL_PREFIXES) or "embed" in base
 def _parse_ollama_models(raw_output: str) -> list[str]:
    """Estrae i nomi modello dall'output di `ollama list`."""
    models: list[str] = []
    for idx, line in enumerate(raw_output.splitlines()):
        line = line.strip()
        if not line:
            continue
        # Prima riga: header tabellare ("NAME ...")
        if idx == 0 and line.lower().startswith("name"):
            continue
        model_name = line.split(maxsplit=1)[0]
        models.append(model_name)
    return models
 sys.path.insert(0, str(Path(__file__).parent.parent))
 try:
    from config import EMBED_MODEL as CONFIGURED_EMBED, OLLAMA_MODEL as CONFIGURED_LLM
 except Exception:
    CONFIGURED_EMBED = None
    CONFIGURED_LLM = None
 REQUIRED_LIBS = ["chromadb"]
 # ─── Checks ───────────────────────────────────────────────────────────────────
 def _print_model_list(title: str, models: list[str]) -> None:
    """Stampa in modo uniforme una lista di modelli."""
    if not models:
        print(f"   {title}: nessuno")
        return
    print(f"   {title} ({len(models)}):")
    for model in models:
        print(f"   - {model}")
 def check_ollama_in_path() -> bool:
    """Verifica che ollama sia nel PATH."""
    found = shutil.which("ollama") is not None
    if found:
        print("✅ ollama trovato nel PATH")
    else:
        print("❌ ollama non trovato nel PATH")
        print("   → Installa con: curl -fsSL https://ollama.com/install.sh | sh")
    return found
 def check_ollama_running() -> list[str] | None:
    """
    Esegue 'ollama list' e ritorna la lista dei modelli disponibili.
    Ritorna None se ollama non risponde.
    """
    try:
        result = subprocess.run(
            ["ollama", "list"],
            capture_output=True, text=True, timeout=10
        )
        if result.returncode != 0:
            print("❌ ollama non risponde (errore all'avvio)")
            print(OLLAMA_SERVE_HINT)
            return None
        models = _parse_ollama_models(result.stdout)
        print("✅ ollama risponde correttamente")
        return models
    except FileNotFoundError:
        print("❌ ollama non trovato (FileNotFoundError)")
        return None
    except subprocess.TimeoutExpired:
        print("❌ ollama non risponde (timeout)")
        print(OLLAMA_SERVE_HINT)
        return None
 def _match(model_name: str, available: list[str]) -> str | None:
    """
    Ritorna il nome completo del modello trovato in 'available' che corrisponde
    a 'model_name' (confronto per prefisso), oppure None.
    """
    for m in available:
        if m == model_name or m.startswith(model_name + ":") or m.startswith(model_name + "-"):
            return m
    return None
 def _check_configured_model(
    configured_name: str | None,
    available: list[str],
    label: str,
 ) -> bool | None:
    """
    Se esiste un modello configurato, lo verifica e ritorna True/False.
    Se non è configurato, ritorna None (il chiamante userà il fallback).
    """
    if not configured_name:
        return None
    print(f"   modello configurato (config.py): {configured_name}")
    found = _match(configured_name, available)
    if found:
        print(f"✅ {label} disponibile: {found}")
        return True
    print(f"❌ {configured_name} non trovato in Ollama")
    print(f"   → ollama pull {configured_name}")
    return False
 def check_embed_model(available: list[str]) -> bool:
    """Verifica che il modello di embedding configurato sia disponibile."""
    configured_check = _check_configured_model(CONFIGURED_EMBED, available, "embedding")
    if configured_check is not None:
        return configured_check
    # fallback: config.py non leggibile
    found = next((m for m in available if _is_embed(m)), None)
    if found:
        print(f"✅ modello embedding trovato: {found}")
        return True
    print("❌ nessun modello di embedding trovato")
    print(f"   → Prima scelta: ollama pull {RECOMMENDED_EMBED_MODEL}")
    return False
 def check_llm_model(available: list[str]) -> bool:
    """Verifica che il modello LLM configurato sia disponibile."""
    configured_check = _check_configured_model(CONFIGURED_LLM, available, "LLM")
    if configured_check is not None:
        return configured_check
    # fallback: config.py non leggibile
    first_llm = next((m for m in available if not _is_embed(m)), None)
    if first_llm:
        print(f"✅ modello LLM trovato: {first_llm}")
        return True
    print("❌ nessun modello LLM trovato")
    print(f"   → Consigliato per 8 GB RAM: ollama pull {RECOMMENDED_LLM_MODEL}")
    return False
 def check_library(lib: str) -> bool:
    """Verifica che una libreria Python sia importabile."""
    try:
        __import__(lib)
        print(f"✅ {lib} importabile")
        return True
    except ImportError:
        print(f"❌ {lib} non importabile")
        print(f"   → Installa con: pip install {lib}")
        return False
 # ─── Entry point ──────────────────────────────────────────────────────────────
 def main() -> int:
    print("─── Verifica ambiente Ollama ─────────────────────────────────────────\n")
    results: list[bool] = []
    # 1. ollama nel PATH
    in_path = check_ollama_in_path()
    results.append(in_path)
    # 2. ollama risponde + modelli
    if in_path:
        available = check_ollama_running()
        if available is None:
            results.extend([False, False, False])
        else:
            results.append(True)
            _print_model_list(
                "modelli embedding rilevati",
                [m for m in available if _is_embed(m)],
            )
            _print_model_list(
                "modelli LLM rilevati",
                [m for m in available if not _is_embed(m)],
            )
            results.append(check_embed_model(available))
            results.append(check_llm_model(available))
    else:
        results.extend([False, False, False])
        print("⚠️  modelli non verificabili (ollama non trovato)")
    # 3. Librerie Python
    print()
    for lib in REQUIRED_LIBS:
        results.append(check_library(lib))
    # ── Riepilogo ─────────────────────────────────────────────────────────────
    print()
    print("──────────────────────────────────────────────────────────────────────")
    all_ok = all(results)
    if all_ok:
        print("✅ Ambiente pronto")
    else:
        n_fail = sum(1 for r in results if not r)
        print(f"⚠️  {n_fail} problema/i rilevato/i — risolvi prima di procedere.")
    return 0 if all_ok else 1
 if __name__ == "__main__":
    sys.exit(main())
@@ -1,66 +0,0 @@
 #!/usr/bin/env python3
 """
 Test chat locale Ollama — senza RAG, senza ChromaDB.
 Uso: python ollama/test_ollama.py
 """
 import json
 import sys
 import urllib.error
 import urllib.request
 from pathlib import Path
 sys.path.insert(0, str(Path(__file__).parent.parent))
 import config as _cfg
 OLLAMA_URL  = _cfg.OLLAMA_URL
 MODEL       = _cfg.OLLAMA_MODEL
 TEMPERATURE = _cfg.TEMPERATURE
 NO_THINK    = _cfg.NO_THINK
 def chat(prompt: str) -> str:
    payload = json.dumps({
        "model": MODEL,
        "prompt": prompt,
        "stream": False,
        "think": not NO_THINK,
        "options": {"temperature": TEMPERATURE},
    }).encode()
    req = urllib.request.Request(
        f"{OLLAMA_URL}/api/generate",
        data=payload,
        headers={"Content-Type": "application/json"},
        method="POST",
    )
    with urllib.request.urlopen(req, timeout=300) as resp:
        return json.loads(resp.read())["response"].strip()
 def main() -> int:
    print(f"─── Chat Ollama ──────────────────────────────── (exit per uscire)")
    print(f"  Modello   : {MODEL}")
    print(f"  Thinking  : {'off' if NO_THINK else 'on'}")
    print()
    while True:
        try:
            user = input("Tu: ").strip()
        except (EOFError, KeyboardInterrupt):
            print("\nUscita.")
            break
        if not user:
            continue
        if user.lower() == "exit":
            break
        try:
            reply = chat(user)
            print(f"\nAssistente: {reply}\n")
        except (urllib.error.URLError, OSError) as e:
            print(f"❌ Errore: {e}")
    return 0
 if __name__ == "__main__":
    sys.exit(main())
@@ -1,252 +0,0 @@
 #!/usr/bin/env python3
 """
 Pipeline RAG interattiva
 Riceve una domanda, recupera i chunk più rilevanti da ChromaDB (retrieval)
 e genera una risposta tramite Ollama (generation).
 Input:  chroma_db/<stem> (collection ChromaDB)
 Output: risposta a schermo
 Uso:
    python rag.py --stem <nome>
 Nel loop interattivo:
    Domanda: <testo>       → risposta
    Domanda: <testo> -v    → risposta + chunk recuperati
    Domanda: exit          → uscita
 """
 import argparse
 import json
 import sys
 import urllib.error
 import urllib.request
 from pathlib import Path
 import chromadb
 # ─── Configurazione ───────────────────────────────────────────────────────────
 sys.path.insert(0, str(Path(__file__).parent))
 import config as _cfg
 project_root = Path(__file__).parent
 CHROMA_DIR   = project_root / "chroma_db"
 OLLAMA_URL    = _cfg.OLLAMA_URL
 EMBED_MODEL   = _cfg.EMBED_MODEL
 LLM_MODEL     = _cfg.OLLAMA_MODEL
 TOP_K         = _cfg.TOP_K
 TEMPERATURE   = _cfg.TEMPERATURE
 NO_THINK      = _cfg.NO_THINK
 SYSTEM_PROMPT = _cfg.SYSTEM_PROMPT
 # ─── Embedding ────────────────────────────────────────────────────────────────
 def embed(text: str) -> list[float]:
    """Genera il vettore della domanda tramite Ollama."""
    payload = json.dumps({"model": EMBED_MODEL, "prompt": text}).encode()
    req = urllib.request.Request(
        f"{OLLAMA_URL}/api/embeddings",
        data=payload,
        headers={"Content-Type": "application/json"},
        method="POST",
    )
    with urllib.request.urlopen(req, timeout=30) as resp:
        return json.loads(resp.read())["embedding"]
 # ─── Generazione ──────────────────────────────────────────────────────────────
 def call_ollama(prompt: str, system: str = "") -> str:
    """Chiama Ollama /api/generate e ritorna la risposta."""
    payload = json.dumps({
        "model": LLM_MODEL,
        "system": system,
        "prompt": prompt,
        "stream": False,
        "think": not NO_THINK,
        "options": {"temperature": TEMPERATURE},
    }).encode()
    req = urllib.request.Request(
        f"{OLLAMA_URL}/api/generate",
        data=payload,
        headers={"Content-Type": "application/json"},
        method="POST",
    )
    with urllib.request.urlopen(req, timeout=300) as resp:
        return json.loads(resp.read())["response"].strip()
 # ─── Retrieval ────────────────────────────────────────────────────────────────
 def retrieve(collection: chromadb.Collection, question: str) -> list[dict]:
    """
    Genera l'embedding della domanda e recupera i TOP_K chunk più simili.
    Ritorna lista di dict con chiavi: text, sezione, titolo, distance.
    """
    vector = embed(question)
    results = collection.query(
        query_embeddings=[vector],
        n_results=TOP_K,
        include=["documents", "metadatas", "distances"],
    )
    chunks = []
    for text, meta, dist in zip(
        results["documents"][0],
        results["metadatas"][0],
        results["distances"][0],
    ):
        chunks.append({
            "text":     text,
            "sezione":  meta.get("sezione", ""),
            "titolo":   meta.get("titolo", ""),
            "distance": dist,
        })
    return chunks
 # ─── Prompt ───────────────────────────────────────────────────────────────────
 def build_prompt(question: str, chunks: list[dict]) -> str:
    """Ritorna (system, user_prompt) separati per l'API Ollama."""
    context_parts = []
    for i, c in enumerate(chunks, start=1):
        header = f"[Contesto {i}"
        if c["sezione"]:
            header += f" — {c['sezione']}"
            if c["titolo"]:
                header += f" > {c['titolo']}"
        header += "]"
        context_parts.append(f"{header}\n{c['text']}")
    context = "\n\n".join(context_parts)
    user_prompt = f"{context}\n\nDomanda: {question}"
    return SYSTEM_PROMPT, user_prompt
 # ─── Loop interattivo ─────────────────────────────────────────────────────────
 def answer(question: str, collection: chromadb.Collection, verbose: bool) -> None:
    try:
        chunks = retrieve(collection, question)
    except (urllib.error.URLError, OSError) as e:
        print(f"❌ Errore embedding: {e}")
        return
    if verbose:
        print("\n── Chunk recuperati ──────────────────────────────────────────")
        for i, c in enumerate(chunks, start=1):
            loc = c["sezione"]
            if c["titolo"]:
                loc += f" > {c['titolo']}"
            sim = 1 - c["distance"]
            print(f"  [{i}] {loc}  (similarità: {sim:.3f})")
            print(f"      {c['text'][:120].replace(chr(10), ' ')}...")
        print("──────────────────────────────────────────────────────────────\n")
    system, prompt = build_prompt(question, chunks)
    try:
        response = call_ollama(prompt, system=system)
    except (urllib.error.URLError, OSError) as e:
        print(f"❌ Errore generazione: {e}")
        return
    print(f"\n{response}\n")
 def run_loop(collection: chromadb.Collection) -> None:
    print("── Loop RAG ─────────────────────────────────────── (exit per uscire)\n")
    while True:
        try:
            raw = input("Domanda: ").strip()
        except (EOFError, KeyboardInterrupt):
            print("\nUscita.")
            break
        if not raw:
            continue
        if raw.lower() == "exit":
            break
        verbose = raw.endswith(" -v")
        question = raw[:-3].strip() if verbose else raw
        answer(question, collection, verbose)
 # ─── Entry point ──────────────────────────────────────────────────────────────
 def _build_epilog() -> str:
    lines = [
        "Uso:",
        "  python rag.py --stem <nome>",
        "",
        "Loop interattivo:",
        "  <domanda>       risposta basata sul documento",
        "  <domanda> -v    risposta + chunk recuperati con score di similarità",
        "  exit            termina",
    ]
    if CHROMA_DIR.exists():
        try:
            client = chromadb.PersistentClient(path=str(CHROMA_DIR))
            names = [c.name for c in client.list_collections()]
            if names:
                lines += ["", f"Collection disponibili: {', '.join(names)}"]
            else:
                lines += ["", "Nessuna collection trovata — eseguire prima: python step-8/ingest.py"]
        except Exception:
            pass
    return "\n".join(lines)
 def main() -> int:
    parser = argparse.ArgumentParser(
        description=(
            "Pipeline RAG interattiva\n\n"
            "Risponde a domande in linguaggio naturale su un documento\n"
            "indicizzato in ChromaDB da step-8/ingest.py."
        ),
        epilog=_build_epilog(),
        formatter_class=argparse.RawDescriptionHelpFormatter,
    )
    parser.add_argument(
        "--stem",
        required=True,
        help=(
            "Nome della collection ChromaDB da interrogare. "
            "Le collection vengono create da: python step-8/ingest.py --stem <nome>"
        ),
    )
    args = parser.parse_args()
    print("─── Pipeline RAG ────────────────────────────────────────────\n")
    print(f"  Documento : {args.stem}")
    print(f"  Modello   : {LLM_MODEL}")
    print(f"  Top-K     : {TOP_K}")
    print(f"  Thinking  : {'off' if NO_THINK else 'on'}")
    print()
    if not CHROMA_DIR.exists():
        print("❌ chroma_db/ non trovata — esegui prima step-8")
        return 1
    client = chromadb.PersistentClient(path=str(CHROMA_DIR))
    collections = [c.name for c in client.list_collections()]
    if args.stem not in collections:
        print(f"❌ Collection '{args.stem}' non trovata in chroma_db/")
        print(f"   → python step-8/ingest.py --stem {args.stem}")
        return 1
    collection = client.get_collection(args.stem)
    print(f"✅ Collection '{args.stem}' caricata ({collection.count()} chunk)\n")
    run_loop(collection)
    return 0
 if __name__ == "__main__":
    sys.exit(main())
@@ -1,4 +1,3 @@
 pdfplumber==0.11.9
 pymupdf4llm
 opendataloader-pdf
 chromadb
@@ -1,217 +0,0 @@
 #!/usr/bin/env python3
 """
 Retrieval puro (senza generazione LLM)
 Loop interattivo: inserisci una query, ottieni i chunk più simili dalla
 collection ChromaDB tramite embedding semantico — senza chiamare Ollama
 per la generation.
 Utile per:
  - verificare la qualità del retrieval prima di diagnosticare risposte sbagliate
  - controllare che i chunk giusti vengano recuperati per una query
  - usare la pipeline come motore di ricerca semantica
 Input:  chroma_db/<stem> (collection ChromaDB)
 Output: lista chunk con score di similarità
 Uso:
    python retrieve.py --stem <nome>
 Nel loop interattivo:
    Query: <testo>      → chunk più simili con score
    Query: <testo> -f   → testo completo dei chunk
    Query: exit         → uscita
 """
 import argparse
 import json
 import sys
 import urllib.error
 import urllib.request
 from pathlib import Path
 import chromadb
 # ─── Configurazione ───────────────────────────────────────────────────────────
 sys.path.insert(0, str(Path(__file__).parent))
 import config as _cfg
 project_root = Path(__file__).parent
 CHROMA_DIR   = project_root / "chroma_db"
 OLLAMA_URL  = _cfg.OLLAMA_URL
 EMBED_MODEL = _cfg.EMBED_MODEL
 TOP_K       = _cfg.TOP_K
 # ─── Embedding ────────────────────────────────────────────────────────────────
 def embed(text: str) -> list[float]:
    """Genera il vettore della query tramite Ollama."""
    payload = json.dumps({"model": EMBED_MODEL, "prompt": text}).encode()
    req = urllib.request.Request(
        f"{OLLAMA_URL}/api/embeddings",
        data=payload,
        headers={"Content-Type": "application/json"},
        method="POST",
    )
    with urllib.request.urlopen(req, timeout=30) as resp:
        return json.loads(resp.read())["embedding"]
 # ─── Retrieval ────────────────────────────────────────────────────────────────
 def retrieve(collection: chromadb.Collection, query: str, top_k: int) -> list[dict]:
    """
    Genera l'embedding della query e recupera i top_k chunk più simili.
    Ritorna lista di dict con chiavi: rank, similarity, sezione, titolo, text.
    """
    vector = embed(query)
    results = collection.query(
        query_embeddings=[vector],
        n_results=top_k,
        include=["documents", "metadatas", "distances"],
    )
    chunks = []
    for rank, (text, meta, dist) in enumerate(
        zip(
            results["documents"][0],
            results["metadatas"][0],
            results["distances"][0],
        ),
        start=1,
    ):
        chunks.append({
            "rank":       rank,
            "similarity": round(1 - dist, 4),
            "sezione":    meta.get("sezione", ""),
            "titolo":     meta.get("titolo", ""),
            "text":       text,
        })
    return chunks
 # ─── Output ───────────────────────────────────────────────────────────────────
 def print_results(chunks: list[dict], full: bool = False) -> None:
    print(f"── {len(chunks)} chunk recuperati ─────────────────────────────────\n")
    for c in chunks:
        loc = c["sezione"]
        if c["titolo"]:
            loc += f" > {c['titolo']}"
        print(f"  [{c['rank']}] similarità: {c['similarity']:.4f}  |  {loc}")
        if full:
            print()
            print(c["text"])
        else:
            print(f"      {c['text'][:200].replace(chr(10), ' ')}")
            if len(c["text"]) > 200:
                print(f"      … ({len(c['text'])} caratteri totali)")
        print()
 # ─── Loop interattivo ─────────────────────────────────────────────────────────
 def run_loop(collection: chromadb.Collection, top_k: int) -> None:
    print("── Loop retrieval ──────────────────────── (exit per uscire, -f per testo completo)\n")
    while True:
        try:
            raw = input("Query: ").strip()
        except (EOFError, KeyboardInterrupt):
            print("\nUscita.")
            break
        if not raw:
            continue
        if raw.lower() == "exit":
            break
        full = raw.endswith(" -f")
        query = raw[:-3].strip() if full else raw
        try:
            chunks = retrieve(collection, query, top_k)
        except (urllib.error.URLError, OSError) as e:
            print(f"❌ Errore embedding (Ollama raggiungibile?): {e}\n")
            continue
        print()
        print_results(chunks, full=full)
 # ─── Entry point ──────────────────────────────────────────────────────────────
 def _build_epilog() -> str:
    lines = [
        "Uso:",
        "  python retrieve.py --stem <nome>",
        "",
        "Nel loop interattivo:",
        "  <query>       chunk più simili con score (testo troncato)",
        "  <query> -f    testo completo dei chunk",
        "  exit          termina",
    ]
    if CHROMA_DIR.exists():
        try:
            client = chromadb.PersistentClient(path=str(CHROMA_DIR))
            names = [c.name for c in client.list_collections()]
            if names:
                lines += ["", f"Collection disponibili: {', '.join(names)}"]
            else:
                lines += ["", "Nessuna collection trovata — eseguire prima: python step-8/ingest.py"]
        except Exception:
            pass
    return "\n".join(lines)
 def main() -> int:
    parser = argparse.ArgumentParser(
        description=(
            "Retrieval puro (senza LLM)\n\n"
            "Loop interattivo: inserisci una query e ottieni i chunk più simili\n"
            "tramite embedding semantico, senza generazione LLM."
        ),
        epilog=_build_epilog(),
        formatter_class=argparse.RawDescriptionHelpFormatter,
    )
    parser.add_argument(
        "--stem",
        required=True,
        help="Nome della collection ChromaDB da interrogare.",
    )
    parser.add_argument(
        "--top-k",
        type=int,
        default=TOP_K,
        metavar="N",
        help=f"Numero di chunk da restituire per query (default: {TOP_K} da config.py).",
    )
    args = parser.parse_args()
    print("─── Retrieval puro ──────────────────────────────────────────\n")
    print(f"  Documento    : {args.stem}")
    print(f"  Embed model  : {EMBED_MODEL}")
    print(f"  Top-K        : {args.top_k}")
    print()
    if not CHROMA_DIR.exists():
        print("❌ chroma_db/ non trovata — esegui prima step-8", file=sys.stderr)
        return 1
    client = chromadb.PersistentClient(path=str(CHROMA_DIR))
    collections = [c.name for c in client.list_collections()]
    if args.stem not in collections:
        print(f"❌ Collection '{args.stem}' non trovata in chroma_db/", file=sys.stderr)
        print(f"   → python step-8/ingest.py --stem {args.stem}", file=sys.stderr)
        return 1
    collection = client.get_collection(args.stem)
    print(f"✅ Collection '{args.stem}' caricata ({collection.count()} chunk)\n")
    run_loop(collection, args.top_k)
    return 0
 if __name__ == "__main__":
    sys.exit(main())
@@ -1,114 +0,0 @@
 # Step 8 — Vettorizzazione
 Legge i chunk prodotti da step-6, genera gli embedding tramite Ollama e li
 salva in ChromaDB (vector store persistente su disco).
 ---
 ## Prerequisiti
 - Step-6 completato (esiste `step-6/<stem>/chunks.json`)
 - Ollama attivo con il modello di embedding scaricato
 - `chromadb` installato (`pip install -r requirements.txt`)
 ---
 ## Configurazione modello
 Il modello di embedding viene letto da **`config.py`**:
 ```python
 # config.py
 EMBED_MODEL = "nomic-embed-text"   # ← cambia qui
 ```
 > Il modello scelto qui deve corrispondere a quello usato in rag.py.
 > Se lo cambi dopo aver già vettorizzato, devi rieseguire step-8 con `--force`.
 ---
 ## Uso
 ```bash
 # Vettorizza un singolo documento
 python step-8/ingest.py --stem <nome>
 # Vettorizza tutti i documenti trovati in step-6/
 python step-8/ingest.py
 # Sovrascrive una collection già esistente
 python step-8/ingest.py --stem <nome> --force
 # Override modello (senza modificare config.py)
 python step-8/ingest.py --stem <nome> --model bge-m3
 ```
 ---
 ## Output
 I vettori vengono salvati in `chroma_db/<stem>/` come collection ChromaDB con
 distanza coseno. La directory è ignorata da git (generata automaticamente).
 ---
 ## Modelli supportati
 Stessi modelli raccomandati nel [README di ollama](../ollama/README.md).
 Il modello deve essere scaricato in Ollama prima di eseguire questo script
 (`ollama pull <modello>`).
 ---
 ## Regole d'oro per parametri ottimali
 ### Modello di embedding
 **Usa un modello multilingue per testi italiani.**
 I modelli English-first (`nomic-embed-text`, `mxbai-embed-large`, `all-minilm`)
 producono vettori di qualità inferiore su italiano, con retrieval meno preciso.
 Prima scelta: `qwen3-embedding:0.6b`.
 **Più dimensioni = retrieval più preciso, ma più spazio su disco.**
 | Dimensioni | Modelli | Quando usarlo |
 |---|---|---|
 | 1024 | `qwen3-embedding:0.6b`, `bge-m3` | documenti tecnici, testi lunghi |
 | 768 | `nomic-embed-text-v2-moe` | buon compromesso |
 | 384 | `all-minilm` | solo per test rapidi |
 **Usa la stessa famiglia LLM + embedding quando possibile.**
 `qwen3-embedding` + `qwen3.5` condividono tokenizer e spazio semantico —
 il retrieval è più coerente rispetto a modelli di famiglie diverse.
 ### Coerenza tra ingest e retrieval
 **`EMBED_MODEL` deve essere identico in `ingest.py` e `rag.py`.**
 ChromaDB memorizza i vettori generati con un certo modello. Se `rag.py` usa un
 modello diverso per la query di ricerca, gli spazi vettoriali non corrispondono
 e il retrieval restituisce risultati casuali — senza alcun errore visibile.
 **Dopo aver cambiato `EMBED_MODEL`, riesegui sempre con `--force`.**
 Senza `--force` lo script salta la collection già esistente — i vecchi vettori
 (generati col modello precedente) restano e continuano a essere usati da `rag.py`.
 ```bash
 # Cambio modello → ricrea sempre la collection
 python step-8/ingest.py --stem <nome> --force
 ```
 ### Quando usare `--force`
 | Situazione | `--force` necessario? |
 |---|---|
 | Prima esecuzione | No |
 | Hai cambiato `EMBED_MODEL` | **Sì** |
 | Hai migliorato i chunk in step-6 | **Sì** |
 | Hai aggiunto nuovi documenti (stem diverso) | No |
 | Vuoi solo verificare che funzioni | No |
 ### Distanza vettoriale
 Lo script usa **distanza coseno** (hardcoded), che è la scelta corretta per
 embedding testuali — misura l'angolo tra vettori indipendentemente dalla loro
 lunghezza. Non cambiare questo parametro.
@@ -1,232 +0,0 @@
 #!/usr/bin/env python3
 """
 Step 8 — Vettorizzazione
 Legge i chunk prodotti da step-6, genera gli embedding tramite Ollama
 e li indicizza in ChromaDB (persistente).
 Il modello di embedding viene letto da config.py (EMBED_MODEL).
 Puoi sovrascriverlo con --model, ma deve corrispondere al modello che
 userai in rag.py — altrimenti riesegui con --force dopo aver cambiato.
 Input:  step-6/<stem>/chunks.json
 Output: chroma_db/<stem> (collection ChromaDB)
 Uso:
    python step-8/ingest.py --stem <nome>            # singolo documento
    python step-8/ingest.py                          # tutti gli stem trovati
    python step-8/ingest.py --stem <nome> --force    # sovrascrive collection
    python step-8/ingest.py --model bge-m3           # override modello
 """
 import argparse
 import json
 import sys
 import time
 import urllib.error
 import urllib.request
 from pathlib import Path
 import chromadb
 # ─── Configurazione ────────────────────────────────────────────────────────────
 project_root = Path(__file__).parent.parent
 CHUNKS_DIR = project_root / "step-6"
 CHROMA_DIR = project_root / "chroma_db"
 sys.path.insert(0, str(project_root))
 from config import EMBED_MODEL, OLLAMA_URL  # noqa: E402
 EMBED_ENDPOINT = f"{OLLAMA_URL}/api/embeddings"
 # ─── Ollama ────────────────────────────────────────────────────────────────────
 def embed(text: str, model: str) -> list[float]:
    """Chiama Ollama /api/embeddings e ritorna il vettore."""
    payload = json.dumps({"model": model, "prompt": text}).encode()
    req = urllib.request.Request(
        EMBED_ENDPOINT,
        data=payload,
        headers={"Content-Type": "application/json"},
        method="POST",
    )
    with urllib.request.urlopen(req, timeout=60) as resp:
        data = json.loads(resp.read())
    return data["embedding"]
 def check_ollama(model: str) -> bool:
    """Verifica che Ollama sia attivo e che il modello di embedding sia disponibile."""
    try:
        req = urllib.request.Request(f"{OLLAMA_URL}/api/tags", method="GET")
        with urllib.request.urlopen(req, timeout=10) as resp:
            data = json.loads(resp.read())
        models = [m["name"] for m in data.get("models", [])]
        found = any(
            m == model or m.startswith(model + ":")
            for m in models
        )
        if found:
            print(f"✅ Ollama OK — {model} disponibile")
            return True
        print(f"❌ Modello {model} non trovato in Ollama")
        print(f"   → ollama pull {model}")
        return False
    except (urllib.error.URLError, OSError):
        print("❌ Ollama non raggiungibile — assicurati che sia in esecuzione")
        print("   → ollama serve")
        return False
 # ─── ChromaDB ─────────────────────────────────────────────────────────────────
 def get_client() -> chromadb.PersistentClient:
    CHROMA_DIR.mkdir(parents=True, exist_ok=True)
    return chromadb.PersistentClient(path=str(CHROMA_DIR))
 def collection_exists(client: chromadb.PersistentClient, stem: str) -> bool:
    return any(c.name == stem for c in client.list_collections())
 # ─── Ingestione ───────────────────────────────────────────────────────────────
 def ingest(stem: str, force: bool, model: str = EMBED_MODEL) -> bool:
    """
    Legge step-6/<stem>/chunks.json, genera embedding e popola ChromaDB.
    Ritorna True se completato con successo, False altrimenti.
    """
    chunks_path = CHUNKS_DIR / stem / "chunks.json"
    if not chunks_path.exists():
        print(f"❌ File non trovato: {chunks_path}")
        return False
    with open(chunks_path, encoding="utf-8") as f:
        chunks = json.load(f)
    if not chunks:
        print(f"⚠️  {stem}: chunks.json è vuoto — skip")
        return False
    client = get_client()
    if collection_exists(client, stem):
        if not force:
            print(f"⚠️  Collection '{stem}' già presente in ChromaDB — skip")
            print(f"   → usa --force per sovrascrivere")
            return True  # non è un errore, è uno skip
        client.delete_collection(stem)
        print(f"🗑️  Collection '{stem}' rimossa (--force)")
    collection = client.create_collection(
        name=stem,
        metadata={"hnsw:space": "cosine"},
    )
    total = len(chunks)
    print(f"📦 {total} chunk da ingestire\n")
    ids        = []
    embeddings = []
    documents  = []
    metadatas  = []
    start     = time.monotonic()
    durations: list[float] = []
    for i, chunk in enumerate(chunks, start=1):
        t0 = time.monotonic()
        vector = embed(chunk["text"], model)
        t1 = time.monotonic()
        durations.append(t1 - t0)
        ids.append(chunk["chunk_id"])
        embeddings.append(vector)
        documents.append(chunk["text"])
        metadatas.append({
            "sezione":   chunk.get("sezione", ""),
            "titolo":    chunk.get("titolo", ""),
            "sub_index": chunk.get("sub_index", 0),
        })
        avg  = sum(durations) / len(durations)
        eta  = int(avg * (total - i))
        done = f"[{i:>{len(str(total))}}/{total}]"
        cid  = chunk["chunk_id"][:50]
        line = f"  {done} ✓ {cid:<50}  ETA: {eta}s"
        print(f"{line:<80}", end="\r", flush=True)
        # Upsert in batch da 100 per non sovraccaricare la memoria
        if len(ids) == 100:
            collection.add(
                ids=ids,
                embeddings=embeddings,
                documents=documents,
                metadatas=metadatas,
            )
            ids, embeddings, documents, metadatas = [], [], [], []
    # Upsert dei rimanenti
    if ids:
        collection.add(
            ids=ids,
            embeddings=embeddings,
            documents=documents,
            metadatas=metadatas,
        )
    elapsed = int(time.monotonic() - start)
    print()  # nuova riga dopo il \r
    print(f"\n✅ Ingestione completata in {elapsed}s — {total}/{total} chunk salvati")
    print(f"   Collection '{stem}' in {CHROMA_DIR}/")
    return True
 # ─── Entry point ──────────────────────────────────────────────────────────────
 def find_stems() -> list[str]:
    """Ritorna tutti gli stem che hanno un chunks.json in step-6/."""
    return sorted(
        p.parent.name
        for p in CHUNKS_DIR.glob("*/chunks.json")
    )
 def main() -> int:
    parser = argparse.ArgumentParser(
        description="Step 8 — Vettorizzazione chunk in ChromaDB"
    )
    parser.add_argument("--stem", help="Nome del documento (senza --stem = tutti)")
    parser.add_argument("--force", action="store_true",
                        help="Sovrascrive la collection se già esistente")
    parser.add_argument("--model", default=EMBED_MODEL,
                        help=f"Modello embedding Ollama (default da config.py: {EMBED_MODEL})")
    args = parser.parse_args()
    print("─── Step 8 — Vettorizzazione ─────────────────────────────────────────\n")
    if not check_ollama(args.model):
        return 1
    stems = [args.stem] if args.stem else find_stems()
    if not stems:
        print("❌ Nessun chunks.json trovato in step-6/")
        return 1
    print()
    results = []
    for stem in stems:
        if len(stems) > 1:
            print(f"── {stem} ──")
        results.append(ingest(stem, force=args.force, model=args.model))
        if len(stems) > 1:
            print()
    return 0 if all(results) else 1
 if __name__ == "__main__":
    sys.exit(main())
Author	SHA1	Message	Date
davide	444942dc8f	feat: demota #→## quando il documento usa h1 per sezioni principali Aggiunge _t_demote_h1 in _headers.py: se il documento contiene ≥5 header # con contenuto testuale (lettere iniziali), i # vengono demotati a ## creando la gerarchia ## (parti) → ### (sezioni) invece di # → ###. Utile per manuali strutturati in parti principali (h1) con sezioni (h3) senza livello intermedio h2. La soglia di 5 evita falsi positivi su documenti con un solo titolo h1 o h1 da artefatti di encoding. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-07 16:21:02 +02:00
davide	3f4689e8fd	feat: rileva note bibliografiche e raccolte multi-articolo in pipeline Risolve la conversione errata di note a piè di pagina accademiche in header Markdown nei testi giuridici (es. dirittopubblico: da 424 h2 errati → 27 h2 semanticamente corretti). - _BIB_MARKERS_RE: aggiunge ibid., cfr., op. cit., cit., ivi - _FOOTNOTE_AUTHOR_RE: nuovo pattern per "A. COGNOME" (es. G. GUZZETTA) - _num_repl / _aphorism_repl / _list_section_repl: usano entrambi i guard per non convertire note bibliografiche in sezioni - _t_promote_chapter_headers: usa max-count ≥ 3 per distinguere raccolte multi-articolo (non promuovere) da libri con capitoli sequenziali (promuovere); preserva il comportamento corretto su anatomia - _t_remove_page_markers / _t_remove_page_numbers / _t_remove_separators: nuove transform per page marker PDF, numeri isolati, separatori underscore Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-07 16:12:50 +02:00
davide	2c0b7a462e	feat: migliora pipeline PDF→MD per RAG — frontmatter e page marker - extract.py: aggiunge extract_metadata() — title, author, year, pages via fitz - extract.py: aggiunge markdown_page_separator con <!-- page: N --> tra pagine - extract.py: aggiunge replace_invalid_chars=" " per testo più pulito - runner.py: prepend YAML frontmatter (source/title/author/year/pages) al clean.md - runner.py: mostra title e author rilevati durante validazione Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-07 14:58:09 +02:00
davide	6e755c0b6c	fix(clear.sh): esclude _pipeline/ dal batch e supporta stem singolo - Aggiunge argomento opzionale <stem>: cancella solo quella cartella - Esclude dal batch le dir che iniziano con _ o __ (es. _pipeline/) - Rimuove flag -f non documentato: la modalità singolo stem non chiede conferma Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-07 14:53:17 +02:00
davide	9598209f12	chore: aggiorna .gitignore — esclude __pycache__ e rimuove riferimento a transforms/ Aggiunge esclusione esplicita di _pipeline/__pycache__/ per compensare la regola di negazione !conversione/_pipeline/**. Rimuove dall'indice tutti i file .pyc precedentemente tracciati per errore. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-07 14:44:40 +02:00
davide	64dc403e80	refactor: ottimizza pipeline PDF→Markdown — struttura piatta e verbosità - Unifica deps.py + checker.py + converter.py in extract.py (fronte PDF) - Sposta transforms/ in _pipeline/ (struttura piatta, no sottocartelle) - Aggiunge spinner animato (thread) durante conversione opendataloader-pdf - Aggiunge progresso step-by-step [i/37] per apply_transforms via callback - Mostra punteggio qualità (score/100 grade) a fine elaborazione - Fix: _DOTLEADER_RE spostata in _constants.py (non più definita inline) - Fix: report.py importa regex da _constants invece di ridefinirle - Fix: _t_remove_urls ora conta e ritorna le rimozioni effettive Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-07 14:30:41 +02:00
davide	afbf29514d	Aggiorna CLAUDE.md	2026-05-07 13:51:55 +02:00
davide	ab4036591f	temp	2026-04-30 15:26:52 +02:00
davide	e41fcae248	refactor: modularizza pipeline in conversione/_pipeline/ Sostituisce i file monolitici pipeline.py e validate.py con il package _pipeline/ a responsabilità separate. Entry point unificato in __main__.py (convert + validate dallo stesso comando). Moduli aggiunti: - __main__.py — CLI unificata (--stem, --force, validate, --detail) - _pipeline/__init__.py — re-export pubblico - _pipeline/checker.py — validazione PDF - _pipeline/deps.py — verifica dipendenze Java + opendataloader - _pipeline/structure.py — analyze() + strategia chunking Moduli già committati in precedenza: - _pipeline/converter.py, transforms.py, report.py, runner.py, validator.py Aggiornamenti collaterali: - .gitignore: exception !conversione/_pipeline/** per tracciare il package - CLAUDE.md: documentazione aggiornata alla nuova architettura; fix riferimenti obsoleti a conversione/pipeline.py → conversione/ Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-30 14:59:55 +02:00
davide	faa8acae84	feat(pipeline): ottimizzazione completa PDF→Markdown senza revisione manuale - converter: parametri adattivi (use_struct_tree per PDF taggati, table_method=cluster, content_safety_off) - transforms: +20 PUA bracket TeX U+F8EB-F8FE (290 simboli corretti su analisi1) - transforms: _t_math_header_demotion — demota header ##/### che sono enunciati esercizi o formule - report: metrica formula_headers_residui con esempi - validator: penalità formula_headers (−3/cad, cap −15), colonna fhdr nel report tabellare Risultato su analisi1: voto 92/A, PUA residui 0, formula-hdr residui 0 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-30 14:58:15 +02:00
davide	a158634378	refactor: riduci repo alla sola fase di conversione PDF → Markdown Rimossi chunks/, step-8/, ollama/, chroma_db/, rag.py, retrieve.py, config.py e chromadb da requirements. Aggiornati README e CLAUDE.md. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-30 12:20:00 +02:00