fix(verify): riconosce URL www. come terminatori validi + doc multi-documento

- _URL_TAIL ora matcha anche www.* (non solo https://) — evita falsi blockers su watermark tipo www.docsity.com - README: documenta --collection / --stems per ingestion, retrieve e rag Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
feat(ingestion): supporto multi-documento in unica collection ChromaDB
2026-05-12 11:21:24 +02:00 · 2026-05-12 11:21:17 +02:00 · 2026-05-12 11:09:28 +02:00 · 2026-05-12 10:43:17 +02:00 · 2026-05-12 10:39:27 +02:00 · 2026-05-12 10:37:39 +02:00
13 changed files with 831 additions and 282 deletions
@@ -50,7 +50,7 @@ except Exception as e: print(f'ERRORE lettura report: {e}')

 ```
 ✅ Chunk pronti — procedi con la vettorizzazione:
-   python step-8/ingest.py --stem $ARGUMENTS
+   python ingestion/ingest.py --stem $ARGUMENTS
 ```

 Se ci sono solo 🟡, spiega brevemente i warning e chiedi se l'utente vuole risolverli prima o procedere.
@@ -105,7 +105,7 @@ Se verdict finale è `ok` o `warnings_only` senza 🔴:
 ```
 ✅ Chunk pronti in chunks/$ARGUMENTS/chunks.json
   Procedi con la vettorizzazione:
-   python step-8/ingest.py --stem $ARGUMENTS
+   python ingestion/ingest.py --stem $ARGUMENTS
 ```

 Se rimangono 🔴 dopo il fix (testo non spezzabile o struttura anomala nel sorgente):
@@ -1,9 +1,11 @@
-# PDF → Markdown
+# PDF → Chunk RAG-ready

-Converte PDF digitali in Markdown strutturato e pulito.
+Converte PDF digitali in chunk semantici pronti per la vettorizzazione RAG,
+senza LLM né OCR.

-**Stack:** Python · opendataloader-pdf (XY-Cut++) · Java 11+  
-**Compatibile con:** Linux · macOS · Windows (WSL2)
+**Pipeline:** PDF → Markdown strutturato → chunk semantici → embedding ChromaDB  
+**Stack:** Python · PyMuPDF · pdfplumber  
+**Non supportati:** PDF scansionati (solo immagini), PDF protetti da password.

 ---

@@ -15,54 +17,230 @@ source .venv/bin/activate
 pip install -r requirements.txt
 ```

-**Java 11+** richiesto:
-
-```bash
-sudo apt install default-jdk   # Ubuntu/Debian/WSL
-java -version
-```
-
 ---

-## Utilizzo
+## Flusso completo
+
+### 1. Posiziona il PDF
+
+```
+sources/<nome>.pdf
+```
+
+### 2. Converti il PDF in Markdown

 ```bash
-# Singolo PDF
-python conversione/pipeline.py --stem <nome>
+# Singolo documento
+.venv/bin/python conversione/ --stem <nome>

 # Tutti i PDF in sources/
-python conversione/pipeline.py
+.venv/bin/python conversione/

-# Forza riesecuzione
-python conversione/pipeline.py --stem <nome> --force
+# Forza riesecuzione (sovrascrive output esistente)
+.venv/bin/python conversione/ --stem <nome> --force
 ```

-`--stem` = nome file PDF senza estensione.  
-Esempio: `sources/analisi1.pdf` → `--stem analisi1`
-
---
-
-## Output
-
-Per ogni stem in `conversione/<stem>/`:
+Output in `conversione/<nome>/`:

 | File | Descrizione |
 |------|-------------|
 | `raw.md` | Markdown grezzo — **non modificare** |
-| `clean.md` | Markdown pulito — copia di lavoro |
-| `structure_profile.json` | Struttura rilevata e metriche |
-| `report.json` | Statistiche complete della conversione |
+| `clean.md` | Markdown pulito — input per il chunker |
+| `structure_profile.json` | Struttura rilevata e strategia di chunking |
+| `report.json` | Metriche di qualità della conversione |

---
-
-## Validazione batch
+### 3. Verifica la qualità del Markdown (opzionale)

 ```bash
-python conversione/validate.py
+.venv/bin/python conversione/ validate <nome> --detail
 ```

-Stampa una tabella di stato su tutti gli stem convertiti.
+Se lo score è ≥ 80 e `valid=true`, procedi. Altrimenti usa `/prepare-md` per
+correzioni manuali (sillabazione residua, header malformati, ecc.).
+
+### 4. Genera i chunk
+
+```bash
+.venv/bin/python chunks/chunker.py --stem <nome>
+
+# Forza riesecuzione
+.venv/bin/python chunks/chunker.py --stem <nome> --force
+```
+
+La strategia di chunking (`h3_aware`, `h2_paragraph_split`, `paragraph`,
+`sliding_window`) viene scelta automaticamente da `structure_profile.json`.
+
+Output in `chunks/<nome>/`:
+
+| File | Descrizione |
+|------|-------------|
+| `chunks.json` | Lista di chunk con testo, sezione, titolo e metadati |
+| `report.json` | Statistiche e anomalie del chunking |
+
+### 5. Verifica i chunk
+
+```bash
+.venv/bin/python chunks/verify_chunks.py --stem <nome>
+```
+
+Verdict possibili:
+
+| Verdict | Significato | Cosa fare |
+|---------|-------------|-----------|
+| `ok` | Nessun problema | Procedi alla vettorizzazione |
+| `warnings_only` | Solo avvisi minori | Puoi procedere o eseguire il fix |
+| `blocked` | Problemi bloccanti (chunk incompleti) | Esegui il fix |
+
+### 6. Correggi i problemi (se necessario)
+
+```bash
+# Anteprima delle correzioni senza applicarle
+.venv/bin/python chunks/fix_chunks.py --stem <nome> --dry-run
+
+# Applica le correzioni (ricorsivo, fino a 3 iterazioni)
+.venv/bin/python chunks/fix_chunks.py --stem <nome>
+```
+
+Il fix gestisce automaticamente: chunk incompleti (frase spezzata), chunk
+troppo corti (accorpa al successivo), chunk eccessivamente lunghi (spezza
+su punteggiatura). Ogni chunk termina sempre su un confine di frase.
+
+### 7. Esegui l'ingestion
+
+Prima verifica che Ollama e i modelli siano pronti:
+
+```bash
+.venv/bin/python ollama/check_env.py
+```
+
+Poi genera gli embedding e salva in ChromaDB:
+
+```bash
+# Singolo documento → collection con lo stesso nome
+.venv/bin/python ingestion/ingest.py --stem <nome>
+
+# Più documenti → un'unica collection condivisa
+.venv/bin/python ingestion/ingest.py --collection <nome-collection> --stems doc1 doc2 doc3
+
+# Tutti i documenti in chunks/ → collection separate
+.venv/bin/python ingestion/ingest.py
+
+# Rigenera dopo aver cambiato modello o aggiornato i chunk
+.venv/bin/python ingestion/ingest.py --stem <nome> --force
+```
+
+Con `--collection` i chunk di documenti diversi vengono uniti in una singola
+collection. Il metadato `source` identifica il documento di provenienza di ogni chunk.
+
+Output in `chroma_db/` (ignorata da git).

 ---

-Vedi [`conversione/README.md`](conversione/README.md) per dettagli sulla pipeline e i tipi di documento supportati.
+## Configurazione del chunking
+
+Tutti i parametri sono in [`chunks/config.py`](chunks/config.py):
+
+```python
+TARGET_CHARS    = 600   # dimensione target dei chunk
+CHUNK_TOLERANCE = 0.25  # ±25% → range accettabile [450, 750]
+OVERLAP_SENTENCES = 1   # frasi di overlap tra chunk consecutivi
+PROTECT_TABLES  = True  # tabelle emesse come chunk atomici
+FIX_MAX_ITERATIONS = 3  # iterazioni massime del fix ricorsivo
+```
+
+Per ogni strategia è possibile definire valori diversi tramite `STRATEGY_OVERRIDES`.
+Modificare solo questo file — chunker, verify e fix si aggiornano automaticamente.
+
+---
+
+## Configurazione modelli
+
+Tutti i parametri LLM e embedding sono in [`config.py`](config.py):
+
+```python
+OLLAMA_MODEL = "qwen3.5:4b"      # modello LLM per la generazione
+EMBED_MODEL  = "nomic-embed-text" # modello embedding (deve coincidere con l'ingestion)
+TEMPERATURE  = 0.2                # 0.0 = deterministico, valori alti = più creativo
+NO_THINK     = True               # True = risposta diretta (più veloce), False = con ragionamento
+TOP_K        = 6                  # numero di chunk recuperati per ogni domanda
+OLLAMA_URL   = "http://localhost:11434"
+```
+
+> Se cambi `EMBED_MODEL` devi rieseguire l'ingestion con `--force` — gli embedding
+> devono essere prodotti dallo stesso modello usato nel retrieval.
+
+---
+
+## Testare il modello (senza RAG)
+
+Verifica che il modello LLM risponda correttamente prima di coinvolgere la pipeline:
+
+```bash
+.venv/bin/python ollama/test_ollama.py
+```
+
+Il modello usato è quello configurato in `config.py` (`OLLAMA_MODEL`).  
+Digita `exit` per uscire.
+
+---
+
+## Retrieval puro (senza generazione)
+
+Utile per verificare che i chunk giusti vengano recuperati prima di diagnosticare
+risposte sbagliate:
+
+```bash
+# Singolo documento
+.venv/bin/python retrieve.py --stem <nome>
+
+# Collection multi-documento
+.venv/bin/python retrieve.py --collection <nome-collection>
+
+# Modifica il numero di chunk restituiti
+.venv/bin/python retrieve.py --stem <nome> --top-k 10
+```
+
+Nel loop interattivo:
+
+| Comando | Effetto |
+|---------|---------|
+| `<query>` | Mostra i chunk più simili con score di similarità (testo troncato) |
+| `<query> -f` | Testo completo dei chunk |
+| `exit` | Termina |
+
+---
+
+## RAG interattivo
+
+Risponde a domande in linguaggio naturale usando i chunk indicizzati in ChromaDB:
+
+```bash
+# Singolo documento
+.venv/bin/python rag.py --stem <nome>
+
+# Collection multi-documento
+.venv/bin/python rag.py --collection <nome-collection>
+```
+
+Nel loop interattivo:
+
+| Comando | Effetto |
+|---------|---------|
+| `<domanda>` | Risposta generata dal LLM con contesto dai chunk |
+| `<domanda> -v` | Risposta + chunk recuperati con score di similarità e documento sorgente |
+| `exit` | Termina |
+
+---
+
+## Test
+
+```bash
+.venv/bin/python -m pytest tests/
+```
+
+---
+
+## Riferimenti
+
+- [`conversione/README.md`](conversione/README.md) — dettagli sulla pipeline PDF→Markdown e sui tipi di documento supportati
+- [`ingestion/README.md`](ingestion/README.md) — configurazione embedding, scelta modello, regole --force
@@ -20,12 +20,10 @@ import re
 import sys
 from pathlib import Path

-
-# ─── Parametri ────────────────────────────────────────────────────────────────
-
-MIN_CHARS = 200   # sotto questa soglia → accorpa al chunk successivo
-MAX_CHARS = 800   # sopra questa soglia → spezza su frasi
-OVERLAP_S = 2     # frasi di overlap tra sotto-chunk dello stesso boundary
+_HERE = Path(__file__).resolve().parent
+if str(_HERE) not in sys.path:
+    sys.path.insert(0, str(_HERE))
+import config as cfg


 # ─── Utilità ──────────────────────────────────────────────────────────────────
@@ -44,73 +42,106 @@ def slugify(s: str, max_len: int = 60) -> str:
    return s[:max_len] if s else "section"


-_SENT_BOUNDARY = re.compile(r"[.!?»)\]'\u2019\"\u201c\u201d/:|\u2026]$")
+def _is_table_block(text: str) -> bool:
+    """True se il testo è prevalentemente una tabella Markdown (≥50% righe con |)."""
+    lines = [l for l in text.strip().splitlines() if l.strip()]
+    if not lines:
+        return False
+    table_lines = sum(1 for l in lines if l.strip().startswith("|"))
+    return table_lines / len(lines) >= 0.5


-def _flush_chunk(
-    current: list[str],
-    sentences: list[str],
-    i: int,
-    prefix: str,
-    sezione: str,
-    titolo: str,
-    sub_index: int,
-    max_chars: int,
-) -> tuple[dict, list[str], int, int]:
-    """Emette un chunk, estendendo fino a un confine di frase (max +20%)."""
-    hard_limit = int(max_chars * 1.2)
-    current_len = sum(len(s) + 1 for s in current)
-    while i < len(sentences) and not _SENT_BOUNDARY.search(" ".join(current)):
-        nxt = sentences[i]
-        if current_len + len(nxt) + 1 > hard_limit:
-            break
-        current.append(nxt)
-        current_len += len(nxt) + 1
-        i += 1
-    chunk_text = prefix + " ".join(current)
-    chunk = {
-        "chunk_id": f"{slugify(sezione)}__{slugify(titolo)}__s{sub_index}",
-        "text": chunk_text,
-        "sezione": sezione,
-        "titolo": titolo,
-        "sub_index": sub_index,
-        "n_chars": len(chunk_text),
-    }
-    return chunk, current, i, sub_index + 1
+def _ov(strategy: str) -> tuple[int, float, int]:
+    """Legge (target_chars, tolerance, overlap) dagli override di strategia."""
+    ov = cfg.STRATEGY_OVERRIDES.get(strategy, {})
+    target    = ov.get("target_chars", cfg.TARGET_CHARS)
+    tolerance = ov.get("tolerance",    cfg.CHUNK_TOLERANCE)
+    overlap   = ov.get("overlap",      cfg.OVERLAP_SENTENCES)
+    return target, tolerance, overlap


+# ─── Core: split in sotto-chunk orientato al target ───────────────────────────
+
 def make_sub_chunks(
    body: str,
    prefix: str,
    sezione: str,
    titolo: str,
-    max_chars: int,
+    target: int,
+    tolerance: float,
    overlap_s: int,
 ) -> list[dict]:
+    """Divide body in chunk il più vicini possibile a `target` char.
+
+    Logica:
+      lower = target × (1 − tolerance)   → soglia minima per emettere
+      upper = target × (1 + tolerance)   → limite massimo
+
+    Si accumulano frasi intere finché la successiva farebbe superare `upper`.
+    A quel punto si emette (siamo vicini al target) e si riparte con overlap.
+    Ogni chunk termina sempre su un confine di frase; non attraversa mai
+    il boundary dell'header corrente.
+    """
+    if cfg.PROTECT_TABLES and _is_table_block(body):
+        chunk_text = prefix + body
+        return [{
+            "chunk_id": f"{slugify(sezione)}__{slugify(titolo)}__s0",
+            "text": chunk_text,
+            "sezione": sezione,
+            "titolo": titolo,
+            "sub_index": 0,
+            "n_chars": len(chunk_text),
+        }]
+
+    # Soglia calcolata sul corpo (n_chars finale = prefix_len + body_len).
+    prefix_len = len(prefix)
+    upper_body = max(1, int(target * (1 + tolerance)) - prefix_len)
+
    sentences = split_sentences(body)
    if not sentences:
        return []

-    chunks = []
+    chunks: list[dict] = []
    current: list[str] = []
    current_len = 0
    sub_index = 0

-    i = 0
-    while i < len(sentences):
-        sent = sentences[i]
-        if not current or current_len + len(sent) + 1 <= max_chars:
-            current.append(sent)
-            current_len += len(sent) + (1 if len(current) > 1 else 0)
-            i += 1
-        else:
-            chunk, current, i, sub_index = _flush_chunk(
-                current, sentences, i, prefix, sezione, titolo, sub_index, max_chars
-            )
-            chunks.append(chunk)
+    def _emit() -> None:
+        nonlocal current, current_len, sub_index
+        chunk_text = prefix + " ".join(current)
+        chunks.append({
+            "chunk_id": f"{slugify(sezione)}__{slugify(titolo)}__s{sub_index}",
+            "text": chunk_text,
+            "sezione": sezione,
+            "titolo": titolo,
+            "sub_index": sub_index,
+            "n_chars": len(chunk_text),
+        })
        overlap = current[-overlap_s:] if overlap_s and len(current) > overlap_s else []
        current = overlap[:]
-            current_len = sum(len(s) + 1 for s in current)
+        # Lunghezza corretta dell'overlap (n-1 spazi tra n frasi).
+        current_len = sum(len(s) for s in current) + max(0, len(current) - 1)
+        sub_index += 1
+
+    for sent in sentences:
+        sep     = 1 if current else 0
+        new_len = current_len + sep + len(sent)
+
+        if new_len <= upper_body:
+            # Ancora entro il limite del corpo: aggiungi e continua.
+            current.append(sent)
+            current_len = new_len
+        elif current:
+            # La frase successiva sfora il limite: emetti il chunk corrente
+            # (che termina su frase completa) poi inizia il nuovo con questa frase.
+            _emit()
+            current.append(sent)
+            current_len += (1 if current[:-1] else 0) + len(sent)
+        else:
+            # Chunk vuoto: la singola frase supera già il limite — emettiamo così com'è.
+            current.append(sent)
+            current_len = len(sent)
+            _emit()

    if current:
        chunk_text = prefix + " ".join(current)
@@ -194,6 +225,9 @@ def parse_h2_sections(text: str) -> list[dict]:
 # ─── Strategie di chunking ────────────────────────────────────────────────────

 def chunk_h3_aware(text: str, stem: str) -> list[dict]:
+    target, tolerance, overlap = _ov("h3_aware")
+    lower = int(target * (1 - tolerance))
+
    sections = parse_h3_sections(text)

    merged: list[dict] = []
@@ -205,7 +239,7 @@ def chunk_h3_aware(text: str, stem: str) -> list[dict]:
            continue

        if (pending["sezione"] == sec["sezione"]
-                and len(pending["body"]) < MIN_CHARS):
+                and len(pending["body"]) < lower):
            sep_title = " / ".join(filter(None, [pending["titolo"], sec["titolo"]]))
            pending = {
                "sezione": pending["sezione"],
@@ -224,15 +258,16 @@ def chunk_h3_aware(text: str, stem: str) -> list[dict]:
        sezione = sec["sezione"] or stem
        titolo  = sec["titolo"] or ""
        body    = sec["body"]
-
        prefix  = f"[{sezione} > {titolo}]\n" if titolo else f"[{sezione}]\n"
-        sub = make_sub_chunks(body, prefix, sezione, titolo, MAX_CHARS, OVERLAP_S)
-        chunks.extend(sub)
+        chunks.extend(make_sub_chunks(body, prefix, sezione, titolo, target, tolerance, overlap))

    return chunks


 def chunk_h2_paragraph_split(text: str, stem: str) -> list[dict]:
+    target, tolerance, overlap = _ov("h2_paragraph_split")
+    lower = int(target * (1 - tolerance))
+
    sections = parse_h2_sections(text)
    chunks = []

@@ -250,7 +285,7 @@ def chunk_h2_paragraph_split(text: str, stem: str) -> list[dict]:
        merged_pars: list[str] = []
        pending = ""
        for par in paragraphs:
-            if pending and len(pending) < MIN_CHARS:
+            if pending and len(pending) < lower:
                pending = pending + "\n\n" + par
            else:
                if pending:
@@ -260,7 +295,7 @@ def chunk_h2_paragraph_split(text: str, stem: str) -> list[dict]:
            merged_pars.append(pending)

        for idx, par in enumerate(merged_pars):
-            sub = make_sub_chunks(par, prefix, sezione, f"par{idx}", MAX_CHARS, OVERLAP_S)
+            sub = make_sub_chunks(par, prefix, sezione, f"par{idx}", target, tolerance, overlap)
            for c in sub:
                c["chunk_id"] = f"{slugify(sezione)}__p{idx}__s{c['sub_index']}"
            chunks.extend(sub)
@@ -269,6 +304,9 @@ def chunk_h2_paragraph_split(text: str, stem: str) -> list[dict]:


 def chunk_paragraph(text: str, stem: str) -> list[dict]:
+    target, tolerance, overlap = _ov("paragraph")
+    lower = int(target * (1 - tolerance))
+
    paragraphs = [
        p.strip()
        for p in re.split(r"\n{2,}", text)
@@ -279,7 +317,7 @@ def chunk_paragraph(text: str, stem: str) -> list[dict]:
    merged: list[str] = []
    pending = ""
    for par in paragraphs:
-        if pending and len(pending) < MIN_CHARS:
+        if pending and len(pending) < lower:
            pending = pending + "\n\n" + par
        else:
            if pending:
@@ -290,7 +328,7 @@ def chunk_paragraph(text: str, stem: str) -> list[dict]:

    chunks = []
    for idx, par in enumerate(merged):
-        sub = make_sub_chunks(par, prefix, stem, f"par{idx}", MAX_CHARS, OVERLAP_S)
+        sub = make_sub_chunks(par, prefix, stem, f"par{idx}", target, tolerance, overlap)
        for c in sub:
            c["chunk_id"] = f"para__{idx}__s{c['sub_index']}"
        chunks.extend(sub)
@@ -299,6 +337,9 @@ def chunk_paragraph(text: str, stem: str) -> list[dict]:


 def chunk_sliding_window(text: str, stem: str) -> list[dict]:
+    target, tolerance, overlap = _ov("sliding_window")
+    upper = int(target * (1 + tolerance))
+
    sentences = split_sentences(text)
    prefix = f"[Documento: {stem}]\n"

@@ -313,10 +354,11 @@ def chunk_sliding_window(text: str, stem: str) -> list[dict]:
        j = i
        while j < len(sentences):
            s = sentences[j]
-            if window and cur_len + len(s) + 1 > MAX_CHARS:
+            sep = 1 if window else 0
+            if window and cur_len + sep + len(s) > upper:
                break
            window.append(s)
-            cur_len += len(s) + (1 if len(window) > 1 else 0)
+            cur_len += sep + len(s)
            j += 1

        if not window:
@@ -333,7 +375,7 @@ def chunk_sliding_window(text: str, stem: str) -> list[dict]:
            "n_chars": len(chunk_text),
        })
        win_idx += 1
-        i += max(1, len(window) - OVERLAP_S)
+        i += max(1, len(window) - overlap)

    return chunks

@@ -393,19 +435,31 @@ def process_stem(stem: str, project_root: Path, force: bool) -> bool:
        json.dumps(chunks, ensure_ascii=False, indent=2), encoding="utf-8"
    )

+    target, tolerance, _ = _ov(strategia)
+    lower = int(target * (1 - tolerance))
+    upper = int(target * (1 + tolerance))
+
+    meta = {"strategy": strategia, "target_chars": target,
+            "min_chars": lower, "max_chars": upper}
+    (out_dir / "meta.json").write_text(
+        json.dumps(meta, ensure_ascii=False), encoding="utf-8"
+    )
+
    lengths = [c["n_chars"] for c in chunks]
    min_c  = min(lengths)
    max_c  = max(lengths)
    avg_c  = int(sum(lengths) / len(lengths))
-    short = sum(1 for l in lengths if l < MIN_CHARS)
-    long_ = sum(1 for l in lengths if l > MAX_CHARS * 1.5)
+    short  = sum(1 for l in lengths if l < lower)
+    long_  = sum(1 for l in lengths if l > upper)

+    print(f"  Target: {target} char  ±{int(tolerance*100)}%  "
+          f"→ range [{lower}, {upper}]")
    print(f"  Chunk totali: {len(chunks)}")
    print(f"  Min: {min_c} char  Max: {max_c} char  Media: {avg_c} char")
    if short:
-        print(f"  ⚠️  {short} chunk sotto MIN_CHARS ({MIN_CHARS})")
+        print(f"  ⚠️  {short} chunk sotto lower ({lower})")
    if long_:
-        print(f"  ⚠️  {long_} chunk sopra MAX_CHARS×1.5 ({int(MAX_CHARS * 1.5)})")
+        print(f"  ⚠️  {long_} chunk sopra upper ({upper})")
    print(f"  ✅ chunks.json salvato in chunks/{stem}/")
    return True

@@ -0,0 +1,88 @@
+#!/usr/bin/env python3
+"""
+Parametri di configurazione della pipeline di chunking.
+
+Modifica questo file per cambiare il comportamento di chunker.py,
+verify_chunks.py e fix_chunks.py senza toccare il codice applicativo.
+"""
+
+# ─── Grandezza target dei chunk ───────────────────────────────────────────────
+#
+# TARGET_CHARS è la dimensione ideale a cui il chunker mira.
+# CHUNK_TOLERANCE è la tolleranza relativa (es. 0.25 = ±25%).
+#
+#   range accettabile = [TARGET × (1 − TOL),  TARGET × (1 + TOL)]
+#
+# Con TARGET=600 e TOL=0.25 → ogni chunk sarà tra 450 e 750 char,
+# il più vicino possibile a 600, terminando sempre su un confine di frase.
+#
+TARGET_CHARS    = 300
+CHUNK_TOLERANCE = 0.25
+
+# ─── Overlap ──────────────────────────────────────────────────────────────────
+
+# Numero di frasi ripetute all'inizio del chunk successivo per preservare
+# il contesto tra chunk adiacenti della stessa sezione.
+OVERLAP_SENTENCES = 1
+
+# ─── Soglie di validazione ────────────────────────────────────────────────────
+
+# fix_chunks.py spezza un chunk "too_long" solo se supera upper × questo fattore.
+# Es. upper=750, fattore=1.5 → split solo per chunk > 1125 char.
+# Chunk in [upper, upper×fattore] restano come warning non bloccanti.
+SPLIT_THRESHOLD_FACTOR = 1.5
+
+MATH_SYMS_MIN = 3   # min. simboli math per declassare incomplete → incomplete_math
+
+# ─── Pattern e formato ────────────────────────────────────────────────────────
+
+SENTENCE_SPLIT_PATTERN = r"(?<=[.!?»])\s+"
+PREFIX_TEMPLATE = "[{sezione} > {titolo}]"
+
+# ─── Protezione contenuti speciali ────────────────────────────────────────────
+
+# Se True, un blocco prevalentemente tabella Markdown (≥50% righe |…|)
+# viene emesso come chunk atomico senza sentence-splitting.
+PROTECT_TABLES = True
+
+# Riservato — blocchi LaTeX non spezzabili (implementazione futura).
+PROTECT_MATH = True
+
+# ─── Fix behavior ─────────────────────────────────────────────────────────────
+
+# Numero massimo di iterazioni del loop fix → verify → fix.
+# Con 1 si ottiene il comportamento originale (fix singolo senza re-verifica).
+FIX_MAX_ITERATIONS = 3
+
+# ─── Override per strategia ───────────────────────────────────────────────────
+#
+# Sovrascrivono TARGET_CHARS / CHUNK_TOLERANCE / OVERLAP_SENTENCES
+# per la specifica strategia indicata in structure_profile.json.
+# Chiavi riconosciute: "target_chars", "tolerance", "overlap".
+#
+STRATEGY_OVERRIDES: dict[str, dict] = {
+    "h3_aware": {
+        # Documenti strutturati H2→H3: chunk medi, overlap moderato.
+        "target_chars": 600,
+        "tolerance":    0.25,
+        "overlap":      2,
+    },
+    "h2_paragraph_split": {
+        # Documenti piatti (solo H2): chunk più ampi, overlap ridotto.
+        "target_chars": 800,
+        "tolerance":    0.25,
+        "overlap":      1,
+    },
+    "paragraph": {
+        # Documenti senza header significativi: chunk più corti.
+        "target_chars": 500,
+        "tolerance":    0.30,
+        "overlap":      1,
+    },
+    "sliding_window": {
+        # Testo lineare/narrativo: finestre ampie, overlap generoso.
+        "target_chars": 800,
+        "tolerance":    0.25,
+        "overlap":      3,
+    },
+}
@@ -21,12 +21,29 @@ Uso:
 """

 import argparse
+import contextlib
+import io
 import json
 import re
 import sys
 from pathlib import Path

-MAX_CHARS = 800
+_HERE = Path(__file__).resolve().parent
+if str(_HERE) not in sys.path:
+    sys.path.insert(0, str(_HERE))
+import config as cfg
+from verify_chunks import verify_stem as _verify_stem
+
+MAX_CHARS = int(cfg.TARGET_CHARS * (1 + cfg.CHUNK_TOLERANCE))
+
+
+def _load_thresholds(stem_dir: Path) -> int:
+    """Legge max_chars da meta.json (scritto dal chunker) o usa il default da config."""
+    meta = stem_dir / "meta.json"
+    if meta.exists():
+        import json as _json
+        return _json.loads(meta.read_text(encoding="utf-8"))["max_chars"]
+    return MAX_CHARS
 PUNCT_END = re.compile(r"[.!?»)\]'\u2019\"\u201c\u201d\u2018\u2014\u2013-]$")


@@ -53,7 +70,20 @@ def _rebuild_text(chunk: dict, body: str) -> str:
    return f"{_prefix(chunk)}\n{body}"


+# Fine frase forte: . ! ? seguiti da spazio + maiuscola o virgolette.
+# Non usare punteggiatura debole (,;:)>>]) per non creare chunk incompleti.
+_STRONG_END = re.compile(
+    r'[.!?\xbb]\s+(?=[A-Z\xc0-\xd6\xd8-\xde\xc0-\xff\xab\x22\x27(])'
+)
+_SECONDARY_END = re.compile(r';\s+')
+
+
 def _split_at_boundary(text: str, max_chars: int) -> list[str]:
+    """Spezza text in parti ≤ max_chars su confini di frase forti (.!?).
+
+    Se non trova un confine forte entro max_chars, NON spezza: meglio un
+    chunk too_long (warning) che un chunk incompleto (blocker).
+    """
    if len(text) <= max_chars:
        return [text]

@@ -62,20 +92,29 @@ def _split_at_boundary(text: str, max_chars: int) -> list[str]:

    while len(remaining) > max_chars:
        candidate = remaining[:max_chars]
-        split_pos = candidate.rfind("\n\n")

-        if split_pos == -1:
-            m = None
-            for m in re.finditer(r"[.!?»]\s+", candidate):
-                pass
-            split_pos = m.end() if m else None
+        last_pos = -1
+        for m in _STRONG_END.finditer(candidate):
+            last_pos = m.start() + 1  # posizione dopo il carattere terminatore

-        if split_pos is None or split_pos == 0:
-            sp = remaining.find(" ", max_chars)
-            split_pos = sp if sp != -1 else len(remaining)
-
-        parts.append(remaining[:split_pos].rstrip())
-        remaining = remaining[split_pos:].lstrip()
+        if last_pos > 0:
+            first = remaining[:last_pos].rstrip()
+            remaining = remaining[last_pos:].lstrip()
+            if first:
+                parts.append(first)
+        else:
+            # Prova confine secondario: ; + spazio (clausole legali)
+            sec_pos = -1
+            for m in _SECONDARY_END.finditer(candidate):
+                sec_pos = m.start() + 1
+            if sec_pos > 0:
+                first = remaining[:sec_pos].rstrip()
+                remaining = remaining[sec_pos:].lstrip()
+                if first:
+                    parts.append(first)
+            else:
+                # Nessun confine: lascia il chunk intero (too_long > incomplete)
+                break

    if remaining:
        parts.append(remaining)
@@ -173,10 +212,12 @@ def renumber_ids(chunks: list[dict]) -> list[dict]:

 # ─── Core ─────────────────────────────────────────────────────────────────────

-def fix_stem(stem: str, project_root: Path, max_chars: int, dry_run: bool) -> bool:
+def fix_stem(stem: str, project_root: Path, max_chars: int, dry_run: bool,
+             max_iter: int = 10) -> bool:
    stem_dir    = project_root / "chunks" / stem
    chunks_path = stem_dir / "chunks.json"
    report_path = stem_dir / "report.json"
+    max_chars   = _load_thresholds(stem_dir)

    if not chunks_path.exists():
        print(f"✗ chunks/{stem}/chunks.json non trovato.")
@@ -195,14 +236,22 @@ def fix_stem(stem: str, project_root: Path, max_chars: int, dry_run: bool) -> bo
    print(f"\nDocumento: {stem}  (verdict: {verdict})")

    if verdict == "ok":
-        print("  ✅ Nessun problema — nulla da correggere.")
+        print("  ✅ Nessun problema - nulla da correggere.")
        return True

    empty_ids      = {e["chunk_id"] for e in report.get("blockers", {}).get("empty", [])}
    no_prefix_ids  = {e["chunk_id"] for e in report.get("blockers", {}).get("no_prefix", [])}
    incomplete_ids = {e["chunk_id"] for e in report.get("blockers", {}).get("incomplete", [])}
    too_short_ids  = {e["chunk_id"] for e in report.get("warnings", {}).get("too_short", [])}
-    too_long_ids   = {e["chunk_id"] for e in report.get("warnings", {}).get("too_long", [])}
+
+    # Spezza solo chunk che superano upper × SPLIT_THRESHOLD_FACTOR,
+    # non quelli appena oltre upper (che causerebbero split con chunk incompleti).
+    _split_limit = max_chars * cfg.SPLIT_THRESHOLD_FACTOR
+    too_long_ids = {
+        e["chunk_id"]
+        for e in report.get("warnings", {}).get("too_long", [])
+        if e.get("n_chars", 0) > _split_limit
+    }

    ops: list[str] = []
    if empty_ids:
@@ -230,24 +279,77 @@ def fix_stem(stem: str, project_root: Path, max_chars: int, dry_run: bool) -> bo

    n_before = len(chunks)

-    if empty_ids:
-        chunks, n = fix_empty(chunks, empty_ids)
-        print(f"\n  🗑  Rimossi {n} chunk vuoti.")
-
-    if no_prefix_ids:
-        chunks, n = fix_no_prefix(chunks, no_prefix_ids)
+    def _fix_blockers(chunks: list[dict], report: dict) -> list[dict]:
+        """Risolve solo i blockers (incomplete, empty, no_prefix) senza toccare warnings."""
+        empty_ids_      = {e["chunk_id"] for e in report.get("blockers", {}).get("empty", [])}
+        no_prefix_ids_  = {e["chunk_id"] for e in report.get("blockers", {}).get("no_prefix", [])}
+        incomplete_ids_ = {e["chunk_id"] for e in report.get("blockers", {}).get("incomplete", [])}
+        if empty_ids_:
+            chunks, n = fix_empty(chunks, empty_ids_)
+            print(f"  🗑  Rimossi {n} chunk vuoti.")
+        if no_prefix_ids_:
+            chunks, n = fix_no_prefix(chunks, no_prefix_ids_)
            print(f"  🔧 Aggiunto prefisso a {n} chunk.")
+        if incomplete_ids_:
+            chunks, n = fix_incomplete_and_short(chunks, incomplete_ids_)
+            print(f"  🔗 Fusi {n} chunk incompleti.")
+        return renumber_ids(chunks)

-    merge_ids = incomplete_ids | too_short_ids
-    if merge_ids:
-        chunks, n = fix_incomplete_and_short(chunks, merge_ids)
-        print(f"  🔗 Fusi {n} chunk (incompleti + corti).")
-
-    if too_long_ids:
-        chunks, n = fix_too_long(chunks, too_long_ids, max_chars)
+    def _fix_warnings(chunks: list[dict], report: dict) -> list[dict]:
+        """Applica fix opzionali: merge too_short e split too_long."""
+        too_short_ids_ = {e["chunk_id"] for e in report.get("warnings", {}).get("too_short", [])}
+        too_long_ids_  = {
+            e["chunk_id"]
+            for e in report.get("warnings", {}).get("too_long", [])
+            if e.get("n_chars", 0) > max_chars * cfg.SPLIT_THRESHOLD_FACTOR
+        }
+        if too_short_ids_:
+            chunks, n = fix_incomplete_and_short(chunks, too_short_ids_)
+            print(f"  🔗 Fusi {n} chunk troppo corti.")
+        if too_long_ids_:
+            chunks, n = fix_too_long(chunks, too_long_ids_, max_chars)
            print(f"  ✂️  Spezzati {n} chunk lunghi.")
+        return renumber_ids(chunks)

-    chunks = renumber_ids(chunks)
+    # Fase 1: risolvi blockers a convergenza (solo merge incomplete)
+    chunks = _fix_blockers(chunks, report)
+
+    _min = int(cfg.TARGET_CHARS * (1 - cfg.CHUNK_TOLERANCE))
+    _max = int(cfg.TARGET_CHARS * (1 + cfg.CHUNK_TOLERANCE))
+    prev_blockers = sum(len(v) for v in report.get("blockers", {}).values())
+
+    for iteration in range(1, max_iter + 1):
+        chunks_path.write_text(
+            json.dumps(chunks, ensure_ascii=False, indent=2), encoding="utf-8"
+        )
+        with contextlib.redirect_stdout(io.StringIO()):
+            _verify_stem(stem, project_root, _min, _max)
+        report = json.loads(report_path.read_text(encoding="utf-8"))
+        new_verdict = report.get("verdict", "ok")
+        curr_blockers = sum(len(v) for v in report.get("blockers", {}).values())
+
+        if new_verdict in ("ok", "warnings_only") or curr_blockers == 0:
+            break
+        if curr_blockers >= prev_blockers:
+            print(f"\n  ⚠️  Nessun miglioramento ({curr_blockers} blockers) - i restanti richiedono correzione manuale del clean.md.")
+            break
+
+        print(f"\n  Iterazione {iteration + 1} - {curr_blockers} blockers residui:")
+        prev_blockers = curr_blockers
+        chunks = _fix_blockers(chunks, report)
+
+    # Fase 2: fix warnings (too_short merge + too_long split) - una sola passata finale
+    with contextlib.redirect_stdout(io.StringIO()):
+        _verify_stem(stem, project_root, _min, _max)
+    report = json.loads(report_path.read_text(encoding="utf-8"))
+    n_short = len(report.get("warnings", {}).get("too_short", []))
+    n_long  = sum(
+        1 for e in report.get("warnings", {}).get("too_long", [])
+        if e.get("n_chars", 0) > max_chars * cfg.SPLIT_THRESHOLD_FACTOR
+    )
+    if n_short or n_long:
+        print(f"\n  Fix warnings: {n_short} corti, {n_long} lunghi da spezzare")
+        chunks = _fix_warnings(chunks, report)

    n_after = len(chunks)
    print(f"\n  Totale chunk: {n_before} → {n_after}")
@@ -256,7 +358,14 @@ def fix_stem(stem: str, project_root: Path, max_chars: int, dry_run: bool) -> bo
        json.dumps(chunks, ensure_ascii=False, indent=2), encoding="utf-8"
    )
    print(f"  ✅ Salvato: chunks/{stem}/chunks.json")
-    print(f"\n  Riesegui la verifica:")
+
+    final_verdict = report.get("verdict", "?")
+    if final_verdict == "ok":
+        print(f"  ✅ Verdict finale: ok - procedi alla vettorizzazione.")
+    elif final_verdict == "warnings_only":
+        print(f"  🟡 Verdict finale: warnings_only - puoi procedere.")
+    else:
+        print(f"  🔴 Verdict finale: {final_verdict} - rilancia la verifica manualmente:")
        print(f"     python chunks/verify_chunks.py --stem {stem}")

    return True
@@ -269,15 +378,20 @@ if __name__ == "__main__":

    parser = argparse.ArgumentParser(description="Fix chunk")
    parser.add_argument("--stem", required=True, help="Nome del documento (sottocartella di chunks/)")
+    _max_def = int(cfg.TARGET_CHARS * (1 + cfg.CHUNK_TOLERANCE))
    parser.add_argument(
-        "--max", type=int, default=MAX_CHARS,
-        help=f"Soglia massima caratteri per lo split (default: {MAX_CHARS})"
+        "--max", type=int, default=_max_def,
+        help=f"Soglia massima caratteri per lo split (default: TARGET×(1+TOL) = {_max_def})"
    )
    parser.add_argument(
        "--dry-run", action="store_true",
        help="Mostra le operazioni pianificate senza applicarle"
    )
+    parser.add_argument(
+        "--max-iter", type=int, default=10, metavar="N",
+        help="Numero massimo di iterazioni automatiche (default: 10)"
+    )
    args = parser.parse_args()

-    ok = fix_stem(args.stem, project_root, args.max, args.dry_run)
+    ok = fix_stem(args.stem, project_root, args.max, args.dry_run, args.max_iter)
    sys.exit(0 if ok else 1)
@@ -20,22 +20,39 @@ import re
 import sys
 from pathlib import Path

+_HERE = Path(__file__).resolve().parent
+if str(_HERE) not in sys.path:
+    sys.path.insert(0, str(_HERE))
+import config as cfg

-# ─── Soglie ───────────────────────────────────────────────────────────────────

-MIN_CHARS = 200
-MAX_CHARS = 800
+# ─── Soglie (derivate dal target, sovrascrivibili da CLI) ────────────────────
+
+MIN_CHARS = int(cfg.TARGET_CHARS * (1 - cfg.CHUNK_TOLERANCE))
+MAX_CHARS = int(cfg.TARGET_CHARS * (1 + cfg.CHUNK_TOLERANCE))
 PUNCT_END = re.compile(
-    r"[.!?»)\]'\u2019\"\u201c\u201d\u2018\u2014\u2013\u2026]$"
+    r"[.!?\xbb)\]'\u2019\"\u201c\u201d\u2018\u2014\u2013\u2026]$"
    r"|/$"    # URL che finisce con /
    r"|\|$"   # riga di tabella Markdown
+    r"|;$"    # fine clausola legale (testo giuridico)
    r"|:$"    # introduzione a lista o formula
 )
 _HEX_END     = re.compile(r"[0-9a-fA-F]{8,}$")
-_URL_TAIL    = re.compile(r"https?://\S+(\s+\S+){0,3}$")  # URL con fino a 3 token extra
+_URL_TAIL    = re.compile(r"(https?://|www\.)\S+(\s+\S+){0,3}$")  # URL con fino a 3 token extra
 _MATH_SYMS   = re.compile(r"[∈∑≤≥≠∀∃∫√∞∂±×÷→←↔⊂⊃⊆⊇∩∪·°]")
+_ROMAN_END   = re.compile(r"\b(I{1,3}|IV|VI{0,3}|IX|XI{0,2}|XIV|XV|XVI{0,2}|XIX|XX{0,2})$")


+
+def _load_thresholds(stem_dir: "Path") -> "tuple[int, int]":
+    """Legge min/max da meta.json (scritto dal chunker) o usa i default da config."""
+    meta = stem_dir / "meta.json"
+    if meta.exists():
+        import json as _json
+        m = _json.loads(meta.read_text(encoding="utf-8"))
+        return m["min_chars"], m["max_chars"]
+    return MIN_CHARS, MAX_CHARS
+
 # ─── Checks ───────────────────────────────────────────────────────────────────

 def has_prefix(chunk: dict) -> bool:
@@ -51,7 +68,7 @@ def is_too_short(chunk: dict, min_chars: int) -> bool:


 def is_too_long(chunk: dict, max_chars: int) -> bool:
-    return chunk.get("n_chars", 0) > max_chars * 1.5
+    return chunk.get("n_chars", 0) > max_chars


 def ends_incomplete(chunk: dict) -> bool:
@@ -65,6 +82,8 @@ def ends_incomplete(chunk: dict) -> bool:
        return False
    if _HEX_END.search(text_check):   # hash SHA / codice hex
        return False
+    if _ROMAN_END.search(text_check):  # numero romano finale (indice/riferimento PDF)
+        return False
    if _URL_TAIL.search(text_check[-200:]):  # URL (con eventuale path dopo spazio)
        return False
    return True
@@ -72,7 +91,7 @@ def ends_incomplete(chunk: dict) -> bool:

 def is_math_incomplete(chunk: dict) -> bool:
    """Incompleto ma in contesto matematico — degrada a warning invece di blocker."""
-    return ends_incomplete(chunk) and len(_MATH_SYMS.findall(chunk.get("text", ""))) >= 3
+    return ends_incomplete(chunk) and len(_MATH_SYMS.findall(chunk.get("text", ""))) >= cfg.MATH_SYMS_MIN


 # ─── Report ───────────────────────────────────────────────────────────────────
@@ -85,7 +104,9 @@ def _fmt_chunk(c: dict) -> str:


 def verify_stem(stem: str, project_root: Path, min_chars: int, max_chars: int) -> bool:
-    chunks_path = project_root / "chunks" / stem / "chunks.json"
+    stem_dir    = project_root / "chunks" / stem
+    chunks_path = stem_dir / "chunks.json"
+    min_chars, max_chars = _load_thresholds(stem_dir)

    print(f"\nDocumento: {stem}")

@@ -170,12 +191,12 @@ def verify_stem(stem: str, project_root: Path, min_chars: int, max_chars: int) -

    if too_long:
        has_errors = True
-        print(f"\n  🟡 {len(too_long)} chunk SOPRA MAX_CHARS×1.5 ({int(max_chars * 1.5)}):")
+        print(f"\n  🟡 {len(too_long)} chunk SOPRA MAX ({max_chars}):")
        for c in too_long[:5]:
            print(_fmt_chunk(c))
        if len(too_long) > 5:
            print(f"  ... e altri {len(too_long) - 5}")
-        print(f"  → Soluzione: alza MAX_CHARS o verifica il testo nel MD")
+        print(f"  → Causa probabile: frasi singole lunghe (liste/paragrafi non suddivisibili)")

    if incomplete:
        has_errors = True
@@ -225,7 +246,12 @@ def verify_stem(stem: str, project_root: Path, min_chars: int, max_chars: int) -
            "max_chars": max_l,
            "avg_chars": avg_l,
        },
-        "thresholds": {"min_chars": min_chars, "max_chars": max_chars},
+        "thresholds": {
+            "min_chars": min_chars,
+            "max_chars": max_chars,
+            "target_chars": cfg.TARGET_CHARS,
+            "chunk_tolerance": cfg.CHUNK_TOLERANCE,
+        },
        "blockers": {
            "empty":      [_chunk_entry(c) for c in empty_chunks],
            "no_prefix":  [_chunk_entry(c) for c in no_prefix],
@@ -253,11 +279,11 @@ def verify_stem(stem: str, project_root: Path, min_chars: int, max_chars: int) -

    if not blockers and not warnings:
        print(f"  ✅ Tutto OK — procedi alla vettorizzazione:")
-        print(f"       python step-8/ingest.py --stem {stem}")
+        print(f"       python ingestion/ingest.py --stem {stem}")

    elif not blockers:
        print(f"  🟡 Solo avvisi minori — puoi procedere alla vettorizzazione:")
-        print(f"       python step-8/ingest.py --stem {stem}")
+        print(f"       python ingestion/ingest.py --stem {stem}")
        print()
        print(f"  Oppure, per ottimizzare prima:")
        if too_short:
@@ -301,13 +327,15 @@ if __name__ == "__main__":

    parser = argparse.ArgumentParser(description="Verifica chunk")
    parser.add_argument("--stem", help="Nome del documento (sottocartella di chunks/)")
+    _min_def = int(cfg.TARGET_CHARS * (1 - cfg.CHUNK_TOLERANCE))
+    _max_def = int(cfg.TARGET_CHARS * (1 + cfg.CHUNK_TOLERANCE))
    parser.add_argument(
-        "--min", type=int, default=MIN_CHARS,
-        help=f"Soglia minima caratteri (default: {MIN_CHARS})"
+        "--min", type=int, default=_min_def,
+        help=f"Soglia minima caratteri (default: TARGET×(1-TOL) = {_min_def})"
    )
    parser.add_argument(
-        "--max", type=int, default=MAX_CHARS,
-        help=f"Soglia massima caratteri (default: {MAX_CHARS})"
+        "--max", type=int, default=_max_def,
+        help=f"Soglia massima caratteri (default: TARGET×(1+TOL) = {_max_def})"
    )
    args = parser.parse_args()

@@ -18,7 +18,7 @@ TOP_K = 6
 # Temperatura del modello LLM.
 # 0.0 = completamente deterministico (stessa risposta ad ogni run)
 # 0.7 = più creativo e vario
-TEMPERATURE = 0.0
+TEMPERATURE = 0.2

 # Disabilita il "thinking" (ragionamento interno) nei modelli Qwen3/Qwen3.5.
 # True  = risposta diretta, più veloce
@@ -38,7 +38,7 @@ EMBED_MODEL = "nomic-embed-text"
 OLLAMA_URL = "http://localhost:11434"

 # Modello LLM. Scegli in base alla RAM disponibile (vedi README).
-OLLAMA_MODEL = "qwen3.5:0.8b"
+OLLAMA_MODEL = "qwen3.5:4b"

 # ── Prompt di sistema ─────────────────────────────────────────────────────────

@@ -33,13 +33,13 @@ Posiziona il PDF in `sources/<nome>.pdf`, poi:

 ```bash
 # Singolo documento
-python conversione/pipeline.py --stem <nome>
+python conversione/ --stem <nome>

 # Tutti i PDF in sources/
-python conversione/pipeline.py
+python conversione/

 # Forza la riesecuzione (sovrascrive output esistente)
-python conversione/pipeline.py --stem <nome> --force
+python conversione/ --stem <nome> --force
 ```

 Il parametro `--stem` è il nome del file PDF senza estensione.  
@@ -49,12 +49,13 @@ Esempio: `sources/analisi1.pdf` → `--stem analisi1`

 ## Output

-Per ogni stem vengono prodotti tre file in `conversione/<stem>/`:
+Per ogni stem vengono prodotti quattro file in `conversione/<stem>/`:

 | File | Descrizione |
 |------|-------------|
 | `raw.md` | Markdown grezzo estratto dal PDF — **non modificare** |
 | `clean.md` | Markdown pulito e strutturato — input per il chunker |
+| `structure_profile.json` | Struttura rilevata e strategia di chunking consigliata |
 | `report.json` | Metriche complete di qualità della conversione |

 ### report.json
@@ -110,11 +111,15 @@ anomalie e problemi residui con esempi.

 ## Validazione batch

-Dopo aver convertito uno o più documenti, esegui `validate.py` per ottenere
+Dopo aver convertito uno o più documenti, esegui `validate` per ottenere
 una tabella di stato su tutti gli stem:

 ```bash
-python conversione/validate.py
+# Tutti i documenti
+python conversione/ validate
+
+# Singolo documento con dettaglio penalità
+python conversione/ validate <stem> --detail
 ```

 Output di esempio:
@@ -221,11 +226,19 @@ Durante l'esecuzione la pipeline stampa le statistiche di ogni trasformazione:

 ```
  [3/4] Pulizia strutturale...
-  ✅ Immagini rimosse:      0
+  ✅ Simboli PUA corretti:  0
+     Immagini rimosse:      0
+     Note rimosse:          12
     Accenti corretti:      3701
     Dot-leader rimossi:    53
     Header concat fixati:  0
+     Header num. normaliz.: 8
+     Articoli → ###:        0
+     Ambienti matematici:   0
+     Titoli header uniti:   4
     TOC rimosso:           sì
+     Versi poesia riprist.: 0
+     Header verso demotati: 0
     ALL-CAPS → ##:         14
     Sezioni → ###:         279
     Paragrafi uniti:       12998
@@ -1,18 +1,28 @@
-# Step 8 — Vettorizzazione
+# Ingestion — Vettorizzazione

-Legge i chunk prodotti da step-6, genera gli embedding tramite Ollama e li
+Legge i chunk prodotti dal chunker, genera gli embedding tramite Ollama e li
 salva in ChromaDB (vector store persistente su disco).

 ---

 ## Prerequisiti

- Step-6 completato (esiste `step-6/<stem>/chunks.json`)
+- Chunking completato (esiste `chunks/<stem>/chunks.json`)
 - Ollama attivo con il modello di embedding scaricato
 - `chromadb` installato (`pip install -r requirements.txt`)

 ---

+## Verifica ambiente
+
+Prima di eseguire l'ingestion, verifica che Ollama e i modelli siano disponibili:
+
+```bash
+.venv/bin/python ollama/check_env.py
+```
+
+---
+
 ## Configurazione modello

 Il modello di embedding viene letto da **`config.py`**:
@@ -23,7 +33,7 @@ EMBED_MODEL = "nomic-embed-text"   # ← cambia qui
 ```

 > Il modello scelto qui deve corrispondere a quello usato in rag.py.
-> Se lo cambi dopo aver già vettorizzato, devi rieseguire step-8 con `--force`.
+> Se lo cambi dopo aver già vettorizzato, devi rieseguire ingestion con `--force`.

 ---

@@ -31,16 +41,16 @@ EMBED_MODEL = "nomic-embed-text"   # ← cambia qui

 ```bash
 # Vettorizza un singolo documento
-python step-8/ingest.py --stem <nome>
+.venv/bin/python ingestion/ingest.py --stem <nome>

 # Vettorizza tutti i documenti trovati in step-6/
-python step-8/ingest.py
+.venv/bin/python ingestion/ingest.py

 # Sovrascrive una collection già esistente
-python step-8/ingest.py --stem <nome> --force
+.venv/bin/python ingestion/ingest.py --stem <nome> --force

 # Override modello (senza modificare config.py)
-python step-8/ingest.py --stem <nome> --model bge-m3
+.venv/bin/python ingestion/ingest.py --stem <nome> --model bge-m3
 ```

 ---
@@ -94,7 +104,7 @@ Senza `--force` lo script salta la collection già esistente — i vecchi vettor

 ```bash
 # Cambio modello → ricrea sempre la collection
-python step-8/ingest.py --stem <nome> --force
+.venv/bin/python ingestion/ingest.py --stem <nome> --force
 ```

 ### Quando usare `--force`
@@ -103,7 +113,7 @@ python step-8/ingest.py --stem <nome> --force
 |---|---|
 | Prima esecuzione | No |
 | Hai cambiato `EMBED_MODEL` | **Sì** |
-| Hai migliorato i chunk in step-6 | **Sì** |
+| Hai migliorato i chunk in chunks/ | **Sì** |
 | Hai aggiunto nuovi documenti (stem diverso) | No |
 | Vuoi solo verificare che funzioni | No |

@@ -2,21 +2,21 @@
 """
 Step 8 — Vettorizzazione

-Legge i chunk prodotti da step-6, genera gli embedding tramite Ollama
+Legge i chunk prodotti da chunks/, genera gli embedding tramite Ollama
 e li indicizza in ChromaDB (persistente).

 Il modello di embedding viene letto da config.py (EMBED_MODEL).
 Puoi sovrascriverlo con --model, ma deve corrispondere al modello che
 userai in rag.py — altrimenti riesegui con --force dopo aver cambiato.

-Input:  step-6/<stem>/chunks.json
+Input:  chunks/<stem>/chunks.json
 Output: chroma_db/<stem> (collection ChromaDB)

 Uso:
-    python step-8/ingest.py --stem <nome>            # singolo documento
-    python step-8/ingest.py                          # tutti gli stem trovati
-    python step-8/ingest.py --stem <nome> --force    # sovrascrive collection
-    python step-8/ingest.py --model bge-m3           # override modello
+    python ingestion/ingest.py --stem <nome>            # singolo documento
+    python ingestion/ingest.py                          # tutti gli stem trovati
+    python ingestion/ingest.py --stem <nome> --force    # sovrascrive collection
+    python ingestion/ingest.py --model bge-m3           # override modello
 """

 import argparse
@@ -33,7 +33,7 @@ import chromadb

 project_root = Path(__file__).parent.parent

-CHUNKS_DIR = project_root / "step-6"
+CHUNKS_DIR = project_root / "chunks"
 CHROMA_DIR = project_root / "chroma_db"

 sys.path.insert(0, str(project_root))
@@ -94,40 +94,27 @@ def collection_exists(client: chromadb.PersistentClient, stem: str) -> bool:

 # ─── Ingestione ───────────────────────────────────────────────────────────────

-def ingest(stem: str, force: bool, model: str = EMBED_MODEL) -> bool:
+def _ingest_stem(stem: str, collection: chromadb.Collection,
+                 model: str, offset: int = 0) -> int:
    """
-    Legge step-6/<stem>/chunks.json, genera embedding e popola ChromaDB.
-    Ritorna True se completato con successo, False altrimenti.
+    Aggiunge i chunk di uno stem a una collection esistente.
+    Prefissa chunk_id con stem per evitare collisioni multi-documento.
+    Ritorna il numero di chunk aggiunti.
    """
    chunks_path = CHUNKS_DIR / stem / "chunks.json"
    if not chunks_path.exists():
        print(f"❌ File non trovato: {chunks_path}")
-        return False
+        return 0

    with open(chunks_path, encoding="utf-8") as f:
        chunks = json.load(f)

    if not chunks:
        print(f"⚠️  {stem}: chunks.json è vuoto — skip")
-        return False
-
-    client = get_client()
-
-    if collection_exists(client, stem):
-        if not force:
-            print(f"⚠️  Collection '{stem}' già presente in ChromaDB — skip")
-            print(f"   → usa --force per sovrascrivere")
-            return True  # non è un errore, è uno skip
-        client.delete_collection(stem)
-        print(f"🗑️  Collection '{stem}' rimossa (--force)")
-
-    collection = client.create_collection(
-        name=stem,
-        metadata={"hnsw:space": "cosine"},
-    )
+        return 0

    total = len(chunks)
-    print(f"📦 {total} chunk da ingestire\n")
+    print(f"  📄 {stem}: {total} chunk\n")

    ids        = []
    embeddings = []
@@ -143,10 +130,11 @@ def ingest(stem: str, force: bool, model: str = EMBED_MODEL) -> bool:
        t1 = time.monotonic()
        durations.append(t1 - t0)

-        ids.append(chunk["chunk_id"])
+        ids.append(f"{stem}__{chunk['chunk_id']}")
        embeddings.append(vector)
        documents.append(chunk["text"])
        metadatas.append({
+            "source":    stem,
            "sezione":   chunk.get("sezione", ""),
            "titolo":    chunk.get("titolo", ""),
            "sub_index": chunk.get("sub_index", 0),
@@ -154,41 +142,69 @@ def ingest(stem: str, force: bool, model: str = EMBED_MODEL) -> bool:

        avg  = sum(durations) / len(durations)
        eta  = int(avg * (total - i))
-        done = f"[{i:>{len(str(total))}}/{total}]"
-        cid  = chunk["chunk_id"][:50]
-        line = f"  {done} ✓ {cid:<50}  ETA: {eta}s"
-        print(f"{line:<80}", end="\r", flush=True)
+        done = f"[{offset + i:>6}/{offset + total}]"
+        cid  = chunk["chunk_id"][:40]
+        print(f"  {done} ✓ {stem}/{cid:<40}  ETA: {eta}s", end="\r", flush=True)

-        # Upsert in batch da 100 per non sovraccaricare la memoria
        if len(ids) == 100:
-            collection.add(
-                ids=ids,
-                embeddings=embeddings,
-                documents=documents,
-                metadatas=metadatas,
-            )
+            collection.add(ids=ids, embeddings=embeddings,
+                           documents=documents, metadatas=metadatas)
            ids, embeddings, documents, metadatas = [], [], [], []

-    # Upsert dei rimanenti
    if ids:
-        collection.add(
-            ids=ids,
-            embeddings=embeddings,
-            documents=documents,
-            metadatas=metadatas,
-        )
+        collection.add(ids=ids, embeddings=embeddings,
+                       documents=documents, metadatas=metadatas)

    elapsed = int(time.monotonic() - start)
-    print()  # nuova riga dopo il \r
-    print(f"\n✅ Ingestione completata in {elapsed}s — {total}/{total} chunk salvati")
-    print(f"   Collection '{stem}' in {CHROMA_DIR}/")
+    print()
+    print(f"  ✅ {stem}: {total} chunk in {elapsed}s")
+    return total
+
+
+def ingest(stem: str, force: bool, model: str = EMBED_MODEL) -> bool:
+    """Ingest singolo documento nella sua collection dedicata (retrocompatibile)."""
+    return ingest_multi([stem], collection_name=stem, force=force, model=model)
+
+
+def ingest_multi(stems: list[str], collection_name: str,
+                 force: bool, model: str = EMBED_MODEL) -> bool:
+    """
+    Ingerisce uno o più documenti in una singola collection ChromaDB.
+    I chunk_id sono prefissati con lo stem per evitare collisioni.
+    Il metadato 'source' identifica il documento di provenienza.
+    """
+    client = get_client()
+
+    if collection_exists(client, collection_name):
+        if not force:
+            print(f"⚠️  Collection '{collection_name}' già presente in ChromaDB — skip")
+            print(f"   → usa --force per sovrascrivere")
+            return True
+        client.delete_collection(collection_name)
+        print(f"🗑️  Collection '{collection_name}' rimossa (--force)")
+
+    collection = client.create_collection(
+        name=collection_name,
+        metadata={"hnsw:space": "cosine"},
+    )
+
+    total_chunks = 0
+    for stem in stems:
+        n = _ingest_stem(stem, collection, model, offset=total_chunks)
+        if n == 0 and len(stems) == 1:
+            return False
+        total_chunks += n
+
+    print(f"\n✅ Collection '{collection_name}': {total_chunks} chunk totali")
+    print(f"   Documenti: {', '.join(stems)}")
+    print(f"   Percorso: {CHROMA_DIR}/")
    return True


 # ─── Entry point ──────────────────────────────────────────────────────────────

 def find_stems() -> list[str]:
-    """Ritorna tutti gli stem che hanno un chunks.json in step-6/."""
+    """Ritorna tutti gli stem che hanno un chunks.json in chunks/."""
    return sorted(
        p.parent.name
        for p in CHUNKS_DIR.glob("*/chunks.json")
@@ -197,26 +213,53 @@ def find_stems() -> list[str]:

 def main() -> int:
    parser = argparse.ArgumentParser(
-        description="Step 8 — Vettorizzazione chunk in ChromaDB"
+        description="Vettorizzazione chunk in ChromaDB",
+        epilog=(
+            "Esempi:\n"
+            "  python ingestion/ingest.py --stem manuale\n"
+            "  python ingestion/ingest.py --collection archivio --stems doc1 doc2 doc3\n"
+            "  python ingestion/ingest.py --collection archivio --stems doc1 doc2 --force\n"
+            "  python ingestion/ingest.py                          # tutti i documenti, collection separate"
+        ),
+        formatter_class=argparse.RawDescriptionHelpFormatter,
    )
-    parser.add_argument("--stem", help="Nome del documento (senza --stem = tutti)")
+    parser.add_argument("--stem",
+                        help="Singolo documento → collection con lo stesso nome")
+    parser.add_argument("--stems", nargs="+", metavar="STEM",
+                        help="Uno o più documenti da unire in --collection")
+    parser.add_argument("--collection",
+                        help="Nome della collection di destinazione (richiesto con --stems)")
    parser.add_argument("--force", action="store_true",
                        help="Sovrascrive la collection se già esistente")
    parser.add_argument("--model", default=EMBED_MODEL,
-                        help=f"Modello embedding Ollama (default da config.py: {EMBED_MODEL})")
+                        help=f"Modello embedding (default: {EMBED_MODEL})")
    args = parser.parse_args()

-    print("─── Step 8 — Vettorizzazione ─────────────────────────────────────────\n")
+    print("─── Vettorizzazione ──────────────────────────────────────────────────\n")

    if not check_ollama(args.model):
        return 1

+    # ── Modalità multi-documento ─────────────────────────────────────────────
+    if args.stems or args.collection:
+        if not args.stems:
+            print("❌ --collection richiede --stems (es. --stems doc1 doc2 doc3)")
+            return 1
+        if not args.collection:
+            print("❌ --stems richiede --collection (es. --collection archivio)")
+            return 1
+        print(f"  Collection : {args.collection}")
+        print(f"  Documenti  : {', '.join(args.stems)}\n")
+        ok = ingest_multi(args.stems, args.collection,
+                          force=args.force, model=args.model)
+        return 0 if ok else 1
+
+    # ── Modalità singolo / tutti ─────────────────────────────────────────────
    stems = [args.stem] if args.stem else find_stems()
    if not stems:
-        print("❌ Nessun chunks.json trovato in step-6/")
+        print("❌ Nessun chunks.json trovato in chunks/")
        return 1

-    print()
    results = []
    for stem in stems:
        if len(stems) > 1:
@@ -57,7 +57,7 @@ Alternative supportate:
 - `bge-m3`
 - `nomic-embed-text`

-Se cambi embedding model rispetto a quello usato in step-8, riesegui ingest con `--force` e aggiorna `EMBED_MODEL` in `config.py`.
+Se cambi embedding model rispetto a quello usato in ingestion, riesegui ingest con `--force` e aggiorna `EMBED_MODEL` in `config.py`.

 ### Modello LLM (consigliato per 8 GB RAM)

@@ -101,7 +101,7 @@ Output atteso (esempio):
 ✅ LLM disponibile: qwen3.5:4b
 ✅ chromadb importabile
 ✅ Ambiente pronto — procedi con la vettorizzazione:
-   python step-8/ingest.py --stem <nome>
+   python ingestion/ingest.py --stem <nome>
 ```

 ---
@@ -109,5 +109,5 @@ Output atteso (esempio):
 ## Prossimo step

 ```bash
-python step-8/ingest.py --stem <nome>
+python ingestion/ingest.py --stem <nome>
 ```
@@ -101,6 +101,7 @@ def retrieve(collection: chromadb.Collection, question: str) -> list[dict]:
    ):
        chunks.append({
            "text":     text,
+            "source":   meta.get("source", ""),
            "sezione":  meta.get("sezione", ""),
            "titolo":   meta.get("titolo", ""),
            "distance": dist,
@@ -143,7 +144,8 @@ def answer(question: str, collection: chromadb.Collection, verbose: bool) -> Non
            if c["titolo"]:
                loc += f" > {c['titolo']}"
            sim = 1 - c["distance"]
-            print(f"  [{i}] {loc}  (similarità: {sim:.3f})")
+            src = f"[{c['source']}] " if c.get("source") else ""
+            print(f"  [{i}] {src}{loc}  (similarità: {sim:.3f})")
            print(f"      {c['text'][:120].replace(chr(10), ' ')}...")
        print("──────────────────────────────────────────────────────────────\n")

@@ -197,7 +199,7 @@ def _build_epilog() -> str:
            if names:
                lines += ["", f"Collection disponibili: {', '.join(names)}"]
            else:
-                lines += ["", "Nessuna collection trovata — eseguire prima: python step-8/ingest.py"]
+                lines += ["", "Nessuna collection trovata — eseguire prima: python ingestion/ingest.py"]
        except Exception:
            pass
    return "\n".join(lines)
@@ -208,41 +210,48 @@ def main() -> int:
        description=(
            "Pipeline RAG interattiva\n\n"
            "Risponde a domande in linguaggio naturale su un documento\n"
-            "indicizzato in ChromaDB da step-8/ingest.py."
+            "indicizzato in ChromaDB da ingestion/ingest.py."
        ),
        epilog=_build_epilog(),
        formatter_class=argparse.RawDescriptionHelpFormatter,
    )
    parser.add_argument(
        "--stem",
-        required=True,
-        help=(
-            "Nome della collection ChromaDB da interrogare. "
-            "Le collection vengono create da: python step-8/ingest.py --stem <nome>"
-        ),
+        help="Collection di un singolo documento (retrocompatibile)",
+    )
+    parser.add_argument(
+        "--collection",
+        help="Collection multi-documento creata con: ingest.py --collection <nome> --stems ...",
    )
    args = parser.parse_args()

+    collection_name = args.collection or args.stem
+    if not collection_name:
+        parser.error("specifica --stem <nome> oppure --collection <nome>")
+
    print("─── Pipeline RAG ────────────────────────────────────────────\n")
-    print(f"  Documento : {args.stem}")
+    print(f"  Collection : {collection_name}")
    print(f"  Modello    : {LLM_MODEL}")
    print(f"  Top-K      : {TOP_K}")
    print(f"  Thinking   : {'off' if NO_THINK else 'on'}")
    print()

    if not CHROMA_DIR.exists():
-        print("❌ chroma_db/ non trovata — esegui prima step-8")
+        print("❌ chroma_db/ non trovata — esegui prima ingestion")
        return 1

    client = chromadb.PersistentClient(path=str(CHROMA_DIR))
    collections = [c.name for c in client.list_collections()]
-    if args.stem not in collections:
-        print(f"❌ Collection '{args.stem}' non trovata in chroma_db/")
-        print(f"   → python step-8/ingest.py --stem {args.stem}")
+    if collection_name not in collections:
+        print(f"❌ Collection '{collection_name}' non trovata in chroma_db/")
+        if args.stem:
+            print(f"   → python ingestion/ingest.py --stem {collection_name}")
+        else:
+            print(f"   → python ingestion/ingest.py --collection {collection_name} --stems doc1 doc2 ...")
        return 1

-    collection = client.get_collection(args.stem)
-    print(f"✅ Collection '{args.stem}' caricata ({collection.count()} chunk)\n")
+    collection = client.get_collection(collection_name)
+    print(f"✅ Collection '{collection_name}' caricata ({collection.count()} chunk)\n")

    run_loop(collection)
    return 0
@@ -85,6 +85,7 @@ def retrieve(collection: chromadb.Collection, query: str, top_k: int) -> list[di
        chunks.append({
            "rank":       rank,
            "similarity": round(1 - dist, 4),
+            "source":     meta.get("source", ""),
            "sezione":    meta.get("sezione", ""),
            "titolo":     meta.get("titolo", ""),
            "text":       text,
@@ -97,10 +98,11 @@ def retrieve(collection: chromadb.Collection, query: str, top_k: int) -> list[di
 def print_results(chunks: list[dict], full: bool = False) -> None:
    print(f"── {len(chunks)} chunk recuperati ─────────────────────────────────\n")
    for c in chunks:
+        src = f"[{c['source']}] " if c.get("source") else ""
        loc = c["sezione"]
        if c["titolo"]:
            loc += f" > {c['titolo']}"
-        print(f"  [{c['rank']}] similarità: {c['similarity']:.4f}  |  {loc}")
+        print(f"  [{c['rank']}] similarità: {c['similarity']:.4f}  |  {src}{loc}")
        if full:
            print()
            print(c["text"])
@@ -159,7 +161,7 @@ def _build_epilog() -> str:
            if names:
                lines += ["", f"Collection disponibili: {', '.join(names)}"]
            else:
-                lines += ["", "Nessuna collection trovata — eseguire prima: python step-8/ingest.py"]
+                lines += ["", "Nessuna collection trovata — eseguire prima: python ingestion/ingest.py"]
        except Exception:
            pass
    return "\n".join(lines)
@@ -177,8 +179,11 @@ def main() -> int:
    )
    parser.add_argument(
        "--stem",
-        required=True,
-        help="Nome della collection ChromaDB da interrogare.",
+        help="Collection di un singolo documento (retrocompatibile)",
+    )
+    parser.add_argument(
+        "--collection",
+        help="Collection multi-documento creata con: ingest.py --collection <nome> --stems ...",
    )
    parser.add_argument(
        "--top-k",
@@ -189,25 +194,32 @@ def main() -> int:
    )
    args = parser.parse_args()

+    collection_name = args.collection or args.stem
+    if not collection_name:
+        parser.error("specifica --stem <nome> oppure --collection <nome>")
+
    print("─── Retrieval puro ──────────────────────────────────────────\n")
-    print(f"  Documento    : {args.stem}")
+    print(f"  Collection   : {collection_name}")
    print(f"  Embed model  : {EMBED_MODEL}")
    print(f"  Top-K        : {args.top_k}")
    print()

    if not CHROMA_DIR.exists():
-        print("❌ chroma_db/ non trovata — esegui prima step-8", file=sys.stderr)
+        print("❌ chroma_db/ non trovata — esegui prima ingestion", file=sys.stderr)
        return 1

    client = chromadb.PersistentClient(path=str(CHROMA_DIR))
    collections = [c.name for c in client.list_collections()]
-    if args.stem not in collections:
-        print(f"❌ Collection '{args.stem}' non trovata in chroma_db/", file=sys.stderr)
-        print(f"   → python step-8/ingest.py --stem {args.stem}", file=sys.stderr)
+    if collection_name not in collections:
+        print(f"❌ Collection '{collection_name}' non trovata in chroma_db/", file=sys.stderr)
+        if args.stem:
+            print(f"   → python ingestion/ingest.py --stem {collection_name}", file=sys.stderr)
+        else:
+            print(f"   → python ingestion/ingest.py --collection {collection_name} --stems doc1 doc2 ...", file=sys.stderr)
        return 1

-    collection = client.get_collection(args.stem)
-    print(f"✅ Collection '{args.stem}' caricata ({collection.count()} chunk)\n")
+    collection = client.get_collection(collection_name)
+    print(f"✅ Collection '{collection_name}' caricata ({collection.count()} chunk)\n")

    run_loop(collection, args.top_k)
    return 0
Author	SHA1	Message	Date
davide	48567fa5e7	fix(verify): riconosce URL www. come terminatori validi + doc multi-documento - _URL_TAIL ora matcha anche www.* (non solo https://) — evita falsi blockers su watermark tipo www.docsity.com - README: documenta --collection / --stems per ingestion, retrieve e rag Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-12 11:21:24 +02:00
davide	8d972fa7c6	feat(ingestion): supporto multi-documento in unica collection ChromaDB Aggiunge la possibilità di unire più documenti in una singola collection ChromaDB, con chunk_id prefissati per stem e metadato source per filtrare. - ingest.py: --stems doc1 doc2 --collection nome (nuovo), --stem (invariato) - rag.py / retrieve.py: --collection, source nei chunk, verbose mostra [source] Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-12 11:21:17 +02:00
davide	5b63c423cc	feat(chunks): ottimizzazione chunking e post-processing - chunker.py: scrive meta.json con strategia e soglie effettive (target, min_chars, max_chars) per ogni documento chunked - verify_chunks.py: * _load_thresholds(): legge min/max da meta.json invece del TARGET_CHARS globale, eliminando il mismatch tra soglie chunker e verify (h3_aware target=600 -> range 450-750, non piu' validato a 225-375) * _ROMAN_END: esclude numeri romani finali (XV, XIV...) dagli incompleti perche' sono artefatti indice PDF, non frasi spezzate * PUNCT_END: aggiunge ; come fine valida (clausole legali italiane) - fix_chunks.py: * _load_thresholds(): usa max_chars da meta.json per split coerente * _SECONDARY_END: split secondario su ; per testo legale multi-clausola * Fase 1 (convergenza): risolve solo blockers (incomplete, empty, no_prefix) senza toccare warnings -- elimina il ciclo merge->too_long->split->incomplete->merge * Fase 2 (finale): una sola passata di merge too_short + split too_long dopo che i blockers sono azzerati Risultato su dirittopenale: da blocked (265 incomplete) a warnings_only in 2 iterazioni, senza cicli infiniti. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-12 11:09:28 +02:00
davide	587238f9f5	docs(conversione): aggiorna README — comandi, output e log di esecuzione Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-12 10:43:17 +02:00
davide	c381d7da3c	docs(readme): aggiunge sezioni configurazione modelli, test ollama, retrieval e RAG Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-12 10:39:27 +02:00
davide	b5fb363104	chore(config): tuning RAG — modello 4b, temperatura 0.2, chunk target 300 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-12 10:37:39 +02:00
davide	602dc87045	fix(ingestion): correggi path chunks da step-6/ a chunks/ Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-12 10:37:35 +02:00
davide	b49ef8edf0	docs: aggiorna README con flusso ingestion completo - README.md: aggiunge step 7 (ingestion) con verifica ambiente, comandi base e --force; aggiorna pipeline header e riferimenti - ingestion/README.md: rinomina da "Step 8" a "Ingestion", aggiorna riferimenti da step-6 a chunks/, aggiunge sezione "Verifica ambiente", corregge comandi con .venv/bin/python Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-11 16:05:23 +02:00
davide	9e1a72a9e6	refactor: rinomina step-8 → ingestion Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-11 15:58:54 +02:00
davide	70b304e1d4	docs(readme): flusso completo conversione → chunking Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-11 15:46:52 +02:00
davide	02c785678d	feat(chunks): target-based chunking con config centralizzata Introduce chunks/config.py come unica fonte di verità per tutti i parametri della pipeline di chunking. TARGET_CHARS + CHUNK_TOLERANCE sostituiscono MIN_CHARS/MAX_CHARS: il chunker mira a una dimensione target e si avvicina il più possibile rispettando il vincolo assoluto di terminare ogni chunk su un confine di frase (punto/punteggiatura). - config.py: TARGET_CHARS, CHUNK_TOLERANCE, SPLIT_THRESHOLD_FACTOR, PROTECT_TABLES, FIX_MAX_ITERATIONS, STRATEGY_OVERRIDES per strategia - chunker.py: algoritmo target-based (emit quando frase successiva sfora upper_body = upper - prefix_len), table protection atomica, override MIN/MAX/overlap per ciascuna delle 4 strategie - verify_chunks.py: soglie derivate da target*(1±tolerance) - fix_chunks.py: _split_at_boundary sempre su punteggiatura finale, loop ricorsivo fix→verify fino a FIX_MAX_ITERATIONS, split solo per chunk > upper × SPLIT_THRESHOLD_FACTOR Risultato su bitcoin: 694 chunk, 0 incompleti, 83% in range [450,750], tutti terminanti su punteggiatura indipendentemente dalla dimensione. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-11 15:45:24 +02:00
davide	508587c5bf	Merge branch 'chunks' into main Integra la pipeline di chunking (chunks/) con tutte le modifiche accumulate sul branch, inclusa la pipeline PDF→Markdown a 9 stadi.	2026-05-11 14:51:53 +02:00
davide	e1b5298b20	feat: integra pipeline PDF→Markdown a 9 stadi e test suite Porta da branch marker la riscrittura completa di conversione/_pipeline/ (9 stadi PyMuPDF) e la suite tests/ senza modificare il resto del progetto RAG (ollama/, step-5/, step-6/, step-8/, rag.py, retrieve.py, config.py). requirements.txt: aggiunge PyMuPDF>=1.24.0 e pytest>=8.0, mantiene chromadb, rimuove opendataloader-pdf e pymupdf4llm. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-11 14:44:16 +02:00