fix(verify): riconosce URL www. come terminatori validi + doc multi-documento

- _URL_TAIL ora matcha anche www.* (non solo https://) — evita falsi blockers su watermark tipo www.docsity.com - README: documenta --collection / --stems per ingestion, retrieve e rag Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
feat(ingestion): supporto multi-documento in unica collection ChromaDB
2026-05-12 11:21:24 +02:00 · 2026-05-12 11:21:17 +02:00 · 2026-05-12 11:09:28 +02:00 · 2026-05-12 10:43:17 +02:00 · 2026-05-12 10:39:27 +02:00 · 2026-05-12 10:37:39 +02:00
13 changed files with 831 additions and 282 deletions
@@ -50,7 +50,7 @@ except Exception as e: print(f'ERRORE lettura report: {e}')
 ```
 ✅ Chunk pronti — procedi con la vettorizzazione:
-   python step-8/ingest.py --stem $ARGUMENTS
+   python ingestion/ingest.py --stem $ARGUMENTS
 ```
 Se ci sono solo 🟡, spiega brevemente i warning e chiedi se l'utente vuole risolverli prima o procedere.
@@ -105,7 +105,7 @@ Se verdict finale è `ok` o `warnings_only` senza 🔴:
 ```
 ✅ Chunk pronti in chunks/$ARGUMENTS/chunks.json
   Procedi con la vettorizzazione:
-   python step-8/ingest.py --stem $ARGUMENTS
+   python ingestion/ingest.py --stem $ARGUMENTS
 ```
 Se rimangono 🔴 dopo il fix (testo non spezzabile o struttura anomala nel sorgente):
@@ -1,9 +1,11 @@
-# PDF → Markdown
+# PDF → Chunk RAG-ready
-Converte PDF digitali in Markdown strutturato e pulito.
+Converte PDF digitali in chunk semantici pronti per la vettorizzazione RAG,
 senza LLM né OCR.
-**Stack:** Python · opendataloader-pdf (XY-Cut++) · Java 11+  
+**Pipeline:** PDF → Markdown strutturato → chunk semantici → embedding ChromaDB  
-**Compatibile con:** Linux · macOS · Windows (WSL2)
+**Stack:** Python · PyMuPDF · pdfplumber  
 **Non supportati:** PDF scansionati (solo immagini), PDF protetti da password.
 ---
@@ -15,54 +17,230 @@ source .venv/bin/activate
 pip install -r requirements.txt
 ```
 **Java 11+** richiesto:
 ```bash
 sudo apt install default-jdk   # Ubuntu/Debian/WSL
 java -version
 ```
 ---
-## Utilizzo
+## Flusso completo
 ### 1. Posiziona il PDF
 ```
 sources/<nome>.pdf
 ```
 ### 2. Converti il PDF in Markdown
 ```bash
-# Singolo PDF
+# Singolo documento
-python conversione/pipeline.py --stem <nome>
+.venv/bin/python conversione/ --stem <nome>
 # Tutti i PDF in sources/
-python conversione/pipeline.py
+.venv/bin/python conversione/
-# Forza riesecuzione
+# Forza riesecuzione (sovrascrive output esistente)
-python conversione/pipeline.py --stem <nome> --force
+.venv/bin/python conversione/ --stem <nome> --force
 ```
-`--stem` = nome file PDF senza estensione.  
+Output in `conversione/<nome>/`:
 Esempio: `sources/analisi1.pdf` → `--stem analisi1`
 ---
 ## Output
 Per ogni stem in `conversione/<stem>/`:
 | File | Descrizione |
 |------|-------------|
 | `raw.md` | Markdown grezzo — **non modificare** |
-| `clean.md` | Markdown pulito — copia di lavoro |
+| `clean.md` | Markdown pulito — input per il chunker |
-| `structure_profile.json` | Struttura rilevata e metriche |
+| `structure_profile.json` | Struttura rilevata e strategia di chunking |
-| `report.json` | Statistiche complete della conversione |
+| `report.json` | Metriche di qualità della conversione |
---
+### 3. Verifica la qualità del Markdown (opzionale)
 ## Validazione batch
 ```bash
-python conversione/validate.py
+.venv/bin/python conversione/ validate <nome> --detail
 ```
-Stampa una tabella di stato su tutti gli stem convertiti.
+Se lo score è ≥ 80 e `valid=true`, procedi. Altrimenti usa `/prepare-md` per
 correzioni manuali (sillabazione residua, header malformati, ecc.).
 ### 4. Genera i chunk
 ```bash
 .venv/bin/python chunks/chunker.py --stem <nome>
 # Forza riesecuzione
 .venv/bin/python chunks/chunker.py --stem <nome> --force
 ```
 La strategia di chunking (`h3_aware`, `h2_paragraph_split`, `paragraph`,
 `sliding_window`) viene scelta automaticamente da `structure_profile.json`.
 Output in `chunks/<nome>/`:
 | File | Descrizione |
 |------|-------------|
 | `chunks.json` | Lista di chunk con testo, sezione, titolo e metadati |
 | `report.json` | Statistiche e anomalie del chunking |
 ### 5. Verifica i chunk
 ```bash
 .venv/bin/python chunks/verify_chunks.py --stem <nome>
 ```
 Verdict possibili:
 | Verdict | Significato | Cosa fare |
 |---------|-------------|-----------|
 | `ok` | Nessun problema | Procedi alla vettorizzazione |
 | `warnings_only` | Solo avvisi minori | Puoi procedere o eseguire il fix |
 | `blocked` | Problemi bloccanti (chunk incompleti) | Esegui il fix |
 ### 6. Correggi i problemi (se necessario)
 ```bash
 # Anteprima delle correzioni senza applicarle
 .venv/bin/python chunks/fix_chunks.py --stem <nome> --dry-run
 # Applica le correzioni (ricorsivo, fino a 3 iterazioni)
 .venv/bin/python chunks/fix_chunks.py --stem <nome>
 ```
 Il fix gestisce automaticamente: chunk incompleti (frase spezzata), chunk
 troppo corti (accorpa al successivo), chunk eccessivamente lunghi (spezza
 su punteggiatura). Ogni chunk termina sempre su un confine di frase.
 ### 7. Esegui l'ingestion
 Prima verifica che Ollama e i modelli siano pronti:
 ```bash
 .venv/bin/python ollama/check_env.py
 ```
 Poi genera gli embedding e salva in ChromaDB:
 ```bash
 # Singolo documento → collection con lo stesso nome
 .venv/bin/python ingestion/ingest.py --stem <nome>
 # Più documenti → un'unica collection condivisa
 .venv/bin/python ingestion/ingest.py --collection <nome-collection> --stems doc1 doc2 doc3
 # Tutti i documenti in chunks/ → collection separate
 .venv/bin/python ingestion/ingest.py
 # Rigenera dopo aver cambiato modello o aggiornato i chunk
 .venv/bin/python ingestion/ingest.py --stem <nome> --force
 ```
 Con `--collection` i chunk di documenti diversi vengono uniti in una singola
 collection. Il metadato `source` identifica il documento di provenienza di ogni chunk.
 Output in `chroma_db/` (ignorata da git).
 ---
-Vedi [`conversione/README.md`](conversione/README.md) per dettagli sulla pipeline e i tipi di documento supportati.
+## Configurazione del chunking
 Tutti i parametri sono in [`chunks/config.py`](chunks/config.py):
 ```python
 TARGET_CHARS    = 600   # dimensione target dei chunk
 CHUNK_TOLERANCE = 0.25  # ±25% → range accettabile [450, 750]
 OVERLAP_SENTENCES = 1   # frasi di overlap tra chunk consecutivi
 PROTECT_TABLES  = True  # tabelle emesse come chunk atomici
 FIX_MAX_ITERATIONS = 3  # iterazioni massime del fix ricorsivo
 ```
 Per ogni strategia è possibile definire valori diversi tramite `STRATEGY_OVERRIDES`.
 Modificare solo questo file — chunker, verify e fix si aggiornano automaticamente.
 ---
 ## Configurazione modelli
 Tutti i parametri LLM e embedding sono in [`config.py`](config.py):
 ```python
 OLLAMA_MODEL = "qwen3.5:4b"      # modello LLM per la generazione
 EMBED_MODEL  = "nomic-embed-text" # modello embedding (deve coincidere con l'ingestion)
 TEMPERATURE  = 0.2                # 0.0 = deterministico, valori alti = più creativo
 NO_THINK     = True               # True = risposta diretta (più veloce), False = con ragionamento
 TOP_K        = 6                  # numero di chunk recuperati per ogni domanda
 OLLAMA_URL   = "http://localhost:11434"
 ```
 > Se cambi `EMBED_MODEL` devi rieseguire l'ingestion con `--force` — gli embedding
 > devono essere prodotti dallo stesso modello usato nel retrieval.
 ---
 ## Testare il modello (senza RAG)
 Verifica che il modello LLM risponda correttamente prima di coinvolgere la pipeline:
 ```bash
 .venv/bin/python ollama/test_ollama.py
 ```
 Il modello usato è quello configurato in `config.py` (`OLLAMA_MODEL`).  
 Digita `exit` per uscire.
 ---
 ## Retrieval puro (senza generazione)
 Utile per verificare che i chunk giusti vengano recuperati prima di diagnosticare
 risposte sbagliate:
 ```bash
 # Singolo documento
 .venv/bin/python retrieve.py --stem <nome>
 # Collection multi-documento
 .venv/bin/python retrieve.py --collection <nome-collection>
 # Modifica il numero di chunk restituiti
 .venv/bin/python retrieve.py --stem <nome> --top-k 10
 ```
 Nel loop interattivo:
 | Comando | Effetto |
 |---------|---------|
 | `<query>` | Mostra i chunk più simili con score di similarità (testo troncato) |
 | `<query> -f` | Testo completo dei chunk |
 | `exit` | Termina |
 ---
 ## RAG interattivo
 Risponde a domande in linguaggio naturale usando i chunk indicizzati in ChromaDB:
 ```bash
 # Singolo documento
 .venv/bin/python rag.py --stem <nome>
 # Collection multi-documento
 .venv/bin/python rag.py --collection <nome-collection>
 ```
 Nel loop interattivo:
 | Comando | Effetto |
 |---------|---------|
 | `<domanda>` | Risposta generata dal LLM con contesto dai chunk |
 | `<domanda> -v` | Risposta + chunk recuperati con score di similarità e documento sorgente |
 | `exit` | Termina |
 ---
 ## Test
 ```bash
 .venv/bin/python -m pytest tests/
 ```
 ---
 ## Riferimenti
 - [`conversione/README.md`](conversione/README.md) — dettagli sulla pipeline PDF→Markdown e sui tipi di documento supportati
 - [`ingestion/README.md`](ingestion/README.md) — configurazione embedding, scelta modello, regole --force
@@ -20,12 +20,10 @@ import re
 import sys
 from pathlib import Path
-
+_HERE = Path(__file__).resolve().parent
-# ─── Parametri ────────────────────────────────────────────────────────────────
+if str(_HERE) not in sys.path:
-
+    sys.path.insert(0, str(_HERE))
-MIN_CHARS = 200   # sotto questa soglia → accorpa al chunk successivo
+import config as cfg
 MAX_CHARS = 800   # sopra questa soglia → spezza su frasi
 OVERLAP_S = 2     # frasi di overlap tra sotto-chunk dello stesso boundary
 # ─── Utilità ──────────────────────────────────────────────────────────────────
@@ -44,73 +42,106 @@ def slugify(s: str, max_len: int = 60) -> str:
    return s[:max_len] if s else "section"
-_SENT_BOUNDARY = re.compile(r"[.!?»)\]'\u2019\"\u201c\u201d/:|\u2026]$")
+def _is_table_block(text: str) -> bool:
    """True se il testo è prevalentemente una tabella Markdown (≥50% righe con |)."""
    lines = [l for l in text.strip().splitlines() if l.strip()]
    if not lines:
        return False
    table_lines = sum(1 for l in lines if l.strip().startswith("|"))
    return table_lines / len(lines) >= 0.5
-def _flush_chunk(
+def _ov(strategy: str) -> tuple[int, float, int]:
-    current: list[str],
+    """Legge (target_chars, tolerance, overlap) dagli override di strategia."""
-    sentences: list[str],
+    ov = cfg.STRATEGY_OVERRIDES.get(strategy, {})
-    i: int,
+    target    = ov.get("target_chars", cfg.TARGET_CHARS)
-    prefix: str,
+    tolerance = ov.get("tolerance",    cfg.CHUNK_TOLERANCE)
-    sezione: str,
+    overlap   = ov.get("overlap",      cfg.OVERLAP_SENTENCES)
-    titolo: str,
+    return target, tolerance, overlap
    sub_index: int,
    max_chars: int,
 ) -> tuple[dict, list[str], int, int]:
    """Emette un chunk, estendendo fino a un confine di frase (max +20%)."""
    hard_limit = int(max_chars * 1.2)
    current_len = sum(len(s) + 1 for s in current)
    while i < len(sentences) and not _SENT_BOUNDARY.search(" ".join(current)):
        nxt = sentences[i]
        if current_len + len(nxt) + 1 > hard_limit:
            break
        current.append(nxt)
        current_len += len(nxt) + 1
        i += 1
    chunk_text = prefix + " ".join(current)
    chunk = {
        "chunk_id": f"{slugify(sezione)}__{slugify(titolo)}__s{sub_index}",
        "text": chunk_text,
        "sezione": sezione,
        "titolo": titolo,
        "sub_index": sub_index,
        "n_chars": len(chunk_text),
    }
    return chunk, current, i, sub_index + 1
 # ─── Core: split in sotto-chunk orientato al target ───────────────────────────
 def make_sub_chunks(
    body: str,
    prefix: str,
    sezione: str,
    titolo: str,
-    max_chars: int,
+    target: int,
    tolerance: float,
    overlap_s: int,
 ) -> list[dict]:
    """Divide body in chunk il più vicini possibile a `target` char.
    Logica:
      lower = target × (1 − tolerance)   → soglia minima per emettere
      upper = target × (1 + tolerance)   → limite massimo
    Si accumulano frasi intere finché la successiva farebbe superare `upper`.
    A quel punto si emette (siamo vicini al target) e si riparte con overlap.
    Ogni chunk termina sempre su un confine di frase; non attraversa mai
    il boundary dell'header corrente.
    """
    if cfg.PROTECT_TABLES and _is_table_block(body):
        chunk_text = prefix + body
        return [{
            "chunk_id": f"{slugify(sezione)}__{slugify(titolo)}__s0",
            "text": chunk_text,
            "sezione": sezione,
            "titolo": titolo,
            "sub_index": 0,
            "n_chars": len(chunk_text),
        }]
    # Soglia calcolata sul corpo (n_chars finale = prefix_len + body_len).
    prefix_len = len(prefix)
    upper_body = max(1, int(target * (1 + tolerance)) - prefix_len)
    sentences = split_sentences(body)
    if not sentences:
        return []
-    chunks = []
+    chunks: list[dict] = []
    current: list[str] = []
    current_len = 0
    sub_index = 0
-    i = 0
+    def _emit() -> None:
-    while i < len(sentences):
+        nonlocal current, current_len, sub_index
-        sent = sentences[i]
+        chunk_text = prefix + " ".join(current)
-        if not current or current_len + len(sent) + 1 <= max_chars:
+        chunks.append({
            "chunk_id": f"{slugify(sezione)}__{slugify(titolo)}__s{sub_index}",
            "text": chunk_text,
            "sezione": sezione,
            "titolo": titolo,
            "sub_index": sub_index,
            "n_chars": len(chunk_text),
        })
        overlap = current[-overlap_s:] if overlap_s and len(current) > overlap_s else []
        current = overlap[:]
        # Lunghezza corretta dell'overlap (n-1 spazi tra n frasi).
        current_len = sum(len(s) for s in current) + max(0, len(current) - 1)
        sub_index += 1
    for sent in sentences:
        sep     = 1 if current else 0
        new_len = current_len + sep + len(sent)
        if new_len <= upper_body:
            # Ancora entro il limite del corpo: aggiungi e continua.
            current.append(sent)
-            current_len += len(sent) + (1 if len(current) > 1 else 0)
+            current_len = new_len
-            i += 1
+        elif current:
            # La frase successiva sfora il limite: emetti il chunk corrente
            # (che termina su frase completa) poi inizia il nuovo con questa frase.
            _emit()
            current.append(sent)
            current_len += (1 if current[:-1] else 0) + len(sent)
        else:
-            chunk, current, i, sub_index = _flush_chunk(
+            # Chunk vuoto: la singola frase supera già il limite — emettiamo così com'è.
-                current, sentences, i, prefix, sezione, titolo, sub_index, max_chars
+            current.append(sent)
-            )
+            current_len = len(sent)
-            chunks.append(chunk)
+            _emit()
            overlap = current[-overlap_s:] if overlap_s and len(current) > overlap_s else []
            current = overlap[:]
            current_len = sum(len(s) + 1 for s in current)
    if current:
        chunk_text = prefix + " ".join(current)
@@ -194,6 +225,9 @@ def parse_h2_sections(text: str) -> list[dict]:
 # ─── Strategie di chunking ────────────────────────────────────────────────────
 def chunk_h3_aware(text: str, stem: str) -> list[dict]:
    target, tolerance, overlap = _ov("h3_aware")
    lower = int(target * (1 - tolerance))
    sections = parse_h3_sections(text)
    merged: list[dict] = []
@@ -205,7 +239,7 @@ def chunk_h3_aware(text: str, stem: str) -> list[dict]:
            continue
        if (pending["sezione"] == sec["sezione"]
-                and len(pending["body"]) < MIN_CHARS):
+                and len(pending["body"]) < lower):
            sep_title = " / ".join(filter(None, [pending["titolo"], sec["titolo"]]))
            pending = {
                "sezione": pending["sezione"],
@@ -222,24 +256,25 @@ def chunk_h3_aware(text: str, stem: str) -> list[dict]:
    chunks = []
    for sec in merged:
        sezione = sec["sezione"] or stem
-        titolo = sec["titolo"] or ""
+        titolo  = sec["titolo"] or ""
-        body = sec["body"]
+        body    = sec["body"]
-
+        prefix  = f"[{sezione} > {titolo}]\n" if titolo else f"[{sezione}]\n"
-        prefix = f"[{sezione} > {titolo}]\n" if titolo else f"[{sezione}]\n"
+        chunks.extend(make_sub_chunks(body, prefix, sezione, titolo, target, tolerance, overlap))
        sub = make_sub_chunks(body, prefix, sezione, titolo, MAX_CHARS, OVERLAP_S)
        chunks.extend(sub)
    return chunks
 def chunk_h2_paragraph_split(text: str, stem: str) -> list[dict]:
    target, tolerance, overlap = _ov("h2_paragraph_split")
    lower = int(target * (1 - tolerance))
    sections = parse_h2_sections(text)
    chunks = []
    for sec in sections:
        sezione = sec["sezione"] or stem
-        body = sec["body"]
+        body    = sec["body"]
-        prefix = f"[{sezione}]\n"
+        prefix  = f"[{sezione}]\n"
        paragraphs = [
            p.strip()
@@ -250,7 +285,7 @@ def chunk_h2_paragraph_split(text: str, stem: str) -> list[dict]:
        merged_pars: list[str] = []
        pending = ""
        for par in paragraphs:
-            if pending and len(pending) < MIN_CHARS:
+            if pending and len(pending) < lower:
                pending = pending + "\n\n" + par
            else:
                if pending:
@@ -260,7 +295,7 @@ def chunk_h2_paragraph_split(text: str, stem: str) -> list[dict]:
            merged_pars.append(pending)
        for idx, par in enumerate(merged_pars):
-            sub = make_sub_chunks(par, prefix, sezione, f"par{idx}", MAX_CHARS, OVERLAP_S)
+            sub = make_sub_chunks(par, prefix, sezione, f"par{idx}", target, tolerance, overlap)
            for c in sub:
                c["chunk_id"] = f"{slugify(sezione)}__p{idx}__s{c['sub_index']}"
            chunks.extend(sub)
@@ -269,6 +304,9 @@ def chunk_h2_paragraph_split(text: str, stem: str) -> list[dict]:
 def chunk_paragraph(text: str, stem: str) -> list[dict]:
    target, tolerance, overlap = _ov("paragraph")
    lower = int(target * (1 - tolerance))
    paragraphs = [
        p.strip()
        for p in re.split(r"\n{2,}", text)
@@ -279,7 +317,7 @@ def chunk_paragraph(text: str, stem: str) -> list[dict]:
    merged: list[str] = []
    pending = ""
    for par in paragraphs:
-        if pending and len(pending) < MIN_CHARS:
+        if pending and len(pending) < lower:
            pending = pending + "\n\n" + par
        else:
            if pending:
@@ -290,7 +328,7 @@ def chunk_paragraph(text: str, stem: str) -> list[dict]:
    chunks = []
    for idx, par in enumerate(merged):
-        sub = make_sub_chunks(par, prefix, stem, f"par{idx}", MAX_CHARS, OVERLAP_S)
+        sub = make_sub_chunks(par, prefix, stem, f"par{idx}", target, tolerance, overlap)
        for c in sub:
            c["chunk_id"] = f"para__{idx}__s{c['sub_index']}"
        chunks.extend(sub)
@@ -299,6 +337,9 @@ def chunk_paragraph(text: str, stem: str) -> list[dict]:
 def chunk_sliding_window(text: str, stem: str) -> list[dict]:
    target, tolerance, overlap = _ov("sliding_window")
    upper = int(target * (1 + tolerance))
    sentences = split_sentences(text)
    prefix = f"[Documento: {stem}]\n"
@@ -313,10 +354,11 @@ def chunk_sliding_window(text: str, stem: str) -> list[dict]:
        j = i
        while j < len(sentences):
            s = sentences[j]
-            if window and cur_len + len(s) + 1 > MAX_CHARS:
+            sep = 1 if window else 0
            if window and cur_len + sep + len(s) > upper:
                break
            window.append(s)
-            cur_len += len(s) + (1 if len(window) > 1 else 0)
+            cur_len += sep + len(s)
            j += 1
        if not window:
@@ -333,7 +375,7 @@ def chunk_sliding_window(text: str, stem: str) -> list[dict]:
            "n_chars": len(chunk_text),
        })
        win_idx += 1
-        i += max(1, len(window) - OVERLAP_S)
+        i += max(1, len(window) - overlap)
    return chunks
@@ -341,28 +383,28 @@ def chunk_sliding_window(text: str, stem: str) -> list[dict]:
 # ─── Dispatcher ───────────────────────────────────────────────────────────────
 _STRATEGIES: dict[str, callable] = {
-    "h3_aware": chunk_h3_aware,
+    "h3_aware":            chunk_h3_aware,
-    "h2_paragraph_split": chunk_h2_paragraph_split,
+    "h2_paragraph_split":  chunk_h2_paragraph_split,
-    "paragraph": chunk_paragraph,
+    "paragraph":           chunk_paragraph,
-    "sliding_window": chunk_sliding_window,
+    "sliding_window":      chunk_sliding_window,
 }
 def chunk_document(clean_md: Path, profile: dict, stem: str) -> list[dict]:
-    text = clean_md.read_text(encoding="utf-8")
+    text      = clean_md.read_text(encoding="utf-8")
    strategia = profile.get("strategia_chunking", "paragraph")
-    fn = _STRATEGIES.get(strategia, chunk_paragraph)
+    fn        = _STRATEGIES.get(strategia, chunk_paragraph)
    return fn(text, stem)
 # ─── Per-document processing ──────────────────────────────────────────────────
 def process_stem(stem: str, project_root: Path, force: bool) -> bool:
-    conv_dir  = project_root / "conversione" / stem
+    conv_dir     = project_root / "conversione" / stem
-    out_dir   = project_root / "chunks" / stem
+    out_dir      = project_root / "chunks" / stem
-    clean_md  = conv_dir / "clean.md"
+    clean_md     = conv_dir / "clean.md"
    profile_path = conv_dir / "structure_profile.json"
-    out_file  = out_dir / "chunks.json"
+    out_file     = out_dir / "chunks.json"
    print(f"\nDocumento: {stem}")
@@ -393,19 +435,31 @@ def process_stem(stem: str, project_root: Path, force: bool) -> bool:
        json.dumps(chunks, ensure_ascii=False, indent=2), encoding="utf-8"
    )
-    lengths = [c["n_chars"] for c in chunks]
+    target, tolerance, _ = _ov(strategia)
-    min_c = min(lengths)
+    lower = int(target * (1 - tolerance))
-    max_c = max(lengths)
+    upper = int(target * (1 + tolerance))
    avg_c = int(sum(lengths) / len(lengths))
    short = sum(1 for l in lengths if l < MIN_CHARS)
    long_ = sum(1 for l in lengths if l > MAX_CHARS * 1.5)
    meta = {"strategy": strategia, "target_chars": target,
            "min_chars": lower, "max_chars": upper}
    (out_dir / "meta.json").write_text(
        json.dumps(meta, ensure_ascii=False), encoding="utf-8"
    )
    lengths = [c["n_chars"] for c in chunks]
    min_c  = min(lengths)
    max_c  = max(lengths)
    avg_c  = int(sum(lengths) / len(lengths))
    short  = sum(1 for l in lengths if l < lower)
    long_  = sum(1 for l in lengths if l > upper)
    print(f"  Target: {target} char  ±{int(tolerance*100)}%  "
          f"→ range [{lower}, {upper}]")
    print(f"  Chunk totali: {len(chunks)}")
    print(f"  Min: {min_c} char  Max: {max_c} char  Media: {avg_c} char")
    if short:
-        print(f"  ⚠️  {short} chunk sotto MIN_CHARS ({MIN_CHARS})")
+        print(f"  ⚠️  {short} chunk sotto lower ({lower})")
    if long_:
-        print(f"  ⚠️  {long_} chunk sopra MAX_CHARS×1.5 ({int(MAX_CHARS * 1.5)})")
+        print(f"  ⚠️  {long_} chunk sopra upper ({upper})")
    print(f"  ✅ chunks.json salvato in chunks/{stem}/")
    return True
@@ -0,0 +1,88 @@
 #!/usr/bin/env python3
 """
 Parametri di configurazione della pipeline di chunking.
 Modifica questo file per cambiare il comportamento di chunker.py,
 verify_chunks.py e fix_chunks.py senza toccare il codice applicativo.
 """
 # ─── Grandezza target dei chunk ───────────────────────────────────────────────
 #
 # TARGET_CHARS è la dimensione ideale a cui il chunker mira.
 # CHUNK_TOLERANCE è la tolleranza relativa (es. 0.25 = ±25%).
 #
 #   range accettabile = [TARGET × (1 − TOL),  TARGET × (1 + TOL)]
 #
 # Con TARGET=600 e TOL=0.25 → ogni chunk sarà tra 450 e 750 char,
 # il più vicino possibile a 600, terminando sempre su un confine di frase.
 #
 TARGET_CHARS    = 300
 CHUNK_TOLERANCE = 0.25
 # ─── Overlap ──────────────────────────────────────────────────────────────────
 # Numero di frasi ripetute all'inizio del chunk successivo per preservare
 # il contesto tra chunk adiacenti della stessa sezione.
 OVERLAP_SENTENCES = 1
 # ─── Soglie di validazione ────────────────────────────────────────────────────
 # fix_chunks.py spezza un chunk "too_long" solo se supera upper × questo fattore.
 # Es. upper=750, fattore=1.5 → split solo per chunk > 1125 char.
 # Chunk in [upper, upper×fattore] restano come warning non bloccanti.
 SPLIT_THRESHOLD_FACTOR = 1.5
 MATH_SYMS_MIN = 3   # min. simboli math per declassare incomplete → incomplete_math
 # ─── Pattern e formato ────────────────────────────────────────────────────────
 SENTENCE_SPLIT_PATTERN = r"(?<=[.!?»])\s+"
 PREFIX_TEMPLATE = "[{sezione} > {titolo}]"
 # ─── Protezione contenuti speciali ────────────────────────────────────────────
 # Se True, un blocco prevalentemente tabella Markdown (≥50% righe |…|)
 # viene emesso come chunk atomico senza sentence-splitting.
 PROTECT_TABLES = True
 # Riservato — blocchi LaTeX non spezzabili (implementazione futura).
 PROTECT_MATH = True
 # ─── Fix behavior ─────────────────────────────────────────────────────────────
 # Numero massimo di iterazioni del loop fix → verify → fix.
 # Con 1 si ottiene il comportamento originale (fix singolo senza re-verifica).
 FIX_MAX_ITERATIONS = 3
 # ─── Override per strategia ───────────────────────────────────────────────────
 #
 # Sovrascrivono TARGET_CHARS / CHUNK_TOLERANCE / OVERLAP_SENTENCES
 # per la specifica strategia indicata in structure_profile.json.
 # Chiavi riconosciute: "target_chars", "tolerance", "overlap".
 #
 STRATEGY_OVERRIDES: dict[str, dict] = {
    "h3_aware": {
        # Documenti strutturati H2→H3: chunk medi, overlap moderato.
        "target_chars": 600,
        "tolerance":    0.25,
        "overlap":      2,
    },
    "h2_paragraph_split": {
        # Documenti piatti (solo H2): chunk più ampi, overlap ridotto.
        "target_chars": 800,
        "tolerance":    0.25,
        "overlap":      1,
    },
    "paragraph": {
        # Documenti senza header significativi: chunk più corti.
        "target_chars": 500,
        "tolerance":    0.30,
        "overlap":      1,
    },
    "sliding_window": {
        # Testo lineare/narrativo: finestre ampie, overlap generoso.
        "target_chars": 800,
        "tolerance":    0.25,
        "overlap":      3,
    },
 }
@@ -21,12 +21,29 @@ Uso:
 """
 import argparse
 import contextlib
 import io
 import json
 import re
 import sys
 from pathlib import Path
-MAX_CHARS = 800
+_HERE = Path(__file__).resolve().parent
 if str(_HERE) not in sys.path:
    sys.path.insert(0, str(_HERE))
 import config as cfg
 from verify_chunks import verify_stem as _verify_stem
 MAX_CHARS = int(cfg.TARGET_CHARS * (1 + cfg.CHUNK_TOLERANCE))
 def _load_thresholds(stem_dir: Path) -> int:
    """Legge max_chars da meta.json (scritto dal chunker) o usa il default da config."""
    meta = stem_dir / "meta.json"
    if meta.exists():
        import json as _json
        return _json.loads(meta.read_text(encoding="utf-8"))["max_chars"]
    return MAX_CHARS
 PUNCT_END = re.compile(r"[.!?»)\]'\u2019\"\u201c\u201d\u2018\u2014\u2013-]$")
@@ -53,7 +70,20 @@ def _rebuild_text(chunk: dict, body: str) -> str:
    return f"{_prefix(chunk)}\n{body}"
 # Fine frase forte: . ! ? seguiti da spazio + maiuscola o virgolette.
 # Non usare punteggiatura debole (,;:)>>]) per non creare chunk incompleti.
 _STRONG_END = re.compile(
    r'[.!?\xbb]\s+(?=[A-Z\xc0-\xd6\xd8-\xde\xc0-\xff\xab\x22\x27(])'
 )
 _SECONDARY_END = re.compile(r';\s+')
 def _split_at_boundary(text: str, max_chars: int) -> list[str]:
    """Spezza text in parti ≤ max_chars su confini di frase forti (.!?).
    Se non trova un confine forte entro max_chars, NON spezza: meglio un
    chunk too_long (warning) che un chunk incompleto (blocker).
    """
    if len(text) <= max_chars:
        return [text]
@@ -62,20 +92,29 @@ def _split_at_boundary(text: str, max_chars: int) -> list[str]:
    while len(remaining) > max_chars:
        candidate = remaining[:max_chars]
        split_pos = candidate.rfind("\n\n")
-        if split_pos == -1:
+        last_pos = -1
-            m = None
+        for m in _STRONG_END.finditer(candidate):
-            for m in re.finditer(r"[.!?»]\s+", candidate):
+            last_pos = m.start() + 1  # posizione dopo il carattere terminatore
                pass
            split_pos = m.end() if m else None
-        if split_pos is None or split_pos == 0:
+        if last_pos > 0:
-            sp = remaining.find(" ", max_chars)
+            first = remaining[:last_pos].rstrip()
-            split_pos = sp if sp != -1 else len(remaining)
+            remaining = remaining[last_pos:].lstrip()
-
+            if first:
-        parts.append(remaining[:split_pos].rstrip())
+                parts.append(first)
-        remaining = remaining[split_pos:].lstrip()
+        else:
            # Prova confine secondario: ; + spazio (clausole legali)
            sec_pos = -1
            for m in _SECONDARY_END.finditer(candidate):
                sec_pos = m.start() + 1
            if sec_pos > 0:
                first = remaining[:sec_pos].rstrip()
                remaining = remaining[sec_pos:].lstrip()
                if first:
                    parts.append(first)
            else:
                # Nessun confine: lascia il chunk intero (too_long > incomplete)
                break
    if remaining:
        parts.append(remaining)
@@ -173,10 +212,12 @@ def renumber_ids(chunks: list[dict]) -> list[dict]:
 # ─── Core ─────────────────────────────────────────────────────────────────────
-def fix_stem(stem: str, project_root: Path, max_chars: int, dry_run: bool) -> bool:
+def fix_stem(stem: str, project_root: Path, max_chars: int, dry_run: bool,
             max_iter: int = 10) -> bool:
    stem_dir    = project_root / "chunks" / stem
    chunks_path = stem_dir / "chunks.json"
    report_path = stem_dir / "report.json"
    max_chars   = _load_thresholds(stem_dir)
    if not chunks_path.exists():
        print(f"✗ chunks/{stem}/chunks.json non trovato.")
@@ -195,14 +236,22 @@ def fix_stem(stem: str, project_root: Path, max_chars: int, dry_run: bool) -> bo
    print(f"\nDocumento: {stem}  (verdict: {verdict})")
    if verdict == "ok":
-        print("  ✅ Nessun problema — nulla da correggere.")
+        print("  ✅ Nessun problema - nulla da correggere.")
        return True
    empty_ids      = {e["chunk_id"] for e in report.get("blockers", {}).get("empty", [])}
    no_prefix_ids  = {e["chunk_id"] for e in report.get("blockers", {}).get("no_prefix", [])}
    incomplete_ids = {e["chunk_id"] for e in report.get("blockers", {}).get("incomplete", [])}
    too_short_ids  = {e["chunk_id"] for e in report.get("warnings", {}).get("too_short", [])}
-    too_long_ids   = {e["chunk_id"] for e in report.get("warnings", {}).get("too_long", [])}
+
    # Spezza solo chunk che superano upper × SPLIT_THRESHOLD_FACTOR,
    # non quelli appena oltre upper (che causerebbero split con chunk incompleti).
    _split_limit = max_chars * cfg.SPLIT_THRESHOLD_FACTOR
    too_long_ids = {
        e["chunk_id"]
        for e in report.get("warnings", {}).get("too_long", [])
        if e.get("n_chars", 0) > _split_limit
    }
    ops: list[str] = []
    if empty_ids:
@@ -230,24 +279,77 @@ def fix_stem(stem: str, project_root: Path, max_chars: int, dry_run: bool) -> bo
    n_before = len(chunks)
-    if empty_ids:
+    def _fix_blockers(chunks: list[dict], report: dict) -> list[dict]:
-        chunks, n = fix_empty(chunks, empty_ids)
+        """Risolve solo i blockers (incomplete, empty, no_prefix) senza toccare warnings."""
-        print(f"\n  🗑  Rimossi {n} chunk vuoti.")
+        empty_ids_      = {e["chunk_id"] for e in report.get("blockers", {}).get("empty", [])}
        no_prefix_ids_  = {e["chunk_id"] for e in report.get("blockers", {}).get("no_prefix", [])}
        incomplete_ids_ = {e["chunk_id"] for e in report.get("blockers", {}).get("incomplete", [])}
        if empty_ids_:
            chunks, n = fix_empty(chunks, empty_ids_)
            print(f"  🗑  Rimossi {n} chunk vuoti.")
        if no_prefix_ids_:
            chunks, n = fix_no_prefix(chunks, no_prefix_ids_)
            print(f"  🔧 Aggiunto prefisso a {n} chunk.")
        if incomplete_ids_:
            chunks, n = fix_incomplete_and_short(chunks, incomplete_ids_)
            print(f"  🔗 Fusi {n} chunk incompleti.")
        return renumber_ids(chunks)
-    if no_prefix_ids:
+    def _fix_warnings(chunks: list[dict], report: dict) -> list[dict]:
-        chunks, n = fix_no_prefix(chunks, no_prefix_ids)
+        """Applica fix opzionali: merge too_short e split too_long."""
-        print(f"  🔧 Aggiunto prefisso a {n} chunk.")
+        too_short_ids_ = {e["chunk_id"] for e in report.get("warnings", {}).get("too_short", [])}
        too_long_ids_  = {
            e["chunk_id"]
            for e in report.get("warnings", {}).get("too_long", [])
            if e.get("n_chars", 0) > max_chars * cfg.SPLIT_THRESHOLD_FACTOR
        }
        if too_short_ids_:
            chunks, n = fix_incomplete_and_short(chunks, too_short_ids_)
            print(f"  🔗 Fusi {n} chunk troppo corti.")
        if too_long_ids_:
            chunks, n = fix_too_long(chunks, too_long_ids_, max_chars)
            print(f"  ✂️  Spezzati {n} chunk lunghi.")
        return renumber_ids(chunks)
-    merge_ids = incomplete_ids | too_short_ids
+    # Fase 1: risolvi blockers a convergenza (solo merge incomplete)
-    if merge_ids:
+    chunks = _fix_blockers(chunks, report)
        chunks, n = fix_incomplete_and_short(chunks, merge_ids)
        print(f"  🔗 Fusi {n} chunk (incompleti + corti).")
-    if too_long_ids:
+    _min = int(cfg.TARGET_CHARS * (1 - cfg.CHUNK_TOLERANCE))
-        chunks, n = fix_too_long(chunks, too_long_ids, max_chars)
+    _max = int(cfg.TARGET_CHARS * (1 + cfg.CHUNK_TOLERANCE))
-        print(f"  ✂️  Spezzati {n} chunk lunghi.")
+    prev_blockers = sum(len(v) for v in report.get("blockers", {}).values())
-    chunks = renumber_ids(chunks)
+    for iteration in range(1, max_iter + 1):
        chunks_path.write_text(
            json.dumps(chunks, ensure_ascii=False, indent=2), encoding="utf-8"
        )
        with contextlib.redirect_stdout(io.StringIO()):
            _verify_stem(stem, project_root, _min, _max)
        report = json.loads(report_path.read_text(encoding="utf-8"))
        new_verdict = report.get("verdict", "ok")
        curr_blockers = sum(len(v) for v in report.get("blockers", {}).values())
        if new_verdict in ("ok", "warnings_only") or curr_blockers == 0:
            break
        if curr_blockers >= prev_blockers:
            print(f"\n  ⚠️  Nessun miglioramento ({curr_blockers} blockers) - i restanti richiedono correzione manuale del clean.md.")
            break
        print(f"\n  Iterazione {iteration + 1} - {curr_blockers} blockers residui:")
        prev_blockers = curr_blockers
        chunks = _fix_blockers(chunks, report)
    # Fase 2: fix warnings (too_short merge + too_long split) - una sola passata finale
    with contextlib.redirect_stdout(io.StringIO()):
        _verify_stem(stem, project_root, _min, _max)
    report = json.loads(report_path.read_text(encoding="utf-8"))
    n_short = len(report.get("warnings", {}).get("too_short", []))
    n_long  = sum(
        1 for e in report.get("warnings", {}).get("too_long", [])
        if e.get("n_chars", 0) > max_chars * cfg.SPLIT_THRESHOLD_FACTOR
    )
    if n_short or n_long:
        print(f"\n  Fix warnings: {n_short} corti, {n_long} lunghi da spezzare")
        chunks = _fix_warnings(chunks, report)
    n_after = len(chunks)
    print(f"\n  Totale chunk: {n_before} → {n_after}")
@@ -256,8 +358,15 @@ def fix_stem(stem: str, project_root: Path, max_chars: int, dry_run: bool) -> bo
        json.dumps(chunks, ensure_ascii=False, indent=2), encoding="utf-8"
    )
    print(f"  ✅ Salvato: chunks/{stem}/chunks.json")
-    print(f"\n  Riesegui la verifica:")
+
-    print(f"     python chunks/verify_chunks.py --stem {stem}")
+    final_verdict = report.get("verdict", "?")
    if final_verdict == "ok":
        print(f"  ✅ Verdict finale: ok - procedi alla vettorizzazione.")
    elif final_verdict == "warnings_only":
        print(f"  🟡 Verdict finale: warnings_only - puoi procedere.")
    else:
        print(f"  🔴 Verdict finale: {final_verdict} - rilancia la verifica manualmente:")
        print(f"     python chunks/verify_chunks.py --stem {stem}")
    return True
@@ -269,15 +378,20 @@ if __name__ == "__main__":
    parser = argparse.ArgumentParser(description="Fix chunk")
    parser.add_argument("--stem", required=True, help="Nome del documento (sottocartella di chunks/)")
    _max_def = int(cfg.TARGET_CHARS * (1 + cfg.CHUNK_TOLERANCE))
    parser.add_argument(
-        "--max", type=int, default=MAX_CHARS,
+        "--max", type=int, default=_max_def,
-        help=f"Soglia massima caratteri per lo split (default: {MAX_CHARS})"
+        help=f"Soglia massima caratteri per lo split (default: TARGET×(1+TOL) = {_max_def})"
    )
    parser.add_argument(
        "--dry-run", action="store_true",
        help="Mostra le operazioni pianificate senza applicarle"
    )
    parser.add_argument(
        "--max-iter", type=int, default=10, metavar="N",
        help="Numero massimo di iterazioni automatiche (default: 10)"
    )
    args = parser.parse_args()
-    ok = fix_stem(args.stem, project_root, args.max, args.dry_run)
+    ok = fix_stem(args.stem, project_root, args.max, args.dry_run, args.max_iter)
    sys.exit(0 if ok else 1)
@@ -20,22 +20,39 @@ import re
 import sys
 from pathlib import Path
 _HERE = Path(__file__).resolve().parent
 if str(_HERE) not in sys.path:
    sys.path.insert(0, str(_HERE))
 import config as cfg
 # ─── Soglie ───────────────────────────────────────────────────────────────────
-MIN_CHARS = 200
+# ─── Soglie (derivate dal target, sovrascrivibili da CLI) ────────────────────
-MAX_CHARS = 800
+
 MIN_CHARS = int(cfg.TARGET_CHARS * (1 - cfg.CHUNK_TOLERANCE))
 MAX_CHARS = int(cfg.TARGET_CHARS * (1 + cfg.CHUNK_TOLERANCE))
 PUNCT_END = re.compile(
-    r"[.!?»)\]'\u2019\"\u201c\u201d\u2018\u2014\u2013\u2026]$"
+    r"[.!?\xbb)\]'\u2019\"\u201c\u201d\u2018\u2014\u2013\u2026]$"
    r"|/$"    # URL che finisce con /
    r"|\|$"   # riga di tabella Markdown
    r"|;$"    # fine clausola legale (testo giuridico)
    r"|:$"    # introduzione a lista o formula
 )
 _HEX_END     = re.compile(r"[0-9a-fA-F]{8,}$")
-_URL_TAIL    = re.compile(r"https?://\S+(\s+\S+){0,3}$")  # URL con fino a 3 token extra
+_URL_TAIL    = re.compile(r"(https?://|www\.)\S+(\s+\S+){0,3}$")  # URL con fino a 3 token extra
 _MATH_SYMS   = re.compile(r"[∈∑≤≥≠∀∃∫√∞∂±×÷→←↔⊂⊃⊆⊇∩∪·°]")
 _ROMAN_END   = re.compile(r"\b(I{1,3}|IV|VI{0,3}|IX|XI{0,2}|XIV|XV|XVI{0,2}|XIX|XX{0,2})$")
 def _load_thresholds(stem_dir: "Path") -> "tuple[int, int]":
    """Legge min/max da meta.json (scritto dal chunker) o usa i default da config."""
    meta = stem_dir / "meta.json"
    if meta.exists():
        import json as _json
        m = _json.loads(meta.read_text(encoding="utf-8"))
        return m["min_chars"], m["max_chars"]
    return MIN_CHARS, MAX_CHARS
 # ─── Checks ───────────────────────────────────────────────────────────────────
 def has_prefix(chunk: dict) -> bool:
@@ -51,7 +68,7 @@ def is_too_short(chunk: dict, min_chars: int) -> bool:
 def is_too_long(chunk: dict, max_chars: int) -> bool:
-    return chunk.get("n_chars", 0) > max_chars * 1.5
+    return chunk.get("n_chars", 0) > max_chars
 def ends_incomplete(chunk: dict) -> bool:
@@ -65,6 +82,8 @@ def ends_incomplete(chunk: dict) -> bool:
        return False
    if _HEX_END.search(text_check):   # hash SHA / codice hex
        return False
    if _ROMAN_END.search(text_check):  # numero romano finale (indice/riferimento PDF)
        return False
    if _URL_TAIL.search(text_check[-200:]):  # URL (con eventuale path dopo spazio)
        return False
    return True
@@ -72,7 +91,7 @@ def ends_incomplete(chunk: dict) -> bool:
 def is_math_incomplete(chunk: dict) -> bool:
    """Incompleto ma in contesto matematico — degrada a warning invece di blocker."""
-    return ends_incomplete(chunk) and len(_MATH_SYMS.findall(chunk.get("text", ""))) >= 3
+    return ends_incomplete(chunk) and len(_MATH_SYMS.findall(chunk.get("text", ""))) >= cfg.MATH_SYMS_MIN
 # ─── Report ───────────────────────────────────────────────────────────────────
@@ -85,7 +104,9 @@ def _fmt_chunk(c: dict) -> str:
 def verify_stem(stem: str, project_root: Path, min_chars: int, max_chars: int) -> bool:
-    chunks_path = project_root / "chunks" / stem / "chunks.json"
+    stem_dir    = project_root / "chunks" / stem
    chunks_path = stem_dir / "chunks.json"
    min_chars, max_chars = _load_thresholds(stem_dir)
    print(f"\nDocumento: {stem}")
@@ -170,12 +191,12 @@ def verify_stem(stem: str, project_root: Path, min_chars: int, max_chars: int) -
    if too_long:
        has_errors = True
-        print(f"\n  🟡 {len(too_long)} chunk SOPRA MAX_CHARS×1.5 ({int(max_chars * 1.5)}):")
+        print(f"\n  🟡 {len(too_long)} chunk SOPRA MAX ({max_chars}):")
        for c in too_long[:5]:
            print(_fmt_chunk(c))
        if len(too_long) > 5:
            print(f"  ... e altri {len(too_long) - 5}")
-        print(f"  → Soluzione: alza MAX_CHARS o verifica il testo nel MD")
+        print(f"  → Causa probabile: frasi singole lunghe (liste/paragrafi non suddivisibili)")
    if incomplete:
        has_errors = True
@@ -225,7 +246,12 @@ def verify_stem(stem: str, project_root: Path, min_chars: int, max_chars: int) -
            "max_chars": max_l,
            "avg_chars": avg_l,
        },
-        "thresholds": {"min_chars": min_chars, "max_chars": max_chars},
+        "thresholds": {
            "min_chars": min_chars,
            "max_chars": max_chars,
            "target_chars": cfg.TARGET_CHARS,
            "chunk_tolerance": cfg.CHUNK_TOLERANCE,
        },
        "blockers": {
            "empty":      [_chunk_entry(c) for c in empty_chunks],
            "no_prefix":  [_chunk_entry(c) for c in no_prefix],
@@ -253,11 +279,11 @@ def verify_stem(stem: str, project_root: Path, min_chars: int, max_chars: int) -
    if not blockers and not warnings:
        print(f"  ✅ Tutto OK — procedi alla vettorizzazione:")
-        print(f"       python step-8/ingest.py --stem {stem}")
+        print(f"       python ingestion/ingest.py --stem {stem}")
    elif not blockers:
        print(f"  🟡 Solo avvisi minori — puoi procedere alla vettorizzazione:")
-        print(f"       python step-8/ingest.py --stem {stem}")
+        print(f"       python ingestion/ingest.py --stem {stem}")
        print()
        print(f"  Oppure, per ottimizzare prima:")
        if too_short:
@@ -301,13 +327,15 @@ if __name__ == "__main__":
    parser = argparse.ArgumentParser(description="Verifica chunk")
    parser.add_argument("--stem", help="Nome del documento (sottocartella di chunks/)")
    _min_def = int(cfg.TARGET_CHARS * (1 - cfg.CHUNK_TOLERANCE))
    _max_def = int(cfg.TARGET_CHARS * (1 + cfg.CHUNK_TOLERANCE))
    parser.add_argument(
-        "--min", type=int, default=MIN_CHARS,
+        "--min", type=int, default=_min_def,
-        help=f"Soglia minima caratteri (default: {MIN_CHARS})"
+        help=f"Soglia minima caratteri (default: TARGET×(1-TOL) = {_min_def})"
    )
    parser.add_argument(
-        "--max", type=int, default=MAX_CHARS,
+        "--max", type=int, default=_max_def,
-        help=f"Soglia massima caratteri (default: {MAX_CHARS})"
+        help=f"Soglia massima caratteri (default: TARGET×(1+TOL) = {_max_def})"
    )
    args = parser.parse_args()
@@ -18,7 +18,7 @@ TOP_K = 6
 # Temperatura del modello LLM.
 # 0.0 = completamente deterministico (stessa risposta ad ogni run)
 # 0.7 = più creativo e vario
-TEMPERATURE = 0.0
+TEMPERATURE = 0.2
 # Disabilita il "thinking" (ragionamento interno) nei modelli Qwen3/Qwen3.5.
 # True  = risposta diretta, più veloce
@@ -38,7 +38,7 @@ EMBED_MODEL = "nomic-embed-text"
 OLLAMA_URL = "http://localhost:11434"
 # Modello LLM. Scegli in base alla RAM disponibile (vedi README).
-OLLAMA_MODEL = "qwen3.5:0.8b"
+OLLAMA_MODEL = "qwen3.5:4b"
 # ── Prompt di sistema ─────────────────────────────────────────────────────────
@@ -33,13 +33,13 @@ Posiziona il PDF in `sources/<nome>.pdf`, poi:
 ```bash
 # Singolo documento
-python conversione/pipeline.py --stem <nome>
+python conversione/ --stem <nome>
 # Tutti i PDF in sources/
-python conversione/pipeline.py
+python conversione/
 # Forza la riesecuzione (sovrascrive output esistente)
-python conversione/pipeline.py --stem <nome> --force
+python conversione/ --stem <nome> --force
 ```
 Il parametro `--stem` è il nome del file PDF senza estensione.  
@@ -49,12 +49,13 @@ Esempio: `sources/analisi1.pdf` → `--stem analisi1`
 ## Output
-Per ogni stem vengono prodotti tre file in `conversione/<stem>/`:
+Per ogni stem vengono prodotti quattro file in `conversione/<stem>/`:
 | File | Descrizione |
 |------|-------------|
 | `raw.md` | Markdown grezzo estratto dal PDF — **non modificare** |
 | `clean.md` | Markdown pulito e strutturato — input per il chunker |
 | `structure_profile.json` | Struttura rilevata e strategia di chunking consigliata |
 | `report.json` | Metriche complete di qualità della conversione |
 ### report.json
@@ -110,11 +111,15 @@ anomalie e problemi residui con esempi.
 ## Validazione batch
-Dopo aver convertito uno o più documenti, esegui `validate.py` per ottenere
+Dopo aver convertito uno o più documenti, esegui `validate` per ottenere
 una tabella di stato su tutti gli stem:
 ```bash
-python conversione/validate.py
+# Tutti i documenti
 python conversione/ validate
 # Singolo documento con dettaglio penalità
 python conversione/ validate <stem> --detail
 ```
 Output di esempio:
@@ -221,11 +226,19 @@ Durante l'esecuzione la pipeline stampa le statistiche di ogni trasformazione:
 ```
  [3/4] Pulizia strutturale...
-  ✅ Immagini rimosse:      0
+  ✅ Simboli PUA corretti:  0
     Immagini rimosse:      0
     Note rimosse:          12
     Accenti corretti:      3701
     Dot-leader rimossi:    53
     Header concat fixati:  0
     Header num. normaliz.: 8
     Articoli → ###:        0
     Ambienti matematici:   0
     Titoli header uniti:   4
     TOC rimosso:           sì
     Versi poesia riprist.: 0
     Header verso demotati: 0
     ALL-CAPS → ##:         14
     Sezioni → ###:         279
     Paragrafi uniti:       12998
@@ -1,18 +1,28 @@
-# Step 8 — Vettorizzazione
+# Ingestion — Vettorizzazione
-Legge i chunk prodotti da step-6, genera gli embedding tramite Ollama e li
+Legge i chunk prodotti dal chunker, genera gli embedding tramite Ollama e li
 salva in ChromaDB (vector store persistente su disco).
 ---
 ## Prerequisiti
- Step-6 completato (esiste `step-6/<stem>/chunks.json`)
+- Chunking completato (esiste `chunks/<stem>/chunks.json`)
 - Ollama attivo con il modello di embedding scaricato
 - `chromadb` installato (`pip install -r requirements.txt`)
 ---
 ## Verifica ambiente
 Prima di eseguire l'ingestion, verifica che Ollama e i modelli siano disponibili:
 ```bash
 .venv/bin/python ollama/check_env.py
 ```
 ---
 ## Configurazione modello
 Il modello di embedding viene letto da **`config.py`**:
@@ -23,7 +33,7 @@ EMBED_MODEL = "nomic-embed-text"   # ← cambia qui
 ```
 > Il modello scelto qui deve corrispondere a quello usato in rag.py.
-> Se lo cambi dopo aver già vettorizzato, devi rieseguire step-8 con `--force`.
+> Se lo cambi dopo aver già vettorizzato, devi rieseguire ingestion con `--force`.
 ---
@@ -31,16 +41,16 @@ EMBED_MODEL = "nomic-embed-text"   # ← cambia qui
 ```bash
 # Vettorizza un singolo documento
-python step-8/ingest.py --stem <nome>
+.venv/bin/python ingestion/ingest.py --stem <nome>
 # Vettorizza tutti i documenti trovati in step-6/
-python step-8/ingest.py
+.venv/bin/python ingestion/ingest.py
 # Sovrascrive una collection già esistente
-python step-8/ingest.py --stem <nome> --force
+.venv/bin/python ingestion/ingest.py --stem <nome> --force
 # Override modello (senza modificare config.py)
-python step-8/ingest.py --stem <nome> --model bge-m3
+.venv/bin/python ingestion/ingest.py --stem <nome> --model bge-m3
 ```
 ---
@@ -94,7 +104,7 @@ Senza `--force` lo script salta la collection già esistente — i vecchi vettor
 ```bash
 # Cambio modello → ricrea sempre la collection
-python step-8/ingest.py --stem <nome> --force
+.venv/bin/python ingestion/ingest.py --stem <nome> --force
 ```
 ### Quando usare `--force`
@@ -103,7 +113,7 @@ python step-8/ingest.py --stem <nome> --force
 |---|---|
 | Prima esecuzione | No |
 | Hai cambiato `EMBED_MODEL` | **Sì** |
-| Hai migliorato i chunk in step-6 | **Sì** |
+| Hai migliorato i chunk in chunks/ | **Sì** |
 | Hai aggiunto nuovi documenti (stem diverso) | No |
 | Vuoi solo verificare che funzioni | No |
@@ -2,21 +2,21 @@
 """
 Step 8 — Vettorizzazione
-Legge i chunk prodotti da step-6, genera gli embedding tramite Ollama
+Legge i chunk prodotti da chunks/, genera gli embedding tramite Ollama
 e li indicizza in ChromaDB (persistente).
 Il modello di embedding viene letto da config.py (EMBED_MODEL).
 Puoi sovrascriverlo con --model, ma deve corrispondere al modello che
 userai in rag.py — altrimenti riesegui con --force dopo aver cambiato.
-Input:  step-6/<stem>/chunks.json
+Input:  chunks/<stem>/chunks.json
 Output: chroma_db/<stem> (collection ChromaDB)
 Uso:
-    python step-8/ingest.py --stem <nome>            # singolo documento
+    python ingestion/ingest.py --stem <nome>            # singolo documento
-    python step-8/ingest.py                          # tutti gli stem trovati
+    python ingestion/ingest.py                          # tutti gli stem trovati
-    python step-8/ingest.py --stem <nome> --force    # sovrascrive collection
+    python ingestion/ingest.py --stem <nome> --force    # sovrascrive collection
-    python step-8/ingest.py --model bge-m3           # override modello
+    python ingestion/ingest.py --model bge-m3           # override modello
 """
 import argparse
@@ -33,7 +33,7 @@ import chromadb
 project_root = Path(__file__).parent.parent
-CHUNKS_DIR = project_root / "step-6"
+CHUNKS_DIR = project_root / "chunks"
 CHROMA_DIR = project_root / "chroma_db"
 sys.path.insert(0, str(project_root))
@@ -94,40 +94,27 @@ def collection_exists(client: chromadb.PersistentClient, stem: str) -> bool:
 # ─── Ingestione ───────────────────────────────────────────────────────────────
-def ingest(stem: str, force: bool, model: str = EMBED_MODEL) -> bool:
+def _ingest_stem(stem: str, collection: chromadb.Collection,
                 model: str, offset: int = 0) -> int:
    """
-    Legge step-6/<stem>/chunks.json, genera embedding e popola ChromaDB.
+    Aggiunge i chunk di uno stem a una collection esistente.
-    Ritorna True se completato con successo, False altrimenti.
+    Prefissa chunk_id con stem per evitare collisioni multi-documento.
    Ritorna il numero di chunk aggiunti.
    """
    chunks_path = CHUNKS_DIR / stem / "chunks.json"
    if not chunks_path.exists():
        print(f"❌ File non trovato: {chunks_path}")
-        return False
+        return 0
    with open(chunks_path, encoding="utf-8") as f:
        chunks = json.load(f)
    if not chunks:
        print(f"⚠️  {stem}: chunks.json è vuoto — skip")
-        return False
+        return 0
    client = get_client()
    if collection_exists(client, stem):
        if not force:
            print(f"⚠️  Collection '{stem}' già presente in ChromaDB — skip")
            print(f"   → usa --force per sovrascrivere")
            return True  # non è un errore, è uno skip
        client.delete_collection(stem)
        print(f"🗑️  Collection '{stem}' rimossa (--force)")
    collection = client.create_collection(
        name=stem,
        metadata={"hnsw:space": "cosine"},
    )
    total = len(chunks)
-    print(f"📦 {total} chunk da ingestire\n")
+    print(f"  📄 {stem}: {total} chunk\n")
    ids        = []
    embeddings = []
@@ -143,10 +130,11 @@ def ingest(stem: str, force: bool, model: str = EMBED_MODEL) -> bool:
        t1 = time.monotonic()
        durations.append(t1 - t0)
-        ids.append(chunk["chunk_id"])
+        ids.append(f"{stem}__{chunk['chunk_id']}")
        embeddings.append(vector)
        documents.append(chunk["text"])
        metadatas.append({
            "source":    stem,
            "sezione":   chunk.get("sezione", ""),
            "titolo":    chunk.get("titolo", ""),
            "sub_index": chunk.get("sub_index", 0),
@@ -154,41 +142,69 @@ def ingest(stem: str, force: bool, model: str = EMBED_MODEL) -> bool:
        avg  = sum(durations) / len(durations)
        eta  = int(avg * (total - i))
-        done = f"[{i:>{len(str(total))}}/{total}]"
+        done = f"[{offset + i:>6}/{offset + total}]"
-        cid  = chunk["chunk_id"][:50]
+        cid  = chunk["chunk_id"][:40]
-        line = f"  {done} ✓ {cid:<50}  ETA: {eta}s"
+        print(f"  {done} ✓ {stem}/{cid:<40}  ETA: {eta}s", end="\r", flush=True)
        print(f"{line:<80}", end="\r", flush=True)
        # Upsert in batch da 100 per non sovraccaricare la memoria
        if len(ids) == 100:
-            collection.add(
+            collection.add(ids=ids, embeddings=embeddings,
-                ids=ids,
+                           documents=documents, metadatas=metadatas)
                embeddings=embeddings,
                documents=documents,
                metadatas=metadatas,
            )
            ids, embeddings, documents, metadatas = [], [], [], []
    # Upsert dei rimanenti
    if ids:
-        collection.add(
+        collection.add(ids=ids, embeddings=embeddings,
-            ids=ids,
+                       documents=documents, metadatas=metadatas)
            embeddings=embeddings,
            documents=documents,
            metadatas=metadatas,
        )
    elapsed = int(time.monotonic() - start)
-    print()  # nuova riga dopo il \r
+    print()
-    print(f"\n✅ Ingestione completata in {elapsed}s — {total}/{total} chunk salvati")
+    print(f"  ✅ {stem}: {total} chunk in {elapsed}s")
-    print(f"   Collection '{stem}' in {CHROMA_DIR}/")
+    return total
 def ingest(stem: str, force: bool, model: str = EMBED_MODEL) -> bool:
    """Ingest singolo documento nella sua collection dedicata (retrocompatibile)."""
    return ingest_multi([stem], collection_name=stem, force=force, model=model)
 def ingest_multi(stems: list[str], collection_name: str,
                 force: bool, model: str = EMBED_MODEL) -> bool:
    """
    Ingerisce uno o più documenti in una singola collection ChromaDB.
    I chunk_id sono prefissati con lo stem per evitare collisioni.
    Il metadato 'source' identifica il documento di provenienza.
    """
    client = get_client()
    if collection_exists(client, collection_name):
        if not force:
            print(f"⚠️  Collection '{collection_name}' già presente in ChromaDB — skip")
            print(f"   → usa --force per sovrascrivere")
            return True
        client.delete_collection(collection_name)
        print(f"🗑️  Collection '{collection_name}' rimossa (--force)")
    collection = client.create_collection(
        name=collection_name,
        metadata={"hnsw:space": "cosine"},
    )
    total_chunks = 0
    for stem in stems:
        n = _ingest_stem(stem, collection, model, offset=total_chunks)
        if n == 0 and len(stems) == 1:
            return False
        total_chunks += n
    print(f"\n✅ Collection '{collection_name}': {total_chunks} chunk totali")
    print(f"   Documenti: {', '.join(stems)}")
    print(f"   Percorso: {CHROMA_DIR}/")
    return True
 # ─── Entry point ──────────────────────────────────────────────────────────────
 def find_stems() -> list[str]:
-    """Ritorna tutti gli stem che hanno un chunks.json in step-6/."""
+    """Ritorna tutti gli stem che hanno un chunks.json in chunks/."""
    return sorted(
        p.parent.name
        for p in CHUNKS_DIR.glob("*/chunks.json")
@@ -197,26 +213,53 @@ def find_stems() -> list[str]:
 def main() -> int:
    parser = argparse.ArgumentParser(
-        description="Step 8 — Vettorizzazione chunk in ChromaDB"
+        description="Vettorizzazione chunk in ChromaDB",
        epilog=(
            "Esempi:\n"
            "  python ingestion/ingest.py --stem manuale\n"
            "  python ingestion/ingest.py --collection archivio --stems doc1 doc2 doc3\n"
            "  python ingestion/ingest.py --collection archivio --stems doc1 doc2 --force\n"
            "  python ingestion/ingest.py                          # tutti i documenti, collection separate"
        ),
        formatter_class=argparse.RawDescriptionHelpFormatter,
    )
-    parser.add_argument("--stem", help="Nome del documento (senza --stem = tutti)")
+    parser.add_argument("--stem",
                        help="Singolo documento → collection con lo stesso nome")
    parser.add_argument("--stems", nargs="+", metavar="STEM",
                        help="Uno o più documenti da unire in --collection")
    parser.add_argument("--collection",
                        help="Nome della collection di destinazione (richiesto con --stems)")
    parser.add_argument("--force", action="store_true",
                        help="Sovrascrive la collection se già esistente")
    parser.add_argument("--model", default=EMBED_MODEL,
-                        help=f"Modello embedding Ollama (default da config.py: {EMBED_MODEL})")
+                        help=f"Modello embedding (default: {EMBED_MODEL})")
    args = parser.parse_args()
-    print("─── Step 8 — Vettorizzazione ─────────────────────────────────────────\n")
+    print("─── Vettorizzazione ──────────────────────────────────────────────────\n")
    if not check_ollama(args.model):
        return 1
    # ── Modalità multi-documento ─────────────────────────────────────────────
    if args.stems or args.collection:
        if not args.stems:
            print("❌ --collection richiede --stems (es. --stems doc1 doc2 doc3)")
            return 1
        if not args.collection:
            print("❌ --stems richiede --collection (es. --collection archivio)")
            return 1
        print(f"  Collection : {args.collection}")
        print(f"  Documenti  : {', '.join(args.stems)}\n")
        ok = ingest_multi(args.stems, args.collection,
                          force=args.force, model=args.model)
        return 0 if ok else 1
    # ── Modalità singolo / tutti ─────────────────────────────────────────────
    stems = [args.stem] if args.stem else find_stems()
    if not stems:
-        print("❌ Nessun chunks.json trovato in step-6/")
+        print("❌ Nessun chunks.json trovato in chunks/")
        return 1
    print()
    results = []
    for stem in stems:
        if len(stems) > 1:
@@ -57,7 +57,7 @@ Alternative supportate:
 - `bge-m3`
 - `nomic-embed-text`
-Se cambi embedding model rispetto a quello usato in step-8, riesegui ingest con `--force` e aggiorna `EMBED_MODEL` in `config.py`.
+Se cambi embedding model rispetto a quello usato in ingestion, riesegui ingest con `--force` e aggiorna `EMBED_MODEL` in `config.py`.
 ### Modello LLM (consigliato per 8 GB RAM)
@@ -101,7 +101,7 @@ Output atteso (esempio):
 ✅ LLM disponibile: qwen3.5:4b
 ✅ chromadb importabile
 ✅ Ambiente pronto — procedi con la vettorizzazione:
-   python step-8/ingest.py --stem <nome>
+   python ingestion/ingest.py --stem <nome>
 ```
 ---
@@ -109,5 +109,5 @@ Output atteso (esempio):
 ## Prossimo step
 ```bash
-python step-8/ingest.py --stem <nome>
+python ingestion/ingest.py --stem <nome>
 ```
@@ -101,6 +101,7 @@ def retrieve(collection: chromadb.Collection, question: str) -> list[dict]:
    ):
        chunks.append({
            "text":     text,
            "source":   meta.get("source", ""),
            "sezione":  meta.get("sezione", ""),
            "titolo":   meta.get("titolo", ""),
            "distance": dist,
@@ -143,7 +144,8 @@ def answer(question: str, collection: chromadb.Collection, verbose: bool) -> Non
            if c["titolo"]:
                loc += f" > {c['titolo']}"
            sim = 1 - c["distance"]
-            print(f"  [{i}] {loc}  (similarità: {sim:.3f})")
+            src = f"[{c['source']}] " if c.get("source") else ""
            print(f"  [{i}] {src}{loc}  (similarità: {sim:.3f})")
            print(f"      {c['text'][:120].replace(chr(10), ' ')}...")
        print("──────────────────────────────────────────────────────────────\n")
@@ -197,7 +199,7 @@ def _build_epilog() -> str:
            if names:
                lines += ["", f"Collection disponibili: {', '.join(names)}"]
            else:
-                lines += ["", "Nessuna collection trovata — eseguire prima: python step-8/ingest.py"]
+                lines += ["", "Nessuna collection trovata — eseguire prima: python ingestion/ingest.py"]
        except Exception:
            pass
    return "\n".join(lines)
@@ -208,41 +210,48 @@ def main() -> int:
        description=(
            "Pipeline RAG interattiva\n\n"
            "Risponde a domande in linguaggio naturale su un documento\n"
-            "indicizzato in ChromaDB da step-8/ingest.py."
+            "indicizzato in ChromaDB da ingestion/ingest.py."
        ),
        epilog=_build_epilog(),
        formatter_class=argparse.RawDescriptionHelpFormatter,
    )
    parser.add_argument(
        "--stem",
-        required=True,
+        help="Collection di un singolo documento (retrocompatibile)",
-        help=(
+    )
-            "Nome della collection ChromaDB da interrogare. "
+    parser.add_argument(
-            "Le collection vengono create da: python step-8/ingest.py --stem <nome>"
+        "--collection",
-        ),
+        help="Collection multi-documento creata con: ingest.py --collection <nome> --stems ...",
    )
    args = parser.parse_args()
    collection_name = args.collection or args.stem
    if not collection_name:
        parser.error("specifica --stem <nome> oppure --collection <nome>")
    print("─── Pipeline RAG ────────────────────────────────────────────\n")
-    print(f"  Documento : {args.stem}")
+    print(f"  Collection : {collection_name}")
-    print(f"  Modello   : {LLM_MODEL}")
+    print(f"  Modello    : {LLM_MODEL}")
-    print(f"  Top-K     : {TOP_K}")
+    print(f"  Top-K      : {TOP_K}")
-    print(f"  Thinking  : {'off' if NO_THINK else 'on'}")
+    print(f"  Thinking   : {'off' if NO_THINK else 'on'}")
    print()
    if not CHROMA_DIR.exists():
-        print("❌ chroma_db/ non trovata — esegui prima step-8")
+        print("❌ chroma_db/ non trovata — esegui prima ingestion")
        return 1
    client = chromadb.PersistentClient(path=str(CHROMA_DIR))
    collections = [c.name for c in client.list_collections()]
-    if args.stem not in collections:
+    if collection_name not in collections:
-        print(f"❌ Collection '{args.stem}' non trovata in chroma_db/")
+        print(f"❌ Collection '{collection_name}' non trovata in chroma_db/")
-        print(f"   → python step-8/ingest.py --stem {args.stem}")
+        if args.stem:
            print(f"   → python ingestion/ingest.py --stem {collection_name}")
        else:
            print(f"   → python ingestion/ingest.py --collection {collection_name} --stems doc1 doc2 ...")
        return 1
-    collection = client.get_collection(args.stem)
+    collection = client.get_collection(collection_name)
-    print(f"✅ Collection '{args.stem}' caricata ({collection.count()} chunk)\n")
+    print(f"✅ Collection '{collection_name}' caricata ({collection.count()} chunk)\n")
    run_loop(collection)
    return 0
@@ -85,6 +85,7 @@ def retrieve(collection: chromadb.Collection, query: str, top_k: int) -> list[di
        chunks.append({
            "rank":       rank,
            "similarity": round(1 - dist, 4),
            "source":     meta.get("source", ""),
            "sezione":    meta.get("sezione", ""),
            "titolo":     meta.get("titolo", ""),
            "text":       text,
@@ -97,10 +98,11 @@ def retrieve(collection: chromadb.Collection, query: str, top_k: int) -> list[di
 def print_results(chunks: list[dict], full: bool = False) -> None:
    print(f"── {len(chunks)} chunk recuperati ─────────────────────────────────\n")
    for c in chunks:
        src = f"[{c['source']}] " if c.get("source") else ""
        loc = c["sezione"]
        if c["titolo"]:
            loc += f" > {c['titolo']}"
-        print(f"  [{c['rank']}] similarità: {c['similarity']:.4f}  |  {loc}")
+        print(f"  [{c['rank']}] similarità: {c['similarity']:.4f}  |  {src}{loc}")
        if full:
            print()
            print(c["text"])
@@ -159,7 +161,7 @@ def _build_epilog() -> str:
            if names:
                lines += ["", f"Collection disponibili: {', '.join(names)}"]
            else:
-                lines += ["", "Nessuna collection trovata — eseguire prima: python step-8/ingest.py"]
+                lines += ["", "Nessuna collection trovata — eseguire prima: python ingestion/ingest.py"]
        except Exception:
            pass
    return "\n".join(lines)
@@ -177,8 +179,11 @@ def main() -> int:
    )
    parser.add_argument(
        "--stem",
-        required=True,
+        help="Collection di un singolo documento (retrocompatibile)",
-        help="Nome della collection ChromaDB da interrogare.",
+    )
    parser.add_argument(
        "--collection",
        help="Collection multi-documento creata con: ingest.py --collection <nome> --stems ...",
    )
    parser.add_argument(
        "--top-k",
@@ -189,25 +194,32 @@ def main() -> int:
    )
    args = parser.parse_args()
    collection_name = args.collection or args.stem
    if not collection_name:
        parser.error("specifica --stem <nome> oppure --collection <nome>")
    print("─── Retrieval puro ──────────────────────────────────────────\n")
-    print(f"  Documento    : {args.stem}")
+    print(f"  Collection   : {collection_name}")
    print(f"  Embed model  : {EMBED_MODEL}")
    print(f"  Top-K        : {args.top_k}")
    print()
    if not CHROMA_DIR.exists():
-        print("❌ chroma_db/ non trovata — esegui prima step-8", file=sys.stderr)
+        print("❌ chroma_db/ non trovata — esegui prima ingestion", file=sys.stderr)
        return 1
    client = chromadb.PersistentClient(path=str(CHROMA_DIR))
    collections = [c.name for c in client.list_collections()]
-    if args.stem not in collections:
+    if collection_name not in collections:
-        print(f"❌ Collection '{args.stem}' non trovata in chroma_db/", file=sys.stderr)
+        print(f"❌ Collection '{collection_name}' non trovata in chroma_db/", file=sys.stderr)
-        print(f"   → python step-8/ingest.py --stem {args.stem}", file=sys.stderr)
+        if args.stem:
            print(f"   → python ingestion/ingest.py --stem {collection_name}", file=sys.stderr)
        else:
            print(f"   → python ingestion/ingest.py --collection {collection_name} --stems doc1 doc2 ...", file=sys.stderr)
        return 1
-    collection = client.get_collection(args.stem)
+    collection = client.get_collection(collection_name)
-    print(f"✅ Collection '{args.stem}' caricata ({collection.count()} chunk)\n")
+    print(f"✅ Collection '{collection_name}' caricata ({collection.count()} chunk)\n")
    run_loop(collection, args.top_k)
    return 0
Author	SHA1	Message	Date
davide	48567fa5e7	fix(verify): riconosce URL www. come terminatori validi + doc multi-documento - _URL_TAIL ora matcha anche www.* (non solo https://) — evita falsi blockers su watermark tipo www.docsity.com - README: documenta --collection / --stems per ingestion, retrieve e rag Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-12 11:21:24 +02:00
davide	8d972fa7c6	feat(ingestion): supporto multi-documento in unica collection ChromaDB Aggiunge la possibilità di unire più documenti in una singola collection ChromaDB, con chunk_id prefissati per stem e metadato source per filtrare. - ingest.py: --stems doc1 doc2 --collection nome (nuovo), --stem (invariato) - rag.py / retrieve.py: --collection, source nei chunk, verbose mostra [source] Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-12 11:21:17 +02:00
davide	5b63c423cc	feat(chunks): ottimizzazione chunking e post-processing - chunker.py: scrive meta.json con strategia e soglie effettive (target, min_chars, max_chars) per ogni documento chunked - verify_chunks.py: * _load_thresholds(): legge min/max da meta.json invece del TARGET_CHARS globale, eliminando il mismatch tra soglie chunker e verify (h3_aware target=600 -> range 450-750, non piu' validato a 225-375) * _ROMAN_END: esclude numeri romani finali (XV, XIV...) dagli incompleti perche' sono artefatti indice PDF, non frasi spezzate * PUNCT_END: aggiunge ; come fine valida (clausole legali italiane) - fix_chunks.py: * _load_thresholds(): usa max_chars da meta.json per split coerente * _SECONDARY_END: split secondario su ; per testo legale multi-clausola * Fase 1 (convergenza): risolve solo blockers (incomplete, empty, no_prefix) senza toccare warnings -- elimina il ciclo merge->too_long->split->incomplete->merge * Fase 2 (finale): una sola passata di merge too_short + split too_long dopo che i blockers sono azzerati Risultato su dirittopenale: da blocked (265 incomplete) a warnings_only in 2 iterazioni, senza cicli infiniti. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-12 11:09:28 +02:00
davide	587238f9f5	docs(conversione): aggiorna README — comandi, output e log di esecuzione Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-12 10:43:17 +02:00
davide	c381d7da3c	docs(readme): aggiunge sezioni configurazione modelli, test ollama, retrieval e RAG Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-12 10:39:27 +02:00
davide	b5fb363104	chore(config): tuning RAG — modello 4b, temperatura 0.2, chunk target 300 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-12 10:37:39 +02:00
davide	602dc87045	fix(ingestion): correggi path chunks da step-6/ a chunks/ Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-12 10:37:35 +02:00
davide	b49ef8edf0	docs: aggiorna README con flusso ingestion completo - README.md: aggiunge step 7 (ingestion) con verifica ambiente, comandi base e --force; aggiorna pipeline header e riferimenti - ingestion/README.md: rinomina da "Step 8" a "Ingestion", aggiorna riferimenti da step-6 a chunks/, aggiunge sezione "Verifica ambiente", corregge comandi con .venv/bin/python Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-11 16:05:23 +02:00
davide	9e1a72a9e6	refactor: rinomina step-8 → ingestion Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-11 15:58:54 +02:00
davide	70b304e1d4	docs(readme): flusso completo conversione → chunking Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-11 15:46:52 +02:00
davide	02c785678d	feat(chunks): target-based chunking con config centralizzata Introduce chunks/config.py come unica fonte di verità per tutti i parametri della pipeline di chunking. TARGET_CHARS + CHUNK_TOLERANCE sostituiscono MIN_CHARS/MAX_CHARS: il chunker mira a una dimensione target e si avvicina il più possibile rispettando il vincolo assoluto di terminare ogni chunk su un confine di frase (punto/punteggiatura). - config.py: TARGET_CHARS, CHUNK_TOLERANCE, SPLIT_THRESHOLD_FACTOR, PROTECT_TABLES, FIX_MAX_ITERATIONS, STRATEGY_OVERRIDES per strategia - chunker.py: algoritmo target-based (emit quando frase successiva sfora upper_body = upper - prefix_len), table protection atomica, override MIN/MAX/overlap per ciascuna delle 4 strategie - verify_chunks.py: soglie derivate da target*(1±tolerance) - fix_chunks.py: _split_at_boundary sempre su punteggiatura finale, loop ricorsivo fix→verify fino a FIX_MAX_ITERATIONS, split solo per chunk > upper × SPLIT_THRESHOLD_FACTOR Risultato su bitcoin: 694 chunk, 0 incompleti, 83% in range [450,750], tutti terminanti su punteggiatura indipendentemente dalla dimensione. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-11 15:45:24 +02:00
davide	508587c5bf	Merge branch 'chunks' into main Integra la pipeline di chunking (chunks/) con tutte le modifiche accumulate sul branch, inclusa la pipeline PDF→Markdown a 9 stadi.	2026-05-11 14:51:53 +02:00
davide	e1b5298b20	feat: integra pipeline PDF→Markdown a 9 stadi e test suite Porta da branch marker la riscrittura completa di conversione/_pipeline/ (9 stadi PyMuPDF) e la suite tests/ senza modificare il resto del progetto RAG (ollama/, step-5/, step-6/, step-8/, rag.py, retrieve.py, config.py). requirements.txt: aggiunge PyMuPDF>=1.24.0 e pytest>=8.0, mantiene chromadb, rimuove opendataloader-pdf e pymupdf4llm. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-11 14:44:16 +02:00