Commit Graph

8 Commits

Author SHA1 Message Date
davide d50f7f64a9 step-9: add pipeline RAG interattiva
Aggiunge rag.py (loop interattivo retrieval+generation), config.py
(tutti i parametri in un unico file), test_ollama.py (verifica
Ollama senza ChromaDB) e README.md dedicato.
Aggiunge .env.example e aggiorna .gitignore
2026-04-14 15:57:29 +02:00
davide e70a9a41f0 step-6: add fix_chunks.py, make step-6 self-contained
- verify_chunks.py now reads from step-6/<stem>/chunks.json and
  auto-copies from step-5 on first run (input and output both in step-6)
- fix_chunks.py: new script that applies fixes directly on chunks.json
  (merge too-short/incomplete, split too-long, remove empty, add prefix)
  supports --dry-run to preview changes before applying
- step6-fix.md skill updated to use fix_chunks.py workflow:
  dry-run → user approval → apply → re-verify
2026-04-13 23:56:50 +02:00
davide 5126e0d971 step-5: add adaptive chunker
chunker.py splits any revised Markdown (step-4) into RAG-ready chunks.
Supports 4 strategies driven by structure_profile.json: h3_aware,
h2_paragraph_split, paragraph, sliding_window. Respects MIN/MAX_CHARS
and sentence-level overlap. Updates .gitignore and README paths.
2026-04-13 13:48:51 +02:00
davide 1631dff80d step-4: add revise.py, step4-review skill, README update
- revise.py: automatic pre-processing (ALL-CAPS→##, numbered sections→###,
  TOC removal, broken paragraph merging, whitespace normalization);
  supports N. and Na. numbering patterns; universal heuristics
- .claude/commands/step4-review.md: Claude Code skill for qualitative
  review of clean.md (🔴/🟡/🟢 report + interactive fixes)
- README: document step-4 workflow with revise.py and /step4-review
- .gitignore: exclude step-4/*/ and step-4/revision_log.md
2026-04-13 12:21:30 +02:00
davide ee25adc0a6 step-3: add detect_structure.py (structure profile, no ML deps) 2026-04-13 10:16:45 +02:00
davide 346e336f1a step-2: add convert_pdf.py (pymupdf4llm, low-memory)
Converts PDFs in sources/ to Markdown using pymupdf4llm (pure C,
~30-50 MB RAM, no ML models). Output: step-2/<stem>/raw.md + clean.md.
2026-04-13 10:01:03 +02:00
davide 3d9ed0141c step-1: add inspect_pdf.py
Analisi automatica pagina per pagina: score 0–100, sillabazioni,
layout a colonne, Unicode anomali, intestazioni/piè ripetitivi.
Report salvato in step-1/<stem>_step1_report.txt (escluso da git).
2026-04-13 08:51:08 +02:00
davide 42c38c30f7 project setup: gitignore, CLAUDE.md, requirements
Aggiunge configurazione base del progetto:
- .gitignore: esclude venv, sources, processed, chroma_db e report generati
- CLAUDE.md: documenta l'uso obbligatorio del venv
- requirements.txt: dipendenze dirette (pdfplumber per step 0-1)
2026-04-13 08:02:54 +02:00