Commit Graph

9 Commits

Author SHA1 Message Date
davide e70a9a41f0 step-6: add fix_chunks.py, make step-6 self-contained
- verify_chunks.py now reads from step-6/<stem>/chunks.json and
  auto-copies from step-5 on first run (input and output both in step-6)
- fix_chunks.py: new script that applies fixes directly on chunks.json
  (merge too-short/incomplete, split too-long, remove empty, add prefix)
  supports --dry-run to preview changes before applying
- step6-fix.md skill updated to use fix_chunks.py workflow:
  dry-run → user approval → apply → re-verify
2026-04-13 23:56:50 +02:00
davide 5126e0d971 step-5: add adaptive chunker
chunker.py splits any revised Markdown (step-4) into RAG-ready chunks.
Supports 4 strategies driven by structure_profile.json: h3_aware,
h2_paragraph_split, paragraph, sliding_window. Respects MIN/MAX_CHARS
and sentence-level overlap. Updates .gitignore and README paths.
2026-04-13 13:48:51 +02:00
davide 1631dff80d step-4: add revise.py, step4-review skill, README update
- revise.py: automatic pre-processing (ALL-CAPS→##, numbered sections→###,
  TOC removal, broken paragraph merging, whitespace normalization);
  supports N. and Na. numbering patterns; universal heuristics
- .claude/commands/step4-review.md: Claude Code skill for qualitative
  review of clean.md (🔴/🟡/🟢 report + interactive fixes)
- README: document step-4 workflow with revise.py and /step4-review
- .gitignore: exclude step-4/*/ and step-4/revision_log.md
2026-04-13 12:21:30 +02:00
davide ee25adc0a6 step-3: add detect_structure.py (structure profile, no ML deps) 2026-04-13 10:16:45 +02:00
davide 346e336f1a step-2: add convert_pdf.py (pymupdf4llm, low-memory)
Converts PDFs in sources/ to Markdown using pymupdf4llm (pure C,
~30-50 MB RAM, no ML models). Output: step-2/<stem>/raw.md + clean.md.
2026-04-13 10:01:03 +02:00
davide 3d9ed0141c step-1: add inspect_pdf.py
Analisi automatica pagina per pagina: score 0–100, sillabazioni,
layout a colonne, Unicode anomali, intestazioni/piè ripetitivi.
Report salvato in step-1/<stem>_step1_report.txt (escluso da git).
2026-04-13 08:51:08 +02:00
davide eda04dc464 step-0: add check_pdf.py
Script di verifica idoneità PDF per lo step 0 della pipeline RAG.
Legge automaticamente tutti i PDF in sources/, controlla criteri
obbligatori e desiderabili, salva il report in step-0/.
2026-04-13 08:03:13 +02:00
davide 42c38c30f7 project setup: gitignore, CLAUDE.md, requirements
Aggiunge configurazione base del progetto:
- .gitignore: esclude venv, sources, processed, chroma_db e report generati
- CLAUDE.md: documenta l'uso obbligatorio del venv
- requirements.txt: dipendenze dirette (pdfplumber per step 0-1)
2026-04-13 08:02:54 +02:00
davide 638ba17629 Inital commit 2026-04-12 23:53:13 +02:00