davide
1631dff80d
step-4: add revise.py, step4-review skill, README update
...
- revise.py: automatic pre-processing (ALL-CAPS→##, numbered sections→###,
TOC removal, broken paragraph merging, whitespace normalization);
supports N. and Na. numbering patterns; universal heuristics
- .claude/commands/step4-review.md: Claude Code skill for qualitative
review of clean.md (🔴 /🟡 /🟢 report + interactive fixes)
- README: document step-4 workflow with revise.py and /step4-review
- .gitignore: exclude step-4/*/ and step-4/revision_log.md
2026-04-13 12:21:30 +02:00
davide
ee25adc0a6
step-3: add detect_structure.py (structure profile, no ML deps)
2026-04-13 10:16:45 +02:00
davide
346e336f1a
step-2: add convert_pdf.py (pymupdf4llm, low-memory)
...
Converts PDFs in sources/ to Markdown using pymupdf4llm (pure C,
~30-50 MB RAM, no ML models). Output: step-2/<stem>/raw.md + clean.md.
2026-04-13 10:01:03 +02:00
davide
3d9ed0141c
step-1: add inspect_pdf.py
...
Analisi automatica pagina per pagina: score 0–100, sillabazioni,
layout a colonne, Unicode anomali, intestazioni/piè ripetitivi.
Report salvato in step-1/<stem>_step1_report.txt (escluso da git).
2026-04-13 08:51:08 +02:00
davide
eda04dc464
step-0: add check_pdf.py
...
Script di verifica idoneità PDF per lo step 0 della pipeline RAG.
Legge automaticamente tutti i PDF in sources/, controlla criteri
obbligatori e desiderabili, salva il report in step-0/.
2026-04-13 08:03:13 +02:00
davide
42c38c30f7
project setup: gitignore, CLAUDE.md, requirements
...
Aggiunge configurazione base del progetto:
- .gitignore: esclude venv, sources, processed, chroma_db e report generati
- CLAUDE.md: documenta l'uso obbligatorio del venv
- requirements.txt: dipendenze dirette (pdfplumber per step 0-1)
2026-04-13 08:02:54 +02:00
davide
638ba17629
Inital commit
2026-04-12 23:53:13 +02:00