Commit Graph

63 Commits

Author SHA1 Message Date
davide f27ebfa101 docs(step-7): aggiorna guida modelli embedding e LLM
Sostituisce la tabella embedding con valutazione completa dei modelli
disponibili su Ollama, con raccomandazione esplicita per testi italiani.
Riduce la sezione LLM alla sola famiglia Qwen3.5 con nota di compatibilità.
Semplifica la sezione chromadb
2026-04-14 15:57:40 +02:00
davide d50f7f64a9 step-9: add pipeline RAG interattiva
Aggiunge rag.py (loop interattivo retrieval+generation), config.py
(tutti i parametri in un unico file), test_ollama.py (verifica
Ollama senza ChromaDB) e README.md dedicato.
Aggiunge .env.example e aggiorna .gitignore
2026-04-14 15:57:29 +02:00
davide 7d95872a8e step-8: add ingest.py, align README
- ingest.py: embed chunks via Ollama nomic-embed-text, index in
  ChromaDB (cosine space); --stem / --force / batch-100 / ETA display
- README: fix step-8 input path (step-5 → step-6), script path
  (scripts/ → step-8/), add --force explanation and real timings
2026-04-14 10:59:40 +02:00
davide a5f8b8d119 step-7: add check_env.py, README, update requirements
- check_env.py: verifica ollama, embedding model, LLM model, chromadb
- Rileva qualsiasi modello embedding/LLM installato (non lista fissa)
- step-7/README.md: guida installazione/disinstallazione Ollama, modelli, chromadb
- requirements.txt: aggiunge chromadb per step-8
2026-04-14 07:54:04 +02:00
davide e70a9a41f0 step-6: add fix_chunks.py, make step-6 self-contained
- verify_chunks.py now reads from step-6/<stem>/chunks.json and
  auto-copies from step-5 on first run (input and output both in step-6)
- fix_chunks.py: new script that applies fixes directly on chunks.json
  (merge too-short/incomplete, split too-long, remove empty, add prefix)
  supports --dry-run to preview changes before applying
- step6-fix.md skill updated to use fix_chunks.py workflow:
  dry-run → user approval → apply → re-verify
2026-04-13 23:56:50 +02:00
davide 5126e0d971 step-5: add adaptive chunker
chunker.py splits any revised Markdown (step-4) into RAG-ready chunks.
Supports 4 strategies driven by structure_profile.json: h3_aware,
h2_paragraph_split, paragraph, sliding_window. Respects MIN/MAX_CHARS
and sentence-level overlap. Updates .gitignore and README paths.
2026-04-13 13:48:51 +02:00
davide 1631dff80d step-4: add revise.py, step4-review skill, README update
- revise.py: automatic pre-processing (ALL-CAPS→##, numbered sections→###,
  TOC removal, broken paragraph merging, whitespace normalization);
  supports N. and Na. numbering patterns; universal heuristics
- .claude/commands/step4-review.md: Claude Code skill for qualitative
  review of clean.md (🔴/🟡/🟢 report + interactive fixes)
- README: document step-4 workflow with revise.py and /step4-review
- .gitignore: exclude step-4/*/ and step-4/revision_log.md
2026-04-13 12:21:30 +02:00
davide ee25adc0a6 step-3: add detect_structure.py (structure profile, no ML deps) 2026-04-13 10:16:45 +02:00
davide 346e336f1a step-2: add convert_pdf.py (pymupdf4llm, low-memory)
Converts PDFs in sources/ to Markdown using pymupdf4llm (pure C,
~30-50 MB RAM, no ML models). Output: step-2/<stem>/raw.md + clean.md.
2026-04-13 10:01:03 +02:00
davide 3d9ed0141c step-1: add inspect_pdf.py
Analisi automatica pagina per pagina: score 0–100, sillabazioni,
layout a colonne, Unicode anomali, intestazioni/piè ripetitivi.
Report salvato in step-1/<stem>_step1_report.txt (escluso da git).
2026-04-13 08:51:08 +02:00
davide eda04dc464 step-0: add check_pdf.py
Script di verifica idoneità PDF per lo step 0 della pipeline RAG.
Legge automaticamente tutti i PDF in sources/, controlla criteri
obbligatori e desiderabili, salva il report in step-0/.
2026-04-13 08:03:13 +02:00
davide 42c38c30f7 project setup: gitignore, CLAUDE.md, requirements
Aggiunge configurazione base del progetto:
- .gitignore: esclude venv, sources, processed, chroma_db e report generati
- CLAUDE.md: documenta l'uso obbligatorio del venv
- requirements.txt: dipendenze dirette (pdfplumber per step 0-1)
2026-04-13 08:02:54 +02:00
davide 638ba17629 Inital commit 2026-04-12 23:53:13 +02:00