2026-04-30 12:20:00 +02:00
|
|
|
# PDF → Markdown
|
2026-04-12 23:53:13 +02:00
|
|
|
|
2026-04-30 12:20:00 +02:00
|
|
|
Converte PDF digitali in Markdown strutturato e pulito.
|
2026-04-12 23:53:13 +02:00
|
|
|
|
2026-04-30 12:20:00 +02:00
|
|
|
**Stack:** Python · opendataloader-pdf (XY-Cut++) · Java 11+
|
|
|
|
|
**Compatibile con:** Linux · macOS · Windows (WSL2)
|
2026-04-12 23:53:13 +02:00
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
2026-04-20 11:20:54 +02:00
|
|
|
## Setup
|
2026-04-12 23:53:13 +02:00
|
|
|
|
|
|
|
|
```bash
|
2026-04-20 11:20:54 +02:00
|
|
|
python -m venv .venv
|
|
|
|
|
source .venv/bin/activate
|
|
|
|
|
pip install -r requirements.txt
|
2026-04-12 23:53:13 +02:00
|
|
|
```
|
|
|
|
|
|
2026-04-30 12:20:00 +02:00
|
|
|
**Java 11+** richiesto:
|
2026-04-12 23:53:13 +02:00
|
|
|
|
|
|
|
|
```bash
|
2026-04-20 11:20:54 +02:00
|
|
|
sudo apt install default-jdk # Ubuntu/Debian/WSL
|
2026-04-30 12:20:00 +02:00
|
|
|
java -version
|
2026-04-12 23:53:13 +02:00
|
|
|
```
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
2026-04-30 12:20:00 +02:00
|
|
|
## Utilizzo
|
2026-04-12 23:53:13 +02:00
|
|
|
|
|
|
|
|
```bash
|
2026-04-30 12:20:00 +02:00
|
|
|
# Singolo PDF
|
2026-04-20 11:20:54 +02:00
|
|
|
python conversione/pipeline.py --stem <nome>
|
2026-04-12 23:53:13 +02:00
|
|
|
|
2026-04-30 12:20:00 +02:00
|
|
|
# Tutti i PDF in sources/
|
|
|
|
|
python conversione/pipeline.py
|
2026-04-15 13:33:56 +02:00
|
|
|
|
2026-04-30 12:20:00 +02:00
|
|
|
# Forza riesecuzione
|
|
|
|
|
python conversione/pipeline.py --stem <nome> --force
|
2026-04-15 13:33:56 +02:00
|
|
|
```
|
2026-04-12 23:53:13 +02:00
|
|
|
|
2026-04-30 12:20:00 +02:00
|
|
|
`--stem` = nome file PDF senza estensione.
|
|
|
|
|
Esempio: `sources/analisi1.pdf` → `--stem analisi1`
|
2026-04-12 23:53:13 +02:00
|
|
|
|
2026-04-30 12:20:00 +02:00
|
|
|
---
|
2026-04-12 23:53:13 +02:00
|
|
|
|
2026-04-30 12:20:00 +02:00
|
|
|
## Output
|
2026-04-12 23:53:13 +02:00
|
|
|
|
2026-04-30 12:20:00 +02:00
|
|
|
Per ogni stem in `conversione/<stem>/`:
|
2026-04-17 18:51:09 +02:00
|
|
|
|
2026-04-30 12:20:00 +02:00
|
|
|
| File | Descrizione |
|
|
|
|
|
|------|-------------|
|
|
|
|
|
| `raw.md` | Markdown grezzo — **non modificare** |
|
|
|
|
|
| `clean.md` | Markdown pulito — copia di lavoro |
|
|
|
|
|
| `structure_profile.json` | Struttura rilevata e metriche |
|
|
|
|
|
| `report.json` | Statistiche complete della conversione |
|
2026-04-12 23:53:13 +02:00
|
|
|
|
2026-04-30 12:20:00 +02:00
|
|
|
---
|
2026-04-12 23:53:13 +02:00
|
|
|
|
2026-04-30 12:20:00 +02:00
|
|
|
## Validazione batch
|
2026-04-12 23:53:13 +02:00
|
|
|
|
|
|
|
|
```bash
|
2026-04-30 12:20:00 +02:00
|
|
|
python conversione/validate.py
|
2026-04-12 23:53:13 +02:00
|
|
|
```
|
|
|
|
|
|
2026-04-30 12:20:00 +02:00
|
|
|
Stampa una tabella di stato su tutti gli stem convertiti.
|
2026-04-12 23:53:13 +02:00
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
2026-04-30 12:20:00 +02:00
|
|
|
Vedi [`conversione/README.md`](conversione/README.md) per dettagli sulla pipeline e i tipi di documento supportati.
|