PDFSystem-MNBVC · Pipeline Demo
FinePDFs-inspired PB-scale PDF → pretraining-data pipeline, adapted for the Chinese MNBVC corpus. This demo shows the MVP closed loop that is actually implemented in the repo today:
Router (XGBoost, 124 features) → MuPDF fast path → OCR Quality Scorer (ModernBERT)
The router decides whether a PDF is cheap to parse with PyMuPDF alone, or whether it needs to go to the (still-stubbed) OCR / VLM backends. Roughly 90% of a typical PDF corpus takes the green fast-path lane.
0 1
~3–5 s on CPU. First run downloads ~800 MB.
Pipeline
┌────────────────┐
PDF ───────►│ Stage-A │ XGBoost · ~10 ms/PDF
│ Router │ 124 PyMuPDF features
└────────┬───────┘
│ ocr_prob
┌─────────────┼─────────────┐
▼ ▼ ▼
MUPDF PIPELINE VLM / DEFERRED
(text-ok) (OCR, stub) (VLM, stub)
│
▼
PyMuPDF blocks ─► Markdown + Segments (with bboxes)
│
▼
ModernBERT-large OCR quality regressor ─► score ∈ [0, 3]
Backend color legend on page preview
- 🟢
mupdf— text-ok fast path (implemented) - 🟠
pipeline— OCR lane (stub, routing only) - 🟣
vlm— VLM lane (stub, routing only) - ⚪
deferred— held back until VLM workers online
Upload a PDF and click Run Pipeline.
Repo: pdfsystem_mnbvc · Architecture: FinePDFs · Router weights: FinePDFs upstream (Apache-2.0) · Quality model: HuggingFaceFW/finepdfs_ocr_quality_classifier_eng_Latn