PDFSystem-MNBVC · Pipeline Demo

FinePDFs-inspired PB-scale PDF → pretraining-data pipeline, adapted for the Chinese MNBVC corpus. This demo shows the MVP closed loop that is actually implemented in the repo today:

Router (XGBoost, 124 features) → MuPDF fast path → OCR Quality Scorer (ModernBERT)

The router decides whether a PDF is cheap to parse with PyMuPDF alone, or whether it needs to go to the (still-stubbed) OCR / VLM backends. Roughly 90% of a typical PDF corpus takes the green fast-path lane.

Upload a PDF

OCR probability threshold

ocr_prob ≥ threshold ⇒ route off the MuPDF fast path

0 1

Run ModernBERT quality scorer

~3–5 s on CPU. First run downloads ~800 MB.

Pipeline

               ┌────────────────┐
   PDF ───────►│  Stage-A       │  XGBoost · ~10 ms/PDF
               │  Router        │  124 PyMuPDF features
               └────────┬───────┘
                        │  ocr_prob
          ┌─────────────┼─────────────┐
          ▼             ▼             ▼
       MUPDF         PIPELINE        VLM / DEFERRED
       (text-ok)     (OCR, stub)     (VLM, stub)
          │
          ▼
     PyMuPDF blocks ─► Markdown + Segments (with bboxes)
          │
          ▼
     ModernBERT-large OCR quality regressor ─► score ∈ [0, 3]

Backend color legend on page preview

🟢 mupdf — text-ok fast path (implemented)
🟠 pipeline — OCR lane (stub, routing only)
🟣 vlm — VLM lane (stub, routing only)
⚪ deferred — held back until VLM workers online

Upload a PDF and click Run Pipeline.

Backend

P(OCR)

Pages

Quality

Total ms

First page with extracted bboxes

Repo: pdfsystem_mnbvc · Architecture: FinePDFs · Router weights: FinePDFs upstream (Apache-2.0) · Quality model: HuggingFaceFW/finepdfs_ocr_quality_classifier_eng_Latn