PDFSystem-MNBVC · Pipeline Demo

FinePDFs-inspired PB-scale PDF → pretraining-data pipeline, adapted for the Chinese MNBVC corpus. This demo shows the MVP closed loop that is actually implemented in the repo today:

Router (XGBoost, 124 features)MuPDF fast pathOCR Quality Scorer (ModernBERT)

The router decides whether a PDF is cheap to parse with PyMuPDF alone, or whether it needs to go to the (still-stubbed) OCR / VLM backends. Roughly 90% of a typical PDF corpus takes the green fast-path lane.

0 1
~3–5 s on CPU. First run downloads ~800 MB.

Pipeline

               ┌────────────────┐
   PDF ───────►│  Stage-A       │  XGBoost · ~10 ms/PDF
               │  Router        │  124 PyMuPDF features
               └────────┬───────┘
                        │  ocr_prob
          ┌─────────────┼─────────────┐
          ▼             ▼             ▼
       MUPDF         PIPELINE        VLM / DEFERRED
       (text-ok)     (OCR, stub)     (VLM, stub)
          │
          ▼
     PyMuPDF blocks ─► Markdown + Segments (with bboxes)
          │
          ▼
     ModernBERT-large OCR quality regressor ─► score ∈ [0, 3]

Backend color legend on page preview

  • 🟢 mupdf — text-ok fast path (implemented)
  • 🟠 pipeline — OCR lane (stub, routing only)
  • 🟣 vlm — VLM lane (stub, routing only)
  • deferred — held back until VLM workers online

Upload a PDF and click Run Pipeline.


Repo: pdfsystem_mnbvc · Architecture: FinePDFs · Router weights: FinePDFs upstream (Apache-2.0) · Quality model: HuggingFaceFW/finepdfs_ocr_quality_classifier_eng_Latn