Epstein Pipeline
Open-source Python toolkit for downloading, processing, and analyzing Epstein case documents. Powers the data behind epsteinexposed.com.
What It Does
A complete pipeline for turning raw DOJ releases into structured, searchable data
Downloaders
Fetch documents from DOJ EFTA, Kaggle, HuggingFace, and Archive.org with built-in rate limiting and resume support.
OCR Processing
Extract text from scanned PDFs using IBM Docling. Handles multi-page documents, redacted sections, and poor scan quality.
Entity Extraction
Identify persons, organizations, and locations in documents using spaCy NLP. Automatic person linking with fuzzy matching.
Deduplication
Find and merge duplicate documents using rapidfuzz similarity scoring. Configurable thresholds for title and content matching.
Validation
JSON schema validation and cross-reference integrity checks. Verify person IDs, document references, and data consistency.
Export
Export to JSON (compatible with epsteinexposed.com), CSV for spreadsheets, or SQLite with FTS5 full-text search.
CLI Reference
Full command-line interface for every step of the pipeline
epstein download dojFetch latest DOJ EFTA releasesepstein ocr ./data/pdfs/OCR all PDFs in a directoryepstein extract-entitiesRun NLP entity extractionepstein dedupFind and merge duplicatesepstein validateCheck data integrityepstein export --format sqliteExport to SQLite with FTS5epstein statsShow dataset statisticsBuilt With
Contribute to the Pipeline
The pipeline is open source and welcomes contributions. Add new data sources, improve OCR accuracy, or build new exporters.