OPEN SOURCE

Epstein Pipeline

The data engine behind epsteinexposed.com. Downloads, OCRs, extracts entities, cross-references OSINT databases, and exports 2.1M+ documents to Neon Postgres with vector search. MIT licensed. pip installable.

View on GitHub Quickstart Guide

2.1M+

Documents Processed

2.0M+

OCR Texts Extracted

1,700+

Persons Identified

2.4M+

Person-Doc Links

CLI Commands

95+

GitHub Stars

Terminal

# Install from PyPI

pip install epstein-pipeline

# Or clone and install with all extras

git clone https://github.com/stonesalltheway1/Epstein-Pipeline.git

cd Epstein-Pipeline

pip install -e ".[all]"

# Optional install groups: ocr, nlp, embeddings, neon, ai, audit, dev

pip install epstein-pipeline[ocr,nlp,neon]

10-Stage Pipeline

Raw DOJ releases in, structured searchable database out

Download

Fetch raw PDFs from 9 sources

›

OCR

Multi-backend text extraction with fallback

›

Entity Extraction

spaCy + GLiNER NER on full text

›

Person Linking

Fuzzy match names to canonical IDs

›

Classification

Zero-shot BART into 12 categories

›

Deduplication

Hash, MinHash/LSH, and semantic passes

›

Chunking

Paragraph-aware, 450 tokens per chunk

›

Embedding

nomic-embed-text-v2-moe vectors

›

Validation

Schema checks + cross-reference integrity

›

Export

JSON, CSV, SQLite, Neon, or direct site sync

What It Does

31 CLI commands covering ingestion, processing, cross-referencing, and export

9 Data Sources

Pull documents from DOJ EFTA (DS1-DS12), Kaggle, HuggingFace, Archive.org, FBI Vault, CourtListener, House Oversight, DocumentCloud, and Sea_Doughnut research databases.

4 OCR Backends

PyMuPDF for text-layer extraction, Surya for 90+ languages, olmOCR 2 (Allen AI) for VLM-based accuracy, and IBM Docling for table/layout understanding. Automatic fallback chain.

NLP Entity Extraction

spaCy transformer models and GLiNER zero-shot NER identify persons, organizations, locations, dates, and financial amounts. Fuzzy name matching links entities to canonical person IDs.

Document Classification

Zero-shot BART classifier sorts documents into 12 legal categories: court filings, depositions, financial records, flight logs, correspondence, law enforcement, and more.

3-Pass Deduplication

First pass: SHA-256 exact hashing. Second pass: MinHash/LSH approximate matching. Third pass: semantic cosine similarity on embeddings. Configurable thresholds at each stage.

Vector Embeddings + Search

nomic-embed-text-v2-moe generates 768-dim (or 256-dim Matryoshka) vectors. Paragraph-aware chunking at 450 tokens. Stored in pgvector with cosine ANN indexes for semantic search.

OSINT Cross-Reference

Match persons against OpenSanctions (OFAC, EU, UN, Interpol), ICIJ Offshore Leaks (Panama/Paradise/Pandora Papers), FEC political donations, and IRS Form 990 nonprofit filings.

Person Integrity Audit

5-phase audit using Claude AI: deduplication check, Wikidata verification, fact-checking against source documents, internal coherence scoring, and confidence grading per person.

Neon Postgres Export

Direct push to Neon with pgvector, pg_trgm, and tsvector/GIN full-text search. Also exports to JSON, CSV, and SQLite with FTS5. Pydantic models use camelCase aliases to match the site's TypeScript types.

CLI Reference

31 commands organized by stage

Data Ingestion

epstein download dojFetch latest DOJ EFTA releases (DS1-DS12)

epstein download kagglePull Kaggle Epstein Ranker dataset

epstein download huggingfacePull HuggingFace structured data

epstein download archivePull Archive.org media collections

epstein import sea-doughnutImport 1.38M Sea_Doughnut docs

Processing

epstein ocr ./data/pdfs/Multi-backend OCR (auto/pymupdf/surya/olmocr/docling)

epstein extract-entitiesRun spaCy + GLiNER entity extraction

epstein classifyZero-shot BART document classification

epstein dedup3-pass deduplication (hash/minhash/semantic)

epstein embedGenerate vector embeddings

epstein analyze-redactionsDetect redacted sections + recover text

epstein extract-imagesExtract images from PDFs (optional AI description)

epstein transcribeAudio/video transcription via faster-whisper

OSINT Cross-Reference

epstein check-sanctionsMatch against OFAC, EU, UN, Interpol, PEP lists

epstein check-icijCross-reference Panama/Paradise/Pandora Papers

epstein check-fecSearch FEC political donation records

epstein check-nonprofitsSearch IRS Form 990 nonprofit filings

Export + Database

epstein export jsonExport to JSON (site-compatible)

epstein export csvExport to CSV for spreadsheets

epstein export sqliteExport to SQLite with FTS5 full-text search

epstein export-neonPush to Neon Postgres with pgvector

epstein search 'query'Semantic search against pgvector

epstein migrateRun idempotent Neon schema migration

Quality + Audit

epstein validateJSON schema validation + integrity checks

epstein audit-persons5-phase AI person integrity audit

epstein statsShow processing statistics

epstein build-graphBuild knowledge graph (JSON + GEXF)

Built With

Core

Python 3.10+Click CLIPydantic v2httpxRich

OCR

PyMuPDFSuryaolmOCR 2IBM Docling

NLP

spaCyGLiNERBART-large-mnlirapidfuzz

Embeddings

nomic-embed-text-v2-moesentence-transformersPyTorch

Database

Neon Postgrespgvectorpg_trgmSQLite FTS5

OpenAIAnthropic ClaudeVoyage AICohere

Infra

DockerGitHub Actionspytestruffmypy

Data Sources

DOJ EFTA (DS1-DS12)2.73M documents~218 GB

Sea_Doughnut Research DBs1.38M documents849K redaction analyses

Kaggle (Epstein Ranker)~23,700 documentsAI-analyzed

HuggingFaceStructured emails + filings

Archive.orgMedia collectionsPhotos, videos, audio

FBI VaultFBI records

CourtListenerCourt filings

House OversightCongressional releases

DocumentCloudSearchable court docs

Contribute

MIT licensed. Add new data sources, improve OCR accuracy, write new exporters, or run the pipeline on your own infrastructure.

Open an Issue Contributing Guide Submit Data