Back to browse
I indexed the academic papers buried in the DOJ Epstein Files

I indexed the academic papers buried in the DOJ Epstein Files

by am-seo·Feb 20, 2026·7 points·0 comments

AI Analysis

●●●BangerZero to OneBig BrainNiche Gem

207 papers buried in government records, now searchable with full context extraction.

Strengths
  • Novel data source: extracts and indexes papers from a massive public records dump nobody else curated
  • Solid pipeline: heuristic filtering + LLM verification + CrossRef/Semantic Scholar enrichment captures metadata even for non-indexed papers
  • Transparent provenance: links original Bates numbers, dataset source, email metadata for each paper
Weaknesses
  • Limited to 207 papers and niche audience; impact depends on researcher interest in Epstein Files corpus
  • 87 papers lack DOI matches; full-text search quality hinges on OCR accuracy of government scans
Category
Target Audience

Researchers, academics, journalists, public records investigators

Similar To

Document indexing projects (e.g., OpenCorporates, Internet Archive Scholar) · Academic metadata aggregators (CrossRef, Semantic Scholar) · FOIA document databases (PACER, SEC Edgar)

Post Description

The DOJ released ~3.5M pages of Epstein documents across 12 datasets. Buried in them are 207 academic papers and 14 books that nobody was really talking about. From what I understand these papers aren't usually freely accesible, but since they are public documents, now they are.

I don't know, thought it was interesting to see what this dude was reading. You can check it out at jeescholar.com

Pipeline: 1. Downloaded all 12 DOJ datasets + House Oversight Committee release 2. Heuristic pre-filter (abstract detection, DOI regex, citation block patterns, affiliation strings) to cut noise 3. LLM classifier to confirm and extract metadata 4. CrossRef and Semantic Scholar APIs for DOI matching, citation counts, abstracts 5. 87 of 207 papers got DOI matches; the rest are identified but not in major indexes

Stack: FastAPI + SQLite (FTS5 for full-text search) + Cloudflare R2 for PDFs + nginx/Docker on Hetzner.

The fields represented are genuinely iteresting: there's a cluster of child abuse/grooming research, but also quantum gravity, AGI safety, econophysics, and regenerative medicine. Each paper links back to its original government PDF and Bates number.

For sure not an exhaustive list. Would be happy to add more if anyone finds them.

Similar Projects