The Robert F. Kennedy Department of Justice Building in Washington, D.C. — headquarters of the agency that printed, scanned, and OCR'd 3.5 million pages of born-digital evidence before releasing them under the EFTA.
Three Million Pages of Nothing
The DOJ released the Epstein files as required by law — every page printed, scanned at 96 DPI, and run through OCR. A systematic analysis across all 12 datasets confirms: the original metadata, text fidelity, and digital forensic value have been destroyed.
On April 20, 2015, Jeffrey Epstein received an email from the editor-in-chief of Scientific American. "Not a rush, obviously," she wrote, "but was just thinking of Jeffrey and hoping he is well. Offer is still open to visit!" The email was sent from an iPhone, routed through Gmail, and stored on a government server after seizure.1
That email — a few lines of text, maybe 2KB in its original digital form — is now a 48-kilobyte PDF. It was printed on paper, placed on a flatbed scanner, photographed at 96 dots per inch, and then run through optical character recognition software that attempted to reconstruct the text it had just destroyed. The OCR layer reads the original @gmail.com as ®gmail.com. The metadata — sender, recipient, timestamp, routing headers, the digital signature that proved it was authentic — is gone. The PDF's producer and creator fields are blank.1
This is not one document. This is every document. All 3.5 million pages released under the Epstein Files Transparency Act.
What Gets Destroyed
When a born-digital document is printed and scanned, the destruction is comprehensive and irreversible:
Metadata — creation date, author, email headers (To, From, CC, BCC, Date), edit history, software version. The DOJ's scanned PDFs have blank producer and creator fields. The original authorship, timestamps, and chain of custody are permanently erased.
Text fidelity — the original document had perfect, machine-generated text. After scanning, the OCR layer introduces character-level errors. In our sampling, we found @ rendered as ®, email addresses garbled into nonsense like mUeevacation©gmail.com, and names corrupted — every error a potential missed search result.2
Digital signatures — any cryptographic proof of authenticity or forensic chain of custody, gone.
Resolution — the DOJ scanned at approximately 96 DPI. Modern office scanners default to 300 DPI. The National Archives and Records Administration recommends 300-400 DPI for text documents. At 96 DPI, fine print, handwritten marginal notes, and faded stamps become difficult to read — and far more difficult for OCR to process accurately.
File size — a 1-page typed email is roughly 2-5KB as native digital text. Scanned and OCR'd at 96 DPI, it balloons to 25-50KB. A 100-page document becomes 42 megabytes — thousands of times larger for strictly less information.3
Three and a Half Million Photographs
To confirm that this pattern holds across the entire release, we randomly sampled 20 Bates numbers from each of the 12 DOJ datasets — 240 documents total, selected with a fixed random seed for reproducibility. Of the 64 documents successfully downloaded, we analyzed all 197 pages using PyMuPDF, checking each for full-page scan images, text layers, embedded fonts, and image resolution.
The result was uniform.
Every dataset. Every document type — emails, court filings, FBI reports, bank records, grand jury transcripts. Documents that were typed on computers, sent over the internet, and stored on government servers were printed onto paper and then photographed.
The OCR text layers range from a single word — the Bates number stamp `EFTA00000205` — to full transcripts with hundreds of words per page.4 The fonts embedded in the OCR layer are generic substitutes: Helvetica, Courier, ArialMT. The original document fonts are gone. An email composed in Outlook, a memo drafted in Word, a spreadsheet exported from Excel — all now wear the same uniform of machine-generated Helvetica overlaid on a blurry photograph.5
The image resolutions tell their own story. Datasets 3 through 9 share a consistent 816×1056 pixel resolution. Dataset 10 shifts to 816×1089. Datasets 11 and 12 use 816×1073. Dataset 1 is entirely different — 769×1152 pixels in a mix of landscape and portrait orientations. Dataset 2 shows wild variation, with widths ranging from 812 to 895 pixels, suggesting documents were hand-fed into a scanner with inconsistent placement.6
What Congress Intended
The Epstein Files Transparency Act was designed for exactly the kind of analysis this project performs: systematic, searchable, cross-referenced investigation of millions of pages of evidence. The law mandated disclosure. It assumed the documents would be usable.
The DOJ did add OCR text layers — so the documents are not completely unsearchable. But OCR on 96 DPI scans introduces errors that compound across millions of pages. A researcher searching for a specific name won't find it if the OCR misread a letter. A search for an email address fails when @ becomes ® or ©. A date search returns nothing when OCR confuses a 3 with an 8.1
The fundamental question remains: why scan these documents at all? The DOJ had the original digital files. Emails existed as .msg or .eml files on Exchange servers. Typed memos existed as Word documents. Spreadsheets existed as Excel files. Court filings existed as native PDFs with perfect text, bookmarks, and hyperlinks. The decision to print 3.5 million pages and scan them back into inferior digital copies was a deliberate choice — not a technical necessity.
Compare the EFTA release to other major government document disclosures. The Mueller Report was released as a born-digital, fully searchable PDF with bookmarks and hyperlinks. Congressional investigation reports routinely include native-format digital exhibits. When agencies want documents to be usable, they release them in usable formats.
The Cost of Compliance Theater
Consider the resources the DOJ spent to make these documents worse.
Someone authorized the printing of 3.5 million pages — roughly 7,000 reams of paper. Someone fed those pages, in at least three separate scanning operations across what appears to be multiple facilities, through flatbed or production scanners. Someone ran OCR software on the resulting images. Someone uploaded the results to justice.gov. The personnel hours, equipment costs, and paper alone likely ran into hundreds of thousands of dollars.
All to produce inferior copies of files that already existed in perfect digital form.
Our project has processed 1.37 million document records from this release. Every one required re-processing to become useful — re-OCR'd at higher quality, text extracted, entities identified, cross-references built. The work Congress intended the public to do with these documents was made orders of magnitude harder by the format in which they were released.
The only parties who benefit from degraded, error-prone documents are those whose names appear in them. Every OCR error is a missed search result. Every destroyed metadata field is a broken link in the chain of evidence. Every blurred marginal note is a detail that might have mattered.
The Pattern
This is not the first time the institutions responsible for the Epstein case chose the path of minimum transparency.
Jeffrey Epstein's 2007 Non-Prosecution Agreement gave blanket immunity to unnamed co-conspirators — a document designed to prevent accountability before anyone knew to demand it. For years, the agreement itself was sealed, its terms hidden from the very victims it was supposed to protect.
The same institutional instinct that produced the NPA — protect the system, manage the optics, release the minimum — produced these 3.5 million photographs. The EFTA required disclosure. It did not specify format. And so the DOJ chose the format that would make the documents hardest to use: printed, scanned at sub-standard resolution, OCR'd with errors, stripped of every piece of metadata that might have made systematic analysis possible.
The documents contain the truth. The DOJ just made sure you'd have to work for it.
123456
Get help building custom platforms, AI-powered tools, and data-driven applications for your business or projects.
This article is based on documents released under the Epstein Files Transparency Act (EFTA). All claims are sourced to specific EFTA documents identified by Bates number. Entity tier classifications reflect evidence strength, not legal determinations.
Research and initial drafting assisted by Claude AI (Anthropic). All articles are reviewed, fact-checked, and edited by Derek Emsbach.
Continue Reading
Related investigations
The Golden Handcuffs
How the Epstein 2014 Trust turned employee-witnesses into paid accomplices to silence — and how the attorney who controlled their bequests told them not to talk to police.

The Heirs With the Most to Hide
Celina Dubin inherited essentially the entire Epstein estate — four properties, $20M in operating endowments, 100% of the residuary. Her mother was the successor trustee. Her father appears on page 57 of the prosecution memo.
The Billion-Dollar Blind Eye
Three victims. Forensic journals. Bank records. A $62.5 million settlement. A $158 million relationship with Epstein. And an AUSA who admitted: "I did not write anything up on Leon Black."
Curious what Epstein and his network talked about?
A community-built email archive spanning years of private correspondence within the Epstein network.
Explore the Archive →