Three Million Pages of Nothing

On April 20, 2015, Jeffrey Epstein received an email from the editor-in-chief of Scientific American. "Not a rush, obviously," she wrote, "but was just thinking of Jeffrey and hoping he is well. Offer is still open to visit!" The email was sent from an iPhone, routed through Gmail, and stored on a government server after seizure.1

That email — a few lines of text, maybe 2KB in its original digital form — is now a 48-kilobyte PDF. It was printed on paper, placed on a flatbed scanner, photographed at 96 dots per inch, and then run through optical character recognition software that attempted to reconstruct the text it had just destroyed. The OCR layer reads the original @gmail.com as ®gmail.com. The metadata — sender, recipient, timestamp, routing headers, the digital signature that proved it was authentic — is gone. The PDF's producer and creator fields are blank.1

This is not one document. This is every document. All 3.5 million pages released under the Epstein Files Transparency Act.

What Gets Destroyed

A flatbed scanner — the device used to convert millions of pages of digital evidence into degraded photographs

When a born-digital document is printed and scanned, the destruction is comprehensive and irreversible:

Metadata — creation date, author, email headers (To, From, CC, BCC, Date), edit history, software version. The DOJ's scanned PDFs have blank producer and creator fields. The original authorship, timestamps, and chain of custody are permanently erased.

Text fidelity — the original document had perfect, machine-generated text. After scanning, the OCR layer introduces character-level errors. In our sampling, we found @ rendered as ®, email addresses garbled into nonsense like mUeevacation©gmail.com, and names corrupted — every error a potential missed search result.2

Digital signatures — any cryptographic proof of authenticity or forensic chain of custody, gone.

Resolution — the DOJ scanned at approximately 96 DPI. Modern office scanners default to 300 DPI. The National Archives and Records Administration recommends 300-400 DPI for text documents. At 96 DPI, fine print, handwritten marginal notes, and faded stamps become difficult to read — and far more difficult for OCR to process accurately.

File size — a 1-page typed email is roughly 2-5KB as native digital text. Scanned and OCR'd at 96 DPI, it balloons to 25-50KB. A 100-page document becomes 42 megabytes — thousands of times larger for strictly less information.3

Key Finding

Scanning a born-digital document is not a neutral act of preservation. It is an act of destruction — the original metadata, text fidelity, and forensic provenance are permanently lost, replaced with a degraded photograph and an error-prone OCR approximation.

Three and a Half Million Photographs

The Robert F. Kennedy Department of Justice Building in Washington, D.C. — headquarters of the agency that processed the EFTA document release

To confirm that this pattern holds across the entire release, we randomly sampled 20 Bates numbers from each of the 12 DOJ datasets — 240 documents total, selected with a fixed random seed for reproducibility. Of the 64 documents successfully downloaded, we analyzed all 197 pages using PyMuPDF, checking each for full-page scan images, text layers, embedded fonts, and image resolution.

The result was uniform.

100%

of all 197 pages sampled across all 12 DOJ EFTA datasets are hybrid scans — photographs of printed documents with OCR text overlaid. Zero born-digital pages. Zero image-only scans. Every page in the release was printed, scanned, and OCR'd.

Every dataset. Every document type — emails, court filings, FBI reports, bank records, grand jury transcripts. Documents that were typed on computers, sent over the internet, and stored on government servers were printed onto paper and then photographed.

The OCR text layers range from a single word — the Bates number stamp `EFTA00000205` — to full transcripts with hundreds of words per page.4 The fonts embedded in the OCR layer are generic substitutes: Helvetica, Courier, ArialMT. The original document fonts are gone. An email composed in Outlook, a memo drafted in Word, a spreadsheet exported from Excel — all now wear the same uniform of machine-generated Helvetica overlaid on a blurry photograph.5

The image resolutions tell their own story. Datasets 3 through 9 share a consistent 816×1056 pixel resolution. Dataset 10 shifts to 816×1089. Datasets 11 and 12 use 816×1073. Dataset 1 is entirely different — 769×1152 pixels in a mix of landscape and portrait orientations. Dataset 2 shows wild variation, with widths ranging from 812 to 895 pixels, suggesting documents were hand-fed into a scanner with inconsistent placement.6

Key Finding

At least three distinct scanning operations processed these documents, based on resolution fingerprinting. This was not one careless batch job. It was a systematic, multi-facility effort to convert born-digital evidence into degraded photographs.

What Congress Intended

The Epstein Files Transparency Act was designed for exactly the kind of analysis this project performs: systematic, searchable, cross-referenced investigation of millions of pages of evidence. The law mandated disclosure. It assumed the documents would be usable.

The DOJ did add OCR text layers — so the documents are not completely unsearchable. But OCR on 96 DPI scans introduces errors that compound across millions of pages. A researcher searching for a specific name won't find it if the OCR misread a letter. A search for an email address fails when @ becomes ® or ©. A date search returns nothing when OCR confuses a 3 with an 8.1

The fundamental question remains: why scan these documents at all? The DOJ had the original digital files. Emails existed as .msg or .eml files on Exchange servers. Typed memos existed as Word documents. Spreadsheets existed as Excel files. Court filings existed as native PDFs with perfect text, bookmarks, and hyperlinks. The decision to print 3.5 million pages and scan them back into inferior digital copies was a deliberate choice — not a technical necessity.

Compare the EFTA release to other major government document disclosures. The Mueller Report was released as a born-digital, fully searchable PDF with bookmarks and hyperlinks. Congressional investigation reports routinely include native-format digital exhibits. When agencies want documents to be usable, they release them in usable formats.

Key Finding

The DOJ delivered technical compliance — 3.5 million pages released as photographs of printouts, stripped of metadata, degraded in resolution, and filtered through error-prone OCR. The original digital files, with their perfect text, authorship metadata, and forensic provenance, were not released.

The Cost of Compliance Theater

Consider the resources the DOJ spent to make these documents worse.

Someone authorized the printing of 3.5 million pages — roughly 7,000 reams of paper. Someone fed those pages, in at least three separate scanning operations across what appears to be multiple facilities, through flatbed or production scanners. Someone ran OCR software on the resulting images. Someone uploaded the results to justice.gov. The personnel hours, equipment costs, and paper alone likely ran into hundreds of thousands of dollars.

All to produce inferior copies of files that already existed in perfect digital form.

96 DPI

The effective scanning resolution across all datasets — 816 pixels on a 612-point page. The National Archives recommends 300-400 DPI for text documents. The DOJ chose a resolution one-third of the minimum archival standard.

Our project has processed 1.37 million document records from this release. Every one required re-processing to become useful — re-OCR'd at higher quality, text extracted, entities identified, cross-references built. The work Congress intended the public to do with these documents was made orders of magnitude harder by the format in which they were released.

The only parties who benefit from degraded, error-prone documents are those whose names appear in them. Every OCR error is a missed search result. Every destroyed metadata field is a broken link in the chain of evidence. Every blurred marginal note is a detail that might have mattered.

The Pattern

This is not the first time the institutions responsible for the Epstein case chose the path of minimum transparency.

Jeffrey Epstein's 2007 Non-Prosecution Agreement gave blanket immunity to unnamed co-conspirators — a document designed to prevent accountability before anyone knew to demand it. For years, the agreement itself was sealed, its terms hidden from the very victims it was supposed to protect.

The same institutional instinct that produced the NPA — protect the system, manage the optics, release the minimum — produced these 3.5 million photographs. The EFTA required disclosure. It did not specify format. And so the DOJ chose the format that would make the documents hardest to use: printed, scanned at sub-standard resolution, OCR'd with errors, stripped of every piece of metadata that might have made systematic analysis possible.

The documents contain the truth. The DOJ just made sure you'd have to work for it.

123456

Three Million Pages of Nothing

What Gets Destroyed

Three and a Half Million Photographs

What Congress Intended

The Cost of Compliance Theater

The Pattern

Continue Reading

The Last Night

Normal for This Client

The Conveyor Belt

Discussion

Help Us Analyze More Documents