What Is OCR and Why Do Scanned PDFs Need It?

OCR stands for Optical Character Recognition, a technology that converts images of text — such as those found in scanned PDFs — into actual, machine-readable text. Scanned PDFs need OCR because, without it, every page is nothing more than a flat image. Screen readers cannot read the content, users cannot search or select text, and the document is completely inaccessible to anyone relying on assistive technology. For universities working toward WCAG 2.1 compliance, OCR is the essential first step in making scanned documents usable.

If your institution has filing cabinets worth of digitised records, old syllabi, or archived research papers, there is a strong chance many of those PDFs are scanned images masquerading as text documents. They look fine on screen, but underneath the surface, there is nothing for assistive technology to work with.

How OCR Works

At a high level, OCR software analyses an image pixel by pixel, identifies patterns that correspond to letters, numbers, and symbols, and outputs a text layer that maps those characters to their positions on the page.

Modern OCR engines use a multi-stage process:

Pre-processing. The image is cleaned up — deskewed, denoised, and converted to high contrast. This step dramatically affects accuracy.
Segmentation. The engine identifies blocks of text, separating them from images, borders, and whitespace. It determines reading order and distinguishes columns from paragraphs.
Character recognition. Each character shape is compared against trained models. Traditional engines used template matching; modern engines like Tesseract 5 use LSTM (Long Short-Term Memory) neural networks that recognise characters in context, improving accuracy for unusual fonts and degraded text.
Post-processing. Dictionary lookups and language models correct common misrecognitions. For example, "rn" misread as "m" or "1" confused with "l" are resolved using contextual analysis.

The output is a text layer that sits behind the original image in the PDF, preserving the visual appearance while adding machine-readable content.

Why Scanned PDFs Are Just Images

When you scan a paper document — or when someone prints to PDF from a physical copy — the scanner captures a photograph of each page. The resulting PDF contains image data (typically JPEG or CCITT-compressed TIFF), not text data.

You can test this yourself: open a PDF and try to select text with your cursor. If you cannot highlight individual words, you are looking at an image-only PDF. The file might be 50 pages long with thousands of words visible on screen, but as far as any software is concerned, it contains zero text.

This distinction matters enormously for accessibility.

When a screen reader like JAWS, NVDA, or VoiceOver opens a scanned PDF without OCR, the result is silence — or worse, a single unhelpful announcement like "image" or "graphic" repeated for every page. The screen reader has no text to read aloud because none exists in the file.

For a student who is blind or has low vision, this means the document is completely unusable. A 30-page course reading, an exam paper, a faculty handbook — all invisible. This is not a minor inconvenience. It is a total barrier to access and a clear WCAG 2.1 failure under Success Criterion 1.1.1 (Non-text Content) and 1.4.5 (Images of Text).

Under the ADA Title II requirements taking effect in April 2027 (extended from 2026 via the April 2026 IFR), public universities must ensure their digital content is accessible. Scanned PDFs without OCR are among the most common and most severe violations.

OCR Accuracy and DPI Recommendations

OCR is not perfect, and accuracy depends heavily on the quality of the source scan. The single most important factor is resolution, measured in DPI (dots per inch).

Recommended: 300 DPI minimum. At 300 DPI, modern OCR engines achieve 98–99% character accuracy on clean, printed text. This is the standard recommendation from both the PDF/UA specification and accessibility professionals.

Below 200 DPI, accuracy drops sharply. Characters become pixelated, and the engine cannot reliably distinguish similar letterforms. At 150 DPI — common in older scans — expect error rates of 5–10% or more, which means multiple errors per paragraph.

Other factors that affect accuracy:

Font size. Text below 10pt is harder to recognise accurately.
Print quality. Faded ink, smudges, and photocopied-of-photocopied degradation all reduce accuracy.
Language and script. Latin-alphabet languages with standard fonts yield the best results. Complex scripts, mathematical notation, and mixed-language documents are more challenging.
Page layout. Multi-column layouts, tables, and text wrapped around images require sophisticated segmentation.

If you are scanning new documents, always use 300 DPI or higher, black and white or greyscale mode, and ensure the original is clean and flat on the scanner bed.

Tesseract vs Commercial OCR

The two main categories of OCR engines are open-source and commercial.

Tesseract is the most widely used open-source OCR engine, originally developed by HP and now maintained by Google. Tesseract 5 uses LSTM neural networks and supports over 100 languages. It is free, well-documented, and produces strong results on clean, high-resolution scans. Many accessibility platforms — including Aelira — build on Tesseract as a core processing component.

Commercial OCR engines — such as ABBYY FineReader, Adobe Acrobat's built-in OCR, and Google Cloud Vision — often achieve slightly higher accuracy, particularly on degraded documents, complex layouts, and handwritten text. They also tend to handle multi-column academic papers and mixed-content pages more reliably. However, they come with per-page or per-document licensing costs that scale quickly across large document libraries.

For most university accessibility workflows, Tesseract at 300 DPI delivers accuracy that is more than sufficient for the bulk of printed course materials. Commercial engines are worth considering for high-value documents with poor source quality or complex formatting.

OCR Limitations: What It Cannot Do

OCR is powerful, but it has clear boundaries that are important to understand.

Handwriting. Even the best OCR engines struggle with handwritten text. Accuracy on handwriting varies wildly — from 60% on neat print to near zero on cursive. If your scanned documents contain handwritten annotations, expect those sections to need manual transcription.

Poor-quality scans. Documents scanned at low resolution, with heavy shadows from book spines, or with significant skew and distortion will produce unreliable OCR output. The old adage applies: garbage in, garbage out.

Complex mathematical notation. Standard OCR engines are not designed to recognise LaTeX-style equations or mathematical symbols. Specialised tools exist for this, but they are separate from general-purpose OCR. For guidance on making LaTeX documents accessible, structured authoring from the source is almost always preferable to scanning and OCR.

Tables and forms. OCR can recognise the text within table cells, but it does not inherently understand the table structure. The relationship between headers and data cells is lost, requiring additional processing to reconstruct.

What to Do After OCR: Tagging Is Still Required

Here is the critical point that many people miss: OCR is necessary but not sufficient for accessibility. Running OCR on a scanned PDF gives you searchable, selectable text — a major improvement — but the document is still not accessible.

After OCR, the PDF still lacks:

Document structure tags. Headings, paragraphs, lists, and tables need to be tagged so screen readers can navigate the document logically.
Reading order. The sequence in which content should be read must be defined, especially for multi-column layouts.
Alternative text. Any images, charts, or diagrams need descriptive alt text.
Language identification. The document language must be declared for correct screen reader pronunciation.

In other words, OCR extracts the raw text, but tagging and remediation give that text meaning and structure. Without both steps, the document fails accessibility checks.

This is why automated scanning tools that only perform OCR and stop there give institutions a false sense of compliance. A comprehensive approach to PDF accessibility requires OCR followed by structural tagging, alt text generation, and reading order verification.

Making OCR Part of Your Accessibility Workflow

For universities managing thousands of documents, the practical question is how to integrate OCR into an efficient workflow:

Audit your existing library. Identify which PDFs are image-only by checking for selectable text. Automated tools can flag these at scale.
Set scanning standards. Establish a minimum 300 DPI policy for all new scans. Train staff on proper scanning technique.
Batch-process with OCR. Use tools that can process documents in bulk rather than one at a time.
Follow OCR with remediation. Ensure your workflow includes structural tagging after the OCR step.
Prioritise high-traffic documents. Start with syllabi, course readings, and public-facing materials that affect the most users.

Aelira automates this entire pipeline — from detecting image-only PDFs through OCR processing to structural tagging and alt text generation — so faculty and accessibility staff can focus on review rather than manual remediation. If your institution is working through a backlog of scanned documents ahead of compliance deadlines, a platform that handles both OCR and tagging in a single pass can save significant time and effort.

What Is OCR and Why Do Scanned PDFs Need It?

How OCR Works

Why Scanned PDFs Are Just Images

OCR Accuracy and DPI Recommendations

Tesseract vs Commercial OCR

OCR Limitations: What It Cannot Do

What to Do After OCR: Tagging Is Still Required

Making OCR Part of Your Accessibility Workflow

Aelira Team

Related Articles

How Do I Make Excel Spreadsheets Accessible?

How Do I Make a PDF Form Accessible?

How Do I Add Bookmarks to a PDF?

Ready to achieve accessibility compliance?

How OCR Works

Why Scanned PDFs Are Just Images

What Happens When a Screen Reader Encounters a Scanned PDF

OCR Accuracy and DPI Recommendations

Tesseract vs Commercial OCR

OCR Limitations: What It Cannot Do

What to Do After OCR: Tagging Is Still Required

Making OCR Part of Your Accessibility Workflow

Aelira Team

Related Articles

How Do I Make Excel Spreadsheets Accessible?

How Do I Make a PDF Form Accessible?

How Do I Add Bookmarks to a PDF?

Ready to achieve accessibility compliance?