Run in Colab Download notebook
In [ ]:
# Install required packages
!pip install --upgrade --quiet 'natural-pdf[export,ai]>=0.5.0'
!pip install --upgrade --quiet rapidocr
!pip install --upgrade --quiet surya-ocr

print('✓ Packages installed!')

Slides: slides.pdf

OCR: Recognizing text

Sometimes you can't actually get the text off of the page. It's an image of text instead of being actual text.

In [1]:
from natural_pdf import PDF

pdf = PDF("https://github.com/jsoma/natural-pdf/raw/refs/heads/main/pdfs/needs-ocr.pdf")
page = pdf.pages[0]
page.show(width=800)
Out[1]:
No description has been provided for this image

Looks the same as the last one, right? But when we try to extract the text...

In [2]:
text = page.extract_text()
print(text)

Nothing! It's time for OCR.

There are a looooot of OCR engines out there, and one of the things that makes Natural PDF nice is that it supports multiples. Figuring out which one is the "best" isn't as tough when you can just run them all right after each other.

The default is EasyOCR which usually works fine. But what happens when we try it with this document?

In [3]:
page.apply_ocr()
[INFO] 2026-05-23 17:40:32,046 [RapidOCR] base.py:22: Using engine_name: onnxruntime
[INFO] 2026-05-23 17:40:32,403 [RapidOCR] download_file.py:60: File exists and is valid: /Users/soma/Development/natural-pdf/.venv/lib/python3.11/site-packages/rapidocr/models/ch_PP-OCRv4_det_mobile.onnx
[INFO] 2026-05-23 17:40:32,403 [RapidOCR] main.py:57: Using /Users/soma/Development/natural-pdf/.venv/lib/python3.11/site-packages/rapidocr/models/ch_PP-OCRv4_det_mobile.onnx
[INFO] 2026-05-23 17:40:32,449 [RapidOCR] base.py:22: Using engine_name: onnxruntime
[INFO] 2026-05-23 17:40:32,451 [RapidOCR] download_file.py:60: File exists and is valid: /Users/soma/Development/natural-pdf/.venv/lib/python3.11/site-packages/rapidocr/models/ch_ppocr_mobile_v2.0_cls_mobile.onnx
[INFO] 2026-05-23 17:40:32,451 [RapidOCR] main.py:57: Using /Users/soma/Development/natural-pdf/.venv/lib/python3.11/site-packages/rapidocr/models/ch_ppocr_mobile_v2.0_cls_mobile.onnx
[INFO] 2026-05-23 17:40:32,476 [RapidOCR] base.py:22: Using engine_name: onnxruntime
[INFO] 2026-05-23 17:40:32,483 [RapidOCR] download_file.py:60: File exists and is valid: /Users/soma/Development/natural-pdf/.venv/lib/python3.11/site-packages/rapidocr/models/en_PP-OCRv4_rec_mobile.onnx
[INFO] 2026-05-23 17:40:32,483 [RapidOCR] main.py:57: Using /Users/soma/Development/natural-pdf/.venv/lib/python3.11/site-packages/rapidocr/models/en_PP-OCRv4_rec_mobile.onnx
Out[3]:
<Page number=1 index=0>
In [5]:
text = page.extract_text()
print(text)
Jungle Health and Safety Inspection Service
INS-UP70N51NCL41R
Site: Durham's Meatpacking Chicago, II..
Date: February 3, 1905
Violation Count: 7.
Summary: Worst of any, however, were the fertilizer men, and those who served in the cooking rooms.
These people could not be shown to the visitor - for the odor of a fertilizer man would scare any ordinary
visitor at a hundred yards, and as for the other men, who worked in tank rooms full of steam, and in
some of which there were open vats near the level of the floor, their peculiar trouble was that they fell
into the vats; and when they were fished out, there was never enough of them left to be worth
exhibiting - sometimes they would be overlooked for days, till all but the bones of them had gone out.
to the world as Durham's Pure Leaf Lard!.
Violations
Statute Description Level Repeat?
4.12.7 Unsanitary Working Conditions. Critical
5.8.3 Inadequate Protective Equipment. Serious
6.3.9 Ineffective Injury Prevention. Serious
7.1.5 Failure to Properly Store Hazardous Materials.. Critical
8.9.2 Lack of Adequate Fire Safety Measures. Serious
9.6.4 Inadequate Ventilation Systems. Serious
10.2.7 Insufficient Employee Training for Safe Work Practices.. Serious
Jungle Health and Safety Inspection Service

At first glance it does pretty well! But maybe we should see what to double-check?

In [11]:
(
    page
    .find_all('text')
    .show(group_by='confidence', width=800)
)
Out[11]:
No description has been provided for this image

The two issues I can see are Durham's Pure Leaf Lardl instead of Durham's Pure Leaf Lard!, and Chicago II.. instead of Chicago Ill.

I don't need to know why, though, really, because I can just try some other engine! You can also fool around with the options - some of the the lowest-hanging fruit is increasing the resolution of the OCR. The default at the moment is 150, you can try upping to 300 for (potentially) better results.

To fix this I'll both up the resolution and try another OCR engine (we have a bushel of them). THere are one million leaderboards with different scores, if you look at this one you'll see GLM OCR is somewhere decently up there.

In [16]:
page.apply_ocr(engine='glm_ocr')
OCR regions:   0%|          | 0/8 [00:00<?, ?region/s]
Out[16]:
<Page number=1 index=0>
[INFO] 2026-03-01 17:43:27,641 [RapidOCR] base.py:22: Using engine_name: onnxruntime
[INFO] 2026-03-01 17:43:27,648 [RapidOCR] download_file.py:60: File exists and is valid: /Users/soma/Development/natural-pdf/.venv/lib/python3.11/site-packages/rapidocr/models/en_PP-OCRv4_rec_infer.onnx
[INFO] 2026-03-01 17:43:27,648 [RapidOCR] main.py:53: Using /Users/soma/Development/natural-pdf/.venv/lib/python3.11/site-packages/rapidocr/models/en_PP-OCRv4_rec_infer.onnx
Out[16]:
<Page number=1 index=0>
In [17]:
text = page.extract_text()
print(text)
Jungle Health and Safety Inspection Service INS-UP70N51NCL41R
Site: Durham's Meatpacking Chicago, Ill.
Date: February 3, 1905
Violation Count: 7
Summary: Worst of any, however, were the fertilizer men, and those who served in the cooking rooms. These people could not be shown to the visitor - for the odor of a fertilizer man would scare any ordinary visitor at a hundred yards, and as for the other men, who worked in tank rooms full of steam, and in some of which there were open vats near the level of the floor, their peculiar trouble was that they fell into the vats; and when they were fished out, there was never enough of them left to be worth exhibiting - sometimes they would be overlooked for days, till all but the bones of them had gone out to the world as Durham's Pure Leaf Lard!
Violations
<table border="1"><tr><td>Statute</td><td>Description</td><td>Level</td><td>Repeat?</td></tr><tr><td>4.12.7</td><td>Unsanitary Working Conditions.</td><td>Critical</td><td>$\textcircled{2}$</td></tr><tr><td>5.8.3</td><td>Inadequate Protective Equipment.</td><td>Serious</td><td>$\textcircled{2}$</td></tr><tr><td>6.3.9</td><td>Ineffective Injury Prevention.</td><td>Serious</td><td>$\square$</td></tr><tr><td>7.1.5</td><td>Failure to Properly Store Hazardous Materials.</td><td>Critical</td><td>$\square$</td></tr><tr><td>8.9.2</td><td>Lack of Adequate Fire Safety Measures.</td><td>Serious</td><td>$\square$</td></tr><tr><td>9.6.4</td><td>Inadequate Ventilation Systems.</td><td>Serious</td><td>$\textcircled{2}$</td></tr><tr><td>10.2.7</td><td>Insufficient Employee Training for Safe Work Practices.</td><td>Serious</td><td>$\square$</td></tr></table>
Jungle Health and Safety Inspection Service

Amazing! But what if it was very very difficult? And neither RapidOCR nor GLM OCR nor the million of other OCR tools can get the results?

Basic OCR with an LLM

Sometimes skipping our OCR-only tools and using an LLM is our last result.

In [1]:
from natural_pdf import PDF
pdf = PDF("https://github.com/jsoma/natural-pdf/raw/refs/heads/main/pdfs/needs-ocr.pdf")
page = pdf.pages[0]

page.extract_text()
Out[1]:
''
In [2]:
import os
from dotenv import load_dotenv

load_dotenv()

GOOGLE_API_KEY = os.environ.get("GOOGLE_API_KEY") or input("Enter GOOGLE_API_KEY: ")
In [21]:
import os
from openai import OpenAI

# Using Google Gemini (default)
client = OpenAI(
    api_key=GOOGLE_API_KEY,
    # api_key="YOUR_API_KEY_HERE",
    base_url="https://generativelanguage.googleapis.com/v1beta/openai/"
)

# This gives the funniest response I've ever seen
# page.apply_ocr(engine="vlm", model="gemini-3.5-flash", client=client)
page.apply_ocr(engine="vlm", model="gemini-3.1-flash-lite", client=client)

# Or use OpenAI directly:
# client = OpenAI(
#     api_key=os.environ["OPENAI_API_KEY"],
#     # api_key="YOUR_API_KEY_HERE",
# )
# page.apply_ocr(engine="vlm", model="gpt-5", client=client)

# Or use OpenRouter for open-source models:
# client = OpenAI(
#     api_key=os.environ["OPENROUTER_API_KEY"],
#     # api_key="YOUR_API_KEY_HERE",
#     base_url="https://openrouter.ai/api/v1"
# )
# page.apply_ocr(engine="vlm", model="qwen/qwen3-vl-8b-instruct", client=client)
Out[21]:
<Page number=1 index=0>
In [22]:
print(page.extract_text())
Jungle Health and Safety Inspection Service
INS-UP70N51NCL41R
Site: Durham's Meatpacking Chicago, III.
Date: February 3, 1905
Violation Count: 7
Summary: Worst of any, however, were the fertilizer men, and those who served in the cooking rooms. These people could not be shown to the visitor - for the odor of a fertilizer man would scare any ordinary visitor at a hundred yards, and as for the other men, who worked in tank rooms full of steam, and in some of which there were open vats near the level of the floor, their peculiar trouble was that they fell into the vats; and when they were fished out, there was never enough of them left to be worth exhibiting - sometimes they would be overlooked for days, till all but the bones of them had gone out to the world as Durham's Pure Leaf Lard!
Violations
Violations Table
In [23]:
page.find_all('text').show(width=700)
Out[23]:
No description has been provided for this image

Advanced OCR with an LLM

Sometimes you need exact bounding boxes, or the LLM is being unpredictable, or you need to zoom in, or you need... all sorts of things. And alternative to throwing everything over the fence is a two-stop process:

  1. Use an OCR tool with detect_only=True so it detects the text, but doesn't try to read it.
  2. Then send each box to the LLM to get the text

I find this works better than sending the entire page because there is less likelihood for hallucination.

In [ ]:
from natural_pdf import PDF
pdf = PDF("https://github.com/jsoma/natural-pdf/raw/refs/heads/main/pdfs/needs-ocr.pdf")
page = pdf.pages[0]

page.apply_ocr(detect_only=True)
page.extract_text()
In [ ]:
texts = page.find_all('text')
texts.show(width=800)
In [ ]:
import os
from openai import OpenAI

instructions = """Return only the exact text content visible in the image. 
Preserve original spelling, capitalization, punctuation, and symbols.
Fix misspellings if they are the result of blurry or incorrect OCR.
Do not add any explanatory text, translations, comments, or quotation marks around the result.
If you cannot process the image or do not see any text, return an empty space.
The text is from an inspection report of a slaughterhouse."""
# Could use all other sorts of text, too
# The text is likely from a Greek document, potentially a spreadsheet, containing Modern Greek words or numbers

client = OpenAI(
    api_key=GOOGLE_API_KEY,
    # api_key="YOUR_API_KEY_HERE",
    base_url="https://generativelanguage.googleapis.com/v1beta/openai/"
)

(
    page.find_all('text')
        .apply_ocr(
            engine="vlm",
            model="gemini-2.5-flash-lite",
            client=client,
            instructions=instructions
        )
)

What do we have now?

In [ ]:
text = page.extract_text()
print(text)

Finding tables on OCR documents

When we used page.extract_table() last time, it was easy because there were all of these line elements on the page that pdfplumber could detect and say "hey, it's a table!" For the same reason that there's no real text on the page, there's also no real lines on the page. Instead, we're going to do a fun secret trick where we look at what horizontal and vertical coordinates seem like they might be lines by setting a threshold.

In [24]:
from natural_pdf import PDF

pdf = PDF("https://github.com/jsoma/natural-pdf/raw/refs/heads/main/pdfs/needs-ocr.pdf")
page = pdf.pages[0]

page.apply_ocr()
page.show(width=800)
[INFO] 2026-05-23 17:56:10,067 [RapidOCR] base.py:22: Using engine_name: onnxruntime
[INFO] 2026-05-23 17:56:10,140 [RapidOCR] download_file.py:60: File exists and is valid: /Users/soma/Development/natural-pdf/.venv/lib/python3.11/site-packages/rapidocr/models/ch_PP-OCRv4_det_mobile.onnx
[INFO] 2026-05-23 17:56:10,140 [RapidOCR] main.py:57: Using /Users/soma/Development/natural-pdf/.venv/lib/python3.11/site-packages/rapidocr/models/ch_PP-OCRv4_det_mobile.onnx
[INFO] 2026-05-23 17:56:10,203 [RapidOCR] base.py:22: Using engine_name: onnxruntime
[INFO] 2026-05-23 17:56:10,205 [RapidOCR] download_file.py:60: File exists and is valid: /Users/soma/Development/natural-pdf/.venv/lib/python3.11/site-packages/rapidocr/models/ch_ppocr_mobile_v2.0_cls_mobile.onnx
[INFO] 2026-05-23 17:56:10,206 [RapidOCR] main.py:57: Using /Users/soma/Development/natural-pdf/.venv/lib/python3.11/site-packages/rapidocr/models/ch_ppocr_mobile_v2.0_cls_mobile.onnx
[INFO] 2026-05-23 17:56:10,230 [RapidOCR] base.py:22: Using engine_name: onnxruntime
[INFO] 2026-05-23 17:56:10,236 [RapidOCR] download_file.py:60: File exists and is valid: /Users/soma/Development/natural-pdf/.venv/lib/python3.11/site-packages/rapidocr/models/en_PP-OCRv4_rec_mobile.onnx
[INFO] 2026-05-23 17:56:10,237 [RapidOCR] main.py:57: Using /Users/soma/Development/natural-pdf/.venv/lib/python3.11/site-packages/rapidocr/models/en_PP-OCRv4_rec_mobile.onnx
Out[24]:
No description has been provided for this image
In [25]:
table_area = (
    page
    .find('text:contains(Violations)')
    .below(
        until='text:contains(Jungle)',
        include_endpoint=False
    )
)
table_area.show(crop=True)
Out[25]:
No description has been provided for this image

Now we can add the lines and use them to detect the table.

In [27]:
guides = table_area.guides()
guides.vertical.from_lines(threshold=0.4, detection_method='pixels')
guides.horizontal.from_lines(detection_method='pixels')
guides.show()
Out[27]:
No description has been provided for this image

Now we just need to find those checkboxes.

In [ ]:
page.detect_checkboxes()

And we're good to go!

In [30]:
guides.extract_table().to_df()
Out[30]:
Statute Description Level Repeat?
0 4.12.7 Unsanitary Working Conditions. Critical [CHECKED]
1 5.8.3 Inadequate Protective Equipment. Serious [CHECKED]
2 6.3.9 Ineffective Injury Prevention. Serious [UNCHECKED]
3 7.1.5 Failure to Properly Store Hazardous Materials.. Critical [UNCHECKED]
4 8.9.2 Lack of Adequate Fire Safety Measures. Serious [UNCHECKED]
5 9.6.4 Inadequate Ventilation Systems. Serious [CHECKED]
6 10.2.7 Insufficient Employee Training for Safe Work P... Serious [UNCHECKED]