← Home

Modern PDF processing with Natural PDF

Download notebook

In [ ]:

# Install required packages
!pip install --upgrade --quiet 'natural-pdf[export,ai]>=0.6.4' rapidocr

print('✓ Packages installed!')

OCR: Recognizing text¶

Sometimes you can't actually get the text off of the page. It's an image of text instead of being actual text.

In [ ]:

from natural_pdf import PDF

pdf = PDF("https://github.com/jsoma/natural-pdf/raw/refs/heads/main/pdfs/needs-ocr.pdf")
page = pdf.pages[0]
page.show(width=800)

Looks the same as the last one, right? But when we try to extract the text...

In [ ]:

text = page.extract_text()
print(text)

Nothing! It's time for OCR.

There are a looooot of OCR engines out there, and one of the things that makes Natural PDF nice is that it supports multiples. Figuring out which one is the "best" isn't as tough when you can just run them all right after each other.

The default is RapidOCR which usually works fine. But what happens when we try it with this document?

In [ ]:

page.apply_ocr()

In [ ]:

text = page.extract_text()
print(text)

At first glance it does pretty well! But maybe we should see what to double-check?

In [ ]:

(
    page
    .find_all('text')
    .show(group_by='confidence', width=800)
)

The two issues I can see are Durham's Pure Leaf Lardl instead of Durham's Pure Leaf Lard!, and Chicago II.. instead of Chicago Ill.

I don't need to know why, though, really, because I can just try some other engine! You can also fool around with the options - some of the the lowest-hanging fruit is increasing the resolution of the OCR. The default at the moment is 150, you can try upping to 300 for (potentially) better results.

To fix this I'll both up the resolution and try another OCR engine (we have a bushel of them). THere are one million leaderboards with different scores, if you look at this one you'll see GLM OCR is somewhere decently up there.

In [ ]:

page.apply_ocr(engine='glm_ocr')

In [ ]:

text = page.extract_text()
print(text)

Amazing! But what if it was very very difficult? And neither RapidOCR nor GLM OCR nor the million of other OCR tools can get the results?

Basic OCR with an LLM¶

Sometimes skipping our OCR-only tools and using an LLM is our last result.

In [ ]:

from natural_pdf import PDF
pdf = PDF("https://github.com/jsoma/natural-pdf/raw/refs/heads/main/pdfs/needs-ocr.pdf")
page = pdf.pages[0]

page.extract_text()

In [ ]:

import os
from dotenv import load_dotenv

load_dotenv()

GOOGLE_API_KEY = os.environ.get("GOOGLE_API_KEY") or input("Enter GOOGLE_API_KEY: ")

In [ ]:

import os
from openai import OpenAI

# Using Google Gemini (default)
client = OpenAI(
    api_key=GOOGLE_API_KEY,
    # api_key="YOUR_API_KEY_HERE",
    base_url="https://generativelanguage.googleapis.com/v1beta/openai/"
)

# This gives the funniest response I've ever seen
# page.apply_ocr(engine="vlm", model="gemini-3.5-flash", client=client)
page.apply_ocr(engine="vlm", model="gemini-3.1-flash-lite", client=client)

# Or use OpenAI directly:
# client = OpenAI(
#     api_key=os.environ["OPENAI_API_KEY"],
#     # api_key="YOUR_API_KEY_HERE",
# )
# page.apply_ocr(engine="vlm", model="gpt-5", client=client)

# Or use OpenRouter for open-source models:
# client = OpenAI(
#     api_key=os.environ["OPENROUTER_API_KEY"],
#     # api_key="YOUR_API_KEY_HERE",
#     base_url="https://openrouter.ai/api/v1"
# )
# page.apply_ocr(engine="vlm", model="qwen/qwen3-vl-8b-instruct", client=client)

In [ ]:

print(page.extract_text())

In [ ]:

page.find_all('text').show(width=700)

Advanced OCR with an LLM¶

Sometimes you need exact bounding boxes, or the LLM is being unpredictable, or you need to zoom in, or you need... all sorts of things. And alternative to throwing everything over the fence is a two-stop process:

Use an OCR tool with detect_only=True so it detects the text, but doesn't try to read it.
Then send each box to the LLM to get the text

I find this works better than sending the entire page because there is less likelihood for hallucination.

In [ ]:

from natural_pdf import PDF
pdf = PDF("https://github.com/jsoma/natural-pdf/raw/refs/heads/main/pdfs/needs-ocr.pdf")
page = pdf.pages[0]

page.apply_ocr(detect_only=True)
page.extract_text()

In [ ]:

texts = page.find_all('text')
texts.show(width=800)

In [ ]:

import os
from openai import OpenAI

instructions = """Return only the exact text content visible in the image. 
Preserve original spelling, capitalization, punctuation, and symbols.
Fix misspellings if they are the result of blurry or incorrect OCR.
Do not add any explanatory text, translations, comments, or quotation marks around the result.
If you cannot process the image or do not see any text, return an empty space.
The text is from an inspection report of a slaughterhouse."""
# Could use all other sorts of text, too
# The text is likely from a Greek document, potentially a spreadsheet, containing Modern Greek words or numbers

client = OpenAI(
    api_key=GOOGLE_API_KEY,
    # api_key="YOUR_API_KEY_HERE",
    base_url="https://generativelanguage.googleapis.com/v1beta/openai/"
)

(
    page.find_all('text')
        .apply_ocr(
            engine="vlm",
            model="gemini-2.5-flash-lite",
            client=client,
            instructions=instructions
        )
)

What do we have now?

In [ ]:

text = page.extract_text()
print(text)

Finding tables on OCR documents¶

When we used page.extract_table() last time, it was easy because there were all of these line elements on the page that pdfplumber could detect and say "hey, it's a table!" For the same reason that there's no real text on the page, there's also no real lines on the page. Instead, we're going to do a fun secret trick where we look at what horizontal and vertical coordinates seem like they might be lines by setting a threshold.

In [ ]:

from natural_pdf import PDF

pdf = PDF("https://github.com/jsoma/natural-pdf/raw/refs/heads/main/pdfs/needs-ocr.pdf")
page = pdf.pages[0]

page.apply_ocr()
page.show(width=800)

In [ ]:

table_area = (
    page
    .find('text:contains(Violations)')
    .below(
        until='text:contains(Jungle)',
        include_endpoint=False
    )
)
table_area.show(crop=True)

Now we can add the lines and use them to detect the table.

In [ ]:

guides = table_area.guides()
guides.vertical.from_lines(threshold=0.4, detection_method='pixels')
guides.horizontal.from_lines(detection_method='pixels')
guides.show()

Now we just need to find those checkboxes.

In [ ]:

page.detect_checkboxes()

And we're good to go!

In [ ]:

guides.extract_table().to_df()