# Install required packages
!pip install --upgrade --quiet 'natural-pdf[export,ai]>=0.5.0'
!pip install --upgrade --quiet easyocr
!pip install --upgrade --quiet surya-ocr
print('✓ Packages installed!')
Slides: slides.pdf
Sometimes you can't actually get the text off of the page. It's an image of text instead of being actual text.
from natural_pdf import PDF
pdf = PDF("https://github.com/jsoma/natural-pdf/raw/refs/heads/main/pdfs/needs-ocr.pdf")
page = pdf.pages[0]
page.show(width=800)
Looks the same as the last one, right? But when we try to extract the text...
text = page.extract_text()
print(text)
Nothing! It's time for OCR.
There are a looooot of OCR engines out there, and one of the things that makes Natural PDF nice is that it supports multiples. Figuring out which one is the "best" isn't as tough when you can just run them all right after each other.
The default is EasyOCR which usually works fine. But what happens when we try it with this document?
page.apply_ocr()
text = page.extract_text()
print(text)
It does pretty well! The only issue is it gives me Durham's Pure Leaf Lardl instead of Durham's Pure Leaf Lard! I don't need to know why, though, really, because I can just try some other engine! You can also fool around with the options - some of the the lowest-hanging fruit is increasing the resolution of the OCR. The default at the moment is 150, you can try upping to 300 for (potentially) better results.
To fix this I'll both up the resolution and try another OCR engine (we have a bushel of them).
page.apply_ocr(engine='rapidocr')
#page.apply_ocr(engine='surya')
text = page.extract_text()
print(text)
Amazing! But what if it was very difficult? And neither EasyOCR nor Surya nor the million of other OCR tools can get the results?
Sometimes skipping our OCR-only tools and using an LLM is our last result.
from natural_pdf import PDF
pdf = PDF("https://github.com/jsoma/natural-pdf/raw/refs/heads/main/pdfs/needs-ocr.pdf")
page = pdf.pages[0]
page.extract_text()
import os
from openai import OpenAI
# Using Google Gemini (default)
client = OpenAI(
api_key=os.environ["GOOGLE_API_KEY"],
# api_key="YOUR_API_KEY_HERE",
base_url="https://generativelanguage.googleapis.com/v1beta/openai/"
)
page.apply_ocr(engine="vlm", model="gemini-2.5-flash", client=client)
# Or use OpenAI directly:
# client = OpenAI(
# api_key=os.environ["OPENAI_API_KEY"],
# # api_key="YOUR_API_KEY_HERE",
# )
# page.apply_ocr(engine="vlm", model="gpt-5", client=client)
# Or use OpenRouter for open-source models:
# client = OpenAI(
# api_key=os.environ["OPENROUTER_API_KEY"],
# # api_key="YOUR_API_KEY_HERE",
# base_url="https://openrouter.ai/api/v1"
# )
# page.apply_ocr(engine="vlm", model="qwen/qwen3-vl-8b-instruct", client=client)
print(page.extract_text())
page.find_all('text').show(width=700)
Sometimes you need exact bounding boxes, or the LLM is being unpredictable, or you need to zoom in, or you need... all sorts of things. And alternative to throwing everything over the fence is a two-stop process:
detect_only=True so it detects the text, but doesn't try to read it.I find this works better than sending the entire page because there is less likelihood for hallucination.
from natural_pdf import PDF
pdf = PDF("https://github.com/jsoma/natural-pdf/raw/refs/heads/main/pdfs/needs-ocr.pdf")
page = pdf.pages[0]
page.apply_ocr('surya', detect_only=True)
page.extract_text()
texts = page.find_all('text')[:5]
#texts.show(width=800)
import os
from openai import OpenAI
instructions = """Return only the exact text content visible in the image.
Preserve original spelling, capitalization, punctuation, and symbols.
Fix misspellings if they are the result of blurry or incorrect OCR.
Do not add any explanatory text, translations, comments, or quotation marks around the result.
If you cannot process the image or do not see any text, return an empty space.
The text is from an inspection report of a slaughterhouse."""
# Could use all other sorts of text, too
# The text is likely from a Greek document, potentially a spreadsheet, containing Modern Greek words or numbers
client = OpenAI(
api_key=os.environ["GOOGLE_API_KEY"],
# api_key="YOUR_API_KEY_HERE",
base_url="https://generativelanguage.googleapis.com/v1beta/openai/"
)
(
page.find_all('text')
.apply_ocr(
engine="vlm",
model="gemini-2.5-flash-lite",
client=client,
instructions=instructions
)
)
What do we have now?
text = page.extract_text()
print(text)
When we used page.extract_table() last time, it was easy because there were all of these line elements on the page that pdfplumber could detect and say "hey, it's a table!" For the same reason that there's no real text on the page, there's also no real lines on the page. Instead, we're going to do a fun secret trick where we look at what horizontal and vertical coordinates seem like they might be lines by setting a threshold.
from natural_pdf import PDF
pdf = PDF("https://github.com/jsoma/natural-pdf/raw/refs/heads/main/pdfs/needs-ocr.pdf")
page = pdf.pages[0]
page.apply_ocr()
page.show(width=800)
table_area = (
page
.find('text:contains(Violations)')
.below(
until='text:contains(Jungle)',
include_endpoint=False
)
)
table_area.show(crop=True)
Now we can add the lines and use them to detect the table.
guides = table_area.guides()
guides.vertical.from_lines(threshold=0.4)
guides.horizontal.from_lines()
guides.show()
guides.extract_table().to_df()
In a tiny preview of the next notebook: what about those checkboxes? Turns out we can use image classification AI to do it for us in the next notebook!
page.detect_checkboxes()
page.find_all('region').show(crop=20)
guides.extract_table().to_df()