# Install required packages
!pip install --upgrade --quiet 'natural-pdf[ai,export]>=0.5.0'
print('✓ Packages installed!')
Slides: slides.pdf
Time for some AI magic. We're using extractive question answering, which is different from LLMs because it pulls content from the page. LLMs are generative AI, which take your question and generates new text.
from natural_pdf import PDF
pdf = PDF("https://github.com/jsoma/natural-pdf/raw/refs/heads/main/pdfs/01-practice.pdf")
page = pdf.pages[0]
page.show(width=900)
result = page.ask("What date was the inspection?")
result
Notice it has a confidence score, which makes life great. You can also use .show() to see where it's getting the answer from.
result.show()
It automatically doesn't show you answers it doesn't have much faith in. Let's ask for the Summary.
page.ask("Summary")
page.ask("Summary", min_confidence=0.0)
That does NOT mean it's always accurate, though. Using the words on the page makes it a lot easier. How should we ask about the number of violations?
#result = page.ask("How many violations were there?")
result = page.ask("What was the violation count?")
result
We can also ask for muliple things at once.
answers = page.ask(['violation count', 'site', 'location'])
answers
There are better ways to extract structured data, though.
LLMs are just better performing than the .ask magic, though, especially when there's a bit of nuance in your question (or the answer). You want it to write things that aren't in there, or piece together something complicated. It's worth the potential for hallucinations!
Below we're using Google thanks to its OpenAI compatibility.
import os
from openai import OpenAI
# Initialize your LLM client
# Anything OpenAI-compatible works!
client = OpenAI(
api_key=os.environ["GOOGLE_API_KEY"],
# api_key="YOUR_API_KEY_HERE",
base_url="https://generativelanguage.googleapis.com/v1beta/openai/" # Changes based on what AI you're using
)
fields = ["site", "date", "violation count", "inspection service", "summary", "city", "full name of state"]
results = page.extract(fields, client=client, model="gemini-2.5-flash-lite")
results.to_dict()
We get a few bonus treats, too: confidence scores and citations.
Interestingly enough, confidence scores decrease accuracy. The LLMs make them up, and because they make the prompt so much more complicated accuracy always drops when you include it. I recommend not using it unless you're paying for a more expensive model.
Citations are great, though, especially when you'd like to be accountable and responsible.
import os
from openai import OpenAI
# Initialize your LLM client
# Anything OpenAI-compatible works!
client = OpenAI(
api_key=os.environ["GOOGLE_API_KEY"],
# api_key="YOUR_API_KEY_HERE",
base_url="https://generativelanguage.googleapis.com/v1beta/openai/" # Changes based on what AI you're using
)
fields = ["site", "date", "violation count", "inspection service", "summary", "city", "state"]
results = page.extract(fields,
client=client,
model="gemini-2.5-flash")
results.to_dict()
# remove the confidences with confidences=False
results.to_dict(confidence=True, citations=False)
Easily citations with .show()
results.show()
Instead of being kind of loose and free with what you want, you can also get MUCH fancier and write a Pydantic model. It will not only send the column names you want, but also little descriptions and demands about strings (text), integers, floats and more.
You can find more details here.
import os
from pydantic import BaseModel, Field
from openai import OpenAI
# Initialize your LLM client
# Anything OpenAI-compatible works!
client = OpenAI(
api_key=os.environ["GOOGLE_API_KEY"],
# api_key="YOUR_API_KEY_HERE",
base_url="https://generativelanguage.googleapis.com/v1beta/openai/"
)
# Define your schema
class ReportInfo(BaseModel):
inspection_number: str = Field(description="The main report identifier")
inspection_date: str = Field(description="The name of the issuing company")
inspection_service: str = Field(description="Name of inspection service")
site: str = Field(description="Name of company inspected")
summary: str = Field(description="Visit summary")
city: str
state: str = Field(description="Full name of state")
violation_count: int
# Extract data
# page.extract(schema=ReportInfo, client=client, model="gemini-2.5-flash-lite")
page.extract(schema=ReportInfo, client=client, model="gemini-2.5-flash")
page.extracted()
dict(page.extracted())
page.extracted('inspection_date')
In the example below, we're saying "Using Gemini, provide a violations table - each row should have a statute, a description, a level, and a repeat-checked
import os
from pydantic import BaseModel, Field
from openai import OpenAI
from typing import List, Literal
client = OpenAI(
api_key=os.environ["GOOGLE_API_KEY"],
# api_key="YOUR_API_KEY_HERE",
base_url="https://generativelanguage.googleapis.com/v1beta/openai/"
)
class ViolationsRow(BaseModel):
statute: str
description: str
level: str
repeat_checked: Literal["checked", "unchecked"] = Field("Whether the checkbox is checked or not")
class ViolationsTable(BaseModel):
inspection_id: str
violations: List[ViolationsRow]
page.extract(schema=ViolationsTable, client=client, model="gemini-2.5-flash")
Note that when we look below... it didn't do the checked/unchecked correctly!
import pandas as pd
data = page.extracted()
pd.DataFrame(data.model_dump()['violations'])
from natural_pdf import PDF
pdf = PDF("https://github.com/jsoma/natural-pdf/raw/refs/heads/main/pdfs/01-practice.pdf")
page = pdf.pages[0]
page.show(width=500)
We can use .extract_table() no problem to get most of the columns, but we really really want those checkboes!
import pandas as pd
df = page.extract_table().to_df()
df
This used to be more complicated, but now all we need to do is page.detect_checkboxes() and we're good to go!
page.detect_checkboxes()
page.find_all('region[type=checkbox]').show(crop='wide')
df = page.extract_table().to_df()
df
If we wanted things to be exceptionally complicated (or if checkbox detection doesn't work), we could also go from rect to rect, seeing whether there's a line inside.
(
page
.find(text='Violations')
.below()
.find_all('rect')
.apply(lambda box: 'yes' if box.find('line') else 'no')
)
df['repeat'] = (
page
.find(text='Violations')
.below()
.find_all('rect')
.apply(lambda box: 'yes' if box.find('line') else 'no')
)
df
from natural_pdf import PDF
pdf = PDF("https://github.com/jsoma/natural-pdf/raw/refs/heads/main/pdfs/01-practice.pdf")
page = pdf.pages[0]
page.show(width=500)
What can we classify the entire PDF as? Maybe a... slaughterhouse report? A dolphin training manual? Something about basketball or birding?
pdf.classify(['slaughterhouse report', 'dolphin training manual', 'basketball', 'birding'], using='text')
pdf.category
pdf.category_confidence
Let's take a look at a document from the CIA investigating whether you can use pigeons as spies.
from natural_pdf import PDF
pdf = PDF("https://github.com/jsoma/ire25-natural-pdf/raw/refs/heads/main/cia-doc.pdf")
pdf.pages.show(cols=6)
Just like we did above, we can ask what category we think the PDF belongs to.
pdf.classify(['slaughterhouse report', 'dolphin training manual', 'basketball', 'birding'], using='text')
(pdf.category, pdf.category_confidence)
But notice how all of the pages look very very different: we can also categorize each page using vision.
pdf.classify_pages(['diagram', 'text', 'invoice', 'blank'], using='vision')
for page in pdf.pages:
print(f"Page {page.number} is {page.category} - {page.category_confidence:0.3}")
And if we just want to see the pages that are diagrams, we can .filter for them.
(
pdf.pages
.filter(lambda page: page.category == 'diagram')
.show(show_category=True)
)
And if that's all we're interested in? We can save a new PDF of just those pages!
(
pdf.pages
.filter(lambda page: page.category == 'diagram')
.save_pdf("diagrams.pdf", original=True)
)