Run in Colab Download notebook
In [ ]:
# Install required packages
!pip install --upgrade --quiet 'natural-pdf[ai,export]>=0.5.0'

print('✓ Packages installed!')

Slides: slides.pdf

Let's ask questions

Time for some AI magic. We're using extractive question answering, which is different from LLMs because it pulls content from the page. LLMs are generative AI, which take your question and generates new text.

In [1]:
from natural_pdf import PDF

pdf = PDF("https://github.com/jsoma/natural-pdf/raw/refs/heads/main/pdfs/01-practice.pdf")
page = pdf.pages[0]
page.show(width=900)
Out[1]:
No description has been provided for this image
In [2]:
result = page.ask("What date was the inspection?")
result
Loading weights:   0%|          | 0/205 [00:00<?, ?it/s]
LayoutLMForQuestionAnswering LOAD REPORT from: impira/layoutlm-document-qa
Key                              | Status     |  | 
---------------------------------+------------+--+-
layoutlm.embeddings.position_ids | UNEXPECTED |  | 

Notes:
- UNEXPECTED	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.
Out[2]:
StructuredDataResult(answer='February 3, 1905', answer_confidence=0.9985750913619995)

Notice it has a confidence score, which makes life great. You can also use .show() to see where it's getting the answer from.

In [3]:
result.show()

But it also does a bad job a lot of the time! Sadly LLMs are a better option.

Let's see some of those bad approaches.

In [4]:
page.ask("Summary")
# page.ask("Summary", min_confidence=0.0)
Out[4]:
StructuredDataResult(answer='Worst of any, however, were the fertilizer men, and those who served in the cooking rooms.', answer_confidence=0.4899481534957886)
In [5]:
result = page.ask("How many violations were there?")
# result = page.ask("What was the violation count?")
result
Out[5]:
StructuredDataResult(answer='7.1.5', answer_confidence=0.9743162393569946)

We can also ask for muliple things at once, even though maybe they won't be accurate. Maybe don't do this??? Just keep reading!

In [6]:
answers = page.ask(['violation count', 'site', 'location'])
answers
Out[6]:
[StructuredDataResult(answer='7', answer_confidence=0.9875915050506592),
 StructuredDataResult(answer='Durham’s Meatpacking Chicago, Ill.', answer_confidence=0.9918930530548096),
 StructuredDataResult(answer='Durham’s Meatpacking Chicago, Ill.', answer_confidence=0.6807746291160583)]

There are better ways to extract structured data, though.

Structured data generation

LLMs are infinitely better performing than the .ask magic, though, especially when there's a bit of nuance in your question (or the answer). You want it to write things that aren't in there, or piece together something complicated. It's worth the potential for hallucinations!

Below we're using Google thanks to its OpenAI compatibility.

In [7]:
import os
from dotenv import load_dotenv

load_dotenv()

GOOGLE_API_KEY = os.environ.get("GOOGLE_API_KEY") or input("Enter GOOGLE_API_KEY: ")
In [8]:
from openai import OpenAI

# Initialize your LLM client
# Anything OpenAI-compatible works!
client = OpenAI(
    api_key=GOOGLE_API_KEY,
    # api_key="YOUR_API_KEY_HERE",
    base_url="https://generativelanguage.googleapis.com/v1beta/openai/"  # Changes based on what AI you're using
)

fields = ["site", "date", "violation count", "inspection service", "summary", "city", "full name of state"]
results = page.extract(fields, client=client, model="gemini-2.5-flash-lite")
results.to_dict()
Out[8]:
{'site': "Durham's Meatpacking Chicago, Ill.",
 'date': 'February 3, 1905',
 'violation_count': '7',
 'inspection_service': 'Jungle Health and Safety Inspection Service',
 'summary': "Worst of any, however, were the fertilizer men, and those who served in the cooking rooms. These people could not be shown to the visitor - for the odor of a fertilizer man would scare any ordinary visitor at a hundred yards, and as for the other men, who worked in tank rooms full of steam, and in some of which there were open vats near the level of the floor, their peculiar trouble was that they fell into the vats; and when they were fished out, there was never enough of them left to be worth exhibiting - sometimes they would be overlooked for days, till all but the bones of them had gone out to the world as Durham's Pure Leaf Lard!",
 'city': 'Chicago',
 'full_name_of_state': 'Illinois'}

Confidence scores and citations

We get a few bonus treats, too: confidence scores and citations.

Interestingly enough, confidence scores can sometimes decrease accuracy. The LLMs make them up, and because they make the prompt so much more complicated accuracy always drops when you include it. I recommend not using it unless you're paying for a more expensive model.

Citations are great, though, especially when you'd like to be accountable and responsible.

In [13]:
from openai import OpenAI

# Initialize your LLM client
# Anything OpenAI-compatible works!
client = OpenAI(
    api_key=GOOGLE_API_KEY,
    # api_key="YOUR_API_KEY_HERE",
    base_url="https://generativelanguage.googleapis.com/v1beta/openai/"  # Changes based on what AI you're using
)

fields = ["site", "date", "violation count", "inspection service", "summary", "city", "state"]
results = page.extract(fields,
                       client=client,
                       model="gemini-2.5-flash",
                       confidence=True,
                       citations=True)
results
Out[13]:
StructuredDataResult(site="Durham's Meatpacking Chicago, Ill.", date='February 3, 1905', violation_count='7', ...)
In [14]:
results.to_dict()
Out[14]:
{'site': "Durham's Meatpacking Chicago, Ill.",
 'site_confidence': 5,
 'date': 'February 3, 1905',
 'date_confidence': 5,
 'violation_count': '7',
 'violation_count_confidence': 5,
 'inspection_service': 'Jungle Health and Safety Inspection Service',
 'inspection_service_confidence': 5,
 'summary': "Worst of any, however, were the fertilizer men, and those who served in the cooking rooms. These people could not be shown to the visitor - for the odor of a fertilizer man would scare any ordinary visitor at a hundred yards, and as for the other men, who worked in tank rooms full of steam, and in some of which there were open vats near the level of the floor, their peculiar trouble was that they fell into the vats; and when they were fished out, there was never enough of them left to be worth exhibiting - sometimes they would be overlooked for days, till all but the bones of them had gone out to the world as Durham's Pure Leaf Lard!",
 'summary_confidence': 5,
 'city': 'Chicago',
 'city_confidence': 5,
 'state': 'Ill.',
 'state_confidence': 5}
In [16]:
# remove the confidences with confidences=False
results.to_dict(confidence=False)
Out[16]:
{'site': "Durham's Meatpacking Chicago, Ill.",
 'date': 'February 3, 1905',
 'violation_count': '7',
 'inspection_service': 'Jungle Health and Safety Inspection Service',
 'summary': "Worst of any, however, were the fertilizer men, and those who served in the cooking rooms. These people could not be shown to the visitor - for the odor of a fertilizer man would scare any ordinary visitor at a hundred yards, and as for the other men, who worked in tank rooms full of steam, and in some of which there were open vats near the level of the floor, their peculiar trouble was that they fell into the vats; and when they were fished out, there was never enough of them left to be worth exhibiting - sometimes they would be overlooked for days, till all but the bones of them had gone out to the world as Durham's Pure Leaf Lard!",
 'city': 'Chicago',
 'state': 'Ill.'}

Easily citations with .show()

In [17]:
results.show()
Out[17]:
No description has been provided for this image

Very intense structured data extraction

Instead of being kind of loose and free with what you want, you can also get MUCH fancier and write a Pydantic model. It will not only send the column names you want, but also little descriptions and demands about strings (text), integers, floats and more.

You can find more details here.

In [18]:
import os
from pydantic import BaseModel, Field
from openai import OpenAI

# Initialize your LLM client
# Anything OpenAI-compatible works!
client = OpenAI(
    api_key=GOOGLE_API_KEY,
    # api_key="YOUR_API_KEY_HERE",
    base_url="https://generativelanguage.googleapis.com/v1beta/openai/"
)

# Define your schema
class ReportInfo(BaseModel):
    inspection_number: str = Field(description="The main report identifier")
    inspection_date: str = Field(description="The name of the issuing company")
    inspection_service: str = Field(description="Name of inspection service")
    site: str = Field(description="Name of company inspected")
    summary: str = Field(description="Visit summary")
    city: str
    state: str = Field(description="Full name of state")
    violation_count: int

# Extract data
# page.extract(schema=ReportInfo, client=client, model="gemini-2.5-flash-lite") 
result = page.extract(schema=ReportInfo, client=client, model="gemini-2.5-flash") 
In [22]:
result
Out[22]:
StructuredDataResult(inspection_number='INS-UP70N51NCL41R', inspection_date='February 3, 1905', inspection_service='Jungle Health and Safety Inspection Service', ...)

There are a handful of ways to access the results.

In [29]:
result.to_dict()['inspection_date']
Out[29]:
'February 3, 1905'
In [30]:
result['inspection_date'].value
Out[30]:
'February 3, 1905'
In [31]:
result.data.inspection_date
Out[31]:
'February 3, 1905'

Table extraction with LLMs

In the example below, we're saying "Using Gemini, provide a violations table - each row should have a statute, a description, a level, and a repeat-checked

In [32]:
import os
from pydantic import BaseModel, Field
from openai import OpenAI
from typing import List, Literal

client = OpenAI(
    api_key=GOOGLE_API_KEY,
    # api_key="YOUR_API_KEY_HERE",
    base_url="https://generativelanguage.googleapis.com/v1beta/openai/"
)

class ViolationsRow(BaseModel):
    statute: str
    description: str
    level: str
    repeat_checked: Literal["checked", "unchecked"] = Field("Whether the checkbox is checked or not")

class ViolationsTable(BaseModel):
    inspection_id: str
    violations: List[ViolationsRow]

result = page.extract(schema=ViolationsTable, client=client, model="gemini-2.5-flash") 
result
Out[32]:
StructuredDataResult(inspection_id='INS-UP70N51NCL41R', violations=[{'statute': '4.12.7', 'description': 'Unsanitary Working Conditions.', 'level': 'Critical', 'repeat_checked': 'unchecked'}, {'statute': '5.8.3', 'description': 'Inadequate Protective Equipment.', 'level': 'Serious', 'repeat_checked': 'unchecked'}, {'statute': '6.3.9', 'description': 'Ineffective Injury Prevention.', 'level': 'Serious', 'repeat_checked': 'unchecked'}, {'statute': '7.1.5', 'description': 'Failure to Properly Store Hazardous Materials.', 'level': 'Critical', 'repeat_checked': 'unchecked'}, {'statute': '8.9.2', 'description': 'Lack of Adequate Fire Safety Measures.', 'level': 'Serious', 'repeat_checked': 'unchecked'}, {'statute': '9.6.4', 'description': 'Inadequate Ventilation Systems.', 'level': 'Serious', 'repeat_checked': 'unchecked'}, {'statute': '10.2.7', 'description': 'Insufficient Employee Training for Safe Work Practices.', 'level': 'Serious', 'repeat_checked': 'unchecked'}])

Note that when we look below... it didn't do the checked/unchecked correctly!

In [33]:
import pandas as pd

violations = result.to_dict()['violations']
pd.DataFrame(violations)
Out[33]:
statute description level repeat_checked
0 4.12.7 Unsanitary Working Conditions. Critical unchecked
1 5.8.3 Inadequate Protective Equipment. Serious unchecked
2 6.3.9 Ineffective Injury Prevention. Serious unchecked
3 7.1.5 Failure to Properly Store Hazardous Materials. Critical unchecked
4 8.9.2 Lack of Adequate Fire Safety Measures. Serious unchecked
5 9.6.4 Inadequate Ventilation Systems. Serious unchecked
6 10.2.7 Insufficient Employee Training for Safe Work P... Serious unchecked

Figuring out how to manage those pesky checkboxes

In [34]:
from natural_pdf import PDF

pdf = PDF("https://github.com/jsoma/natural-pdf/raw/refs/heads/main/pdfs/01-practice.pdf")
page = pdf.pages[0]
page.show(width=500)
Out[34]:
No description has been provided for this image

We can use .extract_table() no problem to get most of the columns, but we really really want those checkboes!

In [35]:
import pandas as pd

df = page.extract_table().to_df()
df
Out[35]:
Statute Description Level Repeat?
0 4.12.7 Unsanitary Working Conditions. Critical <NA>
1 5.8.3 Inadequate Protective Equipment. Serious <NA>
2 6.3.9 Ineffective Injury Prevention. Serious <NA>
3 7.1.5 Failure to Properly Store Hazardous Materials. Critical <NA>
4 8.9.2 Lack of Adequate Fire Safety Measures. Serious <NA>
5 9.6.4 Inadequate Ventilation Systems. Serious <NA>
6 10.2.7 Insufficient Employee Training for Safe Work P... Serious <NA>

This used to be more complicated, but now all we need to do is page.detect_checkboxes() and we're good to go!

In [36]:
page.detect_checkboxes()
page.find_all('region[type=checkbox]').show(crop='wide')
Out[36]:
No description has been provided for this image
In [37]:
df = page.extract_table().to_df()
df
Out[37]:
Statute Description Level Repeat?
0 4.12.7 Unsanitary Working Conditions. Critical [CHECKED]
1 5.8.3 Inadequate Protective Equipment. Serious [CHECKED]
2 6.3.9 Ineffective Injury Prevention. Serious [UNCHECKED]
3 7.1.5 Failure to Properly Store Hazardous Materials. Critical [UNCHECKED]
4 8.9.2 Lack of Adequate Fire Safety Measures. Serious [UNCHECKED]
5 9.6.4 Inadequate Ventilation Systems. Serious [CHECKED]
6 10.2.7 Insufficient Employee Training for Safe Work P... Serious [UNCHECKED]

If we wanted things to be exceptionally complicated (or if checkbox detection doesn't work), we could also go from rect to rect, seeing whether there's a line inside.

In [38]:
(
    page
    .find(text='Violations')
    .below()
    .find_all('rect')
    .apply(lambda box: 'yes' if box.find('line') else 'no')
)
Out[38]:
['yes', 'yes', 'no', 'no', 'no', 'yes', 'no']
In [39]:
df['repeat'] = (
    page
    .find(text='Violations')
    .below()
    .find_all('rect')
    .apply(lambda box: 'yes' if box.find('line') else 'no')
)
df
Out[39]:
Statute Description Level Repeat? repeat
0 4.12.7 Unsanitary Working Conditions. Critical [CHECKED] yes
1 5.8.3 Inadequate Protective Equipment. Serious [CHECKED] yes
2 6.3.9 Ineffective Injury Prevention. Serious [UNCHECKED] no
3 7.1.5 Failure to Properly Store Hazardous Materials. Critical [UNCHECKED] no
4 8.9.2 Lack of Adequate Fire Safety Measures. Serious [UNCHECKED] no
5 9.6.4 Inadequate Ventilation Systems. Serious [CHECKED] yes
6 10.2.7 Insufficient Employee Training for Safe Work P... Serious [UNCHECKED] no

Putting things in categories

We used to have to use fancy computer vision to detect whether something was a checked box or not. No longer! But we can still use it for other things, like saying "what kind of PDF am I looking at?"

Categorizing an entire PDF

In [40]:
from natural_pdf import PDF

pdf = PDF("https://github.com/jsoma/natural-pdf/raw/refs/heads/main/pdfs/01-practice.pdf")
page = pdf.pages[0]
page.show(width=500)
Out[40]:
No description has been provided for this image

What can we classify the entire PDF as? Maybe a... slaughterhouse report? A dolphin training manual? Something about basketball or birding?

In [41]:
pdf.classify(['slaughterhouse report', 'dolphin training manual', 'basketball', 'birding'], using='text')
pdf.category
Loading weights:   0%|          | 0/515 [00:00<?, ?it/s]
Out[41]:
'slaughterhouse report'
In [42]:
pdf.category_confidence
Out[42]:
0.8695738911628723

Classifying pages of a PDF

Let's take a look at a document from the CIA investigating whether you can use pigeons as spies.

In [43]:
from natural_pdf import PDF

pdf = PDF("https://github.com/jsoma/ire25-natural-pdf/raw/refs/heads/main/cia-doc.pdf")
pdf.pages.show(cols=6)
Could get FontBBox from font descriptor because None cannot be parsed as 4 floats
Out[43]:
No description has been provided for this image

Just like we did above, we can ask what category we think the PDF belongs to.

In [44]:
pdf.classify(['slaughterhouse report', 'dolphin training manual', 'basketball', 'birding'], using='text')
(pdf.category, pdf.category_confidence)
Out[44]:
('birding', 0.5170512795448303)

But notice how all of the pages look very very different: we can also categorize each page using vision.

In [45]:
pdf.classify_pages(['diagram', 'text', 'invoice', 'blank'], using='vision')

for page in pdf.pages:
    print(f"Page {page.number} is {page.category} - {page.category_confidence:0.3}")
Loading weights:   0%|          | 0/398 [00:00<?, ?it/s]
CLIPModel LOAD REPORT from: openai/clip-vit-base-patch16
Key                                  | Status     |  | 
-------------------------------------+------------+--+-
text_model.embeddings.position_ids   | UNEXPECTED |  | 
vision_model.embeddings.position_ids | UNEXPECTED |  | 

Notes:
- UNEXPECTED	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.
The image processor of type `CLIPImageProcessor` is now loaded as a fast processor by default, even if the model checkpoint was saved with a slow processor. This is a breaking change and may produce slightly different outputs. To continue using the slow processor, instantiate this class with `use_fast=False`. 
Classifying batch (openai/clip-vit-base-patch16):   0%|          | 0/17 [00:00<?, ?it/s]
Page 1 is invoice - 0.508
Page 2 is text - 0.968
Page 3 is text - 0.953
Page 4 is diagram - 0.903
Page 5 is diagram - 0.907
Page 6 is invoice - 0.969
Page 7 is text - 0.887
Page 8 is invoice - 0.79
Page 9 is invoice - 0.975
Page 10 is invoice - 0.984
Page 11 is invoice - 0.994
Page 12 is invoice - 0.987
Page 13 is text - 0.88
Page 14 is text - 0.928
Page 15 is diagram - 0.927
Page 16 is text - 0.82
Page 17 is invoice - 0.947
Page 1 is invoice - 0.508
Page 2 is text - 0.968
Page 3 is text - 0.953
Page 4 is diagram - 0.904
Page 5 is diagram - 0.907
Page 6 is invoice - 0.969
Page 7 is text - 0.887
Page 8 is invoice - 0.79
Page 9 is invoice - 0.975
Page 10 is invoice - 0.984
Page 11 is invoice - 0.994
Page 12 is invoice - 0.987
Page 13 is text - 0.88
Page 14 is text - 0.928
Page 15 is diagram - 0.927
Page 16 is text - 0.82
Page 17 is invoice - 0.947

And if we just want to see the pages that are diagrams, we can .filter for them.

In [46]:
(
    pdf.pages
    .filter(lambda page: page.category == 'diagram')
    .show(show_category=True)
)
Out[46]:
No description has been provided for this image

And if that's all we're interested in? We can save a new PDF of just those pages!

In [47]:
(
    pdf.pages
    .filter(lambda page: page.category == 'diagram')
    .save_pdf("diagrams.pdf", original=True)
)