Download notebook

In [ ]:

# Install required packages
!pip install --upgrade --quiet 'natural_pdf>=0.6.4'

print('✓ Packages installed!')

Installing Natural PDF¶

There are a LOT of possible extras (a lot of them AI-flavored) inside of Natural PDF, but we'll start by just installing the basics. You use "natural_pdf[all]" if you want everything.

Opening a PDF¶

We'll start by opening a PDF.

You can use a PDF on your own computer, or you can use one from a URL. I'll start by using one from a URL to make everything a bit easier.

In [1]:

from natural_pdf import PDF

pdf = PDF("https://github.com/jsoma/natural-pdf/raw/refs/heads/main/pdfs/01-practice.pdf")
pdf

Out[1]:

<PDF source='https://github.com/jsoma/natural-pdf/raw/refs/heads/main/pdfs/01-practice.pdf' pages=1>

You can find the pages of the pdf under pdf.pages, let's grab the first one.

In [2]:

page = pdf.pages[0]
page

Out[2]:

<Page number=1 index=0>

Pretty boring so far, eh? Let's take a look at the page itself. I'm setting the width since there's a good chance I'm going to be showing this on a projector.

In [3]:

page.show(width=800)

Out[3]:

No description has been provided for this image

Incredible!!! Congratulations, you've opened your first PDF with Natural PDF.

Grabbing page text¶

Most of the time when we're working with PDFs you're interested in the text on the page.

In [4]:

text = page.extract_text()
text = page.extract_text(layout=True)
text
print(text)

                                                                                    
                                                                                    
                                                                                    
                                                     Jungle Health and Safety Inspection Service
                                                     INS-UP70N51NCL41R              
                                                                                    
       Site: Durham’s Meatpacking Chicago, Ill.                                     
                                                                                    
       Date: February 3, 1905                                                       
                                                                                    
       Violation Count: 7                                                           
       Summary: Worst of any, however, were the fertilizer men, and those who served in the cooking rooms.
       These people could not be shown to the visitor - for the odor of a fertilizer man would scare any ordinary
                                                                                    
       visitor at a hundred yards, and as for the other men, who worked in tank rooms full of steam, and in
       some of which there were open vats near the level of the floor, their peculiar trouble was that they fell
       into the vats; and when they were fished out, there was never enough of them left to be worth
       exhibiting - sometimes they would be overlooked for days, till all but the bones of them had gone out
       to the world as Durham’s Pure Leaf Lard!                                     
                                                                                    
                                                                                    
                                                                                    
                                                                                    
                                                                                    
                                                                                    
                                                                                    
                                                                                    
                                                                                    
                                                                                    
       Violations                                                                   
                                                                                    
        Statute Description                                    Level  Repeat?       
        4.12.7 Unsanitary Working Conditions.                  Critical             
                                                                                    
        5.8.3 Inadequate Protective Equipment.                 Serious              
        6.3.9 Ineffective Injury Prevention.                   Serious              
                                                                                    
        7.1.5 Failure to Properly Store Hazardous Materials.   Critical             
        8.9.2 Lack of Adequate Fire Safety Measures.           Serious              
                                                                                    
        9.6.4 Inadequate Ventilation Systems.                  Serious              
        10.2.7 Insufficient Employee Training for Safe Work Practices. Serious      
                                                                                    
                                                                                    
                                                                                    
                                                                                    
                                                                                    
                                                                                    
                                                                                    
                                                                                    
                                                                                    
                                                                                    
                                                                                    
                                                                                    
                                                                                    
                                                                                    
                                                                                    
                                                                                    
                                                                                    
                                Jungle Health and Safety Inspection Service

layout=True is a useful addition if you want to see a text-only representation of the page, and sometimes it helps with data extraction.

Selecting elements and grabbing specific text¶

You rarely want all of the text, though. How would you describe the INS-UP70N51NCL41R text?

It's in a box
It's the second text on a page
It's red
It starts with "INS"

Selecting objects: "It's in the box"¶

In [5]:

page.find('rect').show(crop='wide')

Out[5]:

In [6]:

page.find(
    'text:bold:contains(Summary)'
).below(
    include_source=True,
    until='line'
).show()

Out[6]:

Selecting multiple objects: "It's the second piece of text"¶

In [7]:

page.find_all('text').show()

Out[7]:

In [8]:

texts = page.find_all('text').extract_each_text()

texts[:5]

Out[8]:

['Jungle Health and Safety Inspection Service',
 'INS-UP70N51NCL41R',
 'Site:',
 'Durham’s Meatpacking',
 'Chicago, Ill.']

In [9]:

texts[1]

Out[9]:

'INS-UP70N51NCL41R'

Finding by attributes: "It's the red text"¶

In [10]:

red_text = page.find('text[color~=red]')
red_text.show(crop=True)

Out[10]:

In [11]:

red_text.extract_text()

Out[11]:

'INS-UP70N51NCL41R'

Searching by text: "It starts with INS-"¶

In [12]:

text = page.find('text:contains("INS-")')
# text = page.find('text:starts-with("INS-")')
text.show(crop=True)

Out[12]:

In [13]:

text.extract_text()

Out[13]:

'INS-UP70N51NCL41R'

What about "Chicago, Ill."? It's grey, so...

In [14]:

page.find("text[color~=grey]")

Out[14]:

<TextElement text='Chicago, I...' font='Helvetica' size=10.0 bbox=(182.26000000000002, 84.07000000000005, 234.50000000000003, 94.07000000000005)>

Learning about the page¶

How do we know what's on the page? page.describe() can help!

In [15]:

page.describe()

Out[15]:

Page 1 Summary

Page Info:

page number: 1
dimensions: 612 x 792 pts

Overview:

total elements: 73
type breakdown: Word: 44, Line: 21, Rect: 8

Word:

typography:
fonts:
- Helvetica: 44
sizes:
- 10.0pt: 40
- 8.0pt: 3
- 12.0pt: 1
styles: 9 bold
colors:
- black: 42
- other: 2

Rect:

size stats:
width range: 8-180
height range: 8-35
avg area: 844 sq pts
styles:
stroke: 8
fill: 8
stroke widths:
- 0.5: 7
colors:
- #000000: 8

Line:

length stats:
min: 11
max: 500
avg: 279
line widths:
0.5: 6
2.0: 1
orientations:
horizontal: 10
vertical: 5
diagonal: 6
colors:
#808080: 14
#000000: 7

In [16]:

page.find_all('text').inspect()

Out[16]:

Collection Inspection (44 elements)

Word Elements

text	x0	top	x1	bottom	font_family	size	styles	source	confidence	color
Jungle Health and Safety Inspection Service	385	36	542	44	Helvetica	8		native	1.00	#000000
INS-UP70N51NCL41R	385	46	466	54	Helvetica	8		native	1.00	#ff0000
Site:	50	84	74	94	Helvetica	10	bold	native	1.00	#000000
Durham’s Meatpacking	74	84	182	94	Helvetica	10		native	1.00	#000000
Chicago, Ill.	182	84	235	94	Helvetica	10		native	1.00	#808080
Date:	50	104	81	114	Helvetica	10	bold	native	1.00	#000000
February 3, 1905	81	104	157	114	Helvetica	10		native	1.00	#000000
Violation Count:	50	124	130	134	Helvetica	10	bold	native	1.00	#000000
7	130	124	136	134	Helvetica	10		native	1.00	#000000
Summary:	50	144	102	154	Helvetica	10	bold	native	1.00	#000000
Worst of any, however, were the fertilizer men, an...	102	144	506	154	Helvetica	10		native	1.00	#000000
These people could not be shown to the visitor - f...	50	160	512	170	Helvetica	10		native	1.00	#000000
visitor at a hundred yards, and as for the other m...	50	176	491	186	Helvetica	10		native	1.00	#000000
some of which there were open vats near the level ...	50	192	496	202	Helvetica	10		native	1.00	#000000
into the vats; and when they were fished out, ther...	50	208	465	218	Helvetica	10		native	1.00	#000000
exhibiting - sometimes they would be overlooked fo...	50	224	492	234	Helvetica	10		native	1.00	#000000
to the world as Durham’s Pure Leaf Lard!	50	240	232	250	Helvetica	10		native	1.00	#000000
Violations	50	372	107	384	Helvetica	12	bold, strike	native	1.00	#000000
Statute	55	398	89	408	Helvetica	10	bold	native	1.00	#000000
Description	105	398	160	408	Helvetica	10	bold	native	1.00	#000000
Level	455	398	481	408	Helvetica	10	bold	native	1.00	#000000
Repeat?	505	398	544	408	Helvetica	10	bold	native	1.00	#000000
4.12.7	55	418	83	428	Helvetica	10		native	1.00	#000000
Unsanitary Working Conditions.	105	418	245	428	Helvetica	10		native	1.00	#000000
Critical	455	418	486	428	Helvetica	10		native	1.00	#000000
5.8.3	55	438	77	448	Helvetica	10		native	1.00	#000000
Inadequate Protective Equipment.	105	438	256	448	Helvetica	10		native	1.00	#000000
Serious	455	438	489	448	Helvetica	10		native	1.00	#000000
6.3.9	55	458	77	468	Helvetica	10		native	1.00	#000000
Ineffective Injury Prevention.	105	458	231	468	Helvetica	10		native	1.00	#000000
Showing 30 of 44 elements (pass a higher limit to see more)

Let's find the largest text that's also Helvetica

In [17]:

page.find_all('text[size=max()][font_family=Helvetica]').show(crop=50)

Out[17]:

What else is on the page that we can extract? How about the date? We want to find Date: and grab everything to the right of it.

In [18]:

# page.find(text="Date").show()
page.find(text="Date").right().extract_text()

Out[18]:

'February 3, 1905'

And the site? We want to grab 'site', then keep going right until we see a piece of text.

In [19]:

site = (
    page
    .find(text="Site")
    .right(until='text')
)
# site.show(crop=True)
site.endpoint.extract_text()

Out[19]:

'Durham’s Meatpacking'

How about Violation Count?

In [20]:

(
    page
    .find(text="Violation Count")
    .right(height='element')
    .extract_text()
)

Out[20]:

'7'

The Summary is a little bit more difficult. How would you describe where it is?

In [21]:

(
    page
    .find(text="Summary")
    .right(height='element')
    .extract_text()
)

Out[21]:

'Worst of any, however, were the fertilizer men, and those who served in the cooking rooms.'

In [22]:

summary = (
    page
    .find(text="Summary")
    .expand(bottom='line', right=True, left=True)
)
summary = (
    page
    .find(text="Summary")
    .below(until='line', include_source=True)
)

# summary.show()
# summary.extract_text(newlines=False)
summary.extract_text(newlines=False)

Out[22]:

'Summary: Worst of any, however, were the fertilizer men, and those who served in the cooking rooms. These people could not be shown to the visitor - for the odor of a fertilizer man would scare any ordinary visitor at a hundred yards, and as for the other men, who worked in tank rooms full of steam, and in some of which there were open vats near the level of the floor, their peculiar trouble was that they fell into the vats; and when they were fished out, there was never enough of them left to be worth exhibiting - sometimes they would be overlooked for days, till all but the bones of them had gone out to the world as Durham’s Pure Leaf Lard!'

Grabbing tables¶

Everyone loves extracting tables from PDFs! You can do that here: just do page.extract_table(). Easy!!!

In [23]:

table = page.extract_table()
table

Out[23]:

TableResult(rows=8…)

In [24]:

table.to_df()

Out[24]:

	Statute	Description	Level	Repeat?
0	4.12.7	Unsanitary Working Conditions.	Critical	<NA>
1	5.8.3	Inadequate Protective Equipment.	Serious	<NA>
2	6.3.9	Ineffective Injury Prevention.	Serious	<NA>
3	7.1.5	Failure to Properly Store Hazardous Materials.	Critical	<NA>
4	8.9.2	Lack of Adequate Fire Safety Measures.	Serious	<NA>
5	9.6.4	Inadequate Ventilation Systems.	Serious	<NA>
6	10.2.7	Insufficient Employee Training for Safe Work P...	Serious	<NA>

What about a page with multiple tables?

In most PDF processing libraries you just say, "give me all of the tables!" and then figure out which one you want. In Natural PDF, the proper way to do it is find the area you know the table is in and extract it alone.

In [25]:

# Start from the bold, big text that says "Violations" and header down to the smallest text
(
    page.find('text[size=max()]:bold:contains("Violations")').below(
        until='text[size=min()]',
        include_endpoint=False
    )
    .trim()
).show(crop=True)

Out[25]:

In [26]:

# Start from the bold, big text that says "Violations" and header down to the smallest text
(
    page.find('text[size=max()]:bold:contains("Violations")').below(
        until='text[size=min()]',
        include_endpoint=False
    )
    .trim()
).extract_table().to_df()

Out[26]:

	Statute	Description	Level	Repeat?
0	4.12.7	Unsanitary Working Conditions.	Critical	<NA>
1	5.8.3	Inadequate Protective Equipment.	Serious	<NA>
2	6.3.9	Ineffective Injury Prevention.	Serious	<NA>
3	7.1.5	Failure to Properly Store Hazardous Materials.	Critical	<NA>
4	8.9.2	Lack of Adequate Fire Safety Measures.	Serious	<NA>
5	9.6.4	Inadequate Ventilation Systems.	Serious	<NA>
6	10.2.7	Insufficient Employee Training for Safe Work P...	Serious	<NA>

And if we want the checkboxes? We can detect them with .detect_checkboxes()! I trained a completely custom YOLO model to do this!

In [28]:

page.detect_checkboxes()
page.find_all('checkbox').show(crop=20)

Out[28]:

And now we'll get nice results when we do the .extract_table().

In [29]:

# Start from the bold, big text that says "Violations" and header down to the smallest text
(
    page.find('text[size=max()]:bold:contains("Violations")').below(
        until='text[size=min()]',
        include_endpoint=False
    )
    .trim()
).extract_table().to_df()

Out[29]:

	Statute	Description	Level	Repeat?
0	4.12.7	Unsanitary Working Conditions.	Critical	[CHECKED]
1	5.8.3	Inadequate Protective Equipment.	Serious	[CHECKED]
2	6.3.9	Ineffective Injury Prevention.	Serious	[UNCHECKED]
3	7.1.5	Failure to Properly Store Hazardous Materials.	Critical	[UNCHECKED]
4	8.9.2	Lack of Adequate Fire Safety Measures.	Serious	[UNCHECKED]
5	9.6.4	Inadequate Ventilation Systems.	Serious	[CHECKED]
6	10.2.7	Insufficient Employee Training for Safe Work P...	Serious	[UNCHECKED]

Ignoring text with exclusion zones¶

What if we have like two hundred of these forms, and they all look the same, and all we want is the top, text-y part?

Instead of writing code about what we want, we can also write code about what we don't want. These are called exclusion zones.

In [30]:

from natural_pdf import PDF

pdf = PDF("https://github.com/jsoma/natural-pdf/raw/refs/heads/main/pdfs/01-practice.pdf")
page = pdf.pages[0]

In [31]:

text = page.extract_text()
print(text)

Jungle Health and Safety Inspection Service
INS-UP70N51NCL41R
Site: Durham’s Meatpacking Chicago, Ill.
Date: February 3, 1905
Violation Count: 7
Summary: Worst of any, however, were the fertilizer men, and those who served in the cooking rooms.
These people could not be shown to the visitor - for the odor of a fertilizer man would scare any ordinary
visitor at a hundred yards, and as for the other men, who worked in tank rooms full of steam, and in
some of which there were open vats near the level of the floor, their peculiar trouble was that they fell
into the vats; and when they were fished out, there was never enough of them left to be worth
exhibiting - sometimes they would be overlooked for days, till all but the bones of them had gone out
to the world as Durham’s Pure Leaf Lard!
Violations
Statute Description Level Repeat?
4.12.7 Unsanitary Working Conditions. Critical
5.8.3 Inadequate Protective Equipment. Serious
6.3.9 Ineffective Injury Prevention. Serious
7.1.5 Failure to Properly Store Hazardous Materials. Critical
8.9.2 Lack of Adequate Fire Safety Measures. Serious
9.6.4 Inadequate Ventilation Systems. Serious
10.2.7 Insufficient Employee Training for Safe Work Practices. Serious
Jungle Health and Safety Inspection Service

In [32]:

top = page.region(top=0, left=0, height=80)
bottom = page.find("line[width>=2]").below()
(top + bottom).show()

Out[32]:

In [33]:

page.add_exclusion(top)
page.add_exclusion(bottom)

page.show(exclusions='red')

Out[33]:

In [34]:

text = page.extract_text()
print(text)

Site: Durham’s Meatpacking Chicago, Ill.
Date: February 3, 1905
Violation Count: 7
Summary: Worst of any, however, were the fertilizer men, and those who served in the cooking rooms.
These people could not be shown to the visitor - for the odor of a fertilizer man would scare any ordinary
visitor at a hundred yards, and as for the other men, who worked in tank rooms full of steam, and in
some of which there were open vats near the level of the floor, their peculiar trouble was that they fell
into the vats; and when they were fished out, there was never enough of them left to be worth
exhibiting - sometimes they would be overlooked for days, till all but the bones of them had gone out
to the world as Durham’s Pure Leaf Lard!

Any time there is recurring text - headers, footers, even stamps on the page you want to ignore, you can just add them as an exclusion.

It's also possible to add exclusions across multiple pages. In the example below, every time you load a new page up it applies the PDF-level exclusion on it. Write it once, be done with it forever!

In [35]:

pdf.add_exclusion(lambda page: page.region(top=0, left=0, height=80))
pdf.add_exclusion(lambda page: page.find("line[width>=2]").below())

Out[35]:

<PDF source='https://github.com/jsoma/natural-pdf/raw/refs/heads/main/pdfs/01-practice.pdf' pages=1>

Next steps¶

What about when the text isn't so easy to access? Time to move on to our next notebook!