# Install required packages
!pip install --upgrade --quiet 'natural_pdf>=0.5.0'
print('✓ Packages installed!')
Slides: slides.pdf
There are a LOT of possible extras (a lot of them AI-flavored) inside of Natural PDF, but we'll start by just installing the basics. You use "natural_pdf[all]" if you want everything.
We'll start by opening a PDF.
You can use a PDF on your own computer, or you can use one from a URL. I'll start by using one from a URL to make everything a bit easier.
from natural_pdf import PDF
pdf = PDF("https://github.com/jsoma/natural-pdf/raw/refs/heads/main/pdfs/01-practice.pdf")
pdf
You can find the pages of the pdf under pdf.pages, let's grab the first one.
page = pdf.pages[0]
page
Pretty boring so far, eh? Let's take a look at the page itself. I'm setting the width since there's a good chance I'm going to be showing this on a projector.
page.show(width=800)
Incredible!!! Congratulations, you've opened your first PDF with Natural PDF.
Most of the time when we're working with PDFs you're interested in the text on the page.
# text = page.extract_text()
text = page.extract_text(layout=True)
# text
print(text)
layout=True is a useful addition if you want to see a text-only representation of the page, and sometimes it helps with data extraction.
You rarely want all of the text, though. How would you describe the INS-UP70N51NCL41R text?
# page.find('rect')
# page.find('rect').show()
page.find('rect').show(crop=True)
text = page.find('rect').extract_text()
print(text)
page.find_all('text').show()
texts = page.find_all('text').extract_each_text()
texts[:5]
texts[1]
red_text = page.find('text[color~=red]')
red_text.show(crop=True)
red_text.extract_text()
text = page.find('text:contains("INS-")')
# text = page.find('text:starts-with("INS-")')
text.show(crop=True)
text.extract_text()
What about "Chicago, Ill."? It's grey, so...
page.find("text[color~=grey]")
How do we know what's on the page? page.describe() can help!
page.describe()
page.find_all('text').inspect()
Let's find the largest text that's also Helvetica
page.find_all('text[size=max()][font_family=Helvetica]').show(crop=50)
What else is on the page that we can extract? How about the date? We want to find Date: and grab everything to the right of it.
# page.find(text="Date").show()
page.find(text="Date").right().extract_text()
And the site? We want to grab 'site', then keep going right until we see a piece of text.
site = (
page
.find(text="Site")
.right(until='text')
)
# site.show(crop=True)
site.endpoint.extract_text()
How about Violation Count?
(
page
.find(text="Violation Count")
.right(height='element')
.extract_text()
)
The Summary is a little bit more difficult. How would you describe where it is?
(
page
.find(text="Summary")
.right(height='element')
.extract_text()
)
summary = (
page
.find(text="Summary")
.expand(bottom='line', right=True, left=True)
)
summary = (
page
.find(text="Summary")
.below(until='line', include_source=True)
)
# summary.show()
# summary.extract_text(newlines=False)
summary.extract_text(newlines=False)
Everyone loves extracting tables from PDFs! You can do that here: just do page.extract_table(). Easy!!!
table = page.extract_table()
table
table.to_df()
What about a page with multiple tables?
In most PDF processing libraries you just say, "give me all of the tables!" and then figure out which one you want. In Natural PDF, the proper way to do it is find the area you know the table is in and extract it alone.
# Start from the bold, big text that says "Violations" and header down to the smallest text
(
page.find('text[size=max()]:bold:contains("Violations")').below(
until='text[size=min()]',
include_endpoint=False
)
.trim()
).show(crop=True)
# Start from the bold, big text that says "Violations" and header down to the smallest text
(
page.find('text[size=max()]:bold:contains("Violations")').below(
until='text[size=min()]',
include_endpoint=False
)
.trim()
).extract_table().to_df()
What if we have like two hundred of these forms, and they all look the same, and all we want is the top, text-y part?
Instead of writing code about what we want, we can also write code about what we don't want. These are called exclusion zones.
from natural_pdf import PDF
pdf = PDF("https://github.com/jsoma/natural-pdf/raw/refs/heads/main/pdfs/01-practice.pdf")
page = pdf.pages[0]
text = page.extract_text()
print(text)
top = page.region(top=0, left=0, height=80)
bottom = page.find("line[width>=2]").below()
(top + bottom).show()
page.add_exclusion(top)
page.add_exclusion(bottom)
page.show(exclusions='red')
text = page.extract_text()
print(text)
Any time there is recurring text - headers, footers, even stamps on the page you want to ignore, you can just add them as an exclusion.
It's also possible to add exclusions across multiple pages. In the example below, every time you load a new page up it applies the PDF-level exclusion on it. Write it once, be done with it forever!
pdf.add_exclusion(lambda page: page.region(top=0, left=0, height=80))
pdf.add_exclusion(lambda page: page.find("line[width>=2]").below())
What about when the text isn't so easy to access? Time to move on to our next notebook!