Loading and Basic Text Extraction¶
In [1]:
Copied!
#%pip install "natural-pdf[all]"
#%pip install "natural-pdf[all]"
In this tutorial, we'll learn how to:
- Load a PDF document
- Extract text from pages
- Extract specific elements
Loading a PDF¶
Let's start by loading a PDF file:
In [2]:
Copied!
from natural_pdf import PDF
import os
# Load a PDF file
pdf = PDF("https://github.com/jsoma/natural-pdf/raw/refs/heads/main/pdfs/01-practice.pdf")
# Basic info about the document
{
"Filename": os.path.basename(pdf.path),
"Pages": len(pdf.pages),
"Title": pdf.metadata.get("Title", "N/A"),
"Author": pdf.metadata.get("Author", "N/A")
}
from natural_pdf import PDF
import os
# Load a PDF file
pdf = PDF("https://github.com/jsoma/natural-pdf/raw/refs/heads/main/pdfs/01-practice.pdf")
# Basic info about the document
{
"Filename": os.path.basename(pdf.path),
"Pages": len(pdf.pages),
"Title": pdf.metadata.get("Title", "N/A"),
"Author": pdf.metadata.get("Author", "N/A")
}
Out[2]:
{'Filename': '01-practice.pdf', 'Pages': 1, 'Title': 'N/A', 'Author': 'N/A'}
Extracting Text¶
Now that we have loaded the PDF, let's extract the text from the first page:
In [3]:
Copied!
# Get the first page
page = pdf.pages[0]
# Extract text from the page
text = page.extract_text()
# Show the first 200 characters of the text
print(text[:200])
# Get the first page
page = pdf.pages[0]
# Extract text from the page
text = page.extract_text()
# Show the first 200 characters of the text
print(text[:200])
Jungle Health and Safety Inspection Service INS-UP70N51NCL41R Site: Durham’s Meatpacking Chicago, Ill. Date: February 3, 1905 Violation Count: 7 Summary: Worst of any, however, were the fertilizer men
Finding and Extracting Specific Elements¶
We can find specific elements using spatial queries and text content:
In [4]:
Copied!
# Find text elements containing specific words
elements = page.find_all('text:contains("Inadequate")')
# Show these elements on the page
elements.show()
# Find text elements containing specific words
elements = page.find_all('text:contains("Inadequate")')
# Show these elements on the page
elements.show()
Out[4]:
Working with Layout Regions¶
We can analyze the layout of the page to identify different regions:
In [5]:
Copied!
# Analyze the page layout
page.analyze_layout(engine='yolo')
# Find and highlight all detected regions
page.find_all('region').show(group_by='type')
# Analyze the page layout
page.analyze_layout(engine='yolo')
# Find and highlight all detected regions
page.find_all('region').show(group_by='type')
image 1/1 /var/folders/25/h3prywj14qb0mlkl2s8bxq5m0000gn/T/tmp2y6ycguv/temp_layout_image.png: 1024x800 1 title, 3 plain texts, 2 abandons, 1 table, 1710.3ms
Speed: 7.8ms preprocess, 1710.3ms inference, 1.7ms postprocess per image at shape (1, 3, 1024, 800)
Out[5]:
Working with Multiple Pages¶
You can also work with multiple pages:
In [6]:
Copied!
# Process all pages
for page in pdf.pages:
page_text = page.extract_text()
print(f"Page {page.number}", page_text[:100]) # First 100 chars of each page
# Process all pages
for page in pdf.pages:
page_text = page.extract_text()
print(f"Page {page.number}", page_text[:100]) # First 100 chars of each page
Page 1 Jungle Health and Safety Inspection Service INS-UP70N51NCL41R Site: Durham’s Meatpacking Chicago, Il
This tutorial covered the basics of loading PDFs and extracting text. In the next tutorials, we'll explore more advanced features like searching for specific elements, extracting structured content, and working with tables.