Loading and Basic Text Extraction¶

In [1]:

Copied!

#%pip install "natural-pdf[all]"
#%pip install "natural-pdf[all]"

In this tutorial, we'll learn how to:

Load a PDF document
Extract text from pages
Extract specific elements

Loading a PDF¶

Let's start by loading a PDF file:

In [2]:

Copied!





from natural_pdf import PDF
import os

# Load a PDF file
pdf = PDF("https://github.com/jsoma/natural-pdf/raw/refs/heads/main/pdfs/01-practice.pdf")

# Basic info about the document
{
    "Filename": os.path.basename(pdf.path),
    "Pages": len(pdf.pages),
    "Title": pdf.metadata.get("Title", "N/A"),
    "Author": pdf.metadata.get("Author", "N/A")
}
from natural_pdf import PDF
import os

# Load a PDF file
pdf = PDF("https://github.com/jsoma/natural-pdf/raw/refs/heads/main/pdfs/01-practice.pdf")

# Basic info about the document
{
    "Filename": os.path.basename(pdf.path),
    "Pages": len(pdf.pages),
    "Title": pdf.metadata.get("Title", "N/A"),
    "Author": pdf.metadata.get("Author", "N/A")
}

Out[2]:

{'Filename': '01-practice.pdf', 'Pages': 1, 'Title': 'N/A', 'Author': 'N/A'}

Extracting Text¶

Now that we have loaded the PDF, let's extract the text from the first page:

In [3]:

Copied!





# Get the first page
page = pdf.pages[0]

# Extract text from the page
text = page.extract_text()

# Show the first 200 characters of the text
print(text[:200])
# Get the first page
page = pdf.pages[0]

# Extract text from the page
text = page.extract_text()

# Show the first 200 characters of the text
print(text[:200])

Jungle Health and Safety Inspection Service
INS-UP70N51NCL41R
Site: Durham’s Meatpacking Chicago, Ill.
Date: February 3, 1905
Violation Count: 7
Summary: Worst of any, however, were the fertilizer men

Finding and Extracting Specific Elements¶

We can find specific elements using spatial queries and text content:

In [4]:

Copied!

# Find text elements containing specific words
elements = page.find_all('text:contains("Inadequate")')

# Show these elements on the page
elements.show()
# Find text elements containing specific words
elements = page.find_all('text:contains("Inadequate")')

# Show these elements on the page
elements.show()

Out[4]:

No description has been provided for this image

Working with Layout Regions¶

We can analyze the layout of the page to identify different regions:

In [5]:

Copied!

# Analyze the page layout
page.analyze_layout(engine='yolo')

# Find and highlight all detected regions
page.find_all('region').show(group_by='type')
# Analyze the page layout
page.analyze_layout(engine='yolo')

# Find and highlight all detected regions
page.find_all('region').show(group_by='type')

image 1/1 /var/folders/25/h3prywj14qb0mlkl2s8bxq5m0000gn/T/tmpucuv2sxu/temp_layout_image.png: 1024x800 1 title, 3 plain texts, 2 abandons, 1 table, 976.1ms

Speed: 5.3ms preprocess, 976.1ms inference, 4.1ms postprocess per image at shape (1, 3, 1024, 800)

Out[5]:

Working with Multiple Pages¶

You can also work with multiple pages:

In [6]:

Copied!





# Process all pages
for page in pdf.pages:
    page_text = page.extract_text()
    print(f"Page {page.number}", page_text[:100])  # First 100 chars of each page
# Process all pages
for page in pdf.pages:
    page_text = page.extract_text()
    print(f"Page {page.number}", page_text[:100])  # First 100 chars of each page

Page 1 Jungle Health and Safety Inspection Service
INS-UP70N51NCL41R
Site: Durham’s Meatpacking Chicago, Il

This tutorial covered the basics of loading PDFs and extracting text. In the next tutorials, we'll explore more advanced features like searching for specific elements, extracting structured content, and working with tables.