Spatial Navigation¶
Spatial navigation lets you work with PDF content based on the physical layout of elements on the page. It's perfect for finding elements relative to each other and extracting information in context.
In [1]:
Copied!
#%pip install "natural-pdf[all]"
#%pip install "natural-pdf[all]"
In [2]:
Copied!
from natural_pdf import PDF
# Load a PDF
pdf = PDF("https://github.com/jsoma/natural-pdf/raw/refs/heads/main/pdfs/01-practice.pdf")
page = pdf.pages[0]
# Find the title of the document
title = page.find('text:contains("Jungle Health")')
# Visualize our starting point
title.show(color="red", label="Document Title")
# Display the title text
title.text
from natural_pdf import PDF
# Load a PDF
pdf = PDF("https://github.com/jsoma/natural-pdf/raw/refs/heads/main/pdfs/01-practice.pdf")
page = pdf.pages[0]
# Find the title of the document
title = page.find('text:contains("Jungle Health")')
# Visualize our starting point
title.show(color="red", label="Document Title")
# Display the title text
title.text
Out[2]:
'Jungle Health and Safety Inspection Service'
Finding Elements Above and Below¶
In [3]:
Copied!
# Create a region below the title
region_below = title.below(height=100)
# Visualize the region
region_below.show(color="blue", label="Below Title")
# Find and extract text from this region
text_below = region_below.extract_text()
text_below
# Create a region below the title
region_below = title.below(height=100)
# Visualize the region
region_below.show(color="blue", label="Below Title")
# Find and extract text from this region
text_below = region_below.extract_text()
text_below
Out[3]:
'INS-UP70N51NCL41R\nSite: Durham’s Meatpacking Chicago, Ill.\nDate: February 3, 1905\nViolation Count: 7'
Finding Content Between Elements¶
In [4]:
Copied!
# Find two labels to serve as boundaries
site_label = page.find('text:contains("Site:")')
date_label = page.find('text:contains("Date:")')
# Get the region between these labels
between_region = site_label.below(
include_element=True, # Include starting element
until='text:contains("Date:")', # Stop at this element
include_endpoint=False # Don't include ending element
)
# Visualize the region between labels
between_region.show(color="green", label="Between")
# Extract text from this bounded area
between_region.extract_text()
# Find two labels to serve as boundaries
site_label = page.find('text:contains("Site:")')
date_label = page.find('text:contains("Date:")')
# Get the region between these labels
between_region = site_label.below(
include_element=True, # Include starting element
until='text:contains("Date:")', # Stop at this element
include_endpoint=False # Don't include ending element
)
# Visualize the region between labels
between_region.show(color="green", label="Between")
# Extract text from this bounded area
between_region.extract_text()
Out[4]:
'Site: Durham’s Meatpacking Chicago, Ill.'
Navigating Left and Right¶
In [5]:
Copied!
# Find a field label
site_label = page.find('text:contains("Site:")')
# Get the content to the right (the field value)
value_region = site_label.right(width=200)
# Visualize the label and value regions
site_label.show(color="red", label="Label")
value_region.show(color="blue", label="Value")
# Extract just the value text
value_region.extract_text()
# Find a field label
site_label = page.find('text:contains("Site:")')
# Get the content to the right (the field value)
value_region = site_label.right(width=200)
# Visualize the label and value regions
site_label.show(color="red", label="Label")
value_region.show(color="blue", label="Value")
# Extract just the value text
value_region.extract_text()
Out[5]:
'Durham’s Meatpacking Chicago, Ill.\nFebruary 3, 1905\ntion Count: 7\nmary: Worst of any, however, were the fertilize\ne people could not be shown to the visitor - for\nr at a hundred yards, and as for the other men\nof which there were open vats near the level\nhe vats; and when they were fished out, there\niting - sometimes they would be overlooked fo\nworld as Durham’s Pure Leaf Lard!\nations\nute Description\n.7 Unsanitary Working Conditions.\n3 Inadequate Protective Equipment.\n9 Ineffective Injury Prevention.\n5 Failure to Properly Store Hazardous M\n2 Lack of Adequate Fire Safety Measure\n4 Inadequate Ventilation Systems.\n.7 Insufficient Employee Training for Safe\nJungle Healt'
Finding Adjacent Elements¶
In [6]:
Copied!
# Start with a label element
label = page.find('text:contains("Site:")')
# Find the next and previous elements in reading order
next_elem = label.next()
prev_elem = label.prev()
# Visualize all three elements
label.show(color="red", label="Current")
next_elem.show(color="green", label="Next") if next_elem else None
prev_elem.show(color="blue", label="Previous") if prev_elem else None
# Show the text of adjacent elements
{
"current": label.text,
"next": next_elem.text if next_elem else "None",
"previous": prev_elem.text if prev_elem else "None"
}
# Start with a label element
label = page.find('text:contains("Site:")')
# Find the next and previous elements in reading order
next_elem = label.next()
prev_elem = label.prev()
# Visualize all three elements
label.show(color="red", label="Current")
next_elem.show(color="green", label="Next") if next_elem else None
prev_elem.show(color="blue", label="Previous") if prev_elem else None
# Show the text of adjacent elements
{
"current": label.text,
"next": next_elem.text if next_elem else "None",
"previous": prev_elem.text if prev_elem else "None"
}
Out[6]:
{'current': 'Site: ', 'next': 'i', 'previous': 'S'}
Combining with Element Selectors¶
In [7]:
Copied!
# Find a section label
summary = page.find('text:contains("Summary:")')
# Find the next bold text element
next_bold = summary.next('text:bold', limit=20)
# Find the nearest line element
nearest_line = summary.nearest('line')
# Visualize what we found
summary.show(color="red", label="Summary")
next_bold.show(color="blue", label="Next Bold") if next_bold else None
nearest_line.show(color="green", label="Nearest Line") if nearest_line else None
# Show the content we found
{
"summary": summary.text,
"next_bold": next_bold.text if next_bold else "None found",
"nearest_line": nearest_line if nearest_line else "None found"
}
# Find a section label
summary = page.find('text:contains("Summary:")')
# Find the next bold text element
next_bold = summary.next('text:bold', limit=20)
# Find the nearest line element
nearest_line = summary.nearest('line')
# Visualize what we found
summary.show(color="red", label="Summary")
next_bold.show(color="blue", label="Next Bold") if next_bold else None
nearest_line.show(color="green", label="Nearest Line") if nearest_line else None
# Show the content we found
{
"summary": summary.text,
"next_bold": next_bold.text if next_bold else "None found",
"nearest_line": nearest_line if nearest_line else "None found"
}
Out[7]:
{'summary': 'Summary: ', 'next_bold': 'u', 'nearest_line': <LineElement type=horizontal width=2.0 bbox=(50, 352, 550, 352)>}
Extracting Table Rows with Spatial Navigation¶
In [8]:
Copied!
# Find a table heading
table_heading = page.find('text:contains("Statute")')
table_heading.show(color="purple", label="Table Header")
# Extract table rows using spatial navigation
rows = []
current = table_heading
# Get the next 4 rows
for i in range(4):
# Find the next row below the current one
next_row = current.below(height=15)
if next_row:
rows.append(next_row)
current = next_row # Move to the next row
else:
break
# Visualize all found rows
page.clear_highlights()
for i, row in enumerate(rows):
row.highlight(label=f"Row {i+1}")
page.to_image(width=700)
# Find a table heading
table_heading = page.find('text:contains("Statute")')
table_heading.show(color="purple", label="Table Header")
# Extract table rows using spatial navigation
rows = []
current = table_heading
# Get the next 4 rows
for i in range(4):
# Find the next row below the current one
next_row = current.below(height=15)
if next_row:
rows.append(next_row)
current = next_row # Move to the next row
else:
break
# Visualize all found rows
page.clear_highlights()
for i, row in enumerate(rows):
row.highlight(label=f"Row {i+1}")
page.to_image(width=700)
Out[8]:
In [9]:
Copied!
# Extract text from each row
[row.extract_text() for row in rows]
# Extract text from each row
[row.extract_text() for row in rows]
Out[9]:
['4.12.7 Unsanitary Working Conditions. Critical', '4.12.7 Unsanitary Working Conditions. Critical\n5.8.3 Inadequate Protective Equipment. Serious', '5.8.3 Inadequate Protective Equipment. Serious', '6.3.9 Ineffective Injury Prevention. Serious']
Extracting Key-Value Pairs¶
In [10]:
Copied!
# Find all potential field labels (text with a colon)
labels = page.find_all('text:contains(":")')
# Visualize the labels
labels.show(color="blue", label="Labels")
# Extract key-value pairs
field_data = {}
for label in labels:
# Clean up the label text
key = label.text.strip().rstrip(':')
# Skip if not a proper label
if not key:
continue
# Get the value to the right
value = label.right(width=200).extract_text().strip()
# Add to our collection
field_data[key] = value
# Show the extracted data
field_data
# Find all potential field labels (text with a colon)
labels = page.find_all('text:contains(":")')
# Visualize the labels
labels.show(color="blue", label="Labels")
# Extract key-value pairs
field_data = {}
for label in labels:
# Clean up the label text
key = label.text.strip().rstrip(':')
# Skip if not a proper label
if not key:
continue
# Get the value to the right
value = label.right(width=200).extract_text().strip()
# Add to our collection
field_data[key] = value
# Show the extracted data
field_data
Out[10]:
{'Site': 'Durham’s Meatpacking Chicago, Ill.\nFebruary 3, 1905\ntion Count: 7\nmary: Worst of any, however, were the fertilize\ne people could not be shown to the visitor - for\nr at a hundred yards, and as for the other men\nof which there were open vats near the level\nhe vats; and when they were fished out, there\niting - sometimes they would be overlooked fo\nworld as Durham’s Pure Leaf Lard!\nations\nute Description\n.7 Unsanitary Working Conditions.\n3 Inadequate Protective Equipment.\n9 Ineffective Injury Prevention.\n5 Failure to Properly Store Hazardous M\n2 Lack of Adequate Fire Safety Measure\n4 Inadequate Ventilation Systems.\n.7 Insufficient Employee Training for Safe\nJungle Healt', 'Date': 'Durham’s Meatpacking Chicago, Ill.\nFebruary 3, 1905\non Count: 7\nary: Worst of any, however, were the fertilizer\npeople could not be shown to the visitor - for t\nat a hundred yards, and as for the other men,\nof which there were open vats near the level o\ne vats; and when they were fished out, there w\nng - sometimes they would be overlooked for\nworld as Durham’s Pure Leaf Lard!\ntions\nte Description\n7 Unsanitary Working Conditions.\nInadequate Protective Equipment.\nIneffective Injury Prevention.\nFailure to Properly Store Hazardous Ma\nLack of Adequate Fire Safety Measures\nInadequate Ventilation Systems.\n7 Insufficient Employee Training for Safe W\nJungle Health', 'Violation Count': 'eatpacking Chicago, Ill.\n, 1905\n7\nof any, however, were the fertilizer men, and\nld not be shown to the visitor - for the odor of\nd yards, and as for the other men, who worke\nre were open vats near the level of the floor, t\nwhen they were fished out, there was never e\nmes they would be overlooked for days, till all\nrham’s Pure Leaf Lard!\nription\nnitary Working Conditions.\nquate Protective Equipment.\nctive Injury Prevention.\ne to Properly Store Hazardous Materials.\nof Adequate Fire Safety Measures.\nquate Ventilation Systems.\nicient Employee Training for Safe Work Practi\nJungle Health and Safety Ins', 'Summary': 'm’s Meatpacking Chicago, Ill.\nuary 3, 1905\nount: 7\nWorst of any, however, were the fertilizer men\nple could not be shown to the visitor - for the o\nhundred yards, and as for the other men, who\nich there were open vats near the level of the\ns; and when they were fished out, there was n\nsometimes they would be overlooked for days\nas Durham’s Pure Leaf Lard!\ns\nDescription\nUnsanitary Working Conditions.\nInadequate Protective Equipment.\nIneffective Injury Prevention.\nFailure to Properly Store Hazardous Material\nLack of Adequate Fire Safety Measures.\nInadequate Ventilation Systems.\nInsufficient Employee Training for Safe Work\nJungle Health and S'}
Spatial navigation mimics how humans read documents, letting you navigate content based on physical relationships between elements. It's especially useful for extracting structured data from forms, tables, and formatted documents.