Finding Specific Elements¶
Extracting all the text is useful, but often you need specific pieces of information. natural-pdf
lets you find elements using selectors, similar to CSS.
Let's find the "Site" and "Date" information from our 01-practice.pdf
:
In [1]:
Copied!
#%pip install "natural-pdf[all]"
#%pip install "natural-pdf[all]"
In [2]:
Copied!
from natural_pdf import PDF
# Load a PDF
pdf = PDF("https://github.com/jsoma/natural-pdf/raw/refs/heads/main/pdfs/01-practice.pdf")
# Get the first page (index 0)
page = pdf.pages[0]
# Find the text element containing "Site:"
# The ':contains()' pseudo-class looks for text content.
site_label = page.find('text:contains("Site:")')
# Find the text element containing "Date:"
date_label = page.find('text:contains("Date:")')
# Visualize the found elements
site_label.highlight(color="red", label="Site Label")
date_label.highlight(color="blue", label="Date Label")
# Access the text content directly
{
"Site Label": site_label.text,
"Date Label": date_label.text
}
# Display the page image to see the visualized elements
page.to_image()
from natural_pdf import PDF
# Load a PDF
pdf = PDF("https://github.com/jsoma/natural-pdf/raw/refs/heads/main/pdfs/01-practice.pdf")
# Get the first page (index 0)
page = pdf.pages[0]
# Find the text element containing "Site:"
# The ':contains()' pseudo-class looks for text content.
site_label = page.find('text:contains("Site:")')
# Find the text element containing "Date:"
date_label = page.find('text:contains("Date:")')
# Visualize the found elements
site_label.highlight(color="red", label="Site Label")
date_label.highlight(color="blue", label="Date Label")
# Access the text content directly
{
"Site Label": site_label.text,
"Date Label": date_label.text
}
# Display the page image to see the visualized elements
page.to_image()
Out[2]:
Finding Elements by Color¶
You can find elements based on their color:
In [3]:
Copied!
# Find text elements that are red
red_text = page.find('text[color~=red]')
red_text.highlight(color="red", label="Red Text")
print(f"Found red text: {red_text.text}")
# Find elements with specific RGB colors
blue_text = page.find('text[color=rgb(0,0,255)]')
# Find text elements that are red
red_text = page.find('text[color~=red]')
red_text.highlight(color="red", label="Red Text")
print(f"Found red text: {red_text.text}")
# Find elements with specific RGB colors
blue_text = page.find('text[color=rgb(0,0,255)]')
2025-05-06T16:22:22.726960Z [warning ] Unsupported operator '~=' encountered during filter building for attribute 'color' lineno=445 module=natural_pdf.selectors.parser
[2025-05-06 12:22:22,726] [ WARNING] parser.py:445 - Unsupported operator '~=' encountered during filter building for attribute 'color'
Found red text: Jungle Health and Safety Inspection Service
Finding Lines and Shapes¶
Find lines and rectangles based on their properties:
In [4]:
Copied!
# Find horizontal lines
horizontal_lines = page.find_all('line[horizontal]')
# Find thick lines (width >= 2)
thick_lines = page.find_all('line[width>=2]')
# Find rectangles
rectangles = page.find_all('rect')
# Visualize what we found
page.clear_highlights()
horizontal_lines.highlight(color="blue", label="Horizontal Lines")
thick_lines.highlight(color="red", label="Thick Lines")
rectangles.highlight(color="green", label="Rectangles")
page.to_image()
# Find horizontal lines
horizontal_lines = page.find_all('line[horizontal]')
# Find thick lines (width >= 2)
thick_lines = page.find_all('line[width>=2]')
# Find rectangles
rectangles = page.find_all('rect')
# Visualize what we found
page.clear_highlights()
horizontal_lines.highlight(color="blue", label="Horizontal Lines")
thick_lines.highlight(color="red", label="Thick Lines")
rectangles.highlight(color="green", label="Rectangles")
page.to_image()
Out[4]:
Finding Elements by Font Properties¶
In [5]:
Copied!
# Find text with specific font properties
bold_text = page.find_all('text:bold')
large_text = page.find_all('text[size>=12]')
# Find text with specific font names
helvetica_text = page.find_all('text[fontname=Helvetica]')
# Find text with specific font properties
bold_text = page.find_all('text:bold')
large_text = page.find_all('text[size>=12]')
# Find text with specific font names
helvetica_text = page.find_all('text[fontname=Helvetica]')
Spatial Navigation¶
You can find elements based on their position relative to other elements:
In [6]:
Copied!
# Find text above a specific element
above_text = page.find('line[width=2]').above().extract_text()
# Find text below a specific element
below_text = page.find('text:contains("Summary")').below().extract_text()
# Find text to the right of a specific element
nearby_text = page.find('text:contains("Site")').right(width=200).extract_text()
# Find text above a specific element
above_text = page.find('line[width=2]').above().extract_text()
# Find text below a specific element
below_text = page.find('text:contains("Summary")').below().extract_text()
# Find text to the right of a specific element
nearby_text = page.find('text:contains("Site")').right(width=200).extract_text()
Combining Selectors¶
You can combine multiple conditions to find exactly what you need:
In [7]:
Copied!
# Find large, bold text that contains specific words
important_text = page.find_all('text[size>=12]:bold:contains("Critical")')
# Find red text inside a rectangle
highlighted_text = page.find('rect').find_all('text[color~=red]')
# Find large, bold text that contains specific words
important_text = page.find_all('text[size>=12]:bold:contains("Critical")')
# Find red text inside a rectangle
highlighted_text = page.find('rect').find_all('text[color~=red]')
2025-05-06T16:22:22.858023Z [warning ] Unsupported operator '~=' encountered during filter building for attribute 'color' lineno=445 module=natural_pdf.selectors.parser
[2025-05-06 12:22:22,857] [ WARNING] parser.py:445 - Unsupported operator '~=' encountered during filter building for attribute 'color'
Handling Missing Elements
In these examples, we know certain elements exist in the PDF. In real-world scenarios, `page.find()` might not find a match and would return `None`. Production code should check for this:
```py
site_label = page.find('text:contains("Site:")')
if site_label:
# Found it! Proceed...
site_label.highlight(color="red", label="Site Label")
site_label.text # Display or use the text
else:
# Didn't find it, handle appropriately...
"Warning: 'Site:' label not found."
```
Visual Debugging
When working with complex selectors, it's helpful to visualize what you're finding:
```py
# Clear any existing highlights
page.clear_highlights()
# Find and highlight elements
elements = page.find_all('text[color~=red]')
elements.highlight(color="red", label="Red Text")
# Display the page to see what was found
page.to_image(width=800)
```