Selectors 101
Selectors are the heart of Natural PDF. They let you find elements in your PDF using a simple, CSS-like syntax.
Quick Reference
'text' # All text elements
'text:bold' # Bold text (pseudo-class)
'text:contains("Invoice")' # Text containing "Invoice"
'text[size>12]' # Text with font size > 12 (attribute)
'text:bold[size>=14]' # Bold text AND size >= 14 (combined)
The Basics
Element Types
Every selector starts with an element type:
| Type | What It Finds | Example |
|---|---|---|
text |
Text content | page.find('text') |
line |
Lines and rules | page.find('line') |
rect |
Rectangles and boxes | page.find('rect') |
image |
Embedded images | page.find('image') |
region |
Layout-detected areas | page.find('region') |
# Find the first text element
first_text = page.find('text')
# Find all lines
all_lines = page.find_all('line')
Pseudo-Classes (:name)
Pseudo-classes filter by state or content. They use a colon (:) prefix.
| Pseudo-Class | Description | Example |
|---|---|---|
:bold |
Bold text | 'text:bold' |
:italic |
Italic text | 'text:italic' |
:contains("X") |
Contains the text "X" | 'text:contains("Invoice")' |
:startswith("X") |
Starts with "X" | 'text:startswith("Total")' |
:endswith("X") |
Ends with "X" | 'text:endswith(":")' |
:regex("pattern") |
Matches a regex pattern | 'text:regex("INV-\\d+")' |
:horizontal |
Horizontal lines | 'line:horizontal' |
:vertical |
Vertical lines | 'line:vertical' |
# Find bold text
bold = page.find('text:bold')
# Find text containing "Total"
total = page.find('text:contains("Total")')
# Case-insensitive search
total = page.find('text:contains("total")', case=False)
Common Mistake
Use colon (:) not dot (.) for pseudo-classes:
- ✅
'text:bold'(correct) - ❌
'text.bold'(wrong - this is CSS class syntax)
Attribute Filters ([attr=value])
Attribute filters match specific properties. They use brackets ([]).
| Operator | Meaning | Example |
|---|---|---|
= |
Equals | 'text[size=12]' |
!= |
Not equals | 'text[size!=12]' |
> |
Greater than | 'text[size>12]' |
>= |
Greater or equal | 'text[size>=12]' |
< |
Less than | 'text[size<12]' |
<= |
Less or equal | 'text[size<=12]' |
*= |
Contains | 'text[fontname*=Arial]' |
^= |
Starts with | 'text[fontname^=Times]' |
$= |
Ends with | 'text[fontname$=Bold]' |
# Find large text (size > 14)
large = page.find_all('text[size>14]')
# Find text in Arial font
arial = page.find_all('text[fontname*=Arial]')
# Find high-confidence OCR results
confident = page.find_all('text[confidence>=0.9]')
Common Attributes
| Attribute | Element Types | Description |
|---|---|---|
size |
text | Font size in points |
fontname |
text | Font family name |
confidence |
text (OCR) | OCR confidence (0-1) |
source |
text | Origin: pdf or ocr |
type |
region | Region type from layout analysis |
width |
line, rect | Element width |
height |
line, rect, image | Element height |
fill |
rect | Has fill color |
stroke |
rect, line | Has stroke/border |
Combining Selectors
You can combine pseudo-classes and attributes for precise matching:
# Bold text larger than 14pt
headers = page.find_all('text:bold[size>14]')
# Text containing "Total" in Arial font
totals = page.find_all('text:contains("Total")[fontname*=Arial]')
# Bold, large text containing "Summary"
summary = page.find('text:bold[size>=16]:contains("Summary")')
Selector Order
Pseudo-classes and attributes can appear in any order after the type. These are all equivalent:
'text:bold[size>14]:contains("Summary")'
'text:contains("Summary"):bold[size>14]'
'text[size>14]:bold:contains("Summary")'
Finding Elements
find() - First Match
Returns the first matching element, or None if not found.
# Find the first bold text
title = page.find('text:bold')
# Always check for None!
if title:
print(title.extract_text())
else:
print("Not found")
find_all() - All Matches
Returns an ElementCollection with all matching elements.
# Find all bold text
all_bold = page.find_all('text:bold')
print(f"Found {len(all_bold)} bold elements")
# ElementCollection has useful methods
first = all_bold.first # First element or None
texts = all_bold.extract_text() # Extract text from all
Advanced Patterns
Regex Matching
There are two ways to use regex. The :regex() pseudo-class matches against the full text of each element:
# Find invoice numbers like "INV-12345"
invoice = page.find('text:regex("INV-\\d+")')
# Find dates in MM/DD/YYYY format
dates = page.find_all('text:regex("\\d{2}/\\d{2}/\\d{4}")')
# Find page numbers like "Page 1 of 10"
page.find('text:regex("Page \\d+ of \\d+")')
You can also use regex=True with :contains() for the same effect:
# These are equivalent
page.find('text:regex("Total|Sum")')
page.find('text:contains("Total|Sum")', regex=True)
Layout Regions
After running layout analysis, you can find detected regions:
# First, detect layout
page.analyze_layout(engine='yolo')
# Find all detected tables
tables = page.find_all('region[type=table]')
# Find specific region types
titles = page.find_all('region[type=title]')
figures = page.find_all('region[type=figure]')
OCR Elements
After applying OCR, filter by source and confidence:
# Apply OCR
page.apply_ocr()
# Find OCR text (not native PDF text)
ocr_text = page.find_all('text[source=ocr]')
# Find high-confidence OCR only
confident = page.find_all('text[source=ocr][confidence>=0.8]')
# Find native PDF text only
native = page.find_all('text[source=pdf]')
Common Patterns Cheat Sheet
| Goal | Selector |
|---|---|
| All text | 'text' |
| Bold text | 'text:bold' |
| Large text (headings) | 'text[size>=14]' |
| Text containing "X" | 'text:contains("X")' |
| Case-insensitive search | page.find('text:contains("x")', case=False) |
| Horizontal lines | 'line:horizontal' |
| Thick lines | 'line[width>=2]' |
| Filled rectangles | 'rect[fill]' |
| Detected tables | 'region[type=table]' |
| High-confidence OCR | 'text[source=ocr][confidence>=0.8]' |
Troubleshooting
"My selector finds nothing"
-
Start broad, then narrow down:
# See what's on the page all_text = page.find_all('text') print(f"Total: {len(all_text)} text elements") all_text.show() # Visualize them -
Check your spelling and case:
# This finds "Invoice" but not "INVOICE" or "invoice" page.find('text:contains("Invoice")') # Use case=False for case-insensitive page.find('text:contains("invoice")', case=False) -
Check the syntax:
# Wrong page.find('text.bold') # Use colon, not dot page.find('text[contains="X"]') # :contains is a pseudo-class page.find('text:contains(Invoice)') # Need quotes around text # Correct page.find('text:bold') page.find('text:contains("Invoice")')
"I get AttributeError: 'NoneType'"
find() returns None when nothing matches. Always check:
# Wrong - crashes if not found
text = page.find('text:contains("Missing")').extract_text()
# Correct - handle None
element = page.find('text:contains("Missing")')
if element:
text = element.extract_text()
else:
text = ""
Next Steps
- Quickstart - See selectors in action
- Finding Elements Tutorial - Deep dive into finding elements
- Spatial Navigation - Use
.below(),.above()after finding