Finding Elements with Selectors¶

Natural PDF uses CSS-like selectors to find elements (text, lines, images, etc.) within a PDF page or document. This guide demonstrates how to use these selectors effectively.

Setup¶

Let's load a sample PDF to work with. We'll use 01-practice.pdf which has various elements.

In [1]:

Copied!

from natural_pdf import PDF

# Load the PDF
pdf = PDF("https://github.com/jsoma/natural-pdf/raw/refs/heads/main/pdfs/01-practice.pdf")

# Select the first page
page = pdf.pages[0]

# Display the page
page.show()
from natural_pdf import PDF

# Load the PDF
pdf = PDF("https://github.com/jsoma/natural-pdf/raw/refs/heads/main/pdfs/01-practice.pdf")

# Select the first page
page = pdf.pages[0]

# Display the page
page.show()

CropBox missing from /Page, defaulting to MediaBox

Out[1]:

No description has been provided for this image

Basic Element Finding¶

The core methods are find() (returns the first match) and find_all() (returns all matches as an ElementCollection).

The basic selector structure is element_type[attribute_filter]:pseudo_class.

Finding Text by Content¶

In [2]:

Copied!

# Find the first text element containing "Summary"
summary_text = page.find('text:contains("Summary")')
summary_text
# Find the first text element containing "Summary"
summary_text = page.find('text:contains("Summary")')
summary_text

Out[2]:

<TextElement text='Summary: ' font='Helvetica' size=10.0, style=['bold'] bbox=(50.0, 144.07000000000005, 101.68, 154.07000000000005)>

In [3]:

Copied!

# Find all text elements containing "Inadequate"
contains_inadequate = page.find_all('text:contains("Inadequate")')
len(contains_inadequate)
# Find all text elements containing "Inadequate"
contains_inadequate = page.find_all('text:contains("Inadequate")')
len(contains_inadequate)

Out[3]:

In [4]:

Copied!

summary_text.highlight(label='summary')
contains_inadequate.highlight(label="inadequate")
page.to_image(width=700)
summary_text.highlight(label='summary')
contains_inadequate.highlight(label="inadequate")
page.to_image(width=700)

Out[4]:

Selecting by Element Type¶

You can select specific types of elements found in PDFs.

In [5]:

Copied!

# Find all text elements
all_text = page.find_all('text')
len(all_text)
# Find all text elements
all_text = page.find_all('text')
len(all_text)

Out[5]:

In [6]:

Copied!

# Find all rectangle elements
all_rects = page.find_all('rect')
len(all_rects)
# Find all rectangle elements
all_rects = page.find_all('rect')
len(all_rects)

Out[6]:

In [7]:

Copied!

# Find all line elements
all_lines = page.find_all('line')
len(all_lines)
# Find all line elements
all_lines = page.find_all('line')
len(all_lines)

Out[7]:

In [8]:

Copied!

page.find_all('line').show()
page.find_all('line').show()

Out[8]:

Filtering by Attributes¶

Use square brackets [] to filter elements by their properties (attributes).

Common Attributes & Operators¶

Attribute	Example Usage	Operators	Notes
`size` (text)	`text[size>=12]`	`>`, `<`, `>=`, `<=`	Font size in points
`fontname`	`text[fontname*=Bold]`	`=`, `*=`	`*=` for contains substring
`color` (text)	`text[color~=red]`	`~=`	Approx. match (name, rgb, hex)
`width` (line)	`line[width>1]`	`>`, `<`, `>=`, `<=`	Line thickness
`source`	`text[source=ocr]`	`=`	`pdf`, `ocr`, `detected`
`type` (region)	`region[type=table]`	`=`	Layout analysis region type

In [9]:

Copied!

# Find large text (size >= 11 points)
page.find_all('text[size>=11]')
# Find large text (size >= 11 points)
page.find_all('text[size>=11]')

Out[9]:

<ElementCollection[TextElement](count=1)>

In [10]:

Copied!

# Find text with 'Helvetica' in the font name
page.find_all('text[fontname*=Helvetica]')
# Find text with 'Helvetica' in the font name
page.find_all('text[fontname*=Helvetica]')

Out[10]:

<ElementCollection[TextElement](count=44)>

In [11]:

Copied!

# Find red text (using approximate color match)
# This PDF has text with color (0.8, 0.0, 0.0)
red_text = page.find_all('text[color~=red]')
# Find red text (using approximate color match)
# This PDF has text with color (0.8, 0.0, 0.0)
red_text = page.find_all('text[color~=red]')

In [12]:

Copied!

# Highlight the red text (ignoring existing highlights)
red_text.show()
# Highlight the red text (ignoring existing highlights)
red_text.show()

Out[12]:

In [13]:

Copied!

# Find thick lines (width >= 2)
page.find_all('line[width>=2]')
# Find thick lines (width >= 2)
page.find_all('line[width>=2]')

Out[13]:

<ElementCollection[LineElement](count=1)>

Using Pseudo-Classes¶

Use colons : for special conditions (pseudo-classes).

Common Pseudo-Classes¶

Pseudo-Class	Example Usage	Notes
`:contains('text')`	`text:contains('Report')`	Finds elements containing specific text
`:bold`	`text:bold`	Finds text heuristically identified as bold
`:italic`	`text:italic`	Finds text heuristically identified as italic
`:below(selector)`	`text:below('line[width>=2]')`	Finds elements physically below the reference element
`:above(selector)`	`text:above('text:contains("Summary")')`	Finds elements physically above the reference element
`:left-of(selector)`	`line:left-of('rect')`	Finds elements physically left of the reference element
`:right-of(selector)`	`text:right-of('rect')`	Finds elements physically right of the reference element
`:near(selector)`	`text:near('image')`	Finds elements physically near the reference element

Note: Spatial pseudo-classes like :below, :above identify elements based on bounding box positions relative to the first element matched by the inner selector.

In [14]:

Copied!

# Find bold text
page.find_all('text:bold').show()
# Find bold text
page.find_all('text:bold').show()

Out[14]:

In [15]:

Copied!

# Combine attribute and pseudo-class: bold text size >= 11
page.find_all('text[size>=11]:bold')
# Combine attribute and pseudo-class: bold text size >= 11
page.find_all('text[size>=11]:bold')

Out[15]:

<ElementCollection[TextElement](count=1)>

Negation Pseudo-class (`:not()`)¶

You can exclude elements that match a certain selector using the :not() pseudo-class. It takes another simple selector as its argument.

In [16]:

Copied!





# Find all text elements that are NOT bold
non_bold_text = page.find_all('text:not(:bold)')

# Find all elements that are NOT regions of type 'table'
not_tables = page.find_all(':not(region[type=table])')

# Find text elements that do not contain "Total" (case-insensitive)
relevant_text = page.find_all('text:not(:contains("Total"))', case=False)

# Find text elements that are not empty
non_empty_text = page.find_all('text:not(:empty)')
# Find all text elements that are NOT bold
non_bold_text = page.find_all('text:not(:bold)')

# Find all elements that are NOT regions of type 'table'
not_tables = page.find_all(':not(region[type=table])')

# Find text elements that do not contain "Total" (case-insensitive)
relevant_text = page.find_all('text:not(:contains("Total"))', case=False)

# Find text elements that are not empty
non_empty_text = page.find_all('text:not(:empty)')

Note: The selector inside :not() follows the same rules as regular selectors but currently does not support combinators (like >, +, ~, or descendant space) within :not(). You can nest basic type, attribute, and other pseudo-class selectors.

Spatial Pseudo-Classes Examples¶

In [17]:

Copied!





# Find the thick horizontal line first
ref_line = page.find('line[width>=2]')

# Find text elements strictly above that line
text_above_line = page.find_all('text:above("line[width>=2]")')
text_above_line
# Find the thick horizontal line first
ref_line = page.find('line[width>=2]')

# Find text elements strictly above that line
text_above_line = page.find_all('text:above("line[width>=2]")')
text_above_line

Out[17]:

<ElementCollection[TextElement](count=17)>

Advanced Text Searching Options¶

Pass options to find() or find_all() for more control over text matching.

In [18]:

Copied!

# Case-insensitive search for "summary"
page.find_all('text:contains("summary")', case=False)
# Case-insensitive search for "summary"
page.find_all('text:contains("summary")', case=False)

Out[18]:

<ElementCollection[TextElement](count=1)>

In [19]:

Copied!

# Regular expression search for the inspection ID (e.g., INS-XXX...)
# The ID is in the red text we found earlier
page.find_all('text:contains("INS-\\w+")', regex=True)
# Regular expression search for the inspection ID (e.g., INS-XXX...)
# The ID is in the red text we found earlier
page.find_all('text:contains("INS-\\w+")', regex=True)

Out[19]:

<ElementCollection[TextElement](count=1)>

In [20]:

Copied!

# Combine regex and case-insensitivity
page.find_all('text:contains("jungle health")', regex=True, case=False)
# Combine regex and case-insensitivity
page.find_all('text:contains("jungle health")', regex=True, case=False)

Out[20]:

<ElementCollection[TextElement](count=2)>

Working with ElementCollections¶

find_all() returns an ElementCollection, which is like a list but with extra PDF-specific methods.

In [21]:

Copied!

# Get all headings (using a selector for large, bold text)
headings = page.find_all('text[size>=11]:bold')
headings
# Get all headings (using a selector for large, bold text)
headings = page.find_all('text[size>=11]:bold')
headings

Out[21]:

<ElementCollection[TextElement](count=1)>

In [22]:

Copied!





# Get the first and last heading in reading order
first = headings.first
last = headings.last
(first, last)
# Get the first and last heading in reading order
first = headings.first
last = headings.last
(first, last)

Out[22]:

(<TextElement text='Violations' font='Helvetica' size=12.0, style=['bold'] bbox=(50.0, 372.484, 107.348, 384.484)>,
 <TextElement text='Violations' font='Helvetica' size=12.0, style=['bold'] bbox=(50.0, 372.484, 107.348, 384.484)>)

In [23]:

Copied!





# Get the physically highest/lowest element in the collection
highest = headings.highest()
lowest = headings.lowest()
(highest, lowest)
# Get the physically highest/lowest element in the collection
highest = headings.highest()
lowest = headings.lowest()
(highest, lowest)

Out[23]:

(<TextElement text='Violations' font='Helvetica' size=12.0, style=['bold'] bbox=(50.0, 372.484, 107.348, 384.484)>,
 <TextElement text='Violations' font='Helvetica' size=12.0, style=['bold'] bbox=(50.0, 372.484, 107.348, 384.484)>)

In [24]:

Copied!

# Filter the collection further: headings containing "Service"
service_headings = headings.filter(lambda heading: 'Service' in heading.extract_text())
# Filter the collection further: headings containing "Service"
service_headings = headings.filter(lambda heading: 'Service' in heading.extract_text())

In [25]:

Copied!

# Extract text from all elements in the collection
headings.extract_text()
# Extract text from all elements in the collection
headings.extract_text()

Out[25]:

'Violations'

Remember: .highest(), .lowest(), .leftmost(), .rightmost() raise errors if the collection spans multiple pages.

Font Variants¶

Sometimes PDFs use font variants (prefixes like AAAAAB+) which can be useful for selection.

In [26]:

Copied!

# Find text elements with a specific font variant prefix (if any exist)
# This example PDF doesn't use variants, but the selector works like this:
page.find_all('text[font-variant=AAAAAB]')
# Find text elements with a specific font variant prefix (if any exist)
# This example PDF doesn't use variants, but the selector works like this:
page.find_all('text[font-variant=AAAAAB]')

Out[26]:

<ElementCollection[Mixed](count=0)>

Containment geometry¶

In [27]:

Copied!

from natural_pdf import PDF

pdf = PDF("https://github.com/jsoma/natural-pdf/raw/refs/heads/main/pdfs/geometry.pdf")
page = pdf.pages[0]

rect = page.find('rect')
rect.show(width=500)
from natural_pdf import PDF

pdf = PDF("https://github.com/jsoma/natural-pdf/raw/refs/heads/main/pdfs/geometry.pdf")
page = pdf.pages[0]

rect = page.find('rect')
rect.show(width=500)

CropBox missing from /Page, defaulting to MediaBox

Out[27]:

By default, being inside of something means being fully inside of the outer object.

In [28]:

Copied!

# rect.find_all('text', contains='all').show()
rect.find_all('text').show()
# rect.find_all('text', contains='all').show()
rect.find_all('text').show()

Out[28]:

If you're interested in any overlap, you can use contains='any'.

In [29]:

Copied!

rect.find_all('text', contains='any').show()
rect.find_all('text', contains='any').show()

Out[29]:

For just the center being part of it, contains='center' will work for you.

In [30]:

Copied!

rect.find_all('text', contains='center').show()
rect.find_all('text', contains='center').show()

Out[30]:

Next Steps¶

Now that you can find elements, explore:

Text Extraction: Get text content from found elements.
Spatial Navigation: Use found elements as anchors to navigate (.above(), .below(), etc.).
Working with Regions: Define areas based on found elements.
Visual Debugging: Techniques for highlighting and visualizing elements.