Visual Debugging¶

Sometimes it's hard to understand what's happening when working with PDFs. Natural PDF provides powerful visual debugging tools to help you see what you're extracting.

Adding Persistent Highlights¶

Use the .highlight() method on Element or ElementCollection objects to add persistent highlights to a page. These highlights are stored and will appear when viewing the page later.

In [1]:

Copied!





from natural_pdf import PDF

pdf = PDF("https://github.com/jsoma/natural-pdf/raw/refs/heads/main/pdfs/01-practice.pdf")
page = pdf.pages[0]

# Find a specific element and add a persistent highlight
page.find_all('text:contains("Summary")').highlight()
page.find_all('text:contains("Date")').highlight()
page.find_all('line').highlight()
page.to_image(width=700)
from natural_pdf import PDF

pdf = PDF("https://github.com/jsoma/natural-pdf/raw/refs/heads/main/pdfs/01-practice.pdf")
page = pdf.pages[0]

# Find a specific element and add a persistent highlight
page.find_all('text:contains("Summary")').highlight()
page.find_all('text:contains("Date")').highlight()
page.find_all('line').highlight()
page.to_image(width=700)

Out[1]:

No description has been provided for this image

Customizing Persistent Highlights¶

Customize the appearance of persistent highlights added with .highlight():

In [2]:

Copied!





page.clear_highlights()

title = page.find('text:bold[size>=12]')

# Highlight with a specific color (string name, hex, or RGB/RGBA tuple)
# title.highlight(color=(1, 0, 0, 0.3))  # Red with 30% opacity
# title.highlight(color="#FF0000")        # Hex color
title.highlight(color="red")           # Color name

text = page.find('text:contains("Critical")')

# Add a label to the highlight (appears in legend)
text.highlight(label="Critical")

# Combine color and label
rect = page.find('rect')
rect.highlight(color=(0, 0, 1, 0.2), label="Box")

page.to_image(width=700)
page.clear_highlights()

title = page.find('text:bold[size>=12]')

# Highlight with a specific color (string name, hex, or RGB/RGBA tuple)
# title.highlight(color=(1, 0, 0, 0.3))  # Red with 30% opacity
# title.highlight(color="#FF0000")        # Hex color
title.highlight(color="red")           # Color name

text = page.find('text:contains("Critical")')

# Add a label to the highlight (appears in legend)
text.highlight(label="Critical")

# Combine color and label
rect = page.find('rect')
rect.highlight(color=(0, 0, 1, 0.2), label="Box")

page.to_image(width=700)

Out[2]:

Highlighting Multiple Elements¶

Highlighting an ElementCollection applies the highlight to all elements within it. By default, all elements in the collection get the same color and a label based on their type.

In [3]:

Copied!





# Find and highlight all headings with a single color/label
headings = page.find_all('text[size>=14]:bold')
headings.highlight(color=(0, 0.5, 0, 0.3), label="Headings")

# Find and highlight all tables
tables = page.find_all('region[type=table]')
tables.highlight(color=(0, 0, 1, 0.2), label="Tables")

# View the result
page.viewer()
# Find and highlight all headings with a single color/label
headings = page.find_all('text[size>=14]:bold')
headings.highlight(color=(0, 0.5, 0, 0.3), label="Headings")

# Find and highlight all tables
tables = page.find_all('region[type=table]')
tables.highlight(color=(0, 0, 1, 0.2), label="Tables")

# View the result
page.viewer()

Element Info

Out[3]:

Highlighting Regions¶

You can highlight regions to see what area you're working with:

In [4]:

Copied!





# Find a title and create a region below it
title = page.find('text:contains("Violations")')
content = title.below(height=200)

# Highlight the region
content.show()
# Find a title and create a region below it
title = page.find('text:contains("Violations")')
content = title.below(height=200)

# Highlight the region
content.show()

Out[4]:

Or look at just the region by itself

In [5]:

Copied!





# Find a title and create a region below it
title = page.find('text:contains("Violations")')
content = title.below(height=200)

# Crop to the region
content.to_image(crop_only=True, include_highlights=False)
# Find a title and create a region below it
title = page.find('text:contains("Violations")')
content = title.below(height=200)

# Crop to the region
content.to_image(crop_only=True, include_highlights=False)

Out[5]:

Working with Text Styles¶

Visualize text styles to understand the document structure:

In [6]:

Copied!

# Analyze and highlight text styles
page.clear_highlights()

page.analyze_text_styles()
page.find_all('text').highlight(group_by='style_label')

page.to_image(width=700)
# Analyze and highlight text styles
page.clear_highlights()

page.analyze_text_styles()
page.find_all('text').highlight(group_by='style_label')

page.to_image(width=700)

Out[6]:

Displaying Attributes¶

You can display element attributes directly on the highlights:

In [7]:

Copied!

pdf = PDF("https://github.com/jsoma/natural-pdf/raw/refs/heads/main/pdfs/Atlanta_Public_Schools_GA_sample.pdf")
page = pdf.pages[0]

text = page.find_all('line')
text.highlight(include_attrs=['width', 'color'])

page.to_image(width=700)
pdf = PDF("https://github.com/jsoma/natural-pdf/raw/refs/heads/main/pdfs/Atlanta_Public_Schools_GA_sample.pdf")
page = pdf.pages[0]

text = page.find_all('line')
text.highlight(include_attrs=['width', 'color'])

page.to_image(width=700)

Out[7]:

Does it get busy? YES.

Clearing Highlights¶

You can clear persistent highlights from a page:

In [8]:

Copied!





# Clear all highlights on the page
page.clear_highlights()

# Apply new highlights
page.find_all('text:bold').highlight(label="Bold Text")
page.viewer()
# Clear all highlights on the page
page.clear_highlights()

# Apply new highlights
page.find_all('text:bold').highlight(label="Bold Text")
page.viewer()

Element Info

Out[8]:

Document QA Visualization¶

Visualize document QA results:

In [9]:

Copied!

pdf = PDF("https://github.com/jsoma/natural-pdf/raw/refs/heads/main/pdfs/0500000US42007.pdf")
page = pdf.pages[0]
page.to_image(width=700)
pdf = PDF("https://github.com/jsoma/natural-pdf/raw/refs/heads/main/pdfs/0500000US42007.pdf")
page = pdf.pages[0]
page.to_image(width=700)

Out[9]:

In [10]:

Copied!

response = page.ask("How many votes did Kamala Harris get on Election Day?")
response
response = page.ask("How many votes did Kamala Harris get on Election Day?")
response

Device set to use mps:0

Out[10]:

{'answer': '60',
 'confidence': 0.31857365369796753,
 'start': 31,
 'end': 31,
 'found': True,
 'page_num': 0,
 'source_elements': <ElementCollection[TextElement](count=1)>}

In [11]:

Copied!

response['source_elements'].show()
response['source_elements'].show()

Out[11]:

Next Steps¶

Now that you know how to visualize PDF content, you might want to explore:

OCR capabilities for working with scanned documents
Layout analysis for automatic structure detection
Document QA for asking questions directly to your documents