Text Extraction Guide¶

This guide demonstrates various ways to extract text from PDFs using Natural PDF, from simple page dumps to targeted extraction based on elements, regions, and styles.

Setup¶

First, let's import necessary libraries and load a sample PDF. We'll use example.pdf from the tutorials' pdfs directory. Adjust the path if your setup differs.

In [1]:

Copied!

from natural_pdf import PDF

# Load the PDF
pdf = PDF("https://github.com/jsoma/natural-pdf/raw/refs/heads/main/pdfs/01-practice.pdf")

# Select the first page for initial examples
page = pdf.pages[0]

# Display the first page
page.show(width=700)
from natural_pdf import PDF

# Load the PDF
pdf = PDF("https://github.com/jsoma/natural-pdf/raw/refs/heads/main/pdfs/01-practice.pdf")

# Select the first page for initial examples
page = pdf.pages[0]

# Display the first page
page.show(width=700)

CropBox missing from /Page, defaulting to MediaBox

Out[1]:

No description has been provided for this image

Basic Text Extraction¶

Get all text from a page or the entire document.

In [2]:

Copied!

# Extract all text from the first page
# Displaying first 500 characters
print(page.extract_text()[:500])
# Extract all text from the first page
# Displaying first 500 characters
print(page.extract_text()[:500])

Jungle Health and Safety Inspection Service
INS-UP70N51NCL41R
Site: Durham’s Meatpacking Chicago, Ill.
Date: February 3, 1905
Violation Count: 7
Summary: Worst of any, however, were the fertilizer men, and those who served in the cooking rooms.
These people could not be shown to the visitor - for the odor of a fertilizer man would scare any ordinary
visitor at a hundred yards, and as for the other men, who worked in tank rooms full of steam, and in
some of which there were open vats near the lev

You can also preserve layout with layout=True.

In [3]:

Copied!

# Extract text from the entire document (may take time)
# Uncomment to run:
print(page.extract_text(layout=True)[:2000])
# Extract text from the entire document (may take time)
# Uncomment to run:
print(page.extract_text(layout=True)[:2000])

Jungle Health and Safety Inspection Service
INS-UP70N51NCL41R
Site: Durham’s Meatpacking Chicago, Ill.
Date: February 3, 1905
Violation Count: 7
Summary: Worst of any, however, were the fertilizer men, and those who served in the cooking rooms.
These people could not be shown to the visitor - for the odor of a fertilizer man would scare any ordinary
visitor at a hundred yards, and as for the other men, who worked in tank rooms full of steam, and in
some of which there were open vats near the level of the floor, their peculiar trouble was that they fell
into the vats; and when they were fished out, there was never enough of them left to be worth
exhibiting - sometimes they would be overlooked for days, till all but the bones of them had gone out
to the world as Durham’s Pure Leaf Lard!
Violations
Statute Description Level Repeat?
4.12.7 Unsanitary Working Conditions. Critical
5.8.3 Inadequate Protective Equipment. Serious
6.3.9 Ineffective Injury Prevention. Serious
7.1.5 Failure to Properly Store Hazardous Materials. Critical
8.9.2 Lack of Adequate Fire Safety Measures. Serious
9.6.4 Inadequate Ventilation Systems. Serious
10.2.7 Insufficient Employee Training for Safe Work Practices. Serious
Jungle Health and Safety Inspection Service

Extracting Text from Specific Elements¶

Use selectors with find() or find_all() to target specific elements. Selectors like :contains("Summary") are examples; adapt them to your PDF.

In [4]:

Copied!





# Find a single element, e.g., a title containing "Summary"
# Adjust selector as needed
date_element = page.find('text:contains("Site")')
date_element # Display the found element object
# Find a single element, e.g., a title containing "Summary"
# Adjust selector as needed
date_element = page.find('text:contains("Site")')
date_element # Display the found element object

Out[4]:

<TextElement text='Site: ' font='Helvetica' size=10.0, style=['bold'] bbox=(50.0, 84.07000000000005, 74.45, 94.07000000000005)>

In [5]:

Copied!

date_element.show()
date_element.show()

Out[5]:

In [6]:

Copied!

date_element.text
date_element.text

Out[6]:

'Site: '

In [7]:

Copied!

# Find multiple elements, e.g., bold headings (size >= 8)
heading_elements = page.find_all('text[size>=8]:bold')
heading_elements 
# Find multiple elements, e.g., bold headings (size >= 8)
heading_elements = page.find_all('text[size>=8]:bold')
heading_elements

Out[7]:

<ElementCollection[TextElement](count=9)>

In [8]:

Copied!

page.find_all('text[size>=8]:bold').show()
page.find_all('text[size>=8]:bold').show()

Out[8]:

In [9]:

Copied!

# Pull out all of their text (why? I don't know!)
print(heading_elements.extract_text())
# Pull out all of their text (why? I don't know!)
print(heading_elements.extract_text())

Site: Date:  Violation Count: Summary: ViolationsStatuteDescriptionLevelRepeat?

Advanced text searches¶

In [10]:

Copied!

# Exact phrase (case-sensitive)
page.find('text:contains("Hazardous Materials")').text
# Exact phrase (case-sensitive)
page.find('text:contains("Hazardous Materials")').text

Out[10]:

'Failure to Properly Store Hazardous Materials.'

In [11]:

Copied!

# Exact phrase (case-sensitive)
page.find('text:contains("HAZARDOUS MATERIALS")', case=False).text
# Exact phrase (case-sensitive)
page.find('text:contains("HAZARDOUS MATERIALS")', case=False).text

Out[11]:

'Failure to Properly Store Hazardous Materials.'

In [12]:

Copied!

# Regular expression (e.g., "YYYY Report")
regex = "\d+, \d{4}"
page.find(f'text:contains("{regex}")', regex=True)
# Regular expression (e.g., "YYYY Report")
regex = "\d+, \d{4}"
page.find(f'text:contains("{regex}")', regex=True)

Out[12]:

<TextElement text='February 3...' font='Helvetica' size=10.0 bbox=(80.56, 104.07000000000005, 156.71000000000004, 114.07000000000005)>

In [13]:

Copied!

# Regular expression (e.g., "YYYY Report")
page.find_all('text[fontname="Helvetica"][size=10]')
# Regular expression (e.g., "YYYY Report")
page.find_all('text[fontname="Helvetica"][size=10]')

Out[13]:

<ElementCollection[TextElement](count=32)>

Regions¶

In [14]:

Copied!

# Region below an element (e.g., below "Introduction")
# Adjust selector as needed
page.find('text:contains("Summary")').below(include_source=True).show()
# Region below an element (e.g., below "Introduction")
# Adjust selector as needed
page.find('text:contains("Summary")').below(include_source=True).show()

Out[14]:

In [15]:

Copied!





(
    page
    .find('text:contains("Summary")')
    .below(include_source=True)
    .extract_text()
    [:500]
)
(
    page
    .find('text:contains("Summary")')
    .below(include_source=True)
    .extract_text()
    [:500]
)

Out[15]:

'Summary: Worst of any, however, were the fertilizer men, and those who served in the cooking rooms.\nThese people could not be shown to the visitor - for the odor of a fertilizer man would scare any ordinary\nvisitor at a hundred yards, and as for the other men, who worked in tank rooms full of steam, and in\nsome of which there were open vats near the level of the floor, their peculiar trouble was that they fell\ninto the vats; and when they were fished out, there was never enough of them left to b'

In [16]:

Copied!





(
    page
    .find('text:contains("Summary")')
    .below(include_source=True, until='line:horizontal')
    .show()
)
(
    page
    .find('text:contains("Summary")')
    .below(include_source=True, until='line:horizontal')
    .show()
)

Out[16]:

In [17]:

Copied!

# Manually defined region via coordinates (x0, top, x1, bottom)
manual_region = page.create_region(30, 60, 600, 300)
manual_region.show()
# Manually defined region via coordinates (x0, top, x1, bottom)
manual_region = page.create_region(30, 60, 600, 300)
manual_region.show()

Out[17]:

In [18]:

Copied!

# Extract text from the manual region
manual_region.extract_text()[:500]
# Extract text from the manual region
manual_region.extract_text()[:500]

Out[18]:

'Site: Durham’s Meatpacking Chicago, Ill.\nDate: February 3, 1905\nViolation Count: 7\nSummary: Worst of any, however, were the fertilizer men, and those who served in the cooking rooms.\nThese people could not be shown to the visitor - for the odor of a fertilizer man would scare any ordinary\nvisitor at a hundred yards, and as for the other men, who worked in tank rooms full of steam, and in\nsome of which there were open vats near the level of the floor, their peculiar trouble was that they fell\nint'

Filtering Out Headers and Footers¶

Use Exclusion Zones to remove unwanted content before extraction. Adjust selectors for typical header/footer content.

In [19]:

Copied!





header_content = page.find('rect')
footer_content = page.find_all('line')[-1].below()

header_content.highlight()
footer_content.highlight()
page.show()
header_content = page.find('rect')
footer_content = page.find_all('line')[-1].below()

header_content.highlight()
footer_content.highlight()
page.show()

Out[19]:

In [20]:

Copied!

page.extract_text()[:500]
page.extract_text()[:500]

Out[20]:

'Jungle Health and Safety Inspection Service\nINS-UP70N51NCL41R\nSite: Durham’s Meatpacking Chicago, Ill.\nDate: February 3, 1905\nViolation Count: 7\nSummary: Worst of any, however, were the fertilizer men, and those who served in the cooking rooms.\nThese people could not be shown to the visitor - for the odor of a fertilizer man would scare any ordinary\nvisitor at a hundred yards, and as for the other men, who worked in tank rooms full of steam, and in\nsome of which there were open vats near the lev'

In [21]:

Copied!

page.add_exclusion(header_content)
page.add_exclusion(footer_content)
page.add_exclusion(header_content)
page.add_exclusion(footer_content)

Out[21]:

<Page number=1 index=0>

In [22]:

Copied!

page.extract_text()[:500]
page.extract_text()[:500]

Out[22]:

'Site: Durham’s Meatpacking Chicago, Ill.\nDate: February 3, 1905\nViolation Count: 7\nSummary: Worst of any, however, were the fertilizer men, and those who served in the cooking rooms.\nThese people could not be shown to the visitor - for the odor of a fertilizer man would scare any ordinary\nvisitor at a hundred yards, and as for the other men, who worked in tank rooms full of steam, and in\nsome of which there were open vats near the level of the floor, their peculiar trouble was that they fell\nint'

In [23]:

Copied!

full_text_no_exclusions = page.extract_text(use_exclusions=False)
clean_text = page.extract_text()
f"Original length: {len(full_text_no_exclusions)}, Excluded length: {len(clean_text)}"
full_text_no_exclusions = page.extract_text(use_exclusions=False)
clean_text = page.extract_text()
f"Original length: {len(full_text_no_exclusions)}, Excluded length: {len(clean_text)}"

Out[23]:

'Original length: 1149, Excluded length: 1149'

In [24]:

Copied!

page.clear_exclusions()
page.clear_exclusions()

Out[24]:

<Page number=1 index=0>

Exclusions can also be defined globally at the PDF level using pdf.add_exclusion() with a function.

Controlling Whitespace¶

Manage how spaces and blank lines are handled during extraction using layout.

In [25]:

Copied!

print(page.extract_text())
print(page.extract_text())

Jungle Health and Safety Inspection Service
INS-UP70N51NCL41R
Site: Durham’s Meatpacking Chicago, Ill.
Date: February 3, 1905
Violation Count: 7
Summary: Worst of any, however, were the fertilizer men, and those who served in the cooking rooms.
These people could not be shown to the visitor - for the odor of a fertilizer man would scare any ordinary
visitor at a hundred yards, and as for the other men, who worked in tank rooms full of steam, and in
some of which there were open vats near the level of the floor, their peculiar trouble was that they fell
into the vats; and when they were fished out, there was never enough of them left to be worth
exhibiting - sometimes they would be overlooked for days, till all but the bones of them had gone out
to the world as Durham’s Pure Leaf Lard!
Violations
Statute Description Level Repeat?
4.12.7 Unsanitary Working Conditions. Critical
5.8.3 Inadequate Protective Equipment. Serious
6.3.9 Ineffective Injury Prevention. Serious
7.1.5 Failure to Properly Store Hazardous Materials. Critical
8.9.2 Lack of Adequate Fire Safety Measures. Serious
9.6.4 Inadequate Ventilation Systems. Serious
10.2.7 Insufficient Employee Training for Safe Work Practices. Serious
Jungle Health and Safety Inspection Service

In [26]:

Copied!

print(page.extract_text(use_exclusions=False, layout=True))
print(page.extract_text(use_exclusions=False, layout=True))

Jungle Health and Safety Inspection Service
INS-UP70N51NCL41R
Site: Durham’s Meatpacking Chicago, Ill.
Date: February 3, 1905
Violation Count: 7
Summary: Worst of any, however, were the fertilizer men, and those who served in the cooking rooms.
These people could not be shown to the visitor - for the odor of a fertilizer man would scare any ordinary
visitor at a hundred yards, and as for the other men, who worked in tank rooms full of steam, and in
some of which there were open vats near the level of the floor, their peculiar trouble was that they fell
into the vats; and when they were fished out, there was never enough of them left to be worth
exhibiting - sometimes they would be overlooked for days, till all but the bones of them had gone out
to the world as Durham’s Pure Leaf Lard!
Violations
Statute Description Level Repeat?
4.12.7 Unsanitary Working Conditions. Critical
5.8.3 Inadequate Protective Equipment. Serious
6.3.9 Ineffective Injury Prevention. Serious
7.1.5 Failure to Properly Store Hazardous Materials. Critical
8.9.2 Lack of Adequate Fire Safety Measures. Serious
9.6.4 Inadequate Ventilation Systems. Serious
10.2.7 Insufficient Employee Training for Safe Work Practices. Serious
Jungle Health and Safety Inspection Service

Font Information Access¶

Inspect font details of text elements.

In [27]:

Copied!

# Find the first text element on the page
first_text = page.find_all('text')[1]
first_text # Display basic info
# Find the first text element on the page
first_text = page.find_all('text')[1]
first_text # Display basic info

Out[27]:

<TextElement text='INS-UP70N5...' font='Helvetica' size=8.0 bbox=(385.0, 45.65599999999995, 466.35200000000003, 53.65599999999995)>

In [28]:

Copied!

# Highlight the first text element
first_text.show()
# Highlight the first text element
first_text.show()

Out[28]:

In [29]:

Copied!

# Get detailed font properties dictionary
first_text.font_info()
# Get detailed font properties dictionary
first_text.font_info()

Out[29]:

{'text': 'INS-UP70N51NCL41R',
 'fontname': 'Helvetica',
 'font_family': 'Helvetica',
 'font_variant': '',
 'size': 8.0,
 'bold': False,
 'italic': False,
 'color': (1, 0, 0)}

In [30]:

Copied!

# Check specific style properties directly
f"Is Bold: {first_text.bold}, Is Italic: {first_text.italic}, Font: {first_text.fontname}, Size: {first_text.size}"
# Check specific style properties directly
f"Is Bold: {first_text.bold}, Is Italic: {first_text.italic}, Font: {first_text.fontname}, Size: {first_text.size}"

Out[30]:

'Is Bold: False, Is Italic: False, Font: Helvetica, Size: 8.0'

In [31]:

Copied!





# Find elements by font attributes (adjust selectors)
# Example: Find Arial fonts
arial_text = page.find_all('text[fontname*=Helvetica]')
arial_text # Display list of found elements
# Find elements by font attributes (adjust selectors)
# Example: Find Arial fonts
arial_text = page.find_all('text[fontname*=Helvetica]')
arial_text # Display list of found elements

Out[31]:

<ElementCollection[TextElement](count=44)>

In [32]:

Copied!

# Example: Find large text (e.g., size >= 16)
large_text = page.find_all('text[size>=12]')
large_text
# Example: Find large text (e.g., size >= 16)
large_text = page.find_all('text[size>=12]')
large_text

Out[32]:

<ElementCollection[TextElement](count=1)>

In [33]:

Copied!

# Example: Find large text (e.g., size >= 16)
bold_text = page.find_all('text:bold')
bold_text
# Example: Find large text (e.g., size >= 16)
bold_text = page.find_all('text:bold')
bold_text

Out[33]:

<ElementCollection[TextElement](count=9)>

Working with Font Styles¶

Analyze and group text elements by their computed font style, which combines attributes like font name, size, boldness, etc., into logical groups.

In [34]:

Copied!





# Analyze styles on the page
# This returns a dictionary mapping style names to ElementList objects
page.analyze_text_styles()
page.text_style_labels
# Analyze styles on the page
# This returns a dictionary mapping style names to ElementList objects
page.analyze_text_styles()
page.text_style_labels

Out[34]:

['10.0pt Bold Helvetica (medium)',
 '10.0pt Helvetica (medium)',
 '12.0pt Bold Helvetica (large)',
 '8.0pt Helvetica (small)']

In [35]:

Copied!

page.find_all('text').show(group_by='style_label')
page.find_all('text').show(group_by='style_label')

Out[35]:

In [36]:

Copied!

page.find_all('text[style_label="8.0pt Helvetica (small)"]')
page.find_all('text[style_label="8.0pt Helvetica (small)"]')

Out[36]:

<ElementCollection[TextElement](count=3)>

In [37]:

Copied!

page.find_all('text[fontname="Helvetica"][size=8]')
page.find_all('text[fontname="Helvetica"][size=8]')

Out[37]:

<ElementCollection[TextElement](count=3)>

Font variants (e.g., AAAAAB+FontName) are also accessible via the font-variant attribute selector: page.find_all('text[font-variant="AAAAAB"]').

Reading Order¶

Text extraction respects a pathetic attempt at natural reading order (top-to-bottom, left-to-right by default). page.find_all('text') returns elements already sorted this way.

In [38]:

Copied!

# Get first 5 text elements in reading order
elements_in_order = page.find_all('text')
elements_in_order[:5]
# Get first 5 text elements in reading order
elements_in_order = page.find_all('text')
elements_in_order[:5]

Out[38]:

[<TextElement text='Jungle Hea...' font='Helvetica' size=8.0 bbox=(385.0, 35.65599999999995, 541.9680000000001, 43.65599999999995)>,
 <TextElement text='INS-UP70N5...' font='Helvetica' size=8.0 bbox=(385.0, 45.65599999999995, 466.35200000000003, 53.65599999999995)>,
 <TextElement text='Site: ' font='Helvetica' size=10.0, style=['bold'] bbox=(50.0, 84.07000000000005, 74.45, 94.07000000000005)>,
 <TextElement text='Durham’s M...' font='Helvetica' size=10.0 bbox=(74.45, 84.07000000000005, 182.26000000000002, 94.07000000000005)>,
 <TextElement text='Chicago, I...' font='Helvetica' size=10.0 bbox=(182.26000000000002, 84.07000000000005, 234.50000000000003, 94.07000000000005)>]

In [39]:

Copied!

# Text extracted via page.extract_text() respects this order automatically
# (Result already shown in Basic Text Extraction section)
page.extract_text()[:100]
# Text extracted via page.extract_text() respects this order automatically
# (Result already shown in Basic Text Extraction section)
page.extract_text()[:100]

Out[39]:

'Jungle Health and Safety Inspection Service\nINS-UP70N51NCL41R\nSite: Durham’s Meatpacking Chicago, Il'

Move between elements sequentially based on reading order using .next() and .previous().

In [40]:

Copied!





page.clear_highlights()

start = page.find('text:contains("Date")')
start.highlight(label='Date label')
start.next().highlight(label='Maybe the date', color='green')
start.next('text:contains("\d")', regex=True).highlight(label='Probably the date')

page.show()
page.clear_highlights()

start = page.find('text:contains("Date")')
start.highlight(label='Date label')
start.next().highlight(label='Maybe the date', color='green')
start.next('text:contains("\d")', regex=True).highlight(label='Probably the date')

page.show()

Out[40]:

Next Steps¶

Now that you know how to extract text, you might want to explore:

Working with regions for more precise extraction
OCR capabilities for scanned documents
Document layout analysis for automatic structure detection
Document QA for asking questions directly to your documents