Patterns & Pitfalls
A quick reference of common patterns and mistakes to avoid when working with Natural PDF. Each pattern shows the expected return type.
1. Load PDF and Extract Text
Use case: Open a PDF file and extract all text from a page.
from natural_pdf import PDF
pdf = PDF("document.pdf")
page = pdf.pages[0]
text = page.extract_text()
Returns: str - The extracted text content from the page.
2. Find Element Containing Text
Use case: Locate the first element that contains specific text.
from natural_pdf import PDF
pdf = PDF("document.pdf")
page = pdf.pages[0]
element = page.find('text:contains("Invoice")')
Returns: Element | None - The first matching element, or None if not found.
3. Find All Matching Elements
Use case: Get all elements matching a selector.
from natural_pdf import PDF
pdf = PDF("document.pdf")
page = pdf.pages[0]
elements = page.find_all('text:bold')
Returns: ElementCollection - A collection of all matching elements (may be empty).
4. Navigate Below an Element
Use case: Create a region below a found element to extract content.
from natural_pdf import PDF
pdf = PDF("document.pdf")
page = pdf.pages[0]
header = page.find('text:contains("Summary")')
region = header.below(height=200)
content = region.extract_text()
Returns: Region - A rectangular region below the element.
5. Navigate Right of Element
Use case: Find the value next to a label.
from natural_pdf import PDF
pdf = PDF("document.pdf")
page = pdf.pages[0]
label = page.find('text:contains("Total:")')
value_region = label.right(width=100)
value = value_region.extract_text()
Returns: Region - A rectangular region to the right of the element.
6. Extract Table from Page
Use case: Extract tabular data from a page.
from natural_pdf import PDF
pdf = PDF("document.pdf")
page = pdf.pages[0]
table = page.extract_table()
Returns: TableResult - A sequence of rows (list of lists) with .to_df() method for pandas conversion.
6a. Extract ALL Tables from a Page
Use case: Find and extract every table on a page using layout analysis.
from natural_pdf import PDF
pdf = PDF("document.pdf")
page = pdf.pages[0]
# Detect tables using layout analysis
page.analyze_layout(engine='tatr')
# Find all detected table regions
table_regions = page.find_all('region[type=table]')
print(f"Found {len(table_regions)} tables")
# Extract each table as a DataFrame
dataframes = []
for i, table_region in enumerate(table_regions):
table = table_region.extract_table()
df = table.to_df(header="first")
dataframes.append(df)
print(f"Table {i+1}: {len(df)} rows, {len(df.columns)} columns")
Returns: List of pandas.DataFrame objects, one per table.
Note: The tatr (Table Transformer) engine is recommended for table detection. Alternatives include yolo and paddle.
6b. Extract All Tables from a Page (Shortcut)
Use case: Extract every table on a page without manual layout analysis.
from natural_pdf import PDF
pdf = PDF("document.pdf")
page = pdf.pages[0]
tables = page.extract_tables()
for i, table in enumerate(tables):
df = table.to_df(header="first")
print(f"Table {i+1}: {len(df)} rows")
Returns: List[TableResult] - All tables found on the page.
6c. Convert Page to Markdown with VLM
Use case: Get a structured markdown representation of a page using a Vision Language Model.
from natural_pdf import PDF, set_default_client
from openai import OpenAI
# Configure a default VLM client
set_default_client(OpenAI(), model="gpt-4o")
pdf = PDF("document.pdf")
page = pdf.pages[0]
md = page.to_markdown()
Returns: str - Markdown representation of the page. Falls back to extract_text() when no model is configured.
6d. Semantic Search Across Pages
Use case: Find the most relevant pages for a query using semantic similarity.
from natural_pdf import PDF
pdf = PDF("document.pdf")
results = pdf.search("payment terms and conditions", top_k=3)
for page in results:
print(f"Page {page.number}: {page.extract_text()[:100]}...")
Returns: PageCollection - The top-k most relevant pages.
Note: Requires torch and transformers (pip install torch transformers).
7. Apply OCR
Use case: Run OCR on a scanned document to make it searchable.
from natural_pdf import PDF
pdf = PDF("scanned.pdf")
page = pdf.pages[0]
ocr_elements = page.apply_ocr(engine='easyocr', languages=['en'])
text = page.extract_text()
Returns: ElementCollection - The newly created OCR text elements.
8. Analyze Layout
Use case: Detect document structure like tables, figures, and headings.
from natural_pdf import PDF
pdf = PDF("document.pdf")
page = pdf.pages[0]
regions = page.analyze_layout(engine='yolo')
tables = page.find_all('region[type=table]')
Returns: ElementCollection - Collection of detected layout regions.
9. Chain Find and Extract
Use case: Find an element and extract text from the region below it in one chain.
from natural_pdf import PDF
pdf = PDF("document.pdf")
page = pdf.pages[0]
content = page.find('text:contains("Description")').below(height=100).extract_text()
Returns: str - The extracted text from the chained operations.
10. Filter by Attribute
Use case: Find elements with specific attributes.
from natural_pdf import PDF
pdf = PDF("document.pdf")
page = pdf.pages[0]
large_text = page.find_all('text[size>=14]')
bold_headers = page.find_all('text:bold[size>=12]')
Returns: ElementCollection - Elements matching the attribute filter.
10a. Combined Selectors (3-Part)
Use case: Find elements matching multiple criteria at once.
from natural_pdf import PDF
pdf = PDF("document.pdf")
page = pdf.pages[0]
# Combine element type + pseudo-class + attribute + contains
# Order: type → pseudo-classes → attributes → :contains()
summary_headers = page.find_all('text:bold[size>14]:contains("Summary")')
# Other combined examples
important_notes = page.find('text:italic[size>=12]:contains("Note")')
section_titles = page.find_all('text:bold[fontname*=Arial][size>=16]')
Returns: ElementCollection for find_all(), Element | None for find().
Selector ordering: Pseudo-classes and attributes can appear in any order after the type — text:bold[size>14]:contains("X") and text:contains("X"):bold[size>14] are equivalent.
11. Create a Region from Coordinates
Use case: Define a specific rectangular area on a page.
from natural_pdf import PDF
pdf = PDF("document.pdf")
page = pdf.pages[0]
region = page.create_region(x0=50, top=100, x1=500, bottom=300)
text = region.extract_text()
Returns: Region - A region with the specified coordinates.
12. Add Exclusion Zone
Use case: Exclude headers or footers from text extraction.
from natural_pdf import PDF
pdf = PDF("document.pdf")
page = pdf.pages[0]
# Exclude top 50 points as header
header_region = page.create_region(0, 0, page.width, 50)
page.add_exclusion(header_region)
Returns: None - Exclusion is added to the page.
13. Extract with Exclusions
Use case: Extract text while respecting exclusion zones.
from natural_pdf import PDF
pdf = PDF("document.pdf")
page = pdf.pages[0]
page.add_exclusion(page.create_region(0, 0, page.width, 50)) # Exclude header
text = page.extract_text() # Exclusions applied by default
Returns: str - Text with excluded regions omitted.
14. Get Page Dimensions
Use case: Access page width and height for calculations.
from natural_pdf import PDF
pdf = PDF("document.pdf")
page = pdf.pages[0]
width = page.width
height = page.height
Returns: float - Page dimension in points (1 point = 1/72 inch).
15. Iterate Over Pages
Use case: Process all pages in a PDF.
from natural_pdf import PDF
pdf = PDF("document.pdf")
for page in pdf.pages:
text = page.extract_text()
print(f"Page {page.number}: {len(text)} characters")
Returns: Each iteration yields a Page object.
16. Extract Text with Layout Preservation
Use case: Maintain spatial positioning of text.
from natural_pdf import PDF
pdf = PDF("document.pdf")
page = pdf.pages[0]
text = page.extract_text(layout=True)
Returns: str - Text with whitespace preserving original layout.
17. Find Using Regex
Use case: Search for patterns like invoice numbers or dates.
from natural_pdf import PDF
pdf = PDF("document.pdf")
page = pdf.pages[0]
invoice_num = page.find('text:contains("INV-\\d+")', regex=True)
dates = page.find_all('text:contains("\\d{2}/\\d{2}/\\d{4}")', regex=True)
Returns: Element | None for find(), ElementCollection for find_all().
18. Case-Insensitive Search
Use case: Find text regardless of case.
from natural_pdf import PDF
pdf = PDF("document.pdf")
page = pdf.pages[0]
element = page.find('text:contains("total")', case=False)
Returns: Element | None - First element containing "total", "TOTAL", "Total", etc.
19. Extract Table as DataFrame
Use case: Get table data directly as a pandas DataFrame.
from natural_pdf import PDF
pdf = PDF("document.pdf")
page = pdf.pages[0]
table = page.extract_table()
df = table.to_df(header="first") # Use first row as column headers
Returns: pandas.DataFrame - Table data with proper column headers.
20. Visualize Elements
Use case: Debug by highlighting found elements on the page.
from natural_pdf import PDF
pdf = PDF("document.pdf")
page = pdf.pages[0]
elements = page.find_all('text:bold')
image = elements.show(color="red", label="Bold Text")
Returns: PIL.Image.Image - Page image with highlighted elements.
Quick Reference Table
| Pattern | Method | Returns |
|---|---|---|
| Load PDF | PDF("file.pdf") |
PDF |
| Get page | pdf.pages[0] |
Page |
| Extract text | page.extract_text() |
str |
| Find one | page.find(selector) |
Element \| None |
| Find all | page.find_all(selector) |
ElementCollection |
| Navigate below | element.below() |
Region |
| Navigate right | element.right() |
Region |
| Navigate above | element.above() |
Region |
| Navigate left | element.left() |
Region |
| Extract table | page.extract_table() |
TableResult |
| Extract all tables | page.extract_tables() |
List[TableResult] |
| Page to markdown | page.to_markdown() |
str |
| Semantic search | pdf.search(query) |
PageCollection |
| Apply OCR | page.apply_ocr() |
ElementCollection |
| Analyze layout | page.analyze_layout() |
ElementCollection |
| Create region | page.create_region(...) |
Region |
| Add exclusion | page.add_exclusion(...) |
None |
| Show elements | elements.show() |
PIL.Image.Image |
| Table to DataFrame | table.to_df() |
pandas.DataFrame |
Common Mistakes to Avoid
These are frequent errors when working with Natural PDF.
Wrong Method Names
# WRONG - these methods don't exist
page.get_text() # Use: page.extract_text()
page.search("term") # Use: page.find('text:contains("term")')
PDF.open("file.pdf") # Use: PDF("file.pdf")
page.apply_layout() # Use: page.analyze_layout()
pdf[0] # Use: pdf.pages[0]
Wrong Selector Syntax
# WRONG
page.find('text.bold') # Use colon: 'text:bold'
page.find('text[contains="X"]') # contains is a pseudo-class: 'text:contains("X")'
page.find('text(size>12)') # Use brackets: 'text[size>12]'
page.find('text:contains(Invoice)') # Need quotes: 'text:contains("Invoice")'
Not Handling None
# WRONG - will crash if element not found
text = page.find('text:contains("Missing")').extract_text()
# CORRECT - always check for None
element = page.find('text:contains("Missing")')
if element:
text = element.extract_text()
Wrong Parameter Names
# WRONG
page.find('text:contains("X")', case_sensitive=False) # Use: case=False
page.apply_ocr(engine="easy_ocr") # Use: engine="easyocr"
page.apply_ocr(engine="paddle_ocr") # Use: engine="paddle"
Not Closing PDFs in Loops
# WRONG - memory leak
for path in pdf_paths:
pdf = PDF(path)
# process...
# PDF never closed!
# CORRECT - use try/finally
for path in pdf_paths:
pdf = PDF(path)
try:
# process...
finally:
pdf.close()
Treating find_all() as a Plain List
# WRONG - verbose
elements = page.find_all('text:bold')
first = elements[0] if len(elements) > 0 else None
# CORRECT - use ElementCollection methods
elements = page.find_all('text:bold')
first = elements.first # Returns None if empty
Summary of Corrections
| Wrong | Correct | Issue |
|---|---|---|
page.get_text() |
page.extract_text() |
Wrong method name |
page.search("X") |
page.find('text:contains("X")') |
Wrong method name |
PDF.open("file") |
PDF("file") |
Direct instantiation |
'text.bold' |
'text:bold' |
Colon for pseudo-classes |
case_sensitive=False |
case=False |
Wrong parameter name |
engine="easy_ocr" |
engine="easyocr" |
No underscores |
apply_layout() |
analyze_layout() |
Wrong method name |
pdf[0] |
pdf.pages[0] |
Access via .pages |