Natural PDF
A friendly library for working with PDFs, built on top of pdfplumber.
Natural PDF lets you find and extract content from PDFs using simple code that makes sense.
Installation
pip install natural_pdf
# All the extras
pip install "natural_pdf[all]"
Quick Example
from natural_pdf import PDF
pdf = PDF('document.pdf')
page = pdf.pages[0]
# Find the title and get content below it
title = page.find('text:contains("Summary"):bold')
content = title.below().extract_text()
# Exclude everything above 'CONFIDENTIAL' and below last line on page
page.add_exclusion(page.find('text:contains("CONFIDENTIAL")').above())
page.add_exclusion(page.find_all('line')[-1].below())
# Get the clean text without header/footer
clean_text = page.extract_text()
Key Features
Here are a few highlights of what you can do:
Find Elements with Selectors
Use CSS-like selectors to find text, shapes, and more.
# Find bold text containing "Revenue"
page.find('text:contains("Revenue"):bold').extract_text()
# Find all large text
page.find_all('text[size>=12]').extract_text()
Navigate Spatially
Move around the page relative to elements, not just coordinates.
# Extract text below a specific heading
intro_text = page.find('text:contains("Introduction")').below().extract_text()
# Extract text from one heading to the next
methods_text = page.find('text:contains("Methods")').below(
until='text:contains("Results")'
).extract_text()
Explore more navigation methods →
Extract Clean Text
Easily extract text content, automatically handling common page elements like headers and footers (if exclusions are set).
# Extract all text from the page (respecting exclusions)
page_text = page.extract_text()
# Extract text from a specific region
some_region = page.find(...)
region_text = some_region.extract_text()
Learn about text extraction → Learn about exclusion zones →
Apply OCR
Extract text from scanned documents using various OCR engines.
# Apply OCR using the default engine
ocr_elements = page.apply_ocr()
# Extract text (will use OCR results if available)
text = page.extract_text()
Analyze Document Layout
Use AI models to detect document structures like titles, paragraphs, and tables.
# Detect document structure
page.analyze_layout()
# Highlight titles and tables
page.find_all('region[type=title]').highlight(color="purple")
page.find_all('region[type=table]').highlight(color="blue")
# Extract data from the first table
table_data = page.find('region[type=table]').extract_table()
Learn about layout models → Working with tables? →
Document Question Answering
Ask natural language questions directly to your documents.
# Ask a question
result = pdf.ask("What was the company's revenue in 2022?")
if result.get("found", False):
print(f"Answer: {result['answer']}")
Classify Pages and Regions
Categorize pages or specific regions based on their content using text or vision models.
Note: Requires pip install "natural-pdf[classification]"
# Classify a page based on text
labels = ["invoice", "scientific article", "presentation"]
page.classify(labels, using="text")
print(f"Page Category: {page.category} (Confidence: {page.category_confidence:.2f})")
# Classify a page based on what it looks like
labels = ["invoice", "scientific article", "presentation"]
page.classify(labels, using="vision")
print(f"Page Category: {page.category} (Confidence: {page.category_confidence:.2f})")
Visualize Your Work
Debug and understand your extractions visually.
# Highlight headings
page.find_all('text[size>=14]').highlight(color="red", label="Headings")
# Launch the interactive viewer (Jupyter)
# Requires: pip install natural-pdf[interactive]
page.viewer()
# Or save an image
# page.save_image("highlighted.png")
See more visualization options →
Documentation Topics
Choose what you want to learn about:
Task-based Guides
- Getting Started: Install the library and run your first extraction
- PDF Navigation: Open PDFs and work with pages
- Element Selection: Find text and other elements using selectors
- Text Extraction: Extract clean text from documents
- Regions: Work with specific areas of a page
- Visual Debugging: See what you're extracting
- OCR: Extract text from scanned documents
- Layout Analysis: Detect document structure
- Tables: Extract tabular data
- Document QA: Ask questions to your documents
Reference
- API Reference: Complete library reference