Natural PDF
A friendly library for working with PDFs, built on top of pdfplumber.
Natural PDF lets you find and extract content from PDFs using simple code that makes sense.
Installation
pip install natural_pdf
# All the extras
pip install "natural_pdf[all]"
Quick Example
from natural_pdf import PDF
pdf = PDF('https://github.com/jsoma/natural-pdf/raw/refs/heads/main/pdfs/01-practice.pdf')
page = pdf.pages[0]
# Find the title and get content below it
title = page.find('text:contains("Summary"):bold')
content = title.below().extract_text()
# Exclude everything above 'CONFIDENTIAL' and below last line on page
page.add_exclusion(page.find('text:contains("CONFIDENTIAL")').above())
page.add_exclusion(page.find_all('line')[-1].below())
# Get the clean text without header/footer
clean_text = page.extract_text()
Getting Started
New to Natural PDF?
- Installation - Get Natural PDF installed and run your first extraction
- Quick Reference - Essential commands and patterns in one place
- Tutorial Series - Step-by-step learning path through all features
Learning the Basics
Follow the tutorial series to learn Natural PDF systematically:
- Loading and Basic Text Extraction
- Finding Specific Elements
- Extracting Content Blocks
- Table Extraction
- Excluding Unwanted Content
- Document Question Answering
- Layout Analysis
- Spatial Navigation
- Section Extraction
- Form Field Extraction
- Enhanced Table Processing
- OCR Integration
- Semantic Search
- Categorizing Documents
Solving Specific Problems
Text Extraction Issues
- Extract Clean Text Without Headers and Footers - Remove repeated content that's cluttering your text extraction
- Getting Text from Scanned Documents - Use OCR to extract text from image-based PDFs
Table Problems
- Fix Messy Table Extraction - Handle tables with no borders, merged cells, or poor alignment
- Getting Tables Out of PDFs - Basic to advanced table extraction techniques
Data Extraction
- Extract Data from Forms and Invoices - Pull structured information from standardized documents
- Pulling Structured Data from PDFs - Use AI to extract specific fields from any document
Document Analysis
- Ask Questions to Your Documents - Use natural language to find information
- Categorizing Pages and Regions - Automatically classify document types and content
Finding Content
- Finding What You Need in PDFs - Master selectors to locate any element
- PDF Navigation - Move around documents and work with multiple pages
Layout and Structure
- Document Layout Analysis - Automatically detect titles, tables, and document structure
- Working with Regions - Define and work with specific areas of pages
- Visual Debugging - See what you're extracting and debug selector issues
Key Features
Find Elements with Selectors
Use CSS-like selectors to find text, shapes, and more.
# Find bold text containing "Revenue"
page.find('text:contains("Revenue"):bold').extract_text()
# Find all large text
page.find_all('text[size>=12]').extract_text()
Navigate Spatially
Move around the page relative to elements, not just coordinates.
# Extract text below a specific heading
intro_text = page.find('text:contains("Introduction")').below().extract_text()
# Extract text from one heading to the next
methods_text = page.find('text:contains("Methods")').below(
until='text:contains("Results")'
).extract_text()
Extract Clean Text
Easily extract text content, automatically handling common page elements like headers and footers (if exclusions are set).
# Extract all text from the page (respecting exclusions)
page_text = page.extract_text()
# Extract text from a specific region
some_region = page.find(...)
region_text = some_region.extract_text()
Apply OCR
Extract text from scanned documents using various OCR engines.
# Apply OCR using the default engine
ocr_elements = page.apply_ocr()
# Extract text (will use OCR results if available)
text = page.extract_text()
Analyze Document Layout
Use AI models to detect document structures like titles, paragraphs, and tables.
# Detect document structure
page.analyze_layout()
# Highlight titles and tables
page.find_all('region[type=title]').highlight(color="purple")
page.find_all('region[type=table]').highlight(color="blue")
# Extract data from the first table
table_data = page.find('region[type=table]').extract_table()
Document Question Answering
Ask natural language questions directly to your documents.
# Ask a question
result = page.ask("What was the company's revenue in 2022?")
if result.found:
print(f"Answer: {result.answer}")
result.show() # Highlight where the answer was found
Classify Pages and Regions
Categorize pages or specific regions based on their content using text or vision models.
# Classify a page based on text
labels = ["invoice", "scientific article", "presentation"]
page.classify(labels, using="text")
print(f"Page Category: {page.category} (Confidence: {page.category_confidence:.2f})")
# Classify a page based on what it looks like
page.classify(labels, using="vision")
print(f"Page Category: {page.category} (Confidence: {page.category_confidence:.2f})")
Visualize Your Work
Debug and understand your extractions visually.
# Highlight headings
page.find_all('text[size>=14]').show(color="red", label="Headings")
# Launch the interactive viewer (Jupyter)
page.viewer()
Reference Documentation
- Quick Reference - Cheat sheet of essential commands and patterns
- API Reference - Complete library reference
Understanding Natural PDF
Coming soon: Conceptual guides explaining how Natural PDF thinks about PDFs and when to use different approaches.