Natural PDF
A Python library for PDF extraction built on pdfplumber. Find and extract content using CSS-like selectors and spatial navigation. Simple code that makes sense.
Demos:
Installation
pip install natural-pdf
# All the extras
pip install "natural-pdf[all]"
Quick Example
from natural_pdf import PDF
pdf = PDF('https://github.com/jsoma/natural-pdf/raw/refs/heads/main/pdfs/01-practice.pdf')
page = pdf.pages[0]
# Find the title and get content below it
title = page.find('text:contains("Summary"):bold')
content = title.below().extract_text()
# Exclude everything above 'CONFIDENTIAL' and below last line on page
page.add_exclusion(page.find('text:contains("CONFIDENTIAL")').above())
page.add_exclusion(page.find_all('line')[-1].below())
# Get the clean text without header/footer
clean_text = page.extract_text()
Getting Started
New to Natural PDF?
- Choose Your Path - Find the best starting point for your background and goals
- Installation - Get Natural PDF installed and run your first extraction
- Quickstart - Jump in with a hands-on introduction
- Selectors 101 - Learn the selector syntax for finding elements
- Concepts - Understand the core ideas behind Natural PDF
Tutorials
Follow the tutorial series to learn Natural PDF systematically:
- Loading PDFs - Load PDFs and extract basic text
- Finding Elements - Use selectors to locate content
- Spatial Navigation - Navigate relative to elements
- Tables - Extract and process tabular data
- Exclusions - Remove headers, footers, and unwanted content
- OCR - Extract text from scanned documents
- Layout Analysis - Detect document structure automatically
- Regions & Flows - Work with document regions and multi-page flows
- Document QA - Ask questions and extract structured data
- Batch Processing - Process multiple PDFs efficiently
Key Features
Find Elements with Selectors
Use CSS-like selectors to find text, shapes, and more.
# Find bold text containing "Revenue"
page.find('text:contains("Revenue"):bold').extract_text()
# Find all large text
page.find_all('text[size>=12]').extract_text()
Navigate Spatially
Move around the page relative to elements, not just coordinates.
# Extract text below a specific heading
intro_text = page.find('text:contains("Introduction")').below().extract_text()
# Extract text from one heading to the next
methods_text = page.find('text:contains("Methods")').below(
until='text:contains("Results")'
).extract_text()
Extract Clean Text
Easily extract text content, automatically handling common page elements like headers and footers (if exclusions are set).
# Extract all text from the page (respecting exclusions)
page_text = page.extract_text()
# Extract text from a specific region
some_region = page.find(...)
region_text = some_region.extract_text()
Apply OCR
Extract text from scanned documents using various OCR engines.
# Apply OCR using the default engine
ocr_elements = page.apply_ocr()
# Extract text (will use OCR results if available)
text = page.extract_text()
Analyze Document Layout
Use AI models to detect document structures like titles, paragraphs, and tables.
# Detect document structure
page.analyze_layout()
# Highlight titles and tables
page.find_all('region[type=title]').show()
page.find_all('region[type=table]').show()
# Extract data from the first table
table_data = page.find('region[type=table]').extract_table()
Document Question Answering
Ask natural language questions directly to your documents.
# Ask a question
result = page.ask("What was the company's revenue in 2022?")
print(f"Answer: {result.answer}")
Visualize Your Work
Debug and understand your extractions visually.
# Highlight headings
page.find_all('text[size>=14]').show(color="red", label="Headings")
# Launch the interactive viewer (Jupyter)
page.viewer()
Reference
- Quick Reference - Essential commands and patterns in one place
- API Reference - Complete library documentation
- Patterns & Pitfalls - Common patterns and mistakes to avoid
- Troubleshooting - Solutions to common issues
