Skip to content

Natural PDF

A Python library for PDF extraction built on pdfplumber. Find and extract content using CSS-like selectors and spatial navigation. Simple code that makes sense.

Demos:

Installation

pip install natural-pdf
# All the extras
pip install "natural-pdf[all]"

Quick Example

from natural_pdf import PDF

pdf = PDF('https://github.com/jsoma/natural-pdf/raw/refs/heads/main/pdfs/01-practice.pdf')
page = pdf.pages[0]

# Find the title and get content below it
title = page.find('text:contains("Summary"):bold')
content = title.below().extract_text()

# Exclude everything above 'CONFIDENTIAL' and below last line on page
page.add_exclusion(page.find('text:contains("CONFIDENTIAL")').above())
page.add_exclusion(page.find_all('line')[-1].below())

# Get the clean text without header/footer
clean_text = page.extract_text()

Getting Started

New to Natural PDF?

  • Choose Your Path - Find the best starting point for your background and goals
  • Installation - Get Natural PDF installed and run your first extraction
  • Quickstart - Jump in with a hands-on introduction
  • Selectors 101 - Learn the selector syntax for finding elements
  • Concepts - Understand the core ideas behind Natural PDF

Tutorials

Follow the tutorial series to learn Natural PDF systematically:

  1. Loading PDFs - Load PDFs and extract basic text
  2. Finding Elements - Use selectors to locate content
  3. Spatial Navigation - Navigate relative to elements
  4. Tables - Extract and process tabular data
  5. Exclusions - Remove headers, footers, and unwanted content
  6. OCR - Extract text from scanned documents
  7. Layout Analysis - Detect document structure automatically
  8. Regions & Flows - Work with document regions and multi-page flows
  9. Document QA - Ask questions and extract structured data
  10. Batch Processing - Process multiple PDFs efficiently

Key Features

Find Elements with Selectors

Use CSS-like selectors to find text, shapes, and more.

# Find bold text containing "Revenue"
page.find('text:contains("Revenue"):bold').extract_text()

# Find all large text
page.find_all('text[size>=12]').extract_text()

Move around the page relative to elements, not just coordinates.

# Extract text below a specific heading
intro_text = page.find('text:contains("Introduction")').below().extract_text()

# Extract text from one heading to the next
methods_text = page.find('text:contains("Methods")').below(
    until='text:contains("Results")'
).extract_text()

Extract Clean Text

Easily extract text content, automatically handling common page elements like headers and footers (if exclusions are set).

# Extract all text from the page (respecting exclusions)
page_text = page.extract_text()

# Extract text from a specific region
some_region = page.find(...)
region_text = some_region.extract_text()

Apply OCR

Extract text from scanned documents using various OCR engines.

# Apply OCR using the default engine
ocr_elements = page.apply_ocr()

# Extract text (will use OCR results if available)
text = page.extract_text()

Analyze Document Layout

Use AI models to detect document structures like titles, paragraphs, and tables.

# Detect document structure
page.analyze_layout()

# Highlight titles and tables
page.find_all('region[type=title]').show()
page.find_all('region[type=table]').show()

# Extract data from the first table
table_data = page.find('region[type=table]').extract_table()

Document Question Answering

Ask natural language questions directly to your documents.

# Ask a question
result = page.ask("What was the company's revenue in 2022?")
print(f"Answer: {result.answer}")

Visualize Your Work

Debug and understand your extractions visually.

# Highlight headings
page.find_all('text[size>=14]').show(color="red", label="Headings")

# Launch the interactive viewer (Jupyter)
page.viewer()

Reference