Skip to content

Natural PDF

A friendly library for working with PDFs, built on top of pdfplumber.

Natural PDF lets you find and extract content from PDFs using simple code that makes sense.

Installation

pip install natural_pdf
# All the extras
pip install "natural_pdf[all]"

Quick Example

from natural_pdf import PDF

pdf = PDF('https://github.com/jsoma/natural-pdf/raw/refs/heads/main/pdfs/01-practice.pdf')
page = pdf.pages[0]

# Find the title and get content below it
title = page.find('text:contains("Summary"):bold')
content = title.below().extract_text()

# Exclude everything above 'CONFIDENTIAL' and below last line on page
page.add_exclusion(page.find('text:contains("CONFIDENTIAL")').above())
page.add_exclusion(page.find_all('line')[-1].below())

# Get the clean text without header/footer
clean_text = page.extract_text()

Getting Started

New to Natural PDF?

Learning the Basics

Follow the tutorial series to learn Natural PDF systematically:

  1. Loading and Basic Text Extraction
  2. Finding Specific Elements
  3. Extracting Content Blocks
  4. Table Extraction
  5. Excluding Unwanted Content
  6. Document Question Answering
  7. Layout Analysis
  8. Spatial Navigation
  9. Section Extraction
  10. Form Field Extraction
  11. Enhanced Table Processing
  12. OCR Integration
  13. Semantic Search
  14. Categorizing Documents

Solving Specific Problems

Text Extraction Issues

Table Problems

Data Extraction

Document Analysis

Finding Content

Layout and Structure

Key Features

Find Elements with Selectors

Use CSS-like selectors to find text, shapes, and more.

# Find bold text containing "Revenue"
page.find('text:contains("Revenue"):bold').extract_text()

# Find all large text
page.find_all('text[size>=12]').extract_text()

Move around the page relative to elements, not just coordinates.

# Extract text below a specific heading
intro_text = page.find('text:contains("Introduction")').below().extract_text()

# Extract text from one heading to the next
methods_text = page.find('text:contains("Methods")').below(
    until='text:contains("Results")'
).extract_text()

Extract Clean Text

Easily extract text content, automatically handling common page elements like headers and footers (if exclusions are set).

# Extract all text from the page (respecting exclusions)
page_text = page.extract_text()

# Extract text from a specific region
some_region = page.find(...)
region_text = some_region.extract_text()

Apply OCR

Extract text from scanned documents using various OCR engines.

# Apply OCR using the default engine
ocr_elements = page.apply_ocr()

# Extract text (will use OCR results if available)
text = page.extract_text()

Analyze Document Layout

Use AI models to detect document structures like titles, paragraphs, and tables.

# Detect document structure
page.analyze_layout()

# Highlight titles and tables
page.find_all('region[type=title]').highlight(color="purple")
page.find_all('region[type=table]').highlight(color="blue")

# Extract data from the first table
table_data = page.find('region[type=table]').extract_table()

Document Question Answering

Ask natural language questions directly to your documents.

# Ask a question
result = page.ask("What was the company's revenue in 2022?")
if result.found:
    print(f"Answer: {result.answer}")
    result.show()  # Highlight where the answer was found

Classify Pages and Regions

Categorize pages or specific regions based on their content using text or vision models.

# Classify a page based on text
labels = ["invoice", "scientific article", "presentation"]
page.classify(labels, using="text")
print(f"Page Category: {page.category} (Confidence: {page.category_confidence:.2f})")

# Classify a page based on what it looks like
page.classify(labels, using="vision")
print(f"Page Category: {page.category} (Confidence: {page.category_confidence:.2f})")

Visualize Your Work

Debug and understand your extractions visually.

# Highlight headings
page.find_all('text[size>=14]').show(color="red", label="Headings")

# Launch the interactive viewer (Jupyter)
page.viewer()

Reference Documentation

Understanding Natural PDF

Coming soon: Conceptual guides explaining how Natural PDF thinks about PDFs and when to use different approaches.