Skip to content

Natural PDF

A friendly library for working with PDFs, built on top of pdfplumber.

Natural PDF lets you find and extract content from PDFs using simple code that makes sense.

Installation

pip install natural_pdf
# All the extras
pip install "natural_pdf[all]"

Quick Example

from natural_pdf import PDF

pdf = PDF('document.pdf')
page = pdf.pages[0]

# Find the title and get content below it
title = page.find('text:contains("Summary"):bold')
content = title.below().extract_text()

# Exclude everything above 'CONFIDENTIAL' and below last line on page
page.add_exclusion(page.find('text:contains("CONFIDENTIAL")').above())
page.add_exclusion(page.find_all('line')[-1].below())

# Get the clean text without header/footer
clean_text = page.extract_text()

Key Features

Here are a few highlights of what you can do:

Find Elements with Selectors

Use CSS-like selectors to find text, shapes, and more.

# Find bold text containing "Revenue"
page.find('text:contains("Revenue"):bold').extract_text()

# Find all large text
page.find_all('text[size>=12]').extract_text()

Learn more about selectors →

Move around the page relative to elements, not just coordinates.

# Extract text below a specific heading
intro_text = page.find('text:contains("Introduction")').below().extract_text()

# Extract text from one heading to the next
methods_text = page.find('text:contains("Methods")').below(
    until='text:contains("Results")'
).extract_text()

Explore more navigation methods →

Extract Clean Text

Easily extract text content, automatically handling common page elements like headers and footers (if exclusions are set).

# Extract all text from the page (respecting exclusions)
page_text = page.extract_text()

# Extract text from a specific region
some_region = page.find(...)
region_text = some_region.extract_text()

Learn about text extraction → Learn about exclusion zones →

Apply OCR

Extract text from scanned documents using various OCR engines.

# Apply OCR using the default engine
ocr_elements = page.apply_ocr()

# Extract text (will use OCR results if available)
text = page.extract_text()

Explore OCR options →

Analyze Document Layout

Use AI models to detect document structures like titles, paragraphs, and tables.

# Detect document structure
page.analyze_layout()

# Highlight titles and tables
page.find_all('region[type=title]').highlight(color="purple")
page.find_all('region[type=table]').highlight(color="blue")

# Extract data from the first table
table_data = page.find('region[type=table]').extract_table()

Learn about layout models → Working with tables? →

Document Question Answering

Ask natural language questions directly to your documents.

# Ask a question
result = pdf.ask("What was the company's revenue in 2022?")
if result.get("found", False):
    print(f"Answer: {result['answer']}")

Learn about Document QA →

Classify Pages and Regions

Categorize pages or specific regions based on their content using text or vision models.

Note: Requires pip install "natural-pdf[classification]"

# Classify a page based on text
labels = ["invoice", "scientific article", "presentation"]
page.classify(labels, using="text")
print(f"Page Category: {page.category} (Confidence: {page.category_confidence:.2f})")


# Classify a page based on what it looks like
labels = ["invoice", "scientific article", "presentation"]
page.classify(labels, using="vision")
print(f"Page Category: {page.category} (Confidence: {page.category_confidence:.2f})")

Visualize Your Work

Debug and understand your extractions visually.

# Highlight headings
page.find_all('text[size>=14]').highlight(color="red", label="Headings")

# Launch the interactive viewer (Jupyter)
# Requires: pip install natural-pdf[interactive]
page.viewer()

# Or save an image
# page.save_image("highlighted.png")

See more visualization options →

Documentation Topics

Choose what you want to learn about:

Task-based Guides

Reference