Getting Started with Natural PDF
Let's get Natural PDF installed and run your first extraction.
Installation
The base install includes the core library (selectors, extraction, spatial navigation) plus the openai client for LLM-based features like .extract() and .to_markdown().
pip install natural-pdf
Optional dependencies can be installed individually as needed. The library will tell you what to install if something is missing.
# Bundles
pip install "natural-pdf[export]" # PDF export (pikepdf, img2pdf, etc.)
pip install "natural-pdf[paddle]" # PaddleOCR stack (paddlepaddle + paddleocr + paddlex) — includes paddlevl engine
pip install "natural-pdf[all]" # Everything
# Individual packages
pip install easyocr # EasyOCR engine
pip install "surya-ocr<0.15" # Surya OCR engine
pip install doclayout_yolo # YOLO layout detection
pip install torch transformers # QA, classification, semantic search
If you attempt to use an engine that is missing, the library will raise an
error with the pip install command you need.
You can check what's installed at any time:
npdf list
Your First PDF Extraction
Here's a quick example to make sure everything is working:
from natural_pdf import PDF
# Open a PDF
pdf = PDF("https://github.com/jsoma/natural-pdf/raw/refs/heads/main/pdfs/01-practice.pdf")
# Get the first page
page = pdf.pages[0]
# Extract all text
text = page.extract_text()
print(text)
# Find something specific
title = page.find('text:bold')
if title:
print(f"Found title: {title.extract_text()}")
What's Next?
Now that you have Natural PDF installed, you can:
- Follow the Quickstart guide
- Learn to find elements in PDFs
- See how to extract text