Skip to content

Structured Extraction with LLMs

Use .extract() to pull structured data from PDFs using an LLM. Define a Pydantic schema, pass an OpenAI-compatible client, and get back typed results with optional citations and confidence scores.

Basic Usage

import os
from pydantic import BaseModel
from openai import OpenAI
from natural_pdf import PDF

client = OpenAI(api_key=os.environ.get("OPENAI_API_KEY"))

class InvoiceData(BaseModel):
    invoice_number: str
    date: str
    total: float
    vendor: str

pdf = PDF("invoice.pdf")
page = pdf.pages[0]

result = page.extract(InvoiceData, client=client)

# Attribute access
print(result.invoice_number)   # "INV-2024-00789"
print(result.total)            # 1250.00

# Item access gives FieldResult objects
print(result["vendor"].value)  # "Acme Corp"

# Convert to dict
print(result.to_dict())        # {"invoice_number": "INV-2024-00789", ...}

# Iterate
for name, field in result:
    print(f"{name}: {field.value}")

pdf.close()

Text vs Vision Mode

By default .extract() sends the page's text layer to the LLM. For scanned documents or when layout matters visually, use using='vision':

# Text mode (default) — sends extracted text
result = page.extract(MySchema, client=client, using="text")

# Vision mode — sends a rendered image of the page
result = page.extract(MySchema, client=client, using="vision")

Custom Prompts and Instructions

Override the default prompt entirely or append domain-specific guidance:

# Custom prompt
result = page.extract(
    MySchema,
    client=client,
    prompt="Extract inspection details. Dates should be in ISO format.",
)

# Instructions — appended to the default (or custom) prompt
result = page.extract(
    MySchema,
    client=client,
    instructions="Monetary values should be in USD. If a field is ambiguous, prefer null.",
)

Citations

Add citations=True to trace each extracted value back to the specific PDF elements it came from:

result = page.extract(MySchema, client=client, citations=True)

# Each field has a .citations ElementCollection
result["vendor"].citations       # ElementCollection of source TextElements
result["vendor"].citations.show()  # Highlight sources on the page

# Show all citations at once
result.show()

# Access all citations as a dict
result.all_citations  # {"vendor": ElementCollection, "date": ElementCollection, ...}

Citations work by sending line-numbered text to the LLM and asking it to return verbatim quotes. These quotes are then aligned back to PDF elements using pdfplumber's TextMap provenance data.

Note: Citations require using='text' (the default). They are not supported with using='vision'.

Confidence Scoring

Add confidence=True to get a 0.0–1.0 confidence score for each extracted field:

result = page.extract(MySchema, client=client, confidence=True)

result["vendor"].confidence  # 0.95
result.confidences           # {"vendor": 0.95, "date": 0.85, ...}

The prompt asks the LLM to self-report confidence using these anchors. These are the LLM's own assessments — they are not calibrated or independently verified:

Score Prompt Anchor
0.0 Not present or completely uncertain
0.2 Weakly implied but not stated
0.5 Partially supported or ambiguous
0.8 Supported with minor inference
1.0 Explicitly stated in the text

Categorical Confidence

Instead of numeric scores, use a list of levels. You define what they mean, or the LLM interprets them from the label:

result = page.extract(
    MySchema,
    client=client,
    confidence=["low", "medium", "high"],
)
result["vendor"].confidence  # "high"

Or provide explicit descriptions for each level:

result = page.extract(
    MySchema,
    client=client,
    confidence={
        "low": "implied or inferred",
        "medium": "strongly implied",
        "high": "clearly and explicitly stated",
    },
)

Annotated PDF Export

Save extraction results as a native PDF with highlight annotations and a sidebar legend:

result = page.extract(MySchema, client=client, citations=True, confidence=True)
result.save_pdf("annotated.pdf")

Install: pip install "natural-pdf[export]" (requires pikepdf).

Each field's citation elements become /Highlight annotations on the corresponding pages. The sidebar shows field names, extracted values, and colors matching the highlights.

You can also visualize inline:

result.show()  # Displays enriched legend labels (field name + value)

Controlling which pages appear

Both .show() and .save_pdf() accept a pages parameter:

# .show() defaults to pages="cited" — only pages with citation elements
result.show()                     # cited pages only
result.show(pages="all")          # every page in the source PDF

# .save_pdf() defaults to pages="all" — the full source PDF with annotations
result.save_pdf("annotated.pdf")                   # all pages
result.save_pdf("annotated.pdf", pages="cited")    # only annotated pages

Extracting from Regions, Pages, and PDFs

.extract() works on pages, regions, and entire PDFs:

# From a specific region
header = page.find('text:contains("Invoice")').below(until='text:contains("Items")')
result = header.extract(MySchema, client=client)

# From an entire PDF (multi-page)
result = pdf.extract(MySchema, client=client, citations=True)

Choosing a Model

Pass model= to select which LLM to use:

result = page.extract(MySchema, client=client, model="gpt-4o-mini")