Label-Value Extraction

Find a label element and extract the value next to it. The fundamental pattern for form data extraction.

When to use this pattern:

Invoices ("Total:" -> "$500")
Forms ("Name:" -> "John Smith")
Any document with label-value pairs

The Problem

You need to find "Invoice Number:" on a page and get "INV-2024-00789" from next to it. The position varies between documents, so you can't use fixed coordinates.

Sample PDF

This tutorial uses pdfs/cookbook/vendor_invoice.pdf - an invoice with header fields and line items.

import natural_pdf as npdf

pdf = npdf.PDF("pdfs/cookbook/vendor_invoice.pdf")
pdf.pages[0].show()

Basic Pattern

import natural_pdf as npdf

pdf = npdf.PDF("pdfs/cookbook/vendor_invoice.pdf")
page = pdf.pages[0]

# Find the label
label = page.find('text:contains("Invoice Number:")')

# Get the value to its right
if label:
    value_region = label.right()
    value = value_region.extract_text().strip()
    print(f"Invoice Number: {value}")

pdf.close()

Output:

Invoice Number: INV-2024-00789

Directional Methods

Natural PDF provides four directional methods. Each creates a region extending from the source element:

Method	Default Behavior	Use When
`.right()`	Same height as element	Value is beside the label
`.left()`	Same height as element	Label is to the right of value
`.below()`	Full page width	Value is under the label
`.above()`	Full page width	Value is above the label

# Value is beside the label (most common)
label.right().extract_text()

# Value is under the label (stacked layout)
label.below().extract_text()

# Combine for complex layouts
# Get value that's below AND within a column
label.below(width='element').extract_text()

Controlling Region Size

By default, .right() and .left() match the element's height, while .below() and .above() span the full page width. Override these defaults:

# Fixed width region (150 pixels)
label.right(width=150).extract_text()

# Match the label's width
label.below(width='element').extract_text()

# Full page width
label.right(height='full').extract_text()

Using `until` to Bound Regions

Stop a region at another element instead of extending to the page edge:

# Get text between "Description:" and the next bold text
desc = page.find('text:contains("Description:")')
content = desc.below(until='text:bold').extract_text()

# Get text until a specific label
start = page.find('text:contains("Summary:")')
end_label = 'text:contains("Total:")'
section = start.below(until=end_label).extract_text()

Extracting Multiple Fields

Loop through a list of labels:

fields_to_extract = [
    'Invoice Number:',
    'Invoice Date:',
    'Due Date:',
    'Vendor:',
    'PO Number:',
]

data = {}
for field in fields_to_extract:
    label = page.find(f'text:contains("{field}")')
    if label:
        data[field] = label.right().extract_text().strip()
    else:
        data[field] = None

print(data)

Output:

{
    'Invoice Number:': 'INV-2024-00789',
    'Invoice Date:': '2024-03-15',
    'Due Date:': '2024-04-15',
    'Vendor:': 'Acme Corporation',
    'PO Number:': 'PO-2024-456'
}

Handling Variations

Case-Insensitive Matching

# Matches "total:", "Total:", "TOTAL:"
label = page.find('text:contains("total")', case=False)

Partial Matches

The contains() selector matches substrings:

# Matches "Invoice Number:", "Invoice No:", "Invoice #:"
label = page.find('text:contains("Invoice")')

Multiple Possible Labels

Try several labels until one matches:

possible_labels = ['Total:', 'Grand Total:', 'Amount Due:', 'Balance:']

total_value = None
for label_text in possible_labels:
    label = page.find(f'text:contains("{label_text}")')
    if label:
        total_value = label.right().extract_text().strip()
        break

print(f"Total: {total_value}")

Working with Tables of Label-Value Pairs

Some forms use two-column tables for metadata:

# Extract the entire metadata table
metadata_table = page.extract_table()
if metadata_table:
    df = metadata_table.to_df()

    # Convert two-column table to dictionary
    if len(df.columns) == 2:
        data = dict(zip(df.iloc[:, 0], df.iloc[:, 1]))
        print(data)

Extracting Monetary Values

import re

label = page.find('text:contains("TOTAL:")')
if label:
    raw = label.right().extract_text().strip()
    # Remove currency symbols and commas
    amount = re.sub(r'[^\d.]', '', raw)
    total = float(amount)
    print(f"Total: ${total:,.2f}")

Complete Invoice Extraction Example

import natural_pdf as npdf
import re

def extract_invoice(pdf_path):
    """Extract key fields from an invoice PDF."""
    pdf = npdf.PDF(pdf_path)
    page = pdf.pages[0]

    data = {}

    # Header fields
    header_fields = {
        'invoice_number': 'Invoice Number:',
        'invoice_date': 'Invoice Date:',
        'due_date': 'Due Date:',
        'vendor': 'Vendor:',
        'po_number': 'PO Number:',
    }

    for key, label_text in header_fields.items():
        label = page.find(f'text:contains("{label_text}")')
        data[key] = label.right().extract_text().strip() if label else None

    # Total (with currency parsing)
    total_label = page.find('text:contains("TOTAL:")')
    if total_label:
        raw = total_label.right().extract_text().strip()
        data['total'] = float(re.sub(r'[^\d.]', '', raw))

    # Line items table
    line_items_header = page.find('text:contains("Line Items")')
    if line_items_header:
        table_region = line_items_header.below(until='text:contains("Subtotal")')
        table = table_region.extract_table()
        if table:
            data['line_items'] = table.to_df().to_dict('records')

    pdf.close()
    return data

# Usage
invoice_data = extract_invoice("pdfs/cookbook/vendor_invoice.pdf")
print(f"Invoice: {invoice_data['invoice_number']}")
print(f"Vendor: {invoice_data['vendor']}")
print(f"Total: ${invoice_data['total']:,.2f}")

Troubleshooting

"Value region is empty"

The label might be part of a larger text element. Try finding the specific text:

# Instead of finding just "Total:"
label = page.find('text:contains("Total:")')

# You might need to find the element that STARTS with "Total:"
labels = page.find_all('text:contains("Total")')
for l in labels:
    if l.extract_text().strip().startswith("Total:"):
        value = l.right().extract_text()
        break

"Getting wrong value (from different row)"

Constrain the region height:

# Use element height to stay on the same line
label.right(height='element').extract_text()

"Label and value are in the same element"

Sometimes PDFs store "Label: Value" as one text element:

element = page.find('text:contains("Invoice Number:")')
full_text = element.extract_text()  # "Invoice Number: INV-2024-00789"

# Parse the value from the text
if ': ' in full_text:
    label, value = full_text.split(': ', 1)

Next Steps

One Page = One Row - Apply this pattern to multi-page forms
Finding Sections - Extract content between section headers
Simple Table Extraction - Extract structured tables