Skip to content

API Reference

This section provides detailed documentation for all the classes and methods in Natural PDF.

Core Classes

PDF Class

The main entry point for working with PDFs.

class PDF:
    """
    The main entry point for working with PDFs.

    Parameters:
        path (str): Path to the PDF file.
        password (str, optional): Password for encrypted PDFs. Default: None
        reading_order (bool, optional): Sort elements in reading order. Default: True
        keep_spaces (bool, optional): Keep spaces in word elements. Default: True
        font_attrs (list, optional): Font attributes to use for text grouping. 
                                    Default: ['fontname', 'size']
        ocr (bool/dict/str, optional): OCR configuration. Default: False
        ocr_engine (str/Engine, optional): OCR engine to use. Default: "easyocr"
    """

Main Methods

Method Description Parameters Returns
pages Access pages in the document N/A (property) PageCollection
extract_text(keep_blank_chars=True, apply_exclusions=True) Extract text from all pages keep_blank_chars: Whether to keep blank characters
apply_exclusions: Whether to apply exclusion zones
str: Extracted text
find(selector, case=True, regex=False, apply_exclusions=True) Find first element matching selector across all pages selector: CSS-like selector string
case: Case-sensitive search
regex: Use regex for :contains()
apply_exclusions: Whether to apply exclusion zones
Element or None
find_all(selector, case=True, regex=False, apply_exclusions=True) Find all elements matching selector across all pages selector: CSS-like selector string
case: Case-sensitive search
regex: Use regex for :contains()
apply_exclusions: Whether to apply exclusion zones
ElementCollection
add_exclusion(func, label=None) Add a document-wide exclusion zone func: Function taking a page and returning region
label: Optional label for the exclusion
None
get_sections(start_elements, end_elements=None, boundary_inclusion='start') Get sections across all pages start_elements: Elements marking section starts
end_elements: Elements marking section ends
boundary_inclusion: How to include boundaries ('start', 'end', 'both', 'none')
list[Region]
ask(question, min_confidence=0.0, model=None) Ask a question about the document content question: Question to ask
min_confidence: Minimum confidence threshold
model: Optional model name or path
dict: Result with answer and metadata

Page Class

Represents a single page in a PDF document.

class Page:
    """
    Represents a single page in a PDF document.

    Properties:
        page_number (int): 1-indexed page number
        page_index (int): 0-indexed page position
        width (float): Page width in points
        height (float): Page height in points
        pdf (PDF): Parent PDF object
    """

Main Methods

Method Description Parameters Returns
extract_text(keep_blank_chars=True, apply_exclusions=True, ocr=None) Extract text from the page keep_blank_chars: Whether to keep blank characters
apply_exclusions: Whether to apply exclusion zones
ocr: Whether to force OCR
str: Extracted text
find(selector, case=True, regex=False, apply_exclusions=True) Find the first element matching selector selector: CSS-like selector string
case: Case-sensitive search
regex: Use regex for :contains()
apply_exclusions: Whether to apply exclusion zones
Element or None
find_all(selector, case=True, regex=False, apply_exclusions=True) Find all elements matching selector selector: CSS-like selector string
case: Case-sensitive search
regex: Use regex for :contains()
apply_exclusions: Whether to apply exclusion zones
ElementCollection
create_region(x0, top, x1, bottom) Create a region at specific coordinates x0: Left coordinate
top: Top coordinate
x1: Right coordinate
bottom: Bottom coordinate
Region
highlight(elements, color=None, label=None) Highlight elements on the page elements: Elements to highlight
color: RGBA color tuple
label: Label for the highlight
Page (self)
highlight_all(include_types=None, include_text_styles=False, include_layout_regions=False) Highlight all elements on the page include_types: Element types to include
include_text_styles: Whether to include text styles
include_layout_regions: Whether to include layout regions
Page (self)
save_image(path, resolution=72, labels=True) Save an image of the page with highlights path: Path to save image
resolution: Image resolution in DPI
labels: Whether to include labels
None
to_image(resolution=72, labels=True) Get a PIL Image of the page with highlights resolution: Image resolution in DPI
labels: Whether to include labels
PIL.Image
analyze_text_styles() Group text by visual style properties None dict: Mapping of style name to elements
analyze_layout(engine="yolo", confidence=0.2, existing="replace") Detect layout regions using ML models model: Model to use ("yolo", "tatr")
confidence: Confidence threshold
existing: How to handle existing regions
ElementCollection: Detected regions
add_exclusion(region, label=None) Add an exclusion zone to the page region: Region to exclude
label: Optional label for the exclusion
Region: The exclusion region
get_sections(start_elements, end_elements=None, boundary_inclusion='start') Get sections from the page start_elements: Elements marking section starts
end_elements: Elements marking section ends
boundary_inclusion: How to include boundaries
list[Region]
ask(question, min_confidence=0.0, model=None, debug=False) Ask a question about the page content question: Question to ask
min_confidence: Minimum confidence threshold
model: Optional model name or path
debug: Whether to save debug files
dict: Result with answer and metadata
apply_ocr(languages=None, min_confidence=0.0, **kwargs) Apply OCR to the page languages: Languages to use
min_confidence: Minimum confidence threshold
**kwargs: Additional OCR engine parameters
ElementCollection: OCR text elements

Region Class

Represents a rectangular area on a page.

class Region:
    """
    Represents a rectangular area on a page.

    Properties:
        x0 (float): Left coordinate
        top (float): Top coordinate
        x1 (float): Right coordinate
        bottom (float): Bottom coordinate
        width (float): Width of the region
        height (float): Height of the region
        page (Page): Parent page object
    """

Main Methods

Method Description Parameters Returns
extract_text(keep_blank_chars=True, apply_exclusions=True, ocr=None) Extract text from the region keep_blank_chars: Whether to keep blank characters
apply_exclusions: Whether to apply exclusion zones
ocr: Whether to force OCR
str: Extracted text
find(selector, case=True, regex=False, apply_exclusions=True) Find the first element matching selector within the region selector: CSS-like selector string
case: Case-sensitive search
regex: Use regex for :contains()
apply_exclusions: Whether to apply exclusion zones
Element or None
find_all(selector, case=True, regex=False, apply_exclusions=True) Find all elements matching selector within the region selector: CSS-like selector string
case: Case-sensitive search
regex: Use regex for :contains()
apply_exclusions: Whether to apply exclusion zones
ElementCollection
expand(left=0, top=0, right=0, bottom=0, width_factor=1.0, height_factor=1.0) Expand the region in specified directions left/top/right/bottom: Points to expand in each direction
width_factor/height_factor: Scale width/height by this factor
Region: Expanded region
highlight(color=None, label=None, include_attrs=None) Highlight the region color: RGBA color tuple
label: Label for the highlight
include_attrs: Region attributes to display
Region (self)
to_image(resolution=72, crop_only=False) Get a PIL Image of just the region resolution: Image resolution in DPI
crop_only: Whether to exclude border
PIL.Image
save_image(path, resolution=72, crop_only=False) Save an image of just the region path: Path to save image
resolution: Image resolution in DPI
crop_only: Whether to exclude border
None
get_sections(start_elements, end_elements=None, boundary_inclusion='start') Get sections within the region start_elements: Elements marking section starts
end_elements: Elements marking section ends
boundary_inclusion: How to include boundaries
list[Region]
ask(question, min_confidence=0.0, model=None, debug=False) Ask a question about the region content question: Question to ask
min_confidence: Minimum confidence threshold
model: Optional model name or path
debug: Whether to save debug files
dict: Result with answer and metadata
extract_table(method=None, table_settings=None, use_ocr=False) Extract table data from the region method: Extraction method ("plumber", "tatr")
table_settings: Custom settings for extraction
use_ocr: Whether to use OCR text
list: Table data as rows and columns
intersects(other) Check if this region intersects with another other: Another region bool: True if regions intersect
contains(x, y) Check if a point is within the region x: X coordinate
y: Y coordinate
bool: True if point is in region

Element Types

Element Base Class

The base class for all PDF elements.

class Element:
    """
    Base class for all PDF elements.

    Properties:
        x0 (float): Left coordinate
        top (float): Top coordinate
        x1 (float): Right coordinate
        bottom (float): Bottom coordinate
        width (float): Width of the element
        height (float): Height of the element
        page (Page): Parent page object
    """

Main Methods

Method Description Parameters Returns
above(height=None, full_width=True, until=None, include_until=True) Create a region above the element height: Height of region
full_width: Whether to span page width
until: Selector for boundary
include_until: Whether to include boundary
Region
below(height=None, full_width=True, until=None, include_until=True) Create a region below the element height: Height of region
full_width: Whether to span page width
until: Selector for boundary
include_until: Whether to include boundary
Region
select_until(selector, include_endpoint=True, full_width=True) Create a region from this element to another selector: Selector for endpoint
include_endpoint: Whether to include endpoint
full_width: Whether to span page width
Region
highlight(color=None, label=None, include_attrs=None) Highlight this element color: RGBA color tuple
label: Label for the highlight
include_attrs: Element attributes to display
Element (self)
extract_text(keep_blank_chars=True, apply_exclusions=True) Extract text from this element keep_blank_chars: Whether to keep blank characters
apply_exclusions: Whether to apply exclusion zones
str: Extracted text
next(selector=None, limit=None, apply_exclusions=True) Get the next element in reading order selector: Optional selector to filter
limit: How many elements to search
apply_exclusions: Whether to apply exclusion zones
Element or None
prev(selector=None, limit=None, apply_exclusions=True) Get the previous element in reading order selector: Optional selector to filter
limit: How many elements to search
apply_exclusions: Whether to apply exclusion zones
Element or None
nearest(selector, max_distance=None, apply_exclusions=True) Get the nearest element matching selector selector: Selector for elements
max_distance: Maximum distance in points
apply_exclusions: Whether to apply exclusion zones
Element or None

TextElement

Represents text elements in the PDF.

class TextElement(Element):
    """
    Represents text elements in the PDF.

    Additional Properties:
        text (str): The text content
        fontname (str): The font name
        size (float): The font size
        bold (bool): Whether the text is bold
        italic (bool): Whether the text is italic
        color (tuple): The text color as RGB tuple
        confidence (float): OCR confidence (for OCR text)
        source (str): 'pdf' or 'ocr'
    """

Main Properties

Property Type Description
text str The text content
fontname str The font name
size float The font size
bold bool Whether the text is bold
italic bool Whether the text is italic
color tuple The text color as RGB tuple
confidence float OCR confidence (for OCR text)
source str 'pdf' or 'ocr'
font_variant str Font variant identifier (e.g., 'AAAAAB+')

Additional Methods

Method Description Parameters Returns
font_info() Get detailed font information None dict: Font properties

Collections

ElementCollection

A collection of elements with batch operations.

class ElementCollection:
    """
    A collection of elements with batch operations.

    This class provides operations that can be applied to multiple elements at once.
    """

Main Methods

Method Description Parameters Returns
extract_text(keep_blank_chars=True, apply_exclusions=True) Extract text from all elements keep_blank_chars: Whether to keep blank characters
apply_exclusions: Whether to apply exclusion zones
str: Extracted text
filter(selector) Filter elements by selector selector: CSS-like selector string ElementCollection
highlight(color=None, label=None, include_attrs=None) Highlight all elements color: RGBA color tuple
label: Label for the highlight
include_attrs: Attributes to display
ElementCollection (self)
first Get the first element in the collection N/A (property) Element or None
last Get the last element in the collection N/A (property) Element or None
highest() Get the highest element on the page None Element or None
lowest() Get the lowest element on the page None Element or None
leftmost() Get the leftmost element on the page None Element or None
rightmost() Get the rightmost element on the page None Element or None
__len__() Get the number of elements None int
__getitem__(index) Get an element by index index: Index or slice Element or ElementCollection

PageCollection

A collection of pages with cross-page operations.

class PageCollection:
    """
    A collection of pages with cross-page operations.

    This class provides operations that can be applied across multiple pages.
    """

Main Methods

Method Description Parameters Returns
extract_text(keep_blank_chars=True, apply_exclusions=True) Extract text from all pages keep_blank_chars: Whether to keep blank characters
apply_exclusions: Whether to apply exclusion zones
str: Extracted text
find(selector, case=True, regex=False, apply_exclusions=True) Find the first element matching selector across all pages selector: CSS-like selector string
case: Case-sensitive search
regex: Use regex for :contains()
apply_exclusions: Whether to apply exclusion zones
Element or None
find_all(selector, case=True, regex=False, apply_exclusions=True) Find all elements matching selector across all pages selector: CSS-like selector string
case: Case-sensitive search
regex: Use regex for :contains()
apply_exclusions: Whether to apply exclusion zones
ElementCollection
get_sections(start_elements, end_elements=None, boundary_inclusion='start', new_section_on_page_break=False) Get sections spanning multiple pages start_elements: Elements marking section starts
end_elements: Elements marking section ends
boundary_inclusion: How to include boundaries
new_section_on_page_break: Whether to start new sections at page breaks
list[Region]
__len__() Get the number of pages None int
__getitem__(index) Get a page by index index: Index or slice Page or PageCollection

OCR Classes

OCREngine

Base class for OCR engines.

class OCREngine:
    """
    Base class for OCR engines.

    This class provides the interface for OCR engines.
    """

Main Methods

Method Description Parameters Returns
process_image(image, languages=None, min_confidence=0.0, **kwargs) Process an image with OCR image: PIL Image
languages: Languages to use
min_confidence: Minimum confidence threshold
list: OCR results

EasyOCREngine

OCR engine using EasyOCR.

class EasyOCREngine(OCREngine):
    """
    OCR engine using EasyOCR.

    Parameters:
        model_dir (str, optional): Directory for models. Default: None
    """

PaddleOCREngine

OCR engine using PaddleOCR.

class PaddleOCREngine(OCREngine):
    """
    OCR engine using PaddleOCR.

    Parameters:
        use_angle_cls (bool, optional): Use text direction classification. Default: False
        lang (str, optional): Language code. Default: "en"
        det (bool, optional): Use text detection. Default: True
        rec (bool, optional): Use text recognition. Default: True
        cls (bool, optional): Use text direction classification. Default: False
        det_model_dir (str, optional): Detection model directory. Default: None
        rec_model_dir (str, optional): Recognition model directory. Default: None
        verbose (bool, optional): Enable verbose output. Default: False
    """

Document QA Classes

DocumentQA

Class for document question answering.

class DocumentQA:
    """
    Class for document question answering.

    Parameters:
        model (str, optional): Model name or path. Default: "microsoft/layoutlmv3-base"
        device (str, optional): Device to use. Default: "cpu"
        verbose (bool, optional): Enable verbose output. Default: False
    """

Main Methods

Method Description Parameters Returns
ask(question, image, word_boxes, min_confidence=0.0, max_answer_length=None, language=None) Ask a question about a document question: Question to ask
image: Document image
word_boxes: Text positions
min_confidence: Minimum confidence threshold
max_answer_length: Maximum answer length
language: Language code
dict: Result with answer and metadata

Selector Syntax

Natural PDF uses a CSS-like selector syntax to find elements in PDFs.

Basic Selectors

Selector Description Example
element_type Match elements of this type text, rect, line
[attribute=value] Match elements with this attribute value [fontname=Arial], [size=12]
[attribute>=value] Match elements with attribute >= value [size>=12]
[attribute<=value] Match elements with attribute <= value [size<=10]
[attribute~=value] Match elements with attribute approximately equal [color~=red], [color~=(1,0,0)]
[attribute*=value] Match elements with attribute containing value [fontname*=Arial]

Pseudo-Classes

Pseudo-Class Description Example
:contains("text") Match elements containing text text:contains("Summary")
:starts-with("text") Match elements starting with text text:starts-with("Summary")
:ends-with("text") Match elements ending with text text:ends-with("2023")
:bold Match bold text text:bold
:italic Match italic text text:italic

Attribute Names

Attribute Element Types Description
fontname text Font name
size text Font size
color text, rect, line Color
width rect, line Width
height rect Height
confidence text (OCR) OCR confidence score
source text Source ('pdf' or 'ocr')
type region Region type (e.g., 'table', 'title')
model region Layout model that detected the region
font-variant text Font variant identifier

Constants and Configuration

Color Names

Natural PDF supports color names in selectors.

Color Name RGB Value Example
red (1, 0, 0) [color~=red]
green (0, 1, 0) [color~=green]
blue (0, 0, 1) [color~=blue]
black (0, 0, 0) [color~=black]
white (1, 1, 1) [color~=white]

Region Types

Layout analysis models detect the following region types:

Model Region Types
YOLO title, plain-text, table, figure, figure_caption, table_caption, table_footnote, isolate_formula, formula_caption, abandon
TATR table, table-row, table-column, table-column-header