← Home

Modern PDF processing with Natural PDF

In [ ]:

# Install required packages
!pip install --upgrade --quiet 'natural-pdf[ai]>=0.5.0'

print('✓ Packages installed!')

Slides: slides.pdf

Multi-page flows¶

Sometimes you have data that flows over multiple columns, or pages, or just... isn't arranged in a "normal" top-to-bottom way.

In [1]:

from natural_pdf import PDF

pdf = PDF("https://github.com/jsoma/natural-pdf/raw/refs/heads/main/pdfs/multicolumn.pdf")
page = pdf.pages[0]
page.show(width=800)

Out[1]:

No description has been provided for this image

Natural PDF deals with these through reflowing pages, where you grab specific regions of a page and then paste them back together either vertically or horizontally.

In this example we're splitting the page into three columns.

In [2]:

left = page.region(left=0, right=page.width/3, top=0, bottom=page.height)
mid = page.region(left=page.width/3, right=page.width/3*2, top=0, bottom=page.height)
right = page.region(left=page.width/3*2, right=page.width, top=0, bottom=page.height)
mid.show()

Out[2]:

Now let's stack them on top of each other.

In [3]:

from natural_pdf.flows import Flow

stacked = [left, mid, right]
flow = Flow(segments=stacked, arrangement="vertical")
flow.show()

Out[3]:

In [4]:

flow.show(in_context=False)

Out[4]:

Now any time we want to use spatial comparisons, like "find something below this," it just works.

In [5]:

region = (
    flow
    .find('text:contains("Table one")')
    .below(
        until='text:contains("Table two")',
        include_endpoint=False
    )
)
region.show()

Out[5]:

It works for text, it works for tables, it works for anything. Let's see how we can get both tables on the page.

First we find the bold headers – we need to say width > 10 because otherwise it pulls some weird tiny empty boxes.

In [6]:

(
    flow
    .find_all('text[width>10]:bold')
    .show()
)

Out[6]:

Then we take each of those headers, and go down down down until we either hit another bold header or the "Here is a bit more text" text.

In [7]:

regions = (
    flow
    .find_all('text[width>10]:bold')
    .below(
        until='text[width>10]:bold|text:contains("Here is a bit")',
        include_endpoint=False
    )
)
regions.show()

Out[7]:

Now we can use .extract_table() on each individual region to give us a bunch of tables.

In [8]:

regions[1].extract_table().to_df()

Out[8]:

	index	number
0	XXX1	123
1	XXX2	456
2	XXX3	789
3	XXX4	1122
4	XXX5	1455
5	XXX6	1788
6	XXX7	2121
7	XXX8	2454
8	XXX9	2787
9	XXX10	3120
10	XXX11	3453
11	XXX12	3786
12	XXX13	4119
13	XXX14	4452
14	XXX15	4785
15	XXX16	5118
16	XXX17	5451
17	XXX18	5784
18	XXX19	6117
19	XXX20	6450
20	XXX22	7116
21	XXX23	7449
22	XXX24	7782
23	XXX25	8115
24	XXX26	8448
25	XXX27	8781
26	XXX28	9114
27	XXX29	9447
28	XXX30	9780
29	XXX31	10113
30	XXX32	10446
31	XXX33	10779
32	XXX34	11112
33	XXX35	11445
34	XXX36	11778
35	XXX37	12111
36	XXX38	12444
37	XXX39	12777

Layout analysis and magic table extraction¶

Similar to how we have feelings about what things are on a page - headers, tables, graphics – computers also have opinions! Just like some AI models have been trained to do things like identify pictures of cats and dogs or spell check, others are capable of layout analysis - YOLO, surya, etc etc etc. There are a million! TATR is one of the useful ones for us, it's just for table detection.

But honestly: they're mostly trained on academic papers, so they aren't very good at the kinds of awful documents that journalists have to deal with. And with Natural PDF, you're probably selecting text[size>12]:bold in order to find headlines, anyway. But if your page has no readable text, they might be able to provide some useful information.

Let's start with YOLO, the default.

In [1]:

from natural_pdf import PDF

pdf = PDF("https://github.com/jsoma/natural-pdf/raw/refs/heads/main/pdfs/needs-ocr.pdf")
page = pdf.pages[0]

In [10]:

page.analyze_layout('yolo')
(
    page
    .find_all('region')
    .show(group_by='type', width=800)
)

image 1/1 /var/folders/25/h3prywj14qb0mlkl2s8bxq5m0000gn/T/tmppwh3406v/temp_layout_image.png: 1024x800 2 titles, 3 plain texts, 2 abandons, 1 table, 759.1ms
Speed: 12.9ms preprocess, 759.1ms inference, 1.1ms postprocess per image at shape (1, 3, 1024, 800)

Out[10]:

In [11]:

page.find('table').apply_ocr()
text = page.extract_text()
print(text)

[INFO] 2026-05-23 18:05:49,096 [RapidOCR] base.py:22: Using engine_name: onnxruntime
[INFO] 2026-05-23 18:05:49,129 [RapidOCR] download_file.py:60: File exists and is valid: /Users/soma/Development/natural-pdf/.venv/lib/python3.11/site-packages/rapidocr/models/ch_PP-OCRv4_det_mobile.onnx
[INFO] 2026-05-23 18:05:49,130 [RapidOCR] main.py:57: Using /Users/soma/Development/natural-pdf/.venv/lib/python3.11/site-packages/rapidocr/models/ch_PP-OCRv4_det_mobile.onnx
[INFO] 2026-05-23 18:05:49,169 [RapidOCR] base.py:22: Using engine_name: onnxruntime
[INFO] 2026-05-23 18:05:49,171 [RapidOCR] download_file.py:60: File exists and is valid: /Users/soma/Development/natural-pdf/.venv/lib/python3.11/site-packages/rapidocr/models/ch_ppocr_mobile_v2.0_cls_mobile.onnx
[INFO] 2026-05-23 18:05:49,171 [RapidOCR] main.py:57: Using /Users/soma/Development/natural-pdf/.venv/lib/python3.11/site-packages/rapidocr/models/ch_ppocr_mobile_v2.0_cls_mobile.onnx
[INFO] 2026-05-23 18:05:49,196 [RapidOCR] base.py:22: Using engine_name: onnxruntime
[INFO] 2026-05-23 18:05:49,203 [RapidOCR] download_file.py:60: File exists and is valid: /Users/soma/Development/natural-pdf/.venv/lib/python3.11/site-packages/rapidocr/models/en_PP-OCRv4_rec_mobile.onnx
[INFO] 2026-05-23 18:05:49,203 [RapidOCR] main.py:57: Using /Users/soma/Development/natural-pdf/.venv/lib/python3.11/site-packages/rapidocr/models/en_PP-OCRv4_rec_mobile.onnx

Statute Description Level Repeat?
4.12.7 Unsanitary Working Conditions. Critical
5.8.3 Inadequate Protective Equipment. Serious
6.3.9 Ineffective Injury Prevention. Serious
7.1.5 Failure to Properly Store Hazardous Materials.. Critical
8.9.2 Lack of Adequate Fire Safety Measures. Serious
9.6.4 Inadequate Ventilation Systems. Serious X
10.2.7 Insufficient Employee Training for Safe Work Practices. Serious

Better layout analysis with tables¶

There are other options in Natural PDF - TATR, Microsoft's table transformer - VLM options... but in the end, guides are the answer. You just point it at text, lines, whitespace... and it finds the table for you.

In [13]:

from natural_pdf import PDF

pdf = PDF("https://github.com/jsoma/natural-pdf/raw/refs/heads/main/pdfs/needs-ocr.pdf")
page = pdf.pages[0]

page.apply_ocr()

table_area = (
    page
        .find("text:contains(Violations)")
        .below(until='text:contains(Jungle Health)', include_endpoint=False)
        .trim()
)
table_area.show(crop=True, width=700)

Out[13]:

Let's tell Natural PDF to split columns based on the headers - we OCR'd them but they seem reliable.

In [14]:

guide = table_area.guides()
guide.vertical.from_headers(["Statute", "Description", "Level", "Repeat?"])
guide.show(width=700)

Out[14]:

The big problem is that in trying to evenly divide up the columns, it went right through our text! We can migrate the lines to the nearest empty area by snapping to whitespace.

In [15]:

guide.snap_to_whitespace()
guide.show(width=700)

Out[15]:

Sometimes you don't have lines to split up rows, but once you've split the columns you can pick one that functions as an anchor. We'll take the first one.

In [18]:

rows = guide.column(0).find_all('text')
rows.show(crop=True, width=50)

Out[18]:

We can tell Natural PDF to use those as row dividers. The top of each will count as a new row.

In [19]:

guide.horizontal.from_content(rows)
guide.show()

Out[19]:

Now we just need to detect those checkboxes...

In [40]:

page.detect_checkboxes()

Out[40]:

<ElementCollection[Region](count=7)>

And we can ask the guide for the table!

In [22]:

guide.extract_table().to_df()

Out[22]:

	Statute	Description	Level	Repeat?
0	4.12.7	Unsanitary Working Conditions.	Critical	None
1	5.8.3	Inadequate Protective Equipment.	Serious	None
2	6.3.9	Ineffective Injury Prevention.	Serious	None
3	7.1.5	Failure to Properly Store Hazardous Materials..	Critical	None
4	8.9.2	Lack of Adequate Fire Safety Measures.	Serious	None
5	9.6.4	Inadequate Ventilation Systems.	Serious	None
6	10.2.7	Insufficient Employee Training for Safe Work P...	Serious	None

That was overly complex for pedagogical reasons, we could have also just used the lines.... Natural PDF counts up the pixels and sees what's likely to be dark vertical/horizontal lines. You can pass extra parameters to specify custom approaches like we need more than 40% filled (threshold=0.4) or I only want 5 guidelines (n=5).

In [23]:

from natural_pdf.analyzers.guides import Guides

guide = table_area.guides()
guide.horizontal.from_lines(detection_method='pixels')
guide.vertical.from_lines(detection_method='pixels')
guide.show()

Out[23]:

In [26]:

page.detect_checkboxes()

guide.extract_table().to_df()

Out[26]:

	Statute	Description	Level	Repeat?
0	4.12.7	Unsanitary Working Conditions.	Critical	[CHECKED]
1	5.8.3	Inadequate Protective Equipment.	Serious	[CHECKED]
2	6.3.9	Ineffective Injury Prevention.	Serious	[UNCHECKED]
3	7.1.5	Failure to Properly Store Hazardous Materials..	Critical	[UNCHECKED]
4	8.9.2	Lack of Adequate Fire Safety Measures.	Serious	[UNCHECKED]
5	9.6.4	Inadequate Ventilation Systems.	Serious	[CHECKED]
6	10.2.7	Insufficient Employee Training for Safe Work P...	Serious	[UNCHECKED]