In [ ]:
# Install required packages
!pip install --upgrade --quiet 'natural-pdf[ai]>=0.5.0'

print('✓ Packages installed!')

Slides: slides.pdf

Multi-page flows

Sometimes you have data that flows over multiple columns, or pages, or just... isn't arranged in a "normal" top-to-bottom way.

In [1]:
from natural_pdf import PDF

pdf = PDF("https://github.com/jsoma/natural-pdf/raw/refs/heads/main/pdfs/multicolumn.pdf")
page = pdf.pages[0]
page.show(width=800)
Out[1]:
No description has been provided for this image

Natural PDF deals with these through reflowing pages, where you grab specific regions of a page and then paste them back together either vertically or horizontally.

In this example we're splitting the page into three columns.

In [2]:
left = page.region(left=0, right=page.width/3, top=0, bottom=page.height)
mid = page.region(left=page.width/3, right=page.width/3*2, top=0, bottom=page.height)
right = page.region(left=page.width/3*2, right=page.width, top=0, bottom=page.height)
mid.show()
Out[2]:
No description has been provided for this image

Now let's stack them on top of each other.

In [3]:
from natural_pdf.flows import Flow

stacked = [left, mid, right]
flow = Flow(segments=stacked, arrangement="vertical")
flow.show()
Out[3]:
No description has been provided for this image
In [4]:
flow.show(in_context=False)
Out[4]:
No description has been provided for this image

Now any time we want to use spatial comparisons, like "find something below this," it just works.

In [5]:
region = (
    flow
    .find('text:contains("Table one")')
    .below(
        until='text:contains("Table two")',
        include_endpoint=False
    )
)
region.show()
Out[5]:
No description has been provided for this image

It works for text, it works for tables, it works for anything. Let's see how we can get both tables on the page.

First we find the bold headers – we need to say width > 10 because otherwise it pulls some weird tiny empty boxes.

In [6]:
(
    flow
    .find_all('text[width>10]:bold')
    .show()
)
Out[6]:
No description has been provided for this image

Then we take each of those headers, and go down down down until we either hit another bold header or the "Here is a bit more text" text.

In [7]:
regions = (
    flow
    .find_all('text[width>10]:bold')
    .below(
        until='text[width>10]:bold|text:contains("Here is a bit")',
        include_endpoint=False
    )
)
regions.show()
Out[7]:
No description has been provided for this image

Now we can use .extract_table() on each individual region to give us a bunch of tables.

In [8]:
regions[1].extract_table().to_df()
Out[8]:
index number
0 XXX1 123
1 XXX2 456
2 XXX3 789
3 XXX4 1122
4 XXX5 1455
5 XXX6 1788
6 XXX7 2121
7 XXX8 2454
8 XXX9 2787
9 XXX10 3120
10 XXX11 3453
11 XXX12 3786
12 XXX13 4119
13 XXX14 4452
14 XXX15 4785
15 XXX16 5118
16 XXX17 5451
17 XXX18 5784
18 XXX19 6117
19 XXX20 6450
20 XXX22 7116
21 XXX23 7449
22 XXX24 7782
23 XXX25 8115
24 XXX26 8448
25 XXX27 8781
26 XXX28 9114
27 XXX29 9447
28 XXX30 9780
29 XXX31 10113
30 XXX32 10446
31 XXX33 10779
32 XXX34 11112
33 XXX35 11445
34 XXX36 11778
35 XXX37 12111
36 XXX38 12444
37 XXX39 12777

Layout analysis and magic table extraction

Similar to how we have feelings about what things are on a page - headers, tables, graphics – computers also have opinions! Just like some AI models have been trained to do things like identify pictures of cats and dogs or spell check, others are capable of layout analysis - YOLO, surya, etc etc etc. There are a million! TATR is one of the useful ones for us, it's just for table detection.

But honestly: they're mostly trained on academic papers, so they aren't very good at the kinds of awful documents that journalists have to deal with. And with Natural PDF, you're probably selecting text[size>12]:bold in order to find headlines, anyway. But if your page has no readable text, they might be able to provide some useful information.

Let's start with YOLO, the default.

In [9]:
from natural_pdf import PDF

pdf = PDF("https://github.com/jsoma/natural-pdf/raw/refs/heads/main/pdfs/needs-ocr.pdf")
page = pdf.pages[0]
In [10]:
page.analyze_layout('yolo')
(
    page
    .find_all('region')
    .show(group_by='type', width=800)
)

image 1/1 /var/folders/25/h3prywj14qb0mlkl2s8bxq5m0000gn/T/tmp_ku_yary/temp_layout_image.png: 1024x800 2 titles, 3 plain texts, 2 abandons, 1 table, 1445.2ms
Speed: 5.8ms preprocess, 1445.2ms inference, 1.4ms postprocess per image at shape (1, 3, 1024, 800)
Out[10]:
No description has been provided for this image
In [11]:
page.find('table').apply_ocr()
text = page.extract_text()
print(text)
Using CPU. Note: This module is much faster with a GPU.
/Users/soma/Development/natural-pdf/.venv/lib/python3.11/site-packages/torch/utils/data/dataloader.py:775: UserWarning: 'pin_memory' argument is set as true but not supported on MPS now, device pinned memory won't be used.
  super().__init__(loader)
Statute Description Level Repeat?
4.12.7 Unsanitary Working Conditions Critical
5.8.3 Inadequate Protective Equipment: Serious
6.3.9 Ineffective Injury Prevention: Serious
7.1.5 Failure to Properly Store Hazardous Materials: Critical
8.9.2 Lack of Adequate Fire Safety Measures. Serious
9.6.4 Inadequate Ventilation Systems. Serious
10.2.7 Insufficient Employee Training for Safe Work Practices. Serious

Better layout analysis with tables

Let's see what TATR - Microsoft's table transformer – finds for us.

In [12]:
page.analyze_layout('tatr')
page.find_all('region').show(group_by='type', width=800)
Out[12]:
No description has been provided for this image

There's just so much stuff that TATR is finding that it's all overlapping.

For example, we can just look at one piece at a time.

In [13]:
# table-cell
# table-row
# table-column
page.find_all('region[type=table-row]').expand(-2).show(crop=True, width=800)
Out[13]:
No description has been provided for this image
In [14]:
# Grab all of the columns
cols = page.find_all('region[type=table-column]')

# Take one of the columns and apply OCR to it
cols[2].apply_ocr()
text = cols[2].extract_text()
print(text)
Level
Critical
Serious
Serious
Critical
Serious
Serious
Serious
/Users/soma/Development/natural-pdf/.venv/lib/python3.11/site-packages/torch/utils/data/dataloader.py:775: UserWarning: 'pin_memory' argument is set as true but not supported on MPS now, device pinned memory won't be used.
  super().__init__(loader)
In [15]:
len(cols[2].find_all('text[source=ocr]'))
Out[15]:
8
In [16]:
page.find('table').show()
Out[16]:
No description has been provided for this image
In [17]:
data = page.find('table').extract_table()
data
Out[17]:
TableResult(rows=9…)

YOLO

In [18]:
page.analyze_layout()
page.find_all('region').show(group_by="type")

image 1/1 /var/folders/25/h3prywj14qb0mlkl2s8bxq5m0000gn/T/tmpm2fdyzg6/temp_layout_image.png: 1024x800 2 titles, 3 plain texts, 2 abandons, 1 table, 775.6ms
Speed: 4.5ms preprocess, 775.6ms inference, 1.2ms postprocess per image at shape (1, 3, 1024, 800)
Out[18]:
No description has been provided for this image
In [19]:
page.find("region[type=table]").apply_ocr()
Out[19]:
<Region type='table' source='detected' bbox=(99.85887908935547, 815.1997680664062, 1146.6968994140625, 1153.8369140625)>
In [20]:
text = page.extract_text()
print(text)
Statute Description Level Repeat?
4.12.7 Unsanitary Working Conditions Critical
5.8.3 Inadequate Protective Equipment: Serious
6.3.9 Ineffective Injury Prevention: Serious
7.1.5 Failure to Properly Store Hazardous Materials: Critical
8.9.2 Lack of Adequate Fire Safety Measures. Serious
9.6.4 Inadequate Ventilation Systems. Serious
10.2.7 Insufficient Employee Training for Safe Work Practices. Serious
In [21]:
from natural_pdf.analyzers.guides import Guides

table_area = page.find("region[type=table]")
guides = Guides(table_area)
guides.horizontal.from_lines()
guides.vertical.from_content(["Description", "Level", "Repeat"])
guides.vertical.snap_to_whitespace()
guides.show()
Out[21]:
No description has been provided for this image
In [22]:
guides.extract_table().to_df()
Out[22]:
Statute Description Level Repeat?
0 4.12.7 Unsanitary Working Conditions Critical None
1 5.8.3 Inadequate Protective Equipment: Serious None
2 6.3.9 Ineffective Injury Prevention: Serious None
3 7.1.5 Failure to Properly Store Hazardous Materials: Critical None
4 8.9.2 Lack of Adequate Fire Safety Measures. Serious None
5 9.6.4 Inadequate Ventilation Systems. Serious None
6 10.2.7 Insufficient Employee Training for Safe Work P... Serious None