Restructuring page content¶

Flows are a way to restructure pages that are not in normal one-page reading order. This might be columnal data, tables than span pages, etc.

A multi-column PDF¶

Here is a multi column PDF.

In [1]:

Copied!





from natural_pdf import PDF
from natural_pdf.flows import Flow

pdf = PDF("https://github.com/jsoma/natural-pdf/raw/refs/heads/main/pdfs/multicolumn.pdf")
page = pdf.pages[0]
page.show(width=500)
from natural_pdf import PDF
from natural_pdf.flows import Flow

pdf = PDF("https://github.com/jsoma/natural-pdf/raw/refs/heads/main/pdfs/multicolumn.pdf")
page = pdf.pages[0]
page.show(width=500)

Out[1]:

No description has been provided for this image

We can grab individual columns from it.

In [2]:

Copied!

left = page.region(right=page.width/3)
mid = page.region(left=page.width/3, right=page.width/3*2)
right = page.region(left=page.width/3*2)

mid.show(width=500)
left = page.region(right=page.width/3)
mid = page.region(left=page.width/3, right=page.width/3*2)
right = page.region(left=page.width/3*2)

mid.show(width=500)

Out[2]:

Restructuring¶

We can use Flows to stack the three columns on top of each other.

In [3]:

Copied!

stacked = [left, mid, right]
flow = Flow(segments=stacked, arrangement="vertical")
stacked = [left, mid, right]
flow = Flow(segments=stacked, arrangement="vertical")

As a result, I can find text in the first column and ask it to grab what's "below" until it hits content in the second column.

In [4]:

Copied!





region = (
    flow
    .find('text:contains("Table one")')
    .below(
        until='text:contains("Table two")',
        include_endpoint=False
    )
)
region.show()
region = (
    flow
    .find('text:contains("Table one")')
    .below(
        until='text:contains("Table two")',
        include_endpoint=False
    )
)
region.show()

Out[4]:

While you can't easily extract tables yet, you can at least extract text!

In [5]:

Copied!

print(region.extract_text())
print(region.extract_text())

index number
1 123
2 456
3 789
4 1122
5 1455
6 1788
7 2121
8 2454
9 2787
10 3120
11 3453
12 3786
13 4119
14 4452
15 4785
16 5118
17 5451
18 5784
19 6117
20 6450
21 6783
22 7116
23 7449
24 7782
25 8115
26 8448
27 8781
28 9114
29 9447
30 9780
31 10113
32 10446
33 10779
34 11112
35 11445
36 11778
37 12111
38 12444
39 12777

find_all and reflows¶

Let's say we have a few headers...

In [6]:

Copied!





(
    flow
    .find_all('text[width>10]:bold')
    .show()
)
(
    flow
    .find_all('text[width>10]:bold')
    .show()
)

Out[6]:

...it's easy to extract each table that's betwen them.

In [7]:

Copied!





regions = (
    flow
    .find_all('text[width>10]:bold')
    .below(
        until='text[width>10]:bold|text:contains("Here is a bit")',
        include_endpoint=False
    )
)
regions.show()
regions = (
    flow
    .find_all('text[width>10]:bold')
    .below(
        until='text[width>10]:bold|text:contains("Here is a bit")',
        include_endpoint=False
    )
)
regions.show()

Out[7]:

Merging tables that span pages¶

TK