In [1]:
Copied!
from natural_pdf import PDF
from natural_pdf.flows import Flow
pdf = PDF("https://github.com/jsoma/natural-pdf/raw/refs/heads/main/pdfs/multicolumn.pdf")
page = pdf.pages[0]
page.to_image(width=500)
from natural_pdf import PDF
from natural_pdf.flows import Flow
pdf = PDF("https://github.com/jsoma/natural-pdf/raw/refs/heads/main/pdfs/multicolumn.pdf")
page = pdf.pages[0]
page.to_image(width=500)
CropBox missing from /Page, defaulting to MediaBox
Out[1]:
We can grab individual columns from it.
In [2]:
Copied!
left = page.region(right=page.width/3)
mid = page.region(left=page.width/3, right=page.width/3*2)
right = page.region(left=page.width/3*2)
mid.show(width=500)
left = page.region(right=page.width/3)
mid = page.region(left=page.width/3, right=page.width/3*2)
right = page.region(left=page.width/3*2)
mid.show(width=500)
Out[2]:
Restructuring¶
We can use Flows to stack the three columns on top of each other.
In [3]:
Copied!
stacked = [left, mid, right]
flow = Flow(segments=stacked, arrangement="vertical")
stacked = [left, mid, right]
flow = Flow(segments=stacked, arrangement="vertical")
As a result, I can find text in the first column and ask it to grab what's "below" until it hits content in the second column.
In [4]:
Copied!
region = (
flow
.find('text:contains("Table one")')
.below(
until='text:contains("Table two")',
include_endpoint=False
)
)
region.show()
region = (
flow
.find('text:contains("Table one")')
.below(
until='text:contains("Table two")',
include_endpoint=False
)
)
region.show()
Out[4]:
While you can't easily extract tables yet, you can at least extract text!
In [5]:
Copied!
print(region.extract_text())
print(region.extract_text())
index number 1 123 2 456 3 789 4 1122 5 1455 6 1788 7 2121 8 2454 9 2787 10 3120 11 3453 12 3786 13 4119 14 4452 15 4785 16 5118 17 5451 18 5784 19 6117 20 6450 21 6783 22 7116 23 7449 24 7782 25 8115 26 8448 27 8781 28 9114 29 9447 30 9780 31 10113 32 10446 33 10779 34 11112 35 11445 36 11778 37 12111 38 12444 39 12777
find_all and reflows¶
Let's say we have a few headers...
In [6]:
Copied!
(
flow
.find_all('text[size=12][width>10]:bold')
.show()
)
(
flow
.find_all('text[size=12][width>10]:bold')
.show()
)
Out[6]:
...it's easy to extract each table that's betwen them.
In [7]:
Copied!
regions = (
flow
.find_all('text[size=12][width>10]:bold')
.below(
until='text[size=12][width>10]:bold|text:contains("Here is a bit")',
include_endpoint=False
)
)
regions.show()
regions = (
flow
.find_all('text[size=12][width>10]:bold')
.below(
until='text[size=12][width>10]:bold|text:contains("Here is a bit")',
include_endpoint=False
)
)
regions.show()
Out[7]: