Multi-Column Layouts with Flow
Academic papers, newspapers, and many reports use multi-column layouts. The Flow class helps you read content in the correct order by stacking columns vertically.
#%pip install natural-pdf
The Problem
When a document has multiple columns, extract_text() reads across the page width, mixing content from different columns:
from natural_pdf import PDF
pdf = PDF("academic_paper.pdf")
page = pdf.pages[0]
# This reads left-to-right, mixing columns!
text = page.extract_text()
print(text[:500]) # Jumbled content
Solution: Define Columns and Stack Them
Split the page into column regions and combine them with Flow:
from natural_pdf import PDF
from natural_pdf.flows import Flow
pdf = PDF("academic_paper.pdf")
page = pdf.pages[0]
# Define the three columns
left = page.region(left=0, right=page.width/3, top=0, bottom=page.height)
mid = page.region(left=page.width/3, right=page.width/3*2, top=0, bottom=page.height)
right = page.region(left=page.width/3*2, right=page.width, top=0, bottom=page.height)
# Preview the column divisions
page.highlight(left, mid, right)
# Stack columns into a vertical flow
stacked = [left, mid, right]
flow = Flow(segments=stacked, arrangement="vertical")
# Now text extraction reads in the correct order
flow.show()
Finding Content Within Flows
Use .find() and .find_all() on flows just like pages:
# Find a section header in the flow
region = (
flow
.find('text:contains("Table one")')
.below(
until='text:contains("Table two")',
include_endpoint=False
)
)
region.show()
Extracting Multiple Tables
Find all tables within a multi-column document:
# Find bold headers and get content below each
regions = (
flow
.find_all('text[width>10]:bold')
.below(
until='text[width>10]:bold|text:contains("Here is a bit")',
include_endpoint=False
)
)
regions.show()
# Extract the first table
regions[0].extract_table().to_df()
Combining Data from Multiple Regions
Use .apply() to process each region and combine results:
import pandas as pd
# Apply a function to each region in the collection
elements = flow.find_all('text:bold')
texts = elements.apply(lambda el: el.extract_text())
# Extract table from each region
dfs = regions.apply(lambda r: r.extract_table().to_df())
# Merge all tables into one DataFrame
merged = pd.concat(dfs, ignore_index=True)
merged
Two-Column Layouts
For simpler two-column documents:
from natural_pdf import PDF
from natural_pdf.flows import Flow
pdf = PDF("newsletter.pdf")
page = pdf.pages[0]
# Split into left and right columns
left = page.region(left=0, right=page.width/2)
right = page.region(left=page.width/2, right=page.width)
# Create flow
flow = Flow(segments=[left, right], arrangement="vertical")
# Extract text in reading order
text = flow.extract_text()
print(text)
Detecting Columns Automatically
For documents where column boundaries aren't fixed, use whitespace detection:
# Find vertical gaps that might indicate column boundaries
lines = page.find_all('line:vertical')
if lines:
# Use detected lines as column dividers
boundaries = [line.x0 for line in lines]
Related Tutorials
- Multipage Content - Handle content spanning pages
- Spatial Navigation - Navigate within regions
- Table Extraction - Extract tables from flow regions