Getting Tables Out of PDFs¶
Tables in PDFs can be a real pain. Sometimes they're perfectly formatted with nice lines, other times they're just text floating around that vaguely looks like a table. Natural PDF gives you several different approaches to tackle whatever table nightmare you're dealing with.
Setup¶
Let's start with a PDF that has some tables to work with.
from natural_pdf import PDF
# Load the PDF
pdf = PDF("https://github.com/jsoma/natural-pdf/raw/refs/heads/main/pdfs/01-practice.pdf")
# Select the first page
page = pdf.pages[0]
# Display the page
page.show()
CropBox missing from /Page, defaulting to MediaBox
The Quick and Dirty Approach¶
If you know there's a table somewhere and just want to try extracting it, start simple:
# Try to extract the first table found on the page
# This uses pdfplumber behind the scenes
table_data = page.extract_table() # Returns a list of lists
table_data
[['Statute', 'Description', 'Level', 'Repeat?'], ['4.12.7', 'Unsanitary Working Conditions.', 'Critical', ''], ['5.8.3', 'Inadequate Protective Equipment.', 'Serious', ''], ['6.3.9', 'Ineffective Injury Prevention.', 'Serious', ''], ['7.1.5', 'Failure to Properly Store Hazardous Materials.', 'Critical', ''], ['8.9.2', 'Lack of Adequate Fire Safety Measures.', 'Serious', ''], ['9.6.4', 'Inadequate Ventilation Systems.', 'Serious', ''], ['10.2.7', 'Insufficient Employee Training for Safe Work Practices.', 'Serious', '']]
This might work great, or it might give you garbage. Tables are tricky.
The Smart Way: Detect First, Then Extract¶
A better approach is to first find where the tables actually are, then extract them properly.
Finding Tables with YOLO (Fast and Pretty Good)¶
The YOLO model is good at spotting table-shaped areas on a page.
# Use YOLO to find table regions
page.analyze_layout(engine='yolo')
# Find what it thinks are tables
table_regions_yolo = page.find_all('region[type=table][model=yolo]')
table_regions_yolo.show()
image 1/1 /var/folders/25/h3prywj14qb0mlkl2s8bxq5m0000gn/T/tmpy_y9uk6k/temp_layout_image.png: 1024x800 1 title, 3 plain texts, 2 abandons, 1 table, 1619.9ms
Speed: 8.5ms preprocess, 1619.9ms inference, 3.8ms postprocess per image at shape (1, 3, 1024, 800)
# Extract data from the detected table
table_regions_yolo[0].extract_table()
[['Statute', 'Description', 'Level', 'Repeat?'], ['4.12.7', 'Unsanitary Working Conditions.', 'Critical', ''], ['5.8.3', 'Inadequate Protective Equipment.', 'Serious', ''], ['6.3.9', 'Ineffective Injury Prevention.', 'Serious', ''], ['7.1.5', 'Failure to Properly Store Hazardous Materials.', 'Critical', ''], ['8.9.2', 'Lack of Adequate Fire Safety Measures.', 'Serious', ''], ['9.6.4', 'Inadequate Ventilation Systems.', 'Serious', ''], ['10.2.7', 'Insufficient Employee Training for Safe Work Practices.', 'Serious', '']]
Finding Tables with TATR (Slow but Very Smart)¶
The TATR model actually understands table structure - it can tell you where rows, columns, and headers are.
# Clear previous results and try TATR
page.clear_detected_layout_regions()
page.analyze_layout(engine='tatr')
<ElementCollection[Region](count=15)>
# Find the table that TATR detected
tatr_table = page.find('region[type=table][model=tatr]')
tatr_table.show()
# TATR finds the internal structure too
rows = page.find_all('region[type=table-row][model=tatr]')
cols = page.find_all('region[type=table-column][model=tatr]')
hdrs = page.find_all('region[type=table-column-header][model=tatr]')
f"TATR found: {len(rows)} rows, {len(cols)} columns, {len(hdrs)} headers"
'TATR found: 8 rows, 4 columns, 1 headers'
Choosing Your Extraction Method¶
When you call extract_table()
on a detected region, Natural PDF picks the extraction method automatically:
- YOLO-detected regions → uses
pdfplumber
(looks for lines and text alignment) - TATR-detected regions → uses the smart
tatr
method (uses the detected structure)
You can override this if needed:
tatr_table = page.find('region[type=table][model=tatr]')
# Use TATR's smart extraction
tatr_table.extract_table(method='tatr')
[['Statute Description Level Repeat?'], ['Statute', 'Description', 'Level', 'Repeat?'], ['4.12.7', 'Unsanitary Working Conditions.', 'Critical', ''], ['5.8.3', 'Inadequate Protective Equipment.', 'Serious', ''], ['6.3.9', 'Ineffective Injury Prevention.', 'Serious', ''], ['7.1.5', 'Failure to Properly Store Hazardous Materials.', 'Critical', ''], ['8.9.2', 'Lack of Adequate Fire Safety Measures.', 'Serious', ''], ['9.6.4', 'Inadequate Ventilation Systems.', 'Serious', ''], ['10.2.7', 'Insufficient Employee Training for Safe Work Practices.', 'Serious', '']]
# Or force it to use pdfplumber instead (maybe for comparison)
tatr_table.extract_table(method='pdfplumber')
[['Unsanitary Working Conditions.', 'Critical'], ['Inadequate Protective Equipment.', 'Serious'], ['Ineffective Injury Prevention.', 'Serious'], ['Failure to Properly Store Hazardous Materials.', 'Critical'], ['Lack of Adequate Fire Safety Measures.', 'Serious'], ['Inadequate Ventilation Systems.', 'Serious']]
When to Use Which?¶
pdfplumber
: Great for clean tables with visible grid lines. Fast and reliable.tatr
: Better for messy tables, tables without lines, or tables with merged cells. Slower but smarter.
When Tables Don't Cooperate¶
Sometimes the automatic detection doesn't work well. You can tweak pdfplumber's settings:
# Custom settings for tricky tables
table_settings = {
"vertical_strategy": "text", # Use text alignment instead of lines
"horizontal_strategy": "lines", # Still use lines for rows
"intersection_x_tolerance": 5, # Be more forgiving about line intersections
}
results = page.extract_table(table_settings=table_settings)
Saving Your Results¶
Once you've got your table data, you'll probably want to do something useful with it:
import pandas as pd
# Convert to a pandas DataFrame for easy manipulation
df = pd.DataFrame(page.extract_table())
df
0 | 1 | 2 | 3 | |
---|---|---|---|---|
0 | Statute | Description | Level | Repeat? |
1 | 4.12.7 | Unsanitary Working Conditions. | Critical | |
2 | 5.8.3 | Inadequate Protective Equipment. | Serious | |
3 | 6.3.9 | Ineffective Injury Prevention. | Serious | |
4 | 7.1.5 | Failure to Properly Store Hazardous Materials. | Critical | |
5 | 8.9.2 | Lack of Adequate Fire Safety Measures. | Serious | |
6 | 9.6.4 | Inadequate Ventilation Systems. | Serious | |
7 | 10.2.7 | Insufficient Employee Training for Safe Work P... | Serious |
Working with TATR Cell Structure¶
TATR is smart enough to create individual cell regions, but accessing them directly is still a work in progress:
# This should work but doesn't quite yet - we're working on it!
# tatr_table.cells
Next Steps¶
Tables are just one part of document structure. Once you've got table extraction working:
- Layout Analysis: See how table detection fits into understanding the whole document
- Working with Regions: Manually define table areas when automatic detection fails