Basic Table Extraction¶
PDFs often contain tables, and natural-pdf
provides methods to extract their data, building on pdfplumber
's capabilities.
Let's extract the "Violations" table from our practice PDF.
InĀ [1]:
Copied!
#%pip install "natural-pdf[all]"
#%pip install "natural-pdf[all]"
pdfplumber-based extraction¶
InĀ [2]:
Copied!
from natural_pdf import PDF
pdf = PDF("https://github.com/jsoma/natural-pdf/raw/refs/heads/main/pdfs/01-practice.pdf")
page = pdf.pages[0]
tables = page.extract_tables()
tables[0]
from natural_pdf import PDF
pdf = PDF("https://github.com/jsoma/natural-pdf/raw/refs/heads/main/pdfs/01-practice.pdf")
page = pdf.pages[0]
tables = page.extract_tables()
tables[0]
Out[2]:
[['Statute', 'Description', 'Level', 'Repeat?'], ['4.12.7', 'Unsanitary Working Conditions.', 'Critical', ''], ['5.8.3', 'Inadequate Protective Equipment.', 'Serious', ''], ['6.3.9', 'Ineffective Injury Prevention.', 'Serious', ''], ['7.1.5', 'Failure to Properly Store Hazardous Materials.', 'Critical', ''], ['8.9.2', 'Lack of Adequate Fire Safety Measures.', 'Serious', ''], ['9.6.4', 'Inadequate Ventilation Systems.', 'Serious', ''], ['10.2.7', 'Insufficient Employee Training for Safe Work Practices.', 'Serious', '']]
TATR-based extraction¶
When you do a TATR layout analysis, it uses a little magic to find borders and boundaries. A region analyzed by TATR will automatically use the tatr
extraction method.
InĀ [3]:
Copied!
from natural_pdf import PDF
pdf = PDF("https://github.com/jsoma/natural-pdf/raw/refs/heads/main/pdfs/01-practice.pdf")
page = pdf.pages[0]
page.analyze_layout('tatr')
page.find('table').show()
from natural_pdf import PDF
pdf = PDF("https://github.com/jsoma/natural-pdf/raw/refs/heads/main/pdfs/01-practice.pdf")
page = pdf.pages[0]
page.analyze_layout('tatr')
page.find('table').show()
Out[3]:
InĀ [4]:
Copied!
page.find('table').extract_table()
page.find('table').extract_table()
Out[4]:
[['Statute Description Level Repeat?'], ['Statute', 'Description', 'Level', 'Repeat?'], ['4.12.7', 'Unsanitary Working Conditions.', 'Critical', ''], ['5.8.3', 'Inadequate Protective Equipment.', 'Serious', ''], ['6.3.9', 'Ineffective Injury Prevention.', 'Serious', ''], ['7.1.5', 'Failure to Properly Store Hazardous Materials.', 'Critical', ''], ['8.9.2', 'Lack of Adequate Fire Safety Measures.', 'Serious', ''], ['9.6.4', 'Inadequate Ventilation Systems.', 'Serious', ''], ['10.2.7', 'Insufficient Employee Training for Safe Work Practices.', 'Serious', '']]