Modern PDFs processing with Natural PDF

Jonathan Soma, Columbia University

Contact: js4571@columbia.edu@dangerscarf

Sites: Lede Programjonathansoma.comPractical AI for Investigative JournalismBad PDFs

Learn to extract data from PDFs with the ultimate spatial magic of Natural PDF

Natural PDF

First slide

📥 Download PDF

Natural PDF basics with text and tables

Natural PDF is a spatially-aware PDF processing library that makes accessing PDF data a breeze.

🚀 Live coding worksheet ✓ Completed version
📑 Slides: slides.pdf

Links:

Recognizing text with OCR engines using Natural PDF

Some PDFs are just images of text instead of being actual text. This is when you need OCR (optical character recognition).

🚀 Live coding worksheet ✓ Completed version
📑 Slides: slides.pdf

Links:

AI and data extraction

AI is a great (albeit flawed) method for extracting specific data from your PDFs.

🚀 Live coding worksheet ✓ Completed version
📑 Slides: slides.pdf

Links:

Columns, multi-page flows and other page structures

A one-page PDF with a single block of text is easy mode. Things get more complicated when you have actual layouts.

🚀 Live coding worksheet ✓ Completed version
📑 Slides: slides.pdf

Links:

Putting it all together

Let's see what it looks like to put this all together in a real-life scenario.

🚀 Live coding worksheet ✓ Completed version
📑 Slides: slides.pdf

Created by Jonathan Soma