NICAR 2026

Modern PDF processing with Natural PDF

Jonathan Soma, Columbia University

Learn to extract data from PDFs with the spatial magic of Natural PDF. Basic text extraction to OCR, AI, and complex layouts — everything you need to get structured data out of any PDF.

js4571@columbia.edu • @dangerscarf • Lede Program • jonathansoma.com • Bad PDFs

Natural PDF basics with text and tables

Natural PDF is a spatially-aware PDF processing library that makes accessing PDF data a breeze.

Code-along Download .ipynb

Ref: Natural PDF documentation

Open in Colab Read

Recognizing text with OCR engines using Natural PDF

Some PDFs are just images of text instead of being actual text. This is when you need OCR (optical character recognition).

Code-along Download .ipynb

Ref: Surya OCR • EasyOCR • PaddleOCR

Open in Colab Read

AI and data extraction

AI is a great (albeit flawed) method for extracting specific data from your PDFs.

Code-along Download .ipynb

Ref: impira docquery • OpenAI structured outputs • Pydantic models

Open in Colab Read

Columns, multi-page flows and other page structures

A one-page PDF with a single block of text is easy mode. Things get more complicated when you have actual layouts.

Code-along Download .ipynb

Ref: Microsoft's table transfer (TATR) • YOLO document layout • LayoutLMv3 • merveenoyan/smol-vision

Open in Colab Read

Putting it all together

Let's see what it looks like to put this all together in a real-life scenario.

Code-along Download .ipynb

Open in Colab Read