NICAR 2026

Modern PDF processing with Natural PDF

Jonathan Soma, Columbia University

Learn to extract data from PDFs with the spatial magic of Natural PDF. Basic text extraction to OCR, AI, and complex layouts — everything you need to get structured data out of any PDF.


01

Natural PDF basics with text and tables

Natural PDF is a spatially-aware PDF processing library that makes accessing PDF data a breeze.

02

Recognizing text with OCR engines using Natural PDF

Some PDFs are just images of text instead of being actual text. This is when you need OCR (optical character recognition).

03

AI and data extraction

AI is a great (albeit flawed) method for extracting specific data from your PDFs.

04

Columns, multi-page flows and other page structures

A one-page PDF with a single block of text is easy mode. Things get more complicated when you have actual layouts.

05

Putting it all together

Let's see what it looks like to put this all together in a real-life scenario.