Jonathan Soma, Columbia University
Contact: js4571@columbia.edu • @dangerscarf
Sites: Lede Program • jonathansoma.com • Practical AI for Investigative Journalism • Bad PDFs
Learn to extract data from PDFs with the ultimate spatial magic of Natural PDF
Natural PDF is a spatially-aware PDF processing library that makes accessing PDF data a breeze.
Links:
Some PDFs are just images of text instead of being actual text. This is when you need OCR (optical character recognition).
Links:
AI is a great (albeit flawed) method for extracting specific data from your PDFs.
Links:
A one-page PDF with a single block of text is easy mode. Things get more complicated when you have actual layouts.
Links:
Let's see what it looks like to put this all together in a real-life scenario.
Created by Jonathan Soma