Dataharvest 2026

Newsroom Infrastructure for AI Experimentation

Jonathan Soma

Small local demos for OCR, text-to-speech, semantic PDF search, and Streamlit data browsing.


01. OCR

Journalists often deal with scanned texts, but the best tools are locked behind code. How can we help non-technical users try out our favorite libaries without forcing them through installing Python and running notebooks?

01

OCR with Gradio: simple

A tiny PDF OCR interface using RapidOCR.

02

OCR with Gradio: fancy

A richer OCR interface comparing RapidOCR with GLM-OCR.


02. Text-to-speech

Newsrooms' C-suites have loved auto-generated podcasts recently, but the rest of the publication is often split. How can we let everyone have a hand in demoing the product to show its strengths and weaknesses?

01

Text-to-speech with Gradio: simple

A small Kokoro ONNX text-to-speech demo.

02

Text-to-speech with Gradio: fancy

A larger TTS demo with Kokoro ONNX and MMS options.


03. PDF search

Semantic search is a useful tool for investigative work, but you don't always want to upload all of your docs into a Google product. Can a home-grown version work just as well?

01

PDF Search: notebook version

A notebook walkthrough that reads local PDFs, embeds each page, and ranks semantic search results. This will only work on codespaces!

02

PDF Search: Streamlit app

Semantic search over local PDFs using sentence-transformer embeddings.


04. Transfer data

Tired of doing data analysis for all of your coworkers? Give them the tools to browse the data directly themselves!

01

Analyzing Real Estate transfers the normal way

A notebook that opens up a CSV and does a little light analysis.

02

Real Estate Transfers Browser: simple version

A simple Streamlit browser for local property transfer data.

03

Real Estate Transfers Browser: fancy version

A fuller Streamlit transfer-data app with charts, filters, and summaries.


05. Evaluations

Use Braintrust for evaluation workflows, then use the CSV files here as small local datasets to test with.

Download the CSVs from the section's Download materials link.


06. Structured outputs

01

OpenRouter + Pydantic AI

Use Pydantic AI with OpenRouter to ask questions and request structured outputs.