Audio

Images and documents gave you structured data. Audio is another way to get text — and once it's text, you already know what to do with it. This is how you can do story research like Misinformation on TikTok: How 'Documented' Examined Hundreds of Videos in Different Languages or The Second Trump Presidency, Brought to you by YouTubers.

If you don't want to code... just use NotebookLM. It will do everything for you. Just go to sleep right now.

Transcription with Whisper

Once upon a time OpenAI released an open model named Whisper. It's great! Very very popular.

There are newer models out there — parakeet-mlx is blazing fast on Macs — but Whisper is very easy to use so everyone (and I mean everyone) uses it.

When you use Whisper, you have to make some decisions:

Which packaging of Whisper: Whisper is free to distribute, so a zillion tools are built on top of it. Below we're using WhisperX which is a feature-packed tool built on top of Whisper.
Which version of the model: Like tiny, base, large... bigger is better, but slower! Turbo is the best combination of speed and accuracy.

Let's practice on this Trump/Biden debate clip.

I saved it as an mp3 to make life a little easier.

rDXubdQdJYs.mp3

audio/transcribe-whisperx.py — WhisperX segment-level aligned transcription

from pathlib import Path
import os
os.environ["TORCH_FORCE_NO_WEIGHTS_ONLY_LOAD"] = "1"
import torch
import whisperx

DATA = Path("data")
AUDIO = DATA / "rDXubdQdJYs.mp3"
MODEL = "turbo"
LANGUAGE = "en"

# Decides whether your computer is fancy and powerful
device = "cuda" if torch.cuda.is_available() else "cpu"
compute_type = "float16" if device == "cuda" else "int8"

model = whisperx.load_model(MODEL, device, compute_type=compute_type)
audio = whisperx.load_audio(str(AUDIO))
result = model.transcribe(audio, batch_size=16)

# Print out the entire transcript
text = " ".join(seg["text"].strip() for seg in result["segments"])
print(text)

relative to what we're going to do with more border patrol and more asylum officers. President Trump? I really don't know what he said at the end of that sentence. I don't think he knows what he said either. The only person on this stage is a convicted felon is the man I'm looking at right now. But when he talks about a convicted felon, his son is a convicted felon. What are you talking about? You have the morals of an alley cat. My son was not a loser, was not a sucker. You're the sucker, you're the loser. He's blaming inflation, and he's right, it's been very bad. He caused the inflation. Excuse me, with dealing with everything we have to do with, look, if we finally beat Medicare. Thank you, President Biden. President Trump. Well, I took two tests, cognitive tests. I aced him. Go through the first five questions, he couldn't do it. This guy's three years younger and a lot less competent. I think that just look at the record, look at what I've done.

Getting exact timestamps

The basic transcription gives you segment-level timestamps. Alignment refines those to word-level timestamps — useful for speaker identification, precise clip cutting, or word-by-word subtitles.

model_a, metadata = whisperx.load_align_model(language_code=LANGUAGE, device=device)
result = whisperx.align(
    result["segments"], model_a, metadata, audio, device,
    return_char_alignments=False,
)

import pandas as pd

df = pd.DataFrame(result["segments"])
df

	start	end	text	words
0	2.461	7.990	relative to what we're going to do with more ...	[{'word': 'relative', 'start': 2.461, 'end': 2...
1	8.010	9.091	President Trump?	[{'word': 'President', 'start': 8.01, 'end': 8...
2	9.111	11.215	I really don't know what he said at the end of...	[{'word': 'I', 'start': 9.111, 'end': 9.151, '...
3	11.555	12.897	I don't think he knows what he said either.	[{'word': 'I', 'start': 11.555, 'end': 11.575,...
4	12.937	16.904	The only person on this stage is a convicted f...	[{'word': 'The', 'start': 12.937, 'end': 13.23...
5	17.324	20.629	But when he talks about a convicted felon, his...	[{'word': 'But', 'start': 17.324, 'end': 17.42...
6	20.750	22.292	What are you talking about?	[{'word': 'What', 'start': 20.75, 'end': 20.87...
7	22.913	25.197	You have the morals of an alley cat.	[{'word': 'You', 'start': 22.913, 'end': 23.35...
8	25.217	27.200	My son was not a loser, was not a sucker.	[{'word': 'My', 'start': 25.217, 'end': 25.357...
9	27.481	29.023	You're the sucker, you're the loser.	[{'word': 'You're', 'start': 27.481, 'end': 27...
10	29.704	32.950	He's blaming inflation, and he's right, it's b...	[{'word': 'He's', 'start': 29.704, 'end': 29.9...
11	33.611	34.753	He caused the inflation.	[{'word': 'He', 'start': 33.611, 'end': 33.771...
12	35.073	46.031	Excuse me, with dealing with everything we hav...	[{'word': 'Excuse', 'start': 35.073, 'end': 35...
13	46.532	47.694	Thank you, President Biden.	[{'word': 'Thank', 'start': 46.532, 'end': 46....
14	47.714	48.415	President Trump.	[{'word': 'President', 'start': 47.714, 'end':...
15	48.580	50.572	Well, I took two tests, cognitive tests.	[{'word': 'Well,', 'start': 48.58, 'end': 48.8...
16	50.652	51.256	I aced him.	[{'word': 'I', 'start': 50.652, 'end': 50.713,...
17	51.276	53.248	Go through the first five questions, he couldn...	[{'word': 'Go', 'start': 51.276, 'end': 51.397...
18	53.610	56.649	This guy's three years younger and a lot less ...	[{'word': 'This', 'start': 53.61, 'end': 53.79...
19	56.669	59.063	I think that just look at the record, look at ...	[{'word': 'I', 'start': 56.669, 'end': 56.749,...

A faster option: Parakeet

NVIDIA's Parakeet is a newer speech model that's significantly faster than Whisper — especially on Macs via parakeet-mlx. It also gives you cleaner sentence-level output without needing a separate alignment step.

Here we combine Parakeet for transcription with pyannote for speaker diarization — "who said what?"

Setup: Diarization requires a free Hugging Face token (HF_TOKEN) and accepting the model licenses at pyannote/segmentation-3.0 and pyannote/speaker-diarization-community-1. It's kind of a pain to jump through the hoops but if you're vaguely technical it's definitely worth it.

In the example below, we try to use Parakeet MLX (it's fast on my mac!) but if that fails we go for onnx-asr, a flexible, portable tool that allows you to use different ASR (automatic speech recognition) models.

audio/transcribe-parakeet.py — Transcribe + diarize with Parakeet and pyannote

from pathlib import Path
from pyannote.audio import Pipeline

DATA = Path("data")

AUDIO = DATA / "rDXubdQdJYs.mp3"

try:
    from parakeet_mlx import from_pretrained
    print("Using parakeet-mlx...")
    model = from_pretrained("mlx-community/parakeet-tdt-0.6b-v3")
    result = model.transcribe(str(AUDIO), chunk_duration=600, overlap_duration=15)
    sentences = [{"start": s.start, "end": s.end, "text": s.text} for s in result.sentences]
except ImportError:
    import onnx_asr
    import ffmpeg
    print("Using onnx-asr...")
    WAV = AUDIO.with_suffix(".wav")
    if not WAV.exists():
        ffmpeg.input(str(AUDIO)).output(str(WAV), ar=16000, ac=1).run(quiet=True)
    vad = onnx_asr.load_vad("silero")
    model = onnx_asr.load_model("nemo-parakeet-tdt-0.6b-v3").with_vad(vad)
    result = model.recognize(str(WAV))
    sentences = [{"start": s.start, "end": s.end, "text": s.text} for s in result]
print(f"Transcribed {len(sentences)} sentences")

Using parakeet-mlx...

Transcribed 17 sentences

Diarize with pyannote

Parakeet gives us what was said. Pyannote tells us who said it.

import torchaudio
import os

HF_TOKEN = os.environ["HF_TOKEN"]
waveform, sample_rate = torchaudio.load(str(AUDIO))
pipeline = Pipeline.from_pretrained(
    "pyannote/speaker-diarization-community-1",
    token=HF_TOKEN
)
diarization = pipeline({"waveform": waveform, "sample_rate": sample_rate})

Combine: match speakers to sentences

For each sentence, find which speaker was talking at its midpoint.

for s in sentences:
    mid = (s["start"] + s["end"]) / 2
    speaker = "UNKNOWN"
    for turn, label in diarization.speaker_diarization:
        if turn.start <= mid <= turn.end:
            speaker = label
            break
    s["speaker"] = speaker
sentences[:5]

[{'start': 2.24,
  'end': 7.840000000000001,
  'text': " Relative to what we're gonna do with more border patrol and more uh asylum officers.",
  'speaker': 'SPEAKER_01'},
 {'start': 7.84,
  'end': 8.72,
  'text': ' President Trump.',
  'speaker': 'SPEAKER_00'},
 {'start': 8.72,
  'end': 13.120000000000001,
  'text': " I really don't know what he said at the end of that sentence, but I don't think he knows what he said either.",
  'speaker': 'SPEAKER_00'},
 {'start': 13.120000000000001,
  'end': 17.2,
  'text': " The only person on this stage is a convicted felon is the man I'm looking at right now.",
  'speaker': 'SPEAKER_01'},
 {'start': 17.2,
  'end': 20.56,
  'text': ' But when he talks about a convicted felon, his son is a convicted felon.',
  'speaker': 'SPEAKER_00'}]

Display results

import pandas as pd
df = pd.DataFrame(sentences[:15])
df

	start	end	text	speaker
0	2.24	7.84	Relative to what we're gonna do with more bor...	SPEAKER_01
1	7.84	8.72	President Trump.	SPEAKER_00
2	8.72	13.12	I really don't know what he said at the end o...	SPEAKER_00
3	13.12	17.20	The only person on this stage is a convicted ...	SPEAKER_01
4	17.20	20.56	But when he talks about a convicted felon, hi...	SPEAKER_00
5	20.80	22.72	What what are you talking about?	SPEAKER_01
6	22.72	25.12	You you you have the morals of an alley cat.	SPEAKER_01
7	25.12	27.44	My son was not a loser, was not a sucker.	SPEAKER_01
8	27.44	29.52	You're the sucker, you're the loser.	SPEAKER_01
9	29.52	33.36	He's blaming inflation, and he's right, it's ...	UNKNOWN
10	33.36	34.64	He caused the inflation.	SPEAKER_00
11	35.04	46.32	Excuse me, with um dealing with everything we...	UNKNOWN
12	46.40	48.56	Thank you, President uh Biden, President Trump.	SPEAKER_01
13	48.56	51.28	Well, I took two tests, cognitive tests, I ac...	SPEAKER_00
14	51.28	53.52	Go through the first five questions, he could...	SPEAKER_00

Transcription and speaker identification (WhisperX)

WhisperX can do the same thing — transcribe and separate speakers. It's slower than Parakeet but widely used and runs everywhere.

audio/whisperx-diarize.py — Full WhisperX pipeline: transcribe, align, diarize (who said what)

from pathlib import Path
import os
from collections import defaultdict
import os
os.environ["TORCH_FORCE_NO_WEIGHTS_ONLY_LOAD"] = "1"
import torch
import whisperx
from whisperx.diarize import DiarizationPipeline

DATA = Path("data")
AUDIO = DATA / "rDXubdQdJYs.mp3"
MODEL, LANGUAGE = "large-v3", "en"
HF_TOKEN = os.environ["HF_TOKEN"]
device = "cuda" if torch.cuda.is_available() else "cpu"
compute_type = "float16" if device == "cuda" else "int8"

# Step 1: Transcribe
model = whisperx.load_model(MODEL, device, compute_type=compute_type)
audio = whisperx.load_audio(str(AUDIO))
result = model.transcribe(audio, batch_size=16)

# Step 2: Align
model_a, metadata = whisperx.load_align_model(language_code=LANGUAGE, device=device)
result = whisperx.align(
    result["segments"], model_a, metadata, audio, device, 
    return_char_alignments=False,
)

# Step 3: Diarize (split speakers)
diarize_model = DiarizationPipeline(token=HF_TOKEN, device=device)
result = whisperx.assign_word_speakers(diarize_model(audio), result)

# Print speaker-labeled segments
current_speaker = None
for seg in result["segments"]:
    speaker = seg.get("speaker", "UNKNOWN")
    if speaker != current_speaker:
        print(f"\n--- {speaker} ---")
        current_speaker = speaker
    start = f"{int(seg['start']//60):02d}:{seg['start']%60:05.2f}"
    end = f"{int(seg['end']//60):02d}:{seg['end']%60:05.2f}"
    print(f"  [{start} - {end}] {seg['text'].strip()}")

--- SPEAKER_01 ---
  [00:02.46 - 00:07.99] relative to what we're going to do with more border patrol and more asylum officers.

--- SPEAKER_00 ---
  [00:08.01 - 00:09.09] President Trump?
  [00:09.11 - 00:11.21] I really don't know what he said at the end of that sentence.
  [00:11.55 - 00:12.90] I don't think he knows what he said either.

--- SPEAKER_01 ---
  [00:12.94 - 00:16.90] The only person in this stage is a convicted felon is the man I'm looking at right now.

--- SPEAKER_00 ---
  [00:17.32 - 00:20.63] But when he talks about a convicted felon, his son is a convicted felon.

--- SPEAKER_01 ---
  [00:20.75 - 00:22.29] What are you talking about?
  [00:22.91 - 00:25.20] You have the morals of an alley cat.
  [00:25.22 - 00:27.20] My son was not a loser, was not a sucker.
  [00:27.48 - 00:28.18] You're the sucker.
  [00:28.34 - 00:29.02] You're the loser.

--- SPEAKER_00 ---
  [00:29.70 - 00:32.07] He's blaming inflation, and he's right.
  [00:32.09 - 00:32.95] It's been very bad.
  [00:33.61 - 00:34.75] He caused the inflation.

--- SPEAKER_01 ---
  [00:35.07 - 00:46.51] Excuse me, with dealing with everything we have to do with... Look, if we finally beat Medicare...
  [00:46.53 - 00:47.69] Thank you, President Biden.
  [00:47.71 - 00:48.41] President Trump?

--- SPEAKER_00 ---
  [00:48.58 - 00:50.57] Well, I took two tests, cognitive tests.
  [00:50.65 - 00:51.26] I aced them.
  [00:51.28 - 00:52.62] Go through the first five questions.
  [00:52.64 - 00:53.25] He couldn't do it.
  [00:53.61 - 00:56.65] This guy's three years younger and a lot less competent.

--- SPEAKER_01 ---
  [00:56.67 - 00:58.06] I think that just look at the record.
  [00:58.18 - 00:59.06] Look at what I've done.

It isn't perfect, as you can tell! We probably want them split up beautifully though, right?

import pandas as pd
df = pd.DataFrame(result["segments"])
df

	start	end	text	words	speaker
0	2.461	7.990	relative to what we're going to do with more ...	[{'word': 'relative', 'start': 2.461, 'end': 2...	SPEAKER_01
1	8.010	9.091	President Trump?	[{'word': 'President', 'start': 8.01, 'end': 8...	SPEAKER_00
2	9.111	11.215	I really don't know what he said at the end of...	[{'word': 'I', 'start': 9.111, 'end': 9.151, '...	SPEAKER_00
3	11.555	12.897	I don't think he knows what he said either.	[{'word': 'I', 'start': 11.555, 'end': 11.575,...	SPEAKER_00
4	12.937	16.904	The only person in this stage is a convicted f...	[{'word': 'The', 'start': 12.937, 'end': 13.23...	SPEAKER_01
5	17.324	20.629	But when he talks about a convicted felon, his...	[{'word': 'But', 'start': 17.324, 'end': 17.42...	SPEAKER_00
6	20.750	22.292	What are you talking about?	[{'word': 'What', 'start': 20.75, 'end': 20.87...	SPEAKER_01
7	22.913	25.197	You have the morals of an alley cat.	[{'word': 'You', 'start': 22.913, 'end': 23.35...	SPEAKER_01
8	25.217	27.200	My son was not a loser, was not a sucker.	[{'word': 'My', 'start': 25.217, 'end': 25.357...	SPEAKER_01
9	27.481	28.182	You're the sucker.	[{'word': 'You're', 'start': 27.481, 'end': 27...	SPEAKER_01
10	28.342	29.023	You're the loser.	[{'word': 'You're', 'start': 28.342, 'end': 28...	SPEAKER_01
11	29.704	32.068	He's blaming inflation, and he's right.	[{'word': 'He's', 'start': 29.704, 'end': 29.9...	SPEAKER_00
12	32.088	32.950	It's been very bad.	[{'word': 'It's', 'start': 32.088, 'end': 32.1...	SPEAKER_00
13	33.611	34.753	He caused the inflation.	[{'word': 'He', 'start': 33.611, 'end': 33.771...	SPEAKER_00
14	35.073	46.512	Excuse me, with dealing with everything we hav...	[{'word': 'Excuse', 'start': 35.073, 'end': 35...	SPEAKER_01
15	46.532	47.694	Thank you, President Biden.	[{'word': 'Thank', 'start': 46.532, 'end': 46....	SPEAKER_01
16	47.714	48.415	President Trump?	[{'word': 'President', 'start': 47.714, 'end':...	SPEAKER_01
17	48.580	50.572	Well, I took two tests, cognitive tests.	[{'word': 'Well,', 'start': 48.58, 'end': 48.8...	SPEAKER_00
18	50.652	51.256	I aced them.	[{'word': 'I', 'start': 50.652, 'end': 50.713,...	SPEAKER_00
19	51.276	52.624	Go through the first five questions.	[{'word': 'Go', 'start': 51.276, 'end': 51.397...	SPEAKER_00
20	52.644	53.248	He couldn't do it.	[{'word': 'He', 'start': 52.644, 'end': 52.725...	SPEAKER_00
21	53.610	56.649	This guy's three years younger and a lot less ...	[{'word': 'This', 'start': 53.61, 'end': 53.79...	SPEAKER_00
22	56.669	58.057	I think that just look at the record.	[{'word': 'I', 'start': 56.669, 'end': 56.749,...	SPEAKER_01
23	58.178	59.063	Look at what I've done.	[{'word': 'Look', 'start': 58.178, 'end': 58.2...	SPEAKER_01

"Speaker 1 said X at 0:42, Speaker 2 said Y at 1:15." Now you have a searchable, speaker-labeled transcript!

If you'd prefer to not write code (lol), try MacWhisper, Handy, VoiceInk, Buzz.

Using the cloud

Sometimes your computer is slow, or you don't care about privacy or cost, and you just want something to get done. Here's the same audio, but using Gemini (Google's LLM). We make it send back structured data — each utterance gets a speaker, timestamps, text, and some sentiment to mix it up a bit. Same Pydantic AI pattern as the image notebooks!

audio/gemini-diarize.py — Structured transcription with speaker labels via Gemini (cloud alternative to WhisperX)

from pathlib import Path

from pydantic import BaseModel, Field
from pydantic_ai import Agent, BinaryContent
from typing import Literal

DATA = Path("data")
AUDIO = DATA / "rDXubdQdJYs.mp3"
MODEL = "google-gla:gemini-2.5-flash"

class Utterance(BaseModel):
    speaker: str = Field(description="Speaker label (e.g., Speaker 1, Speaker 2)")
    start: str = Field(description="Start timestamp MM:SS")
    end: str = Field(description="End timestamp MM:SS")
    text: str = Field(description="What was said")
    sentiment: Literal[
        "positive", "negative", "neutral"
    ] = Field(description="Sentiment of the utterance")

agent = Agent(MODEL, output_type=list[Utterance])
result = agent.run_sync([
    "Transcribe this audio with speaker labels and timestamps for each utterance.",
    BinaryContent(data=AUDIO.read_bytes(), media_type="audio/mpeg"),
])

# Each utterance is a typed object: speaker, timestamps, text. Same Pydantic pattern as images.
for u in result.output:
    print(f"[{u.start} - {u.end}] {u.speaker}: {u.text}")

[00:02 - 00:07] Speaker A: relative to what we're going to do with more border patrol and more asylum officers.
[00:08 - 00:08] Speaker B: President Trump.
[00:08 - 00:12] Speaker C: I really don't know what he said at the end of that sentence. I don't think he knows what he said either.
[00:13 - 00:16] Speaker D: The only person at this stage is a convicted felon, this man I'm looking at right now.
[00:17 - 00:20] Speaker C: But when he talks about a convicted felon, his son is a convicted felon.
[00:20 - 00:22] Speaker D: What? What are you talking about?
[00:22 - 00:24] Speaker D: You You have the morals of an alley cat.
[00:25 - 00:29] Speaker D: My son was not a loser, was not a sucker. You're the sucker. You're the loser.
[00:29 - 00:32] Speaker C: He's blaming inflation and he's right, it's been very bad.
[00:33 - 00:34] Speaker C: He caused the inflation.
[00:35 - 00:45] Speaker A: Excuse me, with um dealing with everything we have to do with uh look if we finally beat Medicare.
[00:46 - 00:48] Speaker B: Thank you, President Biden, President Trump.
[00:48 - 00:53] Speaker C: Well, I took two tests, cognitive test, I aced them. Go through the first five questions, he couldn't do it.
[00:53 - 00:58] Speaker D: This guy's three years younger and a lot less competent. I think that this look at the record, look at what I've done.

Cloud tradeoff: faster, structured output built-in, but your audio goes to Google. Most newsrooms will use both local and cloud depending on the sensitivity of the material.

This is how real investigations work:

Documented examined hundreds of TikTok videos by extracting audio and transcribing with Whisper.
Público did the same with 7,616 health TikToks, then used an LLM to pull verifiable claims from the transcripts.
Hearst built Assembly to transcribe 13,000+ hours of government meetings with Whisper and surface keywords via alerts.
Chalkbeat uses LocalLens to monitor 80 school districts across 30 states the same way.

Up next: Video is just images + audio + time. Decompose it, then use the tools you already have.