Audio
Images and documents gave you structured data. Audio is another way to get text — and once it's text, you already know what to do with it. This is how you can do story research like Misinformation on TikTok: How 'Documented' Examined Hundreds of Videos in Different Languages or The Second Trump Presidency, Brought to you by YouTubers.
If you don't want to code... just use NotebookLM. It will do everything for you. Just go to sleep right now.
Transcription with Whisper
Once upon a time OpenAI released an open model named Whisper. It's great! Very very popular.
There are newer models out there — parakeet-mlx is blazing fast on Macs — but Whisper is very easy to use so everyone (and I mean everyone) uses it.
When you use Whisper, you have to make some decisions:
- Which packaging of Whisper: Whisper is free to distribute, so a zillion tools are built on top of it. Below we're using WhisperX which is a feature-packed tool built on top of Whisper.
- Which version of the model: Like tiny, base, large... bigger is better, but slower! Turbo is the best combination of speed and accuracy.
Let's practice on this Trump/Biden debate clip.
I saved it as an mp3 to make life a little easier.
audio/transcribe-whisperx.py — WhisperX segment-level aligned transcription
from pathlib import Path
import os
os.environ["TORCH_FORCE_NO_WEIGHTS_ONLY_LOAD"] = "1"
import torch
import whisperx
DATA = Path("data")
AUDIO = DATA / "rDXubdQdJYs.mp3"
MODEL = "turbo"
LANGUAGE = "en"
# Decides whether your computer is fancy and powerful
device = "cuda" if torch.cuda.is_available() else "cpu"
compute_type = "float16" if device == "cuda" else "int8"
model = whisperx.load_model(MODEL, device, compute_type=compute_type)
audio = whisperx.load_audio(str(AUDIO))
result = model.transcribe(audio, batch_size=16)
# Print out the entire transcript
text = " ".join(seg["text"].strip() for seg in result["segments"])
print(text)
relative to what we're going to do with more border patrol and more asylum officers. President Trump? I really don't know what he said at the end of that sentence. I don't think he knows what he said either. The only person on this stage is a convicted felon is the man I'm looking at right now. But when he talks about a convicted felon, his son is a convicted felon. What are you talking about? You have the morals of an alley cat. My son was not a loser, was not a sucker. You're the sucker, you're the loser. He's blaming inflation, and he's right, it's been very bad. He caused the inflation. Excuse me, with dealing with everything we have to do with, look, if we finally beat Medicare. Thank you, President Biden. President Trump. Well, I took two tests, cognitive tests. I aced him. Go through the first five questions, he couldn't do it. This guy's three years younger and a lot less competent. I think that just look at the record, look at what I've done.
Getting exact timestamps
The basic transcription gives you segment-level timestamps. Alignment refines those to word-level timestamps — useful for speaker identification, precise clip cutting, or word-by-word subtitles.
model_a, metadata = whisperx.load_align_model(language_code=LANGUAGE, device=device)
result = whisperx.align(
result["segments"], model_a, metadata, audio, device,
return_char_alignments=False,
)
import pandas as pd
df = pd.DataFrame(result["segments"])
df
| start | end | text | words | |
|---|---|---|---|---|
| 0 | 2.461 | 7.990 | relative to what we're going to do with more ... | [{'word': 'relative', 'start': 2.461, 'end': 2... |
| 1 | 8.010 | 9.091 | President Trump? | [{'word': 'President', 'start': 8.01, 'end': 8... |
| 2 | 9.111 | 11.215 | I really don't know what he said at the end of... | [{'word': 'I', 'start': 9.111, 'end': 9.151, '... |
| 3 | 11.555 | 12.897 | I don't think he knows what he said either. | [{'word': 'I', 'start': 11.555, 'end': 11.575,... |
| 4 | 12.937 | 16.904 | The only person on this stage is a convicted f... | [{'word': 'The', 'start': 12.937, 'end': 13.23... |
| 5 | 17.324 | 20.629 | But when he talks about a convicted felon, his... | [{'word': 'But', 'start': 17.324, 'end': 17.42... |
| 6 | 20.750 | 22.292 | What are you talking about? | [{'word': 'What', 'start': 20.75, 'end': 20.87... |
| 7 | 22.913 | 25.197 | You have the morals of an alley cat. | [{'word': 'You', 'start': 22.913, 'end': 23.35... |
| 8 | 25.217 | 27.200 | My son was not a loser, was not a sucker. | [{'word': 'My', 'start': 25.217, 'end': 25.357... |
| 9 | 27.481 | 29.023 | You're the sucker, you're the loser. | [{'word': 'You're', 'start': 27.481, 'end': 27... |
| 10 | 29.704 | 32.950 | He's blaming inflation, and he's right, it's b... | [{'word': 'He's', 'start': 29.704, 'end': 29.9... |
| 11 | 33.611 | 34.753 | He caused the inflation. | [{'word': 'He', 'start': 33.611, 'end': 33.771... |
| 12 | 35.073 | 46.031 | Excuse me, with dealing with everything we hav... | [{'word': 'Excuse', 'start': 35.073, 'end': 35... |
| 13 | 46.532 | 47.694 | Thank you, President Biden. | [{'word': 'Thank', 'start': 46.532, 'end': 46.... |
| 14 | 47.714 | 48.415 | President Trump. | [{'word': 'President', 'start': 47.714, 'end':... |
| 15 | 48.580 | 50.572 | Well, I took two tests, cognitive tests. | [{'word': 'Well,', 'start': 48.58, 'end': 48.8... |
| 16 | 50.652 | 51.256 | I aced him. | [{'word': 'I', 'start': 50.652, 'end': 50.713,... |
| 17 | 51.276 | 53.248 | Go through the first five questions, he couldn... | [{'word': 'Go', 'start': 51.276, 'end': 51.397... |
| 18 | 53.610 | 56.649 | This guy's three years younger and a lot less ... | [{'word': 'This', 'start': 53.61, 'end': 53.79... |
| 19 | 56.669 | 59.063 | I think that just look at the record, look at ... | [{'word': 'I', 'start': 56.669, 'end': 56.749,... |
A faster option: Parakeet
NVIDIA's Parakeet is a newer speech model that's significantly faster than Whisper — especially on Macs via parakeet-mlx. It also gives you cleaner sentence-level output without needing a separate alignment step.
Here we combine Parakeet for transcription with pyannote for speaker diarization — "who said what?"
Setup: Diarization requires a free Hugging Face token (HF_TOKEN) and accepting the model licenses at pyannote/segmentation-3.0 and pyannote/speaker-diarization-community-1. It's kind of a pain to jump through the hoops but if you're vaguely technical it's definitely worth it.
In the example below, we try to use Parakeet MLX (it's fast on my mac!) but if that fails we go for onnx-asr, a flexible, portable tool that allows you to use different ASR (automatic speech recognition) models.
audio/transcribe-parakeet.py — Transcribe + diarize with Parakeet and pyannote
from pathlib import Path
from pyannote.audio import Pipeline
DATA = Path("data")
AUDIO = DATA / "rDXubdQdJYs.mp3"
try:
from parakeet_mlx import from_pretrained
print("Using parakeet-mlx...")
model = from_pretrained("mlx-community/parakeet-tdt-0.6b-v3")
result = model.transcribe(str(AUDIO), chunk_duration=600, overlap_duration=15)
sentences = [{"start": s.start, "end": s.end, "text": s.text} for s in result.sentences]
except ImportError:
import onnx_asr
import ffmpeg
print("Using onnx-asr...")
WAV = AUDIO.with_suffix(".wav")
if not WAV.exists():
ffmpeg.input(str(AUDIO)).output(str(WAV), ar=16000, ac=1).run(quiet=True)
vad = onnx_asr.load_vad("silero")
model = onnx_asr.load_model("nemo-parakeet-tdt-0.6b-v3").with_vad(vad)
result = model.recognize(str(WAV))
sentences = [{"start": s.start, "end": s.end, "text": s.text} for s in result]
print(f"Transcribed {len(sentences)} sentences")
Using parakeet-mlx...
Transcribed 17 sentences
Diarize with pyannote
Parakeet gives us what was said. Pyannote tells us who said it.
import torchaudio
import os
HF_TOKEN = os.environ["HF_TOKEN"]
waveform, sample_rate = torchaudio.load(str(AUDIO))
pipeline = Pipeline.from_pretrained(
"pyannote/speaker-diarization-community-1",
token=HF_TOKEN
)
diarization = pipeline({"waveform": waveform, "sample_rate": sample_rate})
Combine: match speakers to sentences
For each sentence, find which speaker was talking at its midpoint.
for s in sentences:
mid = (s["start"] + s["end"]) / 2
speaker = "UNKNOWN"
for turn, label in diarization.speaker_diarization:
if turn.start <= mid <= turn.end:
speaker = label
break
s["speaker"] = speaker
sentences[:5]
[{'start': 2.24,
'end': 7.840000000000001,
'text': " Relative to what we're gonna do with more border patrol and more uh asylum officers.",
'speaker': 'SPEAKER_01'},
{'start': 7.84,
'end': 8.72,
'text': ' President Trump.',
'speaker': 'SPEAKER_00'},
{'start': 8.72,
'end': 13.120000000000001,
'text': " I really don't know what he said at the end of that sentence, but I don't think he knows what he said either.",
'speaker': 'SPEAKER_00'},
{'start': 13.120000000000001,
'end': 17.2,
'text': " The only person on this stage is a convicted felon is the man I'm looking at right now.",
'speaker': 'SPEAKER_01'},
{'start': 17.2,
'end': 20.56,
'text': ' But when he talks about a convicted felon, his son is a convicted felon.',
'speaker': 'SPEAKER_00'}]
Display results
import pandas as pd
df = pd.DataFrame(sentences[:15])
df
| start | end | text | speaker | |
|---|---|---|---|---|
| 0 | 2.24 | 7.84 | Relative to what we're gonna do with more bor... | SPEAKER_01 |
| 1 | 7.84 | 8.72 | President Trump. | SPEAKER_00 |
| 2 | 8.72 | 13.12 | I really don't know what he said at the end o... | SPEAKER_00 |
| 3 | 13.12 | 17.20 | The only person on this stage is a convicted ... | SPEAKER_01 |
| 4 | 17.20 | 20.56 | But when he talks about a convicted felon, hi... | SPEAKER_00 |
| 5 | 20.80 | 22.72 | What what are you talking about? | SPEAKER_01 |
| 6 | 22.72 | 25.12 | You you you have the morals of an alley cat. | SPEAKER_01 |
| 7 | 25.12 | 27.44 | My son was not a loser, was not a sucker. | SPEAKER_01 |
| 8 | 27.44 | 29.52 | You're the sucker, you're the loser. | SPEAKER_01 |
| 9 | 29.52 | 33.36 | He's blaming inflation, and he's right, it's ... | UNKNOWN |
| 10 | 33.36 | 34.64 | He caused the inflation. | SPEAKER_00 |
| 11 | 35.04 | 46.32 | Excuse me, with um dealing with everything we... | UNKNOWN |
| 12 | 46.40 | 48.56 | Thank you, President uh Biden, President Trump. | SPEAKER_01 |
| 13 | 48.56 | 51.28 | Well, I took two tests, cognitive tests, I ac... | SPEAKER_00 |
| 14 | 51.28 | 53.52 | Go through the first five questions, he could... | SPEAKER_00 |
Transcription and speaker identification (WhisperX)
WhisperX can do the same thing — transcribe and separate speakers. It's slower than Parakeet but widely used and runs everywhere.
audio/whisperx-diarize.py — Full WhisperX pipeline: transcribe, align, diarize (who said what)
from pathlib import Path
import os
from collections import defaultdict
import os
os.environ["TORCH_FORCE_NO_WEIGHTS_ONLY_LOAD"] = "1"
import torch
import whisperx
from whisperx.diarize import DiarizationPipeline
DATA = Path("data")
AUDIO = DATA / "rDXubdQdJYs.mp3"
MODEL, LANGUAGE = "large-v3", "en"
HF_TOKEN = os.environ["HF_TOKEN"]
device = "cuda" if torch.cuda.is_available() else "cpu"
compute_type = "float16" if device == "cuda" else "int8"
# Step 1: Transcribe
model = whisperx.load_model(MODEL, device, compute_type=compute_type)
audio = whisperx.load_audio(str(AUDIO))
result = model.transcribe(audio, batch_size=16)
# Step 2: Align
model_a, metadata = whisperx.load_align_model(language_code=LANGUAGE, device=device)
result = whisperx.align(
result["segments"], model_a, metadata, audio, device,
return_char_alignments=False,
)
# Step 3: Diarize (split speakers)
diarize_model = DiarizationPipeline(token=HF_TOKEN, device=device)
result = whisperx.assign_word_speakers(diarize_model(audio), result)
# Print speaker-labeled segments
current_speaker = None
for seg in result["segments"]:
speaker = seg.get("speaker", "UNKNOWN")
if speaker != current_speaker:
print(f"\n--- {speaker} ---")
current_speaker = speaker
start = f"{int(seg['start']//60):02d}:{seg['start']%60:05.2f}"
end = f"{int(seg['end']//60):02d}:{seg['end']%60:05.2f}"
print(f" [{start} - {end}] {seg['text'].strip()}")
--- SPEAKER_01 --- [00:02.46 - 00:07.99] relative to what we're going to do with more border patrol and more asylum officers. --- SPEAKER_00 --- [00:08.01 - 00:09.09] President Trump? [00:09.11 - 00:11.21] I really don't know what he said at the end of that sentence. [00:11.55 - 00:12.90] I don't think he knows what he said either. --- SPEAKER_01 --- [00:12.94 - 00:16.90] The only person in this stage is a convicted felon is the man I'm looking at right now. --- SPEAKER_00 --- [00:17.32 - 00:20.63] But when he talks about a convicted felon, his son is a convicted felon. --- SPEAKER_01 --- [00:20.75 - 00:22.29] What are you talking about? [00:22.91 - 00:25.20] You have the morals of an alley cat. [00:25.22 - 00:27.20] My son was not a loser, was not a sucker. [00:27.48 - 00:28.18] You're the sucker. [00:28.34 - 00:29.02] You're the loser. --- SPEAKER_00 --- [00:29.70 - 00:32.07] He's blaming inflation, and he's right. [00:32.09 - 00:32.95] It's been very bad. [00:33.61 - 00:34.75] He caused the inflation. --- SPEAKER_01 --- [00:35.07 - 00:46.51] Excuse me, with dealing with everything we have to do with... Look, if we finally beat Medicare... [00:46.53 - 00:47.69] Thank you, President Biden. [00:47.71 - 00:48.41] President Trump? --- SPEAKER_00 --- [00:48.58 - 00:50.57] Well, I took two tests, cognitive tests. [00:50.65 - 00:51.26] I aced them. [00:51.28 - 00:52.62] Go through the first five questions. [00:52.64 - 00:53.25] He couldn't do it. [00:53.61 - 00:56.65] This guy's three years younger and a lot less competent. --- SPEAKER_01 --- [00:56.67 - 00:58.06] I think that just look at the record. [00:58.18 - 00:59.06] Look at what I've done.
It isn't perfect, as you can tell! We probably want them split up beautifully though, right?
import pandas as pd
df = pd.DataFrame(result["segments"])
df
| start | end | text | words | speaker | |
|---|---|---|---|---|---|
| 0 | 2.461 | 7.990 | relative to what we're going to do with more ... | [{'word': 'relative', 'start': 2.461, 'end': 2... | SPEAKER_01 |
| 1 | 8.010 | 9.091 | President Trump? | [{'word': 'President', 'start': 8.01, 'end': 8... | SPEAKER_00 |
| 2 | 9.111 | 11.215 | I really don't know what he said at the end of... | [{'word': 'I', 'start': 9.111, 'end': 9.151, '... | SPEAKER_00 |
| 3 | 11.555 | 12.897 | I don't think he knows what he said either. | [{'word': 'I', 'start': 11.555, 'end': 11.575,... | SPEAKER_00 |
| 4 | 12.937 | 16.904 | The only person in this stage is a convicted f... | [{'word': 'The', 'start': 12.937, 'end': 13.23... | SPEAKER_01 |
| 5 | 17.324 | 20.629 | But when he talks about a convicted felon, his... | [{'word': 'But', 'start': 17.324, 'end': 17.42... | SPEAKER_00 |
| 6 | 20.750 | 22.292 | What are you talking about? | [{'word': 'What', 'start': 20.75, 'end': 20.87... | SPEAKER_01 |
| 7 | 22.913 | 25.197 | You have the morals of an alley cat. | [{'word': 'You', 'start': 22.913, 'end': 23.35... | SPEAKER_01 |
| 8 | 25.217 | 27.200 | My son was not a loser, was not a sucker. | [{'word': 'My', 'start': 25.217, 'end': 25.357... | SPEAKER_01 |
| 9 | 27.481 | 28.182 | You're the sucker. | [{'word': 'You're', 'start': 27.481, 'end': 27... | SPEAKER_01 |
| 10 | 28.342 | 29.023 | You're the loser. | [{'word': 'You're', 'start': 28.342, 'end': 28... | SPEAKER_01 |
| 11 | 29.704 | 32.068 | He's blaming inflation, and he's right. | [{'word': 'He's', 'start': 29.704, 'end': 29.9... | SPEAKER_00 |
| 12 | 32.088 | 32.950 | It's been very bad. | [{'word': 'It's', 'start': 32.088, 'end': 32.1... | SPEAKER_00 |
| 13 | 33.611 | 34.753 | He caused the inflation. | [{'word': 'He', 'start': 33.611, 'end': 33.771... | SPEAKER_00 |
| 14 | 35.073 | 46.512 | Excuse me, with dealing with everything we hav... | [{'word': 'Excuse', 'start': 35.073, 'end': 35... | SPEAKER_01 |
| 15 | 46.532 | 47.694 | Thank you, President Biden. | [{'word': 'Thank', 'start': 46.532, 'end': 46.... | SPEAKER_01 |
| 16 | 47.714 | 48.415 | President Trump? | [{'word': 'President', 'start': 47.714, 'end':... | SPEAKER_01 |
| 17 | 48.580 | 50.572 | Well, I took two tests, cognitive tests. | [{'word': 'Well,', 'start': 48.58, 'end': 48.8... | SPEAKER_00 |
| 18 | 50.652 | 51.256 | I aced them. | [{'word': 'I', 'start': 50.652, 'end': 50.713,... | SPEAKER_00 |
| 19 | 51.276 | 52.624 | Go through the first five questions. | [{'word': 'Go', 'start': 51.276, 'end': 51.397... | SPEAKER_00 |
| 20 | 52.644 | 53.248 | He couldn't do it. | [{'word': 'He', 'start': 52.644, 'end': 52.725... | SPEAKER_00 |
| 21 | 53.610 | 56.649 | This guy's three years younger and a lot less ... | [{'word': 'This', 'start': 53.61, 'end': 53.79... | SPEAKER_00 |
| 22 | 56.669 | 58.057 | I think that just look at the record. | [{'word': 'I', 'start': 56.669, 'end': 56.749,... | SPEAKER_01 |
| 23 | 58.178 | 59.063 | Look at what I've done. | [{'word': 'Look', 'start': 58.178, 'end': 58.2... | SPEAKER_01 |
"Speaker 1 said X at 0:42, Speaker 2 said Y at 1:15." Now you have a searchable, speaker-labeled transcript!
If you'd prefer to not write code (lol), try MacWhisper, Handy, VoiceInk, Buzz.
Using the cloud
Sometimes your computer is slow, or you don't care about privacy or cost, and you just want something to get done. Here's the same audio, but using Gemini (Google's LLM). We make it send back structured data — each utterance gets a speaker, timestamps, text, and some sentiment to mix it up a bit. Same Pydantic AI pattern as the image notebooks!
audio/gemini-diarize.py — Structured transcription with speaker labels via Gemini (cloud alternative to WhisperX)
from pathlib import Path
from pydantic import BaseModel, Field
from pydantic_ai import Agent, BinaryContent
from typing import Literal
DATA = Path("data")
AUDIO = DATA / "rDXubdQdJYs.mp3"
MODEL = "google-gla:gemini-2.5-flash"
class Utterance(BaseModel):
speaker: str = Field(description="Speaker label (e.g., Speaker 1, Speaker 2)")
start: str = Field(description="Start timestamp MM:SS")
end: str = Field(description="End timestamp MM:SS")
text: str = Field(description="What was said")
sentiment: Literal[
"positive", "negative", "neutral"
] = Field(description="Sentiment of the utterance")
agent = Agent(MODEL, output_type=list[Utterance])
result = agent.run_sync([
"Transcribe this audio with speaker labels and timestamps for each utterance.",
BinaryContent(data=AUDIO.read_bytes(), media_type="audio/mpeg"),
])
# Each utterance is a typed object: speaker, timestamps, text. Same Pydantic pattern as images.
for u in result.output:
print(f"[{u.start} - {u.end}] {u.speaker}: {u.text}")
[00:02 - 00:07] Speaker A: relative to what we're going to do with more border patrol and more asylum officers. [00:08 - 00:08] Speaker B: President Trump. [00:08 - 00:12] Speaker C: I really don't know what he said at the end of that sentence. I don't think he knows what he said either. [00:13 - 00:16] Speaker D: The only person at this stage is a convicted felon, this man I'm looking at right now. [00:17 - 00:20] Speaker C: But when he talks about a convicted felon, his son is a convicted felon. [00:20 - 00:22] Speaker D: What? What are you talking about? [00:22 - 00:24] Speaker D: You You have the morals of an alley cat. [00:25 - 00:29] Speaker D: My son was not a loser, was not a sucker. You're the sucker. You're the loser. [00:29 - 00:32] Speaker C: He's blaming inflation and he's right, it's been very bad. [00:33 - 00:34] Speaker C: He caused the inflation. [00:35 - 00:45] Speaker A: Excuse me, with um dealing with everything we have to do with uh look if we finally beat Medicare. [00:46 - 00:48] Speaker B: Thank you, President Biden, President Trump. [00:48 - 00:53] Speaker C: Well, I took two tests, cognitive test, I aced them. Go through the first five questions, he couldn't do it. [00:53 - 00:58] Speaker D: This guy's three years younger and a lot less competent. I think that this look at the record, look at what I've done.
Cloud tradeoff: faster, structured output built-in, but your audio goes to Google. Most newsrooms will use both local and cloud depending on the sensitivity of the material.
This is how real investigations work:
- Documented examined hundreds of TikTok videos by extracting audio and transcribing with Whisper.
- Público did the same with 7,616 health TikToks, then used an LLM to pull verifiable claims from the transcripts.
- Hearst built Assembly to transcribe 13,000+ hours of government meetings with Whisper and surface keywords via alerts.
- Chalkbeat uses LocalLens to monitor 80 school districts across 30 states the same way.
Up next: Video is just images + audio + time. Decompose it, then use the tools you already have.