Open this notebook in Colab

Audio

Images and documents gave you structured data. Audio is another way to get text — and once it's text, you already know what to do with it. This is how you can do story research like Misinformation on TikTok: How 'Documented' Examined Hundreds of Videos in Different Languages or The Second Trump Presidency, Brought to you by YouTubers.

If you don't want to code... just use NotebookLM. It will do everything for you. Just go to sleep right now.

Transcription with Whisper

Once upon a time OpenAI released an open model named Whisper. It's great! Very very popular.

There are newer models out there — parakeet-mlx is blazing fast on Macs — but Whisper is very easy to use so everyone (and I mean everyone) uses it.

When you use Whisper, you have to make some decisions:

Let's practice on this Trump/Biden debate clip.

I saved it as an mp3 to make life a little easier.

rDXubdQdJYs.mp3

audio/transcribe-whisperx.py — WhisperX segment-level aligned transcription

from pathlib import Path
import os
os.environ["TORCH_FORCE_NO_WEIGHTS_ONLY_LOAD"] = "1"
import torch
import whisperx

DATA = Path("data")
AUDIO = DATA / "rDXubdQdJYs.mp3"
MODEL = "turbo"
LANGUAGE = "en"

# Decides whether your computer is fancy and powerful
device = "cuda" if torch.cuda.is_available() else "cpu"
compute_type = "float16" if device == "cuda" else "int8"

model = whisperx.load_model(MODEL, device, compute_type=compute_type)
audio = whisperx.load_audio(str(AUDIO))
result = model.transcribe(audio, batch_size=16)

# Print out the entire transcript
text = " ".join(seg["text"].strip() for seg in result["segments"])
print(text)
relative to what we're going to do with more border patrol and more asylum officers. President Trump? I really don't know what he said at the end of that sentence. I don't think he knows what he said either. The only person on this stage is a convicted felon is the man I'm looking at right now. But when he talks about a convicted felon, his son is a convicted felon. What are you talking about? You have the morals of an alley cat. My son was not a loser, was not a sucker. You're the sucker, you're the loser. He's blaming inflation, and he's right, it's been very bad. He caused the inflation. Excuse me, with dealing with everything we have to do with, look, if we finally beat Medicare. Thank you, President Biden. President Trump. Well, I took two tests, cognitive tests. I aced him. Go through the first five questions, he couldn't do it. This guy's three years younger and a lot less competent. I think that just look at the record, look at what I've done.

Getting exact timestamps

The basic transcription gives you segment-level timestamps. Alignment refines those to word-level timestamps — useful for speaker identification, precise clip cutting, or word-by-word subtitles.

model_a, metadata = whisperx.load_align_model(language_code=LANGUAGE, device=device)
result = whisperx.align(
    result["segments"], model_a, metadata, audio, device,
    return_char_alignments=False,
)

import pandas as pd

df = pd.DataFrame(result["segments"])
df
start end text words
0 2.461 7.990 relative to what we're going to do with more ... [{'word': 'relative', 'start': 2.461, 'end': 2...
1 8.010 9.091 President Trump? [{'word': 'President', 'start': 8.01, 'end': 8...
2 9.111 11.215 I really don't know what he said at the end of... [{'word': 'I', 'start': 9.111, 'end': 9.151, '...
3 11.555 12.897 I don't think he knows what he said either. [{'word': 'I', 'start': 11.555, 'end': 11.575,...
4 12.937 16.904 The only person on this stage is a convicted f... [{'word': 'The', 'start': 12.937, 'end': 13.23...
5 17.324 20.629 But when he talks about a convicted felon, his... [{'word': 'But', 'start': 17.324, 'end': 17.42...
6 20.750 22.292 What are you talking about? [{'word': 'What', 'start': 20.75, 'end': 20.87...
7 22.913 25.197 You have the morals of an alley cat. [{'word': 'You', 'start': 22.913, 'end': 23.35...
8 25.217 27.200 My son was not a loser, was not a sucker. [{'word': 'My', 'start': 25.217, 'end': 25.357...
9 27.481 29.023 You're the sucker, you're the loser. [{'word': 'You're', 'start': 27.481, 'end': 27...
10 29.704 32.950 He's blaming inflation, and he's right, it's b... [{'word': 'He's', 'start': 29.704, 'end': 29.9...
11 33.611 34.753 He caused the inflation. [{'word': 'He', 'start': 33.611, 'end': 33.771...
12 35.073 46.031 Excuse me, with dealing with everything we hav... [{'word': 'Excuse', 'start': 35.073, 'end': 35...
13 46.532 47.694 Thank you, President Biden. [{'word': 'Thank', 'start': 46.532, 'end': 46....
14 47.714 48.415 President Trump. [{'word': 'President', 'start': 47.714, 'end':...
15 48.580 50.572 Well, I took two tests, cognitive tests. [{'word': 'Well,', 'start': 48.58, 'end': 48.8...
16 50.652 51.256 I aced him. [{'word': 'I', 'start': 50.652, 'end': 50.713,...
17 51.276 53.248 Go through the first five questions, he couldn... [{'word': 'Go', 'start': 51.276, 'end': 51.397...
18 53.610 56.649 This guy's three years younger and a lot less ... [{'word': 'This', 'start': 53.61, 'end': 53.79...
19 56.669 59.063 I think that just look at the record, look at ... [{'word': 'I', 'start': 56.669, 'end': 56.749,...

A faster option: Parakeet

NVIDIA's Parakeet is a newer speech model that's significantly faster than Whisper — especially on Macs via parakeet-mlx. It also gives you cleaner sentence-level output without needing a separate alignment step.

Here we combine Parakeet for transcription with pyannote for speaker diarization — "who said what?"

Setup: Diarization requires a free Hugging Face token (HF_TOKEN) and accepting the model licenses at pyannote/segmentation-3.0 and pyannote/speaker-diarization-community-1. It's kind of a pain to jump through the hoops but if you're vaguely technical it's definitely worth it.

In the example below, we try to use Parakeet MLX (it's fast on my mac!) but if that fails we go for onnx-asr, a flexible, portable tool that allows you to use different ASR (automatic speech recognition) models.

audio/transcribe-parakeet.py — Transcribe + diarize with Parakeet and pyannote

from pathlib import Path
from pyannote.audio import Pipeline

DATA = Path("data")

AUDIO = DATA / "rDXubdQdJYs.mp3"

try:
    from parakeet_mlx import from_pretrained
    print("Using parakeet-mlx...")
    model = from_pretrained("mlx-community/parakeet-tdt-0.6b-v3")
    result = model.transcribe(str(AUDIO), chunk_duration=600, overlap_duration=15)
    sentences = [{"start": s.start, "end": s.end, "text": s.text} for s in result.sentences]
except ImportError:
    import onnx_asr
    import ffmpeg
    print("Using onnx-asr...")
    WAV = AUDIO.with_suffix(".wav")
    if not WAV.exists():
        ffmpeg.input(str(AUDIO)).output(str(WAV), ar=16000, ac=1).run(quiet=True)
    vad = onnx_asr.load_vad("silero")
    model = onnx_asr.load_model("nemo-parakeet-tdt-0.6b-v3").with_vad(vad)
    result = model.recognize(str(WAV))
    sentences = [{"start": s.start, "end": s.end, "text": s.text} for s in result]
print(f"Transcribed {len(sentences)} sentences")
Using parakeet-mlx...
Transcribed 17 sentences

Diarize with pyannote

Parakeet gives us what was said. Pyannote tells us who said it.

import torchaudio
import os

HF_TOKEN = os.environ["HF_TOKEN"]
waveform, sample_rate = torchaudio.load(str(AUDIO))
pipeline = Pipeline.from_pretrained(
    "pyannote/speaker-diarization-community-1",
    token=HF_TOKEN
)
diarization = pipeline({"waveform": waveform, "sample_rate": sample_rate})

Combine: match speakers to sentences

For each sentence, find which speaker was talking at its midpoint.

for s in sentences:
    mid = (s["start"] + s["end"]) / 2
    speaker = "UNKNOWN"
    for turn, label in diarization.speaker_diarization:
        if turn.start <= mid <= turn.end:
            speaker = label
            break
    s["speaker"] = speaker
sentences[:5]
[{'start': 2.24,
  'end': 7.840000000000001,
  'text': " Relative to what we're gonna do with more border patrol and more uh asylum officers.",
  'speaker': 'SPEAKER_01'},
 {'start': 7.84,
  'end': 8.72,
  'text': ' President Trump.',
  'speaker': 'SPEAKER_00'},
 {'start': 8.72,
  'end': 13.120000000000001,
  'text': " I really don't know what he said at the end of that sentence, but I don't think he knows what he said either.",
  'speaker': 'SPEAKER_00'},
 {'start': 13.120000000000001,
  'end': 17.2,
  'text': " The only person on this stage is a convicted felon is the man I'm looking at right now.",
  'speaker': 'SPEAKER_01'},
 {'start': 17.2,
  'end': 20.56,
  'text': ' But when he talks about a convicted felon, his son is a convicted felon.',
  'speaker': 'SPEAKER_00'}]

Display results

import pandas as pd
df = pd.DataFrame(sentences[:15])
df
start end text speaker
0 2.24 7.84 Relative to what we're gonna do with more bor... SPEAKER_01
1 7.84 8.72 President Trump. SPEAKER_00
2 8.72 13.12 I really don't know what he said at the end o... SPEAKER_00
3 13.12 17.20 The only person on this stage is a convicted ... SPEAKER_01
4 17.20 20.56 But when he talks about a convicted felon, hi... SPEAKER_00
5 20.80 22.72 What what are you talking about? SPEAKER_01
6 22.72 25.12 You you you have the morals of an alley cat. SPEAKER_01
7 25.12 27.44 My son was not a loser, was not a sucker. SPEAKER_01
8 27.44 29.52 You're the sucker, you're the loser. SPEAKER_01
9 29.52 33.36 He's blaming inflation, and he's right, it's ... UNKNOWN
10 33.36 34.64 He caused the inflation. SPEAKER_00
11 35.04 46.32 Excuse me, with um dealing with everything we... UNKNOWN
12 46.40 48.56 Thank you, President uh Biden, President Trump. SPEAKER_01
13 48.56 51.28 Well, I took two tests, cognitive tests, I ac... SPEAKER_00
14 51.28 53.52 Go through the first five questions, he could... SPEAKER_00

Transcription and speaker identification (WhisperX)

WhisperX can do the same thing — transcribe and separate speakers. It's slower than Parakeet but widely used and runs everywhere.

audio/whisperx-diarize.py — Full WhisperX pipeline: transcribe, align, diarize (who said what)

from pathlib import Path
import os
from collections import defaultdict
import os
os.environ["TORCH_FORCE_NO_WEIGHTS_ONLY_LOAD"] = "1"
import torch
import whisperx
from whisperx.diarize import DiarizationPipeline

DATA = Path("data")
AUDIO = DATA / "rDXubdQdJYs.mp3"
MODEL, LANGUAGE = "large-v3", "en"
HF_TOKEN = os.environ["HF_TOKEN"]
device = "cuda" if torch.cuda.is_available() else "cpu"
compute_type = "float16" if device == "cuda" else "int8"

# Step 1: Transcribe
model = whisperx.load_model(MODEL, device, compute_type=compute_type)
audio = whisperx.load_audio(str(AUDIO))
result = model.transcribe(audio, batch_size=16)

# Step 2: Align
model_a, metadata = whisperx.load_align_model(language_code=LANGUAGE, device=device)
result = whisperx.align(
    result["segments"], model_a, metadata, audio, device, 
    return_char_alignments=False,
)

# Step 3: Diarize (split speakers)
diarize_model = DiarizationPipeline(token=HF_TOKEN, device=device)
result = whisperx.assign_word_speakers(diarize_model(audio), result)

# Print speaker-labeled segments
current_speaker = None
for seg in result["segments"]:
    speaker = seg.get("speaker", "UNKNOWN")
    if speaker != current_speaker:
        print(f"\n--- {speaker} ---")
        current_speaker = speaker
    start = f"{int(seg['start']//60):02d}:{seg['start']%60:05.2f}"
    end = f"{int(seg['end']//60):02d}:{seg['end']%60:05.2f}"
    print(f"  [{start} - {end}] {seg['text'].strip()}")
--- SPEAKER_01 ---
  [00:02.46 - 00:07.99] relative to what we're going to do with more border patrol and more asylum officers.

--- SPEAKER_00 ---
  [00:08.01 - 00:09.09] President Trump?
  [00:09.11 - 00:11.21] I really don't know what he said at the end of that sentence.
  [00:11.55 - 00:12.90] I don't think he knows what he said either.

--- SPEAKER_01 ---
  [00:12.94 - 00:16.90] The only person in this stage is a convicted felon is the man I'm looking at right now.

--- SPEAKER_00 ---
  [00:17.32 - 00:20.63] But when he talks about a convicted felon, his son is a convicted felon.

--- SPEAKER_01 ---
  [00:20.75 - 00:22.29] What are you talking about?
  [00:22.91 - 00:25.20] You have the morals of an alley cat.
  [00:25.22 - 00:27.20] My son was not a loser, was not a sucker.
  [00:27.48 - 00:28.18] You're the sucker.
  [00:28.34 - 00:29.02] You're the loser.

--- SPEAKER_00 ---
  [00:29.70 - 00:32.07] He's blaming inflation, and he's right.
  [00:32.09 - 00:32.95] It's been very bad.
  [00:33.61 - 00:34.75] He caused the inflation.

--- SPEAKER_01 ---
  [00:35.07 - 00:46.51] Excuse me, with dealing with everything we have to do with... Look, if we finally beat Medicare...
  [00:46.53 - 00:47.69] Thank you, President Biden.
  [00:47.71 - 00:48.41] President Trump?

--- SPEAKER_00 ---
  [00:48.58 - 00:50.57] Well, I took two tests, cognitive tests.
  [00:50.65 - 00:51.26] I aced them.
  [00:51.28 - 00:52.62] Go through the first five questions.
  [00:52.64 - 00:53.25] He couldn't do it.
  [00:53.61 - 00:56.65] This guy's three years younger and a lot less competent.

--- SPEAKER_01 ---
  [00:56.67 - 00:58.06] I think that just look at the record.
  [00:58.18 - 00:59.06] Look at what I've done.

It isn't perfect, as you can tell! We probably want them split up beautifully though, right?

import pandas as pd
df = pd.DataFrame(result["segments"])
df
start end text words speaker
0 2.461 7.990 relative to what we're going to do with more ... [{'word': 'relative', 'start': 2.461, 'end': 2... SPEAKER_01
1 8.010 9.091 President Trump? [{'word': 'President', 'start': 8.01, 'end': 8... SPEAKER_00
2 9.111 11.215 I really don't know what he said at the end of... [{'word': 'I', 'start': 9.111, 'end': 9.151, '... SPEAKER_00
3 11.555 12.897 I don't think he knows what he said either. [{'word': 'I', 'start': 11.555, 'end': 11.575,... SPEAKER_00
4 12.937 16.904 The only person in this stage is a convicted f... [{'word': 'The', 'start': 12.937, 'end': 13.23... SPEAKER_01
5 17.324 20.629 But when he talks about a convicted felon, his... [{'word': 'But', 'start': 17.324, 'end': 17.42... SPEAKER_00
6 20.750 22.292 What are you talking about? [{'word': 'What', 'start': 20.75, 'end': 20.87... SPEAKER_01
7 22.913 25.197 You have the morals of an alley cat. [{'word': 'You', 'start': 22.913, 'end': 23.35... SPEAKER_01
8 25.217 27.200 My son was not a loser, was not a sucker. [{'word': 'My', 'start': 25.217, 'end': 25.357... SPEAKER_01
9 27.481 28.182 You're the sucker. [{'word': 'You're', 'start': 27.481, 'end': 27... SPEAKER_01
10 28.342 29.023 You're the loser. [{'word': 'You're', 'start': 28.342, 'end': 28... SPEAKER_01
11 29.704 32.068 He's blaming inflation, and he's right. [{'word': 'He's', 'start': 29.704, 'end': 29.9... SPEAKER_00
12 32.088 32.950 It's been very bad. [{'word': 'It's', 'start': 32.088, 'end': 32.1... SPEAKER_00
13 33.611 34.753 He caused the inflation. [{'word': 'He', 'start': 33.611, 'end': 33.771... SPEAKER_00
14 35.073 46.512 Excuse me, with dealing with everything we hav... [{'word': 'Excuse', 'start': 35.073, 'end': 35... SPEAKER_01
15 46.532 47.694 Thank you, President Biden. [{'word': 'Thank', 'start': 46.532, 'end': 46.... SPEAKER_01
16 47.714 48.415 President Trump? [{'word': 'President', 'start': 47.714, 'end':... SPEAKER_01
17 48.580 50.572 Well, I took two tests, cognitive tests. [{'word': 'Well,', 'start': 48.58, 'end': 48.8... SPEAKER_00
18 50.652 51.256 I aced them. [{'word': 'I', 'start': 50.652, 'end': 50.713,... SPEAKER_00
19 51.276 52.624 Go through the first five questions. [{'word': 'Go', 'start': 51.276, 'end': 51.397... SPEAKER_00
20 52.644 53.248 He couldn't do it. [{'word': 'He', 'start': 52.644, 'end': 52.725... SPEAKER_00
21 53.610 56.649 This guy's three years younger and a lot less ... [{'word': 'This', 'start': 53.61, 'end': 53.79... SPEAKER_00
22 56.669 58.057 I think that just look at the record. [{'word': 'I', 'start': 56.669, 'end': 56.749,... SPEAKER_01
23 58.178 59.063 Look at what I've done. [{'word': 'Look', 'start': 58.178, 'end': 58.2... SPEAKER_01

"Speaker 1 said X at 0:42, Speaker 2 said Y at 1:15." Now you have a searchable, speaker-labeled transcript!

If you'd prefer to not write code (lol), try MacWhisper, Handy, VoiceInk, Buzz.

Using the cloud

Sometimes your computer is slow, or you don't care about privacy or cost, and you just want something to get done. Here's the same audio, but using Gemini (Google's LLM). We make it send back structured data — each utterance gets a speaker, timestamps, text, and some sentiment to mix it up a bit. Same Pydantic AI pattern as the image notebooks!

audio/gemini-diarize.py — Structured transcription with speaker labels via Gemini (cloud alternative to WhisperX)

from pathlib import Path

from pydantic import BaseModel, Field
from pydantic_ai import Agent, BinaryContent
from typing import Literal

DATA = Path("data")
AUDIO = DATA / "rDXubdQdJYs.mp3"
MODEL = "google-gla:gemini-2.5-flash"

class Utterance(BaseModel):
    speaker: str = Field(description="Speaker label (e.g., Speaker 1, Speaker 2)")
    start: str = Field(description="Start timestamp MM:SS")
    end: str = Field(description="End timestamp MM:SS")
    text: str = Field(description="What was said")
    sentiment: Literal[
        "positive", "negative", "neutral"
    ] = Field(description="Sentiment of the utterance")

agent = Agent(MODEL, output_type=list[Utterance])
result = agent.run_sync([
    "Transcribe this audio with speaker labels and timestamps for each utterance.",
    BinaryContent(data=AUDIO.read_bytes(), media_type="audio/mpeg"),
])

# Each utterance is a typed object: speaker, timestamps, text. Same Pydantic pattern as images.
for u in result.output:
    print(f"[{u.start} - {u.end}] {u.speaker}: {u.text}")
[00:02 - 00:07] Speaker A: relative to what we're going to do with more border patrol and more asylum officers.
[00:08 - 00:08] Speaker B: President Trump.
[00:08 - 00:12] Speaker C: I really don't know what he said at the end of that sentence. I don't think he knows what he said either.
[00:13 - 00:16] Speaker D: The only person at this stage is a convicted felon, this man I'm looking at right now.
[00:17 - 00:20] Speaker C: But when he talks about a convicted felon, his son is a convicted felon.
[00:20 - 00:22] Speaker D: What? What are you talking about?
[00:22 - 00:24] Speaker D: You You have the morals of an alley cat.
[00:25 - 00:29] Speaker D: My son was not a loser, was not a sucker. You're the sucker. You're the loser.
[00:29 - 00:32] Speaker C: He's blaming inflation and he's right, it's been very bad.
[00:33 - 00:34] Speaker C: He caused the inflation.
[00:35 - 00:45] Speaker A: Excuse me, with um dealing with everything we have to do with uh look if we finally beat Medicare.
[00:46 - 00:48] Speaker B: Thank you, President Biden, President Trump.
[00:48 - 00:53] Speaker C: Well, I took two tests, cognitive test, I aced them. Go through the first five questions, he couldn't do it.
[00:53 - 00:58] Speaker D: This guy's three years younger and a lot less competent. I think that this look at the record, look at what I've done.

Cloud tradeoff: faster, structured output built-in, but your audio goes to Google. Most newsrooms will use both local and cloud depending on the sensitivity of the material.

This is how real investigations work:

Up next: Video is just images + audio + time. Decompose it, then use the tools you already have.