Open this notebook in Colab

Bonus: Video Deep Dive

The main video notebook showed you how to split video into frames and audio. This one goes deeper: automatic scene detection, and using Gemini to understand video directly — with timestamps, structured data, and zero decomposition.

Gemini's Video Understanding page has some great suggestions on what you can do.

Scene detection with PySceneDetect

PySceneDetect finds cuts in video automatically by flagging big visual changes. It's ancient technology, but super quick and (mostly) effective. Great for splitting long videos into meaningful chunks.

video/scenes.py — Detect scene boundaries and extract mid-scene frames with PySceneDetect

import cv2
import pandas as pd
from pathlib import Path
from scenedetect import open_video, SceneManager, ContentDetector

DATA = Path("data")
VIDEO = DATA / "rDXubdQdJYs.mp4"
OUTPUT = Path("outputs") / "scenes"
OUTPUT.mkdir(parents=True, exist_ok=True)

video = open_video(str(VIDEO))
scene_manager = SceneManager()
scene_manager.add_detector(ContentDetector(threshold=27.0))
scene_manager.detect_scenes(video)
scene_list = scene_manager.get_scene_list()

rows = []
for i, (start, end) in enumerate(scene_list, 1):
    rows.append({"scene": i, "start_time": start.get_timecode(), "end_time": end.get_timecode(),
                 "duration_sec": round((end - start).get_seconds(), 2)})
df = pd.DataFrame(rows)
df.to_csv(OUTPUT / "scene_list.csv", index=False)

cap = cv2.VideoCapture(str(VIDEO))
fps = cap.get(cv2.CAP_PROP_FPS)
for i, (start, end) in enumerate(scene_list, 1):
    mid_frame = int((start.get_seconds() + end.get_seconds()) / 2 * fps)
    cap.set(cv2.CAP_PROP_POS_FRAMES, mid_frame)
    ret, frame = cap.read()
    if ret:
        cv2.imwrite(str(OUTPUT / f"scene_{i:03d}.jpg"), frame, [cv2.IMWRITE_JPEG_QUALITY, 95])
cap.release()

print(f"Found {len(scene_list)} scenes, frames saved to {OUTPUT}")
Found 16 scenes, frames saved to outputs/scenes

Each scene gets a start time, end time, and a representative frame saved to disk. Now you can analyze scenes individually instead of processing the whole video.

Send a video file to Gemini

Gemini can watch video. Upload a file, ask a question, get an answer. It's slower, it's more expensive, but it's very easy to do.

video/gemini-upload.py — Upload a video file to Gemini and ask a question about it

import time
from pathlib import Path

from pydantic_ai import Agent, VideoUrl
from pydantic_ai.providers.google import GoogleProvider

DATA = Path("data")
VIDEO = DATA / "rDXubdQdJYs.mp4"
PROMPT = "Describe what happens in this video."
MODEL = "google-gla:gemini-2.5-flash"

provider = GoogleProvider()
video_file = provider.client.files.upload(file=str(VIDEO))

while video_file.state.name == "PROCESSING":
    time.sleep(5)
    video_file = provider.client.files.get(name=video_file.name)

agent = Agent(MODEL)
result = agent.run_sync([PROMPT, VideoUrl(url=video_file.uri, media_type=video_file.mime_type)])
print(result.output)
This video shows excerpts from the June 27, 2024, CNN Presidential Debate in Atlanta, Georgia, featuring President Joe Biden and former President Donald Trump.

The video begins with a split screen:
*   **Top half:** President Joe Biden, dressed in a dark suit and blue tie, walks onto the debate stage, smiling slightly. The stage background is blue with "CNN" logos and an American eagle crest.
*   **Bottom half:** Former President Donald Trump, in a dark suit and red tie, also walks onto the stage with a serious expression. His background is similar, with red and white star motifs.

The video then transitions to close-up shots of each candidate as they speak:

1.  **Biden** is heard saying "...relative to what we're going to do with more border patrol and more asylum officers." He pauses and appears to struggle slightly with the end of the sentence.
2.  **Trump** responds, "I really don't know what he said at the end of that sentence. I don't think he knows what he said either," looking directly at Biden with a dismissive expression and gesturing with his hand.
3.  **Biden** then delivers a sharp personal attack: "The only person on this stage that is a convicted felon is the man I'm looking at right now."
4.  **Trump** retaliates quickly: "...but when he talks about a convicted felon, his son is a convicted felon." He points a finger, looking sternly.
5.  **Biden** becomes visibly agitated, his face contorted: "What are you talking about? You have the morals of an alley cat. My son was not a loser. He was not a sucker. You're the sucker. You're the loser."
6.  The conversation shifts to inflation. **Trump** states, "He's blaming inflation and he's right, it's been very bad. He caused the inflation."
7.  **Biden** is then shown with his eyes closed, appearing to struggle to articulate his thoughts: "Excuse me, with, dealing with everything we have to do with... Look, if... We finally beat Medicare."
8.  **Trump** observes Biden, at one point appearing to smirk and roll his eyes slightly.
9.  A moderator's voice is heard, "Thank you, President Biden, President Trump?"
10. **Trump** then talks about cognitive tests: "Well, I took two tests. Cognitive tests. I aced them. Go through the first five questions. He couldn't do it."
11. **Biden** concludes the segment with a wry smile, asserting, "This guy is three years younger and a lot less competent. I think that just look at the record. Look at what I've done."

The video highlights a contentious exchange, characterized by personal attacks, challenges to cognitive ability, and difficulties in verbal delivery from both candidates, particularly Biden.

Or just give it a YouTube URL

If the video is already on YouTube, it's even easier: Gemini accepts YouTube URLs directly.

video/gemini-youtube.py — Send a YouTube URL directly to Gemini for analysis (no download needed)

from pathlib import Path

from pydantic_ai import Agent, VideoUrl

URL = "https://www.youtube.com/watch?v=rDXubdQdJYs"
PROMPT = "What topics are discussed in this video?"
MODEL = "google-gla:gemini-2.5-flash"

agent = Agent(MODEL)
result = agent.run_sync([
    PROMPT,
    VideoUrl(url=URL),
])

print(result.output)
The video features segments from the June 27, 2024, CNN Presidential Debate between Joe Biden and Donald Trump. The topics discussed include:

1.  **Immigration and Border Security:** Biden mentions "more border patrol and more asylum officers."
2.  **Candidate Competence/Clarity:** Trump questions Biden's ability to articulate his thoughts, while Biden later critiques Trump's competence and age. Both reference cognitive abilities.
3.  **Legal Issues/Felony Convictions:** Biden directly refers to Trump as a "convicted felon," and Trump retorts by mentioning Hunter Biden's felony conviction.
4.  **Personal Attacks and Character:** Both candidates exchange insults, with Biden calling Trump's morals into question and labeling him a "sucker" and "loser" in defense of his son.
5.  **Economy/Inflation:** Trump blames Biden for inflation.
6.  **Healthcare/Medicare:** Biden briefly mentions "beating Medicare" (though the phrasing suggests a stumble, the topic is clear).
7.  **Political Records:** Biden alludes to looking at his record and what he's done.

Ask about a specific moment

"What's happening at 1:30?" Gemini can jump to timestamps. Useful when you already know where to look but need the AI to describe what it sees.

video/gemini-timestamp.py — Ask Gemini about a specific moment in a video using timestamps

import time
from pathlib import Path

from pydantic_ai import Agent, VideoUrl
from pydantic_ai.providers.google import GoogleProvider

DATA = Path("data")
VIDEO = DATA / "rDXubdQdJYs.mp4"
PROMPT = "What is happening at 01:30 in this video? Describe the scene and any text on screen."
MODEL = "google-gla:gemini-2.5-flash"

provider = GoogleProvider()
video_file = provider.client.files.upload(file=str(VIDEO))

while video_file.state.name == "PROCESSING":
    time.sleep(5)
    video_file = provider.client.files.get(name=video_file.name)

agent = Agent(MODEL)
result = agent.run_sync([PROMPT, VideoUrl(url=video_file.uri, media_type=video_file.mime_type)])
print(result.output)
At 01:30, the video shows a close-up shot of **Donald Trump**.

**Describe the scene:**
He is wearing a dark blue suit, a white collared shirt, and a bright red tie. A black microphone is positioned directly in front of him. He has a serious expression, with his mouth slightly open as if he is speaking, and is looking forward. There is a small American flag lapel pin on his left lapel. The background is a solid blue with faint, out-of-focus white "CNN" logos visible.

**Text on screen:**
*   **Top left:** A white "wp" logo (likely indicating The Washington Post).
*   **Top right:** "Courtesy of CNN" in white text. Below that, "Atlanta, Ga. | June 27, 2024".
*   **Bottom middle (subtitles):** "Look, if..."

Structured scene-by-scene breakdown

If you join Gemini with Pydantic, you can ask for exactly what you want: timestamps, descriptions, people visible, text on screen. It's the same structured output pattern from the image notebooks, just applied to video.

video/gemini-structured.py — Get a structured scene-by-scene breakdown of a video

import time
from pathlib import Path

from pydantic import BaseModel, Field
from pydantic_ai import Agent, VideoUrl
from pydantic_ai.providers.google import GoogleProvider

DATA = Path("data")
VIDEO = DATA / "rDXubdQdJYs.mp4"
MODEL = "google-gla:gemini-2.5-flash"

class Scene(BaseModel):
    start: str = Field(description="Start timestamp MM:SS")
    end: str = Field(description="End timestamp MM:SS")
    description: str = Field(description="What happens in this scene")
    people_visible: list[str] = Field(description="People visible")
    text_on_screen: str = Field(description="Any chyrons, captions, or on-screen text", default="")

provider = GoogleProvider()
video_file = provider.client.files.upload(file=str(VIDEO))

while video_file.state.name == "PROCESSING":
    time.sleep(5)
    video_file = provider.client.files.get(name=video_file.name)

agent = Agent(MODEL, output_type=list[Scene])
result = agent.run_sync([
    "Break this video into scenes. For each scene identify timestamps, "
    "what happens, who is visible, and any text on screen.",
    VideoUrl(url=video_file.uri, media_type=video_file.mime_type),
])
for s in result.output:
    print(f"[{s.start} - {s.end}] {s.description}")
    if s.people_visible:
        print(f"  People: {', '.join(s.people_visible)}")
    if s.text_on_screen:
        print(f"  Text: {s.text_on_screen}")
[00:00 - 00:02] Split screen showing Joe Biden and Donald Trump walking onto a stage.
  People: Joe Biden, Donald Trump
  Text: Courtesy of CNN, Atlanta, Ga. | June 27, 2024, wp
[00:02 - 00:04] Close-up of Joe Biden speaking.
  People: Joe Biden
  Text: ...relative to what we're going to do with, Courtesy of CNN, Atlanta, Ga. | June 27, 2024, wp
[00:04 - 00:07] Close-up of Donald Trump listening, then Joe Biden speaking.
  People: Donald Trump, Joe Biden
  Text: more border patrol and more asylum officers, Courtesy of CNN, Atlanta, Ga. | June 27, 2024, wp
[00:07 - 00:13] Close-up of Donald Trump speaking, stating he doesn't know what Biden said.
  People: Donald Trump
  Text: I really don't know what he said at the end of that sentence. don't think he knows what he said either., Donald Trump FORMER PRESIDENT, Courtesy of CNN, Atlanta, Ga. | June 27, 2024, wp
[00:13 - 00:17] Close-up of Joe Biden speaking, accusing Trump of being a convicted felon.
  People: Joe Biden
  Text: The only person on this stage that is a convicted felon is the man I'm looking at right now., Courtesy of CNN, Atlanta, Ga. | June 27, 2024, wp
[00:17 - 00:20] Close-up of Donald Trump speaking, retorting that Biden's son is a convicted felon.
  People: Donald Trump
  Text: ...but when he talks about a convicted felon, his son is a convicted felon., Courtesy of CNN, Atlanta, Ga. | June 27, 2024, wp
[00:20 - 00:29] Close-up of Joe Biden speaking, expressing anger and calling Trump names.
  People: Joe Biden
  Text: What are you talking about?, You have the morals of an alley cat., My son was not a loser. He was not a sucker. You're the sucker. You're the loser., Courtesy of CNN, Atlanta, Ga. | June 27, 2024, wp
[00:29 - 00:34] Close-up of Donald Trump speaking, blaming Biden for inflation.
  People: Donald Trump
  Text: He's blaming inflation and he's right it's been very bad. He caused the inflation., Courtesy of CNN, Atlanta, Ga. | June 27, 2024, wp
[00:34 - 00:39] Close-up of Joe Biden speaking, seemingly struggling with his words.
  People: Joe Biden
  Text: Excuse me, with, dealing with everything we have to do with., Courtesy of CNN, Atlanta, Ga. | June 27, 2024, wp
[00:39 - 00:41] Close-up of Donald Trump listening.
  People: Donald Trump
  Text: Courtesy of CNN, Atlanta, Ga. | June 27, 2024, wp
[00:41 - 00:47] Close-up of Joe Biden speaking, then the moderator interrupting. Biden seems to be searching for words.
  People: Joe Biden
  Text: Look, if..., We finally beat Medicare., Thank you, President Biden, president Trump?, Courtesy of CNN, Atlanta, Ga. | June 27, 2024, wp
[00:47 - 00:52] Close-up of Donald Trump speaking, claiming he aced cognitive tests.
  People: Donald Trump
  Text: Well, I took two tests. Cognitive tests. I aced them. Go through the first five questions. He couldn't do it., Courtesy of CNN, Atlanta, Ga. | June 27, 2024, wp
[00:52 - 00:59] Close-up of Joe Biden speaking, stating Trump is younger but less competent.
  People: Joe Biden
  Text: This guy is three years younger and a lot less competent. I think that just look at the record. Look at what I've done., Courtesy of CNN, Atlanta, Ga. | June 27, 2024, wp