Bonus: Tracking & Counting
Want to reproduce this Bloomberg piece about congestion pricing? We can get about 60% of the way there!
The library doing the heavy lifting here is supervision, which handles tracking, drawing, and counting on top of any object detection model. It's fantastic.
Step 1: Detect
First, we just detect objects in a single frame. YOLO finds the cars, supervision draws the boxes.
tracking/detect.py — Detect objects in a single video frame with YOLO
from pathlib import Path
from PIL import Image
import cv2
import supervision as sv
from ultralytics import YOLO
DATA = Path("data")
VIDEO = DATA / "istockphoto-534232220-640_adpp_is.mp4"
model = YOLO("yolo26n")
cap = cv2.VideoCapture(str(VIDEO))
cap.set(cv2.CAP_PROP_POS_FRAMES, 50)
ret, frame = cap.read()
cap.release()
detections = sv.Detections.from_ultralytics(model(frame, verbose=False)[0])
labels = [f"{model.names[int(c)]} {conf:.2f}" for c, conf in zip(detections.class_id, detections.confidence)]
annotated = sv.BoxAnnotator().annotate(frame.copy(), detections)
annotated = sv.LabelAnnotator().annotate(annotated, detections, labels=labels)
print(f"Found {len(detections)} objects")
Image.fromarray(cv2.cvtColor(annotated, cv2.COLOR_BGR2RGB))
Found 6 objects
Step 2: Track
Now we add tracking. ByteTrack links detections across frames — "that's car #7, same one from 3 seconds ago." Each object gets a unique ID and a motion trail.
tracking/track.py — Track objects across video frames with YOLO + ByteTrack
from pathlib import Path
import cv2
import supervision as sv
import ipywidgets as widgets
from IPython.display import display
from ultralytics import YOLO
DATA = Path("data")
VIDEO = DATA / "istockphoto-534232220-640_adpp_is.mp4"
MAX_FRAMES = 100
model = YOLO("yolo26n")
tracker = sv.ByteTrack()
smoother = sv.DetectionsSmoother()
box_ann = sv.BoxAnnotator()
label_ann = sv.LabelAnnotator()
trace_ann = sv.TraceAnnotator()
image_widget = widgets.Image(format='jpeg')
display(image_widget)
cap = cv2.VideoCapture(str(VIDEO))
frame_count = 0
while cap.isOpened() and frame_count < MAX_FRAMES:
ret, frame = cap.read()
if not ret:
break
detections = sv.Detections.from_ultralytics(model(frame, verbose=False)[0])
detections = tracker.update_with_detections(detections)
detections = smoother.update_with_detections(detections)
labels = [f"#{tid} {model.names[int(c)]}" for tid, c in zip(detections.tracker_id, detections.class_id)] if detections.tracker_id is not None else []
annotated = box_ann.annotate(frame.copy(), detections)
annotated = trace_ann.annotate(annotated, detections)
annotated = label_ann.annotate(annotated, detections, labels=labels)
_, buf = cv2.imencode('.jpg', annotated)
image_widget.value = buf.tobytes()
frame_count += 1
cap.release()
print(f"Tracked {frame_count} frames")
Image(value=b'', format='jpeg')
Tracked 100 frames
Step 3: Count
Draw a virtual line across the road. Every time a tracked object crosses it, the counter ticks up. Think: counting cars at an intersection, people entering a building, boats passing under a bridge.
tracking/count.py — Count objects crossing a line with YOLO + ByteTrack + LineZone
from pathlib import Path
import cv2
import supervision as sv
import ipywidgets as widgets
from IPython.display import display
from ultralytics import YOLO
DATA = Path("data")
VIDEO = DATA / "istockphoto-534232220-640_adpp_is.mp4"
model = YOLO("yolo26n")
tracker = sv.ByteTrack()
smoother = sv.DetectionsSmoother()
box_ann = sv.BoxAnnotator()
label_ann = sv.LabelAnnotator()
trace_ann = sv.TraceAnnotator()
line_zone = sv.LineZone(start=sv.Point(200, 175), end=sv.Point(700, 175))
line_ann = sv.LineZoneAnnotator(text_thickness=1)
image_widget = widgets.Image(format='jpeg')
display(image_widget)
cap = cv2.VideoCapture(str(VIDEO))
frame_count = 0
while cap.isOpened():
ret, frame = cap.read()
if not ret:
break
detections = sv.Detections.from_ultralytics(model(frame, verbose=False)[0])
detections = tracker.update_with_detections(detections)
detections = smoother.update_with_detections(detections)
line_zone.trigger(detections)
labels = [f"#{tid} {model.names[int(c)]}" for tid, c in zip(detections.tracker_id, detections.class_id)] if detections.tracker_id is not None else []
annotated = box_ann.annotate(frame.copy(), detections)
annotated = trace_ann.annotate(annotated, detections)
annotated = label_ann.annotate(annotated, detections, labels=labels)
annotated = line_ann.annotate(annotated, line_counter=line_zone)
_, buf = cv2.imencode('.jpg', annotated)
image_widget.value = buf.tobytes()
frame_count += 1
cap.release()
print(f"Processed {frame_count} frames")
print(f"Crossed IN: {line_zone.in_count} | OUT: {line_zone.out_count}")
Image(value=b'', format='jpeg')
Processed 625 frames Crossed IN: 20 | OUT: 20
The actual count is around 20 in and 20 out. How close did we get?
...but can Gemini just do this?
We've been using Gemini for all sorts of stuff. Let's see if it can count cars in the video. We upload the same clip and ask it to count.
tracking/gemini-count.py — Ask Gemini to count cars in a video — spoiler: it struggles
import time
from pathlib import Path
from pydantic_ai import Agent, VideoUrl
from pydantic_ai.providers.google import GoogleProvider
DATA = Path("data")
VIDEO = DATA / "istockphoto-534232220-640_adpp_is.mp4"
MODEL = "google-gla:gemini-2.5-flash"
provider = GoogleProvider()
video_file = provider.client.files.upload(file=str(VIDEO))
while video_file.state.name == "PROCESSING":
time.sleep(5)
video_file = provider.client.files.get(name=video_file.name)
agent = Agent(MODEL)
result = agent.run_sync([
"Watch this traffic video carefully. Count the exact number of cars "
"that enter the left tunnel and the number that exit the right tunnel.",
VideoUrl(url=video_file.uri, media_type=video_file.mime_type),
])
print(result.output)
Let's count them carefully: **Cars entering the left tunnel:** 1. Dark blue car (0:00) 2. Green van (0:02) 3. Black car (0:04) 4. White convertible (0:06) 5. Police car (0:07) 6. White hatchback (0:08) 7. Silver sedan (0:09) 8. Dark grey hatchback (0:10) 9. Black cab (0:10) 10. White car (behind black cab) (0:11) 11. White flatbed truck (0:11) 12. Silver car (0:12) 13. Black sedan (0:13) 14. White van (0:14) 15. Blue car (0:14) 16. Silver hatchback (0:15) 17. White hatchback (0:17) 18. Dark grey sedan (0:17) 19. Silver hatchback (0:18) 20. Silver hatchback (0:19) 21. White flatbed truck (0:20) 22. White hatchback (0:20) 23. Silver taxi (0:21) 24. Silver hatchback (0:22) 25. Black hatchback (0:23) 26. White car (0:23) **Total entering left tunnel: 26 cars** --- **Cars exiting the right tunnel:** 1. White sedan (0:00) 2. Dark blue sedan (0:01) 3. Red car (0:02) 4. Black sedan (0:03) 5. Silver hatchback (0:03) 6. Red car (0:04) 7. Silver car (0:05) 8. White car (0:06) 9. Dark blue sedan (0:07) 10. Silver sedan (0:08) 11. White van (0:09) 12. Red car (on far right, 0:09) 13. Silver car (0:10) 14. White van (0:10) 15. Black sedan (0:11) 16. Silver sedan (0:12) 17. Blue sedan (0:13) 18. Black sedan (0:13) 19. Green flatbed truck (on far right, 0:14) 20. Black sedan (0:14) 21. Blue sedan (0:15) 22. Dark blue van (0:16) 23. Dark blue sedan (0:16) 24. Black sedan (0:17) 25. Blue sedan (0:18) 26. Blue sedan (0:19) 27. Silver van (0:19) 28. Black taxi (0:20) 29. Blue sedan (0:20) 30. White van (0:21) 31. Blue sedan (0:22) 32. Black taxi (0:22) 33. Black sedan (0:23) **Total exiting right tunnel: 33 cars**
The answer is: huh, this is better than last year. LLMs are bad at counting things in video – they're great at vibes, they're great as yes/no, but measurements are where things get iffy (at least at the moment). This is exactly the kind of task where boring traditional computer vision (detection + tracking + counting) beats the "just ask AI" approach.
Going further
What if you don't just want cars, though — you want to be Bloomberg, and spot taxis and box trucks and commercial vehicles! While you might try to find a model that does that, you can also train your own with a handful of labeled images.
If you're interested in going to the next level, check out the Roboflow docs — you can train a custom YOLO model, swap it in, and everything else stays the same.