Passive scraping for social media

NICAR 2025, Minneapolis

Jonathan Soma, Columbia University

jonathan.soma@gmail.com 🦃 Twitter 🦃 Bluesky 🦃 jonathansoma.com

Slides

Useful tools

Instaloader, for downloading Instagram posts, comments and profile content.
yt-dlp, which will download any video from any website ever (mostly)
Playwright, my favorite scraping framework (browser automation!)
HAR data extractor
WARC data extractor

Tutorials

Finding Undocumented APIs

Pack-ratting

HAR files

This is for Chrome, but is vaguely the same for other browsers.

Right-click the page and select Inspect to open up the developer tools.

Right-click, inspect

Find the Network tab – you might need to resize the window a little bit, or click the >> arrow to find it.

Network tab

Refresh the page, scroll around, click the download icon.

Download HAR

You’re done! Now you can head to the HAR Data Extractor

WACZ files

WACZ/WARC file support is not built into Chrome, so you’ll need to download an extension for it. I recommend Webrecorder ArchiveWeb.page.

Click the WACZ icon on your menu bar, create a new archiving session under “Save to…” if you feel like it, then click Start archiving.

Open up WACZ

Reload the page to make it start. Scroll around, do whatever. When you’re done, click the Stop archiving button.

Supposedly Autopilot will scroll down for you forever, but it doesn’t always work.

Stop archiving with WACZ

To download your WACZ file, click View Archived Pages, then Download on the bottom left. The WACZ file is fine!

Download your WACZ file

You’re done! Now you can head to the WARC Data Extractor

Slides