Passive scraping for social media
NICAR 2025, Minneapolis
Jonathan Soma, Columbia University
jonathan.soma@gmail.com 🦃 Twitter 🦃 Bluesky 🦃 jonathansoma.com
Useful tools
- Instaloader, for downloading Instagram posts, comments and profile content.
- yt-dlp, which will download any video from any website ever (mostly)
- Playwright, my favorite scraping framework (browser automation!)
- HAR data extractor
- WARC data extractor
Tutorials
Pack-ratting
HAR files
This is for Chrome, but is vaguely the same for other browsers.
Right-click the page and select Inspect to open up the developer tools.
Find the Network tab – you might need to resize the window a little bit, or click the >> arrow to find it.
Refresh the page, scroll around, click the download icon.
You’re done! Now you can head to the HAR Data Extractor
WACZ files
WACZ/WARC file support is not built into Chrome, so you’ll need to download an extension for it. I recommend Webrecorder ArchiveWeb.page.
Click the WACZ icon on your menu bar, create a new archiving session under “Save to…” if you feel like it, then click Start archiving.
Reload the page to make it start. Scroll around, do whatever. When you’re done, click the Stop archiving button.
Supposedly Autopilot will scroll down for you forever, but it doesn’t always work.
To download your WACZ file, click View Archived Pages, then Download on the bottom left. The WACZ file is fine!
You’re done! Now you can head to the WARC Data Extractor