scraping-interactive-pages-with-ai-and-playwright

This is the notes for the Advanced Web Scraping - with AI flavor! session, which was a sample class and info session hosted by Professor Jonathan Soma for the Lede Program, a summer data journalism intensive at Columbia Journalism School.

In this session we’ll learn to use Playwright along with a particular AI prompt to write a scraper.

Requests and BeautifulSoup intro

The traditional entry point for learning to scrape in Python is by using requests and BeautifulSoup. It’s usually great!

In the case below, we’re using it to scrape headlines Le Monde’s English website.

import requests
from bs4 import BeautifulSoup

response = requests.get("https://www.lemonde.fr/en/")
doc = BeautifulSoup(response.text)

Sometimes you’ll get lucky and be able to scrape by just specifying a tag name…

headlines = doc.find_all('h3')
for headline in headlines:
    print(headline.text)

…but more often that not a class is going to be more effective.

headlines = doc.find_all(class_='article__title')
for headline in headlines:
    print(headline.text)

Where requests + BeautifulSoup fails

Some websites you’ll be able to download fine with requests, but when you start trying to use BeautifulSoup nothing shows up. For example, if we try to access OpenSyllabus listing pages we won’t see any books show up in BeautifulSoup.

response = requests.get("https://analytics.opensyllabus.org/record/works")
doc = BeautifulSoup(response.text)

doc.find_all(class_='fOVKMS')

This is because the page retrived by requests doesn’t actually have all those books on it.

response.text

This is because visiting this site is a two-step process, first you load up this bare-bones skeleton page, then the browser goes and gets the actual information. Requests doesn’t do that next step, so we need to try another tool!

Enter Playwright

Instead of pulling the raw HTML contents of the page, Playwright actually controls your browser for you! It can load pages up, you can click things, fill out forms, all sorts of things. To begin we’ll just access the same OpenSyllabus page as before and see the actual contents.

from playwright.async_api import async_playwright

# "Hey, open up a browser"
playwright = await async_playwright().start()
browser = await playwright.chromium.launch(headless=False)

# Create a new browser window
page = await browser.new_page()

# Tell it to go to this page
await page.goto("https://analytics.opensyllabus.org/record/works")

Some people will actually scrape the page Playwright, grabbing titles and all of that, but I find it’s easiest to take the HTML – the full HTML, after the skeleton has been filled in – and feed it to BeautifulSoup, just like we’re used to.

html_content = await page.content()

doc = BeautifulSoup(html_content)

Now that we know how to access the page, we can grab the content just like we’d do with a “normal” requests/BeautifulSoup page.

doc.find_all(class_='fOVKMS')

And then we can do all anyone ever wants to do, which is convert it into a spreadsheet! But we can get too excited yet…

Interacting with the page

If we scroll down a bit, we see that the page only lists the top 50 books. We want more than that! And we get that by clicking the “Show More” button.

Playwright makes it easy with page.get_by_text and .click() – but instead of writing the code ourselves, we’re just going to get ChatGPT (or Claude, or Deepseek…) to write the code for us.

We’ll use this magical prompt to make things happen.

Filling out forms

Let’s try another page where we need to fill out some forms. The North Dakota well search page is a good one!

Selecting from dropdowns is easy! But again, we don’t need to know how to do it: we’ll just use the prompt and be guided by the tool.