Run in Colab Download notebook

In [ ]:

# Install required packages
!pip install --upgrade --quiet playwright
!pip install --upgrade --quiet beautifulsoup4
!pip install --upgrade --quiet lxml
!pip install --upgrade --quiet html5lib
!pip install --upgrade --quiet pandas
!pip install --upgrade --quiet nest_asyncio

print('✓ Packages installed!')

Slides: browser-automation.pdf

In this example we are going to scrape the Texas Department of Licensing and Regulation for tow truck licenses.

Traditionally Python programmers use BeautifulSoup to scrape content from the interent. Instead of being traditional, we're going to use Playwright, a browser automation tool! This means you actually control the browser! Filling out forms, clicking buttons, downloading documents... it's magic!!!✨✨✨

Texas Tow Truck Licenses¶

What we'll learn/use¶

Selectors
Dropdowns
Single page of results
Creating and saving a dataframe

Installation¶

We need to install a few tools first! Remove the # and run the cell to install the Python packages and browsers that we'll need for our scraping adventure.

In [1]:

# %pip install --quiet lxml html5lib beautifulsoup4 pandas
# %pip install --quiet playwright
# !playwright install-deps
# !playwright install chromium firefox

And we'll set it up so Playwright will be sure to work on Windows.

In [1]:

# Detect if we're running in Google Colab
import os
IN_COLAB = 'COLAB_GPU' in os.environ or 'COLAB_RELEASE_TAG' in os.environ

import platform
import asyncio
import nest_asyncio

if platform.system() == "Windows":
    asyncio.set_event_loop_policy(asyncio.WindowsProactorEventLoopPolicy())

try:
    asyncio.get_running_loop()
    nest_asyncio.apply()
except RuntimeError:
    pass

Opening up the browser and visiting our destination¶

In [2]:

from playwright.async_api import async_playwright

# "Hey, open up a browser"
playwright = await async_playwright().start()

# Colab can't open a visible browser, so we run headless there
if IN_COLAB:
    use_headless = True
else:
    use_headless = False

browser = await playwright.chromium.launch(headless=use_headless)

# Create a new browser window
page = await browser.new_page()

In [3]:

await page.goto("https://www.tdlr.texas.gov/cimsfo/")

Out[3]:

<Response url='https://www.tdlr.texas.gov/cimsfo/' request=<Request url='https://www.tdlr.texas.gov/cimsfo/' method='GET'>>

In [4]:

from IPython.display import Image

Image(await page.screenshot())

Out[4]:

No description has been provided for this image

You always start with await page.locator("select").select_option("whatever option you want"). You'll probably get an error because there are multiple dropdowns on the page, but Playwright doesn't know which one you want to use! Just read the error and figure out the right one.

In [5]:

# await page.locator("select").select_option("Tow Truck Companies")
await page.get_by_label("Search by License Program Type").select_option("Tow Truck Companies")

Out[5]:

['TOW']

Clicking the search button¶

Same as with dropdowns, for buttons you always start with await page.get_by_text("search or submit or whatever").click(). You usually get an error, and then you read the error to find out the right thing to click.

In this case it looks absolutely nightmarish. Can you imagine figuring that one out manually?

In [6]:

# await page.get_by_text("Search").click()
await page.locator('xpath=//*[@id="dat-menu"]/div/div[2]/div/div/section/div/div/table/tbody/tr/td/form/table/tbody/tr[18]/td/input[1]').click()

Grab the tables from the page¶

Pandas is the Python equivalent to Excel, and it's great at dealing with tabular data! Often the data on a web page that looks like a spreadsheet can be read with pd.read_html.

You use await page.content() to save the contents of the page, then feed it to read_html to find the tables. len(tables) checks the number of tables you have, then you manually poke around to see which one is the one you're interested in. tables[0] is the first one, tables[1] is the second one, and so on...

In [7]:

import pandas as pd
from io import StringIO

html = await page.content()
tables = pd.read_html(StringIO(html))
len(tables)

Out[7]:

In this case there's only one table, so we'll look at the first one. We're saving it as df because... that's what Python/pandas people do!

In [8]:

df = tables[0]
df.head()

Out[8]:

	Name and Location	Order	Basis for Order
0	OLIVAREZ, BENITO OLIVAREZ, JUANITA Company: BE...	Date: 2/24/2026 Respondents Benito Olivarez an...	Respondents performed an unauthorized private ...
1	Company: EL PASO TOWING City: EL PASO County...	Date: 2/23/2026 Respondent is assessed an admi...	Respondent failed to accept cash, credit cards...
2	Company: AP TOWING City: FORT WORTH County: ...	Date: 2/3/2026 Respondent is assessed an admin...	Respondent performed an unauthorized private p...
3	GARCIA, JOSE A Company: ANGEL WRECKER SERVICE...	Date: 2/3/2026 Respondent is assessed an admin...	Respondent performed an illegal tow.
4	Company: INNOVATIVE PARKING MANAGEMENT, INC C...	Date: 1/29/2026 Respondent is assessed an admi...	Respondent performed an unauthorized private p...

Saving the results¶

Now we'll save it to a CSV file! Easy peasy.

In [9]:

df.to_csv("output.csv", index=False)