Run in Colab Download notebook

In [ ]:

# Install required packages
!pip install --upgrade --quiet playwright
!pip install --upgrade --quiet beautifulsoup4
!pip install --upgrade --quiet lxml
!pip install --upgrade --quiet html5lib
!pip install --upgrade --quiet pandas
!pip install --upgrade --quiet nest_asyncio

print('✓ Packages installed!')

Slides: browser-automation.pdf

In this example we are going to scrape locksmiths from Maryland's licensing queries site.

Traditionally Python programmers use BeautifulSoup to scrape content from the interent. Instead of being traditional, we're going to use Playwright, a browser automation tool! This means you actually control the browser! Filling out forms, clicking buttons, downloading documents... it's magic!!!✨✨✨

Maryland locksmiths¶

Inspecting the page
Filling in a text box
Working through a list of inputs (zip codes, in this case)
Combining dataframes
Back button

Installation¶

We need to install a few tools first! Remove the # and run the cell to install the Python packages and browsers that we'll need for our scraping adventure.

In [1]:

# %pip install --quiet lxml html5lib beautifulsoup4 pandas
# %pip install --quiet playwright
# !playwright install-deps
# !playwright install chromium firefox

And we'll set it up so Playwright will be sure to work on Windows.

In [2]:

# Detect if we're running in Google Colab
import os
IN_COLAB = 'COLAB_GPU' in os.environ or 'COLAB_RELEASE_TAG' in os.environ

import platform
import asyncio
import nest_asyncio

if platform.system() == "Windows":
    asyncio.set_event_loop_policy(asyncio.WindowsProactorEventLoopPolicy())

try:
    asyncio.get_running_loop()
    nest_asyncio.apply()
except RuntimeError:
    pass

Opening up the browser and visiting our destination¶

In [3]:

from playwright.async_api import async_playwright

# "Hey, open up a browser"
playwright = await async_playwright().start()

# Colab can't open a visible browser, so we run headless there
if IN_COLAB:
    use_headless = True
else:
    use_headless = False

browser = await playwright.chromium.launch(headless=use_headless)

# Create a new browser window
page = await browser.new_page()

In [4]:

await page.goto("https://www.dllr.state.md.us/cgi-bin/ElectronicLicensing/OP_Search/OP_search.cgi?calling_app=LOCKSMITH::LOCKSMITH_personal_location")

Out[4]:

<Response url='https://www.dllr.state.md.us/cgi-bin/ElectronicLicensing/OP_Search/OP_search.cgi?calling_app=LOCKSMITH::LOCKSMITH_personal_location' request=<Request url='https://www.dllr.state.md.us/cgi-bin/ElectronicLicensing/OP_Search/OP_search.cgi?calling_app=LOCKSMITH::LOCKSMITH_personal_location' method='GET'>>

In [5]:

from IPython.display import Image

Image(await page.screenshot())

Out[5]:

No description has been provided for this image

Filling in a text box¶

You always start with await page.locator("input").fill("whatever you want"). You'll probably get an error because there are multiple inputs on the page, but Playwright doesn't know which one you want to use! Just read the error and figure out the right one.

In [6]:

# 20601 
# 20602
# 20603
# 20606
# 20607
# 20608
# 20609

# await page.locator("input").fill("20602")
await page.locator("[name='zip']").fill("20602")

In [7]:

# await page.get_by_text("Search").click()
await page.get_by_role("button", name="Search").click()

Grab the tables from the page¶

Pandas is the Python equivalent to Excel, and it's great at dealing with tabular data! Often the data on a web page that looks like a spreadsheet can be read with pd.read_html.

You use await page.content() to save the contents of the page, then feed it to read_html to find the tables. len(tables) checks the number of tables you have, then you manually poke around to see which one is the one you're interested in. tables[0] is the first one, tables[1] is the second one, and so on...

In [8]:

import pandas as pd
from io import StringIO

html = await page.content()
await page.wait_for_selector("table", timeout=5000)
tables = pd.read_html(StringIO(html))
len(tables)

Out[8]:

In [9]:

tables[0]

Out[9]:

	0	1	2	3	4	5	6	7	8
0	Personal Name	Business Legal/Trading as Name and Street Address	City	State	Zip	Expiration	Category	Reg. #	Suffix
1	CALBERT BRATHWAITE	LOCKOUT SOLUTIONS 3608 MATLOCK PLACE Total Act...	WALDORF	MD	20602	2026-11-16	LOCKSMITH	384	NaN
2	ALBA MARTINEZ ALAFARO	M & W LOCKSMITH LLC 855 COPLEY AVENUE Total Ac...	WALDORF	MD	20602	2026-09-08	LOCKSMITH	532	NaN

In [10]:

await page.go_back()

Out[10]:

<Response url='https://www.dllr.state.md.us/cgi-bin/ElectronicLicensing/OP_Search/OP_search.cgi?calling_app=LOCKSMITH::LOCKSMITH_personal_location' request=<Request url='https://www.dllr.state.md.us/cgi-bin/ElectronicLicensing/OP_Search/OP_search.cgi?calling_app=LOCKSMITH::LOCKSMITH_personal_location' method='GET'>>

Fill out the ZIP code field again and again and again¶

I found a list of zipcodes on the internet! I pasted them below, then used .split() to make them into something we could use in Python.

In [11]:

zipcodes = """20906
21234
20878
21740
20874
21122
21222
21117
20904
20744
21061
21215
20902
20772
21207
20850
21206
20774
20783
21228
20854
20852
21043
21702
21218
21044
21921
20910
21224
21229""".split("\n")

print(zipcodes)

['20906', '21234', '20878', '21740', '20874', '21122', '21222', '21117', '20904', '20744', '21061', '21215', '20902', '20772', '21207', '20850', '21206', '20774', '20783', '21228', '20854', '20852', '21043', '21702', '21218', '21044', '21921', '20910', '21224', '21229']

Now we fill out the form for each and every zip code, one by one, pulling out the tables and saving them and adding them to the list.

In [12]:

import pandas as pd
from io import StringIO

all_data = pd.DataFrame()

# Go to the front page
await page.goto("https://www.dllr.state.md.us/cgi-bin/ElectronicLicensing/OP_Search/OP_search.cgi?calling_app=LOCKSMITH::LOCKSMITH_personal_location")

# Search for each zipcode
for zipcode in zipcodes:
    print("Searching for", zipcode)

    # Fill out the form and search
    await page.locator("[name='zip']").fill(zipcode)
    await page.get_by_role("button", name="Search").click()

    # try:
    # Get all of the tables on the page
    try:
        await page.wait_for_selector("table", timeout=5000)
        html = await page.content()
        tables = pd.read_html(StringIO(html))
    except:
        tables = []

    # Get the table (and edit if necessary)
    if len(tables) > 0:
        df = tables[0]
        print("Found", len(df))
    
        # Add the tables on this page to 
        all_data = pd.concat([all_data, df], ignore_index = True)
    else:
        print("Nothing found")

    # Save after each zip code in case something breaks
    all_data.to_csv("output.csv", index=False)

    # Go back and start again
    await page.go_back()

Searching for 20906

Found 3
Searching for 21234

Found 3
Searching for 20878

Found 2
Searching for 21740

Found 2
Searching for 20874

Found 2
Searching for 21122

Found 5
Searching for 21222

Found 2
Searching for 21117

Found 7
Searching for 20904

Found 2
Searching for 20744

Found 7
Searching for 21061

Found 2
Searching for 21215

Found 3
Searching for 20902

Found 9
Searching for 20772

Found 3
Searching for 21207

Found 3
Searching for 20850

Found 7
Searching for 21206

Nothing found
Searching for 20774

Found 2
Searching for 20783

Found 3
Searching for 21228

Found 4
Searching for 20854

Found 3
Searching for 20852

Found 14
Searching for 21043

Found 2
Searching for 21702

Found 3
Searching for 21218

Nothing found
Searching for 21044

Found 4
Searching for 21921

Found 3
Searching for 20910

Found 5
Searching for 21224

Nothing found
Searching for 21229

Nothing found

In [13]:

len(all_data)

Out[13]:

In [14]:

all_data.head()

Out[14]:

	0	1	2	3	4	5	6	7	8
0	Personal Name	Business Legal/Trading as Name and Street Address	City	State	Zip	Expiration	Category	Reg. #	Suffix
1	JOSE A. MALDONADO	ELKIN LOCKSMITH 3719 FERRARA DRIVE Total Activ...	SILVER SPRING	MD	20906	2026-05-06	LOCKSMITH	635	NaN
2	TERRY ROSEMOND	SERVICE REPAIRS, LLC 13108 CAMELLIA DRIVE Tota...	SILVER SPRING	MD	20906	2026-11-30	LOCKSMITH	380	NaN
3	Personal Name	Business Legal/Trading as Name and Street Address	City	State	Zip	Expiration	Category	Reg. #	Suffix
4	ROBERT EASTER	EASTER'S LOCK & SECURITY SOLUTIONS 1713 E JOPP...	BALTIMORE	MD	21234	2027-01-31	LOCKSMITH	10	NaN

Saving the results¶

Now we'll save it to a CSV file! Easy peasy.

In [15]:

all_data.to_csv("output.csv", index=False)