Run in Colab Download notebook

In [ ]:

# Install required packages
!pip install --upgrade --quiet playwright
!pip install --upgrade --quiet beautifulsoup4
!pip install --upgrade --quiet lxml
!pip install --upgrade --quiet html5lib
!pip install --upgrade --quiet pandas
!pip install --upgrade --quiet nest_asyncio

print('✓ Packages installed!')

Slides: browser-automation.pdf

In this example we are going to scrape board actions from the North Carolina State Bar Discipline Orders. Unlike last time, we're doing to be downloading a bunch of PDFs with details!

Traditionally Python programmers use BeautifulSoup to scrape content from the interent. Instead of being traditional, we're going to use Playwright, a browser automation tool! This means you actually control the browser! Filling out forms, clicking buttons, downloading documents... it's magic!!!✨✨✨

North Carolina State Bar Disclipline Orders¶

Filling out text inputs
Inspecting the page
Looping through inputs
Pagination
Downloading PDFs (using Firefox)

Installation¶

We need to install a few tools first! Remove the # and run the cell to install the Python packages and browsers that we'll need for our scraping adventure.

In [1]:

# %pip install --quiet lxml html5lib beautifulsoup4 pandas
# %pip install --quiet playwright
# !playwright install-deps
# !playwright install chromium firefox

And we'll set it up to be sure Playwright will work on Windows.

In [43]:

# Detect if we're running in Google Colab
import os
IN_COLAB = 'COLAB_GPU' in os.environ or 'COLAB_RELEASE_TAG' in os.environ

import platform
import asyncio
import nest_asyncio

if platform.system() == "Windows":
    asyncio.set_event_loop_policy(asyncio.WindowsProactorEventLoopPolicy())

try:
    asyncio.get_running_loop()
    nest_asyncio.apply()
except RuntimeError:
    pass

Opening up the browser and visiting our destination¶

We've been using Chromium (basically Chrome) for most of our exercises, but in this case we're using Firefox! Chromium for some reason sometimes gets blocked, while Firefox doesn't. Not sure why!

In [81]:

from playwright.async_api import async_playwright

# "Hey, open up a browser"
playwright = await async_playwright().start()

# Colab can't open a visible browser, so we run headless there
if IN_COLAB:
    use_headless = True
else:
    use_headless = False

browser = await playwright.firefox.launch(headless=use_headless)

# Create a new browser window
page = await browser.new_page()

In [82]:

await page.goto("https://www.ncbar.gov/lawyer-discipline/search-past-orders/orders-in-discipline-and-disability-cases/")

Out[82]:

<Response url='https://www.ncbar.gov/lawyer-discipline/search-past-orders/orders-in-discipline-and-disability-cases/' request=<Request url='https://www.ncbar.gov/lawyer-discipline/search-past-orders/orders-in-discipline-and-disability-cases/' method='GET'>>

In [67]:

from IPython.display import Image

Image(await page.screenshot())

Out[67]:

No description has been provided for this image

Filling in a single letter last name and search¶

Filling in text fields, clicking, waiting for buttons to show up and clicking. It gets a little wild because the last name field is sometimes it's 21 and sometimes it's 22.

In [59]:

# await page.locator("input").fill("A")
# await page.locator("#ContentPlaceHolderDefault_MainContent_MainContent_Item1_DisciplinaryOrdersSearch_21_txtAttorneyFirstName").fill("A")
await page.locator("input[id*='txtAttorneyFirstName']").fill("A")

In [60]:

# await page.get_by_text("Search").click()
await page.get_by_role("button", name="Search").click()

Downloading one of the PDFs¶

Sometimes PDFs open up in Chrome/Chromium as an in-page PDF. We change to Firefox to make sure they will actually download.

In [27]:

from pathlib import Path

Path("downloads").mkdir(exist_ok=True)

# "Link in a table"
links = page.locator("table a")

# More specific
# links = page.locator("table.gv_search a[href*='DisciplinaryOrderHandler']")

async with page.expect_download(timeout=10000) as download_info:
    link = links.nth(1)
    name = await link.text_content()
    print("Clicking", name)
    await link.click()
    download = await download_info.value

# They'll all have the same name!
#filename = Path("downloads") / download.suggested_filename
filename = Path("downloads") / f"{name}.pdf"
print("Saving as", filename)

# Wait for the download process to complete and save the downloaded file to "Downloads"
await download.save_as(filename)

Clicking Ackerman, Nicholas S. - 14G0313 - Censure
Saving as downloads/Ackerman, Nicholas S. - 14G0313 - Censure.pdf

Saving as /Users/soma/Downloads/0000FD89.PDF

Dealing with the pagination¶

The pagination on this one is crazy.

I just said "here's the entire content of the weird link, hack the pagination directly instead of clicking" and went back and forth for about half an hour and eventually it worked!

In [100]:

import time

async def next_page(page):
    PAGER_ROW = "table.gv_search tr:last-child"
    current = await page.query_selector(f"{PAGER_ROW} span")
    if not current:
        return False

    current_num = int((await current.inner_text()).strip())
    next_num = current_num + 1

    any_link = await page.query_selector(f"{PAGER_ROW} a[href*='doPostBack']")
    if not any_link:
        return False

    href = await any_link.get_attribute("href")
    control = href.split("'")[1]

    await page.evaluate(f"__doPostBack('{control}','Page${next_num}')")
    await page.wait_for_load_state("networkidle")
    try:
        await page.wait_for_function(
            f"document.querySelector('table.gv_search tr:last-child span')?.textContent?.trim() !== '{current_num}'"
        , timeout=10000)
    except:
        return False
    
    # Check if it actually moved
    new_current = await page.query_selector(f"{PAGER_ROW} span")
    if not new_current:
        return False
    new_num = int((await new_current.inner_text()).strip())
    return new_num == next_num

Any time we want to go to the next page, we just say await next_page(page). If we get back True it went to the next page, False means we're on the last page.

In [103]:

await next_page(page)

Out[103]:

True

Putting it all together¶

In theory we'd loop through every letter of the alphabet, and every page of each letter. That would take two hundred years, though, so instead we'll add a max_pages and just do 3 letters.

In [87]:

import time

letters = ['a', 'b', 'c']

output_dir = Path("downloads")
output_dir.mkdir(exist_ok=True)
max_pages = 3

for letter in letters:
    print("Searching letter", letter)
    await page.goto("https://www.ncbar.gov/lawyer-discipline/search-past-orders/orders-in-discipline-and-disability-cases/")    
    await page.wait_for_load_state("networkidle")

    # Fill out letter
    await page.locator("input[id*='txtAttorneyLastName']").fill(letter)

    # Click search button
    await page.get_by_role("button", name="Search").click()
    
    current_page = 1
    while True and current_page < max_pages:
        print("Scraping page", current_page)
        links = page.locator("a[href*='DisciplinaryOrderHandler']")
        await links.first.wait_for()
        count = await links.count()
        for i in range(count):
            async with page.expect_download(timeout=10000) as download_info:
                link = links.nth(i)
                name = await link.text_content()
                filename = Path("downloads") / f"{name}_{current_page}_{i}.pdf"
                print("Downloading", name)
                await link.click()
                download = await download_info.value
            
            # Wait for the download process to complete and save the downloaded file to "Downloads"
            await download.save_as(filename)

        if not await next_page(page):
            print("Breaking")
            break
        current_page = current_page + 1

Searching letter a
Scraping page 1
Downloading Ackerman, Nicholas S. - 10G0268 - Reprimand
Downloading Ackerman, Nicholas S. - 14G0313 - Censure
Downloading Ackerman, Nicholas S. - 16DHC33 - Suspension with All Stayed
Downloading Ackerman, Nicholas S. - 16DHC33 - Suspension with All Stayed
Downloading Ackerman, Nicholas S. - 16DHC33SC - Suspension Possible Stay
Downloading Adams, Robert W. - 00DHC1 - Suspension Possible Stay
Downloading Adams, Robert W. - 01BSR1 - Reinstatement
Downloading Adams, Robert W. - 13DHC17 - Suspension Possible Stay
Downloading Adams, Robert W. - 96G0320 - Reprimand
Downloading Adams, Robert W. - 96G1329 - Censure
Breaking
Searching letter b
Scraping page 1
Downloading Asbill, Craig Owen - 16DHC40 - Suspension Possible Stay
Downloading Asbill, Craig Owen - 16G0748 - Reprimand
Downloading Asbill, Craig Owen - 18G0161 - Censure
Downloading Badgett, Richard G. - 06DHC1 - Censure
Downloading Badgett, Mark H. - 09DHC6 - Disbarred
Downloading Bagwell, O. Kenneth - 83J1 - Suspension
Downloading Bagwell, O. Kenneth - 84J1 - Suspension with All Stayed
Downloading Bailey, Edward Grey - 94DHC14 - Suspension with All Stayed
Downloading Bailey, Michael A. - 93G0980 - Reprimand
Downloading Bailey, Earnest Nowlin - 23DHC1N - Suspension
Breaking
Searching letter c
Scraping page 1
Downloading Ackerman, Nicholas S. - 10G0268 - Reprimand
Downloading Ackerman, Nicholas S. - 14G0313 - Censure
Downloading Ackerman, Nicholas S. - 16DHC33 - Suspension with All Stayed
Downloading Ackerman, Nicholas S. - 16DHC33 - Suspension with All Stayed
Downloading Ackerman, Nicholas S. - 16DHC33SC - Suspension Possible Stay
Downloading Akachukwu, Mildred A. - 09DHC32 - Disbarred
Downloading Arceneaux, Wayne T. - 91G0163 - Suspension with All Stayed
Downloading Ballance, Frank W. - 04DHC50 - Dismissed with Prejudice
Downloading Ballance, Frank W. - 04DHC50 - Dismissed with Prejudice
Downloading Ballance, Frank W. - 05BCS2 - Disbarred
Breaking

Found 3 documents
Clicking 8/20/2021

Saving as /Users/soma/Downloads/0000FD88.PDF

Clicking 8/20/2021

Saving as /Users/soma/Downloads/0000FD89.PDF

Clicking 6/11/2021

Saving as /Users/soma/Downloads/0000FD87.PDF

Searching for Q5611

Found 1 documents
Clicking 6/9/2023

Saving as /Users/soma/Downloads/00010736.PDF

Searching for M1444

Found 2 documents
Clicking 3/3/2023
Saving as /Users/soma/Downloads/0000FC46.PDF

Clicking 1/17/2023

Saving as /Users/soma/Downloads/0000FC45.PDF