# Install required packages
!pip install --upgrade --quiet playwright
!pip install --upgrade --quiet beautifulsoup4
!pip install --upgrade --quiet lxml
!pip install --upgrade --quiet html5lib
!pip install --upgrade --quiet pandas
!pip install --upgrade --quiet nest_asyncio
print('✓ Packages installed!')
Slides: browser-automation.pdf
In this example we are going to scrape North Dakota's Oil and Gas wells search.
Traditionally Python programmers use BeautifulSoup to scrape content from the interent. Instead of being traditional, we're going to use Playwright, a browser automation tool! This means you actually control the browser! Filling out forms, clicking buttons, downloading documents... it's magic!!!✨✨✨
We need to install a few tools first! Remove the # and run the cell to install the Python packages and browsers that we'll need for our scraping adventure.
# %pip install --quiet lxml html5lib beautifulsoup4 pandas
# %pip install --quiet playwright
# !playwright install-deps
# !playwright install chromium firefox
And we'll set it up so Playwright will be sure to work on Windows.
# Detect if we're running in Google Colab
import os
IN_COLAB = 'COLAB_GPU' in os.environ or 'COLAB_RELEASE_TAG' in os.environ
import platform
import asyncio
import nest_asyncio
if platform.system() == "Windows":
asyncio.set_event_loop_policy(asyncio.WindowsProactorEventLoopPolicy())
try:
asyncio.get_running_loop()
nest_asyncio.apply()
except RuntimeError:
pass
from playwright.async_api import async_playwright
# "Hey, open up a browser"
playwright = await async_playwright().start()
# Colab can't open a visible browser, so we run headless there
if IN_COLAB:
use_headless = True
else:
use_headless = False
browser = await playwright.chromium.launch(headless=use_headless)
# Create a new browser window
page = await browser.new_page()
await page.goto("https://www.dmr.nd.gov/oilgas/findwellsvw.asp")
from IPython.display import Image
Image(await page.screenshot())
You always start with await page.locator("select").select_option("whatever option you want"). You'll probably get an error because there are multiple dropdowns on the page, but Playwright doesn't know which one you want to use! Just read the error and figure out the right one.
# await page.locator("select").select_option("129")
await page.get_by_label("Township:").select_option("129")
#await page.get_by_text("Submit").click()
await page.get_by_role("button", name="Submit").click()
Pandas is the Python equivalent to Excel, and it's great at dealing with tabular data! Often the data on a web page that looks like a spreadsheet can be read with pd.read_html.
You use await page.content() to save the contents of the page, then feed it to read_html to find the tables. len(tables) checks the number of tables you have, then you manually poke around to see which one is the one you're interested in. tables[0] is the first one, tables[1] is the second one, and so on...
import pandas as pd
from io import StringIO
html = await page.content()
# Wait for the table to show up
await page.wait_for_selector("table", timeout=10000)
tables = pd.read_html(StringIO(html))
len(tables)
tables[2]
We want data for EVERY township! We'll start by finding all of the options for the Township dropdown.
If you use this on another page, it's the same content for filling out the dropdown, except you use .locator("option").all_text_contents() instead of .select_option.
options = await page.get_by_label("Township:").locator("option").all_text_contents()
options[:10]
We're going to remove "Select a Township" because... that isn't a township!
options.remove("Select a Township")
Now we'll go through every single one of the township options and combine the results.
Well... the first three! If you want all of the options, just remove the
[:3]from it.
all_data = pd.DataFrame()
# Try it for three pages
# for option in options:
for option in options[:3]:
# Click and wait for load
print("Querying township", option)
await page.get_by_label("Township:").select_option(option)
await page.get_by_text("Submit").click()
# This is another way of waiting for the page to load,
# by waiting for a specific element to show up
await page.get_by_text("CTB No").wait_for()
# Get all of the tables on the page
html = await page.content()
tables = pd.read_html(StringIO(html))
df = tables[2]
df['township'] = option
# Add the tables on this page to
all_data = pd.concat([all_data, df], ignore_index = True)
# Save after each township in case something breaks
all_data.to_csv("output.csv", index=False)
all_data.head()
Now we'll save it to a CSV file! Easy peasy.
all_data.to_csv("output.csv", index=False)