Run in Colab Download notebook

In [ ]:

# Install required packages
!pip install --upgrade --quiet playwright
!pip install --upgrade --quiet beautifulsoup4
!pip install --upgrade --quiet lxml
!pip install --upgrade --quiet html5lib
!pip install --upgrade --quiet pandas
!pip install --upgrade --quiet nest_asyncio

print('✓ Packages installed!')

Slides: browser-automation.pdf

In this example we are going to scrape North Dakota's Oil and Gas wells search.

Traditionally Python programmers use BeautifulSoup to scrape content from the interent. Instead of being traditional, we're going to use Playwright, a browser automation tool! This means you actually control the browser! Filling out forms, clicking buttons, downloading documents... it's magic!!!✨✨✨

North Dakota Oil Wells¶

Filling out dropdowns
Using every option in the dropdown
Combining dataframes

Installation¶

We need to install a few tools first! Remove the # and run the cell to install the Python packages and browsers that we'll need for our scraping adventure.

In [1]:

# %pip install --quiet lxml html5lib beautifulsoup4 pandas
# %pip install --quiet playwright
# !playwright install-deps
# !playwright install chromium firefox

And we'll set it up so Playwright will be sure to work on Windows.

In [2]:

# Detect if we're running in Google Colab
import os
IN_COLAB = 'COLAB_GPU' in os.environ or 'COLAB_RELEASE_TAG' in os.environ

import platform
import asyncio
import nest_asyncio

if platform.system() == "Windows":
    asyncio.set_event_loop_policy(asyncio.WindowsProactorEventLoopPolicy())

try:
    asyncio.get_running_loop()
    nest_asyncio.apply()
except RuntimeError:
    pass

Opening up the browser and visiting our destination¶

In [3]:

from playwright.async_api import async_playwright

# "Hey, open up a browser"
playwright = await async_playwright().start()

# Colab can't open a visible browser, so we run headless there
if IN_COLAB:
    use_headless = True
else:
    use_headless = False

browser = await playwright.chromium.launch(headless=use_headless)

# Create a new browser window
page = await browser.new_page()

In [4]:

await page.goto("https://www.dmr.nd.gov/oilgas/findwellsvw.asp")

Out[4]:

<Response url='https://www.dmr.nd.gov/oilgas/findwellsvw.asp' request=<Request url='https://www.dmr.nd.gov/oilgas/findwellsvw.asp' method='GET'>>

In [5]:

from IPython.display import Image

Image(await page.screenshot())

Out[5]:

No description has been provided for this image

You always start with await page.locator("select").select_option("whatever option you want"). You'll probably get an error because there are multiple dropdowns on the page, but Playwright doesn't know which one you want to use! Just read the error and figure out the right one.

In [6]:

# await page.locator("select").select_option("129")
await page.get_by_label("Township:").select_option("129")

Out[6]:

['129']

In [7]:

#await page.get_by_text("Submit").click()
await page.get_by_role("button", name="Submit").click()

Grab the tables from the page¶

Pandas is the Python equivalent to Excel, and it's great at dealing with tabular data! Often the data on a web page that looks like a spreadsheet can be read with pd.read_html.

You use await page.content() to save the contents of the page, then feed it to read_html to find the tables. len(tables) checks the number of tables you have, then you manually poke around to see which one is the one you're interested in. tables[0] is the first one, tables[1] is the second one, and so on...

In [8]:

import pandas as pd
from io import StringIO

html = await page.content()
# Wait for the table to show up
await page.wait_for_selector("table", timeout=10000)
tables = pd.read_html(StringIO(html))
len(tables)

Out[8]:

In [9]:

tables[2]

Out[9]:

	File No	CTB No	API No	Well Type	Well Status	Status Date	DTD	Location	Operator	Well Name	Field
0	3065	NaN	3308100001	OG	DRY	1/1/1962	1214.0	NWNW 9-129-58	SARGENT MINERAL CORP.	LAMPORT 1	WILDCAT
1	99025	NaN	3302199025	ST	DRY	9/30/1963	1600.0	NENE 10-129-63	JACK ROUSE OIL PROPERTIES	HAROLD BILLIE 1	WILDCAT
2	515	NaN	3302100001	OG	DRY	11/17/1954	1882.0	SENW 11-129-63	H. HANSON OIL SYNDICATE	HAROLD BILLEY 1	WILDCAT
3	572	NaN	3302100002	OG	DRY	6/17/1954	1860.0	NENW 14-129-63	H. HANSON OIL SYNDICATE	JOHN BELL 1	WILDCAT
4	1394	NaN	3302100005	OG	DRY	3/12/1957	3166.0	NWNW 22-129-66	CALVERT DRILLING, INC.	MARVIN KAMM 1	WILDCAT
...	...	...	...	...	...	...	...	...	...	...	...
373	7038	NaN	3301100279	GASD	PNC	3/2/1982	NaN	SESE 29-129-106	JOSEPH J. C. PAINE & ASSOCIATES	NYQUIST 4-29	LITTLE MISSOURI
374	7039	NaN	3301100280	GASD	PNC	9/15/1982	NaN	SWNE 29-129-106	JOSEPH J. C. PAINE & ASSOCIATES	NYQUIST 1-29	LITTLE MISSOURI
375	5567	NaN	3301100194	OG	DRY	3/23/1975	8650.0	SWSE 36-129-106	AMAX PETROLEUM CORP.	STATE OF NORTH DAKOTA 1	WILDCAT
376	9500	NaN	3301100378	GASD	DRY	7/17/1984	1800.0	NWNW 36-129-106	UNIVERSAL RESOURCES CORP.	STATE 1-36	LITTLE MISSOURI
377	9636	NaN	3301100381	GASD	DRY	6/23/1983	2100.0	NENW 26-129-107	WOODS PETROLEUM CORP.	COX 26-1A	WILDCAT

378 rows × 11 columns

We want data for EVERY township! We'll start by finding all of the options for the Township dropdown.

If you use this on another page, it's the same content for filling out the dropdown, except you use .locator("option").all_text_contents() instead of .select_option.

In [10]:

options = await page.get_by_label("Township:").locator("option").all_text_contents()
options[:10]

Out[10]:

['Select a Township',
 '129',
 '130',
 '131',
 '132',
 '133',
 '134',
 '135',
 '136',
 '137']

We're going to remove "Select a Township" because... that isn't a township!

In [11]:

options.remove("Select a Township")

Now we'll go through every single one of the township options and combine the results.

Well... the first three! If you want all of the options, just remove the [:3] from it.

In [12]:

all_data = pd.DataFrame()

# Try it for three pages
# for option in options:
for option in options[:3]:
    # Click and wait for load
    print("Querying township", option)
    await page.get_by_label("Township:").select_option(option)
    await page.get_by_text("Submit").click()

    # This is another way of waiting for the page to load,
    # by waiting for a specific element to show up
    await page.get_by_text("CTB No").wait_for()

    # Get all of the tables on the page
    html = await page.content()
    tables = pd.read_html(StringIO(html))

    df = tables[2]
    df['township'] = option

    # Add the tables on this page to 
    all_data = pd.concat([all_data, df], ignore_index = True)

    # Save after each township in case something breaks
    all_data.to_csv("output.csv", index=False)

Querying township 129

Querying township 130

Querying township 131

In [13]:

all_data.head()

Out[13]:

	File No	CTB No	API No	Well Type	Well Status	Status Date	DTD	Location	Operator	Well Name	Field	township
0	3065	NaN	3.308100e+09	OG	DRY	1/1/1962	1214.0	NWNW 9-129-58	SARGENT MINERAL CORP.	LAMPORT 1	WILDCAT	129
1	99025	NaN	3.302199e+09	ST	DRY	9/30/1963	1600.0	NENE 10-129-63	JACK ROUSE OIL PROPERTIES	HAROLD BILLIE 1	WILDCAT	129
2	515	NaN	3.302100e+09	OG	DRY	11/17/1954	1882.0	SENW 11-129-63	H. HANSON OIL SYNDICATE	HAROLD BILLEY 1	WILDCAT	129
3	572	NaN	3.302100e+09	OG	DRY	6/17/1954	1860.0	NENW 14-129-63	H. HANSON OIL SYNDICATE	JOHN BELL 1	WILDCAT	129
4	1394	NaN	3.302100e+09	OG	DRY	3/12/1957	3166.0	NWNW 22-129-66	CALVERT DRILLING, INC.	MARVIN KAMM 1	WILDCAT	129

Saving the results¶

Now we'll save it to a CSV file! Easy peasy.

In [14]:

all_data.to_csv("output.csv", index=False)

North Dakota Oil Wells¶

Installation¶

Opening up the browser and visiting our destination¶

Selecting an option from a dropdown¶

Grab the tables from the page¶

Try every option in the "Township" dropdown¶

Saving the results¶