Run in Colab Download notebook
In [ ]:
# Install required packages
!pip install --upgrade --quiet playwright
!pip install --upgrade --quiet beautifulsoup4
!pip install --upgrade --quiet lxml
!pip install --upgrade --quiet html5lib
!pip install --upgrade --quiet pandas
!pip install --upgrade --quiet nest_asyncio

print('✓ Packages installed!')

In this example we are going to scrape North Dakota's Oil and Gas wells search.

Traditionally Python programmers use BeautifulSoup to scrape content from the interent. Instead of being traditional, we're going to use Playwright, a browser automation tool! This means you actually control the browser! Filling out forms, clicking buttons, downloading documents... it's magic!!!✨✨✨

North Dakota Oil Wells

  • Filling out dropdowns
  • Using every option in the dropdown
  • Combining dataframes

Installation

We need to install a few tools first! Remove the # and run the cell to install the Python packages and browsers that we'll need for our scraping adventure.

In [1]:
# %pip install --quiet lxml html5lib beautifulsoup4 pandas
# %pip install --quiet playwright
# !playwright install-deps
# !playwright install chromium firefox

And we'll set it up so Playwright will be sure to work on Windows.

In [2]:
# Detect if we're running in Google Colab
import os
IN_COLAB = 'COLAB_GPU' in os.environ or 'COLAB_RELEASE_TAG' in os.environ

import platform
import asyncio
import nest_asyncio

if platform.system() == "Windows":
    asyncio.set_event_loop_policy(asyncio.WindowsProactorEventLoopPolicy())

try:
    asyncio.get_running_loop()
    nest_asyncio.apply()
except RuntimeError:
    pass

Opening up the browser and visiting our destination

In [3]:
from playwright.async_api import async_playwright

# "Hey, open up a browser"
playwright = await async_playwright().start()

# Colab can't open a visible browser, so we run headless there
if IN_COLAB:
    use_headless = True
else:
    use_headless = False

browser = await playwright.chromium.launch(headless=use_headless)

# Create a new browser window
page = await browser.new_page()
In [4]:
await page.goto("https://www.dmr.nd.gov/oilgas/findwellsvw.asp")
Out[4]:
<Response url='https://www.dmr.nd.gov/oilgas/findwellsvw.asp' request=<Request url='https://www.dmr.nd.gov/oilgas/findwellsvw.asp' method='GET'>>
In [5]:
from IPython.display import Image

Image(await page.screenshot())
Out[5]:
No description has been provided for this image

Selecting an option from a dropdown

You always start with await page.locator("select").select_option("whatever option you want"). You'll probably get an error because there are multiple dropdowns on the page, but Playwright doesn't know which one you want to use! Just read the error and figure out the right one.

In [6]:
# await page.locator("select").select_option("129")
await page.get_by_label("Township:").select_option("129")
Out[6]:
['129']
In [7]:
#await page.get_by_text("Submit").click()
await page.get_by_role("button", name="Submit").click()

Grab the tables from the page

Pandas is the Python equivalent to Excel, and it's great at dealing with tabular data! Often the data on a web page that looks like a spreadsheet can be read with pd.read_html.

You use await page.content() to save the contents of the page, then feed it to read_html to find the tables. len(tables) checks the number of tables you have, then you manually poke around to see which one is the one you're interested in. tables[0] is the first one, tables[1] is the second one, and so on...

In [8]:
import pandas as pd
from io import StringIO

html = await page.content()
# Wait for the table to show up
await page.wait_for_selector("table", timeout=10000)
tables = pd.read_html(StringIO(html))
len(tables)
Out[8]:
3
In [9]:
tables[2]
Out[9]:
File No CTB No API No Well Type Well Status Status Date DTD Location Operator Well Name Field
0 3065 NaN 3308100001 OG DRY 1/1/1962 1214.0 NWNW 9-129-58 SARGENT MINERAL CORP. LAMPORT 1 WILDCAT
1 99025 NaN 3302199025 ST DRY 9/30/1963 1600.0 NENE 10-129-63 JACK ROUSE OIL PROPERTIES HAROLD BILLIE 1 WILDCAT
2 515 NaN 3302100001 OG DRY 11/17/1954 1882.0 SENW 11-129-63 H. HANSON OIL SYNDICATE HAROLD BILLEY 1 WILDCAT
3 572 NaN 3302100002 OG DRY 6/17/1954 1860.0 NENW 14-129-63 H. HANSON OIL SYNDICATE JOHN BELL 1 WILDCAT
4 1394 NaN 3302100005 OG DRY 3/12/1957 3166.0 NWNW 22-129-66 CALVERT DRILLING, INC. MARVIN KAMM 1 WILDCAT
... ... ... ... ... ... ... ... ... ... ... ...
373 7038 NaN 3301100279 GASD PNC 3/2/1982 NaN SESE 29-129-106 JOSEPH J. C. PAINE & ASSOCIATES NYQUIST 4-29 LITTLE MISSOURI
374 7039 NaN 3301100280 GASD PNC 9/15/1982 NaN SWNE 29-129-106 JOSEPH J. C. PAINE & ASSOCIATES NYQUIST 1-29 LITTLE MISSOURI
375 5567 NaN 3301100194 OG DRY 3/23/1975 8650.0 SWSE 36-129-106 AMAX PETROLEUM CORP. STATE OF NORTH DAKOTA 1 WILDCAT
376 9500 NaN 3301100378 GASD DRY 7/17/1984 1800.0 NWNW 36-129-106 UNIVERSAL RESOURCES CORP. STATE 1-36 LITTLE MISSOURI
377 9636 NaN 3301100381 GASD DRY 6/23/1983 2100.0 NENW 26-129-107 WOODS PETROLEUM CORP. COX 26-1A WILDCAT

378 rows × 11 columns

Try every option in the "Township" dropdown

We want data for EVERY township! We'll start by finding all of the options for the Township dropdown.

If you use this on another page, it's the same content for filling out the dropdown, except you use .locator("option").all_text_contents() instead of .select_option.

In [10]:
options = await page.get_by_label("Township:").locator("option").all_text_contents()
options[:10]
Out[10]:
['Select a Township',
 '129',
 '130',
 '131',
 '132',
 '133',
 '134',
 '135',
 '136',
 '137']

We're going to remove "Select a Township" because... that isn't a township!

In [11]:
options.remove("Select a Township")

Now we'll go through every single one of the township options and combine the results.

Well... the first three! If you want all of the options, just remove the [:3] from it.

In [12]:
all_data = pd.DataFrame()

# Try it for three pages
# for option in options:
for option in options[:3]:
    # Click and wait for load
    print("Querying township", option)
    await page.get_by_label("Township:").select_option(option)
    await page.get_by_text("Submit").click()

    # This is another way of waiting for the page to load,
    # by waiting for a specific element to show up
    await page.get_by_text("CTB No").wait_for()

    # Get all of the tables on the page
    html = await page.content()
    tables = pd.read_html(StringIO(html))

    df = tables[2]
    df['township'] = option

    # Add the tables on this page to 
    all_data = pd.concat([all_data, df], ignore_index = True)

    # Save after each township in case something breaks
    all_data.to_csv("output.csv", index=False)
Querying township 129
Querying township 130
Querying township 131
In [13]:
all_data.head()
Out[13]:
File No CTB No API No Well Type Well Status Status Date DTD Location Operator Well Name Field township
0 3065 NaN 3.308100e+09 OG DRY 1/1/1962 1214.0 NWNW 9-129-58 SARGENT MINERAL CORP. LAMPORT 1 WILDCAT 129
1 99025 NaN 3.302199e+09 ST DRY 9/30/1963 1600.0 NENE 10-129-63 JACK ROUSE OIL PROPERTIES HAROLD BILLIE 1 WILDCAT 129
2 515 NaN 3.302100e+09 OG DRY 11/17/1954 1882.0 SENW 11-129-63 H. HANSON OIL SYNDICATE HAROLD BILLEY 1 WILDCAT 129
3 572 NaN 3.302100e+09 OG DRY 6/17/1954 1860.0 NENW 14-129-63 H. HANSON OIL SYNDICATE JOHN BELL 1 WILDCAT 129
4 1394 NaN 3.302100e+09 OG DRY 3/12/1957 3166.0 NWNW 22-129-66 CALVERT DRILLING, INC. MARVIN KAMM 1 WILDCAT 129

Saving the results

Now we'll save it to a CSV file! Easy peasy.

In [14]:
all_data.to_csv("output.csv", index=False)