import requests
from bs4 import BeautifulSoup
= requests.get("https://www.lemonde.fr/en/")
response = BeautifulSoup(response.text) doc
Scraping interactive pages with Playwright
These are the notes for the Advanced scraping with Playwright session, which was a sample class and info session hosted by Professor Jonathan Soma for the Lede Program, a summer data journalism intensive at Columbia Journalism School.
Requests and BeautifulSoup intro
The traditional entry point for learning to scrape in Python is by using requests and BeautifulSoup. It’s usually great!
In the case below, we’re using it to scrape headlines Le Monde’s English website.
Sometimes you’ll get lucky and be able to scrape by just specifying a tag name…
= doc.find_all('h3')
headlines for headline in headlines:
print(headline.text)
Gazprombank executives quietly sold their French villas after the Ukraine invasion
Our die-hard obsession with virginity
In Egypt, Gazans are 'caught between war and a country that doesn't want' them
Two killed in Pakistan election protest as Khan allies lead in vote count
Jacques Doillon, accused of rape by several actors, denounces the 'lies'
France's new Education Minister Nicole Belloubet appointed in the midst of a school crisis
French prosecutors seek trial for Lafarge cement group over terror financing
Iran's Ayatollah Ali Khamenei removed from Instagram and Facebook
Prince Harry settles phone hacking lawsuit against UK tabloid
Vladimir Putin tells Tucker Carlson that Russia cannot be defeated in Ukraine
European countries are turning to 'selective immigration' to mend labor shortages
Three actresses accuse director Jacques Doillon of rape, sexual assault and harassment
Sections
French Focus
Opinion
Informations légales le Monde
…but more often that not a class is going to be more effective.
= doc.find_all(class_='article__title')
headlines for headline in headlines:
print(headline.text)
Putin tells the West that Russia cannot be defeated in Ukraine
In Egypt, Gazans are 'caught between war and a country that doesn't want' them
Ukraine's Hungarian minority, caught between defending its identity and fear of Orban's policies
In the Khan Yunis tunnels, Israeli army has little hope of freeing hostages by force
Paris votes on SUVs: The end of the road for big cars?
Why did astronauts leave poop on the moon, and what can we learn from it?
What's the role of France's prime minister?
Footage of Cyclone Belal hitting France's Réunion Island
Pesticides: 'We, researchers, condemn the way scientific knowledge is being shelved'
TotalEnergies makes its biggest profit ever
Coca-Cola: Sponsor of the Paris 2024 Olympics and leading global plastic polluter
UNESCO World Heritage site is engulfed by flames in Patagonia, Argentina
With the Paris airport border police: 'The risk, for us, is that we won't be able to send them back'
Three actresses accuse director Jacques Doillon of rape, sexual assault and harassment
Under the guise of cinema, director Benoît Jacquot set up a predatory system
ECHR faults France for police 'kettling' tactic
Pesticides: 'We, researchers, condemn the way scientific knowledge is being shelved'
'In many countries embracing democracy, it is regressing; in existing democracies, it is declining in quality'
Pakistan's hobbled democracy
Elfriede Jelinek, Nobel literature laureate, on the rise of the far right: 'I can hear a monster breathing'
Paris pop-up 'rave' boutique opens its doors
2024's must-see exhibitions in France: Henri Matisse, Jan Van Eyck, Vera Molnar, and more
Stepping inside Paris's historic taxidermy paradise
The French-Japanese designer putting a fashionable twist on the culinary arts
Biden faces serious doubts about his health nine months before presidential election
Ukraine's Hungarian minority, caught between defending its identity and fear of Orban's policies
Pesticides: 'We, researchers, condemn the way scientific knowledge is being shelved'
Paris-based Chinese film festival shows banned films
Paris 2024: Thierry Henry faces the conundrum of assembling his Olympic football team roster
In the Khan Yunis tunnels, Israeli army has little hope of freeing hostages by force
Gazprombank executives quietly sold their French villas after the Ukraine invasion
Our die-hard obsession with virginity
In Egypt, Gazans are 'caught between war and a country that doesn't want' them
Two killed in Pakistan election protest as Khan allies lead in vote count
Jacques Doillon, accused of rape by several actors, denounces the 'lies'
France's new Education Minister Nicole Belloubet appointed in the midst of a school crisis
French prosecutors seek trial for Lafarge cement group over terror financing
Iran's Ayatollah Ali Khamenei removed from Instagram and Facebook
Prince Harry settles phone hacking lawsuit against UK tabloid
Vladimir Putin tells Tucker Carlson that Russia cannot be defeated in Ukraine
European countries are turning to 'selective immigration' to mend labor shortages
Three actresses accuse director Jacques Doillon of rape, sexual assault and harassment
'It's the story of a kidnapped child': Actress Judith Godrèche accuses director Benoît Jacquot of rape
Under the guise of cinema, director Benoît Jacquot set up a predatory system
Atos: The hubris and downfall of a French IT giant
Pakistan's hobbled democracy
'In many countries embracing democracy, it is regressing; in existing democracies, it is declining in quality'
Donald Trump must face justice
Swiss train hostage crisis ends with suspect killed and hostages freed
Robert Badinter, who abolished the death penalty in France, has died
Biden faces serious doubts about his health nine months before presidential election
Putin tells the West that Russia cannot be defeated in Ukraine
Vladimir Putin tells Tucker Carlson that Russia cannot be defeated in Ukraine
European countries are turning to 'selective immigration' to mend labor shortages
In the Khan Yunis tunnels, Israeli army has little hope of freeing hostages by force
France's new Education Minister Nicole Belloubet appointed in the midst of a school crisis
French government reshuffle: A long wait for a non-event
Where requests + BeautifulSoup fails
Some websites you’ll be able to download fine with requests, but when you start trying to use BeautifulSoup nothing shows up. For example, if we try to access OpenSyllabus listing pages we won’t see any books show up in BeautifulSoup.
= requests.get("https://explorer.opensyllabus.org/results-list/titles?size=50")
response = BeautifulSoup(response.text) doc
='name-div') doc.find_all(class_
[]
This is because the page retrived by requests doesn’t actually have all those books on it.
response.text
'<!doctype html><html lang="en"><head><meta charset="utf-8"><meta name="viewport" content="width=device-width,initial-scale=1"><script async src="https://www.googletagmanager.com/gtag/js?id=UA-72367808-1"></script><script>function gtag(){dataLayer.push(arguments)}window.dataLayer=window.dataLayer||[],gtag("js",new Date),gtag("config","UA-72367808-1")</script><link rel="shortcut icon" href="/favicon.ico"><link rel="stylesheet" href="https://unpkg.com/leaflet@1.0.3/dist/leaflet.css"/><meta property="og:site_name" content="Open Syllabus"/><meta property="og:title" content="Open Syllabus: Explorer"/><meta property="og:description" content="Mapping the college curriculum across 7,292,573 syllabi."/><meta property="og:image" content="https://opensyllabus.org/og-image.jpg"/><meta name="twitter:card" content="summary_large_image"/><meta name="twitter:site" content="@opensyllabus"/><title>Open Syllabus</title><link href="/static/css/main.a318ad5b.css" rel="stylesheet"></head><body><div id="root"></div><script type="text/javascript" src="/static/js/main.d45e95ea.js"></script></body></html>'
This is because visiting this site is a two-step process, first you load up this bare-bones skeleton page, then the browser goes and gets the actual information. Requests doesn’t do that next step, so we need to try another tool!
Enter Playwright
Instead of pulling the raw HTML contents of the page, Playwright actually controls your browser for you! It can load pages up, you can click things, fill out forms, all sorts of things.
Installing Playwright
To install Playwright, you just need two commands: one to install the library, the other to install the necessary browsers. You can run them both from the command line, or put !
in front of them if you’re running them from a Jupyter Notebook.
pip install playwright
playwright install
If you come from a Selenium background, it’s a lot easier than tracking down webdrivers, ’eh?
Using Playwright
To begin we’ll just access the same OpenSyllabus page as before and see the actual contents.
from playwright.async_api import async_playwright
# "Hey, open up a browser"
= await async_playwright().start()
playwright = await playwright.chromium.launch(headless=False)
browser
# Create a new browser window
= await browser.new_page()
page
# Tell it to go to this page
await page.goto("https://explorer.opensyllabus.org/results-list/titles?size=50")
<Response url='https://explorer.opensyllabus.org/results-list/titles?size=50' request=<Request url='https://explorer.opensyllabus.org/results-list/titles?size=50' method='GET'>>
Some people will actually scrape the page Playwright, grabbing titles and all of that, but I find it’s easiest to take the HTML – the full HTML, after the skeleton has been filled in – and feed it to BeautifulSoup, just like we’re used to.
= await page.content()
html_content
= BeautifulSoup(html_content) doc
='name-div') doc.find_all(class_
[<div class="name-div"><p><a href="/result/title?id=9199819950029">The Elements of Style</a></p><span class="name"><div><a href="/result/author?id=William+Strunk">William Strunk</a></div></span><span class="publisher"><div class="div-no-link">Multiple Editions</div></span></div>,
<div class="name-div"><p><a href="/result/title?id=33749853015144">A Writer's Reference</a></p><span class="name"><div><a href="/result/author?id=Diana+Hacker">Diana Hacker</a></div></span><span class="publisher"><div><a href="/result/publisher?id=St.+Martin%27s+%2F+Bedford+Books"><div class="div-link">St. Martin's / Bedford Books</div>,</a><div class="div1-no-link">1989</div></div></span></div>,
<div class="name-div"><p><a href="/result/title?id=7988639699494">A Manual for Writers of Term Papers, Theses, and Dissertations</a></p><span class="name"><div><a href="/result/author?id=Kate+L.+Turabian">Kate L. Turabian</a></div></span><span class="publisher"><div><a href="/result/publisher?id=University+of+Chicago+Press"><div class="div-link">University of Chicago Press</div>,</a><div class="div1-no-link">1955</div></div></span></div>,
<div class="name-div"><p><a href="/result/title?id=8306467210251">The Communist Manifesto</a></p><span class="name"><div><a href="/result/author?id=Karl+Marx">Karl Marx</a></div></span><span class="publisher"><div class="div-no-link">Multiple Editions</div></span></div>,
<div class="name-div"><p><a href="/result/title?id=7636452301400">The Republic</a></p><span class="name"><div><a href="/result/author?id=Plato">Plato</a></div></span><span class="publisher"><div class="div-no-link">Multiple Editions</div></span></div>,
<div class="name-div"><p><a href="/result/title?id=33457795238412">Calculus</a></p><span class="name"><div><a href="/result/author?id=James+Stewart">James Stewart</a></div></span><span class="publisher"><div><a href="/result/publisher?id=Brooks+%2F+Cole"><div class="div-link">Brooks / Cole</div>,</a><div class="div1-no-link">1987</div></div></span></div>,
<div class="name-div"><p><a href="/result/title?id=7507602833850">Frankenstein</a></p><span class="name"><div><a href="/result/author?id=Mary+Wollstonecraft+Shelley">Mary Wollstonecraft Shelley</a></div></span><span class="publisher"><div class="div-no-link">Multiple Editions</div></span></div>,
<div class="name-div"><p><a href="/result/title?id=7507602833551">The Canterbury Tales</a></p><span class="name"><div><a href="/result/author?id=Geoffrey+Chaucer">Geoffrey Chaucer</a></div></span><span class="publisher"><div class="div-no-link">Multiple Editions</div></span></div>,
<div class="name-div"><p><a href="/result/title?id=7507602833458">Nicomachean Ethics</a></p><span class="name"><div><a href="/result/author?id=Aristotle">Aristotle</a></div></span><span class="publisher"><div class="div-no-link">Multiple Editions</div></span></div>,
<div class="name-div"><p><a href="/result/title?id=32804960208506">Human Anatomy and Physiology</a></p><span class="name"><div><a href="/result/author?id=Elaine+Nicpon+Marieb">Elaine Nicpon Marieb</a></div></span><span class="publisher"><div class="div-no-link">Multiple Editions</div></span></div>,
<div class="name-div"><p><a href="/result/title?id=42305427866233">Doing Your Research Project: A Guide for First-Time Researchers in Education and Social Science</a></p><span class="name"><div><a href="/result/author?id=Judith+Bell">Judith Bell</a></div></span><span class="publisher"><div><a href="/result/publisher?id=Open+University+Press"><div class="div-link">Open University Press</div>,</a><div class="div1-no-link">1986</div></div></span></div>,
<div class="name-div"><p><a href="/result/title?id=40166534153645">Imagined Communities: Reflections on the Origin and Spread of Nationalism</a></p><span class="name"><div><a href="/result/author?id=Benedict+R.+O.%27G.+Anderson">Benedict R. O.'G. Anderson</a></div></span><span class="publisher"><div><a href="/result/publisher?id=Verso+Books"><div class="div-link">Verso Books</div>,</a><div class="div1-no-link">1983</div></div></span></div>,
<div class="name-div"><p><a href="/result/title?id=32521492368242">They Say/I Say: The Moves That Matter in Academic Writing</a></p><span class="name"><div><a href="/result/author?id=Gerald+Graff">Gerald Graff</a></div></span><span class="publisher"><div class="div-no-link">Multiple Editions</div></span></div>,
<div class="name-div"><p><a href="/result/title?id=7507602833582">Leviathan</a></p><span class="name"><div><a href="/result/author?id=Thomas+Hobbes">Thomas Hobbes</a></div></span><span class="publisher"><div class="div-no-link">Multiple Editions</div></span></div>,
<div class="name-div"><p><a href="/result/title?id=33028298506521">A Pocket Style Manual</a></p><span class="name"><div><a href="/result/author?id=Diana+Hacker">Diana Hacker</a></div></span><span class="publisher"><div><a href="/result/publisher?id=St.+Martin%27s+%2F+Bedford+Books"><div class="div-link">St. Martin's / Bedford Books</div>,</a><div class="div1-no-link">1993</div></div></span></div>,
<div class="name-div"><p><a href="/result/title?id=16243566318681">Discipline and Punish: The Birth of the Prison</a></p><span class="name"><div><a href="/result/author?id=Michel+Foucault">Michel Foucault</a></div></span><span class="publisher"><div class="div-no-link">Multiple Editions</div></span></div>,
<div class="name-div"><p><a href="/result/title?id=7868380119082">The Study Skills Handbook</a></p><span class="name"><div><a href="/result/author?id=Stella+Cottrell">Stella Cottrell</a></div></span><span class="publisher"><div><a href="/result/publisher?id=Palgrave+Macmillan"><div class="div-link">Palgrave Macmillan</div>,</a><div class="div1-no-link">1999</div></div></span></div>,
<div class="name-div"><p><a href="/result/title?id=8108898763586">Orientalism</a></p><span class="name"><div><a href="/result/author?id=Edward+W.+Said">Edward W. Said</a></div></span><span class="publisher"><div class="div-no-link">Multiple Editions</div></span></div>,
<div class="name-div"><p><a href="/result/title?id=33938831573696">Social Research Methods</a></p><span class="name"><div><a href="/result/author?id=Alan+Bryman">Alan Bryman</a></div></span><span class="publisher"><div><a href="/result/publisher?id=Oxford+University+Press"><div class="div-link">Oxford University Press</div>,</a><div class="div1-no-link">2001</div></div></span></div>,
<div class="name-div"><p><a href="/result/title?id=33466385171310">Research Design: Qualitative, Quantitative, and Mixed Methods Approaches</a></p><span class="name"><div><a href="/result/author?id=John+W.+Creswell">John W. Creswell</a></div></span><span class="publisher"><div><a href="/result/publisher?id=SAGE"><div class="div-link">SAGE</div>,</a><div class="div1-no-link">1994</div></div></span></div>,
<div class="name-div"><p><a href="/result/title?id=7507602833603">Paradise Lost</a></p><span class="name"><div><a href="/result/author?id=John+Milton">John Milton</a></div></span><span class="publisher"><div class="div-no-link">Multiple Editions</div></span></div>,
<div class="name-div"><p><a href="/result/title?id=32581621910019">Marketing Management</a></p><span class="name"><div><a href="/result/author?id=Philip+Kotler">Philip Kotler</a></div></span><span class="publisher"><div><a href="/result/publisher?id=Pearson+%2F+Prentice+Hall"><div class="div-link">Pearson / Prentice Hall</div>,</a><div class="div1-no-link">1967</div></div></span></div>,
<div class="name-div"><p><a href="/result/title?id=7799660976136">Molecular Biology of the Cell</a></p><span class="name"><div><a href="/result/author?id=Bruce+Alberts">Bruce Alberts</a></div></span><span class="publisher"><div><a href="/result/publisher?id=Garland+Publishing"><div class="div-link">Garland Publishing</div>,</a><div class="div1-no-link">1983</div></div></span></div>,
<div class="name-div"><p><a href="/result/title?id=7584912277506">Heart of Darkness</a></p><span class="name"><div><a href="/result/author?id=Joseph+Conrad">Joseph Conrad</a></div></span><span class="publisher"><div class="div-no-link">Multiple Editions</div></span></div>,
<div class="name-div"><p><a href="/result/title?id=62543313842187">Communication Skills Handbook</a></p><span class="name"><div><a href="/result/author?id=Jane+Summers">Jane Summers</a></div></span><span class="publisher"><div class="div-no-link">Multiple Editions</div></span></div>,
<div class="name-div"><p><a href="/result/title?id=40372692584452">Advanced Engineering Mathematics</a></p><span class="name"><div><a href="/result/author?id=Erwin+Kreyszig">Erwin Kreyszig</a></div></span><span class="publisher"><div><a href="/result/publisher?id=Wiley"><div class="div-link">Wiley</div>,</a><div class="div1-no-link">1962</div></div></span></div>,
<div class="name-div"><p><a href="/result/title?id=8564165296084">Letter From the Birmingham Jail</a></p><span class="name"><div><a href="/result/author?id=Martin+Luther+King">Martin Luther King</a></div></span><span class="publisher"><div class="div-no-link">Multiple Editions</div></span></div>,
<div class="name-div"><p><a href="/result/title?id=7507602833514">The Prince</a></p><span class="name"><div><a href="/result/author?id=Niccolo+Machiavelli">Niccolo Machiavelli</a></div></span><span class="publisher"><div class="div-no-link">Multiple Editions</div></span></div>,
<div class="name-div"><p><a href="/result/title?id=7773890973060">Meditations on First Philosophy</a></p><span class="name"><div><a href="/result/author?id=Rene+Descartes">Rene Descartes</a></div></span><span class="publisher"><div class="div-no-link">Multiple Editions</div></span></div>,
<div class="name-div"><p><a href="/result/title?id=34600256537883">Introduction to Algorithms</a></p><span class="name"><div><a href="/result/author?id=Thomas+H.+Cormen">Thomas H. Cormen</a></div></span><span class="publisher"><div><a href="/result/publisher?id=MIT+Press"><div class="div-link">MIT Press</div>,</a><div class="div1-no-link">1990</div></div></span></div>,
<div class="name-div"><p><a href="/result/title?id=8065949090762">The Clash of Civilizations?</a></p><span class="name"><div><a href="/result/author?id=Samuel+P.+Huntington">Samuel P. Huntington</a></div></span><span class="publisher"><div class="div-no-link">Multiple Editions</div></span></div>,
<div class="name-div"><p><a href="/result/title?id=7980049403911">Second Treatise of Government</a></p><span class="name"><div><a href="/result/author?id=John+Locke">John Locke</a></div></span><span class="publisher"><div class="div-no-link">Multiple Editions</div></span></div>,
<div class="name-div"><p><a href="/result/title?id=7507602833427">Oedipus the King</a></p><span class="name"><div><a href="/result/author?id=Sophocles">Sophocles</a></div></span><span class="publisher"><div class="div-no-link">Multiple Editions</div></span></div>,
<div class="name-div"><p><a href="/result/title?id=7808251072460">The Craft of Research</a></p><span class="name"><div><a href="/result/author?id=Wayne+C.+Booth">Wayne C. Booth</a></div></span><span class="publisher"><div><a href="/result/publisher?id=University+of+Chicago+Press"><div class="div-link">University of Chicago Press</div>,</a><div class="div1-no-link">1995</div></div></span></div>,
<div class="name-div"><p><a href="/result/title?id=32452772891414">Intermediate Algebra</a></p><span class="name"><div><a href="/result/author?id=K.+Elayn+Martin-Gay">K. Elayn Martin-Gay</a></div></span><span class="publisher"><div><a href="/result/publisher?id=Pearson+%2F+Prentice+Hall"><div class="div-link">Pearson / Prentice Hall</div>,</a><div class="div1-no-link">1993</div></div></span></div>,
<div class="name-div"><p><a href="/result/title?id=16243566315256">The Yellow Wallpaper</a></p><span class="name"><div><a href="/result/author?id=Charlotte+Perkins+Gilman">Charlotte Perkins Gilman</a></div></span><span class="publisher"><div class="div-no-link">Multiple Editions</div></span></div>,
<div class="name-div"><p><a href="/result/title?id=7782481269058">The Structure of Scientific Revolutions</a></p><span class="name"><div><a href="/result/author?id=Thomas+S.+Kuhn">Thomas S. Kuhn</a></div></span><span class="publisher"><div class="div-no-link">Multiple Editions</div></span></div>,
<div class="name-div"><p><a href="/result/title?id=33543694584301">MLA Handbook for Writers of Research Papers</a></p><span class="name"><div><a href="/result/author?id=Joseph+Gibaldi">Joseph Gibaldi</a></div></span><span class="publisher"><div><a href="/result/publisher?id=The+Modern+Language+Association+of+America"><div class="div-link">The Modern Language Association of America</div>,</a><div class="div1-no-link">1977</div></div></span></div>,
<div class="name-div"><p><a href="/result/title?id=8177618192651">The Souls of Black Folk</a></p><span class="name"><div><a href="/result/author?id=W.+E.+B.+Du+Bois">W. E. B. Du Bois</a></div></span><span class="publisher"><div class="div-no-link">Multiple Editions</div></span></div>,
<div class="name-div"><p><a href="/result/title?id=9225590260735">Pedagogy of the Oppressed</a></p><span class="name"><div><a href="/result/author?id=Paulo+Freire">Paulo Freire</a></div></span><span class="publisher"><div class="div-no-link">Multiple Editions</div></span></div>,
<div class="name-div"><p><a href="/result/title?id=32899449487478">Biology</a></p><span class="name"><div><a href="/result/author?id=Neil+A.+Campbell">Neil A. Campbell</a></div></span><span class="publisher"><div><a href="/result/publisher?id=Pearson+%2F+Benjamin+Cummings"><div class="div-link">Pearson / Benjamin Cummings</div>,</a><div class="div1-no-link">1973</div></div></span></div>,
<div class="name-div"><p><a href="/result/title?id=32495722562184">The Art of Public Speaking</a></p><span class="name"><div><a href="/result/author?id=Stephen+Lucas">Stephen Lucas</a></div></span><span class="publisher"><div><a href="/result/publisher?id=McGraw-Hill"><div class="div-link">McGraw-Hill</div>,</a><div class="div1-no-link">1983</div></div></span></div>,
<div class="name-div"><p><a href="/result/title?id=7507602833457">Poetics</a></p><span class="name"><div><a href="/result/author?id=Aristotle">Aristotle</a></div></span><span class="publisher"><div class="div-no-link">Multiple Editions</div></span></div>,
<div class="name-div"><p><a href="/result/title?id=7550552512725">An Inquiry Into the Nature and Causes of the Wealth of Nations</a></p><span class="name"><div><a href="/result/author?id=Adam+Smith">Adam Smith</a></div></span><span class="publisher"><div class="div-no-link">Multiple Editions</div></span></div>,
<div class="name-div"><p><a href="/result/title?id=7507602833417">The Odyssey</a></p><span class="name"><div><a href="/result/author?id=Homer">Homer</a></div></span><span class="publisher"><div class="div-no-link">Multiple Editions</div></span></div>,
<div class="name-div"><p><a href="/result/title?id=34995393528856">Principles of Anatomy and Physiology</a></p><span class="name"><div><a href="/result/author?id=Gerard+J.+Tortora">Gerard J. Tortora</a></div></span><span class="publisher"><div class="div-no-link">Multiple Editions</div></span></div>,
<div class="name-div"><p><a href="/result/title?id=33629593929377">Handbook of Qualitative Research</a></p><span class="name"><div><a href="/result/author?id=Norman+K.+Denzin">Norman K. Denzin</a>, </div><div><a href="/result/author?id=Yvonna+S.+Lincoln">Yvonna S. Lincoln</a></div></span><span class="publisher"><div><a href="/result/publisher?id=SAGE"><div class="div-link">SAGE</div>,</a><div class="div1-no-link">1994</div></div></span></div>,
<div class="name-div"><p><a href="/result/title?id=54984171325947">The History of Sexuality</a></p><span class="name"><div><a href="/result/author?id=Michel+Foucault">Michel Foucault</a></div></span><span class="publisher"><div class="div-no-link">Multiple Editions</div></span></div>,
<div class="name-div"><p><a href="/result/title?id=7627862379468">A Theory of Justice</a></p><span class="name"><div><a href="/result/author?id=John+Rawls">John Rawls</a></div></span><span class="publisher"><div class="div-no-link">Multiple Editions</div></span></div>,
<div class="name-div"><p><a href="/result/title?id=33122787788660">Computer Networks</a></p><span class="name"><div><a href="/result/author?id=Andrew+S.+Tanenbaum">Andrew S. Tanenbaum</a></div></span><span class="publisher"><div><a href="/result/publisher?id=Pearson+%2F+Prentice+Hall"><div class="div-link">Pearson / Prentice Hall</div>,</a><div class="div1-no-link">1981</div></div></span></div>]
Now that we know how to access the page, we can grab the content just like we’d do with a “normal” requests/BeautifulSoup page.
= doc.find_all('div', class_='title-item')
books
for book in books:
print("----")
print(book)
print(book.find('a').text)
print(book.find('span', class_='name').text)
print(book.find(class_='appearances').text)
print(book.find(class_='score').text)
----
<div class="title-item"><div class="rank">1</div><div class="title"><div class="name-div"><p><a href="/result/title?id=9199819950029">The Elements of Style</a></p><span class="name"><div><a href="/result/author?id=William+Strunk">William Strunk</a></div></span><span class="publisher"><div class="div-no-link">Multiple Editions</div></span></div></div><div class="appearances">15,533</div><div class="score">100</div></div>
The Elements of Style
William Strunk
15,533
100
----
<div class="title-item"><div class="rank">2</div><div class="title"><div class="name-div"><p><a href="/result/title?id=33749853015144">A Writer's Reference</a></p><span class="name"><div><a href="/result/author?id=Diana+Hacker">Diana Hacker</a></div></span><span class="publisher"><div><a href="/result/publisher?id=St.+Martin%27s+%2F+Bedford+Books"><div class="div-link">St. Martin's / Bedford Books</div>,</a><div class="div1-no-link">1989</div></div></span></div></div><div class="appearances">14,931</div><div class="score">100</div></div>
A Writer's Reference
Diana Hacker
14,931
100
----
<div class="title-item"><div class="rank">3</div><div class="title"><div class="name-div"><p><a href="/result/title?id=7988639699494">A Manual for Writers of Term Papers, Theses, and Dissertations</a></p><span class="name"><div><a href="/result/author?id=Kate+L.+Turabian">Kate L. Turabian</a></div></span><span class="publisher"><div><a href="/result/publisher?id=University+of+Chicago+Press"><div class="div-link">University of Chicago Press</div>,</a><div class="div1-no-link">1955</div></div></span></div></div><div class="appearances">13,426</div><div class="score">100</div></div>
A Manual for Writers of Term Papers, Theses, and Dissertations
Kate L. Turabian
13,426
100
----
<div class="title-item"><div class="rank">4</div><div class="title"><div class="name-div"><p><a href="/result/title?id=8306467210251">The Communist Manifesto</a></p><span class="name"><div><a href="/result/author?id=Karl+Marx">Karl Marx</a></div></span><span class="publisher"><div class="div-no-link">Multiple Editions</div></span></div></div><div class="appearances">11,234</div><div class="score">100</div></div>
The Communist Manifesto
Karl Marx
11,234
100
----
<div class="title-item"><div class="rank">5</div><div class="title"><div class="name-div"><p><a href="/result/title?id=7636452301400">The Republic</a></p><span class="name"><div><a href="/result/author?id=Plato">Plato</a></div></span><span class="publisher"><div class="div-no-link">Multiple Editions</div></span></div></div><div class="appearances">9,883</div><div class="score">100</div></div>
The Republic
Plato
9,883
100
----
<div class="title-item"><div class="rank">6</div><div class="title"><div class="name-div"><p><a href="/result/title?id=33457795238412">Calculus</a></p><span class="name"><div><a href="/result/author?id=James+Stewart">James Stewart</a></div></span><span class="publisher"><div><a href="/result/publisher?id=Brooks+%2F+Cole"><div class="div-link">Brooks / Cole</div>,</a><div class="div1-no-link">1987</div></div></span></div></div><div class="appearances">9,682</div><div class="score">100</div></div>
Calculus
James Stewart
9,682
100
----
<div class="title-item"><div class="rank">7</div><div class="title"><div class="name-div"><p><a href="/result/title?id=7507602833850">Frankenstein</a></p><span class="name"><div><a href="/result/author?id=Mary+Wollstonecraft+Shelley">Mary Wollstonecraft Shelley</a></div></span><span class="publisher"><div class="div-no-link">Multiple Editions</div></span></div></div><div class="appearances">9,320</div><div class="score">100</div></div>
Frankenstein
Mary Wollstonecraft Shelley
9,320
100
----
<div class="title-item"><div class="rank">8</div><div class="title"><div class="name-div"><p><a href="/result/title?id=7507602833551">The Canterbury Tales</a></p><span class="name"><div><a href="/result/author?id=Geoffrey+Chaucer">Geoffrey Chaucer</a></div></span><span class="publisher"><div class="div-no-link">Multiple Editions</div></span></div></div><div class="appearances">9,172</div><div class="score">100</div></div>
The Canterbury Tales
Geoffrey Chaucer
9,172
100
----
<div class="title-item"><div class="rank">9</div><div class="title"><div class="name-div"><p><a href="/result/title?id=7507602833458">Nicomachean Ethics</a></p><span class="name"><div><a href="/result/author?id=Aristotle">Aristotle</a></div></span><span class="publisher"><div class="div-no-link">Multiple Editions</div></span></div></div><div class="appearances">9,104</div><div class="score">100</div></div>
Nicomachean Ethics
Aristotle
9,104
100
----
<div class="title-item"><div class="rank">10</div><div class="title"><div class="name-div"><p><a href="/result/title?id=32804960208506">Human Anatomy and Physiology</a></p><span class="name"><div><a href="/result/author?id=Elaine+Nicpon+Marieb">Elaine Nicpon Marieb</a></div></span><span class="publisher"><div class="div-no-link">Multiple Editions</div></span></div></div><div class="appearances">8,507</div><div class="score">100</div></div>
Human Anatomy and Physiology
Elaine Nicpon Marieb
8,507
100
----
<div class="title-item"><div class="rank">11</div><div class="title"><div class="name-div"><p><a href="/result/title?id=42305427866233">Doing Your Research Project: A Guide for First-Time Researchers in Education and Social Science</a></p><span class="name"><div><a href="/result/author?id=Judith+Bell">Judith Bell</a></div></span><span class="publisher"><div><a href="/result/publisher?id=Open+University+Press"><div class="div-link">Open University Press</div>,</a><div class="div1-no-link">1986</div></div></span></div></div><div class="appearances">8,234</div><div class="score">100</div></div>
Doing Your Research Project: A Guide for First-Time Researchers in Education and Social Science
Judith Bell
8,234
100
----
<div class="title-item"><div class="rank">12</div><div class="title"><div class="name-div"><p><a href="/result/title?id=40166534153645">Imagined Communities: Reflections on the Origin and Spread of Nationalism</a></p><span class="name"><div><a href="/result/author?id=Benedict+R.+O.%27G.+Anderson">Benedict R. O.'G. Anderson</a></div></span><span class="publisher"><div><a href="/result/publisher?id=Verso+Books"><div class="div-link">Verso Books</div>,</a><div class="div1-no-link">1983</div></div></span></div></div><div class="appearances">8,082</div><div class="score">100</div></div>
Imagined Communities: Reflections on the Origin and Spread of Nationalism
Benedict R. O.'G. Anderson
8,082
100
----
<div class="title-item"><div class="rank">13</div><div class="title"><div class="name-div"><p><a href="/result/title?id=32521492368242">They Say/I Say: The Moves That Matter in Academic Writing</a></p><span class="name"><div><a href="/result/author?id=Gerald+Graff">Gerald Graff</a></div></span><span class="publisher"><div class="div-no-link">Multiple Editions</div></span></div></div><div class="appearances">8,007</div><div class="score">100</div></div>
They Say/I Say: The Moves That Matter in Academic Writing
Gerald Graff
8,007
100
----
<div class="title-item"><div class="rank">14</div><div class="title"><div class="name-div"><p><a href="/result/title?id=7507602833582">Leviathan</a></p><span class="name"><div><a href="/result/author?id=Thomas+Hobbes">Thomas Hobbes</a></div></span><span class="publisher"><div class="div-no-link">Multiple Editions</div></span></div></div><div class="appearances">7,974</div><div class="score">100</div></div>
Leviathan
Thomas Hobbes
7,974
100
----
<div class="title-item"><div class="rank">15</div><div class="title"><div class="name-div"><p><a href="/result/title?id=33028298506521">A Pocket Style Manual</a></p><span class="name"><div><a href="/result/author?id=Diana+Hacker">Diana Hacker</a></div></span><span class="publisher"><div><a href="/result/publisher?id=St.+Martin%27s+%2F+Bedford+Books"><div class="div-link">St. Martin's / Bedford Books</div>,</a><div class="div1-no-link">1993</div></div></span></div></div><div class="appearances">7,706</div><div class="score">100</div></div>
A Pocket Style Manual
Diana Hacker
7,706
100
----
<div class="title-item"><div class="rank">16</div><div class="title"><div class="name-div"><p><a href="/result/title?id=16243566318681">Discipline and Punish: The Birth of the Prison</a></p><span class="name"><div><a href="/result/author?id=Michel+Foucault">Michel Foucault</a></div></span><span class="publisher"><div class="div-no-link">Multiple Editions</div></span></div></div><div class="appearances">7,583</div><div class="score">100</div></div>
Discipline and Punish: The Birth of the Prison
Michel Foucault
7,583
100
----
<div class="title-item"><div class="rank">17</div><div class="title"><div class="name-div"><p><a href="/result/title?id=7868380119082">The Study Skills Handbook</a></p><span class="name"><div><a href="/result/author?id=Stella+Cottrell">Stella Cottrell</a></div></span><span class="publisher"><div><a href="/result/publisher?id=Palgrave+Macmillan"><div class="div-link">Palgrave Macmillan</div>,</a><div class="div1-no-link">1999</div></div></span></div></div><div class="appearances">7,541</div><div class="score">100</div></div>
The Study Skills Handbook
Stella Cottrell
7,541
100
----
<div class="title-item"><div class="rank">18</div><div class="title"><div class="name-div"><p><a href="/result/title?id=8108898763586">Orientalism</a></p><span class="name"><div><a href="/result/author?id=Edward+W.+Said">Edward W. Said</a></div></span><span class="publisher"><div class="div-no-link">Multiple Editions</div></span></div></div><div class="appearances">7,369</div><div class="score">100</div></div>
Orientalism
Edward W. Said
7,369
100
----
<div class="title-item"><div class="rank">19</div><div class="title"><div class="name-div"><p><a href="/result/title?id=33938831573696">Social Research Methods</a></p><span class="name"><div><a href="/result/author?id=Alan+Bryman">Alan Bryman</a></div></span><span class="publisher"><div><a href="/result/publisher?id=Oxford+University+Press"><div class="div-link">Oxford University Press</div>,</a><div class="div1-no-link">2001</div></div></span></div></div><div class="appearances">7,351</div><div class="score">100</div></div>
Social Research Methods
Alan Bryman
7,351
100
----
<div class="title-item"><div class="rank">20</div><div class="title"><div class="name-div"><p><a href="/result/title?id=33466385171310">Research Design: Qualitative, Quantitative, and Mixed Methods Approaches</a></p><span class="name"><div><a href="/result/author?id=John+W.+Creswell">John W. Creswell</a></div></span><span class="publisher"><div><a href="/result/publisher?id=SAGE"><div class="div-link">SAGE</div>,</a><div class="div1-no-link">1994</div></div></span></div></div><div class="appearances">7,143</div><div class="score">100</div></div>
Research Design: Qualitative, Quantitative, and Mixed Methods Approaches
John W. Creswell
7,143
100
----
<div class="title-item"><div class="rank">21</div><div class="title"><div class="name-div"><p><a href="/result/title?id=7507602833603">Paradise Lost</a></p><span class="name"><div><a href="/result/author?id=John+Milton">John Milton</a></div></span><span class="publisher"><div class="div-no-link">Multiple Editions</div></span></div></div><div class="appearances">7,057</div><div class="score">100</div></div>
Paradise Lost
John Milton
7,057
100
----
<div class="title-item"><div class="rank">22</div><div class="title"><div class="name-div"><p><a href="/result/title?id=32581621910019">Marketing Management</a></p><span class="name"><div><a href="/result/author?id=Philip+Kotler">Philip Kotler</a></div></span><span class="publisher"><div><a href="/result/publisher?id=Pearson+%2F+Prentice+Hall"><div class="div-link">Pearson / Prentice Hall</div>,</a><div class="div1-no-link">1967</div></div></span></div></div><div class="appearances">7,003</div><div class="score">99</div></div>
Marketing Management
Philip Kotler
7,003
99
----
<div class="title-item"><div class="rank">23</div><div class="title"><div class="name-div"><p><a href="/result/title?id=7799660976136">Molecular Biology of the Cell</a></p><span class="name"><div><a href="/result/author?id=Bruce+Alberts">Bruce Alberts</a></div></span><span class="publisher"><div><a href="/result/publisher?id=Garland+Publishing"><div class="div-link">Garland Publishing</div>,</a><div class="div1-no-link">1983</div></div></span></div></div><div class="appearances">6,981</div><div class="score">99</div></div>
Molecular Biology of the Cell
Bruce Alberts
6,981
99
----
<div class="title-item"><div class="rank">24</div><div class="title"><div class="name-div"><p><a href="/result/title?id=7584912277506">Heart of Darkness</a></p><span class="name"><div><a href="/result/author?id=Joseph+Conrad">Joseph Conrad</a></div></span><span class="publisher"><div class="div-no-link">Multiple Editions</div></span></div></div><div class="appearances">6,877</div><div class="score">99</div></div>
Heart of Darkness
Joseph Conrad
6,877
99
----
<div class="title-item"><div class="rank">25</div><div class="title"><div class="name-div"><p><a href="/result/title?id=62543313842187">Communication Skills Handbook</a></p><span class="name"><div><a href="/result/author?id=Jane+Summers">Jane Summers</a></div></span><span class="publisher"><div class="div-no-link">Multiple Editions</div></span></div></div><div class="appearances">6,793</div><div class="score">99</div></div>
Communication Skills Handbook
Jane Summers
6,793
99
----
<div class="title-item"><div class="rank">26</div><div class="title"><div class="name-div"><p><a href="/result/title?id=40372692584452">Advanced Engineering Mathematics</a></p><span class="name"><div><a href="/result/author?id=Erwin+Kreyszig">Erwin Kreyszig</a></div></span><span class="publisher"><div><a href="/result/publisher?id=Wiley"><div class="div-link">Wiley</div>,</a><div class="div1-no-link">1962</div></div></span></div></div><div class="appearances">6,780</div><div class="score">99</div></div>
Advanced Engineering Mathematics
Erwin Kreyszig
6,780
99
----
<div class="title-item"><div class="rank">27</div><div class="title"><div class="name-div"><p><a href="/result/title?id=8564165296084">Letter From the Birmingham Jail</a></p><span class="name"><div><a href="/result/author?id=Martin+Luther+King">Martin Luther King</a></div></span><span class="publisher"><div class="div-no-link">Multiple Editions</div></span></div></div><div class="appearances">6,780</div><div class="score">99</div></div>
Letter From the Birmingham Jail
Martin Luther King
6,780
99
----
<div class="title-item"><div class="rank">28</div><div class="title"><div class="name-div"><p><a href="/result/title?id=7507602833514">The Prince</a></p><span class="name"><div><a href="/result/author?id=Niccolo+Machiavelli">Niccolo Machiavelli</a></div></span><span class="publisher"><div class="div-no-link">Multiple Editions</div></span></div></div><div class="appearances">6,627</div><div class="score">99</div></div>
The Prince
Niccolo Machiavelli
6,627
99
----
<div class="title-item"><div class="rank">29</div><div class="title"><div class="name-div"><p><a href="/result/title?id=7773890973060">Meditations on First Philosophy</a></p><span class="name"><div><a href="/result/author?id=Rene+Descartes">Rene Descartes</a></div></span><span class="publisher"><div class="div-no-link">Multiple Editions</div></span></div></div><div class="appearances">6,285</div><div class="score">99</div></div>
Meditations on First Philosophy
Rene Descartes
6,285
99
----
<div class="title-item"><div class="rank">30</div><div class="title"><div class="name-div"><p><a href="/result/title?id=34600256537883">Introduction to Algorithms</a></p><span class="name"><div><a href="/result/author?id=Thomas+H.+Cormen">Thomas H. Cormen</a></div></span><span class="publisher"><div><a href="/result/publisher?id=MIT+Press"><div class="div-link">MIT Press</div>,</a><div class="div1-no-link">1990</div></div></span></div></div><div class="appearances">6,281</div><div class="score">99</div></div>
Introduction to Algorithms
Thomas H. Cormen
6,281
99
----
<div class="title-item"><div class="rank">31</div><div class="title"><div class="name-div"><p><a href="/result/title?id=8065949090762">The Clash of Civilizations?</a></p><span class="name"><div><a href="/result/author?id=Samuel+P.+Huntington">Samuel P. Huntington</a></div></span><span class="publisher"><div class="div-no-link">Multiple Editions</div></span></div></div><div class="appearances">6,273</div><div class="score">99</div></div>
The Clash of Civilizations?
Samuel P. Huntington
6,273
99
----
<div class="title-item"><div class="rank">32</div><div class="title"><div class="name-div"><p><a href="/result/title?id=7980049403911">Second Treatise of Government</a></p><span class="name"><div><a href="/result/author?id=John+Locke">John Locke</a></div></span><span class="publisher"><div class="div-no-link">Multiple Editions</div></span></div></div><div class="appearances">6,141</div><div class="score">99</div></div>
Second Treatise of Government
John Locke
6,141
99
----
<div class="title-item"><div class="rank">33</div><div class="title"><div class="name-div"><p><a href="/result/title?id=7507602833427">Oedipus the King</a></p><span class="name"><div><a href="/result/author?id=Sophocles">Sophocles</a></div></span><span class="publisher"><div class="div-no-link">Multiple Editions</div></span></div></div><div class="appearances">5,963</div><div class="score">99</div></div>
Oedipus the King
Sophocles
5,963
99
----
<div class="title-item"><div class="rank">34</div><div class="title"><div class="name-div"><p><a href="/result/title?id=7808251072460">The Craft of Research</a></p><span class="name"><div><a href="/result/author?id=Wayne+C.+Booth">Wayne C. Booth</a></div></span><span class="publisher"><div><a href="/result/publisher?id=University+of+Chicago+Press"><div class="div-link">University of Chicago Press</div>,</a><div class="div1-no-link">1995</div></div></span></div></div><div class="appearances">5,917</div><div class="score">99</div></div>
The Craft of Research
Wayne C. Booth
5,917
99
----
<div class="title-item"><div class="rank">35</div><div class="title"><div class="name-div"><p><a href="/result/title?id=32452772891414">Intermediate Algebra</a></p><span class="name"><div><a href="/result/author?id=K.+Elayn+Martin-Gay">K. Elayn Martin-Gay</a></div></span><span class="publisher"><div><a href="/result/publisher?id=Pearson+%2F+Prentice+Hall"><div class="div-link">Pearson / Prentice Hall</div>,</a><div class="div1-no-link">1993</div></div></span></div></div><div class="appearances">5,863</div><div class="score">99</div></div>
Intermediate Algebra
K. Elayn Martin-Gay
5,863
99
----
<div class="title-item"><div class="rank">36</div><div class="title"><div class="name-div"><p><a href="/result/title?id=16243566315256">The Yellow Wallpaper</a></p><span class="name"><div><a href="/result/author?id=Charlotte+Perkins+Gilman">Charlotte Perkins Gilman</a></div></span><span class="publisher"><div class="div-no-link">Multiple Editions</div></span></div></div><div class="appearances">5,818</div><div class="score">99</div></div>
The Yellow Wallpaper
Charlotte Perkins Gilman
5,818
99
----
<div class="title-item"><div class="rank">37</div><div class="title"><div class="name-div"><p><a href="/result/title?id=7782481269058">The Structure of Scientific Revolutions</a></p><span class="name"><div><a href="/result/author?id=Thomas+S.+Kuhn">Thomas S. Kuhn</a></div></span><span class="publisher"><div class="div-no-link">Multiple Editions</div></span></div></div><div class="appearances">5,801</div><div class="score">99</div></div>
The Structure of Scientific Revolutions
Thomas S. Kuhn
5,801
99
----
<div class="title-item"><div class="rank">38</div><div class="title"><div class="name-div"><p><a href="/result/title?id=33543694584301">MLA Handbook for Writers of Research Papers</a></p><span class="name"><div><a href="/result/author?id=Joseph+Gibaldi">Joseph Gibaldi</a></div></span><span class="publisher"><div><a href="/result/publisher?id=The+Modern+Language+Association+of+America"><div class="div-link">The Modern Language Association of America</div>,</a><div class="div1-no-link">1977</div></div></span></div></div><div class="appearances">5,791</div><div class="score">99</div></div>
MLA Handbook for Writers of Research Papers
Joseph Gibaldi
5,791
99
----
<div class="title-item"><div class="rank">39</div><div class="title"><div class="name-div"><p><a href="/result/title?id=8177618192651">The Souls of Black Folk</a></p><span class="name"><div><a href="/result/author?id=W.+E.+B.+Du+Bois">W. E. B. Du Bois</a></div></span><span class="publisher"><div class="div-no-link">Multiple Editions</div></span></div></div><div class="appearances">5,790</div><div class="score">99</div></div>
The Souls of Black Folk
W. E. B. Du Bois
5,790
99
----
<div class="title-item"><div class="rank">40</div><div class="title"><div class="name-div"><p><a href="/result/title?id=9225590260735">Pedagogy of the Oppressed</a></p><span class="name"><div><a href="/result/author?id=Paulo+Freire">Paulo Freire</a></div></span><span class="publisher"><div class="div-no-link">Multiple Editions</div></span></div></div><div class="appearances">5,784</div><div class="score">99</div></div>
Pedagogy of the Oppressed
Paulo Freire
5,784
99
----
<div class="title-item"><div class="rank">41</div><div class="title"><div class="name-div"><p><a href="/result/title?id=32899449487478">Biology</a></p><span class="name"><div><a href="/result/author?id=Neil+A.+Campbell">Neil A. Campbell</a></div></span><span class="publisher"><div><a href="/result/publisher?id=Pearson+%2F+Benjamin+Cummings"><div class="div-link">Pearson / Benjamin Cummings</div>,</a><div class="div1-no-link">1973</div></div></span></div></div><div class="appearances">5,766</div><div class="score">99</div></div>
Biology
Neil A. Campbell
5,766
99
----
<div class="title-item"><div class="rank">42</div><div class="title"><div class="name-div"><p><a href="/result/title?id=32495722562184">The Art of Public Speaking</a></p><span class="name"><div><a href="/result/author?id=Stephen+Lucas">Stephen Lucas</a></div></span><span class="publisher"><div><a href="/result/publisher?id=McGraw-Hill"><div class="div-link">McGraw-Hill</div>,</a><div class="div1-no-link">1983</div></div></span></div></div><div class="appearances">5,567</div><div class="score">99</div></div>
The Art of Public Speaking
Stephen Lucas
5,567
99
----
<div class="title-item"><div class="rank">43</div><div class="title"><div class="name-div"><p><a href="/result/title?id=7507602833457">Poetics</a></p><span class="name"><div><a href="/result/author?id=Aristotle">Aristotle</a></div></span><span class="publisher"><div class="div-no-link">Multiple Editions</div></span></div></div><div class="appearances">5,563</div><div class="score">99</div></div>
Poetics
Aristotle
5,563
99
----
<div class="title-item"><div class="rank">44</div><div class="title"><div class="name-div"><p><a href="/result/title?id=7550552512725">An Inquiry Into the Nature and Causes of the Wealth of Nations</a></p><span class="name"><div><a href="/result/author?id=Adam+Smith">Adam Smith</a></div></span><span class="publisher"><div class="div-no-link">Multiple Editions</div></span></div></div><div class="appearances">5,546</div><div class="score">98</div></div>
An Inquiry Into the Nature and Causes of the Wealth of Nations
Adam Smith
5,546
98
----
<div class="title-item"><div class="rank">45</div><div class="title"><div class="name-div"><p><a href="/result/title?id=7507602833417">The Odyssey</a></p><span class="name"><div><a href="/result/author?id=Homer">Homer</a></div></span><span class="publisher"><div class="div-no-link">Multiple Editions</div></span></div></div><div class="appearances">5,391</div><div class="score">98</div></div>
The Odyssey
Homer
5,391
98
----
<div class="title-item"><div class="rank">46</div><div class="title"><div class="name-div"><p><a href="/result/title?id=34995393528856">Principles of Anatomy and Physiology</a></p><span class="name"><div><a href="/result/author?id=Gerard+J.+Tortora">Gerard J. Tortora</a></div></span><span class="publisher"><div class="div-no-link">Multiple Editions</div></span></div></div><div class="appearances">5,375</div><div class="score">98</div></div>
Principles of Anatomy and Physiology
Gerard J. Tortora
5,375
98
----
<div class="title-item"><div class="rank">47</div><div class="title"><div class="name-div"><p><a href="/result/title?id=33629593929377">Handbook of Qualitative Research</a></p><span class="name"><div><a href="/result/author?id=Norman+K.+Denzin">Norman K. Denzin</a>, </div><div><a href="/result/author?id=Yvonna+S.+Lincoln">Yvonna S. Lincoln</a></div></span><span class="publisher"><div><a href="/result/publisher?id=SAGE"><div class="div-link">SAGE</div>,</a><div class="div1-no-link">1994</div></div></span></div></div><div class="appearances">5,273</div><div class="score">98</div></div>
Handbook of Qualitative Research
Norman K. Denzin, Yvonna S. Lincoln
5,273
98
----
<div class="title-item"><div class="rank">48</div><div class="title"><div class="name-div"><p><a href="/result/title?id=54984171325947">The History of Sexuality</a></p><span class="name"><div><a href="/result/author?id=Michel+Foucault">Michel Foucault</a></div></span><span class="publisher"><div class="div-no-link">Multiple Editions</div></span></div></div><div class="appearances">5,268</div><div class="score">98</div></div>
The History of Sexuality
Michel Foucault
5,268
98
----
<div class="title-item"><div class="rank">49</div><div class="title"><div class="name-div"><p><a href="/result/title?id=7627862379468">A Theory of Justice</a></p><span class="name"><div><a href="/result/author?id=John+Rawls">John Rawls</a></div></span><span class="publisher"><div class="div-no-link">Multiple Editions</div></span></div></div><div class="appearances">5,228</div><div class="score">98</div></div>
A Theory of Justice
John Rawls
5,228
98
----
<div class="title-item"><div class="rank">50</div><div class="title"><div class="name-div"><p><a href="/result/title?id=33122787788660">Computer Networks</a></p><span class="name"><div><a href="/result/author?id=Andrew+S.+Tanenbaum">Andrew S. Tanenbaum</a></div></span><span class="publisher"><div><a href="/result/publisher?id=Pearson+%2F+Prentice+Hall"><div class="div-link">Pearson / Prentice Hall</div>,</a><div class="div1-no-link">1981</div></div></span></div></div><div class="appearances">5,197</div><div class="score">98</div></div>
Computer Networks
Andrew S. Tanenbaum
5,197
98
= doc.find_all('div', class_='title-item')
books
= []
all_data for book in books:
= {
data 'name': book.find('a').text,
'author': book.find('span', class_='name').text,
'appearances': book.find(class_='appearances').text,
'score': book.find(class_='score').text
}
all_data.append(data)
all_data
[{'name': 'The Elements of Style',
'author': 'William Strunk',
'appearances': '15,533',
'score': '100'},
{'name': "A Writer's Reference",
'author': 'Diana Hacker',
'appearances': '14,931',
'score': '100'},
{'name': 'A Manual for Writers of Term Papers, Theses, and Dissertations',
'author': 'Kate L. Turabian',
'appearances': '13,426',
'score': '100'},
{'name': 'The Communist Manifesto',
'author': 'Karl Marx',
'appearances': '11,234',
'score': '100'},
{'name': 'The Republic',
'author': 'Plato',
'appearances': '9,883',
'score': '100'},
{'name': 'Calculus',
'author': 'James Stewart',
'appearances': '9,682',
'score': '100'},
{'name': 'Frankenstein',
'author': 'Mary Wollstonecraft Shelley',
'appearances': '9,320',
'score': '100'},
{'name': 'The Canterbury Tales',
'author': 'Geoffrey Chaucer',
'appearances': '9,172',
'score': '100'},
{'name': 'Nicomachean Ethics',
'author': 'Aristotle',
'appearances': '9,104',
'score': '100'},
{'name': 'Human Anatomy and Physiology',
'author': 'Elaine Nicpon Marieb',
'appearances': '8,507',
'score': '100'},
{'name': 'Doing Your Research Project: A Guide for First-Time Researchers in Education and Social Science',
'author': 'Judith Bell',
'appearances': '8,234',
'score': '100'},
{'name': 'Imagined Communities: Reflections on the Origin and Spread of Nationalism',
'author': "Benedict R. O.'G. Anderson",
'appearances': '8,082',
'score': '100'},
{'name': 'They Say/I Say: The Moves That Matter in Academic Writing',
'author': 'Gerald Graff',
'appearances': '8,007',
'score': '100'},
{'name': 'Leviathan',
'author': 'Thomas Hobbes',
'appearances': '7,974',
'score': '100'},
{'name': 'A Pocket Style Manual',
'author': 'Diana Hacker',
'appearances': '7,706',
'score': '100'},
{'name': 'Discipline and Punish: The Birth of the Prison',
'author': 'Michel Foucault',
'appearances': '7,583',
'score': '100'},
{'name': 'The Study Skills Handbook',
'author': 'Stella Cottrell',
'appearances': '7,541',
'score': '100'},
{'name': 'Orientalism',
'author': 'Edward W. Said',
'appearances': '7,369',
'score': '100'},
{'name': 'Social Research Methods',
'author': 'Alan Bryman',
'appearances': '7,351',
'score': '100'},
{'name': 'Research Design: Qualitative, Quantitative, and Mixed Methods Approaches',
'author': 'John W. Creswell',
'appearances': '7,143',
'score': '100'},
{'name': 'Paradise Lost',
'author': 'John Milton',
'appearances': '7,057',
'score': '100'},
{'name': 'Marketing Management',
'author': 'Philip Kotler',
'appearances': '7,003',
'score': '99'},
{'name': 'Molecular Biology of the Cell',
'author': 'Bruce Alberts',
'appearances': '6,981',
'score': '99'},
{'name': 'Heart of Darkness',
'author': 'Joseph Conrad',
'appearances': '6,877',
'score': '99'},
{'name': 'Communication Skills Handbook',
'author': 'Jane Summers',
'appearances': '6,793',
'score': '99'},
{'name': 'Advanced Engineering Mathematics',
'author': 'Erwin Kreyszig',
'appearances': '6,780',
'score': '99'},
{'name': 'Letter From the Birmingham Jail',
'author': 'Martin Luther King',
'appearances': '6,780',
'score': '99'},
{'name': 'The Prince',
'author': 'Niccolo Machiavelli',
'appearances': '6,627',
'score': '99'},
{'name': 'Meditations on First Philosophy',
'author': 'Rene Descartes',
'appearances': '6,285',
'score': '99'},
{'name': 'Introduction to Algorithms',
'author': 'Thomas H. Cormen',
'appearances': '6,281',
'score': '99'},
{'name': 'The Clash of Civilizations?',
'author': 'Samuel P. Huntington',
'appearances': '6,273',
'score': '99'},
{'name': 'Second Treatise of Government',
'author': 'John Locke',
'appearances': '6,141',
'score': '99'},
{'name': 'Oedipus the King',
'author': 'Sophocles',
'appearances': '5,963',
'score': '99'},
{'name': 'The Craft of Research',
'author': 'Wayne C. Booth',
'appearances': '5,917',
'score': '99'},
{'name': 'Intermediate Algebra',
'author': 'K. Elayn Martin-Gay',
'appearances': '5,863',
'score': '99'},
{'name': 'The Yellow Wallpaper',
'author': 'Charlotte Perkins Gilman',
'appearances': '5,818',
'score': '99'},
{'name': 'The Structure of Scientific Revolutions',
'author': 'Thomas S. Kuhn',
'appearances': '5,801',
'score': '99'},
{'name': 'MLA Handbook for Writers of Research Papers',
'author': 'Joseph Gibaldi',
'appearances': '5,791',
'score': '99'},
{'name': 'The Souls of Black Folk',
'author': 'W. E. B. Du Bois',
'appearances': '5,790',
'score': '99'},
{'name': 'Pedagogy of the Oppressed',
'author': 'Paulo Freire',
'appearances': '5,784',
'score': '99'},
{'name': 'Biology',
'author': 'Neil A. Campbell',
'appearances': '5,766',
'score': '99'},
{'name': 'The Art of Public Speaking',
'author': 'Stephen Lucas',
'appearances': '5,567',
'score': '99'},
{'name': 'Poetics',
'author': 'Aristotle',
'appearances': '5,563',
'score': '99'},
{'name': 'An Inquiry Into the Nature and Causes of the Wealth of Nations',
'author': 'Adam Smith',
'appearances': '5,546',
'score': '98'},
{'name': 'The Odyssey',
'author': 'Homer',
'appearances': '5,391',
'score': '98'},
{'name': 'Principles of Anatomy and Physiology',
'author': 'Gerard J. Tortora',
'appearances': '5,375',
'score': '98'},
{'name': 'Handbook of Qualitative Research',
'author': 'Norman K. Denzin,\xa0Yvonna S. Lincoln',
'appearances': '5,273',
'score': '98'},
{'name': 'The History of Sexuality',
'author': 'Michel Foucault',
'appearances': '5,268',
'score': '98'},
{'name': 'A Theory of Justice',
'author': 'John Rawls',
'appearances': '5,228',
'score': '98'},
{'name': 'Computer Networks',
'author': 'Andrew S. Tanenbaum',
'appearances': '5,197',
'score': '98'}]
And then we can do all anyone ever wants to do, which is convert it into a spreadsheet!
import pandas as pd
= pd.DataFrame(all_data)
df df.head()
name | author | appearances | score | |
---|---|---|---|---|
0 | The Elements of Style | William Strunk | 15,533 | 100 |
1 | A Writer's Reference | Diana Hacker | 14,931 | 100 |
2 | A Manual for Writers of Term Papers, Theses, a... | Kate L. Turabian | 13,426 | 100 |
3 | The Communist Manifesto | Karl Marx | 11,234 | 100 |
4 | The Republic | Plato | 9,883 | 100 |
Interacting with the page
If we scroll down a bit, we see that the page only lists the top 50 books. We want more than that! And we get that by clicking the “Show More” button.
Playwright makes it easy with page.get_by_text
and .click()
await page.get_by_text("Show more").click()
Notice that we didn’t have to scroll down! If you’re used to Selenium, it would lose its mind whenever you tried to click something that wasn’t on the page. Playwright doesn’t care, it finds it and clicks it!
If we want to click three times? Ten times? Just write a loop!
for _ in range(3):
await page.get_by_text("Show more").click()
Notice that you didn’t have to wait for the content to load. Selenium loves to throw errors if you try to click content that isn’t on the page – Playwright doesn’t care, it just waits for it to show up! This can be a pain if you had a typo and have to wait 30-60 seconds for Playwright to say “maybe you made a mistake,” but I think we can live with that.
Filling out forms
Let’s try another page where we need to fill out some forms. The North Dakota well search page is a good one!
# Imports
from playwright.async_api import async_playwright
= await async_playwright().start()
playwright = await playwright.chromium.launch(headless = False)
browser = await browser.new_page()
page
await page.goto("https://www.dmr.nd.gov/oilgas/findwellsvw.asp")
Selecting from dropdowns is easy! Instead of importing a thousand additional tools (cough Selenium cough) we can just use .select_option
.
What element on the page do we select? I don’t know, and I don’t care! An easy shortcut is to guess incorrectly – Playwright will automatically provide you with some options. If I try to just find all of the select
fields…
await page.locator("select").select_option('135')
…it gives me a few ideas for what I should have done.
1) <select size="1" id="ddmOperator" name="ddmOperator">…</select> aka get_by_label("Operator:")
2) <select size="1" id="ddmField" name="ddmField">…</select> aka get_by_label("Field:")
3) <select size="1" id="ddmSection" name="ddmSection">…</select> aka get_by_label("Section:")
4) <select size="1" id="ddmTownship" name="ddmTownship">…</select> aka get_by_label("Township:")
5) <select size="1" id="ddmRange" name="ddmRange">…</select> aka get_by_label("Range:")
I think get_by_label("Township:")
seems good!
await page.get_by_label("Township:").select_option('135')
['135']
# await page.get_by_text("Submit").click()
await page.get_by_role("button", name="Submit").click()
Since the data is a table, we can actually feed the HTML directly into pandas.
= await page.content()
html_content = pd.read_html(html_content)
tables len(tables)
/var/folders/_m/b8tjbm6n4zs1q2mvjvg25x1m0000gn/T/ipykernel_48736/539925845.py:2: FutureWarning: Passing literal html to 'read_html' is deprecated and will be removed in a future version. To read from a literal string, wrap it in a 'StringIO' object.
tables = pd.read_html(html_content)
3
With some experimentation we can figure out it’s the third table.
= tables[2]
df df.head()
File No | CTB No | API No | Well Type | Well Status | Status Date | DTD | Location | Operator | Well Name | Field | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1355 | NaN | 3304700007 | OG | DRY | 2/1/1957 | 3200.0 | NWNW 11-135-72 | CALVERT DRILLING, INC. | ARNOLD GERBER 1 | WILDCAT |
1 | 5523 | NaN | 3304700020 | OG | DRY | 11/9/1974 | 5320.0 | NWNW 29-135-73 | WISE OIL COMPANY NO. 2 ET AL | BALTZER A. WEIGEL 1 | WILDCAT |
2 | 10369 | NaN | 3302900028 | OG | DRY | 10/11/1983 | 4000.0 | NWNE 22-135-74 | ARKLA EXPLORATION CO. | ELLEFSON 1 | WILDCAT |
3 | 10173 | NaN | 3302900027 | OG | DRY | 6/24/1983 | 5865.0 | SWSW 14-135-76 | SOUTHWESTERN ENERGY PRODUCTION CO. | BEASTROM 1-14 | WILDCAT |
4 | 16476 | NaN | 3302900032 | GASD | DRY | 11/13/2009 | 1691.0 | SESE 15-135-76 | STAGHORN ENERGY, LLC | WEISER 1-15 | WILDCAT |
It was township 135, so we can even save it as 135.csv
.
"135.csv", index=False) df.to_csv(
Waiting for elements to appear
If we want to search for multiple townships, we could write a loop to go through it.
If we do it “normally,” though, we run into an error. Even though we can see the table on the page, it doesn’t make it into BeautifulSoup.
---> 16 df = tables[2]
17 df.to_csv(filename, index=False)
IndexError: list index out of range
This is because the table doesn’t load until a tiny bit after the page loads. Playwright doesn’t know we are waiting for the table, though, so it feeds the incomplete page to BeautifulSoup. If you were working with Selenium you might use time.sleep
or the awful, horrible version of waiting they support, but with Playwright it’s easy!
We’re just going to wait for the “CTB No” field to show up:
await page.get_by_text('CTB No').wait_for()
The code underneath that line won’t continue until “CTB No” appears on the page.
= ['129', '130', '135']
township_numbers
for num in township_numbers:
# Fill it in
print("Searching for page", num)
await page.locator("#ddmTownship").select_option(num)
await page.get_by_text("Submit", exact=True).click()
# Wait for the table to show up
await page.get_by_text('CTB No').wait_for()
# Grab the table from the page
= await page.content()
html = pd.read_html(html)
tables = tables[2]
df
# Build filename and save it
= f"{num}.csv"
filename print("Got it - saving as", filename)
=False) df.to_csv(filename, index
Searching for page 129
Got it - saving as 129.csv
Searching for page 130
Got it - saving as 130.csv
Searching for page 135
Got it - saving as 135.csv
/var/folders/_m/b8tjbm6n4zs1q2mvjvg25x1m0000gn/T/ipykernel_48736/3519667351.py:14: FutureWarning: Passing literal html to 'read_html' is deprecated and will be removed in a future version. To read from a literal string, wrap it in a 'StringIO' object.
tables = pd.read_html(html)
/var/folders/_m/b8tjbm6n4zs1q2mvjvg25x1m0000gn/T/ipykernel_48736/3519667351.py:14: FutureWarning: Passing literal html to 'read_html' is deprecated and will be removed in a future version. To read from a literal string, wrap it in a 'StringIO' object.
tables = pd.read_html(html)
/var/folders/_m/b8tjbm6n4zs1q2mvjvg25x1m0000gn/T/ipykernel_48736/3519667351.py:14: FutureWarning: Passing literal html to 'read_html' is deprecated and will be removed in a future version. To read from a literal string, wrap it in a 'StringIO' object.
tables = pd.read_html(html)
More forms
Let’s try to scrape the South Dakota Board of Technical Professions this time. We’re going to end up downloading some content, so I’m adding downloads_path="."
, which makes Playwright download files into the same folder as this notebook.
from playwright.async_api import async_playwright
# "Hey, open up a browser"
= await async_playwright().start()
playwright = await playwright.chromium.launch(
browser =False,
headless="."
downloads_path
)
# Create a new browser window
= await browser.new_page()
page
# Tell it to go to this page
await page.goto("https://apps.sd.gov/ld17btp/licenseelist.aspx")
<Response url='https://apps.sd.gov/ld17btp/licenseelist.aspx' request=<Request url='https://apps.sd.gov/ld17btp/licenseelist.aspx' method='GET'>>
If we want to start writing inside of the “Name” field, we can take the same shortcut as above and look for page.locator("input")
instead of specifying anything specific. We get 10 options out of a total of 20.
Error: Error: strict mode violation: locator("input") resolved to 20 elements:
1) <input type="hidden" id="ctl00_RadScriptManager1_TSM" n…/> aka locator("#ctl00_RadScriptManager1_TSM")
2) <input value="" type="hidden" id="__EVENTTARGET" name="…/> aka locator("[id=\"__EVENTTARGET\"]")
3) <input value="" type="hidden" id="__EVENTARGUMENT" name…/> aka locator("[id=\"__EVENTARGUMENT\"]")
4) <input type="hidden" id="__VIEWSTATE" name="__VIEWSTATE…/> aka locator("[id=\"__VIEWSTATE\"]")
5) <input type="hidden" value="C7B208E6" id="__VIEWSTATEGE…/> aka locator("[id=\"__VIEWSTATEGENERATOR\"]")
6) <input value="" type="hidden" id="__VIEWSTATEENCRYPTED"…/> aka locator("[id=\"__VIEWSTATEENCRYPTED\"]")
7) <input type="hidden" id="__EVENTVALIDATION" name="__EVE…/> aka locator("[id=\"__EVENTVALIDATION\"]")
8) <input type="text" maxlength="50" id="ctl00_ContentPlac…/> aka locator("#ctl00_ContentPlaceHolder1_txtName")
9) <input type="text" value="All" autocomplete="off" class…/> aka locator("#ctl00_ContentPlaceHolder1_ddlPEDisc_Input")
10) <input type="hidden" autocomplete="off" id="ctl00_Conte…/> aka locator("#ctl00_ContentPlaceHolder1_ddlPEDisc_ClientState")
...
Luckily it seems like locator("#ctl00_ContentPlaceHolder1_txtName")
is probably what we’re looking for!
await page.locator("#ctl00_ContentPlaceHolder1_txtName").fill('SMITH')
Now we want to select something from the Profession dropdown. Because it isn’t a normal select box we have to actually click the dropdown arrow, then click the profession we’re interested in.
await page.locator("#ctl00_ContentPlaceHolder1_ddlProfession_Arrow").click()
await page.get_by_text("Professional Engineer").click()
Now let’s click that Search button! If we try page.get_by_text("Search")
we get a few options and we pick the most likely one.
await page.get_by_role("button", name="Search").click()
Now that the page loads we could use page.content()
to feed the table into pandas… or we could just click the “Download CSV” button!
await page.locator("#ctl00_ContentPlaceHolder1_rgLicensee_ctl00_ctl02_ctl00_ExportToCsvButton").click()
It downloads – with an awful filename of 9b96da64-757f-4166-a84c-63fa1830e77c
– into the current folder. Thank you downloads_path="."
that we set up when we launched the browser!
If we want to be a little more in control of the filename, we can skip downloads_path
and write slightly more complicated code. In the code below, we’re listening for the download to happen and redirect it to a “good” filename when it starts.
async with page.expect_download() as download_info:
# Perform the action that initiates download
await page.locator("#ctl00_ContentPlaceHolder1_rgLicensee_ctl00_ctl02_ctl00_ExportToCsvButton").click()
= await download_info.value
download
# Wait for the download process to complete and save the downloaded file somewhere
await download.save_as("content.csv")
Did it work???
= pd.read_csv("content.csv")
df df.head()
Profession | Name | Address | City | State | Zip | Phone | Registration<br/>Number | PE<br/>Disc. | Expiration<br/>Date | Status | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | PE | Lane Lee Goldsmith | PO Box 123 | Mobridge | SD | 57601 | (605) 350-4625 | 8908 | CE | 6/30/2024 | Active |
1 | PE | Todd L. Goldsmith | 5212 Basswood St | Rapid City | SD | 57703 | (605) 848-0040 | 5163 | CE | 12/31/2024 | Active |
2 | PE | Willie Morgan NeSmith, Jr. | 7300 Marks Lane | Austell | GA | 30168 | (770) 941-5100 | 10305 | CE | 7/31/2024 | Active |
3 | PE | Aaron D. Smith | 1801 W 32nd St, Bldg B Suite 104 | Joplin | MO | 64804 | (417) 624-0444 | 8630 | CE | 7/31/2025 | Active |
4 | PE | Alexander Smith | 8309 W 42nd Street | Sioux Falls | SD | 57106 | 6052202447 | 15570 | CE | 12/31/2025 | Active |
Taking screenshots
If you’ve seen homepages.news you probably stay awake at night thinking, “how does Ben Welsh have the time to sceenshot all those homepages all the time??” And the answer is: Playwright!
(Technically it’s actually shot-scraper but it’s built on Playwright, so close enough)
Let’s take a screenshot of a full page. We’ll start by loading in Playwright and visiting a page.
# Imports
from playwright.async_api import async_playwright
= await async_playwright().start()
playwright = await playwright.chromium.launch(headless = False)
browser = await browser.new_page()
page
await page.goto("https://www.100daysinappalachia.com/")
<Response url='https://www.100daysinappalachia.com/' request=<Request url='https://www.100daysinappalachia.com/' method='GET'>>
await page.screenshot(path="screenshot.png", full_page=True)
print("Done")
Done
That takes a screenshot of the whole page. There are other options including how to take a screenshot of just one part of the page, or capture to a buffer instead of saving to a file.
Solving CAPTCHAs
When you’re looking to scrape a site with a browser automation tool, you’ll often run up against CAPTCHAs that demand to know that you are not a robot. Since you are a robot, it can be problematic.
Luckily, a lot of services exist to help you out with that! The one that’s easiest to use and requires the least technical skill is probably NopeCHA, I wrote a writeup of how to use NopeCHA to solve CAPTCHAs with Playwright here. It involves downloading an extension that you then embed an API key into, then visiting the CAPTCHA-y page with the Playwright browser.
NopeCHA works pretty well from my experience, but it’s always a cat and mouse game! You might have luck using 2captcha or anti-captcha if NopeCHA comes up empty.