Top Banner
Scraping Multiple Pages in Python WORKSHOP 3 | CREATOR: CHARLOTTE LLOYD
12

Scraping Multiple Pages in Python - Charlotte J. Lloyd · Scraping Process // Battle Plan u 1. Surveillance u Evaluate the page, learn the terrain. u 2. Plan of Attack u Brainstorm

Sep 25, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Scraping Multiple Pages in Python - Charlotte J. Lloyd · Scraping Process // Battle Plan u 1. Surveillance u Evaluate the page, learn the terrain. u 2. Plan of Attack u Brainstorm

Scraping Multiple Pages in PythonWORKSHOP 3 | CREATOR: CHARLOTTE LLOYD

Page 2: Scraping Multiple Pages in Python - Charlotte J. Lloyd · Scraping Process // Battle Plan u 1. Surveillance u Evaluate the page, learn the terrain. u 2. Plan of Attack u Brainstorm

Outline

I. RecapII. Workshop ExampleIII. Verify DataIV. Celebration, Back-slapping

Page 3: Scraping Multiple Pages in Python - Charlotte J. Lloyd · Scraping Process // Battle Plan u 1. Surveillance u Evaluate the page, learn the terrain. u 2. Plan of Attack u Brainstorm

RecapPART I

Page 4: Scraping Multiple Pages in Python - Charlotte J. Lloyd · Scraping Process // Battle Plan u 1. Surveillance u Evaluate the page, learn the terrain. u 2. Plan of Attack u Brainstorm

Three Major Ways to Use Python

1. Command Line

2. “IDE”

3. Notebook

Page 5: Scraping Multiple Pages in Python - Charlotte J. Lloyd · Scraping Process // Battle Plan u 1. Surveillance u Evaluate the page, learn the terrain. u 2. Plan of Attack u Brainstorm

Scraping Process // Battle Plan

u 1. Surveillanceu Evaluate the page, learn the terrain.

u 2. Plan of Attack

u Brainstorm ways to approach the enemy.

u 3. Write codeu Be willing to change your strategy if you encounter obstacles or see another

“weakness” to exploit.

u 4. Emerge bloodied, yet victorious.

u Verify the data before all that syntax evaporates from your short term memory.

Page 6: Scraping Multiple Pages in Python - Charlotte J. Lloyd · Scraping Process // Battle Plan u 1. Surveillance u Evaluate the page, learn the terrain. u 2. Plan of Attack u Brainstorm

Workshop ExamplePART IV

Page 7: Scraping Multiple Pages in Python - Charlotte J. Lloyd · Scraping Process // Battle Plan u 1. Surveillance u Evaluate the page, learn the terrain. u 2. Plan of Attack u Brainstorm

GOAL

u http://www.bfi.org.uk/films-tv-people/sightandsoundpoll2012/voters

u Scrape all information about all voters

u Scrape “film details” (except ”featuring”) for all films chosen by voters in their “top ten"

u Save data as 2 different csv files

Page 8: Scraping Multiple Pages in Python - Charlotte J. Lloyd · Scraping Process // Battle Plan u 1. Surveillance u Evaluate the page, learn the terrain. u 2. Plan of Attack u Brainstorm

1. Surveillance

u Voter: http://www.bfi.org.uk/films-tv-people/sightandsoundpoll2012/voter/94

u special case: http://www.bfi.org.uk/films-tv-people/sightandsoundpoll2012/voter/6

u Film: http://www.bfi.org.uk/films-tv-people/4ce2b6a7a801b

u special case: http://www.bfi.org.uk/films-tv-people/4ce2b8bb6b693

u special case: http://www.bfi.org.uk/films-tv-people/4ce2b7d2993a2

Page 9: Scraping Multiple Pages in Python - Charlotte J. Lloyd · Scraping Process // Battle Plan u 1. Surveillance u Evaluate the page, learn the terrain. u 2. Plan of Attack u Brainstorm

2. Plan of Attack: Voters

u What is our strategy to get the judge URLs? u exploit the “class=sas-poll” feature to scrape URLs from each of 25 tables

u What is our strategy to get the data for each judge?u scrape the name, type, info and country from the main page

u scrape the 10 films and comment from the judge’s individual page

u How can we handle the special cases? u manually create filmIDs for films without webpages

Page 10: Scraping Multiple Pages in Python - Charlotte J. Lloyd · Scraping Process // Battle Plan u 1. Surveillance u Evaluate the page, learn the terrain. u 2. Plan of Attack u Brainstorm

2. Plan of Attack: Films

u What is our strategy for getting the film URLs? u save them to a list while we’re scraping the judges

u What is our strategy to get the data for each film? Why do we have to incorporate the special cases directly into the strategy?

u we need to separately search for cells containing the director, country, year, genre, type, and category info

u the number of cells in the table varies, so we have to know what they are based on their content and not their position

Page 11: Scraping Multiple Pages in Python - Charlotte J. Lloyd · Scraping Process // Battle Plan u 1. Surveillance u Evaluate the page, learn the terrain. u 2. Plan of Attack u Brainstorm

3. Let’s look at the code together

u available at: https://github.com/charlloyd/film-gaze

u First let’s run it in Spyder.

u Then let’s download the jupyter notebook.

Page 12: Scraping Multiple Pages in Python - Charlotte J. Lloyd · Scraping Process // Battle Plan u 1. Surveillance u Evaluate the page, learn the terrain. u 2. Plan of Attack u Brainstorm