Top Banner
TCPD Summer School #2 Web Scraping Jérémy Richard - [email protected] Alexandre Chevallier - [email protected] July 2017
13

#2 Web Scraping TCPD Summer School...Web Scraping - BeautifulSoup Python library Pull out data out of HTML/XML files Designed for quick turnaround projects Charged with some superb

Jul 16, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: #2 Web Scraping TCPD Summer School...Web Scraping - BeautifulSoup Python library Pull out data out of HTML/XML files Designed for quick turnaround projects Charged with some superb

TCPD Summer School#2 Web Scraping

Jérémy Richard - [email protected]

Alexandre Chevallier - [email protected]

July 2017

Page 2: #2 Web Scraping TCPD Summer School...Web Scraping - BeautifulSoup Python library Pull out data out of HTML/XML files Designed for quick turnaround projects Charged with some superb

OutlineWeb scraping

Web page

BeautifulSoup Library

Practical Works

2A. Chevallier & J. Richard (CDSP), TCPD Summer School 2017 - 11/07/2017

Page 3: #2 Web Scraping TCPD Summer School...Web Scraping - BeautifulSoup Python library Pull out data out of HTML/XML files Designed for quick turnaround projects Charged with some superb

Web Scraping - What is it ?Data Scraping ?

● Automated process● Explore and download raw data● Grab content● Convert data in usable format for analysis● Store data in database or text file

Web Scraping = Data Scraping of web pages

3A. Chevallier & J. Richard (CDSP), TCPD Summer School 2017 - 11/07/2017

Page 4: #2 Web Scraping TCPD Summer School...Web Scraping - BeautifulSoup Python library Pull out data out of HTML/XML files Designed for quick turnaround projects Charged with some superb

Web Scraping - What is a web page ?Components of a web page

● HTML - Organize and contain the main content of a web page● CSS - Add styling to make the page looks nicer● JS - Javascript files add interactivity to web pages● Media files - Images, Sounds, Videos, etc.

Interesting content for web scraping = HTML

4A. Chevallier & J. Richard (CDSP), TCPD Summer School 2017 - 11/07/2017

Page 5: #2 Web Scraping TCPD Summer School...Web Scraping - BeautifulSoup Python library Pull out data out of HTML/XML files Designed for quick turnaround projects Charged with some superb

Web Scraping - HTMLHTML is used to create documents on the Web

Very simple and logical

NOT a programming language but a markups language that uses <tags> like this

The websites you view are basically HTML files rendered by web browsers

5A. Chevallier & J. Richard (CDSP), TCPD Summer School 2017 - 11/07/2017

Page 6: #2 Web Scraping TCPD Summer School...Web Scraping - BeautifulSoup Python library Pull out data out of HTML/XML files Designed for quick turnaround projects Charged with some superb

Web Scraping - HTMLHTML is organized like a hierarchical tree

6Source: Frances Zlotnick

Page 7: #2 Web Scraping TCPD Summer School...Web Scraping - BeautifulSoup Python library Pull out data out of HTML/XML files Designed for quick turnaround projects Charged with some superb

Web Scraping - Inspect the sourceInspect element

Find HTML node

<table> defines a table<tr> defines a row in a table<th> defines a table header cell<td> defines a cell in table

Use BeautifulSoup to grab it

7A. Chevallier & J. Richard (CDSP), TCPD Summer School 2017 - 11/07/2017

Page 8: #2 Web Scraping TCPD Summer School...Web Scraping - BeautifulSoup Python library Pull out data out of HTML/XML files Designed for quick turnaround projects Charged with some superb

Web Scraping - BeautifulSoupPython library

Pull out data out of HTML/XML files

Designed for quick turnaround projects

Charged with some superb methods

Open-source, free & well documented

8A. Chevallier & J. Richard (CDSP), TCPD Summer School 2017 - 11/07/2017

Page 9: #2 Web Scraping TCPD Summer School...Web Scraping - BeautifulSoup Python library Pull out data out of HTML/XML files Designed for quick turnaround projects Charged with some superb

Web Scraping - Jump into the codeGrab node with BeautifulSoup

from BeautifulSoup import BeautifulSoupimport urllib

raw_html = urllib.urlopen('http://www.elections.in/delhi/mcd-elections/').read()

soup = BeautifulSoup(raw_html)

attrs = { 'class':'tableizer-table' }tables = soup.findAll(attrs=attrs)table = tables[0]rows = table.findAll('tr')

Import librairies

Download data

Instantiate BeautifulSoup object

Access the data

9A. Chevallier & J. Richard (CDSP), TCPD Summer School 2017 - 11/07/2017

Page 10: #2 Web Scraping TCPD Summer School...Web Scraping - BeautifulSoup Python library Pull out data out of HTML/XML files Designed for quick turnaround projects Charged with some superb

Web Scraping - Jump into the codeUse grabbed data to write a CSV file

import csv

with open('export.csv', 'wb') as f: writer = csv.writer(f, delimiter=';') for row in rows: csv_row = [] headers = row.findAll('th') for header in headers: csv_row.append(header.text) cells = row.findAll('td') for cell in cells: csv_row.append(cell.text) writer.writerow(csv_row)

Import the CSV library

Make loops for selecting data inside table cells.Write it in a python list

10

Open a file with write permissionsHandle it with CSV lib’s methods

Write list in CSV handle file

A. Chevallier & J. Richard (CDSP), TCPD Summer School 2017 - 11/07/2017

Page 11: #2 Web Scraping TCPD Summer School...Web Scraping - BeautifulSoup Python library Pull out data out of HTML/XML files Designed for quick turnaround projects Charged with some superb

Web Scraping - Jump into the codeExtraction Result

11A. Chevallier & J. Richard (CDSP), TCPD Summer School 2017 - 11/07/2017

Page 12: #2 Web Scraping TCPD Summer School...Web Scraping - BeautifulSoup Python library Pull out data out of HTML/XML files Designed for quick turnaround projects Charged with some superb

Let’s play !

https://ashoka.cdsp.sciences-po.fr

12A. Chevallier & J. Richard (CDSP), TCPD Summer School 2017 - 11/07/2017