Top Banner
Scraping in Python WORKSHOP 2 | CREATOR: CHARLOTTE LLOYD
27

Scraping in Python - Charlotte J. Lloyd · Scraping Process // Battle Plan u 1. Surveillance u Evaluate the page, learn the terrain. u 2. Plan of Attack u Brainstorm ways to approach

Jul 16, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Scraping in Python - Charlotte J. Lloyd · Scraping Process // Battle Plan u 1. Surveillance u Evaluate the page, learn the terrain. u 2. Plan of Attack u Brainstorm ways to approach

Scraping in PythonWORKSHOP 2 | CREATOR: CHARLOTTE LLOYD

Page 2: Scraping in Python - Charlotte J. Lloyd · Scraping Process // Battle Plan u 1. Surveillance u Evaluate the page, learn the terrain. u 2. Plan of Attack u Brainstorm ways to approach

Outline

I. Introduction to PythonII. Python Three WaysIII. Scraping RecapIV. Workshop ExampleV. Verify DataVI. Celebration, Back-slapping

Page 3: Scraping in Python - Charlotte J. Lloyd · Scraping Process // Battle Plan u 1. Surveillance u Evaluate the page, learn the terrain. u 2. Plan of Attack u Brainstorm ways to approach

Introduction to PythonPART I

Page 4: Scraping in Python - Charlotte J. Lloyd · Scraping Process // Battle Plan u 1. Surveillance u Evaluate the page, learn the terrain. u 2. Plan of Attack u Brainstorm ways to approach

What is Python?

u general purpose

u high-level

u interpreted (not compiled)

u name is related to Monty Python

Page 5: Scraping in Python - Charlotte J. Lloyd · Scraping Process // Battle Plan u 1. Surveillance u Evaluate the page, learn the terrain. u 2. Plan of Attack u Brainstorm ways to approach

Very Popular Language

Checkout the full infographic: http://blog.datacamp.com/wp-content/uploads/2015/05/R-vs-Python-216-2.png

Page 6: Scraping in Python - Charlotte J. Lloyd · Scraping Process // Battle Plan u 1. Surveillance u Evaluate the page, learn the terrain. u 2. Plan of Attack u Brainstorm ways to approach

Less Popular in Data Analysis

Checkout the full infographic: http://blog.datacamp.com/wp-content/uploads/2015/05/R-vs-Python-216-2.png

Page 7: Scraping in Python - Charlotte J. Lloyd · Scraping Process // Battle Plan u 1. Surveillance u Evaluate the page, learn the terrain. u 2. Plan of Attack u Brainstorm ways to approach

Great Beginner Language

Checkout the full infographic: http://blog.datacamp.com/wp-content/uploads/2015/05/R-vs-Python-216-2.png

Page 8: Scraping in Python - Charlotte J. Lloyd · Scraping Process // Battle Plan u 1. Surveillance u Evaluate the page, learn the terrain. u 2. Plan of Attack u Brainstorm ways to approach

Packages for Python

u Packages are bits of code that other people have built to extend Python functionality.u If you install a package you will be able to use the additional

commands that package has defined.

u Over 100,000 publically listed packages famously including:u numpy

u scikit-learn

u pandas

Page 9: Scraping in Python - Charlotte J. Lloyd · Scraping Process // Battle Plan u 1. Surveillance u Evaluate the page, learn the terrain. u 2. Plan of Attack u Brainstorm ways to approach

Python Three WaysPART II

Page 10: Scraping in Python - Charlotte J. Lloyd · Scraping Process // Battle Plan u 1. Surveillance u Evaluate the page, learn the terrain. u 2. Plan of Attack u Brainstorm ways to approach

What is Anaconda?

u Anaconda is an “installation” of Python that includes:

u package management

u environment management

u python distribution

u Anaconda pre-installs over 100 packages

Page 11: Scraping in Python - Charlotte J. Lloyd · Scraping Process // Battle Plan u 1. Surveillance u Evaluate the page, learn the terrain. u 2. Plan of Attack u Brainstorm ways to approach

Three Major Ways to Use Python

1. Command Line

2. “IDE”

3. Notebook

Page 12: Scraping in Python - Charlotte J. Lloyd · Scraping Process // Battle Plan u 1. Surveillance u Evaluate the page, learn the terrain. u 2. Plan of Attack u Brainstorm ways to approach

1. “Command Line” Python

A. Run an interactive session in a Unix shell1. In Terminal (Mac) or Powershell (PC):

1. type ”python”

2. type “2+2”

2. In qtconsole (Anaconda Navigator): [do nothing]

u try typing ”2+2”

B. Run a script (file)

1. In Terminal (Mac) or Powershell (PC): type “python file.py”

2. In qtconsole (Anaconda Navigator): type “%load file.py”

Page 13: Scraping in Python - Charlotte J. Lloyd · Scraping Process // Battle Plan u 1. Surveillance u Evaluate the page, learn the terrain. u 2. Plan of Attack u Brainstorm ways to approach

2. Python in IDEs

u IDE (“integrated development environment”)u Spyder (provided in Anaconda)

u PyCharm

u Xcode (Macs)

u Write code (esp. multiple files) and easily execute within the IDE.

u Activity: Write a ”helloworld” program in Spyder. Execute in both Spyder and Terminal/Powershell.

Page 14: Scraping in Python - Charlotte J. Lloyd · Scraping Process // Battle Plan u 1. Surveillance u Evaluate the page, learn the terrain. u 2. Plan of Attack u Brainstorm ways to approach

3. Python Notebooks

u web-based “interactive computational environment”

u very visual, very cool

u segmented into small cells of executable code

Page 15: Scraping in Python - Charlotte J. Lloyd · Scraping Process // Battle Plan u 1. Surveillance u Evaluate the page, learn the terrain. u 2. Plan of Attack u Brainstorm ways to approach

Hands-on Demo

u Open Anaconda Navigator. Open the Jupyter Notebook. u Navigate to “handypy.ipynb” and open.

u Topics to be covered:u integers, floats, and strings

u lists

u for and while loops

u conditionals

u functions

u reading and writing csv files

Page 16: Scraping in Python - Charlotte J. Lloyd · Scraping Process // Battle Plan u 1. Surveillance u Evaluate the page, learn the terrain. u 2. Plan of Attack u Brainstorm ways to approach

Scraping RecapPART III

Page 17: Scraping in Python - Charlotte J. Lloyd · Scraping Process // Battle Plan u 1. Surveillance u Evaluate the page, learn the terrain. u 2. Plan of Attack u Brainstorm ways to approach

Programming Philosophy

u Concepts are key.

u Syntax is secondary.

u Stackoverflow is your friend.

Page 18: Scraping in Python - Charlotte J. Lloyd · Scraping Process // Battle Plan u 1. Surveillance u Evaluate the page, learn the terrain. u 2. Plan of Attack u Brainstorm ways to approach

What is scraping?

Page 19: Scraping in Python - Charlotte J. Lloyd · Scraping Process // Battle Plan u 1. Surveillance u Evaluate the page, learn the terrain. u 2. Plan of Attack u Brainstorm ways to approach

Scraping Process // Battle Plan

u 1. Surveillanceu Evaluate the page, learn the terrain.

u 2. Plan of Attack

u Brainstorm ways to approach the enemy.

u 3. Write codeu Be willing to change your strategy if you encounter obstacles or see another

“weakness” to exploit.

u 4. Emerge bloodied, yet victorious.

u Verify the data before all that syntax evaporates from your short term memory.

Page 20: Scraping in Python - Charlotte J. Lloyd · Scraping Process // Battle Plan u 1. Surveillance u Evaluate the page, learn the terrain. u 2. Plan of Attack u Brainstorm ways to approach

Workshop ExamplePART IV

Page 21: Scraping in Python - Charlotte J. Lloyd · Scraping Process // Battle Plan u 1. Surveillance u Evaluate the page, learn the terrain. u 2. Plan of Attack u Brainstorm ways to approach

GOAL

u Scrape all text in the table as well as URLs to download files.

u Save data as a csv file that preserves the table format.

u Save URLs on separate lines in a txt file.

Page 22: Scraping in Python - Charlotte J. Lloyd · Scraping Process // Battle Plan u 1. Surveillance u Evaluate the page, learn the terrain. u 2. Plan of Attack u Brainstorm ways to approach

Package: BeautifulSoup

Page 23: Scraping in Python - Charlotte J. Lloyd · Scraping Process // Battle Plan u 1. Surveillance u Evaluate the page, learn the terrain. u 2. Plan of Attack u Brainstorm ways to approach

Hands-on Demo

u Open Anaconda Navigator. Open the Jupyter Notebook. u Navigate to “workshop2.ipynb” and open.

u http://www.goes-r.gov/users/2016-OCONUS.html

Page 24: Scraping in Python - Charlotte J. Lloyd · Scraping Process // Battle Plan u 1. Surveillance u Evaluate the page, learn the terrain. u 2. Plan of Attack u Brainstorm ways to approach

Downloading Files from urls.txt

u Terminalu for i in `cat urls.txt`; do curl -O $i; done

u Powershell (courtesy of Ryann & Keith!)

u foreach ($file in Get-Content url.txt) {echo "downloading $file"; curl -O $file}

Page 25: Scraping in Python - Charlotte J. Lloyd · Scraping Process // Battle Plan u 1. Surveillance u Evaluate the page, learn the terrain. u 2. Plan of Attack u Brainstorm ways to approach

Verify DataPART V

Page 26: Scraping in Python - Charlotte J. Lloyd · Scraping Process // Battle Plan u 1. Surveillance u Evaluate the page, learn the terrain. u 2. Plan of Attack u Brainstorm ways to approach

Make notes, be organized

u This kind of piecemeal code is hard to come back to later, so if you’ll need it again, organize it and write yourself notes. u SYNTAX WILL LEAK OUT OF YOUR BRAIN FASTER THAN ALL THE OTHER

IMPORTANT THINGS THAT YOU HAVE ALREADY FORGOTTEN YOU WERE SUPPOSED TO REMEMBER. PLEASE BELIEVE ME THAT JUST A FEW SHORT WEEKS FROM NOW YOU WILL NOT KNOW WHY YOU DID THAT THING YOU DID OR WHAT THAT VARIABLE MEANS OR WHAT THAT FUNCTION DOES AND WHY YOU APPARENTLY PUT THAT THERE. AND WHY ISN’T THIS PACKAGE UPDATE WORKING WITH MY OLD CODE AND DO I EVEN HAVE THE CORRECT THINGS INSTALLED TO MAKE THIS ALL WORK AGAIN?

Page 27: Scraping in Python - Charlotte J. Lloyd · Scraping Process // Battle Plan u 1. Surveillance u Evaluate the page, learn the terrain. u 2. Plan of Attack u Brainstorm ways to approach