Top Banner
Data Acquisition: Companies & Wharton Data Basic web scraping Using APIs Session 3 Wharton Summer Tech Camp
49

Data Acquisition: Companies & Wharton Data Basic web scraping Using APIs Session 3 Wharton Summer Tech Camp.

Dec 15, 2015

Download

Documents

Maegan Cornish
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Data Acquisition: Companies & Wharton Data Basic web scraping Using APIs Session 3 Wharton Summer Tech Camp.

Data Acquisition:Companies & Wharton Data

Basic web scrapingUsing APIs

Session 3Wharton Summer Tech Camp

Page 2: Data Acquisition: Companies & Wharton Data Basic web scraping Using APIs Session 3 Wharton Summer Tech Camp.

Set up problems

• Mac– mostly no problems due to linux-like environment and great support

• Windows on MOBAXTERM– You can use apt-cyg to install everything

– Apt-cyg install python– Apt-cyg install idle– Apt-cyg install idlex

Page 3: Data Acquisition: Companies & Wharton Data Basic web scraping Using APIs Session 3 Wharton Summer Tech Camp.

REGEX CHALLENGE! • 3 REGEX Challenges• 1 from a well known t-shirt joke (if you know this,

don’t say anything) • 2 are song lyrics (tried to find well known songs). • Raise your hand to say the answer

Page 4: Data Acquisition: Companies & Wharton Data Basic web scraping Using APIs Session 3 Wharton Summer Tech Camp.

a t-shirt people wear

r”(bb|[^b]{2})”

Difficulty *Hint: Phrase

Page 5: Data Acquisition: Companies & Wharton Data Basic web scraping Using APIs Session 3 Wharton Summer Tech Camp.

a t-shirt people wear

r”(bb|[^b]{2})”

“To be or not to be”

Difficulty *Hint: Phrase

Page 6: Data Acquisition: Companies & Wharton Data Basic web scraping Using APIs Session 3 Wharton Summer Tech Camp.

Challenge 2

Difficulty *****Hint: This is literally the entire lyric for the song

r”(\w+ [a-z]{3} w..ld ){144}”

Page 7: Data Acquisition: Companies & Wharton Data Basic web scraping Using APIs Session 3 Wharton Summer Tech Camp.

Challenge 2

Difficulty ****Hint: This is literally the entire lyric for the song Hint 2: It’s a song by the music duo who created the latest Record of the Year

r”(ar\w{3} [a-z]{3} w..ld ){144}”

Page 8: Data Acquisition: Companies & Wharton Data Basic web scraping Using APIs Session 3 Wharton Summer Tech Camp.

Challenge 2

Difficulty *****Hint: This is literally the entire lyric for the song Hint 2: It’s a song by the music duo who created the latest Record of the Year

r”(\w+ [a-z]{3} w..ld ){144}”

Around the world – by Daft Punk

Page 9: Data Acquisition: Companies & Wharton Data Basic web scraping Using APIs Session 3 Wharton Summer Tech Camp.

Challenge 3

Difficulty **Hint: Lyric of an old song

r”ah, ((ba ){4} (bar){2}a an{2} \s)+”

Page 10: Data Acquisition: Companies & Wharton Data Basic web scraping Using APIs Session 3 Wharton Summer Tech Camp.

Difficulty **

r”ah, ((ba ){4} (bar){2}a an{2} \s)+”

Ah, Ba ba ba ba Barbara Ann~ Ah, Ba ba ba ba Barbara Ann~

Challenge 3

Page 11: Data Acquisition: Companies & Wharton Data Basic web scraping Using APIs Session 3 Wharton Summer Tech Camp.

Song PhrasesEver since I learned regex, I was thinking that many Daft Punk songs are optimized for regex.

Lyrics for a song in its entirety with this one simple regex • r”(Around the world ){144}” – Around the world• r"((buy|use|break|fix|trash|change) it )+ now upgrade

it” –Technologic• r”(((work|make|do|makes|more) (it|us|than) (harder|

better|faster|stronger|ever))+ hour after our work is never over. \s)+” – Harder, better, faster, stronger

Page 12: Data Acquisition: Companies & Wharton Data Basic web scraping Using APIs Session 3 Wharton Summer Tech Camp.

THE BIGGEST concern for doctoral students doing empirical work (year 2-4)“WHERE AND HOW DO I GET THE DATA?!“

Mr. Data: “I believe what you are experiencing is frustration”

Page 13: Data Acquisition: Companies & Wharton Data Basic web scraping Using APIs Session 3 Wharton Summer Tech Camp.

Data sources1.Companies2.Wharton Organizations3.Scraping Web4.APIs : application

programming interface

Page 14: Data Acquisition: Companies & Wharton Data Basic web scraping Using APIs Session 3 Wharton Summer Tech Camp.

DATA SOURCES

1. Companies – HARD, UNIQUE– Hardest but once you get a good company, you are set for a

paper or two or more…2. Wharton Organizations – (WRDS) (EASY, COMMON - great for auxiliary data) Other

people can also easily access this data. Data probably have been used already

– (WCAI) (EASY, UNIQUE) data is actually pretty great and only few select teams get it after proposal review process

3. Scraping Web (WGET/REGEX/tools) – MEDIUM, MEDIUM– Relatively easy but painful for big projects and sometimes

not allowed based on website.4. APIs : application programming interface – EASY, COMMON– Easy but restricted to what the company made available.

Page 15: Data Acquisition: Companies & Wharton Data Basic web scraping Using APIs Session 3 Wharton Summer Tech Camp.

Resources for Public Data

• There are many list of lists for public data• Find a link to list of lists for data in

the course website under “resources for learning”• If you have a good source, please

email me so I can link it on the web

Page 16: Data Acquisition: Companies & Wharton Data Basic web scraping Using APIs Session 3 Wharton Summer Tech Camp.

Companies

Page 17: Data Acquisition: Companies & Wharton Data Basic web scraping Using APIs Session 3 Wharton Summer Tech Camp.

Quick tips• Don’t be afraid to contact random companies • Attend conferences and network like an MBA - think of it like a game • Send a short 2-3 page proposal suggesting a research collaboration • Read about the company you are contacting and make sure to offer

something that interests the company • Low success probability – among many proposals I’ve sent (about 30+

if you count emails).– Mostly no response. – 1 company I was working with for 10 months just decided to drop

the ball due to CTO changing twice.– 4 very easy data – not useful and suitable for research– 2 very useful data I am currently using/working with. – 1 company disputing about NDA

• NDAs: you can request help from upenn legal team here – https://medley05.isc-seo.upenn.edu/researchInventory/jsp/

fast2.do?bhcp=1

Page 18: Data Acquisition: Companies & Wharton Data Basic web scraping Using APIs Session 3 Wharton Summer Tech Camp.

NDAs are super important• A horror story I heard– A student worked with a

company for 1+ year and then the company just decided that the result was too good to publish. Wanted it to be a trade secret/IP.

– NDA signed was bad.– No publication.– Most NDAs are OK but some

are not. If bad, get help from that link and negotiate.

– Look out for “Work for hire” type of NDAs

Page 19: Data Acquisition: Companies & Wharton Data Basic web scraping Using APIs Session 3 Wharton Summer Tech Camp.

Wharton Specific

Page 20: Data Acquisition: Companies & Wharton Data Basic web scraping Using APIs Session 3 Wharton Summer Tech Camp.

Wharton Specific

You probably heard about these organization from wharton doctoral orientation.• WRDS: Wharton Research Data Services – https://wrds-web.wharton.upenn.edu/wrds/

• WCAI: Wharton Customer Analytics Initiative– http://www.wharton.upenn.edu/wcai/

• Other organizations exist but mostly for conferences and not for data.– http://www.wharton.upenn.edu/faculty/research-c

enters-and-initiativ.cfm

Page 21: Data Acquisition: Companies & Wharton Data Basic web scraping Using APIs Session 3 Wharton Summer Tech Camp.

Basic Web Scraping

Page 22: Data Acquisition: Companies & Wharton Data Basic web scraping Using APIs Session 3 Wharton Summer Tech Camp.

Caveats

• I spent time writing and testing a scraping code for this course where one inputs a list of music artists in csv format and the script queries allmusic.com to obtain information such as the genres associated with the artists.

• Written in March of 2013. • On July, It broke because allmusic.com has updated

their website… • This is one problem with scraping. You never know

when it will stop working and you have to rewrite.

Page 23: Data Acquisition: Companies & Wharton Data Basic web scraping Using APIs Session 3 Wharton Summer Tech Camp.

Outline of basic scraping

1. CRAWLING: Instead of using web browsers, use scripts to access html (xml, etc). Or crawl through website recursively and download all htmls or txts or whatever. (WGET or Python or any language such as php)

2. PATTERN SEARCHING: Researcher looks at the raw http output and looks for where the required data is and figure out what the pattern is. (Developer’s toolbox Firefox)

3. EXTRACTION: Use text extracting tool to extract information and store it! (if it’s structured format such as xml then use appropriate tools for each format). (REGEX, Apache Lucene, SED, AWK, etc)

4. Go publish papers with the data

Page 24: Data Acquisition: Companies & Wharton Data Basic web scraping Using APIs Session 3 Wharton Summer Tech Camp.

Alternatives

• Want something easier or with GUI? – MOZENDA: Wharton has license and it’s cheap

• More advanced scraping – We will cover this next week with Scrapy

• There are many other tools and packages for this.– http://en.wikipedia.org/wiki/Web_crawler– http://stackoverflow.com/questions/419235/anyone-kno

w-of-a-good-python-based-web-crawler-that-i-could-use

Page 25: Data Acquisition: Companies & Wharton Data Basic web scraping Using APIs Session 3 Wharton Summer Tech Camp.

Tools used in our examples

• WGET + Python• REGEX• HTML/DOM inspector –Firefox has Web Developer's Toolbox

which is an add-on you can download. –This is useful for looking for pattern of

data you want to extract

Page 26: Data Acquisition: Companies & Wharton Data Basic web scraping Using APIs Session 3 Wharton Summer Tech Camp.

Scraping Example 1

• Facebook SEC filing exploration–Purpose: Exploration before research–What this toy example is doing: Get SEC

filing for Facebook and extract certain parts– I am interested in reading a few words

before and after whenever there is “shares” mentioned

Page 27: Data Acquisition: Companies & Wharton Data Basic web scraping Using APIs Session 3 Wharton Summer Tech Camp.

DOWNLOAD HTMLS/TXT/JPG/ETC

• WGET“GNU Wget is a free software package for retrieving files using HTTP, HTTPS and FTP, the most widely-used Internet protocols. It is a non-interactive commandline tool, so it may easily be called from scripts, cron jobs, terminals without X-Windows support, etc.”

Fire up edgarFBarchive.sh and extractPhrase.py

Page 28: Data Acquisition: Companies & Wharton Data Basic web scraping Using APIs Session 3 Wharton Summer Tech Camp.

WGET FB’s SEC filings

wget -r -l1 -H -t1 -nd -N -np -A.txt –e robots=off http://www.sec.gov/Archives/edgar/data/1326801/

-r -H -l1 -np These options tell wget to download recursively.-nd no directory. Keep the downloaded in one folder-A.txt only download txt files -erobots=off ignore robot.txt (avoid using this option if wget without this option works. Make sure to use --wait option if you use this option or your IP may get banned)

Page 29: Data Acquisition: Companies & Wharton Data Basic web scraping Using APIs Session 3 Wharton Summer Tech Camp.

Caveats• WGET only works well for certain websites. You can use it

download all photos etc. But if your script makes too many requests, they may ban your IP. You can specify delayed requests.

• Once website gets fancy, you have to use other tools such as PHP or Python packages – ASP– POST (as opposed to GET protocol in HTTP)– Javascript produced cites – AJAX cites

• This is a toy example for learning. You can still use this method for simple scraping but consider learning pro tools (we’ll cover basics of a such tool next week)

Page 30: Data Acquisition: Companies & Wharton Data Basic web scraping Using APIs Session 3 Wharton Summer Tech Camp.

Scraping Example 2

• Jambase.com concert venues–This example gets a list of artists and

queries jambase.com to get concert venue information.–Another toy example

Page 31: Data Acquisition: Companies & Wharton Data Basic web scraping Using APIs Session 3 Wharton Summer Tech Camp.
Page 32: Data Acquisition: Companies & Wharton Data Basic web scraping Using APIs Session 3 Wharton Summer Tech Camp.

Fire up getConcertVenue.py

Page 33: Data Acquisition: Companies & Wharton Data Basic web scraping Using APIs Session 3 Wharton Summer Tech Camp.

API ( Application Programming Interface)

Page 34: Data Acquisition: Companies & Wharton Data Basic web scraping Using APIs Session 3 Wharton Summer Tech Camp.

Programmable Web

• programmableweb.com– Search engine for freely available APIs online – http://blog.programmableweb.com/2012/02/15/

40-real-estate-apis-zillow-trulia-walk-score/

– Usage examples

• Usually, you have to apply for API keys from the website or the company offering the data

• Mostly free (limited queries)

Page 35: Data Acquisition: Companies & Wharton Data Basic web scraping Using APIs Session 3 Wharton Summer Tech Camp.

Idea behind API

1. You obtain a key from the company offering the data

2. Make requests for data – Many different ways based on API

3. Company server grants you the data 4. Data analysis

Page 36: Data Acquisition: Companies & Wharton Data Basic web scraping Using APIs Session 3 Wharton Summer Tech Camp.

Commonly Used Protocol in API• REST (REpresentational State Transfer) – guidelines for client-server interaction for

exchanging data as opposed to the alternative SOAP • I recommend this funny explanation for REST vs SOAP (diagram involving Martin

Lawrence)– http://stackoverflow.com/questions/209905/representational-state-transfer-rest-and-simple-object-

access-protocol-soap

• Based on HTTP• You request data via HTTP GET

(http://www.w3schools.com/tags/ref_httpmethods.asp) protocol and server will give you data – HTTP-URL?QueryStrings – QueryStrings: Field=Value separated by &– E.g. http://www.youtube.com/watch?v=5pidokakU4I&t=0m38s– v: stands for video = some value – t: stands for start time= some value

• Usual Data formats – XML eXtensible Markup Language http://www.w3schools.com/xml/– JSON JavaScript Object Notationhttp://www.w3schools.com/json/

Page 37: Data Acquisition: Companies & Wharton Data Basic web scraping Using APIs Session 3 Wharton Summer Tech Camp.

XML Example<CATALOG>

<PLANT><COMMON>Bloodroot</COMMON><BOTANICAL>Sanguinaria canadensis</BOTANICAL><ZONE>4</ZONE><LIGHT>Mostly Shady</LIGHT><PRICE>$2.44</PRICE><AVAILABILITY>031599</AVAILABILITY>

</PLANT><PLANT>

<COMMON>Columbine</COMMON><BOTANICAL>Aquilegia canadensis</BOTANICAL><ZONE>3</ZONE><LIGHT>Mostly Shady</LIGHT><PRICE>$9.37</PRICE><AVAILABILITY>030699</AVAILABILITY>

</PLANT>

</CATALOG>

Many xml related packageshttp://wiki.python.org/moin/PythonXml

Page 38: Data Acquisition: Companies & Wharton Data Basic web scraping Using APIs Session 3 Wharton Summer Tech Camp.

JSON Example (just like python)

newObject = { "first": "Ted", "last": "Logan", "age": 17, "sex": "M", "salary": 0, "registered": false, "interests": ["Van Halen", "Being Excellent", "Partying"]}

Main python moduleimport json

Page 39: Data Acquisition: Companies & Wharton Data Basic web scraping Using APIs Session 3 Wharton Summer Tech Camp.

Yahoo Finance Data Example

Page 40: Data Acquisition: Companies & Wharton Data Basic web scraping Using APIs Session 3 Wharton Summer Tech Camp.

Python Package Wrapper

• Yahoo provides simple web interface for anyone to download stock information via url– http://finance.yahoo.com/d/quotes.csv?s=%s&f=%s– s: symbol “GOOG”– f: stat (e.g. l1 means last trade price)

• http://finance.yahoo.com/d/quotes.csv?s=GOOG&f=l1 • More info here

– http://www.gummy-stuff.org/Yahoo-data.htm Ordered to take down

– http://web.archive.org/web/20140325063520/http://www.gummy-stuff.org/Yahoo-data.htm

Page 41: Data Acquisition: Companies & Wharton Data Basic web scraping Using APIs Session 3 Wharton Summer Tech Camp.

This Wrapper Package does it for you

• ystockquote– https://pypi.python.org/pypi/ystockquote/0.2.3– https://github.com/cgoldberg/ystockquote

• See the simple source code to learn• Open up ystock.py

Page 42: Data Acquisition: Companies & Wharton Data Basic web scraping Using APIs Session 3 Wharton Summer Tech Camp.

Example: YQL

• http://developer.yahoo.com/yql/• APIs are written by individual companies and support

different I/O and usually different languages. • Yahoo Query Language is a simple interface that yahoo

has made available to developers combining several APIs

• “Yahoo! Query Language (YQL) enables you to access Internet data with SQL-like commands.”

• Apply for your API Key – http://developer.yahoo.com/yql/

Page 43: Data Acquisition: Companies & Wharton Data Basic web scraping Using APIs Session 3 Wharton Summer Tech Camp.

Our example: BBYOPEN

• https://bbyopen.com/bbyopen-apis-overview• Retail information

– Archive query - Returns a single file containing all attributes for all items exposed by the given API

– Basic query - Returns information about a single item– Advanced query - Returns information about one or more items

according to your specifications– Store availability query - Returns information about products

available at specific storesBest buy is providing this API

• API overview – https://developer.bestbuy.com/get-started

Page 44: Data Acquisition: Companies & Wharton Data Basic web scraping Using APIs Session 3 Wharton Summer Tech Camp.

Basic QueryBasic query structurehttp://api.remix.bestbuy.com/API/Item.Format?show=&apiKey=Key API - One of {products, stores, reviews, categories} Item - The value of the fundamental attribute for the selected API:

o products - skuo stores - storeIdo reviews - ido categories - id

Format - One of {xml, json} show= - (optional) The item attributes you want displayed Key - Your API keyNote: show= and Key can be specified in either order.

Page 45: Data Acquisition: Companies & Wharton Data Basic web scraping Using APIs Session 3 Wharton Summer Tech Camp.

Basic Query Examples

Page 46: Data Acquisition: Companies & Wharton Data Basic web scraping Using APIs Session 3 Wharton Summer Tech Camp.

API example

• Open up bestbuyAPI.py

Page 47: Data Acquisition: Companies & Wharton Data Basic web scraping Using APIs Session 3 Wharton Summer Tech Camp.

Lab session

• For the next 10-15 minutes, choose your favorite website and try to scrape a few items

• We’ll do this again with scrapy

Page 48: Data Acquisition: Companies & Wharton Data Basic web scraping Using APIs Session 3 Wharton Summer Tech Camp.

Data isn’t impossibly hard to get after all. There are many routes but it could take a LONG time

(especially if are going the company route). START EARLY and you’ll get that data.

DATA!

Page 49: Data Acquisition: Companies & Wharton Data Basic web scraping Using APIs Session 3 Wharton Summer Tech Camp.

Next Session

• Hugh will be speaking about HPCC

• After that, we will learn the basics of Scrapy

• Brush up on your HTML and look into XPATH– W3school.com is the best

• Intro into Big Data and Empirical Business Research