Web scraping for non programmers ITNIG | 25th September 2014 @algonpaje - www.quadrigram.com
Jul 04, 2015
Web scraping for non programmers
ITNIG | 25th September 2014
@algonpaje - www.quadrigram.com
Goal: Introduce non programmers to APIs and scraping concepts (*)
(*) In a simple way…..
@algonpaje - www.quadrigram.com
How?: Using few modules of a visual programming language called “Quadrigram”
@algonpaje - www.quadrigram.com
> Quadrigram is a computer software designed to make the practice of data analysis and data visualization more universal
> It is designed to gather, shape, and share data
> It enables to prototype and share ideas rapidly, as well as produce compelling solutions with data in the forms of interactive visualizations, animations or dashboards
> The Quadrigram approach to data analysis and visualization is based on a visual programming language composed of around 500 modules
@algonpaje - www.quadrigram.com
Example 1: Getting financial information in real time
@algonpaje - www.quadrigram.com
> Data source: http://finance.yahoo.com/
@algonpaje - www.quadrigram.com
Stock Ticker Input Box
> Base URL: http://finance.yahoo.com/q?s=TEF.MC&ql=1/
1.- http://finance.yahoo.com/q?s=2.- ticker (TEF.MC)3.- &ql=1/
@algonpaje - www.quadrigram.com
1 + 2 + 3 = Base URL
1.- Building base URL using Quadrigram
1.1.- Module “Text” (String): “http://finance.yahoo.com/q?s=”1.2.- Module “Text Entry Box”: Input the stock ticker (eg: TEF.MC)1.3.- Module “Text” (String): “&ql=1/”1.4.- Module “Addition of 5 objects” concatenating 1, 2 and 3
…. result = “http://finance.yahoo.com/q?s=TEF.MC&ql=1/”
@algonpaje - www.quadrigram.com
2.- Querying data
2.1.- Connect the output of “Addition of 5 Objects” (“http://finance.yahoo.com/q?s=TEF.MC&ql=1/”) to module “Query HTTP GET”
2.2.- Connect a “Periodic Pulse” module to “Query HTTP GET” to query data each “X” seconds
…. and so we get our HTML code ready to be scraped
@algonpaje - www.quadrigram.com
3.- Scraping data
3.1.- Analyse the code and look for a “left - content - right” pattern.
In this case, the pattern we are looking for is:
left = <span id="yfs_l84_tef.mc">content = stock price (* real time when market is opened)right = </span>
@algonpaje - www.quadrigram.com
3.- Scraping data
@algonpaje - www.quadrigram.com
3.- Scraping data
3.2.- Use “Scrape Text” module to extract data
“Scrape Text” inlets:
source text = HTML code (output of Query HTTP GET)start sequence = <span id="yfs_l84_tef.mc">end sequence = </span>
3.3.- Extract the stock price using “Extract Object from List” module
@algonpaje - www.quadrigram.com
@algonpaje - www.quadrigram.com
Example 2: Build a network of similarities using “The Echonest” API
@algonpaje - www.quadrigram.com
>Data source: http://developer.echonest.com/raw_tutorials/artist_api/raw_artist_02.html
@algonpaje - www.quadrigram.com
>BaseURL:
http://developer.echonest.com/api/v4/artist/similar?api_key=J1OPQ9MJ8G8FC19FH&name=stones
1.- http://developer.echonest.com/api/v4/artist/similar?api_key=J1OPQ9MJ8G8FC19FH&name=
2.- artist´s name (“strokes”)
@algonpaje - www.quadrigram.com
1 + 2 = Base URL
1.- Building base URL using Quadrigram
1.1.- Module “Text” (String): “http://developer.echonest.com/api/v4/artist/similar?
api_key=J1OPQ9MJ8G8FC19FH&name=”
1.2.- Module “Text Entry Box”: Input the artist´s name (eg: strokes)
1.3.- Module “Addition of 5 objects” concatenating 1 and 2
…. result = “http://developer.echonest.com/api/v4/artist/similar?
api_key=J1OPQ9MJ8G8FC19FH&name=strokes”
@algonpaje - www.quadrigram.com
2.- Querying data
2.1.- Connect the output of “Addition of 5 Objects”
(“http://developer.echonest.com/api/v4/artist/similar?api_key=J1OPQ9MJ8G8FC19FH&name=strokes”)
to module “Query HTTP GET”
…. and so we get our HTML code
@algonpaje - www.quadrigram.com
3.- Scraping data
3.2.- Use “Scrape Text” module to extract data
“Scrape Text” inlets:
source text = HTML code (output of Query HTTP GET)start sequence = "name": "end sequence = "},
… and we obtain the list with similar artists to our query name
@algonpaje - www.quadrigram.com
4.- Build a Network of similarities
4.1.- Use “Length of List” module to count how many similar artists the are
4.2.- Use “Create List with repeated Object” module to create as many “strokes” as similar artists are
4.3.- Create a Pair Table using “Create Custom Data Structure” module
4.4.- Conver the Pair Table to a Network using “Convert PairTable to Network” module
@algonpaje - www.quadrigram.com
@algonpaje - www.quadrigram.com
More information: www.quadrigram.com
@algonpaje - www.quadrigram.com
Thank you!!!
@algonpaje - www.quadrigram.com