Click here to load reader
javier ramirez@supercoco9
Big Data Analyticswith Google BigQuery
javier ramirez @supercoco9 https://teowaki.com nosqlmatters 2013
REST API +
AngularJS web as an API client
javier ramirez @supercoco9 https://teowaki.com
data that’s an order of magnitude greater than data you’re accustomed to
javier ramirez @supercoco9 https://teowaki.com
Doug Laney VP Research, Business Analytics and Performance Management at Gartner
data that exceeds the processing capacity of conventional database systems. The data is too big, moves too fast, or doesn’t fit the structures of your database architectures.
Ed Dumbill program chair for the O’Reilly Strata Conference
javier ramirez @supercoco9 https://teowaki.com
bigdata is doing a fullscan to 330MM rows, matching them against a regexp, and getting the result (223MM rows) in just 5 seconds
javier ramirez @supercoco9 https://teowaki.com
Javier Ramirezimpresionable teowaki founder
bigdata is cool but...
expensive cluster
hard to set up and monitor
not interactive enough
Our choice:
Google BigQuery
Data analysis as a service
http://developers.google.com/bigquery
javier ramirez @supercoco9 https://teowaki.com
Based on Dremel
Specifically designed for interactive queries over petabytes of real-time data
javier ramirez @supercoco9 https://teowaki.com
• Analysis of crawled web documents.• Tracking install data for applications on Android Market.• Crash reporting for Google products.• OCR results from Google Books.• Spam analysis.• Debugging of map tiles on Google Maps.• Tablet migrations in managed Bigtable instances.• Results of tests run on Google’s distributed build system.• Disk I/O statistics for hundreds of thousands of disks.• Resource monitoring for jobs run in Google’s data centers.• Symbols and dependencies in Google’s codebase.
What Dremel is used for in Google
Columnarstorage
javier ramirez @supercoco9 https://teowaki.com
highly distributed execution using a tree
javier ramirez @supercoco9 https://teowaki.com rubyc kiev 14
loading data
You can feed flat CSV-like files or nested JSON objects
javier ramirez @supercoco9 https://teowaki.com
javier ramirez @supercoco9 https://teowaki.com
bq cli
bq load --nosynchronous_mode --encoding UTF-8 --field_delimiter 'tab' --max_bad_records 100 --source_format CSV api.stats 20131014T11-42-05Z.gz
web console screenshot
javier ramirez @supercoco9 https://teowaki.com
javier ramirez @supercoco9 https://teowaki.com
analytical SQL functions.correlations.
window functions.views.
JSON fields.timestamped tables.
Things you always wanted to try but were too scared to
javier ramirez @supercoco9 https://teowaki.com
select count(*) from publicdata:samples.wikipedia
where REGEXP_MATCH(title, "[0-9]*") AND wp_namespace = 0;
223,163,387Query complete (5.6s elapsed, 9.13 GB processed, Cost: 32¢)
SELECT repository_name, repository_language, repository_description, COUNT(repository_name) as cnt,repository_urlFROM github.timelineWHERE type="WatchEvent"AND PARSE_UTC_USEC(created_at) >= PARSE_UTC_USEC("#{yesterday} 20:00:00")AND repository_url IN (
SELECT repository_urlFROM github.timelineWHERE type="CreateEvent"AND PARSE_UTC_USEC(repository_created_at) >= PARSE_UTC_USEC('#{yesterday} 20:00:00')AND repository_fork = "false"AND payload_ref_type = "repository"GROUP BY repository_url
)GROUP BY repository_name, repository_language, repository_description, repository_urlHAVING cnt >= 5ORDER BY cnt DESCLIMIT 25
Global Database of Events, Language and Tone
quarter billion rows30 yearsupdated daily
http://gdeltproject.org/data.html#googlebigquery
SELECT Year, Actor1Name, Actor2Name, Count FROM (SELECT Actor1Name, Actor2Name, Year, COUNT(*) Count, RANK() OVER(PARTITION BY YEAR ORDER BY Count DESC) rankFROM (SELECT Actor1Name, Actor2Name, Year FROM [gdelt-bq:full.events] WHERE Actor1Name < Actor2Name and Actor1CountryCode != '' and Actor2CountryCode != '' and Actor1CountryCode!=Actor2CountryCode), (SELECT Actor2Name Actor1Name, Actor1Name Actor2Name, Year FROM [gdelt-bq:full.events] WHERE Actor1Name > Actor2Name and Actor1CountryCode != '' and Actor2CountryCode != '' and Actor1CountryCode!=Actor2CountryCode),WHERE Actor1Name IS NOT nullAND Actor2Name IS NOT nullGROUP EACH BY 1, 2, 3HAVING Count > 100)
WHERE rank=1ORDER BY Year
javier ramirez @supercoco9 https://teowaki.com
our most active user
javier ramirez @supercoco9 https://teowaki.com
10 request we should be caching
javier ramirez @supercoco9 http://teowaki.com
5 most created resources
select uri, count(*) total from stats where method = 'POST' group by URI;
javier ramirez @supercoco9 http://teowaki.com
...but
/users/javier/shouts/users/rgo/shouts/teams/javier-community/links/teams/nosqlmatters-cgn/links
javier ramirez @supercoco9 https://teowaki.com
Automation with Apps Script
Read from bigquery
Create a spreadsheet on Drive
E-mail it everyday as a PDF
javier ramirez @supercoco9 https://teowaki.com
bigquery pricing
$80 per stored TB300000 rows => $0.007629392 / month
$35 per processed TB1 full scan = 84 MB1 count = 0 MB1 full scan over 1 column = 5.4 MB10 GB => $0.35 / month*the 1st TB every month is free of charge
javier ramirez @supercoco9 https://teowaki.com
Find related links at
https://teowaki.com/teams/javier-community/link-categories/bigquery-talk
Grazas
Javier Ramírez@supercoco9