Hadoop BigQuery Connector Simon Su & Sunny Hu @ MiCloud
Nov 21, 2014
Hadoop BigQuery ConnectorSimon Su & Sunny Hu @ MiCloud
I am Simon Su
var simon = {};simon.aboutme = 'http://about.me/peihsinsu';simon.nodejs = ‘http://opennodes.arecord.us';simon.googleshare = 'http://gappsnews.blogspot.tw'simon.nodejsblog = ‘http://nodejs-in-example.blogspot.tw';simon.blog = ‘http://peihsinsu.blogspot.com';simon.slideshare = ‘http://slideshare.net/peihsinsu/';simon.email = ‘[email protected]’;simon.say(‘Good luck to everybody!');
I am Sunny Hu
var sunny = {};
sunny.aboutme = 'https://plus.google.com/u/0/+sunnyHU/posts';
sunny.email = [email protected]’;
sunny.language =[‘Java’,’.NET’,’NodeJS’,’SQL’ ]
sunny.skill = [ ‘Project management’,’System Analysis’,
’System design’,’Car ho lan’]
sunny.say(‘寫code太苦悶,心情要sunny');
● We are 蘇 胡 二人組 ...
● 2011/11 MiCloud Launch
● 2013/2 Google Apps Partner
● 2013/9 Google Cloud Partner
● 2014/4 Google Cloud Launch
We are MiCloud
緣起
● Dremel (BigQuery) 能提供大量及穩定服務● 2013, 平均每日服務量: 5,922,000,000 人次● 2012, 平均每日服務量: 5,134,000,000 人次
● 2011, 平均每日服務量: 4,717,000,000 人次
● 2010, 平均每日服務量: 3,627,000,000 人次
● 2009, 平均每日服務量: 2,610,000,000 人次
● 2008, 平均每日服務量: 1,745,000,000 人次
What is the components of Hadoop...
HDFS
MapReduce
Strategy
Persistence storage for parallel access, better with good performance...
Mass computing power to parallel load and process the requirements
Your idea for filtering information from the given datasets
You have better choice in Cloud...
HDFS
MapReduce
Strategy
Object storage services, like: Google Cloud Storage, AWS S3...
Cloud machines with unlimited resources, better with lower and scalable pricing...
Nothing can replace a good idea…, but fast...
● The fast way run hadoop - docker
Google Provide Resources
● GCE Hadoop Utility
● GCE Cluster Tool - bdutil
Before Demo… Prepare
1. Install google_cloud_sdk2. Install bdutil
google cloud sdkcurl https://sdk.cloud.google.com | bash
● Auth the gcloud utility
● Setup default project
● Test configuration….
Using bdutil...https://developers.google.com/hadoop/setting-up-a-hadoop-cluster
bdutil scopes
● Design for fast create hadoop cluster● Quick run a hadoop task● Quick integrate google’s resources● Quick clear finished resources
Demo start first….
● Config your bdutil env.
● bdutil deploy -e bigquery_env.sh
● Checking the result...
● The Administration console
TeraSorthttps://www.mapr.com/fr/company/press/mapr-and-google-compute-engine-set-new-world-record-hadoop-terasort
You can win the game, too...
…. (skip)
BigQuery Connectorhttps://developers.google.com/hadoop/running-with-bigquery-connector
hadoop-mhadoop-w-0 hadoop-w-1
Demo start first….
Run a BigQuery Connector job...
Workflow...
1. Dump sample data from [publicdata:samples.shakespeare]2. MapReduce to count the word display 3. Update result to BigQuery specific table
Look into source code...
● BigQueryInputFormat class● Input parameters● Mapper● BigQueryOutputFormat class● Output parameters● Reducer
BigQueryInputFormat
● Using a user-specified query to select the appropriate BigQuery objects.
● Splitting the results of the query evenly among the Hadoop nodes.
● Parsing the splits into java objects to pass to the mapper
Input parameters
● Project Id : GCP project id , eg. hadoop-conf-2014● Input Table Id :[optional projectId]:[datasetId].[table id]
BigqueryOutputFormat Class
● Provides Hadoop with the ability to write JsonObject values directly into a BigQuery table
● An extension of the Hadoop OutputFormat class
Output parameters
● Project Id : GCP project id ,eg. hadoop-conf-2014● Output Table Id :[optional projectId]:[datasetId].[table id]● Output Table Schema :[{'name': 'Name','type': 'STRING'},
{'name': 'Number','type': 'INTEGER'}]
bdutil house keeping...https://developers.google.com/hadoop/setting-up-a-hadoop-cluster
Delete the hadoop cluster● Game over - Delete the hadoop cluster
● Check project….
You cost in this lab...
VM (n1-standard-1) machines hours*
* *
*
$0.070 USD/Hour 24 1
Today’s Demo
Using Docker...
● Using google optimized docker container
localhost:~$ gcloud compute instances create simon-docker \
> --image https://www.googleapis.com/compute/v1/projects/google-containers/global/images/container-vm-v20140522\
> --zone asia-east1-a\
> --machine-type f1-micro
localhost:~$ gcloud compute ssh simon-docker
simonsu@simon-docker:~$ sudo docker search bdutil
simonsu@simon-docker:~$ docker run -it peihsinsu/bdutil bash
Other connectors
BigQuery connector for Hadoop
$ ./bdutil deploy -e bigquery_env.sh
Datastore connector for Hadoop
$ ./bdutil deploy -e datastore_env.sh
To use both BQ & Datastore
$ ./bdutil deploy -e datastore_env.sh,bigquery_env.sh
http://goo.gl/PbHdDc
http://micloud.tw
http://jsdc-tw.kktix.cc/events/jsdc2014