My life as a beekeeper @89clouds
My life as a beekeeper
@89clouds
Who am I?
Pedro Figueiredo ([email protected])
Hadoop et al
SocialFacebook games, media (TV, publishing)
Elastic MapReduce, Cloudera
NoSQL, as in “Not a SQL guy”
The problem with Hive
It looks like SQL
No, seriously SELECT CONCAT(vishi,vislo), SUM( CASE WHEN searchengine = 'google' THEN 1 ELSE 0 END ) AS google_searches FROM omniture WHERE year(hittime) = 2011 AND month(hittime) = 8 AND is_search = 'Y' GROUP BY CONCAT(vishi,vislo);
“It’s just like Oracle!”
Analysts will be very happy
At least until they join with that 30 billion-record table
Pro tip: explain MapReduce and then MAPJOIN
set hive.mapjoin.smalltable.filesize=xxx;
Your first interview question
“Explain the difference between CREATE TABLE and CREATE EXTERNAL TABLE”
Dynamic partitions
Partitions are the poor person’s indexes
Unstructured data is full of surprises set hive.exec.dynamic.partition.mode=nonstrict; set hive.exec.dynamic.partition=true; set hive.exec.max.dynamic.partitions=100000; set hive.exec.max.dynamic.partitions.pernode=100000;
Plan your partitions ahead
Multi-vitamins
You can minimise input scans by using multi-table INSERTs:
FROM inputINSERT INTO TABLE output1 SELECT fooINSERT INTO TABLE output2 SELECT bar;
Persistence, do you speak it?
External Hive metastore
Avoid the pain of cluster set up
Use an RDS metastore if on AWS, RDBMS otherwise.
10GB will get you a long way, this thing is tiny
Now you have 2 problems
Regular expressions are great, if you’re using a real programming language.
WHERE foo RLIKE ‘(a|b|c)’ will hurt
WHERE foo=‘a’ OR foo=‘b’ OR foo=‘c’
Generate these statements, if needs be, it will pay off.
Avro
Serialisation framework (think Thrift/Protocol Buffers).
Avro container files are SequenceFile-like, splittable.
Support for snappy built-in.
If using the LinkedIn SerDe, the table creation syntax changes.
AvroCREATE EXTERNAL TABLE IF NOT EXISTS mytable PARTITIONED BY (ds STRING) ROW FORMAT SERDE 'com.linkedin.haivvreo.AvroSerDe' WITH SERDEPROPERTIES ('schema.url'='hdfs:///user/hadoop/avro/myschema.avsc') STORED AS INPUTFORMAT 'com.linkedin.haivvreo.AvroContainerInputFormat' OUTPUTFORMAT 'com.linkedin.haivvreo.AvroContainerOutputFormat' LOCATION '/data/mytable';
MAKE! MONEY! FAST!
Use spot instances in EMR
Usually stick around until America wakes up
Brilliant for worker nodes
Bag of tricksset hive.optimize.s3.query=true;
set hive.cli.print.header=true;
set hive.exec.max.created.files=xxx;
set mapred.reduce.tasks=xxx;
hive.exec.compress.intermediate=true;
hive.exec.parallel=true;
Bag of tricksset hive.optimize.s3.query=true;
set hive.cli.print.header=true;
set hive.exec.max.created.files=xxx;
set mapred.reduce.tasks=xxx;
hive.exec.compress.intermediate=true;
hive.exec.parallel=true;
Bag of tricksset hive.optimize.s3.query=true;
set hive.cli.print.header=true;
set hive.exec.max.created.files=xxx;
set mapred.reduce.tasks=xxx;
hive.exec.compress.intermediate=true;
hive.exec.parallel=true;
Bag of tricksset hive.optimize.s3.query=true;
set hive.cli.print.header=true;
set hive.exec.max.created.files=xxx;
set mapred.reduce.tasks=xxx;
hive.exec.compress.intermediate=true;
hive.exec.parallel=true;
Bag of tricksset hive.optimize.s3.query=true;
set hive.cli.print.header=true;
set hive.exec.max.created.files=xxx;
set mapred.reduce.tasks=xxx;
hive.exec.compress.intermediate=true;
hive.exec.parallel=true;
Bag of tricksset hive.optimize.s3.query=true;
set hive.cli.print.header=true;
set hive.exec.max.created.files=xxx;
set mapred.reduce.tasks=xxx;
hive.exec.compress.intermediate=true;
hive.exec.parallel=true;
To be or not to be“Consider a traditional RDBMS”
At what size should we do this?
Hive is not an end, it’s the means
Data on HDFS/S3 is simply available, not “available to Hive”
Hive isn’t suitable for near real time
Hive != MapReduce
Don’t use Hive instead of Native/Streaming
“I know, I’ll just stream this bit through a shell script!”
Imo, Hive excels at analysis and aggregation, so use it for that
Thank you
Fred Easey (@poppa_f)
Peter Hanlon