My life as a beekeeper

My life as a beekeeper

@89clouds

Who am I?

Pedro Figueiredo ([email protected])

Hadoop et al

SocialFacebook games, media (TV, publishing)

Elastic MapReduce, Cloudera

NoSQL, as in “Not a SQL guy”

The problem with Hive

It looks like SQL

No, seriously SELECT CONCAT(vishi,vislo), SUM( CASE WHEN searchengine = 'google' THEN 1 ELSE 0 END ) AS google_searches FROM omniture WHERE year(hittime) = 2011 AND month(hittime) = 8 AND is_search = 'Y' GROUP BY CONCAT(vishi,vislo);

“It’s just like Oracle!”

Analysts will be very happy

At least until they join with that 30 billion-record table

Pro tip: explain MapReduce and then MAPJOIN

set hive.mapjoin.smalltable.filesize=xxx;

Your first interview question

“Explain the difference between CREATE TABLE and CREATE EXTERNAL TABLE”

Dynamic partitions

Partitions are the poor person’s indexes

Unstructured data is full of surprises set hive.exec.dynamic.partition.mode=nonstrict; set hive.exec.dynamic.partition=true; set hive.exec.max.dynamic.partitions=100000; set hive.exec.max.dynamic.partitions.pernode=100000;

Plan your partitions ahead

Multi-vitamins

You can minimise input scans by using multi-table INSERTs:

FROM inputINSERT INTO TABLE output1 SELECT fooINSERT INTO TABLE output2 SELECT bar;

Persistence, do you speak it?

External Hive metastore

Avoid the pain of cluster set up

Use an RDS metastore if on AWS, RDBMS otherwise.

10GB will get you a long way, this thing is tiny

Now you have 2 problems

Regular expressions are great, if you’re using a real programming language.

WHERE foo RLIKE ‘(a|b|c)’ will hurt

WHERE foo=‘a’ OR foo=‘b’ OR foo=‘c’

Generate these statements, if needs be, it will pay off.

Avro

Serialisation framework (think Thrift/Protocol Buffers).

Avro container files are SequenceFile-like, splittable.

Support for snappy built-in.

If using the LinkedIn SerDe, the table creation syntax changes.

AvroCREATE EXTERNAL TABLE IF NOT EXISTS mytable PARTITIONED BY (ds STRING) ROW FORMAT SERDE 'com.linkedin.haivvreo.AvroSerDe' WITH SERDEPROPERTIES ('schema.url'='hdfs:///user/hadoop/avro/myschema.avsc') STORED AS INPUTFORMAT 'com.linkedin.haivvreo.AvroContainerInputFormat' OUTPUTFORMAT 'com.linkedin.haivvreo.AvroContainerOutputFormat' LOCATION '/data/mytable';

MAKE! MONEY! FAST!

Use spot instances in EMR

Usually stick around until America wakes up

Brilliant for worker nodes

Bag of tricksset hive.optimize.s3.query=true;

set hive.cli.print.header=true;

set hive.exec.max.created.files=xxx;

set mapred.reduce.tasks=xxx;

hive.exec.compress.intermediate=true;

hive.exec.parallel=true;































To be or not to be“Consider a traditional RDBMS”

At what size should we do this?

Hive is not an end, it’s the means

Data on HDFS/S3 is simply available, not “available to Hive”

Hive isn’t suitable for near real time

Hive != MapReduce

Don’t use Hive instead of Native/Streaming

“I know, I’ll just stream this bit through a shell script!”

Imo, Hive excels at analysis and aggregation, so use it for that

Thank you

Fred Easey (@poppa_f)

Peter Hanlon

Questions?

[email protected]

@pfig / @89clouds

http://89clouds.com/

My life as a beekeeper

Technology