Hadoop Operations: Starting Out Small / So Your Cluster Isn't Yahoo-sized (yet)

APOLLO GROUP

So Your Cluster Isn't Yahoo-sized (yet)

Hadoop Operations: Starting Out Small

Michael ArnoldPrincipal Systems Engineer14 June 2012

2APOLLO GROUP

Who

What (Definitions)

Decisions for Now

Decisions for Later

Lessons Learned

Agenda

© 2012 Apollo Group

3APOLLO GROUP

APOLLO GROUP

Who


4APOLLO GROUP

Who is Apollo?


Apollo Group is a leading provider of higher education programs for working adults.

5APOLLO GROUP

Systems Administrator

Automation geek

13 years in IT

I deal with:

–Server hardware specification/configuration

–Server firmware

–Server operating system

–Hadoop application health

–Monitoring all the above

Who is Michael Arnold?


6APOLLO GROUP

APOLLO GROUP

What

Definitions


7APOLLO GROUP

Q: What is a tiny/small/medium/large cluster?

A:

–Tiny: 1-9

–Small: 10-99

–Medium: 100-999

–Large: 1000+

–Yahoo-sized: 4000

Definitions


8APOLLO GROUP

Q: What is a “headnode”?

A: A server that runs one or more of the following Hadoop processes:

–NameNode

–JobTracker

–Secondary NameNode

–ZooKeeper

–HBase Master

Definitions


9APOLLO GROUP

APOLLO GROUP

What decisions should you make now and which can you postpone for later?

Decisions for Now


10APOLLO GROUP

Amazon

Apache

Cloudera

Greenplum

Hortonworks

IBM

MapR

Platform Computing

Which Hadoop distribution?


11APOLLO GROUP

Can be OK for small clusters BUT

–virtualization adds overhead

–can cause performance degradation

–cannot take advantage of Hadoop rack locality

Virtualization can be good for:

–functional testing of M/R job or workflow changes

–evaluation of Hadoop upgrades

Should you virtualize?


12APOLLO GROUP

Inexpensive

Not “enterprisey” hardware

–No RAID*

–No Redundant power*

Low power consumption

No optical drives

–get systems that can boot off the network

* except in headnodes

What sort of hardware should you be considering?


13APOLLO GROUP

Start at the bottom and work your way up

Leave room in your cabinets for more machines

Plan for capacity expansion


14APOLLO GROUP

Deploy your initial cluster in two cabinets

–One headnode, one switch, and several (five) datanodes per cabinet

Plan for capacity expansion (cont.)


15APOLLO GROUP

Install a second cluster in the empty space in the upper half of the cabinet

Plan for capacity expansion (cont.)


16APOLLO GROUP

APOLLO GROUP

What decisions should you make now and which can you postpone for later?

Decisions for Later


17APOLLO GROUP

Depends upon your:

Budget

Data size

Workload characteristics

SLA

What size cluster?


18APOLLO GROUP

Are your MapReduce jobs:

compute-intensive?

reading lots of data?

http://www.cloudera.com/blog/2010/08/hadoophbase-capacity-planning/

What size cluster? (cont.)


19APOLLO GROUP

If more than one switch in the cluster:

YES

Should you implement rack awareness?


20APOLLO GROUP

If not in the beginning, then as soon as possible.

Boot disks will fail.

Automated OS and application installs:

–Save time

–Reduce errors

•Cobbler/Spacewalk/Foreman/xCat/etc

•Puppet/Chef/Cfengine/shell scripts/etc

Should you use automation?


21APOLLO GROUP

APOLLO GROUP

Lessons Learned


22APOLLO GROUP

Don't add redundancy and features (server/network) that will make things more

complicated and expensive.

Hadoop has built-in redundancies.

Don't overlook them.

Keep It Simple


23APOLLO GROUP

Twelve hours of manual work in the datacenter is not fun.

Make sure all server firmware is configured identically.

–HP SmartStart Scripting Toolkit

–Dell OpenManage Deployment Toolkit

–IBM ServerGuide Scripting Toolkit

Automate the Hardware


24APOLLO GROUP

(Just not of the Hadoop software.)

Datanodes can be decommissioned, patched, and added back into the cluster without service

downtime.

Rolling upgrades are possible


25APOLLO GROUP

Bad NIC/switchport can cause cluster slowness.

Slow disks can cause intermittent job slowdowns.

The smallest thing can have a big impact on the cluster


26APOLLO GROUP

On ext3/ext4:

–Small blocks are not padded to the HDFS block-size, but rather the actual size of the data.

–Each HDFS block is actually two files on the datanode's filesystem:

•The actual data and

•A metadata/checksum file

HDFS blocks are weird


# ls -l blk_1058778885645824207*

-rw-r--r-- 1 hdfs hdfs 35094 May 14 01:26 blk_1058778885645824207

-rw-r--r-- 1 hdfs hdfs 283 May 14 01:26 blk_1058778885645824207_19155994.meta

27APOLLO GROUP

Be careful tuning your datanode filesystems.

• mkfs -t ext4 -T largefile4 ... (probably bad)

• mkfs -t ext4 -i 131072 -m 0 ... (better)

Do not prematurely optimize


/etc/mke2fs.conf

[fs_types]

hadoop = {

features = has_journal,extent,huge_file,flex_bg,uninit_bg,dir_nlink, extra_isize

inode_ratio = 131072

blocksize = -1

reserved_ratio = 0

default_mntopts = acl,user_xattr

}

29APOLLO GROUP

hdfs://hdfs.delta.hadoop.apollogrp.edu:8020/

mapred.delta.hadoop.apollogrp.edu:8021

http://oozie.delta.hadoop.apollogrp.edu:11000/

hiveserver.delta.hadoop.apollogrp.edu:10000

Yes, the names are long, but I bet you can figure out how to connect to Bravo Cluster.

Use DNS-friendly names for services


30APOLLO GROUP

pdsh/Cluster SSH/mussh/etc

SSH in a for loop is so 2010

FUNC/MCollective

Use a parallel, remote execution tool


31APOLLO GROUP

20-100GB /var/log

–Implement log purging cronjobs or your log directories will fill up.

Beware: M/R jobs can fill up /tmp as well.

Make your log directories as large as you can.


33APOLLO GROUP

Serial Over LAN is awesome when booting a system.

Standardized hardware/temperature monitoring.

Simple remote power control.

Insist on IPMI 2.0 for out of band management of server hardware.


34APOLLO GROUP

Enable portfast on your server switch ports or the BMCs may never get a DHCP lease.

Spanning-tree is the devil


35APOLLO GROUP 35APOLLO GROUP

You may end up doing so as well.

Apollo has re-built it's cluster four times.


36APOLLO GROUP

First build

Cloudera Professional Services helped install CDH

Four nodes

Manually build OS via USB CDROM.

CDH2

Apollo Timeline


37APOLLO GROUP

Second build

Cobbler

All software deployment is via kickstart. Very little is in puppet. Config files are deployed via wget.

CDH2

Apollo Timeline


38APOLLO GROUP

Third build

OS filesystem partitioning needed to change.

Most software deployment still via kickstart.

CDH3b2

Apollo Timeline


39APOLLO GROUP

Fourth build

HDFS filesystem inodes needed to be increased.

Full puppet automation.

Added redundant/hotswap enterprise hardware for headnodes.

CDH3u1

Apollo Timeline


40APOLLO GROUP

Hardware

–disk failures (40+)

–disk cabling (6)

–RAM (2)

–switch port (1)

Software

–Cluster

•NFS (NN -> 2NN metadata)

–Job

•TT java heap

•Running out of /tmp or /var/log/hadoop

•Running out of HDFS space

Cluster failures at Apollo


41APOLLO GROUP

You can spend all the time in the world trying to get the best CPU/RAM/HDD/switch/cabinet configuration, but you are running on pure luck until you understand your cluster's workload.

Know your workload


42APOLLO GROUP

APOLLO GROUP

Questions?


Hadoop Operations: Starting Out Small / So Your Cluster Isn't Yahoo-sized (yet)

Technology