APOLLO GROUP So Your Cluster Isn't Yahoo-sized (yet) Hadoop Operations: Starting Out Small Michael Arnold Principal Systems Engineer 14 June 2012
Jul 06, 2015
APOLLO GROUP
So Your Cluster Isn't Yahoo-sized (yet)
Hadoop Operations: Starting Out Small
Michael ArnoldPrincipal Systems Engineer14 June 2012
2APOLLO GROUP
Who
What (Definitions)
Decisions for Now
Decisions for Later
Lessons Learned
Agenda
© 2012 Apollo Group
3APOLLO GROUP
APOLLO GROUP
Who
© 2012 Apollo Group
4APOLLO GROUP
Who is Apollo?
© 2012 Apollo Group
Apollo Group is a leading provider of higher education programs for working adults.
5APOLLO GROUP
Systems Administrator
Automation geek
13 years in IT
I deal with:
–Server hardware specification/configuration
–Server firmware
–Server operating system
–Hadoop application health
–Monitoring all the above
Who is Michael Arnold?
© 2012 Apollo Group
6APOLLO GROUP
APOLLO GROUP
What
Definitions
© 2012 Apollo Group
7APOLLO GROUP
Q: What is a tiny/small/medium/large cluster?
A:
–Tiny: 1-9
–Small: 10-99
–Medium: 100-999
–Large: 1000+
–Yahoo-sized: 4000
Definitions
© 2012 Apollo Group
8APOLLO GROUP
Q: What is a “headnode”?
A: A server that runs one or more of the following Hadoop processes:
–NameNode
–JobTracker
–Secondary NameNode
–ZooKeeper
–HBase Master
Definitions
© 2012 Apollo Group
9APOLLO GROUP
APOLLO GROUP
What decisions should you make now and which can you postpone for later?
Decisions for Now
© 2012 Apollo Group
10APOLLO GROUP
Amazon
Apache
Cloudera
Greenplum
Hortonworks
IBM
MapR
Platform Computing
Which Hadoop distribution?
© 2012 Apollo Group
11APOLLO GROUP
Can be OK for small clusters BUT
–virtualization adds overhead
–can cause performance degradation
–cannot take advantage of Hadoop rack locality
Virtualization can be good for:
–functional testing of M/R job or workflow changes
–evaluation of Hadoop upgrades
Should you virtualize?
© 2012 Apollo Group
12APOLLO GROUP
Inexpensive
Not “enterprisey” hardware
–No RAID*
–No Redundant power*
Low power consumption
No optical drives
–get systems that can boot off the network
* except in headnodes
What sort of hardware should you be considering?
© 2012 Apollo Group
13APOLLO GROUP
Start at the bottom and work your way up
Leave room in your cabinets for more machines
Plan for capacity expansion
© 2012 Apollo Group
14APOLLO GROUP
Deploy your initial cluster in two cabinets
–One headnode, one switch, and several (five) datanodes per cabinet
Plan for capacity expansion (cont.)
© 2012 Apollo Group
15APOLLO GROUP
Install a second cluster in the empty space in the upper half of the cabinet
Plan for capacity expansion (cont.)
© 2012 Apollo Group
16APOLLO GROUP
APOLLO GROUP
What decisions should you make now and which can you postpone for later?
Decisions for Later
© 2012 Apollo Group
17APOLLO GROUP
Depends upon your:
Budget
Data size
Workload characteristics
SLA
What size cluster?
© 2012 Apollo Group
18APOLLO GROUP
Are your MapReduce jobs:
compute-intensive?
reading lots of data?
http://www.cloudera.com/blog/2010/08/hadoophbase-capacity-planning/
What size cluster? (cont.)
© 2012 Apollo Group
19APOLLO GROUP
If more than one switch in the cluster:
YES
Should you implement rack awareness?
© 2012 Apollo Group
20APOLLO GROUP
If not in the beginning, then as soon as possible.
Boot disks will fail.
Automated OS and application installs:
–Save time
–Reduce errors
•Cobbler/Spacewalk/Foreman/xCat/etc
•Puppet/Chef/Cfengine/shell scripts/etc
Should you use automation?
© 2012 Apollo Group
21APOLLO GROUP
APOLLO GROUP
Lessons Learned
© 2012 Apollo Group
22APOLLO GROUP
Don't add redundancy and features (server/network) that will make things more
complicated and expensive.
Hadoop has built-in redundancies.
Don't overlook them.
Keep It Simple
© 2012 Apollo Group
23APOLLO GROUP
Twelve hours of manual work in the datacenter is not fun.
Make sure all server firmware is configured identically.
–HP SmartStart Scripting Toolkit
–Dell OpenManage Deployment Toolkit
–IBM ServerGuide Scripting Toolkit
Automate the Hardware
© 2012 Apollo Group
24APOLLO GROUP
(Just not of the Hadoop software.)
Datanodes can be decommissioned, patched, and added back into the cluster without service
downtime.
Rolling upgrades are possible
© 2012 Apollo Group
25APOLLO GROUP
Bad NIC/switchport can cause cluster slowness.
Slow disks can cause intermittent job slowdowns.
The smallest thing can have a big impact on the cluster
© 2012 Apollo Group
26APOLLO GROUP
On ext3/ext4:
–Small blocks are not padded to the HDFS block-size, but rather the actual size of the data.
–Each HDFS block is actually two files on the datanode's filesystem:
•The actual data and
•A metadata/checksum file
HDFS blocks are weird
© 2012 Apollo Group
# ls -l blk_1058778885645824207*
-rw-r--r-- 1 hdfs hdfs 35094 May 14 01:26 blk_1058778885645824207
-rw-r--r-- 1 hdfs hdfs 283 May 14 01:26 blk_1058778885645824207_19155994.meta
27APOLLO GROUP
Be careful tuning your datanode filesystems.
• mkfs -t ext4 -T largefile4 ... (probably bad)
• mkfs -t ext4 -i 131072 -m 0 ... (better)
Do not prematurely optimize
© 2012 Apollo Group
/etc/mke2fs.conf
[fs_types]
hadoop = {
features = has_journal,extent,huge_file,flex_bg,uninit_bg,dir_nlink, extra_isize
inode_ratio = 131072
blocksize = -1
reserved_ratio = 0
default_mntopts = acl,user_xattr
}
29APOLLO GROUP
hdfs://hdfs.delta.hadoop.apollogrp.edu:8020/
mapred.delta.hadoop.apollogrp.edu:8021
http://oozie.delta.hadoop.apollogrp.edu:11000/
hiveserver.delta.hadoop.apollogrp.edu:10000
Yes, the names are long, but I bet you can figure out how to connect to Bravo Cluster.
Use DNS-friendly names for services
© 2012 Apollo Group
30APOLLO GROUP
pdsh/Cluster SSH/mussh/etc
SSH in a for loop is so 2010
FUNC/MCollective
Use a parallel, remote execution tool
© 2012 Apollo Group
31APOLLO GROUP
20-100GB /var/log
–Implement log purging cronjobs or your log directories will fill up.
Beware: M/R jobs can fill up /tmp as well.
Make your log directories as large as you can.
© 2012 Apollo Group
33APOLLO GROUP
Serial Over LAN is awesome when booting a system.
Standardized hardware/temperature monitoring.
Simple remote power control.
Insist on IPMI 2.0 for out of band management of server hardware.
© 2012 Apollo Group
34APOLLO GROUP
Enable portfast on your server switch ports or the BMCs may never get a DHCP lease.
Spanning-tree is the devil
© 2012 Apollo Group
35APOLLO GROUP 35APOLLO GROUP
You may end up doing so as well.
Apollo has re-built it's cluster four times.
© 2012 Apollo Group
36APOLLO GROUP
First build
Cloudera Professional Services helped install CDH
Four nodes
Manually build OS via USB CDROM.
CDH2
Apollo Timeline
© 2012 Apollo Group
37APOLLO GROUP
Second build
Cobbler
All software deployment is via kickstart. Very little is in puppet. Config files are deployed via wget.
CDH2
Apollo Timeline
© 2012 Apollo Group
38APOLLO GROUP
Third build
OS filesystem partitioning needed to change.
Most software deployment still via kickstart.
CDH3b2
Apollo Timeline
© 2012 Apollo Group
39APOLLO GROUP
Fourth build
HDFS filesystem inodes needed to be increased.
Full puppet automation.
Added redundant/hotswap enterprise hardware for headnodes.
CDH3u1
Apollo Timeline
© 2012 Apollo Group
40APOLLO GROUP
Hardware
–disk failures (40+)
–disk cabling (6)
–RAM (2)
–switch port (1)
Software
–Cluster
•NFS (NN -> 2NN metadata)
–Job
•TT java heap
•Running out of /tmp or /var/log/hadoop
•Running out of HDFS space
Cluster failures at Apollo
© 2012 Apollo Group
41APOLLO GROUP
You can spend all the time in the world trying to get the best CPU/RAM/HDD/switch/cabinet configuration, but you are running on pure luck until you understand your cluster's workload.
Know your workload
© 2012 Apollo Group
42APOLLO GROUP
APOLLO GROUP
Questions?
© 2012 Apollo Group