Meltdown Mysteries Sean Suchter
Aug 11, 2014
Meltdown MysteriesSean Suchter
Disks are thrashing!
Solution
• Make job author aware of surprising behavior.
• Modify job code & settings to be nicer to disks.
Nodes are dying!
Initial diagnosis…• Nodes abruptly started swapping and
becoming non-responsive. (Required physical power cycling)
• Job submitters report “I didn’t change anything”
• Question: What’s doing this to the cluster?
Cause & solution• While the job didn’t change, its input data did.
• Stop that user’s jobs immediately.
• Better use of capacity scheduler virtual memory controls.
• Use Pepperdata protection to limit physical memory as well.
Take-away
• You see problems at the node level.
• You see the root causes at the task level.
Pepperdata meetup tomorrow!
• War Stories from the Hadoop Trenches
• Allen Wittenauer (Apache Hadoop committer and former LinkedIn)
• Eric Baldeschwieler (former Hortonworks CEO / CTO)
• Todd Nemet (Looker; former Altiscale, ClearStory Data, Cloudera)
• 6pm Wed 6/25
• Firehouse Brewery, 111 S Murphy, Sunnyvale
• http://www.meetup.com/pepperdata/