The Last Mile: Why Hadoop Management Is Critical to Success · Automated and reliable data ingest • Capture and manage relevant metadata • Preserve original source data where

#TDPARTNERS16 Sept 11,2016 GEORGIA WORLD CONGRESS CENTER

The Last Mile:Why Hadoop Management Is Critical to Success

Ron Bodkin and Scott Fleming

Think Big, a Teradata company

The Last Mile

• The open source ecosystem for analytics is complicated

• It’s easy to get started• Maintaining an optimal, performant environment is

not• Success depends on careful planning and

management

Data Lake Design Principles

• Automated and reliable data ingest • Capture and manage relevant metadata• Preserve original source data where possible• Provide cleansing, aggregation, and integration matched

to each use• Balance governance and agility• Implement security at the right time• Easily search, access, and consume data• Make the data ready for analysis

New Data Sources

• It all starts here• Capture the rawest form• Determine how it will be used and who will be using it• Cleanse it, validate it and profile it• Make it discoverable (and useful) • Bottom line: Be consistent and consider tools

Typical Data Ingestion

Governance

• Clear distinction of roles and responsibilities for curating data

• Common vocabulary for data sets / types• Implement required security – not too much, not too little• On-going data quality polices• Data retention / archival policies

Security Challenges

• Residual files following failed jobs• Compatibility of security tools with major Hadoop

distributors• Multiple types of discoverable data in the environment• BI and analytics user access• Lack of mature security tools• Uncontrolled replication of data• User authentication and authorization is complex

Without considering comprehensive security measures, your valuable data could be easily compromised and you may be a subject to security breach

Security LayersRedo this image

Ingestion Jobs and Monitoring

• Baseline job performance and resource requirements• Ensure error handling is robust• Build alerting into the processes that submit jobs• Develop and monitor SLAs for job performance. Look

for drift.• Leverage tools where possible

Resource Contention

• SLAs and sandboxes – often in the same environment

• Leverage the capacity scheduler and hierarchical queues

• Don’t be afraid to get granular• Use YARN containers – be prudent about the

resources requested

Capacity Planning

• Capacity planning is an on-going effort, not one and done

• Includes storage, compute, network, memory and real estate

• Review resource and storage utilization at least monthly

• Implement retention and archiving processes where appropriate

• Be thoughtful and plan when expanding• Just adding nodes can have unexpected

results

Hive Operations

• Bring your own data – User Education• Sub-optimal storage formats

• Table proliferation• Over partitioning• ODBC / JDBC Connectivity

• Canary processes for Hive Server 2• Impala – compute stats

General Hadoop Operations

• Develop a RACI for operations• ITIL Processes – minimally Release Management

and Change Management• Stay aligned with the distro versions• Use configuration management tools like Puppet

and Ansible• Staff appropriately

Hadoop Operations Top 10

1. Continuous Capacity Planning2. Isolate the LAN3. Implement proactive monitoring and alerting4. Establish data balancer schedule and use5. Periodic review of Hive tables, schemas and data storage6. Monitor for small files7. End user education8. Periodic review of the capacity scheduler and resource

management9. Monitor SLAs for drift10.Runbook, Runbook, Runbook

Monitoring

• Ambari / Cloudera Manager – basic blocking and tackling• Nagios – where there are gaps• PCNG – for application monitoring• Dr. Elephant – for application heuristics

Engineering and Operations

• Weekly reviews for alignment and planning• Include operations in engineering design• New technology preparation, planning and

training• Continuous updates to the runbooks• DevOps and Agile – rules of the road to be able to

fail fast while maintaining a stable environment

Monitoring Adoption

• Knowing who is doing what in the environment is essential to maintenance and planning.

• Determine who the power users are and make them champions

• Helps to understand resource planning and allocation

Summary

• Getting started is easy• Getting started to ensure long term success takes

some planning• There is a lot to stay on top of to ensure successful

operations• The platform components and tools vary in every

environment• Capable operations people are hard to find• Proactive management and monitoring is key to

happy users

Thank You

Questions/CommentsEmail:

Follow MeTwitter @

Rate This Session #with the PARTNERS Mobile App

Remember To Share Your Virtual Passes

ron.bodkin@thinkbiganalytics.com

Ronbodkin and @scottbfleming

The Last Mile: Why Hadoop Management Is Critical to Success · Automated and reliable data ingest • Capture and manage relevant metadata • Preserve original source data where

Documents

Ingest Wizard

The Ingest and Uses of Specimen Label Data into Semantic...

Ingest & Tag Astro Data

Privacy Preserve Electronic Health Data

Transforming Healthcare With Data - Enterprise Cloud Data...

(STG202) AWS Import/Export Snowball: Large-Scale Data Ingest...

Ingest Real-Time, Pre-Processed Data BENEFITS for...

Easy Ingest

Accelerate Real Time Data Ingest into Hadoop -...

Applying Semantics in Dataset Summarization for Solar Data....

WHITE PAPER Redis for Fast Data Ingest · 2020-02-22 · 3....

Introduction to Data Ingest using the Harvester 2012 VIVO...

Data Ingest

Streaming Data Ingest and Processing with Apache Kafka

Meteorological Assimilated Data Ingest System … 31,...

PaQu - Getting the most out of MySQL with distributed...