Top Banner
CUSTOMER APPLICATIONS OF HADOOP ON RED HAT STORAGE Jacob Shucart Sr. Solution Architect, Red Hat Erin Boyd Principle Software Engineer, Red Hat Diane Feddema Principle Software Engineer, Red Hat
43

CUSTOMER APPLICATIONS OF HADOOP ON RED HAT · PDF file•Hadoop services are loaded into Ambari via the concept of a ... •Jacob Shucart – [email protected] •Erin Boyd...

Mar 16, 2018

Download

Documents

LeThien
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: CUSTOMER APPLICATIONS OF HADOOP ON RED HAT · PDF file•Hadoop services are loaded into Ambari via the concept of a ... •Jacob Shucart – jshucart@redhat.com •Erin Boyd – eboyd@redhat.com

CUSTOMER APPLICATIONS OF HADOOP ON RED HAT STORAGEJacob ShucartSr. Solution Architect, Red HatErin BoydPrinciple Software Engineer, Red HatDiane FeddemaPrinciple Software Engineer, Red Hat

Page 2: CUSTOMER APPLICATIONS OF HADOOP ON RED HAT · PDF file•Hadoop services are loaded into Ambari via the concept of a ... •Jacob Shucart – jshucart@redhat.com •Erin Boyd – eboyd@redhat.com

CUSTOMER APPLICATIONS OF HADOOP ON RED HAT STORAGEA real world scenario

Page 3: CUSTOMER APPLICATIONS OF HADOOP ON RED HAT · PDF file•Hadoop services are loaded into Ambari via the concept of a ... •Jacob Shucart – jshucart@redhat.com •Erin Boyd – eboyd@redhat.com

AGENDA

What is Red Hat Storage Server?

•Overview of technology

How does it integrate with Hadoop?

•Summary of integration

Customer use case

•DreamWorks Animation

Demonstration

Q&A

Page 4: CUSTOMER APPLICATIONS OF HADOOP ON RED HAT · PDF file•Hadoop services are loaded into Ambari via the concept of a ... •Jacob Shucart – jshucart@redhat.com •Erin Boyd – eboyd@redhat.com

OPEN SOFTWARE-DEFINED STORAGE

TRADITIONAL STORAGE THE RED HAT WAY

HARDWARE CENTRIC

LOCKED DOWN

PROPRIETARY

SOFTWARE CENTRIC

FLEXIBLE

OPEN

INNOVATIVEVENDOR CONTROLLED

Page 5: CUSTOMER APPLICATIONS OF HADOOP ON RED HAT · PDF file•Hadoop services are loaded into Ambari via the concept of a ... •Jacob Shucart – jshucart@redhat.com •Erin Boyd – eboyd@redhat.com

ARCHITECTURAL HIGHLIGHTS

DESIGNED FOR THE NEW DATA LANDSCAPE – PETABYTE SCALE

LINEAR SCALABILITY – PERFORMANCE AND CAPACITY

HIGHLY AVAILABLE – WITHOUT BREAKING THE BANK

ACCESSIBLE – FROM ANY APPLICATION, THROUGH ANY DEVICE

OPEN AND INTEROPERABLE – STANDARDS AND OPEN TECHNOLOGIES

SELF-HEALING, SELF-MANAGING – REDUCING OPERATIONAL OVERHEAD

HYBRID DATACENTER FOUNDATION - PRIVATE, PUBLIC, AND HYBRID CLOUDS

EXTENSIBLE – INNOVATE TO MEET YOUR UNIQUE BUSINESS REQUIREMENTS

DESIGNED FOR TODAY’S IT ECONOMICS – DO MORE WITH WHAT YOU HAVE

Page 6: CUSTOMER APPLICATIONS OF HADOOP ON RED HAT · PDF file•Hadoop services are loaded into Ambari via the concept of a ... •Jacob Shucart – jshucart@redhat.com •Erin Boyd – eboyd@redhat.com

SINGLE PLATFORM FOR ALL YOUR DATA

CONVERGED COMPUTE AND STORAGEFILE SERVICES OPEN OBJECT APIs

OPEN, SOFTWARE-DEFINED STORAGE PLATFORM

PERSISTANT DATA STORES

APPLICATION BIG DATA

PHYSICAL

STANDARD X86

FILE

VIRTUAL

IDLE OR LEGACY

CLOUD MOBILE

CLOUD

EBSEBSSCALE-OUT STORAGE

ARCHITECTURE

Page 7: CUSTOMER APPLICATIONS OF HADOOP ON RED HAT · PDF file•Hadoop services are loaded into Ambari via the concept of a ... •Jacob Shucart – jshucart@redhat.com •Erin Boyd – eboyd@redhat.com

ELASTIC SCALE-OUT ARCHITECTURE

APPS

GROW AS YOU GO

EXPAND TRANSPARENTLY

NO META DATA COMPLEXITY

CLIENTS

SCALE OUT NEW DATA CAPACITYSTORAGE

SINGLE, SHARED NAMESPACE

/mnt/point

Page 8: CUSTOMER APPLICATIONS OF HADOOP ON RED HAT · PDF file•Hadoop services are loaded into Ambari via the concept of a ... •Jacob Shucart – jshucart@redhat.com •Erin Boyd – eboyd@redhat.com

HIGHLY AVAILABLE HERE OR THERE

SITE HERE SITE THERE

REMOTE SITE / DR

Page 9: CUSTOMER APPLICATIONS OF HADOOP ON RED HAT · PDF file•Hadoop services are loaded into Ambari via the concept of a ... •Jacob Shucart – jshucart@redhat.com •Erin Boyd – eboyd@redhat.com

BRING APPLICATIONS CLOSER TO THE DATA

DATA LOCALITY

REDUCE FOOTPRINT

ISOLATE WORKLOAD

REMAIN SUPPORTED

STORAGE RESIDENT APPLICATIONS

Page 10: CUSTOMER APPLICATIONS OF HADOOP ON RED HAT · PDF file•Hadoop services are loaded into Ambari via the concept of a ... •Jacob Shucart – jshucart@redhat.com •Erin Boyd – eboyd@redhat.com

FOUNDATION FOR BIG DATA

DIRECT DATA ACCESS

DON'T COPY YOUR DATA

FULL API COMPATIBILITY

SCALES AS NEEDED

BRING HADOOP TO DATA

Page 11: CUSTOMER APPLICATIONS OF HADOOP ON RED HAT · PDF file•Hadoop services are loaded into Ambari via the concept of a ... •Jacob Shucart – jshucart@redhat.com •Erin Boyd – eboyd@redhat.com

HADOOP ON RHS ARCHITECTURE DIAGRAM

Page 12: CUSTOMER APPLICATIONS OF HADOOP ON RED HAT · PDF file•Hadoop services are loaded into Ambari via the concept of a ... •Jacob Shucart – jshucart@redhat.com •Erin Boyd – eboyd@redhat.com

ENABLING HADOOP ON RHS VIA AMBARI

Ambari provides ease of installation and configuration of Hadoop

•High Touch Beta (HTB) enabled the ability to install and configure Hadoop on GlusterFS

• Initial Goals• Remove the tightly coupled bond to HDFS creating crippling dependencies between services

• Enable GlusterFS presence on the UI

• Enable minimal monitoring of GlusterFS

Page 13: CUSTOMER APPLICATIONS OF HADOOP ON RED HAT · PDF file•Hadoop services are loaded into Ambari via the concept of a ... •Jacob Shucart – jshucart@redhat.com •Erin Boyd – eboyd@redhat.com

HDP STACKS

•Hadoop services are loaded into Ambari via the concept of a stack

•A stack consists of service definitions, rpms and configurations for Hadoop services

•Development enabled GlusterFS as a client service is a new HDP/RHS stack• Client services have no control enablement (stop/start function) from the management console

• Client services are not able to be monitored via Ganglia

• GlusterFS client was deployed via stanard Ambari agents and validated to be running properly for Hadoop installation

Page 14: CUSTOMER APPLICATIONS OF HADOOP ON RED HAT · PDF file•Hadoop services are loaded into Ambari via the concept of a ... •Jacob Shucart – jshucart@redhat.com •Erin Boyd – eboyd@redhat.com

STACK SELECTION

Page 15: CUSTOMER APPLICATIONS OF HADOOP ON RED HAT · PDF file•Hadoop services are loaded into Ambari via the concept of a ... •Jacob Shucart – jshucart@redhat.com •Erin Boyd – eboyd@redhat.com

SERVICES IN THE STACK

Stack 2.0.6.GlusterFS•Contains all the core services of 2.0.6 with the addition of GlusterFS:

•GANGLIA, NAGIOS, GLUSTERFS •HBASE •HDFS •MAPREDUCE2 •OOZIE •SQOOP •YARN•GLUSTERFS •HCATALOG •HIVE •PIG •WEBHCAT •ZOOKEEPER

Page 16: CUSTOMER APPLICATIONS OF HADOOP ON RED HAT · PDF file•Hadoop services are loaded into Ambari via the concept of a ... •Jacob Shucart – jshucart@redhat.com •Erin Boyd – eboyd@redhat.com

EXPOSE GLUSTERFS OPTIONS FOR HADOOP

•Enabling Raw local filesystem

•Security enablement for multiusers

•Allow users in Hadoop group to run on Hadoop without compromising other’s data

Page 17: CUSTOMER APPLICATIONS OF HADOOP ON RED HAT · PDF file•Hadoop services are loaded into Ambari via the concept of a ... •Jacob Shucart – jshucart@redhat.com •Erin Boyd – eboyd@redhat.com

DREAMWORKS ANIMATION WITH HADOOP ON RED HAT STORAGE

Diane FeddemaPrinciple Software Engineer

Red Hat

Page 18: CUSTOMER APPLICATIONS OF HADOOP ON RED HAT · PDF file•Hadoop services are loaded into Ambari via the concept of a ... •Jacob Shucart – jshucart@redhat.com •Erin Boyd – eboyd@redhat.com

OVERVIEW

Introduction● What does DreamWorks Animation want to know?

Using RHS to do big data analysis in place● Price effective and uses existing infrastructure

What we found● Some surprise findings

What we learned● How to find the right questions and produce the appropriate log data for the

questions

Page 19: CUSTOMER APPLICATIONS OF HADOOP ON RED HAT · PDF file•Hadoop services are loaded into Ambari via the concept of a ... •Jacob Shucart – jshucart@redhat.com •Erin Boyd – eboyd@redhat.com

DREAMWORKS ANIMATION WEB SERVICE INFRASTRUCTURE

Render farm 2with traffic servers

Render farm 3with traffic servers

Render farm 1with traffic servers

Page 20: CUSTOMER APPLICATIONS OF HADOOP ON RED HAT · PDF file•Hadoop services are loaded into Ambari via the concept of a ... •Jacob Shucart – jshucart@redhat.com •Erin Boyd – eboyd@redhat.com

Data volume:• 1 ms granularity • ~3.5 TB per week• From 8 traffic servers • Currently much if the data is discarded

Data Analysis to date• Hourly analysis using Splunk –expensive, DreamWorks Animation thinks provides

generic, limited answers to their questions• There remain a number of outstanding questions about work flow and efficiencies

THE NATURE OF DREAMWORKS ANIMATION WEB SERVICE INFRASTRUCTURE DATA

Page 21: CUSTOMER APPLICATIONS OF HADOOP ON RED HAT · PDF file•Hadoop services are loaded into Ambari via the concept of a ... •Jacob Shucart – jshucart@redhat.com •Erin Boyd – eboyd@redhat.com

The present log files provide:• One record for each Service Request ( & sub-service request )

Each record provides:– Unix era time stamp and duration (ms) of a process– Type of data being transferred (e.g. JSON, jpeg)

Source, traffic and destination IP information etc.– Code for movie name – Payload size and payload headers etc.– Cache results – HIT or MISS– Correlation ID – unique job/sub-job identifiers

THE NATURE OF DREAMWORKS ANIMATION WEB SERVICE INFRASTRUCTURE DATA

Page 22: CUSTOMER APPLICATIONS OF HADOOP ON RED HAT · PDF file•Hadoop services are loaded into Ambari via the concept of a ... •Jacob Shucart – jshucart@redhat.com •Erin Boyd – eboyd@redhat.com
Page 23: CUSTOMER APPLICATIONS OF HADOOP ON RED HAT · PDF file•Hadoop services are loaded into Ambari via the concept of a ... •Jacob Shucart – jshucart@redhat.com •Erin Boyd – eboyd@redhat.com

THE DREAMWORKS ANIMATION QUESTIONS

DreamWorks Animation wanted to answer six questions about their web service infrastructure:

1.Determine the service infrastructure HIT and MISS rate over time?2.What does the dominant service call look like?

– What chain of sub calls is most common?3.How often are Correlation IDs reused?4.Compare production vs. non-production jobs for these questions?

– there are jobs which run to test the health of the system that are NOT production runs. Production and non-production runs are mixed together on these traffic servers at all times.

5.What is the average service call duration per traffic server?6.Determine number of non-zero duration events for traffic server cache hits and misses?

Page 24: CUSTOMER APPLICATIONS OF HADOOP ON RED HAT · PDF file•Hadoop services are loaded into Ambari via the concept of a ... •Jacob Shucart – jshucart@redhat.com •Erin Boyd – eboyd@redhat.com

QUESTION 1 What is the service infrastructure HIT and MISS rate over time?

Page 25: CUSTOMER APPLICATIONS OF HADOOP ON RED HAT · PDF file•Hadoop services are loaded into Ambari via the concept of a ... •Jacob Shucart – jshucart@redhat.com •Erin Boyd – eboyd@redhat.com

TCP_HIT/TCP_MISS time series over 5 minutes

Page 26: CUSTOMER APPLICATIONS OF HADOOP ON RED HAT · PDF file•Hadoop services are loaded into Ambari via the concept of a ... •Jacob Shucart – jshucart@redhat.com •Erin Boyd – eboyd@redhat.com

TCP_HIT/TCP_MISS time series over a day

Page 27: CUSTOMER APPLICATIONS OF HADOOP ON RED HAT · PDF file•Hadoop services are loaded into Ambari via the concept of a ... •Jacob Shucart – jshucart@redhat.com •Erin Boyd – eboyd@redhat.com

QUESTION 3How often are Correlation IDs reused?

Page 28: CUSTOMER APPLICATIONS OF HADOOP ON RED HAT · PDF file•Hadoop services are loaded into Ambari via the concept of a ... •Jacob Shucart – jshucart@redhat.com •Erin Boyd – eboyd@redhat.com

TCP_HIT correlation ID distribution

21,472,938 correlation IDs occur only 2 times each

Page 29: CUSTOMER APPLICATIONS OF HADOOP ON RED HAT · PDF file•Hadoop services are loaded into Ambari via the concept of a ... •Jacob Shucart – jshucart@redhat.com •Erin Boyd – eboyd@redhat.com

TCP_MISS correlation ID distribution

Page 30: CUSTOMER APPLICATIONS OF HADOOP ON RED HAT · PDF file•Hadoop services are loaded into Ambari via the concept of a ... •Jacob Shucart – jshucart@redhat.com •Erin Boyd – eboyd@redhat.com

QUESTION 4Compare production vs. non-production jobs for these questions?

Page 31: CUSTOMER APPLICATIONS OF HADOOP ON RED HAT · PDF file•Hadoop services are loaded into Ambari via the concept of a ... •Jacob Shucart – jshucart@redhat.com •Erin Boyd – eboyd@redhat.com

PRODUCTION run Correlation ID distributions

Page 32: CUSTOMER APPLICATIONS OF HADOOP ON RED HAT · PDF file•Hadoop services are loaded into Ambari via the concept of a ... •Jacob Shucart – jshucart@redhat.com •Erin Boyd – eboyd@redhat.com

NON-PRODUCTION run Correlation ID distributions

Page 33: CUSTOMER APPLICATIONS OF HADOOP ON RED HAT · PDF file•Hadoop services are loaded into Ambari via the concept of a ... •Jacob Shucart – jshucart@redhat.com •Erin Boyd – eboyd@redhat.com

Comparison of first 5 bins for Production/Non-production and HIT/MISS

Page 34: CUSTOMER APPLICATIONS OF HADOOP ON RED HAT · PDF file•Hadoop services are loaded into Ambari via the concept of a ... •Jacob Shucart – jshucart@redhat.com •Erin Boyd – eboyd@redhat.com

QUESTION 5What is the average service call duration per traffic

server?

QUESTION 6Determine number of non-zero duration events for

traffic server cache hits and misses?

Page 35: CUSTOMER APPLICATIONS OF HADOOP ON RED HAT · PDF file•Hadoop services are loaded into Ambari via the concept of a ... •Jacob Shucart – jshucart@redhat.com •Erin Boyd – eboyd@redhat.com

EVENTS BY TRAFFIC SERVER

Page 36: CUSTOMER APPLICATIONS OF HADOOP ON RED HAT · PDF file•Hadoop services are loaded into Ambari via the concept of a ... •Jacob Shucart – jshucart@redhat.com •Erin Boyd – eboyd@redhat.com

PERCENT EVENTS BY TRAFFIC SERVER

Page 37: CUSTOMER APPLICATIONS OF HADOOP ON RED HAT · PDF file•Hadoop services are loaded into Ambari via the concept of a ... •Jacob Shucart – jshucart@redhat.com •Erin Boyd – eboyd@redhat.com

Zero/Non-zero percentages by traffic server

Page 38: CUSTOMER APPLICATIONS OF HADOOP ON RED HAT · PDF file•Hadoop services are loaded into Ambari via the concept of a ... •Jacob Shucart – jshucart@redhat.com •Erin Boyd – eboyd@redhat.com

TCP_HIT/TCP_MISS events per traffic server

Page 39: CUSTOMER APPLICATIONS OF HADOOP ON RED HAT · PDF file•Hadoop services are loaded into Ambari via the concept of a ... •Jacob Shucart – jshucart@redhat.com •Erin Boyd – eboyd@redhat.com

WHAT WE LEARNED

Page 40: CUSTOMER APPLICATIONS OF HADOOP ON RED HAT · PDF file•Hadoop services are loaded into Ambari via the concept of a ... •Jacob Shucart – jshucart@redhat.com •Erin Boyd – eboyd@redhat.com

FINDINGS

• Demonstrated the usefulness Hadoop on RHS for analyzing log data in place.

• Found conditions that they did not know could happen– Found “impossible” missing ID cases– Single and double correlation IDs events dominate when they expected larger

numbers of service call chains to dominate

• Need to make sure you have the right data for the questions asked i.e. – correlations ID data does not allow to track how jobs are split

Page 41: CUSTOMER APPLICATIONS OF HADOOP ON RED HAT · PDF file•Hadoop services are loaded into Ambari via the concept of a ... •Jacob Shucart – jshucart@redhat.com •Erin Boyd – eboyd@redhat.com

CONCLUSIONS AND FUTURE COLLABORATION

• Look at more Time Series questions, e.g. show farmid jobs and how they overlap in time.

• Develop hadoop based Dashboard to monitor the “health” of the system and create automated alerts.

• Rethink the questions that can be extracted from existing data• Need to be able to examine individual cases to determine whether some application of the system are not using the API properly

• Consider adding/substituting information in the log file to provide more useful information

• Develop Dashboard to monitor the “health” of the system and create automated alerts.

Page 42: CUSTOMER APPLICATIONS OF HADOOP ON RED HAT · PDF file•Hadoop services are loaded into Ambari via the concept of a ... •Jacob Shucart – jshucart@redhat.com •Erin Boyd – eboyd@redhat.com

THANK YOU AND Q&A

Questions and Answers

Contact Information

•Jacob Shucart – [email protected]

•Erin Boyd – [email protected]

•Diane Feddema – [email protected]

Thank you!

Page 43: CUSTOMER APPLICATIONS OF HADOOP ON RED HAT · PDF file•Hadoop services are loaded into Ambari via the concept of a ... •Jacob Shucart – jshucart@redhat.com •Erin Boyd – eboyd@redhat.com

Check Out Other Red Hat Storage Activities at The Summit

•Enter the raffle to win tickets for a $500 gift card or trip to LegoLand!• Entry cards available in all storage sessions - the more you attend, the more chances you

have to win!

•Talk to Storage Experts:

• Red Hat Booth (# 211)

• Infrastructure • Infrastructure-as-a-Service

•Storage Partner Solutions Booth (# 605)

•Upstream Gluster projects• Developer Lounge

Follow us on Twitter, Facebook: @RedHatStorage