Top Banner
Scalable Hadoop in the Cloud Johan Gustavsson
51

Scalable Hadoop in the cloud

Jan 09, 2017

Download

Technology

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Scalable Hadoop in the cloud

Scalable Hadoop in the CloudJohan Gustavsson

Page 2: Scalable Hadoop in the cloud

WHO AM I?

• Johan Gustavsson(ヨハン)

• Some contributions in Hadoop, Hive…

• Software Engineer at Treasure Data, Inc.

• Hadoop Team

Page 3: Scalable Hadoop in the cloud

HIGH LEVEL OVERVIEW

Page 4: Scalable Hadoop in the cloud

CONTENT• Basic architecture:

• Storage

• Replacing a Hadoop Cluster

• Basic job flow with Plazma

• Overview Basic job execution

• Isolation (JobClient)

• Isolation (In cluster)

• Architecture changes (PTD):

• What is PTD?

• Multiple Hadoop Versions

• Multiple Version Job submission

Page 5: Scalable Hadoop in the cloud

BASIC ARCHITECTURE

Page 6: Scalable Hadoop in the cloud

STORAGE (PLAZMA)

• Time indexed database (hourly partitioned)

Page 7: Scalable Hadoop in the cloud

STORAGE (PLAZMA)

• Metadata in Postgres

• Data in mpc1 files on S3 (columnar format file with schema on read)

Page 8: Scalable Hadoop in the cloud

STORAGE (PLAZMA)

• A write will create the files and write them to S3

• Then commit by writing metadata to Postgres

Page 9: Scalable Hadoop in the cloud

REPLACING A HADOOP CLUSTER

Page 10: Scalable Hadoop in the cloud

REPLACING A HADOOP CLUSTER

Page 11: Scalable Hadoop in the cloud

REPLACING A HADOOP CLUSTER

Page 12: Scalable Hadoop in the cloud

BASIC JOB FLOW WITH PLAZMA

• Job one runs reading from Plazma

• Shuffle uses local disk same as always

Page 13: Scalable Hadoop in the cloud

BASIC JOB FLOW WITH PLAZMA

• Output of the first job is written to HDFS

Page 14: Scalable Hadoop in the cloud

BASIC JOB FLOW WITH PLAZMA

• Second job reads from HDFS

Page 15: Scalable Hadoop in the cloud

BASIC JOB FLOW WITH PLAZMA

• Final job in the dag writes to HDFS, then the data is downloaded to a result bucket on S3

Page 16: Scalable Hadoop in the cloud

BASIC JOB FLOW WITH PLAZMA

• In case of INSTERT data is written directly to a table in plazma

Page 17: Scalable Hadoop in the cloud

OVERVIEW BASIC JOB EXECUTION

Page 18: Scalable Hadoop in the cloud

OVERVIEW BASIC JOB EXECUTION

Page 19: Scalable Hadoop in the cloud

OVERVIEW BASIC JOB EXECUTION

Page 20: Scalable Hadoop in the cloud

OVERVIEW BASIC JOB EXECUTION

Page 21: Scalable Hadoop in the cloud

OVERVIEW BASIC JOB EXECUTION

Page 22: Scalable Hadoop in the cloud

ISOLATION (JOBCLIENT)

• Worker builds command line options with java properties

• Runs QueryRunner as a subprocess

Page 23: Scalable Hadoop in the cloud

ISOLATION (JOBCLIENT)

• UDFs used in query enabled

• Executing CREATE TEMPORARY FUNCTION

• Add databases/tables from PlazmaDB to Metastore

• Executing CREATE DATABASE/TABLE

Page 24: Scalable Hadoop in the cloud

ISOLATION (JOBCLIENT)

Page 25: Scalable Hadoop in the cloud

ISOLATION (JOBCLIENT)

• The good:

• High level of isolation

• OOM deals protect jobs from each other

• The bad:

• Job setup costs are a bit high

Page 26: Scalable Hadoop in the cloud

ISOLATION (IN CLUSTER)• Using Hadoop resource pools:

• 1 account 1 resource pool (not counting sub-pools)

• Based on price plan max and min running containers are set

• Currently 6711 pools in production

Page 27: Scalable Hadoop in the cloud

ISOLATION (IN CLUSTER)• The good part:

• Relatively low cost to guarantee minimum resources

• Jobs can still burst to max if resources are free in the cluster

Page 28: Scalable Hadoop in the cloud

ISOLATION (IN CLUSTER)•The bad parts:

• Due to too many pools meaning cluster separation is needed

• The Resourcemanager tends to get slow with too many pools

• Some unsafe UDFs needs to be disabled

• java_method()

• reflect()

Page 29: Scalable Hadoop in the cloud

ARCHITECTURE CHANGES (PTD)

Page 30: Scalable Hadoop in the cloud

WHAT IS PTD?• Patchset Treasure Data

Page 31: Scalable Hadoop in the cloud

WHAT IS PTD?• Patchset Treasure Data

• Name first coined by these two@frsyuki @tagomoris

Page 32: Scalable Hadoop in the cloud

WHAT IS PTD?• Patchset Treasure Data

• Name first coined by these two

• Still in development @frsyuki @tagomoris

Page 33: Scalable Hadoop in the cloud

WHAT IS PTD?• Patchset Treasure Data

• Name first coined by these two

• Still in development

• Original plan

@frsyuki @tagomoris

Page 34: Scalable Hadoop in the cloud

WHAT IS PTD?• Patchset Treasure Data

• Name first coined by these two

• Still in development

• Original plan

• Base all internal Hadoop components on latest community edition

• Simplify releases to keep an as current version as possible

@frsyuki @tagomoris

Page 35: Scalable Hadoop in the cloud

WHAT IS PTD?• Patchset Treasure Data

• Name first coined by these two

• Still in development

• Original plan

• Base all internal Hadoop components on latest community edition

• Simplify releases to keep an as current version as possible

• What it’s turning into

@frsyuki @tagomoris

Page 36: Scalable Hadoop in the cloud

WHAT IS PTD?• Patchset Treasure Data

• Name first coined by these two

• Still in development

• Original plan

• Base all internal Hadoop components on latest community edition

• Simplify releases to keep an as current version as possible

• What it’s turning into

• A complete overhaul of most things related to Hadoop

@frsyuki @tagomoris

Page 37: Scalable Hadoop in the cloud

MULTIPLE HADOOP VERSIONS

Page 38: Scalable Hadoop in the cloud

MULTIPLE HADOOP VERSIONS

Page 39: Scalable Hadoop in the cloud

MULTIPLE HADOOP VERSIONS

Page 40: Scalable Hadoop in the cloud

MULTIPLE HADOOP VERSIONS

• By changing settings in a data bag default version is change

Page 41: Scalable Hadoop in the cloud

MULTIPLE VERSION JOB SUBMISSION

Page 42: Scalable Hadoop in the cloud

MULTIPLE VERSION JOB SUBMISSION

Page 43: Scalable Hadoop in the cloud

ELEPHANT SERVER

• Provides REST api for job submission and monitoring

• All Hive/Pig related code separated from the generic worker

• Distributed on memory queue managing job progress

Page 44: Scalable Hadoop in the cloud

ELEPHANT SERVER

• Built to support multiple versions of Hadoop/Hive/Pig…

Page 45: Scalable Hadoop in the cloud

ELEPHANT SERVER

• This could lead to the following longterm solution

Page 46: Scalable Hadoop in the cloud

JOB PRESERVING RESTARTS

• Worker is polling job status from local server

Page 47: Scalable Hadoop in the cloud

JOB PRESERVING RESTARTS

• A new instance of the server is started joining the Hazelcast cluster and repeatedly trying to start REST server

• The old instance goes into shutdown mode (not starting new jobs but keep current ones running)

Page 48: Scalable Hadoop in the cloud

JOB PRESERVING RESTARTS

• Newly submitted jobs popped and managed by the new instance

Page 49: Scalable Hadoop in the cloud

JOB PRESERVING RESTARTS

• Ones all jobs running on the old instance have finished one way or another it shuts down

Page 50: Scalable Hadoop in the cloud

JOB PRESERVING RESTARTS

• Since the port opens up, the new instance starts REST api

Page 51: Scalable Hadoop in the cloud

https://www.treasuredata.com/