Scalable Hadoop in the cloud

Scalable Hadoop in the CloudJohan Gustavsson

WHO AM I?

• Johan Gustavsson（ヨハン）

• Some contributions in Hadoop, Hive…

• Software Engineer at Treasure Data, Inc.

• Hadoop Team

HIGH LEVEL OVERVIEW

CONTENT• Basic architecture:

• Storage

• Replacing a Hadoop Cluster

• Basic job flow with Plazma

• Overview Basic job execution

• Isolation (JobClient)

• Isolation (In cluster)

• Architecture changes (PTD):

• What is PTD?

• Multiple Hadoop Versions

• Multiple Version Job submission

BASIC ARCHITECTURE

STORAGE (PLAZMA)

• Time indexed database (hourly partitioned)

STORAGE (PLAZMA)

• Metadata in Postgres

• Data in mpc1 files on S3 (columnar format file with schema on read)

STORAGE (PLAZMA)

• A write will create the files and write them to S3

• Then commit by writing metadata to Postgres

REPLACING A HADOOP CLUSTER



BASIC JOB FLOW WITH PLAZMA

• Job one runs reading from Plazma

• Shuffle uses local disk same as always


• Output of the first job is written to HDFS


• Second job reads from HDFS


• Final job in the dag writes to HDFS, then the data is downloaded to a result bucket on S3


• In case of INSTERT data is written directly to a table in plazma

OVERVIEW BASIC JOB EXECUTION





ISOLATION (JOBCLIENT)

• Worker builds command line options with java properties

• Runs QueryRunner as a subprocess


• UDFs used in query enabled

• Executing CREATE TEMPORARY FUNCTION

• Add databases/tables from PlazmaDB to Metastore

• Executing CREATE DATABASE/TABLE



• The good:

• High level of isolation

• OOM deals protect jobs from each other

• The bad:

• Job setup costs are a bit high

ISOLATION (IN CLUSTER)• Using Hadoop resource pools:

• 1 account 1 resource pool (not counting sub-pools)

• Based on price plan max and min running containers are set

• Currently 6711 pools in production

ISOLATION (IN CLUSTER)• The good part:

• Relatively low cost to guarantee minimum resources

• Jobs can still burst to max if resources are free in the cluster

ISOLATION (IN CLUSTER)•The bad parts:

• Due to too many pools meaning cluster separation is needed

• The Resourcemanager tends to get slow with too many pools

• Some unsafe UDFs needs to be disabled

• java_method()

• reflect()

ARCHITECTURE CHANGES (PTD)

WHAT IS PTD?• Patchset Treasure Data


• Name first coined by these two@frsyuki @tagomoris


• Name first coined by these two

• Still in development @frsyuki @tagomoris



• Still in development

• Original plan

@frsyuki @tagomoris




• Original plan

• Base all internal Hadoop components on latest community edition

• Simplify releases to keep an as current version as possible

@frsyuki @tagomoris




• Original plan



• What it’s turning into

@frsyuki @tagomoris




• Original plan



• What it’s turning into

• A complete overhaul of most things related to Hadoop

@frsyuki @tagomoris

MULTIPLE HADOOP VERSIONS




• By changing settings in a data bag default version is change

MULTIPLE VERSION JOB SUBMISSION

MULTIPLE VERSION JOB SUBMISSION

ELEPHANT SERVER

• Provides REST api for job submission and monitoring

• All Hive/Pig related code separated from the generic worker

• Distributed on memory queue managing job progress

ELEPHANT SERVER

• Built to support multiple versions of Hadoop/Hive/Pig…

ELEPHANT SERVER

• This could lead to the following longterm solution

JOB PRESERVING RESTARTS

• Worker is polling job status from local server


• A new instance of the server is started joining the Hazelcast cluster and repeatedly trying to start REST server

• The old instance goes into shutdown mode (not starting new jobs but keep current ones running)


• Newly submitted jobs popped and managed by the new instance


• Ones all jobs running on the old instance have finished one way or another it shuts down


• Since the port opens up, the new instance starts REST api

https://www.treasuredata.com/

Scalable Hadoop in the cloud

Technology