Top Banner
1 © Cloudera, Inc. All rights reserved. Introduction to Kudu: Hadoop Storage for Fast Analytics on Fast Data Jeff Holoman Sr. Systems Engineer
57

Introduction to Apache Kudu

Apr 16, 2017

Download

Technology

Jeff Holoman
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Introduction to Apache Kudu

1© Cloudera, Inc. All rights reserved.

Introduction to Kudu: Hadoop Storage for Fast Analytics on Fast DataJeff HolomanSr. Systems Engineer

Page 2: Introduction to Apache Kudu

2© Cloudera, Inc. All rights reserved.

Agenda

What is Kudu? (Motivations & Design Goals)Use Cases

Overview of Design & Internals Simple Benchmark

Status & Getting Started

Page 3: Introduction to Apache Kudu

3© Cloudera, Inc. All rights reserved.

What is Kudu?

Page 4: Introduction to Apache Kudu

4© Cloudera, Inc. All rights reserved.

But First….

Page 5: Introduction to Apache Kudu

5© Cloudera, Inc. All rights reserved.

• Efficiently scanning large amounts of data

• Accumulating data with high throughput• Multiple SQL Options• All processing engines

• Single Row Access is problematic• Mutation is problematic• “Fast Data” access is problematic

Excels at…However…

GFS paper published in 2003!

Page 6: Introduction to Apache Kudu

6© Cloudera, Inc. All rights reserved.

• Efficiently finding and writing individual rows

• Accumulating data with high throughput

• Scans are problematic• High cardinality access is problematic• SQL support is so/so due to the above

Excels at…However…

Big Table Paper published in 2006!

Page 7: Introduction to Apache Kudu

7© Cloudera, Inc. All rights reserved.

Page 8: Introduction to Apache Kudu

8© Cloudera, Inc. All rights reserved.

In 2006…

DID NOT EXIST!

Page 9: Introduction to Apache Kudu

9© Cloudera, Inc. All rights reserved.

5 10 13 15 15$0

$20

$40

$60

$80

$100

$120

$140

$160

$180

$200

RAM / GB

RAM / GB

Page 10: Introduction to Apache Kudu

10© Cloudera, Inc. All rights reserved.

Page 11: Introduction to Apache Kudu

11© Cloudera, Inc. All rights reserved.

Today…Changing Hardware Landscape• Spinning disk -> solid state storage• NAND flash: Up to 450k read 250k write iops, about 2GB/sec read and 1.5GB/sec write throughput,

at a price of less than $3/GB and dropping• 3D XPoint memory (1000x faster than NAND, cheaper than RAM)

• RAM is cheaper and more abundant• 64->128->256GB over last few years

Takeaway 1: The next bottleneck is CPU, and current storage systems weren’t designed with CPU efficiency in mind.

Takeaway 2: Column stores are feasible for random access.

Page 12: Introduction to Apache Kudu

12© Cloudera, Inc. All rights reserved.

Current Storage Landscape in Hadoop

Gaps exist when these properties are needed simultaneously

The Hadoop Storage “Gap”

Page 13: Introduction to Apache Kudu

13© Cloudera, Inc. All rights reserved.

The Kudu Elevator PitchStorage for Fast Analytics on Fast Data

• New updating column store for Hadoop• Simplifies the architecture for building

analytic applications on changing data• Designed for fast analytic performance• Natively integrated with Hadoop

• Apache-licensed open source (with pending ASF Incubator proposal)

• Beta now available

FILESYSTEMHDFS

NoSQLHBASE

INGEST – SQOOP, FLUME, KAFKA

DATA INTEGRATION & STORAGE

SECURITY – SENTRY

RESOURCE MANAGEMENT – YARN

UNIFIED DATA SERVICES

BATCH STREAM SQL SEARCH MODEL ONLINE

DATA ENGINEERING DATA DISCOVERY & ANALYTICS DATA APPS

SPARK, HIVE, PIG

SPARK IMPALA SOLR SPARK HBASE

RELATIONALKUDU

Page 14: Introduction to Apache Kudu

14© Cloudera, Inc. All rights reserved.

• High throughput for big scans• Low-latency for random accesses• High CPU performance to better take

advantage of RAM and Flash• Single-column scan rate 10-100x faster than HBase

• High IO efficiency• True column store with type-specific encodings• Efficient analytics when only certain columns are

accessed• Expressive and evolvable data model• Architecture that supports multi-data center

operation

Kudu Design Goals

Page 15: Introduction to Apache Kudu

15© Cloudera, Inc. All rights reserved.

Page 16: Introduction to Apache Kudu

16© Cloudera, Inc. All rights reserved.

Using Kudu

• Table has a SQL-like schema• Finite number of columns (unlike HBase/Cassandra)• Types: BOOL, INT8, INT16, INT32, INT64, FLOAT, DOUBLE, STRING, BINARY,

TIMESTAMP• Some subset of columns makes up a possibly-composite primary key• Fast ALTER TABLE

• Java and C++ “NoSQL” style APIs• Insert(), Update(), Delete(), Scan()

• Integrations with MapReduce, Spark, and Impala• more to come!

16

Page 17: Introduction to Apache Kudu

17© Cloudera, Inc. All rights reserved.

What Kudu is *NOT*

•Not a SQL interface itself • It’s just the storage layer – “Bring Your Own SQL” (eg Impala or Spark)

•Not an application that runs on HDFS• It’s an alternative, native Hadoop storage engine• Colocation with HDFS is expected

•Not a replacement for HDFS or HBase• Select the right storage for the right use case• Cloudera will support and invest in all three

Page 18: Introduction to Apache Kudu

18© Cloudera, Inc. All rights reserved.

Use Cases for Kudu

Page 19: Introduction to Apache Kudu

19© Cloudera, Inc. All rights reserved.

Kudu Use Cases

Kudu is best for use cases requiring a simultaneous combination ofsequential and random reads and writes, e.g.:

● Time Series○ Examples: Stream market data; fraud detection & prevention; risk monitoring○ Workload: Insert, updates, scans, lookups

● Machine Data Analytics○ Examples: Network threat detection○ Workload: Inserts, scans, lookups

● Online Reporting○ Examples: ODS○ Workload: Inserts, updates, scans, lookups

Page 20: Introduction to Apache Kudu

20© Cloudera, Inc. All rights reserved.

Industry Examples

• Streaming market data

• Real-time fraud detection & prevention

• Risk monitoring

• Real-time offers• Location-based

targeting

• Geospatial monitoring

• Risk and threat detection (real time)

Financial Services Retail Public Sector

Page 21: Introduction to Apache Kudu

21© Cloudera, Inc. All rights reserved.

Real-Time Analytics in Hadoop TodayFraud Detection in the Real World = Storage Complexity

Considerations:● How do I handle failure

during this process?

● How often do I reorganize data streaming in into a format appropriate for reporting?

● When reporting, how do I see data that has not yet been reorganized?

● How do I ensure that important jobs aren’t interrupted by maintenance?

New Partition

Most Recent Partition

Historic Data

HBase

Parquet File

Have we accumulated enough data?

Reorganize HBase file

into Parquet

• Wait for running operations to complete • Define new Impala partition referencing

the newly written Parquet file

Incoming Data (Messaging

System)

Reporting Request

Impala on HDFS

Page 22: Introduction to Apache Kudu

22© Cloudera, Inc. All rights reserved.

Real-Time Analytics in Hadoop with KuduSimpler Architecture, Superior Performance over Hybrid Approaches

Impala on KuduIncoming Data

(Messaging System)

Reporting Request

Page 23: Introduction to Apache Kudu

23© Cloudera, Inc. All rights reserved.

Design & Internals

Page 24: Introduction to Apache Kudu

24© Cloudera, Inc. All rights reserved.

Kudu Basic Design

• Typed storage•Basic Construct: Tables • Tables broken down into Tablets (roughly equivalent to partitions)

•Maintains consistency through a Paxos-like quorum model (Raft)•Architecture supports geographically disparate, active/active systems

Page 25: Introduction to Apache Kudu

25© Cloudera, Inc. All rights reserved.

Columnar Data Store

A B CA1 B1 C1A2 B2 C2A3 B3 C3

A1 B1 C1 A2 B2 C2 A3 B3 C3

A1 A2 A3 B1 B2 B3 C1 C2 C3

Row-Based Storage

Columnar Storage

Page 26: Introduction to Apache Kudu

26© Cloudera, Inc. All rights reserved.

Tables and Tablets

• Table is horizontally partitioned into tablets• Range or hash partitioning• PRIMARY KEY (host, metric, timestamp) DISTRIBUTE BY HASH(timestamp) INTO 100 BUCKETS

• Each tablet has N replicas (3 or 5), with Raft consensus• Allow read from any replica, plus leader-driven writes with low MTTR

• Tablet servers host tablets• Store data on local disks (no HDFS)

26

Page 27: Introduction to Apache Kudu

27© Cloudera, Inc. All rights reserved.

Tables and Tablets(2)

• CREATE TABLE customers (• first_name STRING NOT NULL,• last_name STRING NOT NULL,• order_count INT32,• PRIMARY KEY (last_name, first_name),• )• Specifying the split rows as (("b", ""), ("c", ""), ("d", ""), .., ("z", "")) (25 split rows

total) will result in the creation of 26 tablets, with each tablet containing a range of customer surnames all beginning with a given letter. This is an effective partition schema for a workload where customers are inserted and updated uniformly by last name, and scans are typically performed over a range of surnames.

Page 28: Introduction to Apache Kudu

28© Cloudera, Inc. All rights reserved.

Client

Meta Cache

Page 29: Introduction to Apache Kudu

29© Cloudera, Inc. All rights reserved.

Client

Hey Master! Where is the row for ‘[email protected]’ in table “T”?Meta Cache

Page 30: Introduction to Apache Kudu

30© Cloudera, Inc. All rights reserved.

Client

Hey Master! Where is the row for ‘[email protected]’ in table “T”?

It’s part of tablet 2, which is on servers {Z,Y,X}. BTW, here’s info on other tablets you might care about: T1, T2, T3, …

Meta Cache

Page 31: Introduction to Apache Kudu

31© Cloudera, Inc. All rights reserved.

Client

Hey Master! Where is the row for ‘[email protected]’ in table “T”?

It’s part of tablet 2, which is on servers {Z,Y,X}. BTW, here’s info on other tablets you might care about: T1, T2, T3, …

Meta CacheT1: …T2: …T3: …

Page 32: Introduction to Apache Kudu

32© Cloudera, Inc. All rights reserved.

Client

Hey Master! Where is the row for ‘[email protected]’ in table “T”?

It’s part of tablet 2, which is on servers {Z,Y,X}. BTW, here’s info on other tablets you might care about: T1, T2, T3, …

UPDATE [email protected] SET …

Meta CacheT1: …T2: …T3: …

Page 33: Introduction to Apache Kudu

33© Cloudera, Inc. All rights reserved.

Tablet Design

• Inserts buffered in an in-memory store (like HBase’s memstore)• Flushed to disk• Columnar layout, similar to Apache Parquet

• Updates use MVCC (updates tagged with timestamp, not in-place)• Allow “SELECT AS OF <timestamp>” queries and consistent cross-tablet scans

• Near-optimal read path for “current time” scans• No per row branches, fast vectorized decoding and predicate evaluation

• Performance worsens based on number of recent updates

33

Page 34: Introduction to Apache Kudu

34© Cloudera, Inc. All rights reserved.

Metadata

• Replicated master*• Acts as a tablet directory (“META” table)• Acts as a catalog (table schemas, etc)• Acts as a load balancer (tracks TS liveness, re-replicates under-replicated tablets)

• Caches all metadata in RAM for high performance• 80-node load test, GetTableLocations RPC perf:• 99th percentile: 68us, 99.99th percentile: 657us • <2% peak CPU usage

• Client configured with master addresses• Asks master for tablet locations as needed and caches them

34

Page 35: Introduction to Apache Kudu

35© Cloudera, Inc. All rights reserved.

Kudu Trade-Offs

•Random updates will be slower•HBase model allows random updates without incurring a disk seek• Kudu requires a key lookup before update, Bloom lookup before insert

• Single-row reads may be slower• Columnar design is optimized for scans• Future: may introduce “column groups” for applications where single-row

access is more important

Page 36: Introduction to Apache Kudu

36© Cloudera, Inc. All rights reserved.

Fault tolerance

• Transient FOLLOWER failure:• Leader can still achieve majority• Restart follower TS within 5 min and it will rejoin transparently

• Transient LEADER failure:• Followers expect to hear a heartbeat from their leader every 1.5 seconds• 3 missed heartbeats: leader election!• New LEADER is elected from remaining nodes within a few seconds

• Restart within 5 min and it rejoins as a FOLLOWER• N replicas handle (N-1)/2 failures

36

Page 37: Introduction to Apache Kudu

37© Cloudera, Inc. All rights reserved.

Fault tolerance (2)

• Permanent failure:• Leader notices that a follower has been dead for 5 minutes• Evicts that follower• Master selects a new replica• Leader copies the data over to the new one, which joins as a new FOLLOWER

37

Page 38: Introduction to Apache Kudu

38© Cloudera, Inc. All rights reserved.

Benchmarks

Page 39: Introduction to Apache Kudu

39© Cloudera, Inc. All rights reserved.

TPC-H (Analytics benchmark)

• 75TS + 1 master cluster• 12 (spinning) disk each, enough RAM to fit dataset• Using Kudu 0.5.0, Impala 2.2 with Kudu support, CDH 5.4• TPC-H Scale Factor 100 (100GB)

• Example query:• SELECT n_name, sum(l_extendedprice * (1 - l_discount)) as revenue FROM customer, orders, lineitem, supplier, nation, region WHERE c_custkey = o_custkey AND l_orderkey = o_orderkey AND l_suppkey = s_suppkey AND c_nationkey = s_nationkey AND s_nationkey = n_nationkey AND n_regionkey = r_regionkey AND r_name = 'ASIA' AND o_orderdate >= date '1994-01-01' AND o_orderdate < '1995-01-01’ GROUP BY n_name ORDER BY revenue desc;

39

Page 40: Introduction to Apache Kudu

40© Cloudera, Inc. All rights reserved.

- Kudu outperforms Parquet by 31% (geometric mean) for RAM-resident data- Parquet likely to outperform Kudu for HDD-resident (larger IO requests)

Page 41: Introduction to Apache Kudu

41© Cloudera, Inc. All rights reserved.

What about Apache Phoenix?• 10 node cluster (9 worker, 1 master)• HBase 1.0, Phoenix 4.3• TPC-H LINEITEM table only (6B rows)

41

Load TPCH Q1 COUNT(*)COUNT(*)WHERE…

single-rowlookup

0.01

0.1

1

10

100

1000

100002152

21976

131

0.04

1918

13.2

1.7

0.7 0.15

155

9.3

1.4 1.5 1.37

PhoenixKuduParquet

Tim

e (s

ec)

Page 42: Introduction to Apache Kudu

42© Cloudera, Inc. All rights reserved.

What about NoSQL-style random access? (YCSB)

• YCSB 0.5.0-snapshot• 10 node cluster

(9 worker, 1 master)• HBase 1.0• 100M rows, 10M ops

42

Page 43: Introduction to Apache Kudu

43© Cloudera, Inc. All rights reserved.

Xiaomi Use Case

• Gather important RPC tracing events from mobile app and backend service

• Service monitoring & troubleshooting tool

• High write throughput

• >5 Billion records/day and growing

• Query latest data and quick response

• Identify and resolve issues quickly

• Can search for individual records

• Easy for troubleshooting

Page 44: Introduction to Apache Kudu

44© Cloudera, Inc. All rights reserved.

Big Data Analytics PipelineBefore Kudu

• Long pipelinehigh latency(1 hour ~ 1 day), data conversion pains

• No orderingLog arrival(storage) order not exactly logical ordere.g. read 2-3 days of log for data in 1 day

Page 45: Introduction to Apache Kudu

45© Cloudera, Inc. All rights reserved.

Big Data Analysis PipelineSimplified With Kudu

• ETL Pipeline(0~10s latency)Apps that need to prevent backpressure or require ETL

• Direct Pipeline(no latency)Apps that don’t require ETL and no backpressure issues

OLAP scanSide table lookupResult store

Page 46: Introduction to Apache Kudu

46© Cloudera, Inc. All rights reserved.

Use Case 1: Benchmark

Environment

• 71 Node cluster• Hardware• CPU: E5-2620 2.1GHz * 24 core Memory: 64GB • Network: 1Gb Disk: 12 HDD

• Software• Hadoop2.6/Impala 2.1/Kudu

Data

• 1 day of server side tracing data• ~2.6 Billion rows• ~270 bytes/row• 17 columns, 5 key columns

Page 47: Introduction to Apache Kudu

47© Cloudera, Inc. All rights reserved.

Use Case 1: Benchmark Results

Q1 Q2 Q3 Q4 Q5 Q6

1.4 2.0 2.3 3.1 1.3 0.9 1.3

2.8 4.0

5.7 7.5

16.7 kuduparquet

Total Time(s) Throughput(Total) Throughput(per node)

Kudu 961.1 2.8M record/s 39.5k record/s

Parquet 114.6 23.5M record/s 331k records/s

Bulk load using impala (INSERT INTO):

Query latency:

* HDFS parquet file replication = 3 , kudu table replication = 3* Each query run 5 times then take average

Page 48: Introduction to Apache Kudu

48© Cloudera, Inc. All rights reserved.

Status & Getting Started

Page 49: Introduction to Apache Kudu

49© Cloudera, Inc. All rights reserved.

With Kudu:• Ingest and serve data simultaneously

• Support analytic and real-time operations on the same data set

• Make existing storage architectures simpler, and enable new architectures that previously weren’t possible

Basic Features:• High availability, no single point of failure

• Consistency by consensus, options for “tunable consistency”

• Horizontally scalable

• Efficient use of modern storage and processors

Basic Kudu value proposition

Page 50: Introduction to Apache Kudu

50© Cloudera, Inc. All rights reserved.

Current Status

✔ Completed all components core to the architecture

✔ Java and C++ API

✔ Impala, MapReduce, and Spark integration

✔ Support for SSDs and spinning disk

✔ Fault recovery

✔ Public beta available

Page 51: Introduction to Apache Kudu

51© Cloudera, Inc. All rights reserved.

Getting Started

Users:

Install the Beta or try a VM:getkudu.io

Get help:[email protected]

Read the white paper:getkudu.io/kudu.pdf

Developers:

Contribute:github.com/cloudera/kudu (commits)

gerrit.cloudera.org (reviews)issues.cloudera.org (JIRAs going back to 2013)

Join the Dev list:[email protected]

Contributions/participation are welcome and encouraged!

Page 52: Introduction to Apache Kudu

52© Cloudera, Inc. All rights reserved.

Questions?

Page 53: Introduction to Apache Kudu

53© Cloudera, Inc. All rights reserved.

Appendix

Page 54: Introduction to Apache Kudu

54© Cloudera, Inc. All rights reserved.

Fault tolerance

• Transient FOLLOWER failure:• Leader can still achieve majority• Restart follower TS within 5 min and it will rejoin transparently

• Transient LEADER failure:• Followers expect to hear a heartbeat from their leader every 1.5 seconds• 3 missed heartbeats: leader election!• New LEADER is elected from remaining nodes within a few seconds

• Restart within 5 min and it rejoins as a FOLLOWER• N replicas handle (N-1)/2 failures

54

Page 55: Introduction to Apache Kudu

55© Cloudera, Inc. All rights reserved.

Fault tolerance (2)

• Permanent failure:• Leader notices that a follower has been dead for 5 minutes• Evicts that follower• Master selects a new replica• Leader copies the data over to the new one, which joins as a new FOLLOWER

55

Page 56: Introduction to Apache Kudu

56© Cloudera, Inc. All rights reserved.

LSM vs Kudu

• LSM – Log Structured Merge (Cassandra, HBase, etc)• Inserts and updates all go to an in-memory map (MemStore) and later flush to

on-disk files (HFile/SSTable)• Reads perform an on-the-fly merge of all on-disk HFiles

• Kudu• Shares some traits (memstores, compactions)• More complex.• Slower writes in exchange for faster reads (especially scans)

56

Page 57: Introduction to Apache Kudu

57© Cloudera, Inc. All rights reserved.

Kudu storage – Compaction policy

• Solves an optimization problem (knapsack problem)• Minimize “height” of rowsets for the average key lookup• Bound on number of seeks for write or random-read

• Restrict total IO of any compaction to a budget (128MB)• No long compactions, ever• No “minor” vs “major” distinction• Always be compacting or flushing• Low IO priority maintenance threads

57