Top Banner
HOW CLOUDERA IMPALA HAS PUSHED HDFS IN NEW WAYS How HDFS is evolving to meet new needs
30
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: (Aaron myers)   hdfs impala

HOW CLOUDERA IMPALA HAS PUSHED HDFS IN NEW WAYS

How HDFS is evolving to meet new needs

Page 2: (Aaron myers)   hdfs impala

✛ Aaron T. Myers > Email: [email protected], [email protected] > Twitter: @atm

✛ Hadoop PMC Member / Committer at ASF ✛ Software Engineer at Cloudera ✛ Primarily work on HDFS and Hadoop Security

2

Page 3: (Aaron myers)   hdfs impala

✛ HDFS introduction/architecture ✛  Impala introduction/architecture ✛ New requirements for HDFS

> Block replica / disk placement info > Correlated file/block replica placement >  In-memory caching for hot files > Short-circuit reads, reduced copy overhead

3

Page 4: (Aaron myers)   hdfs impala

HDFS INTRODUCTION

Page 5: (Aaron myers)   hdfs impala

✛ HDFS is the Hadoop Distributed File System ✛ Append-only distributed file system ✛  Intended to store many very large files

> Block sizes usually 64MB – 512MB > Files composed of several blocks

✛ Write a file once during ingest ✛ Read a file many times for analysis

5

Page 6: (Aaron myers)   hdfs impala

✛ HDFS originally designed specifically for Map/Reduce > Each MR task typically operates on one HDFS block > MR tasks run co-located on HDFS nodes > Data locality: move the code to the data

✛ Each block of each file is replicated 3 times > For reliability in the face of machine, drive failures > Provide a few options for data locality during

processing

6

Page 7: (Aaron myers)   hdfs impala

HDFS ARCHITECTURE

Page 8: (Aaron myers)   hdfs impala

✛ Each cluster has… > A single Name Node

∗  Stores file system metadata ∗  Stores “Block ID” -> Data Node mapping

> Many Data Nodes ∗  Store actual file data

> Clients of HDFS… ∗  Communicate with Name Node to browse file system, get

block locations for files ∗  Communicate directly with Data Nodes to read/write files

8

Page 9: (Aaron myers)   hdfs impala

9

Page 10: (Aaron myers)   hdfs impala

IMPALA INTRODUCTION

Page 11: (Aaron myers)   hdfs impala

✛ General-purpose SQL query engine: > Should work both for analytical and transactional

workloads > Will support queries that take from milliseconds to

hours ✛ Runs directly within Hadoop:

> Reads widely used Hadoop file formats > Talks directly to HDFS (or HBase) > Runs on same nodes that run Hadoop processes

11

Page 12: (Aaron myers)   hdfs impala

✛ Uses HQL for query language > Hive Query Language – what Apache Hive uses > Very close to complete SQL-92 compliance

✛ Extremely high performance > C++ instead of Java > Runtime code generation > Completely new execution engine that doesn't build

on MapReduce

12

Page 13: (Aaron myers)   hdfs impala

✛ Runs as a distributed service in cluster > One Impala daemon on each node with data > Doesn’t use Hadoop Map/Reduce at all

✛ User submits query via ODBC/JDBC to any of the daemons

✛ Query is distributed to all nodes with relevant data

✛  If any node fails, the query fails and is reexecuted

13

Page 14: (Aaron myers)   hdfs impala

IMPALA ARCHITECTURE

Page 15: (Aaron myers)   hdfs impala

✛  Two daemons: impalad and statestored ✛  Impala daemon (impalad)

> Handles client requests > Handles all internal requests related to query

execution ✛ State store daemon (statestored)

> Provides name service of cluster members > Hive table metadata distribution

15

Page 16: (Aaron myers)   hdfs impala

✛ Query execution phases > Request arrives to impalad via odbc/jdbc > Planner turns request into collection of plan fragments

∗  Plan fragments may be executed in parallel

> Coordinator impalad initiates execution of plan fragments on remote impalad daemons

✛ During execution >  Intermediate results are streamed between executors > Query results are streamed back to client

16

Page 17: (Aaron myers)   hdfs impala

17

✛ During execution, impalad daemons connect directly to HDFS/HBase to read/write data

Page 18: (Aaron myers)   hdfs impala

HDFS IMPROVEMENTS MOTIVATED BY IMPALA

Page 19: (Aaron myers)   hdfs impala

✛  Impala is concerned with very low latency queries > Need to make best use of available aggregate disk

throughput ✛  Impala’s more efficient execution engine is far

more likely to be I/O bound as compared to Hive >  Implies that for many queries the best performance

improvement will be from improved I/O ✛  Impala query execution has no shuffle phase

>  Implies that joins between tables does not necessitate all-to-all communication

19

Page 20: (Aaron myers)   hdfs impala

✛ Expose HDFS block replica disk location information

✛ Allow for explicitly co-located block replicas across files

✛  In-memory caching of hot tables/files ✛ Reduced copies during reading, short-circuit

reads

20

Page 21: (Aaron myers)   hdfs impala

✛  The problem: NameNode knows which DataNodes blocks are on, not which disks > Only the DNs are aware of block replica -> disk map

✛  Impala wants to make sure that separate plan fragments operate on data on separate disks > Maximize aggregate available disk throughput

21

Page 22: (Aaron myers)   hdfs impala

✛  The solution: add new RPC call to DataNodes to expose which volumes (disks) replicas are stored on

✛ During query planning phase, impalad… > Determines all DNs data for query is stored on > Queries those DNs to get volume information

✛ During query execution phase, impalad… > Queues disk reads so that only 1 or 2 reads ever

happen to a given disk at a given time ✛ With this additional info, Impala is able to ensure

disk reads are large, minimize seeks

22

Page 23: (Aaron myers)   hdfs impala

✛  The problem: when performing a join, a single impalad may have to read from both a local file and a remote file on another DN

✛  Local reads at full disk throughput: ~800 MB/s ✛ Remote reads in a 1 gigabit network: ~128 MB/s ✛  Ideally all reads should be done on local disks

23

Page 24: (Aaron myers)   hdfs impala

✛  The solution: add feature to HDFS to specify that a set of files should have their replicas placed on the same set of nodes

✛ Gives Impala more control to lay out data ✛ Can ensure that tables/files which are joined

frequently have their data co-located ✛ Additionally, more fine-grained block placement

control allows for potential improvements in columnar formats like Parquet

24

Page 25: (Aaron myers)   hdfs impala

✛  The problem: Impala queries are often bottlenecked at maximum disk throughput

✛ Memory throughput is much higher ✛ Memory is getting cheaper/denser

> Routinely seeing DNs with 48GB-96GB of RAM

✛ We’ve observed substantial Impala speedups when file data ends up in OS buffer cache

25

Page 26: (Aaron myers)   hdfs impala

✛  The solution: Add facility to HDFS to explicitly read specific HDFS files into main memory

✛ Allows Impala to read data at full memory bandwidth speeds (several GB/s)

✛ Give cluster operator control over which files/tables are queried frequently and thus should be kept in memory > Don’t want an MR job to inadvertently evict data from

memory via the OS buffer cache

26

Page 27: (Aaron myers)   hdfs impala

✛  The problem: A typical read in HDFS must be read from disk by DN, copied into DN memory, sent over network, copied into client buffers, etc.

✛ All of these extraneous copies use unnecessary memory, CPU resources

27

Page 28: (Aaron myers)   hdfs impala

✛  The solution: Allow for reads to be performed directly on local files, use direct buffers

✛ Added facility to HDFS to allow for reads to completely bypass DataNode when client co-located with block replica files

✛ Added API in libhdfs to supply direct byte buffers to HDFS read operations to reduce number of copies to bare minimum

28

Page 29: (Aaron myers)   hdfs impala

✛  For simpler queries (no joins, tpch-q*) on large datasets (1TB) > 5-10x faster than Hive

✛  For complex queries on large datasets (1TB) > 20-50x faster than Hive

✛  For complex queries out of buffer cache (300GB) > 25-150x faster than Hive

✛ Due to Impala’s improved execution engine, low startup time, improved I/O, etc.

29

Page 30: (Aaron myers)   hdfs impala