HOW CLOUDERA IMPALA HAS PUSHED HDFS IN NEW WAYS How HDFS is evolving to meet new needs
Jan 26, 2015
HOW CLOUDERA IMPALA HAS PUSHED HDFS IN NEW WAYS
How HDFS is evolving to meet new needs
✛ Aaron T. Myers > Email: [email protected], [email protected] > Twitter: @atm
✛ Hadoop PMC Member / Committer at ASF ✛ Software Engineer at Cloudera ✛ Primarily work on HDFS and Hadoop Security
2
✛ HDFS introduction/architecture ✛ Impala introduction/architecture ✛ New requirements for HDFS
> Block replica / disk placement info > Correlated file/block replica placement > In-memory caching for hot files > Short-circuit reads, reduced copy overhead
3
HDFS INTRODUCTION
✛ HDFS is the Hadoop Distributed File System ✛ Append-only distributed file system ✛ Intended to store many very large files
> Block sizes usually 64MB – 512MB > Files composed of several blocks
✛ Write a file once during ingest ✛ Read a file many times for analysis
5
✛ HDFS originally designed specifically for Map/Reduce > Each MR task typically operates on one HDFS block > MR tasks run co-located on HDFS nodes > Data locality: move the code to the data
✛ Each block of each file is replicated 3 times > For reliability in the face of machine, drive failures > Provide a few options for data locality during
processing
6
HDFS ARCHITECTURE
✛ Each cluster has… > A single Name Node
∗ Stores file system metadata ∗ Stores “Block ID” -> Data Node mapping
> Many Data Nodes ∗ Store actual file data
> Clients of HDFS… ∗ Communicate with Name Node to browse file system, get
block locations for files ∗ Communicate directly with Data Nodes to read/write files
8
9
IMPALA INTRODUCTION
✛ General-purpose SQL query engine: > Should work both for analytical and transactional
workloads > Will support queries that take from milliseconds to
hours ✛ Runs directly within Hadoop:
> Reads widely used Hadoop file formats > Talks directly to HDFS (or HBase) > Runs on same nodes that run Hadoop processes
11
✛ Uses HQL for query language > Hive Query Language – what Apache Hive uses > Very close to complete SQL-92 compliance
✛ Extremely high performance > C++ instead of Java > Runtime code generation > Completely new execution engine that doesn't build
on MapReduce
12
✛ Runs as a distributed service in cluster > One Impala daemon on each node with data > Doesn’t use Hadoop Map/Reduce at all
✛ User submits query via ODBC/JDBC to any of the daemons
✛ Query is distributed to all nodes with relevant data
✛ If any node fails, the query fails and is reexecuted
13
IMPALA ARCHITECTURE
✛ Two daemons: impalad and statestored ✛ Impala daemon (impalad)
> Handles client requests > Handles all internal requests related to query
execution ✛ State store daemon (statestored)
> Provides name service of cluster members > Hive table metadata distribution
15
✛ Query execution phases > Request arrives to impalad via odbc/jdbc > Planner turns request into collection of plan fragments
∗ Plan fragments may be executed in parallel
> Coordinator impalad initiates execution of plan fragments on remote impalad daemons
✛ During execution > Intermediate results are streamed between executors > Query results are streamed back to client
16
17
✛ During execution, impalad daemons connect directly to HDFS/HBase to read/write data
HDFS IMPROVEMENTS MOTIVATED BY IMPALA
✛ Impala is concerned with very low latency queries > Need to make best use of available aggregate disk
throughput ✛ Impala’s more efficient execution engine is far
more likely to be I/O bound as compared to Hive > Implies that for many queries the best performance
improvement will be from improved I/O ✛ Impala query execution has no shuffle phase
> Implies that joins between tables does not necessitate all-to-all communication
19
✛ Expose HDFS block replica disk location information
✛ Allow for explicitly co-located block replicas across files
✛ In-memory caching of hot tables/files ✛ Reduced copies during reading, short-circuit
reads
20
✛ The problem: NameNode knows which DataNodes blocks are on, not which disks > Only the DNs are aware of block replica -> disk map
✛ Impala wants to make sure that separate plan fragments operate on data on separate disks > Maximize aggregate available disk throughput
21
✛ The solution: add new RPC call to DataNodes to expose which volumes (disks) replicas are stored on
✛ During query planning phase, impalad… > Determines all DNs data for query is stored on > Queries those DNs to get volume information
✛ During query execution phase, impalad… > Queues disk reads so that only 1 or 2 reads ever
happen to a given disk at a given time ✛ With this additional info, Impala is able to ensure
disk reads are large, minimize seeks
22
✛ The problem: when performing a join, a single impalad may have to read from both a local file and a remote file on another DN
✛ Local reads at full disk throughput: ~800 MB/s ✛ Remote reads in a 1 gigabit network: ~128 MB/s ✛ Ideally all reads should be done on local disks
23
✛ The solution: add feature to HDFS to specify that a set of files should have their replicas placed on the same set of nodes
✛ Gives Impala more control to lay out data ✛ Can ensure that tables/files which are joined
frequently have their data co-located ✛ Additionally, more fine-grained block placement
control allows for potential improvements in columnar formats like Parquet
24
✛ The problem: Impala queries are often bottlenecked at maximum disk throughput
✛ Memory throughput is much higher ✛ Memory is getting cheaper/denser
> Routinely seeing DNs with 48GB-96GB of RAM
✛ We’ve observed substantial Impala speedups when file data ends up in OS buffer cache
25
✛ The solution: Add facility to HDFS to explicitly read specific HDFS files into main memory
✛ Allows Impala to read data at full memory bandwidth speeds (several GB/s)
✛ Give cluster operator control over which files/tables are queried frequently and thus should be kept in memory > Don’t want an MR job to inadvertently evict data from
memory via the OS buffer cache
26
✛ The problem: A typical read in HDFS must be read from disk by DN, copied into DN memory, sent over network, copied into client buffers, etc.
✛ All of these extraneous copies use unnecessary memory, CPU resources
27
✛ The solution: Allow for reads to be performed directly on local files, use direct buffers
✛ Added facility to HDFS to allow for reads to completely bypass DataNode when client co-located with block replica files
✛ Added API in libhdfs to supply direct byte buffers to HDFS read operations to reduce number of copies to bare minimum
28
✛ For simpler queries (no joins, tpch-q*) on large datasets (1TB) > 5-10x faster than Hive
✛ For complex queries on large datasets (1TB) > 20-50x faster than Hive
✛ For complex queries out of buffer cache (300GB) > 25-150x faster than Hive
✛ Due to Impala’s improved execution engine, low startup time, improved I/O, etc.
29