OSCON 2013: Apache Drill Workshop > Execution & ValueVectors

Apache Drill: Execution

Jacques Nadeau, OSCON July 23, 2013

jacques@apache.org |@intjesus

Drill is…

–Optimistic & Pipelined–Columnar & Late materialized–Vectorized –Language Agnostic–MPP Query Engine

Optimistic Execution

Optimistic Recovery Pipelined Scheduling Pipelined Communication

Optimistic Recovery

Assume Failures Don’t overbuild for them– The shorter the queries, the less work lost on failure

Graceful management of node failure at a system level– Individual queries must be rerun

Avoid the overhead of persistence and barriers.

Pipelined Operators

Pipelining – push data along as soon as it is available– Cross-operator and cross-node

Straight forward for simple operators like filter, project Also possible with less common things like sort, radix hash join– External Sort: merge only what is needed to push first part of data down

pipeline

Destination buffering rather source buffering

Full pipelining requires query at once scheduling

Query at Once Schedule entire query at once

Pros:– Fastest data movement– Less herd effect

Cons:– Poorer workload distribution– Failure checkpoints hard

Task by Task Schedule each task when all

previous tasks are completed

Pros:– Potential better workload

distribution– Failure checkpoints

straightforward

Cons:– Slower data movement– Poorer routing decision

Comparison with MapReduce

Barriers–Map completion required before shuffle/reduce

commencement– All maps must complete before reduce can start– In chained jobs, one job must finish entirely before the next

one can start Persistence and Recoverability– Data is persisted to disk between each barrier– Serialization and deserialization are required between

execution phase

Record versus Columnar Representation

Record Column

Data Format ExampleDonut Price Icing

Bacon Maple Bar 2.19 [Maple Frosting, Bacon]

Portland Cream 1.79 [Chocolate]

The Loop 2.29 [Vanilla, Fruitloops]

Triple Chocolate Penetration

2.79 [Chocolate, Cocoa Puffs]

Record EncodingBacon Maple Bar, 2.19, Maple Frosting, Bacon, Portland Cream, 1.79, ChocolateThe Loop, 2.29, Vanilla, Fruitloops, Triple Chocolate Penetration, 2.79, Chocolate, Cocoa Puffs

Columnar EncodingBacon Maple Bar, Portland Cream, The Loop, Triple Chocolate Penetration2.19, 1.79, 2.29, 2.79Maple Frosting, Bacon, Chocolate, Vanilla, Fruitloops, Chocolate, Cocoa Puffs

Places to Apply Columnar

Columnar Storage (on disk)– Improved compression when similar data is co-located – Alternative compression techniques: dictionary, RLE, delta– Avoid column reads when not needed

Columnar Execution (in memory)– Improved cache locality– Improved cpu pipelineing (especially with things like null

checks)– Can reduce memory copies–Maintain unusual encoding schemas for direct relational

operator use

Columnar Execution: When to materialize

Users want rows Data is Columnar When do you transform?–On read into memory–On return to user–Somewhere in between

Later is generally better–Not always :)

Late Decompression

Don’t necessarily materialize each value Reduce memory consumption Reduce CPU cost Examples: RLE, Bit Dictionary

Example: RLE and Sum

Dataset – 2, 4– 8, 10

Goal– Sum all the records

Normal Work– Decompress & store: 2, 2, 2, 2, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8– Add: 2 + 2 + 2 + 2 + 8 + 8 + 8 + 8 + 8 + 8 + 8 + 8 + 8 + 8

Optimized Work– 2 * 4 + 8 * 10– Less Memory, less operations

Example: Bitpacked Dictionary VarChar Sort

Dataset:– Dictionary: [Rupert, Bill, Larry]– Values: [1,0,1,2,1,2,1,0]

Normal Work: – Decompress & store: Bill, Rupert, Bill, Larry, Bill, Larry, Bill, Rupert– Sort: ~24 comparisons of variable width strings (requiring length

lookup and check during comparisons) Optimized Work– Sort Dictionary: {Bill: 1, Larry: 2, Rupert: 0}– Sort bitpacked values– Work: max 3 string comparisons, ~24 comparisons of fixed-width

dictionary bits– Data in 16 bits as opposed 368/736 for UTF8/16

Storage versus Relational operators

How do you write operator implementations for many different data representations– If you’re trying to inline, you have to avoid abstractions to complex for JVM

to simplify

Push optimizations to storage layer for things like RLE– Rare that data is exactly in desired format beyond simplest queries

Define a primary in-memory representation for columnar data– Support alternative randomly-accesible compressions schemas in all

operators (such as Dictionary/Bitpacked)

Vectorization

Operating on more than one record at the same time–Old school: use word-sized manipulations when records are

stored smaller than word size–New School: SIMD (single input multiple data) instructions• GCC, LLVM and JVM all to various otpimizations

automatically• More can be had manually coding algorithms

– Logical Vectorization:• Using general record characteristics to reduce CPU cycles per

collection of records

Alternative Meaning– Avoiding branching to speed CPU pipeline, working on large

cache local data in process

Drill Columnar Approach

A RecordBatch contains one or more ValueVectors corresponding to each Field within a BatchSchema

Operators can operate directly against ValueVector or work with an alternative view of data by work leveraging a SelectionVector

Leverage simple Vectorization and trust JIT to optimize SIMD by generating simple buffer based operations and loops.– Explore performance impact of advanced SIMD in C for specific

operators

Record Batch

Unit of work for the query system– Operators always work on a batch of records

All values associated with a particular collection of records

Each record batch must have a single defined schema– Possibly includes fields that have embedded types if

you have a heterogeneous field

Record batches are pipelined between operators and nodes

No more than 65k records Target single L2 cache (~256k) Operator reconfiguration is done at RecordBatch

boundaries

RecordBatch

VV VV VV VV

RecordBatch

VV VV VV VV

RecordBatch

VV VV VV VV

SelectionVector

Includes particular records from consideration by record batch index

Avoids early copying of records after applying filtering–Maintains random accessibility

All operators need to support SelectionVector accessDonut Price IcingBacon Maple Bar 2.19 [Maple Frosting,

Bacon]Portland Cream 1.79 [Chocolate]The Loop 2.29 [Vanilla, Fruitloops]

Triple Chocolate Penetration

2.79 [Chocolate, Cocoa Puffs]

Selection Vector0

ValueVector

One ore more contiguous buffers of data containing values– Stored in Native Order– In-memory representation fully specified for cross language portability

Associated with a single field– Synonymous with column in traditional flat tables

Nested fields are separate ValueVectors Randomly accessible Defined for each System datatype Each has Accessor and Mutator– Primitives and simple primitive “structs” are access interfaces

Drill DataTypes

MajorType = MinorType + DataMode + (Width|Scale)?

MinorType–Describes width and nature of data: smallint, bigint,

uint32, varchar4 (utf8), var16char4 (utf16) DataMode:–Optional (nullable)–Required (non-nullable)–Repeated (non item list/array)

Traditional 3 value semantics & Drill 4 value

SQL’s 3-Valued Semantics–True–False–Unknown

Drill adds fourth–Repeated

Fixed Value Vectors

Nullable Values

Repeated Values

Variable Width

Repeated Map

Strengths of RecordBatch + ValueVectors

RecordBatch separates high performance/low performance space– Record-by-record, avoid method invocation– Batch-by-batch, trust JVM

Avoid serialization/deserialization Off-heap means large memory footprint without GC woes Full specification combined with off-heap and batch-level

execution allows C/C++ operators as necessary Random access: sort without restructuring

Code Play Time

Get Latest Drill git clone git://git.apache.org/incubator-drill.git cd incubator-drill/sandbox/prototype git checkout 9f69ed0 mvn clean install

Download OSCON Drill examples: git clone https://github.com/jacques-n/oscon-drill.git cd oscon-drill mvn install cd vectors

http://bit.ly/19goc7R

Vectors Exercise

Goals RPC implementation to minimize data copies and support keeping all

data off-heap Basic benchmark analysis comparing ValueVectors and straight

ProtoBuf encoding

Logic C = A + B Assume two lists of fixed four byte integers (list a and list b). Send them to remote node Remote node decodes them, adds the two numbers together for

each record, then returns the list (list c) First node sums all returning numbers and verifies expected result

Vectors Exercise

├── pom.xml

└── src

├── main/java/org/apache/drill/oscon/rpc

│ │ ├── ClientConnectFuture.java

│ │ ├── ExampleClient.java

│ │ ├── ExampleConfig.java

│ │ └── ExampleServer.java

│ └── protobuf

│ └── Example.proto

└── test/java/org/apache/drill/oscon/rpc

└── TestRpc.java

OSCON 2013: Apache Drill Workshop > Execution & ValueVectors

push data

similar data

recoverability data

rows data

columnar columnar storage

columnar execution

alternative view of

work sort dictionary

Technology

OSCON 2013: Apache Drill Workshop > SQL Queries

Apache Drill (ver. 0.2)

Apache Drill Workshop

Apache Drill – Hands-On SQL References

Data Exploration with Apache Drill: Day 2

Berlin Hadoop Get Together Apache Drill

Apache Drill

Apache Drill @ PJUG, Jan 15, 2013

Apache Drill Overview - Tokyo Apache Drill Meetup 2015/09/15

Apache Drill at ApacheCon2014

What and Why and How: Apache Drill ! - Tugdual Grall

Working with Delimited Data in Apache Drill 1.6.0

Killing ETL with Apache Drill

OSCON 2014 Trip report #OSCON

Apache Drill (ver. 0.1, check ver. 0.2)

NoSQL HBase schema design and SQL with Apache Drill