Top Banner
© 2017 Dremio Corporation @DremioHQ The future of column-oriented data processing with Arrow and Parquet Julien Le Dem, Principal Architect Dremio, VP Apache Parquet
20

Mule soft mar 2017 Parquet Arrow

Jan 21, 2018

Download

Software

Julien Le Dem
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Mule soft mar 2017 Parquet Arrow

© 2017 Dremio Corporation @DremioHQ

The future of column-oriented data processing with Arrow and Parquet

Julien Le Dem, Principal Architect Dremio, VP Apache Parquet

Page 2: Mule soft mar 2017 Parquet Arrow

© 2017 Dremio Corporation @DremioHQ

• Architect at @DremioHQ

• Formerly Tech Lead at Twitter on Data Platforms.

• Creator of Parquet

• Apache member

• Apache PMCs: Arrow, Kudu, Incubator, Pig, Parquet

Julien Le Dem@J_ Julien

Page 3: Mule soft mar 2017 Parquet Arrow

© 2017 Dremio Corporation @DremioHQ

Agenda

• Community Driven Standard

– Interoperability and Ecosystem

• Benefits of Columnar representation

– On disk (Apache Parquet)

– In memory (Apache Arrow)

• Future of columnar

Page 4: Mule soft mar 2017 Parquet Arrow

© 2017 Dremio Corporation @DremioHQ

Community Driven Standard

• Parquet: Common need for on disk columnar.

• Arrow: Common need for in memory columnar.

• Arrow building on the success of Parquet.

• Benefits:– Share the effort

– Create an ecosystem

• Standard from the start

Page 5: Mule soft mar 2017 Parquet Arrow

© 2017 Dremio Corporation @DremioHQ

Arrow goals

• Well-documented and cross language compatible

• Designed to take advantage of the modern CPU

• Embeddable in execution engines, storage layers, etc.

• Interoperable

Page 6: Mule soft mar 2017 Parquet Arrow

© 2017 Dremio Corporation @DremioHQ

Interoperability and EcosystemBefore With Arrow

• Each system has its own internal memory format• 70-80% CPU wasted on marshalling• Duplication and unnecessary conversions

• All systems utilize the same memory format• No overhead for cross-system communication• Shared functionality (Parquet-to-Arrow reader)

High Performance Sharing & InterchangeHigh Interoperability cost:

Page 7: Mule soft mar 2017 Parquet Arrow

© 2017 Dremio Corporation @DremioHQ

Benefits of Columnar formats@EmrgencyKittens

Page 8: Mule soft mar 2017 Parquet Arrow

© 2017 Dremio Corporation @DremioHQ

Columnar layout

Logical table

representationRow layout

Column layout

Page 9: Mule soft mar 2017 Parquet Arrow

© 2017 Dremio Corporation @DremioHQ

On Disk and in Memory

• Different trade offs– On disk: Storage.

• Accessed by multiple queries.

• Priority to I/O reduction (but still needs good CPU throughput).

• Mostly Streaming access.

– In memory: Transient.• Specific to one query execution.

• Priority to CPU throughput (but still needs good I/O).

• Streaming and Random access.

Page 10: Mule soft mar 2017 Parquet Arrow

© 2017 Dremio Corporation @DremioHQ

Parquet on disk columnar format

• Nested data structures

• Compact format: – type aware encodings

– better compression

• Optimized I/O:– Projection push down (column pruning)

– Predicate push down (filters based on stats)

Page 11: Mule soft mar 2017 Parquet Arrow

© 2017 Dremio Corporation @DremioHQ

Access only the data you need

a b c

a1 b1 c1

a2 b2 c2

a3 b3 c3

a4 b4 c4

a5 b5 c5

a b c

a1 b1 c1

a2 b2 c2

a3 b3 c3

a4 b4 c4

a5 b5 c5

a b c

a1 b1 c1

a2 b2 c2

a3 b3 c3

a4 b4 c4

a5 b5 c5

+ =

Columnar StatisticsRead only the data you need!

Page 12: Mule soft mar 2017 Parquet Arrow

© 2017 Dremio Corporation @DremioHQ

Arrow in memory columnar format

• Nested Data Structures

• Maximize CPU throughput

– Pipelining

– SIMD

– cache locality

• Scatter/gather I/O

Page 13: Mule soft mar 2017 Parquet Arrow

© 2017 Dremio Corporation @DremioHQ

Focus on CPU EfficiencyArrow

Memory Buffer• Cache Locality

• Super-scalar & vectorized operation

• Minimal Structure Overhead

• Constant value access

– With minimal structure overhead

• Operate directly on columnar data

Page 14: Mule soft mar 2017 Parquet Arrow

© 2017 Dremio Corporation @DremioHQ

Arrow Messages, RPC & IPC

Page 15: Mule soft mar 2017 Parquet Arrow

© 2017 Dremio Corporation @DremioHQ

Common Message Pattern

• Schema Negotiation– Logical Description of structure– Identification of dictionary encoded

Nodes

• Dictionary Batch– Dictionary ID, Values

• Record Batch– Batches of records up to 64K– Leaf nodes up to 2B values

Schema Negotiation

Dictionary Batch

Record Batch

Record Batch

Record Batch

1..N Batches

0..N Batches

Page 16: Mule soft mar 2017 Parquet Arrow

© 2017 Dremio Corporation @DremioHQ

Multi-system IPC

SQL engine

Pythonprocess

User defined function

SQLOperator

1

SQLOperator

2

readsreads

Page 17: Mule soft mar 2017 Parquet Arrow

© 2017 Dremio Corporation @DremioHQ

Summary and Future

Page 18: Mule soft mar 2017 Parquet Arrow

© 2017 Dremio Corporation @DremioHQ

Current activity:

• Spark Integration (SPARK-13534)

• Dictionary encoding (ARROW-542)

• Time related types finalization (ARROW-617)

• Arrow REST API

• Bindings:

– C, Ruby (ARROW-631)

– JavaScript (ARROW-541)

Page 19: Mule soft mar 2017 Parquet Arrow

© 2017 Dremio Corporation @DremioHQ

Some results- PySpark Integration:

53x speedup (IBM spark work on SPARK-13534)http://s.apache.org/arrowstrata1

- Streaming Arrow Performance7.75GB/s data movementhttp://s.apache.org/arrowstrata2

- Arrow Parquet C++ Integration4GB/s readshttp://s.apache.org/arrowstrata3

- Pandas Integration9.71GB/shttp://s.apache.org/arrowstrata4

Page 20: Mule soft mar 2017 Parquet Arrow

© 2017 Dremio Corporation @DremioHQ

Get Involved

• Join the community

– dev@{arrow,parquet}.apache.org

– Slack:

• https://apachearrowslackin.herokuapp.com/

• https://parquet-slack-invite.herokuapp.com/

– http://{arrow,parquet}.apache.org

– Follow @Apache{Parquet,Arrow}