Top Banner
1 Enterprise DataLake Consumption Layer powered by Presto @ WalmartLabs Ashish Tadose Principal Engineer
17

Enterprise DataLake Consumption Layer powered by Presto ... · Enterprise DataLake Consumption Layer powered by ... –Elevated cost due to managed services such as DataProc –Overhead

Jun 28, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Enterprise DataLake Consumption Layer powered by Presto ... · Enterprise DataLake Consumption Layer powered by ... –Elevated cost due to managed services such as DataProc –Overhead

1

Enterprise DataLake Consumption Layer powered by Presto @ WalmartLabs

Ashish Tadose

Principal Engineer

Page 2: Enterprise DataLake Consumption Layer powered by Presto ... · Enterprise DataLake Consumption Layer powered by ... –Elevated cost due to managed services such as DataProc –Overhead

2

Agenda

• Data stores @ Walmart Labs

• Motivation for Presto as Distributed Query service

• Multi-tenant Distributed Query service

• Presto deployment & auto-scaling in GCP

• Security integrations

• Overall architecture

• Monitoring

• Best practices and tuning

Footer

Page 3: Enterprise DataLake Consumption Layer powered by Presto ... · Enterprise DataLake Consumption Layer powered by ... –Elevated cost due to managed services such as DataProc –Overhead

3

Data stores @ Walmart LabsAccess needs are varied from team to team – one solution does not fit all….

Page 4: Enterprise DataLake Consumption Layer powered by Presto ... · Enterprise DataLake Consumption Layer powered by ... –Elevated cost due to managed services such as DataProc –Overhead

4

Motivation for Presto..

• DataLake cluster - powered by on-prem Hadoop/HDFS

• Compute storage colocation – GOOD

• Need to ingest data from all diverse sources – CHALLENGING

• Scaling out compute with growing needs – CHALLENGING

• Need to separate storage & compute / support federated query capability – PRESTO..

• Isolated clusters in private cloud powering dedicated data-marts

Dat

a jo

urne

y

Page 5: Enterprise DataLake Consumption Layer powered by Presto ... · Enterprise DataLake Consumption Layer powered by ... –Elevated cost due to managed services such as DataProc –Overhead

5

• Simplified query access layer

• Leverage cloud elastic compute

• Better scalability & Effective cluster utilization by auto-scaling

• Performant query response times

• Security – Authentication – LDAP– Authorization – work with existing policies

• Handle sensitive data – encryption at rest & over the wire

• Efficient Monitoring & alerting

• Resource quotas – SLA guarantees

• Flexibility to configure query configuration per tenant

Multi-tenant Query service - requirements

Page 6: Enterprise DataLake Consumption Layer powered by Presto ... · Enterprise DataLake Consumption Layer powered by ... –Elevated cost due to managed services such as DataProc –Overhead

6

Presto & Alluxio Works well together…

Small range query response timeLower is better

Large scan query response timeLower is better

ConcurrencyHigher is better

Presto Presto + Alluxio

• Avoids unpredictable network

• Consistent query latency

• Higher throughput and better concurrency

Page 7: Enterprise DataLake Consumption Layer powered by Presto ... · Enterprise DataLake Consumption Layer powered by ... –Elevated cost due to managed services such as DataProc –Overhead

7

• Cloud DataProc init scripts or optional image -https://cloud.google.com/dataproc/docs/tutorials/presto-dataproc

– Super easy to spawn Presto cluster – Elevated cost due to managed services such as DataProc– Overhead of additional Hadoop components – Difficult to source new catalog or deploy config changes

• Alluxio – no GCP managed deployment

• Presto-admin – can be used deployment and configuration not auto-scaling

• Need for lower level deployment strategy

Presto on GCP

Page 8: Enterprise DataLake Consumption Layer powered by Presto ... · Enterprise DataLake Consumption Layer powered by ... –Elevated cost due to managed services such as DataProc –Overhead

8

• WalmartLabs internal auto-scaler Presto deployer

• Framework to deploy and auto-scale Presto cluster in GCP

• Leverages ansible & GCP deployment manager

• Auto-scaling via configurable cluster wide CPU & memory usage threshold

• Our recent changes – will be released soon to open community – Alluxio deployment co-located with Presto workers– Efficient configurability – suitable for multiple envs– More auto-scaling configs– Terraform integration – making it cloud agnostic

GCP presto auto-deployment

Page 9: Enterprise DataLake Consumption Layer powered by Presto ... · Enterprise DataLake Consumption Layer powered by ... –Elevated cost due to managed services such as DataProc –Overhead

9

• Ranger plugin for Hive catalog

• Caching ranger policies

• Hive MetaStore impersonation

Presto Security integrations

Page 10: Enterprise DataLake Consumption Layer powered by Presto ... · Enterprise DataLake Consumption Layer powered by ... –Elevated cost due to managed services such as DataProc –Overhead

10

Hive MetaStore , Alluxio integration & Views

• Automated approach to sync metadata

• Hive MetaStore event listeners

• External metastore clients

• Waggle-dance (WIP)

https://github.com/HotelsDotCom/waggle-dance

• Hive native views access

Page 11: Enterprise DataLake Consumption Layer powered by Presto ... · Enterprise DataLake Consumption Layer powered by ... –Elevated cost due to managed services such as DataProc –Overhead

11

Presto Alluxio – overall stack

Page 12: Enterprise DataLake Consumption Layer powered by Presto ... · Enterprise DataLake Consumption Layer powered by ... –Elevated cost due to managed services such as DataProc –Overhead

12

• Presto Event listeners

– Track latencies – Analyze failures – Faulty clients – Frequently queried tables for caching

• On prem monitoring - Prometheus & Grafana

• GCP stack driver integration

• GCP Stackdriver Presto MBeans integration issue

Presto monitoring & archiving

Page 13: Enterprise DataLake Consumption Layer powered by Presto ... · Enterprise DataLake Consumption Layer powered by ... –Elevated cost due to managed services such as DataProc –Overhead

13

• Kafka – ability to apply timestamp filters based Kafka message timestamp– https://www.slideshare.net/shubhamtagra/debugging-data-pipelines-ola-by-karan-kumar

• Druid connector – Based on Druid JDBC interface and extension to Presto’s BaseJdbcClient

• ClickHouse connector

• ThoughtSpot connector

• BigQuery connector

• SAP HANA connector

Presto custom connectors

Page 14: Enterprise DataLake Consumption Layer powered by Presto ... · Enterprise DataLake Consumption Layer powered by ... –Elevated cost due to managed services such as DataProc –Overhead

14

• SLA guarantees by Presto resource queues - https://prestosql.io/docs/current/admin/resource-groups.html

• Each application group has varying query patterns

– Configurable through session properties • join_reordering_strategy• optimize_top_n_row_number• query_max_execution_time

– Session Property Managers - https://prestosql.io/docs/current/admin/session-property-managers.html• Configure sessions for resource groups, source types, client tags

Supporting Multi-tenant cluster

Page 15: Enterprise DataLake Consumption Layer powered by Presto ... · Enterprise DataLake Consumption Layer powered by ... –Elevated cost due to managed services such as DataProc –Overhead

15

Distributed query across Data stores

Page 16: Enterprise DataLake Consumption Layer powered by Presto ... · Enterprise DataLake Consumption Layer powered by ... –Elevated cost due to managed services such as DataProc –Overhead

16

• ORC compression – ZLIB

– Point to point queries performs well for snappy – Large aggregation ZLIB is better

• Enable bloom filter on frequently used columns in filters

• Enable sorting on frequently used columns (boost query perf on the cost of higher ingestion time )

• Increase ORC stripe & stride size

– ORC files are splittable on a stripe level thus affects parallelism.– We observed 18%-22% increased in presto parallelism (after setting stripe size = 128Mb and index stride = 16k)

• Enable Table & column stats (Most important )

– Now stats can be computed via Presto - https://prestosql.io/docs/current/sql/analyze.html

ORC storage recommendations

Page 17: Enterprise DataLake Consumption Layer powered by Presto ... · Enterprise DataLake Consumption Layer powered by ... –Elevated cost due to managed services such as DataProc –Overhead

17

THANKS!

17