Enterprise DataLake Consumption Layer powered by Presto ... · Enterprise DataLake Consumption Layer powered by ... –Elevated cost due to managed services such as DataProc –Overhead

1

Enterprise DataLake Consumption Layer powered by Presto @ WalmartLabs

Ashish Tadose

Principal Engineer

2

Agenda

• Data stores @ Walmart Labs

• Motivation for Presto as Distributed Query service

• Multi-tenant Distributed Query service

• Presto deployment & auto-scaling in GCP

• Security integrations

• Overall architecture

• Monitoring

• Best practices and tuning

Footer

3

Data stores @ Walmart LabsAccess needs are varied from team to team – one solution does not fit all….

4

Motivation for Presto..

• DataLake cluster - powered by on-prem Hadoop/HDFS

• Compute storage colocation – GOOD

• Need to ingest data from all diverse sources – CHALLENGING

• Scaling out compute with growing needs – CHALLENGING

• Need to separate storage & compute / support federated query capability – PRESTO..

• Isolated clusters in private cloud powering dedicated data-marts

Dat

a jo

urne

y

5

• Simplified query access layer

• Leverage cloud elastic compute

• Better scalability & Effective cluster utilization by auto-scaling

• Performant query response times

• Security – Authentication – LDAP– Authorization – work with existing policies

• Handle sensitive data – encryption at rest & over the wire

• Efficient Monitoring & alerting

• Resource quotas – SLA guarantees

• Flexibility to configure query configuration per tenant

Multi-tenant Query service - requirements

6

Presto & Alluxio Works well together…

Small range query response timeLower is better

Large scan query response timeLower is better

ConcurrencyHigher is better

Presto Presto + Alluxio

• Avoids unpredictable network

• Consistent query latency

• Higher throughput and better concurrency

7

• Cloud DataProc init scripts or optional image -https://cloud.google.com/dataproc/docs/tutorials/presto-dataproc

– Super easy to spawn Presto cluster – Elevated cost due to managed services such as DataProc– Overhead of additional Hadoop components – Difficult to source new catalog or deploy config changes

• Alluxio – no GCP managed deployment

• Presto-admin – can be used deployment and configuration not auto-scaling

• Need for lower level deployment strategy

Presto on GCP

https://cloud.google.com/dataproc/docs/tutorials/presto-dataproc

8

• WalmartLabs internal auto-scaler Presto deployer

• Framework to deploy and auto-scale Presto cluster in GCP

• Leverages ansible & GCP deployment manager

• Auto-scaling via configurable cluster wide CPU & memory usage threshold

• Our recent changes – will be released soon to open community – Alluxio deployment co-located with Presto workers– Efficient configurability – suitable for multiple envs– More auto-scaling configs– Terraform integration – making it cloud agnostic

GCP presto auto-deployment

9

• Ranger plugin for Hive catalog

• Caching ranger policies

• Hive MetaStore impersonation

Presto Security integrations

10

Hive MetaStore , Alluxio integration & Views

• Automated approach to sync metadata

• Hive MetaStore event listeners

• External metastore clients

• Waggle-dance (WIP)

https://github.com/HotelsDotCom/waggle-dance

• Hive native views access

https://github.com/HotelsDotCom/waggle-dance

11

Presto Alluxio – overall stack

12

• Presto Event listeners

– Track latencies – Analyze failures – Faulty clients – Frequently queried tables for caching

• On prem monitoring - Prometheus & Grafana

• GCP stack driver integration

• GCP Stackdriver Presto MBeans integration issue

Presto monitoring & archiving

13

• Kafka – ability to apply timestamp filters based Kafka message timestamp– https://www.slideshare.net/shubhamtagra/debugging-data-pipelines-ola-by-karan-kumar

• Druid connector – Based on Druid JDBC interface and extension to Presto’s BaseJdbcClient

• ClickHouse connector

• ThoughtSpot connector

• BigQuery connector

• SAP HANA connector

Presto custom connectors

https://www.slideshare.net/shubhamtagra/debugging-data-pipelines-ola-by-karan-kumar

14

• SLA guarantees by Presto resource queues - https://prestosql.io/docs/current/admin/resource-groups.html

• Each application group has varying query patterns

– Configurable through session properties • join_reordering_strategy• optimize_top_n_row_number• query_max_execution_time

– Session Property Managers - https://prestosql.io/docs/current/admin/session-property-managers.html• Configure sessions for resource groups, source types, client tags

Supporting Multi-tenant cluster

https://prestosql.io/docs/current/admin/resource-groups.html

https://prestosql.io/docs/current/admin/session-property-managers.html

15

Distributed query across Data stores

16

• ORC compression – ZLIB

– Point to point queries performs well for snappy – Large aggregation ZLIB is better

• Enable bloom filter on frequently used columns in filters

• Enable sorting on frequently used columns (boost query perf on the cost of higher ingestion time )

• Increase ORC stripe & stride size

– ORC files are splittable on a stripe level thus affects parallelism.– We observed 18%-22% increased in presto parallelism (after setting stripe size = 128Mb and index stride = 16k)

• Enable Table & column stats (Most important )

– Now stats can be computed via Presto - https://prestosql.io/docs/current/sql/analyze.html

ORC storage recommendations

https://prestosql.io/docs/current/sql/analyze.html

17

THANKS!

17

Enterprise DataLake Consumption Layer powered by Presto ... · Enterprise DataLake Consumption Layer powered by ... –Elevated cost due to managed services such as DataProc –Overhead

Documents