How to ensure Presto scalability in multi use case

How to ensure Presto scalability

in multi use case Kai Sasaki

Treasure Data Inc.

Kai Sasaki (@Lewuathe)

Software Engineer at Treasure Data Inc.Hadoop/Presto/Spark

Presto In TD• 150000+ queries / day• 190+ TB processing / day• 10+ MB processing / query * sec• 100+ million processed records / query

Presto In TD

Prestobase Proxy

PerfectQueue

query Plazmadata

Presto

TD API

BI ToolHTTP

How to make it scalable• Prestobase Proxy• Node scheduler• Resource Group

Prestobase proxy

Prestobase proxy

Prestobase proxy aims to provide the interface especially for BI tools through JDBC/ODBC and also to replace Prestogres.

Presto In TD

Prestobase Proxy

PerfectQueue

query Plazmadata

Presto

TD API

BI ToolHTTP

Prestobase proxy

• Written in Scala• Finagle base RPC proxy• Running as Docker container• A user of Airframe• VCR base light-weight test framework

Finagle

Finagle is an extensible RPC system for the JVM, used to construct high-concurrency servers. Finagle implements uniform client and server APIs for several protocols, and is designed for high performance and concurrency.

see: https://twitter.github.io/finagle/

Finagle

protected val service: Service[Request, Response] = bind[SomeFilter] andThen bind[AnotherHandler] andThen LastFilter andThen prestoClient

Build request pipeline by binding filter, handlers with Airframe

Airframe

Airframe is a trait base dependency injection framework using Scala macro

- https://github.com/wvlet/airframe

https://github.com/wvlet/airframe

Airframe

- Dependency injection tailored Scala- Tagged binding with wvlet https://github.com/wvlet/wvlet

- Object lifecycle management

Airframeval design : Design = newDesign .bind[X].toInstance(new X) // Bind type X to a concrete instance .bind[Y].toSingleton // Bind type Y to a singleton object .bind[Z].to[ZImpl] // Bind type Z to an instance of ZImpl

import wvlet.airframe._

trait App { val x = bind[X] val y = bind[Y] val z = bind[Z] // Do something with X, Y, and Z}

val session = design.newSessionval app : App = session.build[App]

VCR testing framework

Record test suite HTTP interaction to make test stable and deterministic

see more detailhttps://testing.googleblog.com/2016/11/what-test-engineers-do-at-google.html

https://testing.googleblog.com/2016/11/what-test-engineers-do-at-google.html


protected val service: Service[Request, Response] = bind[SomeFilter] andThen bind[AnotherHandler] andThen QueryRewriter andThen bind[RequestVCR] andThen prestClient

protected val service: Service[Request, Response] = bind[SomeFilter] andThen bind[AnotherHandler] andThen QueryRewriter andThen bind[NoRecording] andThen prestClient

On CI

On Production

Prestobase


RequestVCRClient

…

…

SQLite

Recording

Prestobase


RequestVCRClient

…

…

SQLite

Replaying

Prestobase proxy

Will be open sourced soon

Node Scheduler

Node Scheduler

Submitting query follows…- Analyze query AST- Make query logical/physical plan- Schedule each stage

Node Schedulerquery

stage2 stage1 stage0

task2-0

task2-1

task2-0

task1-0

task1-1

task0-0Table Scan output

Node Scheduler

NodeScheduler creates NodeSelector that selects worker nodes on which tasks are scheduled. NodeSelector picks up worker nodes when there is available splits.

Node Scheduler in TD

Keeps worker node map that can be candidate for launching next tasks. - Ignore min candidates - Limit by available memory pool


Back to normal memory pool usage after task is completed.


Challenges- Smoothing CPU time metric- Split type awareness- Avoid problematic worker nodes

Resource Group

Resource Group

Resource Group was introduced since 0.147 → https://prestodb.io/docs/current/admin/resource-groups.html

Resource Group aims to limit the resource usage by account/group/query.

https://prestodb.io/docs/current/admin/resource-groups.html

Resource Group

rootGroup

general adhoc

softMemoryLimit: 100%maxQueued : 5000maxRunning : 1000


softMemoryLimit: 100%maxRunning : 1000

Resource Group limits

- maxQueued- maxRunning- softMemoryLimit Following queries will be queued- softCpuLimit Impose penalty against max running queries- hardCpuLimit Following queries will be queued

Resource Group scheduling

- schedulingPolicy - fair : FIFO - weighted : Selected stochastically - query_priority : Selected according to priority- schedulingWeight

Resource Group

Every query must be associated to a resource group. The matching can be done by configured selector.

{ "user": “bob", "group": "general" }, { "source": “.*adhoc.*", "group": "global.adhoc.adhoc_${USER}" }

Resource Group

rootGroup

general adhoc



softMemoryLimit: 100%maxRunning : 1000

Bob’s query

Bob’s query …

Resource Group DI

Easily change resource group config behavior with Guice injection.

- ResourceGroupConfigurationManager- configure(ResourceGroup, SelectionContext)

- ResourceGroupSelector- match(Statement, SelectionContext)

SelectionContext

SelectionContext holds the information for associating submitted query.

- Authenticated- User- Source- Query Priority

Currently available as default

{ "runningQueryIds": ["query1", "query2"], "accountId": 1, "children": [{ "memoryUsage": 12345, "runningQueryIds": [“query1"], "children": [], "runningQueries": 1, "queuedQueries": 0, "maxRunningQueries": 2, "resourceId": "general" }, { "memoryUsage": 26296, "runningQueryIds": ["query2"], "children": [], "runningQueries": 1, "queuedQueries": 0, "maxRunningQueries": 2, "resourceId": "scheduled" }], "runningQueries": 2, "maxRunningQueries": 30,}

Queries in parent group

Running query in general

Running query in scheduled

RecapDistributed system often requires each component to be stable and scalable. We can make Presto ecosystem reliable by doing…

- Code modification reliability with DI- VCR testing- Multi dimensional resource scheduling- Resource isolation makes multi-tenant distributed SQL engine reliable