Using Spark in Production
Quick Introduction to Localytics
•Localytics is an app marketing service with a strong analytics engine at its core.
•We process billions of events per day in real time, using a backend made of Scala and AWS.
•Every data point we’ve ever received is available for querying, and we integrate analytics insight with marketing automation.
2
Using Spark in Production
Original Architecture
•Rails
•MySQL
• It was 2009 and people did that back then
3
Using Spark in Production
Current Backend Architecture
•Scala services loosely coupled through queues
•“Queuer” service is our front line, and collects and enqueues events sent from mobile or web apps
•“Processor” service handles deduplication, device recognition, and a lot more
•Data ends up in MPP DB (querying) and in S3 (export)
4
Using Spark in Production
Our Processor Service
•Processor service is the place we have traditionally added new functionality
•After four years of this, it’s rather monolithic
•Parts are absolutely critical to our company; other parts are not
•Different parts scale differently
5
Using Spark in Production
Rethinking Our Data Processing
•Treat the Processor service as the “source of truth”
•Build outboard services which consume this truth
•Simplify the existing Processor by moving some logic out
•Add specialized data sources to take burden off MPP DB
6
Using Spark in Production
The Source of Truth
•S3 bucket containing batches of events, filed by date and hour of processing
•Events may (will) be out of order
•A single user’s session may (will) be split across files
•At press time: each file ~500MB of JSON objects; ~500 files per hour (~6 TB/day)
7
Using Spark in Production
Side Note: Lambda Architecture
•Batch and stream processing, at once
•Most of what we do right now is batch processing, where “batch” means “all of our data ever”
•MPP DB provides arbitrary querying of entire data set
•Outboard services can provide a speed layer for simple/common queries
8
Using Spark in Production
Feature Requirements
•Every event we receive has a number of attributes passed along with it •e.g., Country, Timezone, OS Version, etc.
•We want to keep the most recent attribute values for each of our users in a key-value store
•This provides fast real-time access to current truths, rather than sifting through a series of events for point-in-time facts
9
Using Spark in Production
Apache Spark
•Most often envisioned as multi-node, long-lived cluster à la Hadoop
•This is a great fit for many batch-layer tasks
•Spark Streaming makes sliding-window querying very simple
•Spark SQL provides a familiar interface to non-production coders (business intelligence, data science, etc.)
10
Using Spark in Production
A Few Gripes about Spark Clusters
•Weak tooling around cluster spin-up, monitoring, maintenance, scaling
•Tough to scale horizontally, compared to Amazon ELB
•Building fat JARs when you use any non-Spark dependencies at all is a god-damned nightmare
11
Using Spark in Production
I’m Not Kidding About the Fat JARslibraryDependencies ++= Seq( // ("org.apache.spark" %% "spark-sql" % "1.1.0"). exclude("com.typesafe.akka", "akka-actor_2.10"). exclude("com.typesafe.akka", "akka-slf4j_2.10"). exclude("org.mortbay.jetty", "servlet-api"). exclude("javax.transaction", "jta"). exclude("commons-beanutils", "commons-beanutils-core"). exclude("commons-logging", "commons-logging"). exclude("org.slf4j", "jcl-over-slf4j"). exclude("org.slf4j", "slf4j-log4j12"). exclude("commons-collections", "commons-collections"). exclude("com.esotericsoftware.minlog", "minlog"), // ("com.typesafe.play" %% "play-json" % "2.4.0-M2"). exclude("com.typesafe.akka", "akka-slf4j_2.10"). exclude("javax.transaction", "jta"). exclude("commons-logging", "commons-logging"). exclude("org.slf4j", "jcl-over-slf4j").
12
Using Spark in Production
Spark Standalone FTW
•Great for small bites of data (tens of GB)
•Few dependency issues
•Horizontal scale comes easy/free
•Hadoop file input adapters are excellent
•Works great for tests, too!13
Using Spark in Production
A Proposed Product Architecture
•Processor service spits truth into S3 bucket •S3 Event Notifications are posted to an SQS queue •A pool of post-processing servers, each running Spark
standalone, drains the queue. Scale up on queue depth •Each S3 file generates ~25K updates to be applied •Actor pool of HTTP clients applies updates to key-value web
service •Akka ask pattern provides per-server synchronization •Play framework as app container
14
Using Spark in Production
How’s it work?
•Great, that’s how
•Each server ingests 5-10 files at a time (~3-5GB) •Pre-tuning: T_spark=1m20s, 25K updates per minute on
c3.8xlarge EC2 instance (60GB RAM, 32 cores, 10GB network) •Post-tuning: T_spark=1m20s, 75K updates per minute on
c3.2xlarge EC2 instance (15GB RAM, 8 cores, 1GB network)
•That’s a 4x speedup on Spark and 12x on HTTP
15
Using Spark in Production
Tuning
• Ingesting files in batches of 5 or more
•Replacing default Play logger with AsyncLogger
•HTTP tuning •Gave up play-ws for Apache HttpClient •Single-threaded client per actor •Keepalive •Preëmptive HTTP Basic Auth
16
Using Spark in Production
Gotchas and Issues
•Mostly transient Spark issues •Network timeouts •File not found •Too many open files
•Solution: •Akka SupervisorStrategy + actor restart •SQS message timeout ensures data will be processed
17
Using Spark in Production
Future Work
•Establish long-lived Spark Streaming cluster(s) for batch layer
•Create more standalone Spark services
•Disassemble the Processor monolith brick by brick
•???
18
‹#›
If this kind of thing sounds like fun, let’s talk! Email me at [email protected] or visit www.localytics.com/jobs/.
We’re Hiring!