Case Study: Real-time Analytics With Druid
Salil Kalia, Tech Lead, TO THE NEW Digital
About Presenter• Over 10 years in software industry
• Working with TO THE NEW Digital since 2009
• Using mainly Java/Groovy/Grails eco-systems for the development purpose
• Working on Digital marketing domain for the last few years
• Cassandra certified trainer
• Loves traveling and exploring new places
AgendaUnderstanding the use-case
• Ad workflow• Our use case
Experiments with technologies• Redis• Cassandra
Introduction to Druid• Architecture• Druid in production• Demo
Understanding the use-case
Understanding The Ad Workflow
AD AGENCY-2
AD AGENCY-3
AD AGENCY-1
USER
Web PageRequest
AdRequest
Ad-Content
PUBLISHERSERVER
ADEXCHANGE
Examples From Our Use Case•How many times a video has been viewed ?
•How many times a video has been viewed in a particular time-span ?
•How many times a video has been viewed in a particular time-span at a particular site ?
•How many times a video has been viewed in a particular time-span at a particular site in a particular country ?
•How many times a video has been viewed in a particular time-span at a particular site in a particular country on a particular device ?
Video Events For The Analysis• LOAD
• START
• PLAYING
• VIEW
• STOP / PAUSE
• FINISH
Event Data (Sample)
TIMESTAMP Ad Site Advertiser Event Action
2011-01-01T01:01:27 Z 123 abc.com Brand X Player Load
2011-01-01T01:01:33 Z 234 abcd.com Brand Y Player Load
2011-01-01T01:01:40 Z 123 abc.com Brand X Player Start
2011-01-01T01:01:45 Z 123 abc.com Brand X Player Playing
2011-01-01T01:01:50 Z 123 abc.com Brand Y Player Playing
2011-01-01T01:01:51 Z 123 abc.com Brand X Player Stop
What Is Analytics ?Processing the HISTORICAL data to:
•Understand potential trends
•Analyze the effects of certain decisions or events
•Evaluate the performance of a system
•Make better business decisions
What Is Real-time Analytics ?
Why (We Need) Real-time Analytics ?
• Understand the real-time performance
• Control the velocity
• Avoid over serving
• Avoid under serving
• Control the targeting
Recap – Things We Understood
• How the ad-tech works (in general)
• Our use-case
• Different video player events
• We are expecting a huge amount of data coming at a very high velocity.
Experiments with technologies
Why We Picked Redis
• Great buzz in the market
• Highly scalable
• Easy to setup, configure and use
• We were not very clear with our use-case
Realizations From Redis
• Not a good fit to deal with time-series (big) data
• Persistence is another issue – we can’t afford loosing data
• There was a huge variety of keys all over the place
• Complexity in the (application side) code started increasing
Working With Cassandra
• Very good support for the time-series data
• Extremely good for writing the data at a very high speed
• Very easy to scale horizontally
• Supports aggregations through Counters
Writing into Cassandra
ANALYTICSSERVER
CASSANDRA
AD PLAYER
Reading from Cassandra
ANALYTICSSERVER CASSANDRA
CAMPAIGNMANAGER
What didn’t work with Cassandra
• Inconsistent results
• Unreliable counters
• No ad-hoc queries support
• Nodes were crashing out very frequently
Crossroads – What next ?
• Third party tools on the top of Cassandra for better consistency
• DataStax Enterprise edition
• Taking a deeper dive into Cassandra to reconfigure the whole architecture and setup
• Switching to different technology
Understanding druid
About Druid (http://druid.io)
• An open-source analytics data store
• Supports streaming - data ingestion
• Flexible filters for ad-hoc queries
• Fast aggregations – sub second queries
• Distributed, shared-nothing architecture
• Easily scalable
Setting Up Druid In Production
KAFKA(CLUSTER)
ANALYTICSSERVER
DRUIDCLUSTER
CASSANDRA
AD PLAYER
Druid’s Reliability Check
KAFKA(CLUSTER)
ANALYTICSSERVER
DRUIDCLUSTER
RAW FILECONSUMER
RAWFILES
RAWFILES
RAWFILES
Job To Test Druid’s
Integrity
AD PLAYER
A Quick Demo
Druid Architecture
DEEPSTORAGE
ZOOKEEPER
Druid Nodes
External Dependencies
Queries
MetaData
Data/Segments
Client Queries
StreamingData
REALTIME
NODES
COORDINATORNODES
HISTORICALNODES
BROKERNODES
MY SQL
Druid Data Ingestion
DEEPSTORAGE
ZOOKEEPER
Druid Nodes
External Dependencies
Queries
MetaData
Data/Segments
Client Queries
StreamingData
REALTIME
NODES
COORDINATORNODES
HISTORICALNODES
BROKERNODES
MY SQL
Druid Data Ingestion (Our System)
KAFKA(CLUSTER)
DRUIDReal-time NodeANALYTICS
SERVERAD PLAYER
Druid Data Retrieval
DEEPSTORAGE
ZOOKEEPER
Druid Nodes
External Dependencies
Queries
MetaData
Data/Segments
Client Queries
StreamingData
REALTIME
NODES
COORDINATORNODES
HISTORICALNODES
BROKERNODES
MY SQL
Coordinator Nodes
DEEPSTORAGE
ZOOKEEPER
Druid Nodes
External Dependencies
Queries
MetaData
Data/Segments
Client Queries
StreamingData
REALTIME
NODES
COORDINATORNODES
HISTORICALNODES
BROKERNODES
MY SQL
Druid Data Segment Propagation
DEEPSTORAGE
ZOOKEEPER
Druid Nodes
External Dependencies
Queries
MetaData
Data/Segments
StreamingData
REALTIME
NODES
COORDINATORNODES
HISTORICALNODES
MY SQL
Our Production Stats
•Over 200 million events per day – ingested into Druid cluster
•4 boxes with 8 cores, 64GB RAM, 1TB SSD
•2 coordinator nodes (only one master)
•2 real-time nodes
•4 historical nodes (on each box)
Companies Using Druid
Questions ?