Page 1
© 2013 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon.com, Inc.
Scaling your Application for Growth using
Automation
November 14,2013
Ken Leung- Euclid Analytics
Greg Narain- Chute
Page 3
Online Analytics for the Offline World
E-Commerce Physical Stores
Page 4
How Euclid Works
Shopper carrying smartphone
walks by or into store
Euclid analyzes data
for trends and insights
We use Wi-Fi technology to turn in-store behavior into actionable insights
Wi-Fi AP detects smartphone
MAC addresses
XX:XX:XX:XX:XX:XX
Insights on customer acquisition,
engagement and retention
Page 5
Market Leader in Real World Analytics
• First to develop proprietary Wi-Fi based analytics – Most advanced data analytics capabilities and experience in retail environments
– Backed by tier 1 investors: Series A led by NEA, Series B led by Benchmark Capital
• World-class executive team – Co-founder of Google Analytics, Founding team of ShopperTrak
– Executive experience from Google, SAP, Ariba and Tibco
• Experience with the world’s leading retailers – Specialty retail, QSR, department store, big box, automotive, malls and more
• Largest data scale and rapidly accelerating adoption – Recording >5B events per day
– Dataset with >100M unique devices (shoppers)
– Gartner Cool Vendor 2012; Idea Innovation Award Winner: Business Technology 2012
• Market leadership recognized by:
Page 6
Euclid is a
Data Company Acquire
Data
•Reliable
•Durable
•Scalable
Process Data
•Efficient
•Flexible
•Scalable
•Versatile
Deliver Data
•Richness
•Sophistication
•Value
As of October, 2013, the
Euclid Network:
• Covers over 600
shopping centers, malls,
and street locations
• Processes 50 TB of raw
data
• Collects over 30 GB of
raw data daily
Page 7
Euclid’s Challenges
Common Challenges
• Scaling
• Performance
• Cost effectiveness
• Removing the technical
barriers for innovation
• “Failing fast”
Unique Challenges
• Recomputing the entire
history of Euclid data!
– Need fast results
– Need a lot of computational
power, sometimes greater
than 100x of regular daily
compute needs
Page 8
Euclid’s Use of AWS
Euclid started with AWS from Day One
- Amazon EC2, Amazon RDS, Amazon EMR,
Amazon S3
- AWS Elastic Beanstalk
- Amazon Redshift
Heroku from Amazon Partner Network (APN)
Page 10
Data Acquisition
Elastic Beanstalk
- Multi-AZ, multi-region
- Load balancing, auto scaling
- Monitoring, notification
- Deployment Management
- Amazon EBS-backed volume for failover data recovery
- Log rotation to Amazon S3 (99.999999999% durability)
All built-in.
Page 11
Data Acquisition - code <%@ page import="java.io.*,java.util.*,com.euclid.spongebob..server.*" %><%
Properties sensorCredentials = (Properties)this.getServletContext().getAttribute("sensor_credentials");
String sensor_id = request.getParameter("sensor_id");
String credential = request.getParameter("credential");
String body = request.getParameter("body");
if (sensor_id == null || !sensorCredentials.containsKey(sensor_id) ||
!sensorCredentials.getProperty(sensor_id).equals(credential)) {
response.sendError(HttpServletResponse.SC_UNAUTHORIZED);
return;
}
java.util.logging.Logger logger = java.util.logging.Logger.getLogger("spongebob");
logger.log(java.util.logging.Level.INFO, body);
response.setStatus(HttpServletResponse.SC_OK);
%>
Page 12
Data Acquisition - Principles
• Log to Amazon EBS Volume – high I/O
performance
• As “dumb” as possible: reliable
• Fork data from disk to – Amazon S3 for batch processing
– Kafka messaging service for real time processing
Page 13
Data Acquisition – System Monitor
• Low latency
• Low CPU utilization
Page 14
Data Processing - Pipeline
R/D
Analytics
Raw Data
Product dashboard, insights
Map
Reduce
(EMR)
Page 15
Pipeline – Dual Purposes
Two worlds, one platform
• Big Data Engineering – noSQL – Pig Latin with Amazon EMR (Java, Python UDFs)
– Work flows (Jenkins), shell scripting
• Analytics, Analysts, Business – SQL – Excel
– Tableau
– Maybe some Python, etc.
Page 16
Pipeline - Architecture
SQL MapReduce
Raw Data
Aggr.
Level 1
Aggr.
Level n
Amazon S3 SQL DB: MySQL, Redshift
Product dashboard, insights
MySQL
Some Raw Data
Aggr.
Level 1
Aggr.
Level n
Meta
Data
3rd Party
Data
Models
Algorithms
R&D Models
Algorithms
Analytics Direct
DB Load Meta
Data
3rd Party
Data
Page 17
SQL: MySQL, Amazon Redshift, both by AWS
• Started with MySQL, Amazon Redshift Preview Jan
2013
• MySQL 1TB limit vs Amazon Redshift PB scale
• Performance, night and day – E.g., count distinct of 100m rows: 5h in MySQL, 2m in Amazon Redshift
• Amazon Redshift: killer data warehouse – Low cost
– No DBA!
– Easy integration
Page 18
Pipeline - Monitoring
• System monitoring provided by AWS
• Workflow monitoring with Jenkins – Failure notification
– Dependency management
• Data quality (including acquisition) monitoring – Also utilize Jenkins
– Scripts that check data at various stages
– Each script as a job in the Jenkins workflow
Page 19
Pipeline - Workflow
Part of the Jenkins Dependency Graph
Page 20
AWS Benefits
• “Apps not Ops” – Euclid does not have/need an
Ops team
• Scale up and down on demand
• Pay as we go
• Agile (innovations, time-to-market)
Page 21
Chute
1. Data
2. Automation
3. Uptime
4. Monitoring
Page 22
Data
● Real time analytics is hard
● Hadoop!
○ Sqoop imports SQL data to HDFS
○ Clojure
○ Scalding (github.com/twitter/scalding)
● Elasticsearch, Logstash
○ parse logs to track activity for customers
Page 23
Sharded Postgres
Hadoop cluster
or
EMR
S3 HDFS
SQOOP Server
Page 24
ElasticSearch
ELB
N number of
EC2 instances
● varnish
● logstash
Redis cluster
Events Server
● nginx
● logstash
Kibana
plugin front ends
API
Page 25
Automation through DevOps
● Chute has 100 servers
○ Configured many manually
○ 82? of 100 now managed by Chef
● Whirr
● Sqoop and Cron to automate data import
● route53 with Chef for urls
Page 26
Uptime
● Architect applications to scale horizontally
○ AWS launches servers on demand
○ spot and reserve pricing
● Keep services running with Chef
○ Chef makes it easy to wrap programs as
a service on AWS
Page 27
Monitoring
● newrelic
○ server resource monitoring
○ application monitoring
● logstash + kibana
○ elasticsearch backend
○ redis (cluster)
○ can monitor server logs
Page 28
Please give us your feedback on this
presentation
As a thank you, we will select prize
winners daily for completed surveys!
CPN209