Benefits of Big Data: Handling Operations at Scale
Benefits of Big Data: Handling Operations at Scale
Where it all began in 2000
More than 190 million reviews and opinions
from travellers around the world
250,000+
attractions
680,000+
hotels
1,000,000+
restaurants
Operates in 45
countries
315 million
unique visitors
per month
Where we are now: World’s largest travel site
NORTH
AMERICA
22%
EUROPE
44%
MIDDLE
EAST &
AFRICA
4%
LATAM
8%ASIA PACIFIC
22%
Source: comScore Media Metrix for TripAdvisor Sites, worldwide, August 2014
With nearly 280 million unique monthly visitors
Traffic and Infrastructure
o From 500K to 1.5 million hits per minute
o > 1000 Production Servers (Real, not Virtual)
o Split across multiple data centres in the US
o >2 TB of Compressed Log data per day
Managed by a
team of just
12engineers
So where does Big Data fit In?
Big Data @ TripAdvisor
o 10 TB of Postgres Site Data
o 2.5 PB of Data in Hadoop
o ~160 Large Hadoop Nodes
o ~280 TB of Logs (last 7 months) on
site
o Redshift and Tableau for Ad hoc
exploration of data
o SSAS Data Cubes for static models
So what about Operations?
Challenges in Operations
o Traditional Ops tools don’t scale well
o A human can’t review 30K Graphs
(30 metrics x 1000 Servers)
o Aggregate Chart data is a hack and
not very helpful
o Tools record and present the data only, a human has to interpret it
Pray
Example: Release Day
Monitor 30K Cacti
Charts
Start Release
Imagine 9+ releases a week with Cacti and
Nagios…
The tools are not designed for this!
So what’s the Solution?
Solution: Better Analytics
o Change to more flexible technology
o Measure everything! (~700K per
second)
o Interpret the data within 10 to 15
minutes
o Tune! Remove as many false positives
as possible
o Alert – Page someone on unexpected
changes
If its worth
doing, its worth
measuring
What we built
Web
Servers
Log Central
Servers
File Servers
Why Postgres?
Analysis
2–3 TB of
metric’s
data
Anomaly
Detection
90+ days
aggregate
data
Holding
~2TB of
data
Anomaly Detection
Capture what is happening now
Compare it against data 1 day ago
Compare it against what other Pools are doing
Look at historical variance over 9 days
Track statistically significant changes
RolloutTest for anomaly
Release Day v2
Rollback
Monitor key
metrics
Deploy new code
to two pools
No praying needed!
Results
Roll back 1 release per week
15 minutes vs 3 hours
Dramatic reduction in user facing issues
Increasing number of releases
Strict process on rolling back
Why no Hadoop?
VS
What’s Next?
ReduceFalse negatives
AlertOn performance
or behaviour
changes
VisualiseData through
dashboards
Expand Set of metrics
we’re alerting on
Correlatesystem and application
metrics