Transcript

Benefits of Big Data: Handling Operations at Scale

Where it all began in 2000

More than 190 million reviews and opinions

from travellers around the world

250,000+

attractions

680,000+

hotels

1,000,000+

restaurants

Operates in 45

countries

315 million

unique visitors

per month

Where we are now: World’s largest travel site

NORTH

AMERICA

22%

EUROPE

44%

MIDDLE

EAST &

AFRICA

4%

LATAM

8%ASIA PACIFIC

22%

Source: comScore Media Metrix for TripAdvisor Sites, worldwide, August 2014

With nearly 280 million unique monthly visitors

Traffic and Infrastructure

o From 500K to 1.5 million hits per minute

o > 1000 Production Servers (Real, not Virtual)

o Split across multiple data centres in the US

o >2 TB of Compressed Log data per day

Managed by a

team of just

12engineers

So where does Big Data fit In?

Big Data @ TripAdvisor

o 10 TB of Postgres Site Data

o 2.5 PB of Data in Hadoop

o ~160 Large Hadoop Nodes

o ~280 TB of Logs (last 7 months) on

site

o Redshift and Tableau for Ad hoc

exploration of data

o SSAS Data Cubes for static models

So what about Operations?

Challenges in Operations

o Traditional Ops tools don’t scale well

o A human can’t review 30K Graphs

(30 metrics x 1000 Servers)

o Aggregate Chart data is a hack and

not very helpful

o Tools record and present the data only, a human has to interpret it

Pray

Example: Release Day

Monitor 30K Cacti

Charts

Start Release

Imagine 9+ releases a week with Cacti and

Nagios…

The tools are not designed for this!

So what’s the Solution?

Solution: Better Analytics

o Change to more flexible technology

o Measure everything! (~700K per

second)

o Interpret the data within 10 to 15

minutes

o Tune! Remove as many false positives

as possible

o Alert – Page someone on unexpected

changes

If its worth

doing, its worth

measuring

What we built

Web

Servers

Log Central

Servers

File Servers

Why Postgres?

Analysis

2–3 TB of

metric’s

data

Anomaly

Detection

90+ days

aggregate

data

Holding

~2TB of

data

Anomaly Detection

Capture what is happening now

Compare it against data 1 day ago

Compare it against what other Pools are doing

Look at historical variance over 9 days

Track statistically significant changes

RolloutTest for anomaly

Release Day v2

Rollback

Monitor key

metrics

Deploy new code

to two pools

No praying needed!

Results

Roll back 1 release per week

15 minutes vs 3 hours

Dramatic reduction in user facing issues

Increasing number of releases

Strict process on rolling back

Why no Hadoop?

VS

What’s Next?

ReduceFalse negatives

AlertOn performance

or behaviour

changes

VisualiseData through

dashboards

Expand Set of metrics

we’re alerting on

Correlatesystem and application

metrics