Top Banner
Benefits of Big Data: Handling Operations at Scale
21
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: BigDataInOperationsV8

Benefits of Big Data: Handling Operations at Scale

Page 2: BigDataInOperationsV8

Where it all began in 2000

Page 3: BigDataInOperationsV8

More than 190 million reviews and opinions

from travellers around the world

250,000+

attractions

680,000+

hotels

1,000,000+

restaurants

Operates in 45

countries

315 million

unique visitors

per month

Page 4: BigDataInOperationsV8

Where we are now: World’s largest travel site

NORTH

AMERICA

22%

EUROPE

44%

MIDDLE

EAST &

AFRICA

4%

LATAM

8%ASIA PACIFIC

22%

Source: comScore Media Metrix for TripAdvisor Sites, worldwide, August 2014

With nearly 280 million unique monthly visitors

Page 5: BigDataInOperationsV8

Traffic and Infrastructure

o From 500K to 1.5 million hits per minute

o > 1000 Production Servers (Real, not Virtual)

o Split across multiple data centres in the US

o >2 TB of Compressed Log data per day

Managed by a

team of just

12engineers

Page 6: BigDataInOperationsV8

So where does Big Data fit In?

Page 7: BigDataInOperationsV8

Big Data @ TripAdvisor

o 10 TB of Postgres Site Data

o 2.5 PB of Data in Hadoop

o ~160 Large Hadoop Nodes

o ~280 TB of Logs (last 7 months) on

site

o Redshift and Tableau for Ad hoc

exploration of data

o SSAS Data Cubes for static models

Page 8: BigDataInOperationsV8

So what about Operations?

Page 9: BigDataInOperationsV8

Challenges in Operations

o Traditional Ops tools don’t scale well

o A human can’t review 30K Graphs

(30 metrics x 1000 Servers)

o Aggregate Chart data is a hack and

not very helpful

o Tools record and present the data only, a human has to interpret it

Page 10: BigDataInOperationsV8

Pray

Example: Release Day

Monitor 30K Cacti

Charts

Start Release

Imagine 9+ releases a week with Cacti and

Nagios…

The tools are not designed for this!

Page 11: BigDataInOperationsV8

So what’s the Solution?

Page 12: BigDataInOperationsV8

Solution: Better Analytics

o Change to more flexible technology

o Measure everything! (~700K per

second)

o Interpret the data within 10 to 15

minutes

o Tune! Remove as many false positives

as possible

o Alert – Page someone on unexpected

changes

If its worth

doing, its worth

measuring

Page 13: BigDataInOperationsV8

What we built

Web

Servers

Log Central

Servers

File Servers

Page 14: BigDataInOperationsV8

Why Postgres?

Analysis

2–3 TB of

metric’s

data

Anomaly

Detection

90+ days

aggregate

data

Holding

~2TB of

data

Page 15: BigDataInOperationsV8

Anomaly Detection

Capture what is happening now

Compare it against data 1 day ago

Compare it against what other Pools are doing

Look at historical variance over 9 days

Track statistically significant changes

Page 16: BigDataInOperationsV8
Page 17: BigDataInOperationsV8

RolloutTest for anomaly

Release Day v2

Rollback

Monitor key

metrics

Deploy new code

to two pools

No praying needed!

Page 18: BigDataInOperationsV8

Results

Roll back 1 release per week

15 minutes vs 3 hours

Dramatic reduction in user facing issues

Increasing number of releases

Strict process on rolling back

Page 19: BigDataInOperationsV8

Why no Hadoop?

VS

Page 20: BigDataInOperationsV8

What’s Next?

ReduceFalse negatives

AlertOn performance

or behaviour

changes

VisualiseData through

dashboards

Expand Set of metrics

we’re alerting on

Correlatesystem and application

metrics

Page 21: BigDataInOperationsV8