Top Banner
REDUCING ALERT NOISE WITH DATA SCIENCE PREPARED FOR THE FUTURE STATE OF OPERATIONS MEETUP DAN TURCHIN | @DTURCHIN | BIGPANDA | JUNE 2016
27

Using data science to automate event correlation - June 2016 - Dan Turchin - BigPanda

Jan 22, 2018

Download

Technology

Dan Turchin
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Using data science to automate event correlation - June 2016 - Dan Turchin - BigPanda

REDUCING ALERT NOISE WITH DATA SCIENCE

PREPARED FOR THE FUTURE STATE OF OPERATIONS MEETUP

DAN TURCHIN | @DTURCHIN | BIGPANDA | JUNE 2016

Page 2: Using data science to automate event correlation - June 2016 - Dan Turchin - BigPanda

t

OBJECTIVES

1. Discuss why data’s eating the world

2. Share how data science is solving the noisy alert problem

3. Discuss the state of innovation… and our role in it

4. Learn from each other

Page 3: Using data science to automate event correlation - June 2016 - Dan Turchin - BigPanda

t

SO WHO’S THE SHORT, NERDY DUDE?

Page 4: Using data science to automate event correlation - June 2016 - Dan Turchin - BigPanda

t

BUT FIRST…

HTTPS://POLLEV.COM/DANTURCHIN744

Page 5: Using data science to automate event correlation - June 2016 - Dan Turchin - BigPanda

t

DATA IS EATING THE WORLD

DATA SCIENCE

Using all available data to make better

business decisions.

MACHINE LEARNING

Automating the use of statistics to infer

future behavior from past results.

Page 6: Using data science to automate event correlation - June 2016 - Dan Turchin - BigPanda

t

DATA SCIENCE + MACHINE LEARNING: CASE STUDY

WHY DON’T UPS TRUCKS MAKE LEFT TURNS?

• Fuel efficiency

• Maintenance records

• Accident reports

• Driver health data

• On-time deliveries

• Package returns

• Customer surveys

• Objective: improve

service and reduce costs

• Hypotheses: minimize

miles traveled, avoid rush

hour

• Collect and analyze data

• Conclusion: only right

turns!

Page 7: Using data science to automate event correlation - June 2016 - Dan Turchin - BigPanda

t

DATA SCIENCE IS IMPACTING EVERY INDUSTRY

Page 8: Using data science to automate event correlation - June 2016 - Dan Turchin - BigPanda

t“…AND IT OPS DESERVES CREDIT (AND BLAME)

JAMES TURNBULL, THE ART OF MONITORING

Applications and services are now critical for customer satisfaction. IT is no longer

just a cost center. There are more hosts, applications and infrastructure are more

complex, and expectations around availability and quality are more aggressive. More

data is needed to deliver the same quality of service and often that data isn’t being collected

or is hard to find. Legacy approaches to monitoring no longer work.”

Page 9: Using data science to automate event correlation - June 2016 - Dan Turchin - BigPanda

t

THE STATE OF MONITORING… IS POOR

• 80% AGREE THAT MONITORING

IS STRATEGIC.

• 12% ARE SATISFIED WITH THEIR

STRATEGY. http://bit.ly/BP_SoM

75% RECEIVE MORE THAN 50

ALERTS PER DAY.

31% OF THOSE WITH MTTR GREATER

THAN 24 HOURS ARE SATISFIED WITH

THEIR MONITORING STRATEGIES… VS.

63% WITH LOWER MTTR.

Page 10: Using data science to automate event correlation - June 2016 - Dan Turchin - BigPanda

t

Aler

ts p

er m

onth

0

4,500

9,000

13,500

18,000

2000 2005 2010 2015 2020

… AND NOISE LEVELS ARE INCREASING

Page 11: Using data science to automate event correlation - June 2016 - Dan Turchin - BigPanda

t

…BUT HEADCOUNT ISN’T

2000 2020

• 5 incidents per engineer per day

• 96 minutes per incident

• 400 incidents per engineer per day

• 1.2 minutes per incident

Page 12: Using data science to automate event correlation - June 2016 - Dan Turchin - BigPanda

t

IS THERE A BETTER WAY TO FIX PROBLEMS FASTER?

DETECT

INVESTIGATEPREVENT

FIX

Page 13: Using data science to automate event correlation - June 2016 - Dan Turchin - BigPanda

t

THREE POSSIBLE APPROACHES…

PEOPLE BOTS AUTOMATION

Page 14: Using data science to automate event correlation - June 2016 - Dan Turchin - BigPanda

t

WHAT’S THE BEST WAY TO AUTOMATE EVENT CORRELATION?

HEURISTICS NLP

ASSISTED UNASSISTED ASSISTED UNASSISTED

• Optimal for: dynamic models where

new inputs affect outputs • Examples

• Air pollution • InfoSec • IT Ops

• Optimal for: static models where known

inputs have predictable outputs • Examples

• Migration patterns • Molecular sequencing • Mine detection

Page 15: Using data science to automate event correlation - June 2016 - Dan Turchin - BigPanda

t

COR·RE·LA·TIONˌkôrəˈlāSH(ə)n

The extent to which two variables have a

linear relationship.

ARE THESE EVENTS RELATED… OR CORRELATED?

• • • • • • • •• • • • • • • •

• • • • • • • •

• • • • • • • •10 MINUTES

HEURISTICS-BASED CORRELATION

Page 16: Using data science to automate event correlation - June 2016 - Dan Turchin - BigPanda

t

ARE THESE EVENTS RELATED… OR CORRELATED?

• • • • • • • •• • • • • • • •

• • • • • • • •

• • • • • • • •10 MINUTES

“WHENEVER THERE’S A CPU ISSUE IT’S

FOLLOWED BY A QUERY ERROR AND A DISK

I/O ISSUE WITHIN 5 MINUTES WHEN HOSTS

ARE IN THE SAME CLUSTER.”

•••

CPU SPIKE

LONG QUERY EXECUTION

DISK I/O BUFFER

• SAME CLUSTER

HEURISTICS-BASED CORRELATION

Page 17: Using data science to automate event correlation - June 2016 - Dan Turchin - BigPanda

t

tag(s) time window filter = matching events+ +DEFINITION

cluster 30 minutes source_system=api.* AND cluster NOT IN [“stage-*”] = matching events+ +EXAMPLE

All the alerts in an incident correlated by this rule will have the same cluster, the time between the creation of the first and most recent alert will be no more than 30 minutes,

and all matching alerts will meet the filter conditions.

SAMPLE HEURISTIC

Page 18: Using data science to automate event correlation - June 2016 - Dan Turchin - BigPanda

t

WHO CARES?

What should I work on next?

What’s about to break?

How does that impact the business?

PRIORITIZE

INVESTIGATE

PREVENT

Page 19: Using data science to automate event correlation - June 2016 - Dan Turchin - BigPanda

t

YOUR CUSTOMERS CARE.

BEFORE

DURING

AFTER

Page 20: Using data science to automate event correlation - June 2016 - Dan Turchin - BigPanda

t

WHAT’S AHEAD?

Page 21: Using data science to automate event correlation - June 2016 - Dan Turchin - BigPanda

t

THE EXISTENTIAL THREAT…

…IS REAL

Page 22: Using data science to automate event correlation - June 2016 - Dan Turchin - BigPanda

t

THREE… TWO… ONE…

Page 23: Using data science to automate event correlation - June 2016 - Dan Turchin - BigPanda

t

WHY ARE WE HERE?

WHAT CAN ONLY WE DO?

CAN MACHINES THINK? SHOULD THEY?

Page 24: Using data science to automate event correlation - June 2016 - Dan Turchin - BigPanda

t

AS PROBLEMS GET HARDER…

…WE RISE TO THE CHALLENGE

Page 25: Using data science to automate event correlation - June 2016 - Dan Turchin - BigPanda

t

WILL WE AGAIN?

CONTAINERS CLOUDS MICROSERVICES

CI/CD DEVOPS

° °

°

Page 26: Using data science to automate event correlation - June 2016 - Dan Turchin - BigPanda

t

FURTHER LEARNING

O’REILLY DATA SHOWTHIS WEEK IN DATA

Page 27: Using data science to automate event correlation - June 2016 - Dan Turchin - BigPanda

t

“THE BEST WAY OUT IS ALWAYS THROUGH.”

-ROBERT FROST

DAN TURCHIN | BIGPANDA

[email protected] | @DTURCHIN | (650)533-0918