Splunk for Data Science

Copyright © 2014 Splunk Inc.

Tom LaGa=a Data Scien@st, Splunk

Olivier de Garrigues Sr Prof Services Consultant, Splunk

Splunk for Data Science

Disclaimer

2

During the course of this presenta@on, we may make forward-‐looking statements regarding future events or the expected performance of the company. We cau@on you that such statements reflect our current expecta@ons and

es@mates based on factors currently known to us and that actual events or results could differ materially. For important factors that may cause actual results to differ from those contained in our forward-‐looking statements,

please review our filings with the SEC. The forward-‐looking statements made in the this presenta@on are being made as of the @me and date of its live presenta@on. If reviewed aRer its live presenta@on, this presenta@on may not contain current or accurate informa@on. We do not assume any obliga@on to update any forward-‐looking statements we may make. In addi@on, any informa@on about our roadmap outlines our general product direc@on and is subject to change at any @me without no@ce. It is for informa@onal purposes only, and shall not be incorporated into any contract or other commitment. Splunk undertakes no obliga@on either to develop the features or func@onality described or to

include any such feature or func@onality in a future release.

3 Key Takeaways

Splunk is great for doing Data Science!

Splunk complements other tools in the

Data Science toolkit.

Data Science is about extrac:ng ac:onable insights from data.

1 2 3

3

About Us !   Tom LaGaAa, Data Scien:st – Tom joined Splunk in Spring 2014 as a Data Scien@st

specializing in Probability and Sta@s@cs. Tom is an expert on the mathema@cs of inference, and he enjoys func@onal programming in languages like Clojure, Haskell & R. At Splunk, Tom is helping to develop our internal and external Data Science program and curriculum. Tom has a Ph.D. in Mathema@cs from the University of Arizona, and un@l recently was a Courant Instructor at the Courant Ins@tute at New York University. Tom is based in New York City.

!   Olivier de Garrigues, Senior Professional Services Consultant – Olivier is based in London on the EMEA Professional Services team and has helped out more than 40 customers in 10 countries on various Splunk projects in the past year and a half. Prior to this, he worked as a quan@ta@ve analyst with extensive use of MATLAB and R. He developed a keen interest in machine learning and enjoys dreaming about how to make Splunk be=er for data scien@sts, and helped develop the R Project App. Olivier holds an MS in Mathema@cs of Finance from Columbia University.

4

Splunk for Data Science

What is Data Science? Data Science is about extrac@ng ac@onable insights from data. !   Helps people make be=er decisions !   Can be used for automated decision-‐making !   Data Science is cross-‐func@onal, and blends techniques & theories from: –  CS / Programming –  Math and Sta@s@cs –  Machine Learning –  Data Mining / Databases –  Data Visualiza@on

!   Don’t be afraid of Data Science!

–  Substan@ve / Domain Exper@se –  Social Science –  Communica@on and Presenta@on –  Accoun@ng, Finance and KPIs –  Business Analy@cs

6

Data Science & Analy@cs Teams There is no “one size fits all” data scien@st. Data Science & Analy@cs teams are made up of people with complementary skill sets.

Source: Schu= & O’Neil. Doing Data Science. 2013

7

Splunk for Data Science Splunk is great for doing Data Science! !   Integrate, query & visualize all the data:

–  Plalorm for machine data –  Connects with any other data source

!   Easy-‐to-‐use Analy@cs capabili@es !   Powerful algorithms out-‐of-‐the-‐box !   Sharp visualiza@ons and dashboards !   Deliver results to both IT & Business users

!   Complements other Data Science tools (next slide)

8

Splunk and Data Science Tools Splunk complements other tools in the Data Science toolkit: !   Hadoop: the workhorse of the Data Science world. Using Hunk, you can integrate Hadoop & HDFS seamlessly into Splunk.

!   R & Python: the preferred languages of Data Science. Execute R & Python scripts in your Splunk queries using the R Project App & SDK for Python

!   SQL & other RDBMS: valuable stores for customer & product data. Use Splunk’s DB Connect App to mash rela@onal data up with machine data.

!   External tools: export finalized data from Splunk using the ODBC Driver –  Tip: do all your data processing in Splunk/Hunk, and export only the final results

!   D3 Custom Visualiza@ons: sharp dashboards & reports using Splunk

9

Splunk and Data Science Use Cases Splunk is a powerful tool for lots of Data Science use cases: Green Use Cases (easy out of the box) Yellow Use Cases (needs @nkering)

Trend Forecas@ng D3 Custom Visualiza@ons A/B Tes@ng Predic@ve Modeling Root Cause Analysis Sen@ment Analysis Anomaly Detec@on Conversion Funnel/Pathing Market Segmenta@on More Algorithms via R & Python Topic Modeling Capacity Planning Correlate Data from 2+ Sources Data Munging & Normaliza@on KPIs & Execu@ve Dashboards

10

Data Science Use Cases

Use Case: Trend Forecas@ng Trend Forecas@ng: given past & real@me data, predict future values & events. !   Common applica@ons:

–  Forecast revenue & other KPIs –  Web server traffic & product downloads –  Customer conversion rates –  Es@mate MTTR & server outages –  Resource & capacity planning (AWS App) –  Security threats (Enterprise Security App)

!   The “true” course of events can (and will) take only one of many divergent paths. But which one…?

!   Be mindful of rare events & black swans!

12

Splunk Solu@on: predict!predict command: forecast future trajectories of @me series

!   Implements a Kalman filter to iden@fy seasonal trends

!   Gives an “uncertainty envelope” as a buffer around the trend

!   Tip: Always run the predict command on LOTS of past data. Capture low-‐frequency and high-‐frequency trends

!   Remember: the future is always uncertain…

!   Remember: all forecasts are probabilis@c. The predict command qualifies its es@mate with an “uncertainty envelope”: this also accounts for past measurement error.

13

Splunk Solu@on: Predict App David Carasso’s Predict App: forecast future values of individual events.

–  8 minute walkthrough: h=ps://www.youtube.com/watch?v=ROvaqJigNFg !   Implements a Naïve Bayes classifier !   You have to train models! !   Train a model to predict any target field using any reference field(s): fields ref1, ref2, ..., target | train my_model from target!

!   Guess target field for incoming events: guess my_model into target!

!   Temporal or non-‐temporal predic@on (include _@me among reference fields)

14

Concept: Supervised Learning & Classifica@on Supervised learning: use observed training data to classify values of unknown tes1ng data

!   predict command (Kalman filter): Training data = @mechart of past & real@me values. Tes@ng data = @me range for future values

!   Predict App (Naïve Bayes classifier): Training data = events with reference & target fields. Tes@ng data = events with reference fields but not target field

!   Tip: only deploy models & algorithms a2er extensive tes@ng & evalua@on

!   More powerful learning algorithms using R Project App or SDK for Python

15

Demo: Predict App !   Train a model to predict movie Ra@ng based on MovieID, UserID, Genre, Tag

index=movielens Timestamp < 1199188800 UserID=593* | eval original_rating = case(Rating<3,"Dislike", Rating=3,"Neutral", Rating>3,"Like") | fields original_rating MovieID UserID Genre Tag | train rating_model from original_rating!

!   Guess Ra@ng for test data based on trained model index=movielens Timestamp > 1199188800 UserID=593* | guess rating_model into guessed_rating | top original_rating guessed_rating!

!   Accuracy of model: correct on 97.6% of values

!   Tip: always train on LOTS of training data

!   Evaluate before deploying

16

Use Case: Sen@ment Analysis Sen@ment Analysis: the assignment of “emo@onal” labels to textual data

!   Can be simple +1 vs. -‐1, or more sophis@cated: “happy”, “angry”, “sad”, etc.

!   Analyze tweets, emails, news ar@cles, logs or any other textual data! –  Social data correlates with other factors

!   Typically done via supervised learning: –  Train a model on labeled corpus of text –  Test the model on incoming text data

!   Read more about Sen@ment Analysis: –  Chapter 14 of Big Data Analy1cs Using Splunk

(pp. 255-‐282) –  Michael Wilde & David Carasso. Social Media &

Sen1ment Analysis. .conf2012

3rd 8th 4th 1st 2nd

2011 Irish General Elec@on

17% 1.8% 10% 36% 19%

★

r=.79

17

Splunk Solu@on: Sen@ment Analysis App David Carasso’s Sen@ment Analysis App assigns binary sen@ment values to textual data (logs, tweets, email, etc.) !   Naïve Bayes classifier under the hood !   Twi=er & IMDB models out of the box !   Can guess language of authorship, and “heat”, a measure of emo@onal charge

!   Tip: compare rela@ve sen@ment changes across @me & groups

!   How to train your own models: h=p://answers.splunk.com/answers/59743

18

Demo: Sen@ment Analysis App

19

Use Case: Anomaly Detec@on !   An anomaly (or outlier) is an event which is vastly dissimilar to other events !   Anomaly Detec@on is one of Splunk’s most common use cases. Examples:

–  Transac@ons which occur faster than humanly possible –  DDoS a=acks from IP address ranges –  High-‐value customer purchase pa=erns

!   Quick techniques for finding sta@s@cal outliers: –  Non-‐average outliers: more than 2*stdev from the avg –  Non-‐typical outliers: more than 1.5*IQR above perc75

or below perc25 !   Tip: save these as even=ypes for automated outlier detec@on !   Once anomalies have been found, dig deeper to discover root causes

20

Splunk Solu@on: cluster !   Anomalies are dissimilar to other events (by defini@on) !   We can use clustering algorithms to help us detect anomalies:

–  Non-‐anomalous events typically form a few large clusters –  Anomalous events typically form lots of small clusters

!   Cluster your data, sort ascending: cluster showcount=true labelonly=true | sort cluster_count cluster_label!

!   Remember: there is no “right way” to find all anomalies. Explore your data!

21

Concept: Unsupervised Learning & Clustering !   A clustering algorithm is any process which groups together similar things (events, people, etc), and separates dissimilar things (events, people, etc.)

!   Clustering is unsupervised: choose labels based on pa=erns in the data !   Clustering is in the eye of the beholder:

–  Lots of different clustering algorithms –  Lots of different similarity func@ons

!   Do not confuse with: –  Computer cluster: a group of computers

working together as a single system –  Splunk cluster: a group of Splunk indexers

replica@ng indexes & external data

22

Demo: cluster!

23

Splunk Solu@on: Other Commands !   anomalies:

–  Assigns an “unexpectedness” score to each event

!   anomalousvalue: –  Assigns an “anomaly score” to

events with anomalous values

!   outlier: –  Removes or truncates outliers

!   kmeans: –  Powerful clustering algorithm.

You choose k = # of clusters

24

Splunk Solu@on: Prelert (Partner App) !   Manages Anomaly Detec@on directly

–  Pre-‐built dashboards, alerts, API. –  Use cases: Security, IT Ops / APM, DevOps –  Godfrey Sullivan: "beau@fully adjacent and

complimentary to what Splunk does”

!   Can download from Splunk Apps –  May save you @me with Anomaly

Detec@on –  Can also be good source of inspira@on

for your own Anomaly Detec@on dashboards

!   Keep in mind Prelert is a paid app: –  Cost: $225/month @ 5GB

25

Use Case: Market Segmenta@on !   Market Segmenta@on: group customers according to common needs and priori@es, and develop strategies to target them –  Market segments are internally homogeneous, and externally heterogeneous

i.e., market segments are clusters of customers

!   Many reasons for Market Segmenta@on: –  Different market segments require different strategies –  Customers in same segment have similar product

preferences. Different segments, different preferences –  Segments should be reasonably stable, to allow for

historical analysis (good for Data Science)

!   Use Splunk’s clustering algorithms to iden@fy and label market segments!

26

Data Visualiza@ons

Intro to Data Visualiza@on !   Data Visualiza@on is the crea@on and

study of the visual representa@on of data, and is a vital part of Data Science

!   The goal of data visualiza@on is to communicate informa1on: –  Visualiza@ons communicate complex

ideas with clarity, precision, and efficiency –  Transmission speed of the op@c nerve

is about 9Mb/sec – fast image processing –  Pa=ern matching, edge detec@on –  Visualiza@ons pack lots of informa@on

into small spaces. More than text alone!

28

Telling Stories with Data Visualiza@ons !   We process data in linear narra@ves: even dashboards go top-‐to-‐bo=om !   Visualiza@ons help pierce the monotony of text, number & data streams !   Think about the story you’re telling:

–  Empathize with the viewer –  What’s their takeaway?

!   A good visualiza@on tells its own story: “Island Na@on Obtains Favourable Balance of Trade; Goes On To Rule The World”

!   Weave mul@ple visualiza@ons together to tell more effec@ve stories

William Playfair (1786)

29

Source: New York Times. May 17, 2012

Splunk

30

Source: New York Times. May 17, 2012

Splunk

31

Tips for Effec@ve Data Visualiza@ons !   #1 @p: Plot the most important keys on x & y axes

–  You choose “most important.” –  You might need >1 visualiza@on.

!   Manipulate size, color and shape to convey addi@onal informa@on

!   Annotate, label and add icons ✔︎ !   Use chart overlay to correlate data sources. Mix histograms & line charts ↑↑↑

!   Manipulate numerical scale: linear vs. log scales (previous 2 slides)

!   Read more about Data Visualiza@on: –  Tableau’s whitepaper, Visual Analysis Best Prac1ces (2013) –  Edward TuRe’s The Visual Display of Quan1ta1ve Informa1on (2001)

32

D3 Custom Visualiza@ons in Splunk !   Splunk now supports D3 visualiza@ons with some minor customiza@on

!   Satoshi’s talk: “I want that cool viz in Splunk!”

!   Resources for Custom Visualiza@ons: –  Splunk Web Framework Toolkit

h=ps://apps.splunk.com/app/1613/ –  Splunk 6.x Dashboard Examples

h=ps://apps.splunk.com/app/1603/ –  Custom SimpleXML Extensions

h=p://apps.splunk.com/app/1772/ –  Lots more D3 visualiza@ons for

h=ps://github.com/mbostock/d3/wiki/Gallery

33

Demo: Sankey Chart

34

How-‐to for Sankey Charts !   Install the Custom SimpleXML Extensions app:

h=p://apps.splunk.com/app/1772/ !   Create your own app, and install Sankey chart components:

–  Drop autodiscover.js in $SPLUNK_HOME/etc/apps/<YOURAPP>/appserver/sta@c –  Copy & paste /sankeychart/ subfolder into $SPLUNK_HOME/etc/apps/<YOURAPP>/

appserver/sta@c/components –  Restart Splunk

!   In your dashboard: –  Include script="autodiscover.js" in <form> or <dashboard> opening tag –  Insert XML snippet from 2-‐ or 3-‐node Sankey dashboard example –  Change 2 instances of “custom_simplexml_extensions” to <YOURAPP> –  Update search and “data-‐op@ons” parameters (nodes) in XML to reflect your data

35

Know Your Audience !   Finally, keep in mind your audience: who are

they, what ques@ons do they care about, and how do they want to consume the data? –  Execu@ve: KPIs, charts, tables with icons ✔︎ –  Marke@ng Analyst: KPIs & metrics. Sharp

images for their own reports & decks. Tableau –  Data Scien@st: output clean data to organized

data stores (Hunk, HDFS, SQL, NoSQL) –  Sysadmin: sparklines, gauges for ac@vity &

MTTR, tables with highlighted anomalies –  Security Ops: maps with detailed overlays,

drill down on anomalous events.

!   Bring it back to the business problem & use

36

3 Key Takeaways

Splunk is great for doing Data Science!

Splunk complements other tools in the

Data Science toolkit.

Data Science is about extrac:ng ac:onable insights from data.

1 2 3

37

List of References Good books on Data Science: !  Schu= & O’Neil. Doing Data Science. O’Reilly 2013 !  Provost & Fawce=. Data Science for Business. O’Reilly 2013 !  Max Shron. Thinking With Data. O’Reilly 2014 !  Edward TuRe. The Visual Display of Quan1ta1ve Informa1on. Graphics Press 2001 !  Zumel & Mount. Prac1cal Data Science with R. Manning 2014 !  Has@e et al. Elements of Sta1s1cal Learning. Springer-‐Verlag 2009 (free PDF!)

Using Splunk for Data Science: !  Zadrozny, Kodali (and Stout). Big Data Analy1cs Using Splunk. Apress 2013 !  David Carasso. Exploring Splunk. CITO Research 2012 !  David Carasso. Data Mining with Splunk. .conf2012 !  Michael Wilde & David Carasso. Social Media & Sen1ment Analysis. .conf2012

Good free references: !  Tableau. Visual Analysis Best Prac1ces. Tableau 2013 !  King & Magoulas. 2013 Data Science Salary Survey. O’Reilly 2013 !  DJ Pa@l. Building Data Science Teams. O’Reilly 2013 !  Cathy O’Neil. On Being A Data Skep1c. O’Reilly 2013

38

THANK YOU

Splunk for Data Science

Documents