Page 1
HSDHochschule Düsseldorf
University of Applied Scienses
WFachbereich Wirtschaftswissenschaften
Faculty of Business Studies
IT Applications in Business Analytics
Business Analytics (M.Sc.)
IT in Business Analytics
SS2016 / Lecture 14 – Wrap Up
Thomas Zeutschler
SS 2016 - IT Applications in Business Analytics - 14. Wrap Up 1
Page 2
HSDFaculty of Business Studies
Thomas Zeutschler
Associate Lecturer
Let’s get started…
SS 2016 - IT Applications in Business Analytics - 14. Wrap Up 2
Page 3
HSDFaculty of Business Studies
Thomas Zeutschler
Associate Lecturer
Targets of Module and Lectures
SS 2016 - IT Applications in Business Analytics - 1. Introduction 3
German
“Die Studierenden erlernen die Anwendung praxisrelevanter IT-
Werkzeuge (für Business Analytics) anhand von Fallstudien.”
English
“Students will learn to apply analytical tools on business problems.”
American English
“We’ll try make you a Bruce Willis in Analytics.”
Page 4
HSDFaculty of Business Studies
Thomas Zeutschler
Associate Lecturer
Scope of Module and Lectures
SS 2016 - IT Applications in Business Analytics - 1. Introduction 4
Advanced Analytics“Advanced Analytics is the
autonomous or semi-
autonomous examination of
data or content using
sophisticated techniques and
tools, typically beyond those of
traditional business intelligence
(BI), to discover deeper
insights, make predictions, or
generate recommendations.”http://www.gartner.com/it-glossary/
Page 5
HSDFaculty of Business Studies
Thomas Zeutschler
Associate Lecturer
In Scope / Out of Scope
SS 2016 - IT Applications in Business Analytics - 1. Introduction 5
Data Science
Data Mining, Text Mining
Predictive Analytics, Simulation, Machine Learning
Database Technologies
Information Retrieval
Data Analysis
Text Analysis, Semantic Web, XML
Data Warehouse, Data Mart, ETL
In Memory Technologies
Reporting, OLAP
Data and Decision Modelling
Data Visualization
Data Quality Management, Data Protection
Specific Business Applications ► Case Studies
Page 6
HSDFaculty of Business Studies
Thomas Zeutschler
Associate Lecturer
Sequence of Lectures
SS 2016 - IT Applications in Business Analytics - 1. Introduction 6
Introduction 1st April 2016
Methodology and process model for analytics (CRISP DM)
Tools, technologies and data sources
The R Programming Language
KNIME
Case Study 1
Case Study 2
Case Study 3
Wrap Up 8th July 2016
1
2
3
4
5
6
9
12
15
Theory
Tools Training
Hands On Case Studies
Page 7
HSDFaculty of Business Studies
Thomas Zeutschler
Associate Lecturer
CRISP DM
SS 2016 - IT Applications in Business Analytics - 2. CRISP DM 7
Page 8
HSDFaculty of Business Studies
Thomas Zeutschler
Associate Lecturer
The Data Mining Process
SS 2016 - IT Applications in Business Analytics - 2. CRISP DM 8
CRoss-InduStry Process for Data Mining A methodology covering the typical phases of an analytical project,
the tasks involved with each phase, and an explanation of the
relationships between these tasks.
A process model, as CRISP-DM provides an overview of the data
mining life cycle.
CRISP-DM was conceived in 1996 and first published in 1999 by SPSS, NCR and Mercedes
and is reported as the leading methodology for data mining/predictive analytics projects.
IBM has released a new implementation method for Data Mining/Predictive Analytics projects in 2015
called Analytics Solutions Unified Method for Data Mining & Predictive Analytics (ASUM-DM)
which is a refined and extended CRISP-DM. But it’s a little bit too complex start with…
Page 9
HSDFaculty of Business Studies
Thomas Zeutschler
Associate Lecturer
Introduction
SS 2016 - IT Applications in Business Analytics - 2. CRISP DM 9
„The process of
knowledge discovery in
data mining has to be
reproducible and reliable.
Especially for people who
have no background in
data science.“
Page 10
HSDFaculty of Business Studies
Thomas Zeutschler
Associate Lecturer
CRISP DM
SS 2016 - IT Applications in Business Analytics - 2. CRISP DM 10
CRoss-InduStry Process for Data Mining A methodology covering the typical phases of an analytical project,
the tasks involved with each phase, and an explanation of the
relationships between these tasks.
A process model, as CRISP-DM provides an overview of the data
mining life cycle.
CRISP-DM was conceived in 1996 and first published in 1999 by SPSS, NCR and Mercedes
and is reported as the leading methodology for data mining/predictive analytics projects.
IBM has released a new implementation method for Data Mining/Predictive Analytics projects in 2015
called Analytics Solutions Unified Method for Data Mining & Predictive Analytics (ASUM-DM)
which is a refined and extended CRISP-DM. But it’s a little bit too complex start with…
Page 11
HSDFaculty of Business Studies
Thomas Zeutschler
Associate Lecturer
CRISP DM – Current Industry Standard
SS 2016 - IT Applications in Business Analytics - 2. CRISP DM 11
Source:
http://www.kdnuggets.com/2014/10/crisp-dm-top-
methodology-analytics-data-mining-data-
science-projects.html
Other approaches:
KDD „Knowledge Discovery in Databases“ developed by
Usama Fayyad (Microsoft Research, 1996) describes
methods and technologies to assist humans in
extracting useful information (knowledge) from the
rapidly growing volumes of digital data.
SEMMA SEMMA is an acronym that stands for Sample, Explore,
Modify, Model and Assess. It is a list of sequential steps
developed by SAS Institute in 2009.
Criticism: SEMMA mainly focuses on the modeling
tasks of data mining projects, leaving the business
aspects out. Focussed on the usage of SAS products.
Page 12
HSDFaculty of Business Studies
Thomas Zeutschler
Associate Lecturer
CRISP DM – Objectives and Benefits
Ensure quality of knowledge discovery project results
Reduce skills required for knowledge discovery
Reduce costs and time
General purpose (i.e., stable across varying applications)
Robust (i.e., insensitive to changes in the environment)
Tool and technique independent
Tool supportable
SS 2016 - IT Applications in Business Analytics - 2. CRISP DM 12
Support documentation of projects
Capture experience for reuse
Support knowledge transfer and training
Page 13
HSDFaculty of Business Studies
Thomas Zeutschler
Associate Lecturer
CRISP DM – Phases and Tasks
SS 2016 - IT Applications in Business Analytics - 2. CRISP DM 13
Business
Understanding Determine Business
Objectives
Background.
Business Objectives.
Business Success
Criteria.
Assess Situation
Inventory of Resources,
Requirements,
Assumptions and
Constraints.
Risks and Contingencies
Terminology.
Costs and Benefits.
Determine Data Mining
Goals
Data Mining Goals.
Data Mining Success
Criteria.
Produce Project Plan
Project Plan.
Initial Assessment of
Tools and Techniques.
Data
UnderstandingCollect Initial Data
Initial Data Collection
Report.
Describe Data
Data Description
Report.
Explore Data
Data Exploration
Report.
Verify Data Quality
Data Quality Report.
Data
PreparationSelect Data
Rationale for Inclusion/
Exclusion.
Clean Data
Data Cleaning Report.
Construct Data
Derived Attributes.
Generated Records.
Integrate Data
Merged Data.
Format Data
Reformatted Data.
Dataset
Dataset Description.
Modelling
Select Modelling
Technique
Modelling Technique.
Modelling Assumptions.
Generate Test Design
Test Design.
Build Model
Parameter Settings
Models.
Model Description.
Assess Model
Model Assessment.
Revised Parameter
Settings.
Evaluation
Evaluate Results
Assessment of Data.
Mining Results w.r.t.
Business Success
Criteria.
Approved Models.
Review Process
Review of Process.
Determine Next Steps
List of Possible Actions.
Decision.
Deployment
Plan Deployment
Deployment Plan.
Plan Monitoring and
Maintenance
Monitoring and
Maintenance Plan.
Produce Final Report
Final Report.
Final Presentation.
Review Project
Experience
Documentation.
Page 14
HSDFaculty of Business Studies
Thomas Zeutschler
Associate Lecturer
CRISP DM – Objectives and Benefits
Typical Effort per CRISP DM Phase in %
SS 2016 - IT Applications in Business Analytics - 2. CRISP DM 14
Eff
ort
Business
Under-
standing
Data
Under-
standing
Data
Prepa-
ration
Modelling Eva-
luation
Deploy-
ment
10%
20%
30%
Page 15
HSDFaculty of Business Studies
Thomas Zeutschler
Associate Lecturer
CRISP DM – 1 Business Understanding
SS 2016 - IT Applications in Business Analytics - 2. CRISP DM 15
1.1 Determine Business ObjectivesBackground.
Business Objectives.
Business Success Criteria.
1.2 Assess SituationInventory of Resources, Requirements,
Assumptions and Constraints.
Risks and Contingencies Terminology.
Costs and Benefits.
1.3 Determine Data Mining GoalsData Mining Goals.
Data Mining Success Criteria.
1.4 Produce Project PlanProject Plan.
Initial Assessment of Tools and Techniques.
Page 16
HSDFaculty of Business Studies
Thomas Zeutschler
Associate Lecturer
CRISP DM – 2 Data Understanding
SS 2016 - IT Applications in Business Analytics - 2. CRISP DM 16
2.1 Collect Initial DataInitial Data Collection Report.
2.2 Describe DataData Description Report.
2. 3 Explore DataData Exploration Report.
2.4 Verify Data QualityData Quality Report.
Page 17
HSDFaculty of Business Studies
Thomas Zeutschler
Associate Lecturer
CRISP DM – 3 Data Preparation
SS 2016 - IT Applications in Business Analytics - 2. CRISP DM 17
3.1 Select DataRationale for Inclusion / Exclusion.
3.2 Clean DataData Cleaning Report.
3.3 Construct DataDerived Attributes.
Generated Records.
3.4 Integrate DataMerged Data.
3.5 Format DataReformatted Data.
3.6 DatasetDataset Description.
Page 18
HSDFaculty of Business Studies
Thomas Zeutschler
Associate Lecturer
CRISP DM – 4 Modelling
SS 2016 - IT Applications in Business Analytics - 2. CRISP DM 18
4.1 Select Modelling TechniqueModelling Technique.
Modelling Assumptions.
4.2 Generate Test DesignTest Design.
4.3 Build ModelParameter Settings Models.
Model Description.
4.4 Assess ModelModel Assessment.
Revised Parameter Settings.
Page 19
HSDFaculty of Business Studies
Thomas Zeutschler
Associate Lecturer
CRISP DM – 5 Evaluation
SS 2016 - IT Applications in Business Analytics - 2. CRISP DM 19
5.1 Evaluate ResultsAssessment of Data.
Mining Results with respect to Business Success Criteria.
Approved Models.
5.2 Review ProcessReview of Process.
5.3 Determine Next StepsList of Possible Actions.
Decision.
Page 20
HSDFaculty of Business Studies
Thomas Zeutschler
Associate Lecturer
CRISP DM – 6 Deployment
SS 2016 - IT Applications in Business Analytics - 2. CRISP DM 20
6.1 Plan DeploymentDeployment Plan.
6.2 Plan Monitoring and MaintenanceMonitoring and Maintenance Plan.
6.3 Produce Final ReportFinal Report.
Final Presentation.
6.4 Review ProjectExperience Documentation.
Page 21
HSDFaculty of Business Studies
Thomas Zeutschler
Associate Lecturer
Tools
SS 2016 - IT Applications in Business Analytics - 14. Wrap Up 21
Page 22
HSDFaculty of Business Studies
Thomas Zeutschler
Associate Lecturer
Basic Concept – SQL
SS 2016 - IT Applications in Business Analytics - 3. Tools, Technologies and Data Sources 22
SQL Structured Query Language is a special-purpose programming language
designed for managing data held in a relational database management system
(RDBMS)
SQL is based upon relational algebra and tuple relational calculus.
https://en.wikipedia.org/wiki/Relational_algebra
SQL defines 3 language aspects:
data definition language (DDL) …to define database schemas
data manipulation language … selecting, inserting, deleting and updating data
data control language …to control access rights in databases
Page 23
HSDFaculty of Business Studies
Thomas Zeutschler
Associate Lecturer
Database System Classification
SS 2016 - IT Applications in Business Analytics - 3. Tools, Technologies and Data Sources
SQL Databases
Predefined Schema
Standard definition and interface
language
Tight consistency
Well defined semantics
NoSQL Database
No predefined Schema
Per-product definition and
interface language
Getting an answer quickly is more
important than getting a correct
answer
23
Page 24
HSDFaculty of Business Studies
Thomas Zeutschler
Associate Lecturer
Big Data Framework
SS 2016 - IT Applications in Business Analytics - 3. Tools, Technologies and Data Sources 24
A pre-customized
and pre-compiled
collection of tools and
technologies required
for big data processing
based on Hadoop.
Page 25
HSDFaculty of Business Studies
Thomas Zeutschler
Associate Lecturer
The R Programming Language
SS 2016 - IT Applications in Business Analytics - 4. The R Programming Language 25
R system contains two major components:
1. Base System – contains the R language software and the high
priority add-on packages.
2. User contributed add-on Packages.
R includes… an effective data handling and storage facility,
a suite of operators for calculations on arrays, in particular matrices,
a large collection of intermediate tools for data analysis,
graphical facilities for data analysis and display either on-screen or on
hardcopy, and
a simple and effective programming language which includes conditionals,
loops, user-defined recursive functions and input and output facilities.
Page 26
HSDFaculty of Business Studies
Thomas Zeutschler
Associate Lecturer
RStudio
SS 2016 - IT Applications in Business Analytics - 4. The R Programming Language 26
Native R is a console
application, RStudio is
wrapper for convenience…
Page 27
HSDFaculty of Business Studies
Thomas Zeutschler
Associate Lecturer
R Basicshttp://www.ats.ucla.edu/stat/r/seminars/intro.htm
SS 2016 - IT Applications in Business Analytics - 4. The R Programming Language 27
Variables
Simple Mathematics
Charting
# Declaration and usage of variables
A <- 2
B <- 3
x <- seq(0, 2*pi, 0.1)
y <- sin(x)
# Attention: R is case sensitive
1 + 2
Sin(2*3)
# Declaration and usage of variables
plot(x,y, main=„Sinus Plot",
sub=„made with R",
xlab="x-axis",
ylab="y-axis")
Page 28
HSDFaculty of Business Studies
Thomas Zeutschler
Associate Lecturer
R Basics – Install and use packageshttp://www.ats.ucla.edu/stat/r/seminars/intro.htm
SS 2016 - IT Applications in Business Analytics - 4. The R Programming Language 28
Using Packages
Installing Packages (remove the #)
Automatic Load and (if required) Installation of a Package
Page 29
HSDFaculty of Business Studies
Thomas Zeutschler
Associate Lecturer
R Basicshttp://www.ats.ucla.edu/stat/r/seminars/intro.htm
SS 2016 - IT Applications in Business Analytics - 4. The R Programming Language 29
Loading Data
Assign Data to Objects
Accessing Data
Page 30
HSDFaculty of Business Studies
Thomas Zeutschler
Associate LecturerSS 2016 - IT Applications in Business Analytics - 4. The R Programming Language
R Basicshttp://www.ats.ucla.edu/stat/r/seminars/intro.htm
30
Accessing Data continued / Saving Data
Page 31
HSDFaculty of Business Studies
Thomas Zeutschler
Associate LecturerSS 2016 - IT Applications in Business Analytics - 4. The R Programming Language
R Basicshttp://www.ats.ucla.edu/stat/r/seminars/intro.htm
31
Simple Data Analysis
d <- read.csv(“http://www.ats.ucla.edu/stat/data/hsb2.csv“)
# return the number of observations(rows) and variables(columns) in d.
dim(d)
# get the structure of d, including the class(type) of all variables
str(d)
# return the distributional summaries of variables in the dataset
summary(d)
# return a summary of the dataset for all rows where variable ‘read’ >= 60.
# note that filter is in the dplyr package.
summary(filter(d, read >= 60))
Page 32
HSDFaculty of Business Studies
Thomas Zeutschler
Associate LecturerSS 2016 - IT Applications in Business Analytics - 4. The R Programming Language
R Basicshttp://www.ats.ucla.edu/stat/r/seminars/intro.htm
32
Charting
# load the lattice charting package
require(lattice)
# draw a simple scatter plot
xyplot(read ~ write, data = d)
# conditioned scatter plot
xyplot(read ~ write | prog, data = d)
# box and whisker plots
bwplot(read ~ factor(prog), data = d)
More Charting (ggplot2 package)
# draw a kernel density plot
ggplot(d, aes(x = write)) + geom_density()
# draw a kernel density plot per prog
ggplot(d, aes(x = write)) + geom_density()
+ facet_wrap(~ prog)
# inspect univariate and bivariate
# relationships using a scatter plot matrix
ggpairs(d[, 7:11])
Page 33
HSDFaculty of Business Studies
Thomas Zeutschler
Associate Lecturer
Analytics Data Processing – Sample: Knime
SS 2016 - IT Applications in Business Analytics - 3. Tools, Technologies and Data Sources 33
www.knime.org
Page 34
HSDFaculty of Business Studies
Thomas Zeutschler
Associate Lecturer
Knime - Essential Nodes
SS 2016 - IT Applications in Business Analytics - 5. Introduction to KNIME 34
Data Preparation
The input table is split into two partitions (i.e. row-wise),
e.g. train and test data. The two partitions are available
at the two output ports.
This node helps handle missing values found in cells of
the input table.
The node allows for row / column filtering according to
certain criteria
Page 35
HSDFaculty of Business Studies
Thomas Zeutschler
Associate Lecturer
Knime - Essential Nodes
SS 2016 - IT Applications in Business Analytics - 5. Introduction to KNIME 35
First Statistical Data Analysis
Calculates statistical moments such as minimum, maximum,
mean, standard deviation, variance, median, overall sum,
number of missing values and row count across all numeric
columns, and counts all nominal values together with their
occurrences.
Creates a cross table (also referred as contingency table
or cross tab). It can be used to analyze the relation of
two columns with categorical data and does display the
frequency distribution of the categorical variables in a
table.
Page 36
HSDFaculty of Business Studies
Thomas Zeutschler
Associate Lecturer
Knime – Data Mining Cheating…
SS 2016 - IT Applications in Business Analytics - 6. Analytical Use Case 1 36
Algorithm Pros Cons Good at
Linear regression
- Very fast (runs in constant time)
- Easy to understand the model
- Less prone to overfitting
- Unable to model complex relationships
-Unable to capture nonlinear relationships
without first transforming the inputs
- The first look at a dataset
- Numerical data with lots of features
Decision trees
- Fast
- Robust to noise and missing values
- Accurate
- Complex trees are hard to interpret
- Duplication within the same sub-tree is
possible
- Star classification
- Medical diagnosis
- Credit risk analysis
Neural networks
- Extremely powerful
- Can model even very complex relationships
- No need to understand the underlying data
- Almost works by “magic”
- Prone to overfitting
- Long training time
- Requires significant computing power for
large datasets
- Model is essentially unreadable
- Images
- Video
- “Human-intelligence” type tasks like driving or
flying
- Robotics
Support Vector
Machines
- Can model complex, nonlinear
relationships
- Robust to noise (because they maximize
margins)
- Need to select a good kernel function
- Model parameters are difficult to interpret
- Sometimes numerical stability problems
- Requires significant memory and
processing power
- Classifying proteins
- Text classification
- Image classification
- Handwriting recognition
K-Nearest Neighbors
- Simple
- Powerful
- No training involved (“lazy”)
- Naturally handles multiclass classification
and regression
- Expensive and slow to predict new
instances
- Must define a meaningful distance
function
- Performs poorly on high-dimensionality
datasets
- Low-dimensional datasets
- Computer security: intrusion detection
- Fault detection in semiconducter manufacturing
- Video content retrieval
- Gene expression
- Protein-protein interaction
Page 37
HSDFaculty of Business Studies
Thomas Zeutschler
Associate Lecturer
Knime – Data Mining Cheating…
SS 2016 - IT Applications in Business Analytics - 6. Analytical Use Case 1 37
http://www.kdnuggets.com/2015/07/good-data-science-machine-learning-cheat-sheets.html
https://github.com/soulmachin
e/machine-learning-cheat-
sheet/raw/master/machine-
learning-cheat-sheet.pdf
https://azure.microsoft.com/en-
us/documentation/articles/mach
ine-learning-algorithm-cheat-
sheet/
Page 38
HSDFaculty of Business Studies
Thomas Zeutschler
Associate Lecturer
Time Series
SS 2016 - IT Applications in Business Analytics - 14. Wrap Up 38
Page 39
HSDFaculty of Business Studies
Thomas Zeutschler
Associate Lecturer
Time Series
SS 2016 - IT Applications in Business Analytics - 10. Time Series 39
A time series is a sequence of
often equally spaced observations
in chorological order
over a continuous time interval.
Samples from science and business
Meteorology: weather data, e.g. temperature, pressure, wind.
Economy and finance: economic factors, financial indexes, exchange rates.
Business: sales, production or any activity of business
Industry: electric load, power consumption, sensors.
Medicine: physiological signals (EEG), heart-rate, patient temperature.
Web: views, clicks, logs.
Page 40
HSDFaculty of Business Studies
Thomas Zeutschler
Associate Lecturer
Time Series
SS 2016 - IT Applications in Business Analytics - 10. Time Series 40
Time-Series DecompositionDecompose the variation of a series into 3 main parts…
A. Trend This is a long-term change in the mean level,
e.g. an increasing trend.
B. Seasonal effect Many time series exhibit variation which is seasonal
(e.g. annual) in period. The measure and the removal
of such variation is called deseasonalizing of data.
C. Irregular fluctuations After trend and cyclic variations have been removed
from a set of data, there is a series of residuals,
which may (or may not) be completely random.
Seasonal Trend Decomposition using LOESS (STL)
STL Method, 1990: http://www.wessa.net/download/stl.pdf
Page 41
HSDFaculty of Business Studies
Thomas Zeutschler
Associate Lecturer
Time Series in R
SS 2016 - IT Applications in Business Analytics - 10. Time Series 41
# load data
births <- scan("http://robjhyndman.com/tsdldata/data/nybirths.dat")
# convert to time series
birthstimeseries <- ts(births, frequency = 12, start = c(1946,1))
# Seasonal trend decomposition using Loess algorithm (STL)
births.stl = stl(birthstimeseries, s.window = "periodic")
# plot trend decomposition
plot(births.stl)
Seasonal Trend Decomposition using LOESS*
*LOcal regrESSion
Page 42
HSDFaculty of Business Studies
Thomas Zeutschler
Associate Lecturer
Time Series – First Example
SS 2016 - IT Applications in Business Analytics - 10. Time Series 42
# load data
births <- scan("http://robjhyndman.com/tsdldata/data/nybirths.dat")
# convert to time series
birthstimeseries <- ts(births, frequency = 12, start = c(1946,1))
# build ARIMA model
birthsmodel <- arima(birthstimeseries, order = c(1,0,0), list(order = c(2,1,0), period = 12))
# 24 month forecast based on the model
birthsforecast <- predict(birthsmodel, n.ahead=24)
# calculate bounds for 95% confidence level
U <- birthsforecast$pred + 2 * birthsforecast$se
L <- birthsforecast$pred - 2 * birthsforecast$se
# plot for time series, prediction and confidence interval
ts.plot(birthstimeseries, birthsforecast$pred, U, L, col = c(1,2,4,4), lty = c(1,1,2,2))
# add legend to plot
legend("topleft", c("Actual", "Forecast", "Error Bounds (95% Confidence)"),
col =c(1,2,4), lty = c(1,1,2))
Forecasting using ARMIA model
Page 43
HSDFaculty of Business Studies
Thomas Zeutschler
Associate Lecturer
Decision Tree Learning
SS 2016 - IT Applications in Business Analytics - 6. Analytical Use Case 1 43
Page 44
HSDFaculty of Business Studies
Thomas Zeutschler
Associate Lecturer
Classification Method Comparison
SS 2016 - IT Applications in Business Analytics - 6. Analytical Use Case 1 44
Try to understand the pattern of data...
…by applying visual data analysis
…by applying pairwise comparison of attributes
Is your data Linear Separable?
Yes: Logistic Regression, Neuronal Networks…be cautious on Decision Tree or Random Forrest
No: Random Forrest or SVM
???: Random Forrest…good balance of generalization and accuracy, and its computational cost is relatively low
But: Neuronal Networks can (not must) be the best solution…but it’s not easy to tune them to deliver good results (many parameters).
Page 45
HSDFaculty of Business Studies
Thomas Zeutschler
Associate Lecturer
Decision Tree
SS 2016 - IT Applications in Business Analytics - 6. Analytical Use Case 1 45
Decision Tree (partial) for Bike Sales Sample
A supervised learning method.
Purpose: Predict the certain value
of an item (record) based on
observations from other items.
If the target value is from a
finite set of values, then we
call it classification tree.
Leaves represent class
labels (e.g. Region),
whereas Branches
represent conjunctions
of features that lead to
those class labels.
Page 46
HSDFaculty of Business Studies
Thomas Zeutschler
Associate Lecturer
Outlier Detection
SS 2016 - IT Applications in Business Analytics - 14. Wrap Up 46
Page 47
HSDFaculty of Business Studies
Thomas Zeutschler
Associate Lecturer
Outliers – Where are they?
SS 2016 - IT Applications in Business Analytics - 8. Outlier Detection 47
Article: Anomaly Detection with Score functions based on Nearest Neighbour Graphs
https://arxiv.org/pdf/0910.5461.pdf
Page 48
HSDFaculty of Business Studies
Thomas Zeutschler
Associate Lecturer
Outlier Detection – Introduction
SS 2016 - IT Applications in Business Analytics - 8. Outlier Detection 48
“An outlier is an observation which
deviates so much from the other
observations as to arouse suspicions
that it was generated by a different
mechanism”D. M. Hawkins 1980
Two reasons for outliers:
Bad Data e.g. measurement errors, typos
Correct Datae.g. random variation of data, heavy-tailed
distribution of dataLOF - Local Outlier Factor
Page 49
HSDFaculty of Business Studies
Thomas Zeutschler
Associate Lecturer
Outliers – Core Problem
SS 2016 - IT Applications in Business Analytics - 8. Outlier Detection 49
Find them..
How to detect outliers?
Keep them (or not)…
Do we need to keep them? They are the main subject of interest
(e.g. in fraud detection)
They are an integral part of the statistical case.
Do we need to remove them? For more robust statistics.
For clean data (remove bad data).
Treat them…
What action needs to be done? Business purpose >>> outlier treatment
Page 50
HSDFaculty of Business Studies
Thomas Zeutschler
Associate Lecturer
Outliers – Core Problem
SS 2016 - IT Applications in Business Analytics - 8. Outlier Detection 50
Outlier Labeling
Flag potential outliers for further investigation
(i.e., are the potential outliers erroneous data, indicative of an
inappropriate distributional model, and so on).
Outlier Accommodation
Use robust statistical techniques that will not be unduly affected by
outliers. That is, if we cannot determine that potential outliers are
erroneous observations, do we need modify our statistical analysis to
more appropriately account for these observations?
Outlier Identification
Formally test whether observations are outliers.
Boris Iglewicz and David Hoaglin (1993),
"Volume 16: How to Detect and Handle Outliers",
The ASQC Basic References in Quality Control:
Statistical Techniques, Edward F. Mykytka, Ph.D., Editor.
Page 51
HSDFaculty of Business Studies
Thomas Zeutschler
Associate LecturerSS 2016 - IT Applications in Business Analytics - 8. Outlier Detection 51
Excursus – PCAAnalysis of environmental controls on tsunami deposit texture
a) PCA loading plot of variables along components 1 and 2 (accounting for 58% of total variance), showing the
spatial relationship of the variables along these dimensions.
b) Scoreplot, showing individual data points plotted in coordinate space along components 1 and 2
Page 52
HSDFaculty of Business Studies
Thomas Zeutschler
Associate Lecturer
Other Topics
SS 2016 - IT Applications in Business Analytics - 6. Analytical Use Case 1 52
Page 53
HSDFaculty of Business Studies
Thomas Zeutschler
Associate Lecturer
Information Gathering
SS 2016 - IT Applications in Business Analytics - 12. Data Acquisition 53
Page 54
HSDFaculty of Business Studies
Thomas Zeutschler
Associate Lecturer
S.U.C.C.E.S.S.
SS 2016 - IT Applications in Business Analytics - 13. Information Design 54
SAY Deliver messages: Reports and presentations serve to convey messages to
readers and listeners.
UNIFY Standardize content: Reports and presentations are more easily understood
when the content displayed adheres to a uniform concept of meaning.
CONDENSE Concentrate information: Reports and presentations are better understood
when the contents have a high level of information density.
CHECK Ensure quality: Reports and presentations are credible when the conveyed
content is based on correct, appropriate, and current data.
ENABLE Implement concept: Organizational, personnel-related, and technical
requirements must be met in order to implement the rules.
SIMPLIFY Avoid complication: Reports and presentations are better understood when
noise and redundancy are avoided.
STRUCTURE Group content: Reports and presentations should adhere to the requirements
for homogeneous, mutually exclusive, and exhaustive structures.
source: http://www.hichert.com
Page 55
HSDFaculty of Business Studies
Thomas Zeutschler
Associate Lecturer
S.U.C.C.E.S.S.
SS 2016 - IT Applications in Business Analytics - 13. Information Design 55
Page 56
HSDFaculty of Business Studies
Thomas Zeutschler
Associate Lecturer
#TheEnd
SS 2016 - IT Applications in Business Analytics - 14. Wrap Up 56
Page 57
HSDFaculty of Business Studies
Thomas Zeutschler
Associate Lecturer
Any Questions?
SS 2016 - IT Applications in Business Analytics - 14. Wrap Up 57