SQLDay2013_MarcinSzeliga_DataInDataMining

NASI SPONSORZY I PARTNERZY

Data in Data Miningwith SQL Server 2012

Marcin Szeligawww.sqlexpert.pl

http://blog.sqlexpert.pl/http://www.facebook.com/SQLExpertpl

[email protected]

Agenda

• Know Your Data

• What Kind of Data Do You Need?

• How Much Data Do You Need?

• The Problem of Missing Data

• Recap

Know Your Data

• Data is not ready for mining

– Even if it comes from DW/BI system

• In Data Mining Garbage In Really Bad Garbage Out

• Assesing the data is the key

– What and how much information it holds?

– How much data is invalid/missing?

– Does the data support the business problem?

• Data Profiling Task and Naive Bayes come to rescue

Demo

• Data Profiling

• Checking Attributes Relationship

What Kind of Data Do You Need?

• Tabular one – Most of the time a row = a case, and a column = an attribute (or

variable)

• Attribute can be:– Single-valued or multi-valued– Discrete, ordered, continuous or cyclical– Monotonic or not

• As far as relations are concerned, an attribute can be:– Independent or not– Redundant or not– Anachronistic or not

• T-SQL and Mining Structure Column Properties come to rescue– Get rid of single-valued, monotonic, independent, redundant and

anachronistic ones– Convert discrete attributes into continuous ones or vice versa

Demo

• Adding variables

• Changing variables’ type

How Much Data Do You Need? Part 1

• The raw amount of data is mostly irrelevant for data mining algorithms– Only the information hidden in this data matters– The problem is not related to an algorithm

• Does this means that you can use only a handful of data to mine? – Probably not ! (more about this later)

• Data mining algorithms work by analyzing statistical relationship between variables– The values’ distribution of each parameter is the most

important factor that determines the results

• Algorithm parameters come to rescue

Demo

• Who had the best chance to survive, according to Decison Tress?

• Tweaking Data Mining Algorithms


• Model parameters are called variables because each of them can take on a variety of values– Those values contain some sort of pattern

– They are distributed across the variable’s range in some specific way

• To see this pattern, display it graphically, as a curve

• At this stage you can only check if there is too few data

• Statistics comes to rescue– You can measure the frequency of a attributes states/values to

get variability

– Standard deviation can be used as the measure of the variability• It’s a sort of average distant between values and the mean

Demo

• Checking Variability

• Variability as a data quantity measure


• In real projects the population is too big to be measured – Most of the time we have to deal with sample data, the data that

represents only some part of the population– Even if the whole population is available, we still need to divide it

into at least two datasets (the training one and the test one)

• Convergence means that by adding more cases variability will„settle down”– When the sample is small, each new record can greatly change the

value distribution– As the sample is getting bigger, adding new records barely makes

any difference

• T-SQL and OVER clause come to rescue– Checking how much deviation has changed between samples is

easy – But what if we are unlucky and in our sample some correlations

between variables (i.e. between people under 18 who have very high salary) are not properly represented ?

Demo

• Converging on a representative sample

• What about correlations between variables?

The Problem of Missing Data

• NULL has two meanings:

– There is no data

– The data exists but it’s unknown

• I should stress the word „meaning”

– By removing NULLs you can lost valuable information

– By replacing NULLs you can severely skewed your data

• Model of Missing Value Pattern (MVP) comes to rescue

Demo

• Building Missing Value Pattern Model

• Flagging and predicting missing values

Recap• Never mine unknown data• Basic preparation can completely change the results• There is no easy way to say how much data you need for a

particular model– However, you can check if the sample is representative by

measuring the differences in the variability• This test should be done for most important, if nor for all,

variables• This has nothing to do with the data mining algorithm itself• But as long as you plan to use the data mining model to solve

some real world problems, you have to train it using representative data

• Missing data often hides important information – do not loseit– Use separe data mining models to supplement it

NASI SPONSORZY I PARTNERZY

Organizacja: Polskie Stowarzyszenie Użytkowników SQL Server - PLSSUGProdukcja: DATA MASTER Maciej Pilecki

SQLDay2013_MarcinSzeliga_DataInDataMining

Education

data data

sample data

kind of data

handful of data

data statistics

data mining garbage

data quantity measure

data profiling task