Top Banner
UNIT – 1 Data Preprocessing
66

UNIT – 1 Data Preprocessing. Data Preprocessing Learning Objectives Understand why preprocess the data. Understand how to clean the data. Understand how.

Mar 26, 2015

Download

Documents

Brooke Riley
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: UNIT – 1 Data Preprocessing. Data Preprocessing Learning Objectives Understand why preprocess the data. Understand how to clean the data. Understand how.

UNIT ndash 1Data Preprocessing

Data Preprocessing Learning Objectives

bull Understand why preprocess the databull Understand how to clean the databull Understand how to integrate and transform the data

Why preprocess the data Data cleaning Data integration and transformation

Why Data Preprocessing1 Data mining aims at discovering relationships and other

forms of knowledge from data in the real world

1 Data map entities in the application domain to symbolic representation through a measurement function

1 Data in the real world is dirty

incomplete missing data lacking attribute values lacking certain attributes of interest or containing only aggregate datanoisy containing errors such as measurement errors or outliersinconsistent containing discrepancies in codes or namesdistorted sampling distortion (A Change for worse)

4 No quality data no quality mining results (GIGO)

5 Quality decisions must be based on quality data

6 Data warehouse needs consistent integration of quality data

Data quality is multidimensional Accuracy Preciseness (=reliability) Completeness Consistency Timeliness Believability (=validity) Value added Interpretability Accessibility

Broad categories intrinsic contextual representational and

accessibility

Data cleaning Fill in missing values smooth noisy data identify or

remove outliers and resolve inconsistencies and errors

Data integration Integration of multiple databases data cubes or files

Data transformation Normalization and aggregation

Data reduction Obtains reduced representation in volume but

produces the same or similar analytical results Data discretization

Part of data reduction but with particular importance especially for numerical data

bull For data preprocessing to be successful it is essential to have an overall picture of your data

bull Descriptive data summarization techniques can be used to identify the typical properties of your data and highlight which

data values should be treated as noise or outliers

bull Thus we first introduce the basic concepts of descriptive data summarization before getting into the concrete workings of

data preprocessing techniques

bull For many data preprocessing tasks users would like to learn about data characteristics regarding both central tendency and dispersion of the data

bull Measures of central tendency include mean median mode and midrange while measures of data dispersion include quartiles interquartile range (IQR) and variance

bull These descriptive statistics are of great help in understanding the distribution of the data

bull Such measures have been studied extensively in the statistical literature

bull From the data mining point of view we need to examine how they can be computed efficiently in large databases

bull In particular it is necessary to introduce the notions of distributive measure algebraic measure and holistic measure

bull Knowing what kind of measure we are dealing with can help us choose an efficient implementation for it

In this section we look at various ways to measure the central tendency of data The most common and most effective numerical measure of the ldquocenterrdquo of a set of data is the (arithmetic) mean

mean1048576mode = 3(mean1048576median)

The degree to which numerical data tend to spread is called the dispersion or variance of the data The most common measures of data dispersion are

1) Range Quartiles Outliers and Boxplots 2) Variance and Standard Deviation

The range of the set is the difference between the largest (max()) and smallest (min()) values

The most commonly used percentiles other than the median are quartiles The first quartile denoted by Q1 is the 25th percentile the third quartile denoted by Q3 is the 75th percentile The quartiles including the median give some indication of the center spread and shape of a distribution The distance between the first and third quartiles is a simple measure of spread that gives the range covered by the middle half of the data

bull Boxplots are a popular way of visualizing a distribution A boxplot incorporates the five-number summary as follows

bull Typically the ends of the box are at the quartiles so that the box length is the interquartile range IQR

bull The median is marked by a line within the box

bull Two lines (called whiskers) outside the box extend to the smallest (Minimum) and largest (Maximum) observations

Aside from the bar charts pie charts and line graphs used in most statistical or graphical data presentation software packages there are other popular types of graphs for the display of data summaries and distributions These include histograms quantile plots q-q plots scatter plots and loess curves Such graphs are very helpful for the visual inspection of your data

23 Graphic Displays of Basic Descriptive Data Summaries

3 Data Cleaningbull Data cleaning tasks

Fill in missing valuesIdentify outliers and smooth out noisy data

Correct inconsistent data1) Missing Databull Data is not always available

a Eg many tuples have no recorded value for several attributes such as customer income in sales data

bull Missing data may be due to a equipment malfunction

b inconsistent with other recorded data and thus deletedc data not entered due to misunderstandingd certain data may not be considered important at the time of entrye not register history or changes of the data

f Missing data may need to be inferred

How to Handle Missing Data

bull Ignore the tuple usually done when class label is missing(assuming the tasks in classificationmdashnot effective when the percentage of missing values per attribute varies considerably)

bull Fill in the missing value manually tedious + infeasiblebull Use a global constant to fill in the missing value eg ldquounknownrdquo a new class bull Use the attribute mean to fill in the missing valuebull Use the attribute mean for all samples belonging to the same class to fill in the missing value smarterbull Use the most probable value to fill in the missing value inference-based such as Bayesian formula or decision tree

bull 2Noisy Data

bull Noise random error or variance in a measured variable

bull Incorrect attribute values may be due tondash faulty data collection instrumentsndash data entry problemsndash data transmission problemsndash technology limitationndash inconsistency in naming convention

bull Other data problems which requires data cleaningndash duplicate recordsndash inconsistent data

How to Handle Noisy Data

bull Binning method

- first sort data and partition into (equi-depth) bins- then one can smooth by bin means smooth by bin median

- smooth by bin boundaries etc

bull Clustering- detect and remove outliers

bull Combined computer and human inspection- detect suspicious values and check by human

bull Regression- smooth by fitting the data into regression functions

Binning Methods for Data SmoothingSorted data for price (in dollars)

48915 21 21 24 25 26 28 29 34

Partition into (equi-depth) bins - Bin 1 4 8 9 15 - Bin 2 21 21 24 25 - Bin 3 26 28 29 34

Smoothing by bin means - Bin 1 9 9 9 9 - Bin 2 23 23 23 23 - Bin 3 29 29 29 29

Smoothing by bin boundaries - Bin 1 4 4 4 15 - Bin 2 21 21 25 25 - Bin 3 26 26 26 34

Cluster Analysis

Regression

Data integration combines data from multiple sources into a

coherent store Schema integration

integrate metadata from different sources entity identification problem identify real world

entities from multiple data sources eg Acust-id Bcust-

Detecting and resolving data value conflicts for the same real world entity attribute values from

different sources are different possible reasons different representations

different scales eg metric vs British units

bull Redundant data occur often when integration of multiple databasesndash The same attribute may have different names in

different databasesndash One attribute may be a ldquoderivedrdquo attribute in

another table eg annual revenuebull Redundant data may be able to be detected by

correlational analysisbull Careful integration of the data from multiple sources

may help reduceavoid redundancies and inconsistencies and improve mining speed and quality

Smoothing remove noise from data Aggregation summarization data cube construction Generalization concept hierarchy climbing Normalization scaled to fall within a small specified

range min-max normalization z-score normalization normalization by decimal scaling

Attributefeature construction New attributes constructed from the given ones

bull min-max normalization

bull z-score normalization

bull normalization by decimal scaling

AAA

AA

A

minnewminnewmaxnewminmax

minvv _)__(

A

A

devstand

meanvv

_

j

vv

10 Where j is the smallest integer such that Max(| |)lt1v

Min-max normalization to [new_minA new_maxA]

Ex Let income range $12000 to $98000 normalized to [00 10] Then $73600 is mapped to

Z-score normalization (μ mean σ standard deviation)

Ex Let μ = 54000 σ = 16000 Then Normalization by decimal scaling

71600)001(0001200098

0001260073

AAA

AA

A

minnewminnewmaxnewminmax

minvv _)__(

A

Avv

j

vv

10

Where j is the smallest integer such that Max(|νrsquo|) lt 1

225100016

0005460073

Data reduction techniques can be applied to obtain a reduced representation of the data set that is much smaller in volume yet closely maintains the integrity of the original data That is mining on the reduced data set should be more efficient yet produce the same (or almost the same) analytical results

Strategies for data reduction include the following1Data cube aggregation where aggregation operations are applied to the data in the construction of a data cube

2 Attribute subset selection where irrelevant weakly relevant or redundant attributes or dimensions may be detected and removed

3 Dimensionality reduction where encoding mechanisms are used to reduce the data set size

4 Numerosity reduction where the data are replaced or estimated by alternative smaller data representations such as parametric models (which need store only the model parameters instead of the actual data) or nonparametric methods such as clustering sampling and the use of histograms

5 Discretization and concept hierarchy generation where rawdata values for attributes are replaced by ranges or higher conceptual levels

Attribute subset selection reduces the data set size by removing irrelevant or redundant attributes (or dimensions)

The goal of attribute subset selection is to find a minimum set of attributes such that the resulting probability distribution of the data classes is as close as possible to the original distribution obtained using all attributes

Mining on a reduced set of attributes has an additional benefit

It reduces the number of attributes appearing in the discovered patterns helping to make the patterns easier to understand

The ldquoBestrdquo (and ldquoWorstrdquo) attributes are typically determined using tests of statistical significance which assume that the attributes are independent of one Many other attribute evaluation measures can be used such as the information gain measure used in building decision trees for classification

bull Basic heuristic methods of attribute subset selection include the following techniques some of which are illustrated in Figure

bull 1 Stepwise forward selection The procedure starts with an empty set of attributes as the reduced set The best of the original attributes is determined and added to the reduced set At each subsequent iteration or step the best of the remaining original attributes is added to the set

bull 2 Stepwise backward elimination The procedure starts with the full set of attributes At each step it removes the worst attribute remaining in the set

bull 3 Combination of forward selection and backward elimination The stepwise forward selection and backward elimination methods can be combined so that at each step the procedure selects the best attribute and removes the

worst from among the remaining attributes

bull 4 Decision tree induction Decision tree algorithms such as ID3 C45 and CART were originally intended for classification Decision tree induction constructs a flow chart like structure where each internal (non leaf) node denotes a test on an attribute each branch corresponds to an outcome of the test and each external (leaf) node denotes a class prediction At each node the algorithm chooses the ldquobestrdquo attribute to partition the data into individual classes When decision tree induction is used for attribute subset selection a tree is constructed from the given data All attributes that do not appear in the tree are assumed to be irrelevant The set of attributes appearing in the tree form the reduced subset of attributes

The stopping criteria for the methods may vary The procedure may employ a threshold on the measure used to determine when to stop the attribute selection process

In dimensionality reduction data encoding or transformations are applied so as to obtain a reduced or ldquocompressedrdquo representation of the original data If the original data can be reconstructed from the compressed data without any loss of information the data reduction is called lossless If instead we can reconstruct only an approximation of the original data then the data reduction is called lossy There are several well-tuned algorithms for string compression Although they are typically lossless they allow only limited manipulation of the data

In this section we instead focus on two popular and effective methods of lossy dimensionality reduction wavelet transforms and principal components analysis

bull Wavelet transforms can be applied to multidimensional data such as a data cube

bull This is done by first applying the transform to the first dimension then to the second and so on The computational complexity involved is linear with respect to the number of cells in the cube Wavelet transforms give good results on sparse or skewed data and on data with ordered attributes Lossy compression by wavelets is reportedly better than JPEG compression the current commercial standard Wavelet transforms have many real-world applications including the compression of fingerprint images computer vision analysis of time-series data and data cleaning

bull PCA is computationally inexpensive can be applied to ordered and unordered attributes and can handle sparse data and skewed data Multidimensional data of more than two dimensions can be handled by reducing the problem to two dimensions Principal components may be used as inputs to multiple regression and cluster analysis

bull In comparison with wavelet transforms PCA tends to be better at handling sparse data whereas wavelet transforms are more suitable for data of high dimensionality

ldquoCan we reduce the data volume by choosing alternative lsquosmallerrsquo forms of data representationrdquo

Techniques of numerosity reduction can indeed be applied for this purpose These techniques may be parametric or nonparametric

For parametric methods a model is used to estimate the data so that typically only the data parameters need to be stored instead of the actual data (Outliers may also be stored) Log-linear models which estimate discrete multidimensional probability distributions are an example

Nonparametric methods for storing reduced representations of the data include histograms clustering and sampling

Letrsquos look at each of the numerosity reduction techniques mentioned above

Linear regression Data are modeled to fit a straight line Often uses the least-square method to fit the

line Multiple regression allows a response variable Y

to be modeled as a linear function of multidimensional feature vector

Log-linear model approximates discrete multidimensional probability distributions

bull Linear regression Y = w X + bndash Two regression coefficients w and b specify the

line and are to be estimated by using the data at hand

ndash Using the least squares criterion to the known values of Y1 Y2 hellip X1 X2 hellip

bull Multiple regression Y = b0 + b1 X1 + b2 X2ndash Many nonlinear functions can be transformed into

the above

bull Log-linear modelsndash The multi-way table of joint probabilities is

approximated by a product of lower-order tablesndash Probability p(a b c d) = ab acad bcd

Histograms Histograms use binning to approximate data distributions and are a popular form of data reduction

A histogram for an attribute A partitions the data distribution of A into disjoint subsets or buckets If each bucket represents only a single attribute-valuefrequency pair the buckets are called singleton buckets Often buckets instead represent continuous ranges for the given attribute

There are several partitioning rules including the following

Equal-width In an equal-width histogram the width of each bucket range is uniform

Equal-frequency (or equidepth) In an equal-frequency histogram the buckets are created so that roughly the frequency of each bucket is constant (that is each bucket contains roughly the same number of contiguous data samples)

V-Optimal If we consider all of the possible histograms for a given number of buckets the V-Optimal histogram is the one with the least variance Histogram variance is a weighted sum of the original values that each bucket represents where bucket weight is equal to the number of values in the bucket

MaxDiff In a MaxDiff histogram we consider the difference between each pair of adjacent values

ClusteringClustering techniques consider data tuples as objects They

partition the objects into groups or clusters so that objects within a cluster are ldquosimilarrdquo to one another and ldquodissimilarrdquo to objects in other clusters

In data reduction the cluster representations of the data are used to replace the actual data The effectiveness of this technique depends on the nature of the data It is much more effective for data that can be organized into distinct clusters than for smeared data

Partition data set into clusters based on similarity and store cluster representation (eg centroid and diameter) only

Can be very effective if data is clustered but not if data is ldquosmearedrdquo

Can have hierarchical clustering and be stored in multi-dimensional index tree structures

There are many choices of clustering definitions and clustering algorithms

Cluster analysis will be studied in depth later

SamplingSampling can be used as a data reduction technique because it

allows a large data set to be represented by a much smaller random sample (or subset) of the data Suppose that a large data set D contains N tuples Letrsquos look at the most common ways that we could sample D for data reduction

An advantage of sampling for data reduction is that the cost of obtaining a sample is proportional to the size of the sample sampling complexity is potentially sublinear to the size of the data

Simple Random sample without replacement

bull For a fixed sample size sampling complexity increases only linearly as the number of data dimensions

bull When applied to data reduction sampling is most commonly used to estimate the answer to an aggregate query

Sampling obtaining a small sample s to represent the whole data set N

Allow a mining algorithm to run in complexity that is potentially sub-linear to the size of the data

Choose a representative subset of the data Simple random sampling may have very poor

performance in the presence of skew Develop adaptive sampling methods

Stratified sampling Approximate the percentage of each class (or

subpopulation of interest) in the overall database Used in conjunction with skewed data

Note Sampling may not reduce database IOs (page at a time)

Sampling with or without Replacement

SRSWOR

(simple random

sample without

replacement)

SRSWR

Raw Data

Raw Data ClusterStratified Sample

Data discretization techniques can be used to reduce the number of values for a given continuous attribute by dividing the range of the attribute into intervals Interval labels can then be used to replace actual data values Replacing numerous values of a continuous attribute by a small number of interval labels thereby reduces and simplifies the original data This leads to a concise easy-to-use knowledge-level representation of mining results

bull Discretization techniques can be categorized based on how the discretization is performed such as whether it uses class information or which direction it proceeds (ie top-down vs bottom-up) If the discretization process uses class information then we say it is supervised discretization Otherwise it is unsupervised If the process starts by first finding one or a few points (called split points or cut points) to split the entire attribute range and then repeats this recursively on the resulting intervals it is called top-down discretization or splitting This contrasts with bottom-up discretization or merging which starts by considering all of the continuous values as potential split-points removes some by merging neighborhood values to form intervals and then recursively applies this process to the resulting intervals Discretization can be performed recursively on an attribute to provide a hierarchical or multiresolution partitioning of the attribute values known as a concept hierarchy Concept hierarchies are useful for mining at multiple levels of abstraction

Three types of attributes Nominal mdash values from an unordered set eg color

profession Ordinal mdash values from an ordered set eg military or

academic rank Continuous mdash real numbers eg integer or real

numbers Discretization

Divide the range of a continuous attribute into intervals Some classification algorithms only accept categorical

attributes Reduce data size by discretization Prepare for further analysis

Typical methods All the methods can be applied recursively Binning (covered above)

Top-down split unsupervised Histogram analysis (covered above)

Top-down split unsupervised Clustering analysis (covered above)

Either top-down split or bottom-up merge unsupervised Entropy-based discretization supervised top-down

split Interval merging by 2 Analysis unsupervised bottom-

up merge Segmentation by natural partitioning top-down split

unsupervised

Given a set of samples S if S is partitioned into two intervals S1 and S2 using boundary T the information gain after partitioning is

Entropy is calculated based on class distribution of the samples in the set Given m classes the entropy of S1 is

where pi is the probability of class i in S1 The boundary that minimizes the entropy function over

all possible boundaries is selected as a binary discretization

The process is recursively applied to partitions obtained until some stopping criterion is met

Such a boundary may reduce data size and improve classification accuracy

)Entropy(S|S|

|S|+)Entropy(S

|S|

|S|=T)I(S 2

21

1

m

iii ppSEntropy

121 )(log)(

Merging-based (bottom-up) vs splitting-based methods Merge Find the best neighboring intervals and merge

them to form larger intervals recursively ChiMerge [Kerber AAAI 1992 See also Liu et al DMKD

2002] Initially each distinct value of a numerical attr A is

considered to be one interval 2 tests are performed for every pair of adjacent

intervals Adjacent intervals with the least 2 values are merged

together since low 2 values for a pair indicate similar class distributions

This merge process proceeds recursively until a predefined stopping criterion is met (such as significance level max-interval max inconsistency etc)

Cluster analysis is a popular data discretization method A clustering algorithm can be applied to discretize a numerical attribute A by partitioning the values of A into clusters or groups Clustering takes the distribution of A into consideration as well as the closeness of data points and therefore is able to produce high-quality discretization results

Clustering can be used to generate a concept hierarchy for A by following either a topdown splitting strategy or a bottom-up merging strategy where each cluster forms a node of the concept hierarchy In the former each initial cluster or partition may be further

decomposed into several subclusters forming a lower level of the hierarchy In the latter clusters are formed by repeatedly grouping neighboring clusters in order to form

higher-level concepts

bull Discretization by Intuitive Partitioningbull Although the above discretization methods are useful

in the generation of numerical hierarchies many users would like to see numerical ranges partitioned into relatively uniform easy-to-read intervals that appear intuitive or ldquonaturalrdquo

bull If an interval covers 3 6 7 or 9 distinct values at the most significant digit then partition the range into 3 intervals (3 equal-width intervals for 3 6 and 9 and 3 intervals in the grouping of 2-3-2 for 7)

bull If it covers 2 4 or 8 distinct values at the most significant digit then partition the range into 4 equal-width intervals

bull If it covers 1 5 or 10 distinct values at the most significant digit then partition the range into 5 equal-width intervals

bull The rule can be recursively applied to each interval creating a concept hierarchy for the given numerical attribute Real-world data often contain extremely large positive andor negative outlier values which could distort any top-down discretization method based on minimum and maximum data values

A simply 3-4-5 rule can be used to segment numeric data into relatively uniform ldquonaturalrdquo intervals If an interval covers 3 6 7 or 9 distinct values

at the most significant digit partition the range into 3 equi-width intervals

If it covers 2 4 or 8 distinct values at the most significant digit partition the range into 4 intervals

If it covers 1 5 or 10 distinct values at the most significant digit partition the range into 5 intervals

Categorical data are discrete data Categorical attributes have a finite (but possibly large) number of distinct values with no ordering among the values Examples include geographic location job category and itemtype There are several methods for the generation of concept hierarchies for categorical data

Specification of a partial ordering of attributes explicitly at the schema level by users or Experts

Specification of a portion of a hierarchy by explicit data grouping

Specification of a set of attributes but not of their partial ordering

Specification of only a partial set of attributes

Specification of a partialtotal ordering of attributes explicitly at the schema level by users or experts street lt city lt state lt country

Specification of a hierarchy for a set of values by explicit data grouping Urbana Champaign Chicago lt Illinois

Specification of only a partial set of attributes Eg only street lt city not others

Automatic generation of hierarchies (or attribute levels) by the analysis of the number of distinct values Eg for a set of attributes street city state

country

Some hierarchies can be automaatically generated based on the analysis of the number of distinct values per attribute in the data set The attribute with the most distinct values is

placed at the lowest level of the hierarchy Exceptions eg weekday month quarter year

country

province_or_ state

city

street

15 distinct values

365 distinct values

3567 distinct values

674339 distinct values

Data preparation or preprocessing is a big issue for both data warehousing and data mining

Discriptive data summarization is need for quality data preprocessing

Data preparation includes Data cleaning and data integration Data reduction and feature selection Discretization

A lot a methods have been developed but data preprocessing still an active area of research

  • Slide 1
  • Slide 2
  • Slide 3
  • Multi-Dimensional Measure of Data Quality
  • Major Tasks in Data Preprocessing
  • Figure Forms of data preprocessing
  • 2 Descriptive Data Summarization
  • Slide 8
  • 21 Measuring the Central Tendency
  • 22 Measuring the Dispersion of Data
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Data Integration
  • Handling Redundant Data in Data Integration
  • Data Transformation
  • Data Transformation Normalization
  • Slide 27
  • 5 Data Reduction
  • Slide 29
  • 51 Data Cube Aggregation
  • 52 Attribute Subset Selection
  • Slide 32
  • Slide 33
  • Slide 34
  • 53 Dimensionality Reduction
  • Slide 36
  • Slide 37
  • Slide 38
  • 4 Numerosity Reduction
  • Data Reduction Method (1) Regression and Log-Linear Models
  • Regress Analysis and Log-Linear Models
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Data Reduction Method (3) Clustering
  • Slide 47
  • Slide 48
  • Slide 49
  • Data Reduction Method (4) Sampling
  • Slide 51
  • Sampling Cluster or Stratified Sampling
  • Data Discretization and Concept Hierarchy Generation
  • Slide 54
  • Discretization
  • Discretization and Concept Hierarchy Generation for Numeric Data
  • Entropy-Based Discretization
  • Interval Merge by 2 Analysis
  • Cluster Analysis
  • Slide 60
  • Segmentation by Natural Partitioning
  • 62 Concept Hierarchy Generation for Categorical Data
  • Slide 63
  • Concept Hierarchy Generation for Categorical Data
  • Automatic Concept Hierarchy Generation
  • Summary
Page 2: UNIT – 1 Data Preprocessing. Data Preprocessing Learning Objectives Understand why preprocess the data. Understand how to clean the data. Understand how.

Data Preprocessing Learning Objectives

bull Understand why preprocess the databull Understand how to clean the databull Understand how to integrate and transform the data

Why preprocess the data Data cleaning Data integration and transformation

Why Data Preprocessing1 Data mining aims at discovering relationships and other

forms of knowledge from data in the real world

1 Data map entities in the application domain to symbolic representation through a measurement function

1 Data in the real world is dirty

incomplete missing data lacking attribute values lacking certain attributes of interest or containing only aggregate datanoisy containing errors such as measurement errors or outliersinconsistent containing discrepancies in codes or namesdistorted sampling distortion (A Change for worse)

4 No quality data no quality mining results (GIGO)

5 Quality decisions must be based on quality data

6 Data warehouse needs consistent integration of quality data

Data quality is multidimensional Accuracy Preciseness (=reliability) Completeness Consistency Timeliness Believability (=validity) Value added Interpretability Accessibility

Broad categories intrinsic contextual representational and

accessibility

Data cleaning Fill in missing values smooth noisy data identify or

remove outliers and resolve inconsistencies and errors

Data integration Integration of multiple databases data cubes or files

Data transformation Normalization and aggregation

Data reduction Obtains reduced representation in volume but

produces the same or similar analytical results Data discretization

Part of data reduction but with particular importance especially for numerical data

bull For data preprocessing to be successful it is essential to have an overall picture of your data

bull Descriptive data summarization techniques can be used to identify the typical properties of your data and highlight which

data values should be treated as noise or outliers

bull Thus we first introduce the basic concepts of descriptive data summarization before getting into the concrete workings of

data preprocessing techniques

bull For many data preprocessing tasks users would like to learn about data characteristics regarding both central tendency and dispersion of the data

bull Measures of central tendency include mean median mode and midrange while measures of data dispersion include quartiles interquartile range (IQR) and variance

bull These descriptive statistics are of great help in understanding the distribution of the data

bull Such measures have been studied extensively in the statistical literature

bull From the data mining point of view we need to examine how they can be computed efficiently in large databases

bull In particular it is necessary to introduce the notions of distributive measure algebraic measure and holistic measure

bull Knowing what kind of measure we are dealing with can help us choose an efficient implementation for it

In this section we look at various ways to measure the central tendency of data The most common and most effective numerical measure of the ldquocenterrdquo of a set of data is the (arithmetic) mean

mean1048576mode = 3(mean1048576median)

The degree to which numerical data tend to spread is called the dispersion or variance of the data The most common measures of data dispersion are

1) Range Quartiles Outliers and Boxplots 2) Variance and Standard Deviation

The range of the set is the difference between the largest (max()) and smallest (min()) values

The most commonly used percentiles other than the median are quartiles The first quartile denoted by Q1 is the 25th percentile the third quartile denoted by Q3 is the 75th percentile The quartiles including the median give some indication of the center spread and shape of a distribution The distance between the first and third quartiles is a simple measure of spread that gives the range covered by the middle half of the data

bull Boxplots are a popular way of visualizing a distribution A boxplot incorporates the five-number summary as follows

bull Typically the ends of the box are at the quartiles so that the box length is the interquartile range IQR

bull The median is marked by a line within the box

bull Two lines (called whiskers) outside the box extend to the smallest (Minimum) and largest (Maximum) observations

Aside from the bar charts pie charts and line graphs used in most statistical or graphical data presentation software packages there are other popular types of graphs for the display of data summaries and distributions These include histograms quantile plots q-q plots scatter plots and loess curves Such graphs are very helpful for the visual inspection of your data

23 Graphic Displays of Basic Descriptive Data Summaries

3 Data Cleaningbull Data cleaning tasks

Fill in missing valuesIdentify outliers and smooth out noisy data

Correct inconsistent data1) Missing Databull Data is not always available

a Eg many tuples have no recorded value for several attributes such as customer income in sales data

bull Missing data may be due to a equipment malfunction

b inconsistent with other recorded data and thus deletedc data not entered due to misunderstandingd certain data may not be considered important at the time of entrye not register history or changes of the data

f Missing data may need to be inferred

How to Handle Missing Data

bull Ignore the tuple usually done when class label is missing(assuming the tasks in classificationmdashnot effective when the percentage of missing values per attribute varies considerably)

bull Fill in the missing value manually tedious + infeasiblebull Use a global constant to fill in the missing value eg ldquounknownrdquo a new class bull Use the attribute mean to fill in the missing valuebull Use the attribute mean for all samples belonging to the same class to fill in the missing value smarterbull Use the most probable value to fill in the missing value inference-based such as Bayesian formula or decision tree

bull 2Noisy Data

bull Noise random error or variance in a measured variable

bull Incorrect attribute values may be due tondash faulty data collection instrumentsndash data entry problemsndash data transmission problemsndash technology limitationndash inconsistency in naming convention

bull Other data problems which requires data cleaningndash duplicate recordsndash inconsistent data

How to Handle Noisy Data

bull Binning method

- first sort data and partition into (equi-depth) bins- then one can smooth by bin means smooth by bin median

- smooth by bin boundaries etc

bull Clustering- detect and remove outliers

bull Combined computer and human inspection- detect suspicious values and check by human

bull Regression- smooth by fitting the data into regression functions

Binning Methods for Data SmoothingSorted data for price (in dollars)

48915 21 21 24 25 26 28 29 34

Partition into (equi-depth) bins - Bin 1 4 8 9 15 - Bin 2 21 21 24 25 - Bin 3 26 28 29 34

Smoothing by bin means - Bin 1 9 9 9 9 - Bin 2 23 23 23 23 - Bin 3 29 29 29 29

Smoothing by bin boundaries - Bin 1 4 4 4 15 - Bin 2 21 21 25 25 - Bin 3 26 26 26 34

Cluster Analysis

Regression

Data integration combines data from multiple sources into a

coherent store Schema integration

integrate metadata from different sources entity identification problem identify real world

entities from multiple data sources eg Acust-id Bcust-

Detecting and resolving data value conflicts for the same real world entity attribute values from

different sources are different possible reasons different representations

different scales eg metric vs British units

bull Redundant data occur often when integration of multiple databasesndash The same attribute may have different names in

different databasesndash One attribute may be a ldquoderivedrdquo attribute in

another table eg annual revenuebull Redundant data may be able to be detected by

correlational analysisbull Careful integration of the data from multiple sources

may help reduceavoid redundancies and inconsistencies and improve mining speed and quality

Smoothing remove noise from data Aggregation summarization data cube construction Generalization concept hierarchy climbing Normalization scaled to fall within a small specified

range min-max normalization z-score normalization normalization by decimal scaling

Attributefeature construction New attributes constructed from the given ones

bull min-max normalization

bull z-score normalization

bull normalization by decimal scaling

AAA

AA

A

minnewminnewmaxnewminmax

minvv _)__(

A

A

devstand

meanvv

_

j

vv

10 Where j is the smallest integer such that Max(| |)lt1v

Min-max normalization to [new_minA new_maxA]

Ex Let income range $12000 to $98000 normalized to [00 10] Then $73600 is mapped to

Z-score normalization (μ mean σ standard deviation)

Ex Let μ = 54000 σ = 16000 Then Normalization by decimal scaling

71600)001(0001200098

0001260073

AAA

AA

A

minnewminnewmaxnewminmax

minvv _)__(

A

Avv

j

vv

10

Where j is the smallest integer such that Max(|νrsquo|) lt 1

225100016

0005460073

Data reduction techniques can be applied to obtain a reduced representation of the data set that is much smaller in volume yet closely maintains the integrity of the original data That is mining on the reduced data set should be more efficient yet produce the same (or almost the same) analytical results

Strategies for data reduction include the following1Data cube aggregation where aggregation operations are applied to the data in the construction of a data cube

2 Attribute subset selection where irrelevant weakly relevant or redundant attributes or dimensions may be detected and removed

3 Dimensionality reduction where encoding mechanisms are used to reduce the data set size

4 Numerosity reduction where the data are replaced or estimated by alternative smaller data representations such as parametric models (which need store only the model parameters instead of the actual data) or nonparametric methods such as clustering sampling and the use of histograms

5 Discretization and concept hierarchy generation where rawdata values for attributes are replaced by ranges or higher conceptual levels

Attribute subset selection reduces the data set size by removing irrelevant or redundant attributes (or dimensions)

The goal of attribute subset selection is to find a minimum set of attributes such that the resulting probability distribution of the data classes is as close as possible to the original distribution obtained using all attributes

Mining on a reduced set of attributes has an additional benefit

It reduces the number of attributes appearing in the discovered patterns helping to make the patterns easier to understand

The ldquoBestrdquo (and ldquoWorstrdquo) attributes are typically determined using tests of statistical significance which assume that the attributes are independent of one Many other attribute evaluation measures can be used such as the information gain measure used in building decision trees for classification

bull Basic heuristic methods of attribute subset selection include the following techniques some of which are illustrated in Figure

bull 1 Stepwise forward selection The procedure starts with an empty set of attributes as the reduced set The best of the original attributes is determined and added to the reduced set At each subsequent iteration or step the best of the remaining original attributes is added to the set

bull 2 Stepwise backward elimination The procedure starts with the full set of attributes At each step it removes the worst attribute remaining in the set

bull 3 Combination of forward selection and backward elimination The stepwise forward selection and backward elimination methods can be combined so that at each step the procedure selects the best attribute and removes the

worst from among the remaining attributes

bull 4 Decision tree induction Decision tree algorithms such as ID3 C45 and CART were originally intended for classification Decision tree induction constructs a flow chart like structure where each internal (non leaf) node denotes a test on an attribute each branch corresponds to an outcome of the test and each external (leaf) node denotes a class prediction At each node the algorithm chooses the ldquobestrdquo attribute to partition the data into individual classes When decision tree induction is used for attribute subset selection a tree is constructed from the given data All attributes that do not appear in the tree are assumed to be irrelevant The set of attributes appearing in the tree form the reduced subset of attributes

The stopping criteria for the methods may vary The procedure may employ a threshold on the measure used to determine when to stop the attribute selection process

In dimensionality reduction data encoding or transformations are applied so as to obtain a reduced or ldquocompressedrdquo representation of the original data If the original data can be reconstructed from the compressed data without any loss of information the data reduction is called lossless If instead we can reconstruct only an approximation of the original data then the data reduction is called lossy There are several well-tuned algorithms for string compression Although they are typically lossless they allow only limited manipulation of the data

In this section we instead focus on two popular and effective methods of lossy dimensionality reduction wavelet transforms and principal components analysis

bull Wavelet transforms can be applied to multidimensional data such as a data cube

bull This is done by first applying the transform to the first dimension then to the second and so on The computational complexity involved is linear with respect to the number of cells in the cube Wavelet transforms give good results on sparse or skewed data and on data with ordered attributes Lossy compression by wavelets is reportedly better than JPEG compression the current commercial standard Wavelet transforms have many real-world applications including the compression of fingerprint images computer vision analysis of time-series data and data cleaning

bull PCA is computationally inexpensive can be applied to ordered and unordered attributes and can handle sparse data and skewed data Multidimensional data of more than two dimensions can be handled by reducing the problem to two dimensions Principal components may be used as inputs to multiple regression and cluster analysis

bull In comparison with wavelet transforms PCA tends to be better at handling sparse data whereas wavelet transforms are more suitable for data of high dimensionality

ldquoCan we reduce the data volume by choosing alternative lsquosmallerrsquo forms of data representationrdquo

Techniques of numerosity reduction can indeed be applied for this purpose These techniques may be parametric or nonparametric

For parametric methods a model is used to estimate the data so that typically only the data parameters need to be stored instead of the actual data (Outliers may also be stored) Log-linear models which estimate discrete multidimensional probability distributions are an example

Nonparametric methods for storing reduced representations of the data include histograms clustering and sampling

Letrsquos look at each of the numerosity reduction techniques mentioned above

Linear regression Data are modeled to fit a straight line Often uses the least-square method to fit the

line Multiple regression allows a response variable Y

to be modeled as a linear function of multidimensional feature vector

Log-linear model approximates discrete multidimensional probability distributions

bull Linear regression Y = w X + bndash Two regression coefficients w and b specify the

line and are to be estimated by using the data at hand

ndash Using the least squares criterion to the known values of Y1 Y2 hellip X1 X2 hellip

bull Multiple regression Y = b0 + b1 X1 + b2 X2ndash Many nonlinear functions can be transformed into

the above

bull Log-linear modelsndash The multi-way table of joint probabilities is

approximated by a product of lower-order tablesndash Probability p(a b c d) = ab acad bcd

Histograms Histograms use binning to approximate data distributions and are a popular form of data reduction

A histogram for an attribute A partitions the data distribution of A into disjoint subsets or buckets If each bucket represents only a single attribute-valuefrequency pair the buckets are called singleton buckets Often buckets instead represent continuous ranges for the given attribute

There are several partitioning rules including the following

Equal-width In an equal-width histogram the width of each bucket range is uniform

Equal-frequency (or equidepth) In an equal-frequency histogram the buckets are created so that roughly the frequency of each bucket is constant (that is each bucket contains roughly the same number of contiguous data samples)

V-Optimal If we consider all of the possible histograms for a given number of buckets the V-Optimal histogram is the one with the least variance Histogram variance is a weighted sum of the original values that each bucket represents where bucket weight is equal to the number of values in the bucket

MaxDiff In a MaxDiff histogram we consider the difference between each pair of adjacent values

ClusteringClustering techniques consider data tuples as objects They

partition the objects into groups or clusters so that objects within a cluster are ldquosimilarrdquo to one another and ldquodissimilarrdquo to objects in other clusters

In data reduction the cluster representations of the data are used to replace the actual data The effectiveness of this technique depends on the nature of the data It is much more effective for data that can be organized into distinct clusters than for smeared data

Partition data set into clusters based on similarity and store cluster representation (eg centroid and diameter) only

Can be very effective if data is clustered but not if data is ldquosmearedrdquo

Can have hierarchical clustering and be stored in multi-dimensional index tree structures

There are many choices of clustering definitions and clustering algorithms

Cluster analysis will be studied in depth later

SamplingSampling can be used as a data reduction technique because it

allows a large data set to be represented by a much smaller random sample (or subset) of the data Suppose that a large data set D contains N tuples Letrsquos look at the most common ways that we could sample D for data reduction

An advantage of sampling for data reduction is that the cost of obtaining a sample is proportional to the size of the sample sampling complexity is potentially sublinear to the size of the data

Simple Random sample without replacement

bull For a fixed sample size sampling complexity increases only linearly as the number of data dimensions

bull When applied to data reduction sampling is most commonly used to estimate the answer to an aggregate query

Sampling obtaining a small sample s to represent the whole data set N

Allow a mining algorithm to run in complexity that is potentially sub-linear to the size of the data

Choose a representative subset of the data Simple random sampling may have very poor

performance in the presence of skew Develop adaptive sampling methods

Stratified sampling Approximate the percentage of each class (or

subpopulation of interest) in the overall database Used in conjunction with skewed data

Note Sampling may not reduce database IOs (page at a time)

Sampling with or without Replacement

SRSWOR

(simple random

sample without

replacement)

SRSWR

Raw Data

Raw Data ClusterStratified Sample

Data discretization techniques can be used to reduce the number of values for a given continuous attribute by dividing the range of the attribute into intervals Interval labels can then be used to replace actual data values Replacing numerous values of a continuous attribute by a small number of interval labels thereby reduces and simplifies the original data This leads to a concise easy-to-use knowledge-level representation of mining results

bull Discretization techniques can be categorized based on how the discretization is performed such as whether it uses class information or which direction it proceeds (ie top-down vs bottom-up) If the discretization process uses class information then we say it is supervised discretization Otherwise it is unsupervised If the process starts by first finding one or a few points (called split points or cut points) to split the entire attribute range and then repeats this recursively on the resulting intervals it is called top-down discretization or splitting This contrasts with bottom-up discretization or merging which starts by considering all of the continuous values as potential split-points removes some by merging neighborhood values to form intervals and then recursively applies this process to the resulting intervals Discretization can be performed recursively on an attribute to provide a hierarchical or multiresolution partitioning of the attribute values known as a concept hierarchy Concept hierarchies are useful for mining at multiple levels of abstraction

Three types of attributes Nominal mdash values from an unordered set eg color

profession Ordinal mdash values from an ordered set eg military or

academic rank Continuous mdash real numbers eg integer or real

numbers Discretization

Divide the range of a continuous attribute into intervals Some classification algorithms only accept categorical

attributes Reduce data size by discretization Prepare for further analysis

Typical methods All the methods can be applied recursively Binning (covered above)

Top-down split unsupervised Histogram analysis (covered above)

Top-down split unsupervised Clustering analysis (covered above)

Either top-down split or bottom-up merge unsupervised Entropy-based discretization supervised top-down

split Interval merging by 2 Analysis unsupervised bottom-

up merge Segmentation by natural partitioning top-down split

unsupervised

Given a set of samples S if S is partitioned into two intervals S1 and S2 using boundary T the information gain after partitioning is

Entropy is calculated based on class distribution of the samples in the set Given m classes the entropy of S1 is

where pi is the probability of class i in S1 The boundary that minimizes the entropy function over

all possible boundaries is selected as a binary discretization

The process is recursively applied to partitions obtained until some stopping criterion is met

Such a boundary may reduce data size and improve classification accuracy

)Entropy(S|S|

|S|+)Entropy(S

|S|

|S|=T)I(S 2

21

1

m

iii ppSEntropy

121 )(log)(

Merging-based (bottom-up) vs splitting-based methods Merge Find the best neighboring intervals and merge

them to form larger intervals recursively ChiMerge [Kerber AAAI 1992 See also Liu et al DMKD

2002] Initially each distinct value of a numerical attr A is

considered to be one interval 2 tests are performed for every pair of adjacent

intervals Adjacent intervals with the least 2 values are merged

together since low 2 values for a pair indicate similar class distributions

This merge process proceeds recursively until a predefined stopping criterion is met (such as significance level max-interval max inconsistency etc)

Cluster analysis is a popular data discretization method A clustering algorithm can be applied to discretize a numerical attribute A by partitioning the values of A into clusters or groups Clustering takes the distribution of A into consideration as well as the closeness of data points and therefore is able to produce high-quality discretization results

Clustering can be used to generate a concept hierarchy for A by following either a topdown splitting strategy or a bottom-up merging strategy where each cluster forms a node of the concept hierarchy In the former each initial cluster or partition may be further

decomposed into several subclusters forming a lower level of the hierarchy In the latter clusters are formed by repeatedly grouping neighboring clusters in order to form

higher-level concepts

bull Discretization by Intuitive Partitioningbull Although the above discretization methods are useful

in the generation of numerical hierarchies many users would like to see numerical ranges partitioned into relatively uniform easy-to-read intervals that appear intuitive or ldquonaturalrdquo

bull If an interval covers 3 6 7 or 9 distinct values at the most significant digit then partition the range into 3 intervals (3 equal-width intervals for 3 6 and 9 and 3 intervals in the grouping of 2-3-2 for 7)

bull If it covers 2 4 or 8 distinct values at the most significant digit then partition the range into 4 equal-width intervals

bull If it covers 1 5 or 10 distinct values at the most significant digit then partition the range into 5 equal-width intervals

bull The rule can be recursively applied to each interval creating a concept hierarchy for the given numerical attribute Real-world data often contain extremely large positive andor negative outlier values which could distort any top-down discretization method based on minimum and maximum data values

A simply 3-4-5 rule can be used to segment numeric data into relatively uniform ldquonaturalrdquo intervals If an interval covers 3 6 7 or 9 distinct values

at the most significant digit partition the range into 3 equi-width intervals

If it covers 2 4 or 8 distinct values at the most significant digit partition the range into 4 intervals

If it covers 1 5 or 10 distinct values at the most significant digit partition the range into 5 intervals

Categorical data are discrete data Categorical attributes have a finite (but possibly large) number of distinct values with no ordering among the values Examples include geographic location job category and itemtype There are several methods for the generation of concept hierarchies for categorical data

Specification of a partial ordering of attributes explicitly at the schema level by users or Experts

Specification of a portion of a hierarchy by explicit data grouping

Specification of a set of attributes but not of their partial ordering

Specification of only a partial set of attributes

Specification of a partialtotal ordering of attributes explicitly at the schema level by users or experts street lt city lt state lt country

Specification of a hierarchy for a set of values by explicit data grouping Urbana Champaign Chicago lt Illinois

Specification of only a partial set of attributes Eg only street lt city not others

Automatic generation of hierarchies (or attribute levels) by the analysis of the number of distinct values Eg for a set of attributes street city state

country

Some hierarchies can be automaatically generated based on the analysis of the number of distinct values per attribute in the data set The attribute with the most distinct values is

placed at the lowest level of the hierarchy Exceptions eg weekday month quarter year

country

province_or_ state

city

street

15 distinct values

365 distinct values

3567 distinct values

674339 distinct values

Data preparation or preprocessing is a big issue for both data warehousing and data mining

Discriptive data summarization is need for quality data preprocessing

Data preparation includes Data cleaning and data integration Data reduction and feature selection Discretization

A lot a methods have been developed but data preprocessing still an active area of research

  • Slide 1
  • Slide 2
  • Slide 3
  • Multi-Dimensional Measure of Data Quality
  • Major Tasks in Data Preprocessing
  • Figure Forms of data preprocessing
  • 2 Descriptive Data Summarization
  • Slide 8
  • 21 Measuring the Central Tendency
  • 22 Measuring the Dispersion of Data
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Data Integration
  • Handling Redundant Data in Data Integration
  • Data Transformation
  • Data Transformation Normalization
  • Slide 27
  • 5 Data Reduction
  • Slide 29
  • 51 Data Cube Aggregation
  • 52 Attribute Subset Selection
  • Slide 32
  • Slide 33
  • Slide 34
  • 53 Dimensionality Reduction
  • Slide 36
  • Slide 37
  • Slide 38
  • 4 Numerosity Reduction
  • Data Reduction Method (1) Regression and Log-Linear Models
  • Regress Analysis and Log-Linear Models
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Data Reduction Method (3) Clustering
  • Slide 47
  • Slide 48
  • Slide 49
  • Data Reduction Method (4) Sampling
  • Slide 51
  • Sampling Cluster or Stratified Sampling
  • Data Discretization and Concept Hierarchy Generation
  • Slide 54
  • Discretization
  • Discretization and Concept Hierarchy Generation for Numeric Data
  • Entropy-Based Discretization
  • Interval Merge by 2 Analysis
  • Cluster Analysis
  • Slide 60
  • Segmentation by Natural Partitioning
  • 62 Concept Hierarchy Generation for Categorical Data
  • Slide 63
  • Concept Hierarchy Generation for Categorical Data
  • Automatic Concept Hierarchy Generation
  • Summary
Page 3: UNIT – 1 Data Preprocessing. Data Preprocessing Learning Objectives Understand why preprocess the data. Understand how to clean the data. Understand how.

Why Data Preprocessing1 Data mining aims at discovering relationships and other

forms of knowledge from data in the real world

1 Data map entities in the application domain to symbolic representation through a measurement function

1 Data in the real world is dirty

incomplete missing data lacking attribute values lacking certain attributes of interest or containing only aggregate datanoisy containing errors such as measurement errors or outliersinconsistent containing discrepancies in codes or namesdistorted sampling distortion (A Change for worse)

4 No quality data no quality mining results (GIGO)

5 Quality decisions must be based on quality data

6 Data warehouse needs consistent integration of quality data

Data quality is multidimensional Accuracy Preciseness (=reliability) Completeness Consistency Timeliness Believability (=validity) Value added Interpretability Accessibility

Broad categories intrinsic contextual representational and

accessibility

Data cleaning Fill in missing values smooth noisy data identify or

remove outliers and resolve inconsistencies and errors

Data integration Integration of multiple databases data cubes or files

Data transformation Normalization and aggregation

Data reduction Obtains reduced representation in volume but

produces the same or similar analytical results Data discretization

Part of data reduction but with particular importance especially for numerical data

bull For data preprocessing to be successful it is essential to have an overall picture of your data

bull Descriptive data summarization techniques can be used to identify the typical properties of your data and highlight which

data values should be treated as noise or outliers

bull Thus we first introduce the basic concepts of descriptive data summarization before getting into the concrete workings of

data preprocessing techniques

bull For many data preprocessing tasks users would like to learn about data characteristics regarding both central tendency and dispersion of the data

bull Measures of central tendency include mean median mode and midrange while measures of data dispersion include quartiles interquartile range (IQR) and variance

bull These descriptive statistics are of great help in understanding the distribution of the data

bull Such measures have been studied extensively in the statistical literature

bull From the data mining point of view we need to examine how they can be computed efficiently in large databases

bull In particular it is necessary to introduce the notions of distributive measure algebraic measure and holistic measure

bull Knowing what kind of measure we are dealing with can help us choose an efficient implementation for it

In this section we look at various ways to measure the central tendency of data The most common and most effective numerical measure of the ldquocenterrdquo of a set of data is the (arithmetic) mean

mean1048576mode = 3(mean1048576median)

The degree to which numerical data tend to spread is called the dispersion or variance of the data The most common measures of data dispersion are

1) Range Quartiles Outliers and Boxplots 2) Variance and Standard Deviation

The range of the set is the difference between the largest (max()) and smallest (min()) values

The most commonly used percentiles other than the median are quartiles The first quartile denoted by Q1 is the 25th percentile the third quartile denoted by Q3 is the 75th percentile The quartiles including the median give some indication of the center spread and shape of a distribution The distance between the first and third quartiles is a simple measure of spread that gives the range covered by the middle half of the data

bull Boxplots are a popular way of visualizing a distribution A boxplot incorporates the five-number summary as follows

bull Typically the ends of the box are at the quartiles so that the box length is the interquartile range IQR

bull The median is marked by a line within the box

bull Two lines (called whiskers) outside the box extend to the smallest (Minimum) and largest (Maximum) observations

Aside from the bar charts pie charts and line graphs used in most statistical or graphical data presentation software packages there are other popular types of graphs for the display of data summaries and distributions These include histograms quantile plots q-q plots scatter plots and loess curves Such graphs are very helpful for the visual inspection of your data

23 Graphic Displays of Basic Descriptive Data Summaries

3 Data Cleaningbull Data cleaning tasks

Fill in missing valuesIdentify outliers and smooth out noisy data

Correct inconsistent data1) Missing Databull Data is not always available

a Eg many tuples have no recorded value for several attributes such as customer income in sales data

bull Missing data may be due to a equipment malfunction

b inconsistent with other recorded data and thus deletedc data not entered due to misunderstandingd certain data may not be considered important at the time of entrye not register history or changes of the data

f Missing data may need to be inferred

How to Handle Missing Data

bull Ignore the tuple usually done when class label is missing(assuming the tasks in classificationmdashnot effective when the percentage of missing values per attribute varies considerably)

bull Fill in the missing value manually tedious + infeasiblebull Use a global constant to fill in the missing value eg ldquounknownrdquo a new class bull Use the attribute mean to fill in the missing valuebull Use the attribute mean for all samples belonging to the same class to fill in the missing value smarterbull Use the most probable value to fill in the missing value inference-based such as Bayesian formula or decision tree

bull 2Noisy Data

bull Noise random error or variance in a measured variable

bull Incorrect attribute values may be due tondash faulty data collection instrumentsndash data entry problemsndash data transmission problemsndash technology limitationndash inconsistency in naming convention

bull Other data problems which requires data cleaningndash duplicate recordsndash inconsistent data

How to Handle Noisy Data

bull Binning method

- first sort data and partition into (equi-depth) bins- then one can smooth by bin means smooth by bin median

- smooth by bin boundaries etc

bull Clustering- detect and remove outliers

bull Combined computer and human inspection- detect suspicious values and check by human

bull Regression- smooth by fitting the data into regression functions

Binning Methods for Data SmoothingSorted data for price (in dollars)

48915 21 21 24 25 26 28 29 34

Partition into (equi-depth) bins - Bin 1 4 8 9 15 - Bin 2 21 21 24 25 - Bin 3 26 28 29 34

Smoothing by bin means - Bin 1 9 9 9 9 - Bin 2 23 23 23 23 - Bin 3 29 29 29 29

Smoothing by bin boundaries - Bin 1 4 4 4 15 - Bin 2 21 21 25 25 - Bin 3 26 26 26 34

Cluster Analysis

Regression

Data integration combines data from multiple sources into a

coherent store Schema integration

integrate metadata from different sources entity identification problem identify real world

entities from multiple data sources eg Acust-id Bcust-

Detecting and resolving data value conflicts for the same real world entity attribute values from

different sources are different possible reasons different representations

different scales eg metric vs British units

bull Redundant data occur often when integration of multiple databasesndash The same attribute may have different names in

different databasesndash One attribute may be a ldquoderivedrdquo attribute in

another table eg annual revenuebull Redundant data may be able to be detected by

correlational analysisbull Careful integration of the data from multiple sources

may help reduceavoid redundancies and inconsistencies and improve mining speed and quality

Smoothing remove noise from data Aggregation summarization data cube construction Generalization concept hierarchy climbing Normalization scaled to fall within a small specified

range min-max normalization z-score normalization normalization by decimal scaling

Attributefeature construction New attributes constructed from the given ones

bull min-max normalization

bull z-score normalization

bull normalization by decimal scaling

AAA

AA

A

minnewminnewmaxnewminmax

minvv _)__(

A

A

devstand

meanvv

_

j

vv

10 Where j is the smallest integer such that Max(| |)lt1v

Min-max normalization to [new_minA new_maxA]

Ex Let income range $12000 to $98000 normalized to [00 10] Then $73600 is mapped to

Z-score normalization (μ mean σ standard deviation)

Ex Let μ = 54000 σ = 16000 Then Normalization by decimal scaling

71600)001(0001200098

0001260073

AAA

AA

A

minnewminnewmaxnewminmax

minvv _)__(

A

Avv

j

vv

10

Where j is the smallest integer such that Max(|νrsquo|) lt 1

225100016

0005460073

Data reduction techniques can be applied to obtain a reduced representation of the data set that is much smaller in volume yet closely maintains the integrity of the original data That is mining on the reduced data set should be more efficient yet produce the same (or almost the same) analytical results

Strategies for data reduction include the following1Data cube aggregation where aggregation operations are applied to the data in the construction of a data cube

2 Attribute subset selection where irrelevant weakly relevant or redundant attributes or dimensions may be detected and removed

3 Dimensionality reduction where encoding mechanisms are used to reduce the data set size

4 Numerosity reduction where the data are replaced or estimated by alternative smaller data representations such as parametric models (which need store only the model parameters instead of the actual data) or nonparametric methods such as clustering sampling and the use of histograms

5 Discretization and concept hierarchy generation where rawdata values for attributes are replaced by ranges or higher conceptual levels

Attribute subset selection reduces the data set size by removing irrelevant or redundant attributes (or dimensions)

The goal of attribute subset selection is to find a minimum set of attributes such that the resulting probability distribution of the data classes is as close as possible to the original distribution obtained using all attributes

Mining on a reduced set of attributes has an additional benefit

It reduces the number of attributes appearing in the discovered patterns helping to make the patterns easier to understand

The ldquoBestrdquo (and ldquoWorstrdquo) attributes are typically determined using tests of statistical significance which assume that the attributes are independent of one Many other attribute evaluation measures can be used such as the information gain measure used in building decision trees for classification

bull Basic heuristic methods of attribute subset selection include the following techniques some of which are illustrated in Figure

bull 1 Stepwise forward selection The procedure starts with an empty set of attributes as the reduced set The best of the original attributes is determined and added to the reduced set At each subsequent iteration or step the best of the remaining original attributes is added to the set

bull 2 Stepwise backward elimination The procedure starts with the full set of attributes At each step it removes the worst attribute remaining in the set

bull 3 Combination of forward selection and backward elimination The stepwise forward selection and backward elimination methods can be combined so that at each step the procedure selects the best attribute and removes the

worst from among the remaining attributes

bull 4 Decision tree induction Decision tree algorithms such as ID3 C45 and CART were originally intended for classification Decision tree induction constructs a flow chart like structure where each internal (non leaf) node denotes a test on an attribute each branch corresponds to an outcome of the test and each external (leaf) node denotes a class prediction At each node the algorithm chooses the ldquobestrdquo attribute to partition the data into individual classes When decision tree induction is used for attribute subset selection a tree is constructed from the given data All attributes that do not appear in the tree are assumed to be irrelevant The set of attributes appearing in the tree form the reduced subset of attributes

The stopping criteria for the methods may vary The procedure may employ a threshold on the measure used to determine when to stop the attribute selection process

In dimensionality reduction data encoding or transformations are applied so as to obtain a reduced or ldquocompressedrdquo representation of the original data If the original data can be reconstructed from the compressed data without any loss of information the data reduction is called lossless If instead we can reconstruct only an approximation of the original data then the data reduction is called lossy There are several well-tuned algorithms for string compression Although they are typically lossless they allow only limited manipulation of the data

In this section we instead focus on two popular and effective methods of lossy dimensionality reduction wavelet transforms and principal components analysis

bull Wavelet transforms can be applied to multidimensional data such as a data cube

bull This is done by first applying the transform to the first dimension then to the second and so on The computational complexity involved is linear with respect to the number of cells in the cube Wavelet transforms give good results on sparse or skewed data and on data with ordered attributes Lossy compression by wavelets is reportedly better than JPEG compression the current commercial standard Wavelet transforms have many real-world applications including the compression of fingerprint images computer vision analysis of time-series data and data cleaning

bull PCA is computationally inexpensive can be applied to ordered and unordered attributes and can handle sparse data and skewed data Multidimensional data of more than two dimensions can be handled by reducing the problem to two dimensions Principal components may be used as inputs to multiple regression and cluster analysis

bull In comparison with wavelet transforms PCA tends to be better at handling sparse data whereas wavelet transforms are more suitable for data of high dimensionality

ldquoCan we reduce the data volume by choosing alternative lsquosmallerrsquo forms of data representationrdquo

Techniques of numerosity reduction can indeed be applied for this purpose These techniques may be parametric or nonparametric

For parametric methods a model is used to estimate the data so that typically only the data parameters need to be stored instead of the actual data (Outliers may also be stored) Log-linear models which estimate discrete multidimensional probability distributions are an example

Nonparametric methods for storing reduced representations of the data include histograms clustering and sampling

Letrsquos look at each of the numerosity reduction techniques mentioned above

Linear regression Data are modeled to fit a straight line Often uses the least-square method to fit the

line Multiple regression allows a response variable Y

to be modeled as a linear function of multidimensional feature vector

Log-linear model approximates discrete multidimensional probability distributions

bull Linear regression Y = w X + bndash Two regression coefficients w and b specify the

line and are to be estimated by using the data at hand

ndash Using the least squares criterion to the known values of Y1 Y2 hellip X1 X2 hellip

bull Multiple regression Y = b0 + b1 X1 + b2 X2ndash Many nonlinear functions can be transformed into

the above

bull Log-linear modelsndash The multi-way table of joint probabilities is

approximated by a product of lower-order tablesndash Probability p(a b c d) = ab acad bcd

Histograms Histograms use binning to approximate data distributions and are a popular form of data reduction

A histogram for an attribute A partitions the data distribution of A into disjoint subsets or buckets If each bucket represents only a single attribute-valuefrequency pair the buckets are called singleton buckets Often buckets instead represent continuous ranges for the given attribute

There are several partitioning rules including the following

Equal-width In an equal-width histogram the width of each bucket range is uniform

Equal-frequency (or equidepth) In an equal-frequency histogram the buckets are created so that roughly the frequency of each bucket is constant (that is each bucket contains roughly the same number of contiguous data samples)

V-Optimal If we consider all of the possible histograms for a given number of buckets the V-Optimal histogram is the one with the least variance Histogram variance is a weighted sum of the original values that each bucket represents where bucket weight is equal to the number of values in the bucket

MaxDiff In a MaxDiff histogram we consider the difference between each pair of adjacent values

ClusteringClustering techniques consider data tuples as objects They

partition the objects into groups or clusters so that objects within a cluster are ldquosimilarrdquo to one another and ldquodissimilarrdquo to objects in other clusters

In data reduction the cluster representations of the data are used to replace the actual data The effectiveness of this technique depends on the nature of the data It is much more effective for data that can be organized into distinct clusters than for smeared data

Partition data set into clusters based on similarity and store cluster representation (eg centroid and diameter) only

Can be very effective if data is clustered but not if data is ldquosmearedrdquo

Can have hierarchical clustering and be stored in multi-dimensional index tree structures

There are many choices of clustering definitions and clustering algorithms

Cluster analysis will be studied in depth later

SamplingSampling can be used as a data reduction technique because it

allows a large data set to be represented by a much smaller random sample (or subset) of the data Suppose that a large data set D contains N tuples Letrsquos look at the most common ways that we could sample D for data reduction

An advantage of sampling for data reduction is that the cost of obtaining a sample is proportional to the size of the sample sampling complexity is potentially sublinear to the size of the data

Simple Random sample without replacement

bull For a fixed sample size sampling complexity increases only linearly as the number of data dimensions

bull When applied to data reduction sampling is most commonly used to estimate the answer to an aggregate query

Sampling obtaining a small sample s to represent the whole data set N

Allow a mining algorithm to run in complexity that is potentially sub-linear to the size of the data

Choose a representative subset of the data Simple random sampling may have very poor

performance in the presence of skew Develop adaptive sampling methods

Stratified sampling Approximate the percentage of each class (or

subpopulation of interest) in the overall database Used in conjunction with skewed data

Note Sampling may not reduce database IOs (page at a time)

Sampling with or without Replacement

SRSWOR

(simple random

sample without

replacement)

SRSWR

Raw Data

Raw Data ClusterStratified Sample

Data discretization techniques can be used to reduce the number of values for a given continuous attribute by dividing the range of the attribute into intervals Interval labels can then be used to replace actual data values Replacing numerous values of a continuous attribute by a small number of interval labels thereby reduces and simplifies the original data This leads to a concise easy-to-use knowledge-level representation of mining results

bull Discretization techniques can be categorized based on how the discretization is performed such as whether it uses class information or which direction it proceeds (ie top-down vs bottom-up) If the discretization process uses class information then we say it is supervised discretization Otherwise it is unsupervised If the process starts by first finding one or a few points (called split points or cut points) to split the entire attribute range and then repeats this recursively on the resulting intervals it is called top-down discretization or splitting This contrasts with bottom-up discretization or merging which starts by considering all of the continuous values as potential split-points removes some by merging neighborhood values to form intervals and then recursively applies this process to the resulting intervals Discretization can be performed recursively on an attribute to provide a hierarchical or multiresolution partitioning of the attribute values known as a concept hierarchy Concept hierarchies are useful for mining at multiple levels of abstraction

Three types of attributes Nominal mdash values from an unordered set eg color

profession Ordinal mdash values from an ordered set eg military or

academic rank Continuous mdash real numbers eg integer or real

numbers Discretization

Divide the range of a continuous attribute into intervals Some classification algorithms only accept categorical

attributes Reduce data size by discretization Prepare for further analysis

Typical methods All the methods can be applied recursively Binning (covered above)

Top-down split unsupervised Histogram analysis (covered above)

Top-down split unsupervised Clustering analysis (covered above)

Either top-down split or bottom-up merge unsupervised Entropy-based discretization supervised top-down

split Interval merging by 2 Analysis unsupervised bottom-

up merge Segmentation by natural partitioning top-down split

unsupervised

Given a set of samples S if S is partitioned into two intervals S1 and S2 using boundary T the information gain after partitioning is

Entropy is calculated based on class distribution of the samples in the set Given m classes the entropy of S1 is

where pi is the probability of class i in S1 The boundary that minimizes the entropy function over

all possible boundaries is selected as a binary discretization

The process is recursively applied to partitions obtained until some stopping criterion is met

Such a boundary may reduce data size and improve classification accuracy

)Entropy(S|S|

|S|+)Entropy(S

|S|

|S|=T)I(S 2

21

1

m

iii ppSEntropy

121 )(log)(

Merging-based (bottom-up) vs splitting-based methods Merge Find the best neighboring intervals and merge

them to form larger intervals recursively ChiMerge [Kerber AAAI 1992 See also Liu et al DMKD

2002] Initially each distinct value of a numerical attr A is

considered to be one interval 2 tests are performed for every pair of adjacent

intervals Adjacent intervals with the least 2 values are merged

together since low 2 values for a pair indicate similar class distributions

This merge process proceeds recursively until a predefined stopping criterion is met (such as significance level max-interval max inconsistency etc)

Cluster analysis is a popular data discretization method A clustering algorithm can be applied to discretize a numerical attribute A by partitioning the values of A into clusters or groups Clustering takes the distribution of A into consideration as well as the closeness of data points and therefore is able to produce high-quality discretization results

Clustering can be used to generate a concept hierarchy for A by following either a topdown splitting strategy or a bottom-up merging strategy where each cluster forms a node of the concept hierarchy In the former each initial cluster or partition may be further

decomposed into several subclusters forming a lower level of the hierarchy In the latter clusters are formed by repeatedly grouping neighboring clusters in order to form

higher-level concepts

bull Discretization by Intuitive Partitioningbull Although the above discretization methods are useful

in the generation of numerical hierarchies many users would like to see numerical ranges partitioned into relatively uniform easy-to-read intervals that appear intuitive or ldquonaturalrdquo

bull If an interval covers 3 6 7 or 9 distinct values at the most significant digit then partition the range into 3 intervals (3 equal-width intervals for 3 6 and 9 and 3 intervals in the grouping of 2-3-2 for 7)

bull If it covers 2 4 or 8 distinct values at the most significant digit then partition the range into 4 equal-width intervals

bull If it covers 1 5 or 10 distinct values at the most significant digit then partition the range into 5 equal-width intervals

bull The rule can be recursively applied to each interval creating a concept hierarchy for the given numerical attribute Real-world data often contain extremely large positive andor negative outlier values which could distort any top-down discretization method based on minimum and maximum data values

A simply 3-4-5 rule can be used to segment numeric data into relatively uniform ldquonaturalrdquo intervals If an interval covers 3 6 7 or 9 distinct values

at the most significant digit partition the range into 3 equi-width intervals

If it covers 2 4 or 8 distinct values at the most significant digit partition the range into 4 intervals

If it covers 1 5 or 10 distinct values at the most significant digit partition the range into 5 intervals

Categorical data are discrete data Categorical attributes have a finite (but possibly large) number of distinct values with no ordering among the values Examples include geographic location job category and itemtype There are several methods for the generation of concept hierarchies for categorical data

Specification of a partial ordering of attributes explicitly at the schema level by users or Experts

Specification of a portion of a hierarchy by explicit data grouping

Specification of a set of attributes but not of their partial ordering

Specification of only a partial set of attributes

Specification of a partialtotal ordering of attributes explicitly at the schema level by users or experts street lt city lt state lt country

Specification of a hierarchy for a set of values by explicit data grouping Urbana Champaign Chicago lt Illinois

Specification of only a partial set of attributes Eg only street lt city not others

Automatic generation of hierarchies (or attribute levels) by the analysis of the number of distinct values Eg for a set of attributes street city state

country

Some hierarchies can be automaatically generated based on the analysis of the number of distinct values per attribute in the data set The attribute with the most distinct values is

placed at the lowest level of the hierarchy Exceptions eg weekday month quarter year

country

province_or_ state

city

street

15 distinct values

365 distinct values

3567 distinct values

674339 distinct values

Data preparation or preprocessing is a big issue for both data warehousing and data mining

Discriptive data summarization is need for quality data preprocessing

Data preparation includes Data cleaning and data integration Data reduction and feature selection Discretization

A lot a methods have been developed but data preprocessing still an active area of research

  • Slide 1
  • Slide 2
  • Slide 3
  • Multi-Dimensional Measure of Data Quality
  • Major Tasks in Data Preprocessing
  • Figure Forms of data preprocessing
  • 2 Descriptive Data Summarization
  • Slide 8
  • 21 Measuring the Central Tendency
  • 22 Measuring the Dispersion of Data
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Data Integration
  • Handling Redundant Data in Data Integration
  • Data Transformation
  • Data Transformation Normalization
  • Slide 27
  • 5 Data Reduction
  • Slide 29
  • 51 Data Cube Aggregation
  • 52 Attribute Subset Selection
  • Slide 32
  • Slide 33
  • Slide 34
  • 53 Dimensionality Reduction
  • Slide 36
  • Slide 37
  • Slide 38
  • 4 Numerosity Reduction
  • Data Reduction Method (1) Regression and Log-Linear Models
  • Regress Analysis and Log-Linear Models
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Data Reduction Method (3) Clustering
  • Slide 47
  • Slide 48
  • Slide 49
  • Data Reduction Method (4) Sampling
  • Slide 51
  • Sampling Cluster or Stratified Sampling
  • Data Discretization and Concept Hierarchy Generation
  • Slide 54
  • Discretization
  • Discretization and Concept Hierarchy Generation for Numeric Data
  • Entropy-Based Discretization
  • Interval Merge by 2 Analysis
  • Cluster Analysis
  • Slide 60
  • Segmentation by Natural Partitioning
  • 62 Concept Hierarchy Generation for Categorical Data
  • Slide 63
  • Concept Hierarchy Generation for Categorical Data
  • Automatic Concept Hierarchy Generation
  • Summary
Page 4: UNIT – 1 Data Preprocessing. Data Preprocessing Learning Objectives Understand why preprocess the data. Understand how to clean the data. Understand how.

Data quality is multidimensional Accuracy Preciseness (=reliability) Completeness Consistency Timeliness Believability (=validity) Value added Interpretability Accessibility

Broad categories intrinsic contextual representational and

accessibility

Data cleaning Fill in missing values smooth noisy data identify or

remove outliers and resolve inconsistencies and errors

Data integration Integration of multiple databases data cubes or files

Data transformation Normalization and aggregation

Data reduction Obtains reduced representation in volume but

produces the same or similar analytical results Data discretization

Part of data reduction but with particular importance especially for numerical data

bull For data preprocessing to be successful it is essential to have an overall picture of your data

bull Descriptive data summarization techniques can be used to identify the typical properties of your data and highlight which

data values should be treated as noise or outliers

bull Thus we first introduce the basic concepts of descriptive data summarization before getting into the concrete workings of

data preprocessing techniques

bull For many data preprocessing tasks users would like to learn about data characteristics regarding both central tendency and dispersion of the data

bull Measures of central tendency include mean median mode and midrange while measures of data dispersion include quartiles interquartile range (IQR) and variance

bull These descriptive statistics are of great help in understanding the distribution of the data

bull Such measures have been studied extensively in the statistical literature

bull From the data mining point of view we need to examine how they can be computed efficiently in large databases

bull In particular it is necessary to introduce the notions of distributive measure algebraic measure and holistic measure

bull Knowing what kind of measure we are dealing with can help us choose an efficient implementation for it

In this section we look at various ways to measure the central tendency of data The most common and most effective numerical measure of the ldquocenterrdquo of a set of data is the (arithmetic) mean

mean1048576mode = 3(mean1048576median)

The degree to which numerical data tend to spread is called the dispersion or variance of the data The most common measures of data dispersion are

1) Range Quartiles Outliers and Boxplots 2) Variance and Standard Deviation

The range of the set is the difference between the largest (max()) and smallest (min()) values

The most commonly used percentiles other than the median are quartiles The first quartile denoted by Q1 is the 25th percentile the third quartile denoted by Q3 is the 75th percentile The quartiles including the median give some indication of the center spread and shape of a distribution The distance between the first and third quartiles is a simple measure of spread that gives the range covered by the middle half of the data

bull Boxplots are a popular way of visualizing a distribution A boxplot incorporates the five-number summary as follows

bull Typically the ends of the box are at the quartiles so that the box length is the interquartile range IQR

bull The median is marked by a line within the box

bull Two lines (called whiskers) outside the box extend to the smallest (Minimum) and largest (Maximum) observations

Aside from the bar charts pie charts and line graphs used in most statistical or graphical data presentation software packages there are other popular types of graphs for the display of data summaries and distributions These include histograms quantile plots q-q plots scatter plots and loess curves Such graphs are very helpful for the visual inspection of your data

23 Graphic Displays of Basic Descriptive Data Summaries

3 Data Cleaningbull Data cleaning tasks

Fill in missing valuesIdentify outliers and smooth out noisy data

Correct inconsistent data1) Missing Databull Data is not always available

a Eg many tuples have no recorded value for several attributes such as customer income in sales data

bull Missing data may be due to a equipment malfunction

b inconsistent with other recorded data and thus deletedc data not entered due to misunderstandingd certain data may not be considered important at the time of entrye not register history or changes of the data

f Missing data may need to be inferred

How to Handle Missing Data

bull Ignore the tuple usually done when class label is missing(assuming the tasks in classificationmdashnot effective when the percentage of missing values per attribute varies considerably)

bull Fill in the missing value manually tedious + infeasiblebull Use a global constant to fill in the missing value eg ldquounknownrdquo a new class bull Use the attribute mean to fill in the missing valuebull Use the attribute mean for all samples belonging to the same class to fill in the missing value smarterbull Use the most probable value to fill in the missing value inference-based such as Bayesian formula or decision tree

bull 2Noisy Data

bull Noise random error or variance in a measured variable

bull Incorrect attribute values may be due tondash faulty data collection instrumentsndash data entry problemsndash data transmission problemsndash technology limitationndash inconsistency in naming convention

bull Other data problems which requires data cleaningndash duplicate recordsndash inconsistent data

How to Handle Noisy Data

bull Binning method

- first sort data and partition into (equi-depth) bins- then one can smooth by bin means smooth by bin median

- smooth by bin boundaries etc

bull Clustering- detect and remove outliers

bull Combined computer and human inspection- detect suspicious values and check by human

bull Regression- smooth by fitting the data into regression functions

Binning Methods for Data SmoothingSorted data for price (in dollars)

48915 21 21 24 25 26 28 29 34

Partition into (equi-depth) bins - Bin 1 4 8 9 15 - Bin 2 21 21 24 25 - Bin 3 26 28 29 34

Smoothing by bin means - Bin 1 9 9 9 9 - Bin 2 23 23 23 23 - Bin 3 29 29 29 29

Smoothing by bin boundaries - Bin 1 4 4 4 15 - Bin 2 21 21 25 25 - Bin 3 26 26 26 34

Cluster Analysis

Regression

Data integration combines data from multiple sources into a

coherent store Schema integration

integrate metadata from different sources entity identification problem identify real world

entities from multiple data sources eg Acust-id Bcust-

Detecting and resolving data value conflicts for the same real world entity attribute values from

different sources are different possible reasons different representations

different scales eg metric vs British units

bull Redundant data occur often when integration of multiple databasesndash The same attribute may have different names in

different databasesndash One attribute may be a ldquoderivedrdquo attribute in

another table eg annual revenuebull Redundant data may be able to be detected by

correlational analysisbull Careful integration of the data from multiple sources

may help reduceavoid redundancies and inconsistencies and improve mining speed and quality

Smoothing remove noise from data Aggregation summarization data cube construction Generalization concept hierarchy climbing Normalization scaled to fall within a small specified

range min-max normalization z-score normalization normalization by decimal scaling

Attributefeature construction New attributes constructed from the given ones

bull min-max normalization

bull z-score normalization

bull normalization by decimal scaling

AAA

AA

A

minnewminnewmaxnewminmax

minvv _)__(

A

A

devstand

meanvv

_

j

vv

10 Where j is the smallest integer such that Max(| |)lt1v

Min-max normalization to [new_minA new_maxA]

Ex Let income range $12000 to $98000 normalized to [00 10] Then $73600 is mapped to

Z-score normalization (μ mean σ standard deviation)

Ex Let μ = 54000 σ = 16000 Then Normalization by decimal scaling

71600)001(0001200098

0001260073

AAA

AA

A

minnewminnewmaxnewminmax

minvv _)__(

A

Avv

j

vv

10

Where j is the smallest integer such that Max(|νrsquo|) lt 1

225100016

0005460073

Data reduction techniques can be applied to obtain a reduced representation of the data set that is much smaller in volume yet closely maintains the integrity of the original data That is mining on the reduced data set should be more efficient yet produce the same (or almost the same) analytical results

Strategies for data reduction include the following1Data cube aggregation where aggregation operations are applied to the data in the construction of a data cube

2 Attribute subset selection where irrelevant weakly relevant or redundant attributes or dimensions may be detected and removed

3 Dimensionality reduction where encoding mechanisms are used to reduce the data set size

4 Numerosity reduction where the data are replaced or estimated by alternative smaller data representations such as parametric models (which need store only the model parameters instead of the actual data) or nonparametric methods such as clustering sampling and the use of histograms

5 Discretization and concept hierarchy generation where rawdata values for attributes are replaced by ranges or higher conceptual levels

Attribute subset selection reduces the data set size by removing irrelevant or redundant attributes (or dimensions)

The goal of attribute subset selection is to find a minimum set of attributes such that the resulting probability distribution of the data classes is as close as possible to the original distribution obtained using all attributes

Mining on a reduced set of attributes has an additional benefit

It reduces the number of attributes appearing in the discovered patterns helping to make the patterns easier to understand

The ldquoBestrdquo (and ldquoWorstrdquo) attributes are typically determined using tests of statistical significance which assume that the attributes are independent of one Many other attribute evaluation measures can be used such as the information gain measure used in building decision trees for classification

bull Basic heuristic methods of attribute subset selection include the following techniques some of which are illustrated in Figure

bull 1 Stepwise forward selection The procedure starts with an empty set of attributes as the reduced set The best of the original attributes is determined and added to the reduced set At each subsequent iteration or step the best of the remaining original attributes is added to the set

bull 2 Stepwise backward elimination The procedure starts with the full set of attributes At each step it removes the worst attribute remaining in the set

bull 3 Combination of forward selection and backward elimination The stepwise forward selection and backward elimination methods can be combined so that at each step the procedure selects the best attribute and removes the

worst from among the remaining attributes

bull 4 Decision tree induction Decision tree algorithms such as ID3 C45 and CART were originally intended for classification Decision tree induction constructs a flow chart like structure where each internal (non leaf) node denotes a test on an attribute each branch corresponds to an outcome of the test and each external (leaf) node denotes a class prediction At each node the algorithm chooses the ldquobestrdquo attribute to partition the data into individual classes When decision tree induction is used for attribute subset selection a tree is constructed from the given data All attributes that do not appear in the tree are assumed to be irrelevant The set of attributes appearing in the tree form the reduced subset of attributes

The stopping criteria for the methods may vary The procedure may employ a threshold on the measure used to determine when to stop the attribute selection process

In dimensionality reduction data encoding or transformations are applied so as to obtain a reduced or ldquocompressedrdquo representation of the original data If the original data can be reconstructed from the compressed data without any loss of information the data reduction is called lossless If instead we can reconstruct only an approximation of the original data then the data reduction is called lossy There are several well-tuned algorithms for string compression Although they are typically lossless they allow only limited manipulation of the data

In this section we instead focus on two popular and effective methods of lossy dimensionality reduction wavelet transforms and principal components analysis

bull Wavelet transforms can be applied to multidimensional data such as a data cube

bull This is done by first applying the transform to the first dimension then to the second and so on The computational complexity involved is linear with respect to the number of cells in the cube Wavelet transforms give good results on sparse or skewed data and on data with ordered attributes Lossy compression by wavelets is reportedly better than JPEG compression the current commercial standard Wavelet transforms have many real-world applications including the compression of fingerprint images computer vision analysis of time-series data and data cleaning

bull PCA is computationally inexpensive can be applied to ordered and unordered attributes and can handle sparse data and skewed data Multidimensional data of more than two dimensions can be handled by reducing the problem to two dimensions Principal components may be used as inputs to multiple regression and cluster analysis

bull In comparison with wavelet transforms PCA tends to be better at handling sparse data whereas wavelet transforms are more suitable for data of high dimensionality

ldquoCan we reduce the data volume by choosing alternative lsquosmallerrsquo forms of data representationrdquo

Techniques of numerosity reduction can indeed be applied for this purpose These techniques may be parametric or nonparametric

For parametric methods a model is used to estimate the data so that typically only the data parameters need to be stored instead of the actual data (Outliers may also be stored) Log-linear models which estimate discrete multidimensional probability distributions are an example

Nonparametric methods for storing reduced representations of the data include histograms clustering and sampling

Letrsquos look at each of the numerosity reduction techniques mentioned above

Linear regression Data are modeled to fit a straight line Often uses the least-square method to fit the

line Multiple regression allows a response variable Y

to be modeled as a linear function of multidimensional feature vector

Log-linear model approximates discrete multidimensional probability distributions

bull Linear regression Y = w X + bndash Two regression coefficients w and b specify the

line and are to be estimated by using the data at hand

ndash Using the least squares criterion to the known values of Y1 Y2 hellip X1 X2 hellip

bull Multiple regression Y = b0 + b1 X1 + b2 X2ndash Many nonlinear functions can be transformed into

the above

bull Log-linear modelsndash The multi-way table of joint probabilities is

approximated by a product of lower-order tablesndash Probability p(a b c d) = ab acad bcd

Histograms Histograms use binning to approximate data distributions and are a popular form of data reduction

A histogram for an attribute A partitions the data distribution of A into disjoint subsets or buckets If each bucket represents only a single attribute-valuefrequency pair the buckets are called singleton buckets Often buckets instead represent continuous ranges for the given attribute

There are several partitioning rules including the following

Equal-width In an equal-width histogram the width of each bucket range is uniform

Equal-frequency (or equidepth) In an equal-frequency histogram the buckets are created so that roughly the frequency of each bucket is constant (that is each bucket contains roughly the same number of contiguous data samples)

V-Optimal If we consider all of the possible histograms for a given number of buckets the V-Optimal histogram is the one with the least variance Histogram variance is a weighted sum of the original values that each bucket represents where bucket weight is equal to the number of values in the bucket

MaxDiff In a MaxDiff histogram we consider the difference between each pair of adjacent values

ClusteringClustering techniques consider data tuples as objects They

partition the objects into groups or clusters so that objects within a cluster are ldquosimilarrdquo to one another and ldquodissimilarrdquo to objects in other clusters

In data reduction the cluster representations of the data are used to replace the actual data The effectiveness of this technique depends on the nature of the data It is much more effective for data that can be organized into distinct clusters than for smeared data

Partition data set into clusters based on similarity and store cluster representation (eg centroid and diameter) only

Can be very effective if data is clustered but not if data is ldquosmearedrdquo

Can have hierarchical clustering and be stored in multi-dimensional index tree structures

There are many choices of clustering definitions and clustering algorithms

Cluster analysis will be studied in depth later

SamplingSampling can be used as a data reduction technique because it

allows a large data set to be represented by a much smaller random sample (or subset) of the data Suppose that a large data set D contains N tuples Letrsquos look at the most common ways that we could sample D for data reduction

An advantage of sampling for data reduction is that the cost of obtaining a sample is proportional to the size of the sample sampling complexity is potentially sublinear to the size of the data

Simple Random sample without replacement

bull For a fixed sample size sampling complexity increases only linearly as the number of data dimensions

bull When applied to data reduction sampling is most commonly used to estimate the answer to an aggregate query

Sampling obtaining a small sample s to represent the whole data set N

Allow a mining algorithm to run in complexity that is potentially sub-linear to the size of the data

Choose a representative subset of the data Simple random sampling may have very poor

performance in the presence of skew Develop adaptive sampling methods

Stratified sampling Approximate the percentage of each class (or

subpopulation of interest) in the overall database Used in conjunction with skewed data

Note Sampling may not reduce database IOs (page at a time)

Sampling with or without Replacement

SRSWOR

(simple random

sample without

replacement)

SRSWR

Raw Data

Raw Data ClusterStratified Sample

Data discretization techniques can be used to reduce the number of values for a given continuous attribute by dividing the range of the attribute into intervals Interval labels can then be used to replace actual data values Replacing numerous values of a continuous attribute by a small number of interval labels thereby reduces and simplifies the original data This leads to a concise easy-to-use knowledge-level representation of mining results

bull Discretization techniques can be categorized based on how the discretization is performed such as whether it uses class information or which direction it proceeds (ie top-down vs bottom-up) If the discretization process uses class information then we say it is supervised discretization Otherwise it is unsupervised If the process starts by first finding one or a few points (called split points or cut points) to split the entire attribute range and then repeats this recursively on the resulting intervals it is called top-down discretization or splitting This contrasts with bottom-up discretization or merging which starts by considering all of the continuous values as potential split-points removes some by merging neighborhood values to form intervals and then recursively applies this process to the resulting intervals Discretization can be performed recursively on an attribute to provide a hierarchical or multiresolution partitioning of the attribute values known as a concept hierarchy Concept hierarchies are useful for mining at multiple levels of abstraction

Three types of attributes Nominal mdash values from an unordered set eg color

profession Ordinal mdash values from an ordered set eg military or

academic rank Continuous mdash real numbers eg integer or real

numbers Discretization

Divide the range of a continuous attribute into intervals Some classification algorithms only accept categorical

attributes Reduce data size by discretization Prepare for further analysis

Typical methods All the methods can be applied recursively Binning (covered above)

Top-down split unsupervised Histogram analysis (covered above)

Top-down split unsupervised Clustering analysis (covered above)

Either top-down split or bottom-up merge unsupervised Entropy-based discretization supervised top-down

split Interval merging by 2 Analysis unsupervised bottom-

up merge Segmentation by natural partitioning top-down split

unsupervised

Given a set of samples S if S is partitioned into two intervals S1 and S2 using boundary T the information gain after partitioning is

Entropy is calculated based on class distribution of the samples in the set Given m classes the entropy of S1 is

where pi is the probability of class i in S1 The boundary that minimizes the entropy function over

all possible boundaries is selected as a binary discretization

The process is recursively applied to partitions obtained until some stopping criterion is met

Such a boundary may reduce data size and improve classification accuracy

)Entropy(S|S|

|S|+)Entropy(S

|S|

|S|=T)I(S 2

21

1

m

iii ppSEntropy

121 )(log)(

Merging-based (bottom-up) vs splitting-based methods Merge Find the best neighboring intervals and merge

them to form larger intervals recursively ChiMerge [Kerber AAAI 1992 See also Liu et al DMKD

2002] Initially each distinct value of a numerical attr A is

considered to be one interval 2 tests are performed for every pair of adjacent

intervals Adjacent intervals with the least 2 values are merged

together since low 2 values for a pair indicate similar class distributions

This merge process proceeds recursively until a predefined stopping criterion is met (such as significance level max-interval max inconsistency etc)

Cluster analysis is a popular data discretization method A clustering algorithm can be applied to discretize a numerical attribute A by partitioning the values of A into clusters or groups Clustering takes the distribution of A into consideration as well as the closeness of data points and therefore is able to produce high-quality discretization results

Clustering can be used to generate a concept hierarchy for A by following either a topdown splitting strategy or a bottom-up merging strategy where each cluster forms a node of the concept hierarchy In the former each initial cluster or partition may be further

decomposed into several subclusters forming a lower level of the hierarchy In the latter clusters are formed by repeatedly grouping neighboring clusters in order to form

higher-level concepts

bull Discretization by Intuitive Partitioningbull Although the above discretization methods are useful

in the generation of numerical hierarchies many users would like to see numerical ranges partitioned into relatively uniform easy-to-read intervals that appear intuitive or ldquonaturalrdquo

bull If an interval covers 3 6 7 or 9 distinct values at the most significant digit then partition the range into 3 intervals (3 equal-width intervals for 3 6 and 9 and 3 intervals in the grouping of 2-3-2 for 7)

bull If it covers 2 4 or 8 distinct values at the most significant digit then partition the range into 4 equal-width intervals

bull If it covers 1 5 or 10 distinct values at the most significant digit then partition the range into 5 equal-width intervals

bull The rule can be recursively applied to each interval creating a concept hierarchy for the given numerical attribute Real-world data often contain extremely large positive andor negative outlier values which could distort any top-down discretization method based on minimum and maximum data values

A simply 3-4-5 rule can be used to segment numeric data into relatively uniform ldquonaturalrdquo intervals If an interval covers 3 6 7 or 9 distinct values

at the most significant digit partition the range into 3 equi-width intervals

If it covers 2 4 or 8 distinct values at the most significant digit partition the range into 4 intervals

If it covers 1 5 or 10 distinct values at the most significant digit partition the range into 5 intervals

Categorical data are discrete data Categorical attributes have a finite (but possibly large) number of distinct values with no ordering among the values Examples include geographic location job category and itemtype There are several methods for the generation of concept hierarchies for categorical data

Specification of a partial ordering of attributes explicitly at the schema level by users or Experts

Specification of a portion of a hierarchy by explicit data grouping

Specification of a set of attributes but not of their partial ordering

Specification of only a partial set of attributes

Specification of a partialtotal ordering of attributes explicitly at the schema level by users or experts street lt city lt state lt country

Specification of a hierarchy for a set of values by explicit data grouping Urbana Champaign Chicago lt Illinois

Specification of only a partial set of attributes Eg only street lt city not others

Automatic generation of hierarchies (or attribute levels) by the analysis of the number of distinct values Eg for a set of attributes street city state

country

Some hierarchies can be automaatically generated based on the analysis of the number of distinct values per attribute in the data set The attribute with the most distinct values is

placed at the lowest level of the hierarchy Exceptions eg weekday month quarter year

country

province_or_ state

city

street

15 distinct values

365 distinct values

3567 distinct values

674339 distinct values

Data preparation or preprocessing is a big issue for both data warehousing and data mining

Discriptive data summarization is need for quality data preprocessing

Data preparation includes Data cleaning and data integration Data reduction and feature selection Discretization

A lot a methods have been developed but data preprocessing still an active area of research

  • Slide 1
  • Slide 2
  • Slide 3
  • Multi-Dimensional Measure of Data Quality
  • Major Tasks in Data Preprocessing
  • Figure Forms of data preprocessing
  • 2 Descriptive Data Summarization
  • Slide 8
  • 21 Measuring the Central Tendency
  • 22 Measuring the Dispersion of Data
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Data Integration
  • Handling Redundant Data in Data Integration
  • Data Transformation
  • Data Transformation Normalization
  • Slide 27
  • 5 Data Reduction
  • Slide 29
  • 51 Data Cube Aggregation
  • 52 Attribute Subset Selection
  • Slide 32
  • Slide 33
  • Slide 34
  • 53 Dimensionality Reduction
  • Slide 36
  • Slide 37
  • Slide 38
  • 4 Numerosity Reduction
  • Data Reduction Method (1) Regression and Log-Linear Models
  • Regress Analysis and Log-Linear Models
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Data Reduction Method (3) Clustering
  • Slide 47
  • Slide 48
  • Slide 49
  • Data Reduction Method (4) Sampling
  • Slide 51
  • Sampling Cluster or Stratified Sampling
  • Data Discretization and Concept Hierarchy Generation
  • Slide 54
  • Discretization
  • Discretization and Concept Hierarchy Generation for Numeric Data
  • Entropy-Based Discretization
  • Interval Merge by 2 Analysis
  • Cluster Analysis
  • Slide 60
  • Segmentation by Natural Partitioning
  • 62 Concept Hierarchy Generation for Categorical Data
  • Slide 63
  • Concept Hierarchy Generation for Categorical Data
  • Automatic Concept Hierarchy Generation
  • Summary
Page 5: UNIT – 1 Data Preprocessing. Data Preprocessing Learning Objectives Understand why preprocess the data. Understand how to clean the data. Understand how.

Data cleaning Fill in missing values smooth noisy data identify or

remove outliers and resolve inconsistencies and errors

Data integration Integration of multiple databases data cubes or files

Data transformation Normalization and aggregation

Data reduction Obtains reduced representation in volume but

produces the same or similar analytical results Data discretization

Part of data reduction but with particular importance especially for numerical data

bull For data preprocessing to be successful it is essential to have an overall picture of your data

bull Descriptive data summarization techniques can be used to identify the typical properties of your data and highlight which

data values should be treated as noise or outliers

bull Thus we first introduce the basic concepts of descriptive data summarization before getting into the concrete workings of

data preprocessing techniques

bull For many data preprocessing tasks users would like to learn about data characteristics regarding both central tendency and dispersion of the data

bull Measures of central tendency include mean median mode and midrange while measures of data dispersion include quartiles interquartile range (IQR) and variance

bull These descriptive statistics are of great help in understanding the distribution of the data

bull Such measures have been studied extensively in the statistical literature

bull From the data mining point of view we need to examine how they can be computed efficiently in large databases

bull In particular it is necessary to introduce the notions of distributive measure algebraic measure and holistic measure

bull Knowing what kind of measure we are dealing with can help us choose an efficient implementation for it

In this section we look at various ways to measure the central tendency of data The most common and most effective numerical measure of the ldquocenterrdquo of a set of data is the (arithmetic) mean

mean1048576mode = 3(mean1048576median)

The degree to which numerical data tend to spread is called the dispersion or variance of the data The most common measures of data dispersion are

1) Range Quartiles Outliers and Boxplots 2) Variance and Standard Deviation

The range of the set is the difference between the largest (max()) and smallest (min()) values

The most commonly used percentiles other than the median are quartiles The first quartile denoted by Q1 is the 25th percentile the third quartile denoted by Q3 is the 75th percentile The quartiles including the median give some indication of the center spread and shape of a distribution The distance between the first and third quartiles is a simple measure of spread that gives the range covered by the middle half of the data

bull Boxplots are a popular way of visualizing a distribution A boxplot incorporates the five-number summary as follows

bull Typically the ends of the box are at the quartiles so that the box length is the interquartile range IQR

bull The median is marked by a line within the box

bull Two lines (called whiskers) outside the box extend to the smallest (Minimum) and largest (Maximum) observations

Aside from the bar charts pie charts and line graphs used in most statistical or graphical data presentation software packages there are other popular types of graphs for the display of data summaries and distributions These include histograms quantile plots q-q plots scatter plots and loess curves Such graphs are very helpful for the visual inspection of your data

23 Graphic Displays of Basic Descriptive Data Summaries

3 Data Cleaningbull Data cleaning tasks

Fill in missing valuesIdentify outliers and smooth out noisy data

Correct inconsistent data1) Missing Databull Data is not always available

a Eg many tuples have no recorded value for several attributes such as customer income in sales data

bull Missing data may be due to a equipment malfunction

b inconsistent with other recorded data and thus deletedc data not entered due to misunderstandingd certain data may not be considered important at the time of entrye not register history or changes of the data

f Missing data may need to be inferred

How to Handle Missing Data

bull Ignore the tuple usually done when class label is missing(assuming the tasks in classificationmdashnot effective when the percentage of missing values per attribute varies considerably)

bull Fill in the missing value manually tedious + infeasiblebull Use a global constant to fill in the missing value eg ldquounknownrdquo a new class bull Use the attribute mean to fill in the missing valuebull Use the attribute mean for all samples belonging to the same class to fill in the missing value smarterbull Use the most probable value to fill in the missing value inference-based such as Bayesian formula or decision tree

bull 2Noisy Data

bull Noise random error or variance in a measured variable

bull Incorrect attribute values may be due tondash faulty data collection instrumentsndash data entry problemsndash data transmission problemsndash technology limitationndash inconsistency in naming convention

bull Other data problems which requires data cleaningndash duplicate recordsndash inconsistent data

How to Handle Noisy Data

bull Binning method

- first sort data and partition into (equi-depth) bins- then one can smooth by bin means smooth by bin median

- smooth by bin boundaries etc

bull Clustering- detect and remove outliers

bull Combined computer and human inspection- detect suspicious values and check by human

bull Regression- smooth by fitting the data into regression functions

Binning Methods for Data SmoothingSorted data for price (in dollars)

48915 21 21 24 25 26 28 29 34

Partition into (equi-depth) bins - Bin 1 4 8 9 15 - Bin 2 21 21 24 25 - Bin 3 26 28 29 34

Smoothing by bin means - Bin 1 9 9 9 9 - Bin 2 23 23 23 23 - Bin 3 29 29 29 29

Smoothing by bin boundaries - Bin 1 4 4 4 15 - Bin 2 21 21 25 25 - Bin 3 26 26 26 34

Cluster Analysis

Regression

Data integration combines data from multiple sources into a

coherent store Schema integration

integrate metadata from different sources entity identification problem identify real world

entities from multiple data sources eg Acust-id Bcust-

Detecting and resolving data value conflicts for the same real world entity attribute values from

different sources are different possible reasons different representations

different scales eg metric vs British units

bull Redundant data occur often when integration of multiple databasesndash The same attribute may have different names in

different databasesndash One attribute may be a ldquoderivedrdquo attribute in

another table eg annual revenuebull Redundant data may be able to be detected by

correlational analysisbull Careful integration of the data from multiple sources

may help reduceavoid redundancies and inconsistencies and improve mining speed and quality

Smoothing remove noise from data Aggregation summarization data cube construction Generalization concept hierarchy climbing Normalization scaled to fall within a small specified

range min-max normalization z-score normalization normalization by decimal scaling

Attributefeature construction New attributes constructed from the given ones

bull min-max normalization

bull z-score normalization

bull normalization by decimal scaling

AAA

AA

A

minnewminnewmaxnewminmax

minvv _)__(

A

A

devstand

meanvv

_

j

vv

10 Where j is the smallest integer such that Max(| |)lt1v

Min-max normalization to [new_minA new_maxA]

Ex Let income range $12000 to $98000 normalized to [00 10] Then $73600 is mapped to

Z-score normalization (μ mean σ standard deviation)

Ex Let μ = 54000 σ = 16000 Then Normalization by decimal scaling

71600)001(0001200098

0001260073

AAA

AA

A

minnewminnewmaxnewminmax

minvv _)__(

A

Avv

j

vv

10

Where j is the smallest integer such that Max(|νrsquo|) lt 1

225100016

0005460073

Data reduction techniques can be applied to obtain a reduced representation of the data set that is much smaller in volume yet closely maintains the integrity of the original data That is mining on the reduced data set should be more efficient yet produce the same (or almost the same) analytical results

Strategies for data reduction include the following1Data cube aggregation where aggregation operations are applied to the data in the construction of a data cube

2 Attribute subset selection where irrelevant weakly relevant or redundant attributes or dimensions may be detected and removed

3 Dimensionality reduction where encoding mechanisms are used to reduce the data set size

4 Numerosity reduction where the data are replaced or estimated by alternative smaller data representations such as parametric models (which need store only the model parameters instead of the actual data) or nonparametric methods such as clustering sampling and the use of histograms

5 Discretization and concept hierarchy generation where rawdata values for attributes are replaced by ranges or higher conceptual levels

Attribute subset selection reduces the data set size by removing irrelevant or redundant attributes (or dimensions)

The goal of attribute subset selection is to find a minimum set of attributes such that the resulting probability distribution of the data classes is as close as possible to the original distribution obtained using all attributes

Mining on a reduced set of attributes has an additional benefit

It reduces the number of attributes appearing in the discovered patterns helping to make the patterns easier to understand

The ldquoBestrdquo (and ldquoWorstrdquo) attributes are typically determined using tests of statistical significance which assume that the attributes are independent of one Many other attribute evaluation measures can be used such as the information gain measure used in building decision trees for classification

bull Basic heuristic methods of attribute subset selection include the following techniques some of which are illustrated in Figure

bull 1 Stepwise forward selection The procedure starts with an empty set of attributes as the reduced set The best of the original attributes is determined and added to the reduced set At each subsequent iteration or step the best of the remaining original attributes is added to the set

bull 2 Stepwise backward elimination The procedure starts with the full set of attributes At each step it removes the worst attribute remaining in the set

bull 3 Combination of forward selection and backward elimination The stepwise forward selection and backward elimination methods can be combined so that at each step the procedure selects the best attribute and removes the

worst from among the remaining attributes

bull 4 Decision tree induction Decision tree algorithms such as ID3 C45 and CART were originally intended for classification Decision tree induction constructs a flow chart like structure where each internal (non leaf) node denotes a test on an attribute each branch corresponds to an outcome of the test and each external (leaf) node denotes a class prediction At each node the algorithm chooses the ldquobestrdquo attribute to partition the data into individual classes When decision tree induction is used for attribute subset selection a tree is constructed from the given data All attributes that do not appear in the tree are assumed to be irrelevant The set of attributes appearing in the tree form the reduced subset of attributes

The stopping criteria for the methods may vary The procedure may employ a threshold on the measure used to determine when to stop the attribute selection process

In dimensionality reduction data encoding or transformations are applied so as to obtain a reduced or ldquocompressedrdquo representation of the original data If the original data can be reconstructed from the compressed data without any loss of information the data reduction is called lossless If instead we can reconstruct only an approximation of the original data then the data reduction is called lossy There are several well-tuned algorithms for string compression Although they are typically lossless they allow only limited manipulation of the data

In this section we instead focus on two popular and effective methods of lossy dimensionality reduction wavelet transforms and principal components analysis

bull Wavelet transforms can be applied to multidimensional data such as a data cube

bull This is done by first applying the transform to the first dimension then to the second and so on The computational complexity involved is linear with respect to the number of cells in the cube Wavelet transforms give good results on sparse or skewed data and on data with ordered attributes Lossy compression by wavelets is reportedly better than JPEG compression the current commercial standard Wavelet transforms have many real-world applications including the compression of fingerprint images computer vision analysis of time-series data and data cleaning

bull PCA is computationally inexpensive can be applied to ordered and unordered attributes and can handle sparse data and skewed data Multidimensional data of more than two dimensions can be handled by reducing the problem to two dimensions Principal components may be used as inputs to multiple regression and cluster analysis

bull In comparison with wavelet transforms PCA tends to be better at handling sparse data whereas wavelet transforms are more suitable for data of high dimensionality

ldquoCan we reduce the data volume by choosing alternative lsquosmallerrsquo forms of data representationrdquo

Techniques of numerosity reduction can indeed be applied for this purpose These techniques may be parametric or nonparametric

For parametric methods a model is used to estimate the data so that typically only the data parameters need to be stored instead of the actual data (Outliers may also be stored) Log-linear models which estimate discrete multidimensional probability distributions are an example

Nonparametric methods for storing reduced representations of the data include histograms clustering and sampling

Letrsquos look at each of the numerosity reduction techniques mentioned above

Linear regression Data are modeled to fit a straight line Often uses the least-square method to fit the

line Multiple regression allows a response variable Y

to be modeled as a linear function of multidimensional feature vector

Log-linear model approximates discrete multidimensional probability distributions

bull Linear regression Y = w X + bndash Two regression coefficients w and b specify the

line and are to be estimated by using the data at hand

ndash Using the least squares criterion to the known values of Y1 Y2 hellip X1 X2 hellip

bull Multiple regression Y = b0 + b1 X1 + b2 X2ndash Many nonlinear functions can be transformed into

the above

bull Log-linear modelsndash The multi-way table of joint probabilities is

approximated by a product of lower-order tablesndash Probability p(a b c d) = ab acad bcd

Histograms Histograms use binning to approximate data distributions and are a popular form of data reduction

A histogram for an attribute A partitions the data distribution of A into disjoint subsets or buckets If each bucket represents only a single attribute-valuefrequency pair the buckets are called singleton buckets Often buckets instead represent continuous ranges for the given attribute

There are several partitioning rules including the following

Equal-width In an equal-width histogram the width of each bucket range is uniform

Equal-frequency (or equidepth) In an equal-frequency histogram the buckets are created so that roughly the frequency of each bucket is constant (that is each bucket contains roughly the same number of contiguous data samples)

V-Optimal If we consider all of the possible histograms for a given number of buckets the V-Optimal histogram is the one with the least variance Histogram variance is a weighted sum of the original values that each bucket represents where bucket weight is equal to the number of values in the bucket

MaxDiff In a MaxDiff histogram we consider the difference between each pair of adjacent values

ClusteringClustering techniques consider data tuples as objects They

partition the objects into groups or clusters so that objects within a cluster are ldquosimilarrdquo to one another and ldquodissimilarrdquo to objects in other clusters

In data reduction the cluster representations of the data are used to replace the actual data The effectiveness of this technique depends on the nature of the data It is much more effective for data that can be organized into distinct clusters than for smeared data

Partition data set into clusters based on similarity and store cluster representation (eg centroid and diameter) only

Can be very effective if data is clustered but not if data is ldquosmearedrdquo

Can have hierarchical clustering and be stored in multi-dimensional index tree structures

There are many choices of clustering definitions and clustering algorithms

Cluster analysis will be studied in depth later

SamplingSampling can be used as a data reduction technique because it

allows a large data set to be represented by a much smaller random sample (or subset) of the data Suppose that a large data set D contains N tuples Letrsquos look at the most common ways that we could sample D for data reduction

An advantage of sampling for data reduction is that the cost of obtaining a sample is proportional to the size of the sample sampling complexity is potentially sublinear to the size of the data

Simple Random sample without replacement

bull For a fixed sample size sampling complexity increases only linearly as the number of data dimensions

bull When applied to data reduction sampling is most commonly used to estimate the answer to an aggregate query

Sampling obtaining a small sample s to represent the whole data set N

Allow a mining algorithm to run in complexity that is potentially sub-linear to the size of the data

Choose a representative subset of the data Simple random sampling may have very poor

performance in the presence of skew Develop adaptive sampling methods

Stratified sampling Approximate the percentage of each class (or

subpopulation of interest) in the overall database Used in conjunction with skewed data

Note Sampling may not reduce database IOs (page at a time)

Sampling with or without Replacement

SRSWOR

(simple random

sample without

replacement)

SRSWR

Raw Data

Raw Data ClusterStratified Sample

Data discretization techniques can be used to reduce the number of values for a given continuous attribute by dividing the range of the attribute into intervals Interval labels can then be used to replace actual data values Replacing numerous values of a continuous attribute by a small number of interval labels thereby reduces and simplifies the original data This leads to a concise easy-to-use knowledge-level representation of mining results

bull Discretization techniques can be categorized based on how the discretization is performed such as whether it uses class information or which direction it proceeds (ie top-down vs bottom-up) If the discretization process uses class information then we say it is supervised discretization Otherwise it is unsupervised If the process starts by first finding one or a few points (called split points or cut points) to split the entire attribute range and then repeats this recursively on the resulting intervals it is called top-down discretization or splitting This contrasts with bottom-up discretization or merging which starts by considering all of the continuous values as potential split-points removes some by merging neighborhood values to form intervals and then recursively applies this process to the resulting intervals Discretization can be performed recursively on an attribute to provide a hierarchical or multiresolution partitioning of the attribute values known as a concept hierarchy Concept hierarchies are useful for mining at multiple levels of abstraction

Three types of attributes Nominal mdash values from an unordered set eg color

profession Ordinal mdash values from an ordered set eg military or

academic rank Continuous mdash real numbers eg integer or real

numbers Discretization

Divide the range of a continuous attribute into intervals Some classification algorithms only accept categorical

attributes Reduce data size by discretization Prepare for further analysis

Typical methods All the methods can be applied recursively Binning (covered above)

Top-down split unsupervised Histogram analysis (covered above)

Top-down split unsupervised Clustering analysis (covered above)

Either top-down split or bottom-up merge unsupervised Entropy-based discretization supervised top-down

split Interval merging by 2 Analysis unsupervised bottom-

up merge Segmentation by natural partitioning top-down split

unsupervised

Given a set of samples S if S is partitioned into two intervals S1 and S2 using boundary T the information gain after partitioning is

Entropy is calculated based on class distribution of the samples in the set Given m classes the entropy of S1 is

where pi is the probability of class i in S1 The boundary that minimizes the entropy function over

all possible boundaries is selected as a binary discretization

The process is recursively applied to partitions obtained until some stopping criterion is met

Such a boundary may reduce data size and improve classification accuracy

)Entropy(S|S|

|S|+)Entropy(S

|S|

|S|=T)I(S 2

21

1

m

iii ppSEntropy

121 )(log)(

Merging-based (bottom-up) vs splitting-based methods Merge Find the best neighboring intervals and merge

them to form larger intervals recursively ChiMerge [Kerber AAAI 1992 See also Liu et al DMKD

2002] Initially each distinct value of a numerical attr A is

considered to be one interval 2 tests are performed for every pair of adjacent

intervals Adjacent intervals with the least 2 values are merged

together since low 2 values for a pair indicate similar class distributions

This merge process proceeds recursively until a predefined stopping criterion is met (such as significance level max-interval max inconsistency etc)

Cluster analysis is a popular data discretization method A clustering algorithm can be applied to discretize a numerical attribute A by partitioning the values of A into clusters or groups Clustering takes the distribution of A into consideration as well as the closeness of data points and therefore is able to produce high-quality discretization results

Clustering can be used to generate a concept hierarchy for A by following either a topdown splitting strategy or a bottom-up merging strategy where each cluster forms a node of the concept hierarchy In the former each initial cluster or partition may be further

decomposed into several subclusters forming a lower level of the hierarchy In the latter clusters are formed by repeatedly grouping neighboring clusters in order to form

higher-level concepts

bull Discretization by Intuitive Partitioningbull Although the above discretization methods are useful

in the generation of numerical hierarchies many users would like to see numerical ranges partitioned into relatively uniform easy-to-read intervals that appear intuitive or ldquonaturalrdquo

bull If an interval covers 3 6 7 or 9 distinct values at the most significant digit then partition the range into 3 intervals (3 equal-width intervals for 3 6 and 9 and 3 intervals in the grouping of 2-3-2 for 7)

bull If it covers 2 4 or 8 distinct values at the most significant digit then partition the range into 4 equal-width intervals

bull If it covers 1 5 or 10 distinct values at the most significant digit then partition the range into 5 equal-width intervals

bull The rule can be recursively applied to each interval creating a concept hierarchy for the given numerical attribute Real-world data often contain extremely large positive andor negative outlier values which could distort any top-down discretization method based on minimum and maximum data values

A simply 3-4-5 rule can be used to segment numeric data into relatively uniform ldquonaturalrdquo intervals If an interval covers 3 6 7 or 9 distinct values

at the most significant digit partition the range into 3 equi-width intervals

If it covers 2 4 or 8 distinct values at the most significant digit partition the range into 4 intervals

If it covers 1 5 or 10 distinct values at the most significant digit partition the range into 5 intervals

Categorical data are discrete data Categorical attributes have a finite (but possibly large) number of distinct values with no ordering among the values Examples include geographic location job category and itemtype There are several methods for the generation of concept hierarchies for categorical data

Specification of a partial ordering of attributes explicitly at the schema level by users or Experts

Specification of a portion of a hierarchy by explicit data grouping

Specification of a set of attributes but not of their partial ordering

Specification of only a partial set of attributes

Specification of a partialtotal ordering of attributes explicitly at the schema level by users or experts street lt city lt state lt country

Specification of a hierarchy for a set of values by explicit data grouping Urbana Champaign Chicago lt Illinois

Specification of only a partial set of attributes Eg only street lt city not others

Automatic generation of hierarchies (or attribute levels) by the analysis of the number of distinct values Eg for a set of attributes street city state

country

Some hierarchies can be automaatically generated based on the analysis of the number of distinct values per attribute in the data set The attribute with the most distinct values is

placed at the lowest level of the hierarchy Exceptions eg weekday month quarter year

country

province_or_ state

city

street

15 distinct values

365 distinct values

3567 distinct values

674339 distinct values

Data preparation or preprocessing is a big issue for both data warehousing and data mining

Discriptive data summarization is need for quality data preprocessing

Data preparation includes Data cleaning and data integration Data reduction and feature selection Discretization

A lot a methods have been developed but data preprocessing still an active area of research

  • Slide 1
  • Slide 2
  • Slide 3
  • Multi-Dimensional Measure of Data Quality
  • Major Tasks in Data Preprocessing
  • Figure Forms of data preprocessing
  • 2 Descriptive Data Summarization
  • Slide 8
  • 21 Measuring the Central Tendency
  • 22 Measuring the Dispersion of Data
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Data Integration
  • Handling Redundant Data in Data Integration
  • Data Transformation
  • Data Transformation Normalization
  • Slide 27
  • 5 Data Reduction
  • Slide 29
  • 51 Data Cube Aggregation
  • 52 Attribute Subset Selection
  • Slide 32
  • Slide 33
  • Slide 34
  • 53 Dimensionality Reduction
  • Slide 36
  • Slide 37
  • Slide 38
  • 4 Numerosity Reduction
  • Data Reduction Method (1) Regression and Log-Linear Models
  • Regress Analysis and Log-Linear Models
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Data Reduction Method (3) Clustering
  • Slide 47
  • Slide 48
  • Slide 49
  • Data Reduction Method (4) Sampling
  • Slide 51
  • Sampling Cluster or Stratified Sampling
  • Data Discretization and Concept Hierarchy Generation
  • Slide 54
  • Discretization
  • Discretization and Concept Hierarchy Generation for Numeric Data
  • Entropy-Based Discretization
  • Interval Merge by 2 Analysis
  • Cluster Analysis
  • Slide 60
  • Segmentation by Natural Partitioning
  • 62 Concept Hierarchy Generation for Categorical Data
  • Slide 63
  • Concept Hierarchy Generation for Categorical Data
  • Automatic Concept Hierarchy Generation
  • Summary
Page 6: UNIT – 1 Data Preprocessing. Data Preprocessing Learning Objectives Understand why preprocess the data. Understand how to clean the data. Understand how.

bull For data preprocessing to be successful it is essential to have an overall picture of your data

bull Descriptive data summarization techniques can be used to identify the typical properties of your data and highlight which

data values should be treated as noise or outliers

bull Thus we first introduce the basic concepts of descriptive data summarization before getting into the concrete workings of

data preprocessing techniques

bull For many data preprocessing tasks users would like to learn about data characteristics regarding both central tendency and dispersion of the data

bull Measures of central tendency include mean median mode and midrange while measures of data dispersion include quartiles interquartile range (IQR) and variance

bull These descriptive statistics are of great help in understanding the distribution of the data

bull Such measures have been studied extensively in the statistical literature

bull From the data mining point of view we need to examine how they can be computed efficiently in large databases

bull In particular it is necessary to introduce the notions of distributive measure algebraic measure and holistic measure

bull Knowing what kind of measure we are dealing with can help us choose an efficient implementation for it

In this section we look at various ways to measure the central tendency of data The most common and most effective numerical measure of the ldquocenterrdquo of a set of data is the (arithmetic) mean

mean1048576mode = 3(mean1048576median)

The degree to which numerical data tend to spread is called the dispersion or variance of the data The most common measures of data dispersion are

1) Range Quartiles Outliers and Boxplots 2) Variance and Standard Deviation

The range of the set is the difference between the largest (max()) and smallest (min()) values

The most commonly used percentiles other than the median are quartiles The first quartile denoted by Q1 is the 25th percentile the third quartile denoted by Q3 is the 75th percentile The quartiles including the median give some indication of the center spread and shape of a distribution The distance between the first and third quartiles is a simple measure of spread that gives the range covered by the middle half of the data

bull Boxplots are a popular way of visualizing a distribution A boxplot incorporates the five-number summary as follows

bull Typically the ends of the box are at the quartiles so that the box length is the interquartile range IQR

bull The median is marked by a line within the box

bull Two lines (called whiskers) outside the box extend to the smallest (Minimum) and largest (Maximum) observations

Aside from the bar charts pie charts and line graphs used in most statistical or graphical data presentation software packages there are other popular types of graphs for the display of data summaries and distributions These include histograms quantile plots q-q plots scatter plots and loess curves Such graphs are very helpful for the visual inspection of your data

23 Graphic Displays of Basic Descriptive Data Summaries

3 Data Cleaningbull Data cleaning tasks

Fill in missing valuesIdentify outliers and smooth out noisy data

Correct inconsistent data1) Missing Databull Data is not always available

a Eg many tuples have no recorded value for several attributes such as customer income in sales data

bull Missing data may be due to a equipment malfunction

b inconsistent with other recorded data and thus deletedc data not entered due to misunderstandingd certain data may not be considered important at the time of entrye not register history or changes of the data

f Missing data may need to be inferred

How to Handle Missing Data

bull Ignore the tuple usually done when class label is missing(assuming the tasks in classificationmdashnot effective when the percentage of missing values per attribute varies considerably)

bull Fill in the missing value manually tedious + infeasiblebull Use a global constant to fill in the missing value eg ldquounknownrdquo a new class bull Use the attribute mean to fill in the missing valuebull Use the attribute mean for all samples belonging to the same class to fill in the missing value smarterbull Use the most probable value to fill in the missing value inference-based such as Bayesian formula or decision tree

bull 2Noisy Data

bull Noise random error or variance in a measured variable

bull Incorrect attribute values may be due tondash faulty data collection instrumentsndash data entry problemsndash data transmission problemsndash technology limitationndash inconsistency in naming convention

bull Other data problems which requires data cleaningndash duplicate recordsndash inconsistent data

How to Handle Noisy Data

bull Binning method

- first sort data and partition into (equi-depth) bins- then one can smooth by bin means smooth by bin median

- smooth by bin boundaries etc

bull Clustering- detect and remove outliers

bull Combined computer and human inspection- detect suspicious values and check by human

bull Regression- smooth by fitting the data into regression functions

Binning Methods for Data SmoothingSorted data for price (in dollars)

48915 21 21 24 25 26 28 29 34

Partition into (equi-depth) bins - Bin 1 4 8 9 15 - Bin 2 21 21 24 25 - Bin 3 26 28 29 34

Smoothing by bin means - Bin 1 9 9 9 9 - Bin 2 23 23 23 23 - Bin 3 29 29 29 29

Smoothing by bin boundaries - Bin 1 4 4 4 15 - Bin 2 21 21 25 25 - Bin 3 26 26 26 34

Cluster Analysis

Regression

Data integration combines data from multiple sources into a

coherent store Schema integration

integrate metadata from different sources entity identification problem identify real world

entities from multiple data sources eg Acust-id Bcust-

Detecting and resolving data value conflicts for the same real world entity attribute values from

different sources are different possible reasons different representations

different scales eg metric vs British units

bull Redundant data occur often when integration of multiple databasesndash The same attribute may have different names in

different databasesndash One attribute may be a ldquoderivedrdquo attribute in

another table eg annual revenuebull Redundant data may be able to be detected by

correlational analysisbull Careful integration of the data from multiple sources

may help reduceavoid redundancies and inconsistencies and improve mining speed and quality

Smoothing remove noise from data Aggregation summarization data cube construction Generalization concept hierarchy climbing Normalization scaled to fall within a small specified

range min-max normalization z-score normalization normalization by decimal scaling

Attributefeature construction New attributes constructed from the given ones

bull min-max normalization

bull z-score normalization

bull normalization by decimal scaling

AAA

AA

A

minnewminnewmaxnewminmax

minvv _)__(

A

A

devstand

meanvv

_

j

vv

10 Where j is the smallest integer such that Max(| |)lt1v

Min-max normalization to [new_minA new_maxA]

Ex Let income range $12000 to $98000 normalized to [00 10] Then $73600 is mapped to

Z-score normalization (μ mean σ standard deviation)

Ex Let μ = 54000 σ = 16000 Then Normalization by decimal scaling

71600)001(0001200098

0001260073

AAA

AA

A

minnewminnewmaxnewminmax

minvv _)__(

A

Avv

j

vv

10

Where j is the smallest integer such that Max(|νrsquo|) lt 1

225100016

0005460073

Data reduction techniques can be applied to obtain a reduced representation of the data set that is much smaller in volume yet closely maintains the integrity of the original data That is mining on the reduced data set should be more efficient yet produce the same (or almost the same) analytical results

Strategies for data reduction include the following1Data cube aggregation where aggregation operations are applied to the data in the construction of a data cube

2 Attribute subset selection where irrelevant weakly relevant or redundant attributes or dimensions may be detected and removed

3 Dimensionality reduction where encoding mechanisms are used to reduce the data set size

4 Numerosity reduction where the data are replaced or estimated by alternative smaller data representations such as parametric models (which need store only the model parameters instead of the actual data) or nonparametric methods such as clustering sampling and the use of histograms

5 Discretization and concept hierarchy generation where rawdata values for attributes are replaced by ranges or higher conceptual levels

Attribute subset selection reduces the data set size by removing irrelevant or redundant attributes (or dimensions)

The goal of attribute subset selection is to find a minimum set of attributes such that the resulting probability distribution of the data classes is as close as possible to the original distribution obtained using all attributes

Mining on a reduced set of attributes has an additional benefit

It reduces the number of attributes appearing in the discovered patterns helping to make the patterns easier to understand

The ldquoBestrdquo (and ldquoWorstrdquo) attributes are typically determined using tests of statistical significance which assume that the attributes are independent of one Many other attribute evaluation measures can be used such as the information gain measure used in building decision trees for classification

bull Basic heuristic methods of attribute subset selection include the following techniques some of which are illustrated in Figure

bull 1 Stepwise forward selection The procedure starts with an empty set of attributes as the reduced set The best of the original attributes is determined and added to the reduced set At each subsequent iteration or step the best of the remaining original attributes is added to the set

bull 2 Stepwise backward elimination The procedure starts with the full set of attributes At each step it removes the worst attribute remaining in the set

bull 3 Combination of forward selection and backward elimination The stepwise forward selection and backward elimination methods can be combined so that at each step the procedure selects the best attribute and removes the

worst from among the remaining attributes

bull 4 Decision tree induction Decision tree algorithms such as ID3 C45 and CART were originally intended for classification Decision tree induction constructs a flow chart like structure where each internal (non leaf) node denotes a test on an attribute each branch corresponds to an outcome of the test and each external (leaf) node denotes a class prediction At each node the algorithm chooses the ldquobestrdquo attribute to partition the data into individual classes When decision tree induction is used for attribute subset selection a tree is constructed from the given data All attributes that do not appear in the tree are assumed to be irrelevant The set of attributes appearing in the tree form the reduced subset of attributes

The stopping criteria for the methods may vary The procedure may employ a threshold on the measure used to determine when to stop the attribute selection process

In dimensionality reduction data encoding or transformations are applied so as to obtain a reduced or ldquocompressedrdquo representation of the original data If the original data can be reconstructed from the compressed data without any loss of information the data reduction is called lossless If instead we can reconstruct only an approximation of the original data then the data reduction is called lossy There are several well-tuned algorithms for string compression Although they are typically lossless they allow only limited manipulation of the data

In this section we instead focus on two popular and effective methods of lossy dimensionality reduction wavelet transforms and principal components analysis

bull Wavelet transforms can be applied to multidimensional data such as a data cube

bull This is done by first applying the transform to the first dimension then to the second and so on The computational complexity involved is linear with respect to the number of cells in the cube Wavelet transforms give good results on sparse or skewed data and on data with ordered attributes Lossy compression by wavelets is reportedly better than JPEG compression the current commercial standard Wavelet transforms have many real-world applications including the compression of fingerprint images computer vision analysis of time-series data and data cleaning

bull PCA is computationally inexpensive can be applied to ordered and unordered attributes and can handle sparse data and skewed data Multidimensional data of more than two dimensions can be handled by reducing the problem to two dimensions Principal components may be used as inputs to multiple regression and cluster analysis

bull In comparison with wavelet transforms PCA tends to be better at handling sparse data whereas wavelet transforms are more suitable for data of high dimensionality

ldquoCan we reduce the data volume by choosing alternative lsquosmallerrsquo forms of data representationrdquo

Techniques of numerosity reduction can indeed be applied for this purpose These techniques may be parametric or nonparametric

For parametric methods a model is used to estimate the data so that typically only the data parameters need to be stored instead of the actual data (Outliers may also be stored) Log-linear models which estimate discrete multidimensional probability distributions are an example

Nonparametric methods for storing reduced representations of the data include histograms clustering and sampling

Letrsquos look at each of the numerosity reduction techniques mentioned above

Linear regression Data are modeled to fit a straight line Often uses the least-square method to fit the

line Multiple regression allows a response variable Y

to be modeled as a linear function of multidimensional feature vector

Log-linear model approximates discrete multidimensional probability distributions

bull Linear regression Y = w X + bndash Two regression coefficients w and b specify the

line and are to be estimated by using the data at hand

ndash Using the least squares criterion to the known values of Y1 Y2 hellip X1 X2 hellip

bull Multiple regression Y = b0 + b1 X1 + b2 X2ndash Many nonlinear functions can be transformed into

the above

bull Log-linear modelsndash The multi-way table of joint probabilities is

approximated by a product of lower-order tablesndash Probability p(a b c d) = ab acad bcd

Histograms Histograms use binning to approximate data distributions and are a popular form of data reduction

A histogram for an attribute A partitions the data distribution of A into disjoint subsets or buckets If each bucket represents only a single attribute-valuefrequency pair the buckets are called singleton buckets Often buckets instead represent continuous ranges for the given attribute

There are several partitioning rules including the following

Equal-width In an equal-width histogram the width of each bucket range is uniform

Equal-frequency (or equidepth) In an equal-frequency histogram the buckets are created so that roughly the frequency of each bucket is constant (that is each bucket contains roughly the same number of contiguous data samples)

V-Optimal If we consider all of the possible histograms for a given number of buckets the V-Optimal histogram is the one with the least variance Histogram variance is a weighted sum of the original values that each bucket represents where bucket weight is equal to the number of values in the bucket

MaxDiff In a MaxDiff histogram we consider the difference between each pair of adjacent values

ClusteringClustering techniques consider data tuples as objects They

partition the objects into groups or clusters so that objects within a cluster are ldquosimilarrdquo to one another and ldquodissimilarrdquo to objects in other clusters

In data reduction the cluster representations of the data are used to replace the actual data The effectiveness of this technique depends on the nature of the data It is much more effective for data that can be organized into distinct clusters than for smeared data

Partition data set into clusters based on similarity and store cluster representation (eg centroid and diameter) only

Can be very effective if data is clustered but not if data is ldquosmearedrdquo

Can have hierarchical clustering and be stored in multi-dimensional index tree structures

There are many choices of clustering definitions and clustering algorithms

Cluster analysis will be studied in depth later

SamplingSampling can be used as a data reduction technique because it

allows a large data set to be represented by a much smaller random sample (or subset) of the data Suppose that a large data set D contains N tuples Letrsquos look at the most common ways that we could sample D for data reduction

An advantage of sampling for data reduction is that the cost of obtaining a sample is proportional to the size of the sample sampling complexity is potentially sublinear to the size of the data

Simple Random sample without replacement

bull For a fixed sample size sampling complexity increases only linearly as the number of data dimensions

bull When applied to data reduction sampling is most commonly used to estimate the answer to an aggregate query

Sampling obtaining a small sample s to represent the whole data set N

Allow a mining algorithm to run in complexity that is potentially sub-linear to the size of the data

Choose a representative subset of the data Simple random sampling may have very poor

performance in the presence of skew Develop adaptive sampling methods

Stratified sampling Approximate the percentage of each class (or

subpopulation of interest) in the overall database Used in conjunction with skewed data

Note Sampling may not reduce database IOs (page at a time)

Sampling with or without Replacement

SRSWOR

(simple random

sample without

replacement)

SRSWR

Raw Data

Raw Data ClusterStratified Sample

Data discretization techniques can be used to reduce the number of values for a given continuous attribute by dividing the range of the attribute into intervals Interval labels can then be used to replace actual data values Replacing numerous values of a continuous attribute by a small number of interval labels thereby reduces and simplifies the original data This leads to a concise easy-to-use knowledge-level representation of mining results

bull Discretization techniques can be categorized based on how the discretization is performed such as whether it uses class information or which direction it proceeds (ie top-down vs bottom-up) If the discretization process uses class information then we say it is supervised discretization Otherwise it is unsupervised If the process starts by first finding one or a few points (called split points or cut points) to split the entire attribute range and then repeats this recursively on the resulting intervals it is called top-down discretization or splitting This contrasts with bottom-up discretization or merging which starts by considering all of the continuous values as potential split-points removes some by merging neighborhood values to form intervals and then recursively applies this process to the resulting intervals Discretization can be performed recursively on an attribute to provide a hierarchical or multiresolution partitioning of the attribute values known as a concept hierarchy Concept hierarchies are useful for mining at multiple levels of abstraction

Three types of attributes Nominal mdash values from an unordered set eg color

profession Ordinal mdash values from an ordered set eg military or

academic rank Continuous mdash real numbers eg integer or real

numbers Discretization

Divide the range of a continuous attribute into intervals Some classification algorithms only accept categorical

attributes Reduce data size by discretization Prepare for further analysis

Typical methods All the methods can be applied recursively Binning (covered above)

Top-down split unsupervised Histogram analysis (covered above)

Top-down split unsupervised Clustering analysis (covered above)

Either top-down split or bottom-up merge unsupervised Entropy-based discretization supervised top-down

split Interval merging by 2 Analysis unsupervised bottom-

up merge Segmentation by natural partitioning top-down split

unsupervised

Given a set of samples S if S is partitioned into two intervals S1 and S2 using boundary T the information gain after partitioning is

Entropy is calculated based on class distribution of the samples in the set Given m classes the entropy of S1 is

where pi is the probability of class i in S1 The boundary that minimizes the entropy function over

all possible boundaries is selected as a binary discretization

The process is recursively applied to partitions obtained until some stopping criterion is met

Such a boundary may reduce data size and improve classification accuracy

)Entropy(S|S|

|S|+)Entropy(S

|S|

|S|=T)I(S 2

21

1

m

iii ppSEntropy

121 )(log)(

Merging-based (bottom-up) vs splitting-based methods Merge Find the best neighboring intervals and merge

them to form larger intervals recursively ChiMerge [Kerber AAAI 1992 See also Liu et al DMKD

2002] Initially each distinct value of a numerical attr A is

considered to be one interval 2 tests are performed for every pair of adjacent

intervals Adjacent intervals with the least 2 values are merged

together since low 2 values for a pair indicate similar class distributions

This merge process proceeds recursively until a predefined stopping criterion is met (such as significance level max-interval max inconsistency etc)

Cluster analysis is a popular data discretization method A clustering algorithm can be applied to discretize a numerical attribute A by partitioning the values of A into clusters or groups Clustering takes the distribution of A into consideration as well as the closeness of data points and therefore is able to produce high-quality discretization results

Clustering can be used to generate a concept hierarchy for A by following either a topdown splitting strategy or a bottom-up merging strategy where each cluster forms a node of the concept hierarchy In the former each initial cluster or partition may be further

decomposed into several subclusters forming a lower level of the hierarchy In the latter clusters are formed by repeatedly grouping neighboring clusters in order to form

higher-level concepts

bull Discretization by Intuitive Partitioningbull Although the above discretization methods are useful

in the generation of numerical hierarchies many users would like to see numerical ranges partitioned into relatively uniform easy-to-read intervals that appear intuitive or ldquonaturalrdquo

bull If an interval covers 3 6 7 or 9 distinct values at the most significant digit then partition the range into 3 intervals (3 equal-width intervals for 3 6 and 9 and 3 intervals in the grouping of 2-3-2 for 7)

bull If it covers 2 4 or 8 distinct values at the most significant digit then partition the range into 4 equal-width intervals

bull If it covers 1 5 or 10 distinct values at the most significant digit then partition the range into 5 equal-width intervals

bull The rule can be recursively applied to each interval creating a concept hierarchy for the given numerical attribute Real-world data often contain extremely large positive andor negative outlier values which could distort any top-down discretization method based on minimum and maximum data values

A simply 3-4-5 rule can be used to segment numeric data into relatively uniform ldquonaturalrdquo intervals If an interval covers 3 6 7 or 9 distinct values

at the most significant digit partition the range into 3 equi-width intervals

If it covers 2 4 or 8 distinct values at the most significant digit partition the range into 4 intervals

If it covers 1 5 or 10 distinct values at the most significant digit partition the range into 5 intervals

Categorical data are discrete data Categorical attributes have a finite (but possibly large) number of distinct values with no ordering among the values Examples include geographic location job category and itemtype There are several methods for the generation of concept hierarchies for categorical data

Specification of a partial ordering of attributes explicitly at the schema level by users or Experts

Specification of a portion of a hierarchy by explicit data grouping

Specification of a set of attributes but not of their partial ordering

Specification of only a partial set of attributes

Specification of a partialtotal ordering of attributes explicitly at the schema level by users or experts street lt city lt state lt country

Specification of a hierarchy for a set of values by explicit data grouping Urbana Champaign Chicago lt Illinois

Specification of only a partial set of attributes Eg only street lt city not others

Automatic generation of hierarchies (or attribute levels) by the analysis of the number of distinct values Eg for a set of attributes street city state

country

Some hierarchies can be automaatically generated based on the analysis of the number of distinct values per attribute in the data set The attribute with the most distinct values is

placed at the lowest level of the hierarchy Exceptions eg weekday month quarter year

country

province_or_ state

city

street

15 distinct values

365 distinct values

3567 distinct values

674339 distinct values

Data preparation or preprocessing is a big issue for both data warehousing and data mining

Discriptive data summarization is need for quality data preprocessing

Data preparation includes Data cleaning and data integration Data reduction and feature selection Discretization

A lot a methods have been developed but data preprocessing still an active area of research

  • Slide 1
  • Slide 2
  • Slide 3
  • Multi-Dimensional Measure of Data Quality
  • Major Tasks in Data Preprocessing
  • Figure Forms of data preprocessing
  • 2 Descriptive Data Summarization
  • Slide 8
  • 21 Measuring the Central Tendency
  • 22 Measuring the Dispersion of Data
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Data Integration
  • Handling Redundant Data in Data Integration
  • Data Transformation
  • Data Transformation Normalization
  • Slide 27
  • 5 Data Reduction
  • Slide 29
  • 51 Data Cube Aggregation
  • 52 Attribute Subset Selection
  • Slide 32
  • Slide 33
  • Slide 34
  • 53 Dimensionality Reduction
  • Slide 36
  • Slide 37
  • Slide 38
  • 4 Numerosity Reduction
  • Data Reduction Method (1) Regression and Log-Linear Models
  • Regress Analysis and Log-Linear Models
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Data Reduction Method (3) Clustering
  • Slide 47
  • Slide 48
  • Slide 49
  • Data Reduction Method (4) Sampling
  • Slide 51
  • Sampling Cluster or Stratified Sampling
  • Data Discretization and Concept Hierarchy Generation
  • Slide 54
  • Discretization
  • Discretization and Concept Hierarchy Generation for Numeric Data
  • Entropy-Based Discretization
  • Interval Merge by 2 Analysis
  • Cluster Analysis
  • Slide 60
  • Segmentation by Natural Partitioning
  • 62 Concept Hierarchy Generation for Categorical Data
  • Slide 63
  • Concept Hierarchy Generation for Categorical Data
  • Automatic Concept Hierarchy Generation
  • Summary
Page 7: UNIT – 1 Data Preprocessing. Data Preprocessing Learning Objectives Understand why preprocess the data. Understand how to clean the data. Understand how.

bull Measures of central tendency include mean median mode and midrange while measures of data dispersion include quartiles interquartile range (IQR) and variance

bull These descriptive statistics are of great help in understanding the distribution of the data

bull Such measures have been studied extensively in the statistical literature

bull From the data mining point of view we need to examine how they can be computed efficiently in large databases

bull In particular it is necessary to introduce the notions of distributive measure algebraic measure and holistic measure

bull Knowing what kind of measure we are dealing with can help us choose an efficient implementation for it

In this section we look at various ways to measure the central tendency of data The most common and most effective numerical measure of the ldquocenterrdquo of a set of data is the (arithmetic) mean

mean1048576mode = 3(mean1048576median)

The degree to which numerical data tend to spread is called the dispersion or variance of the data The most common measures of data dispersion are

1) Range Quartiles Outliers and Boxplots 2) Variance and Standard Deviation

The range of the set is the difference between the largest (max()) and smallest (min()) values

The most commonly used percentiles other than the median are quartiles The first quartile denoted by Q1 is the 25th percentile the third quartile denoted by Q3 is the 75th percentile The quartiles including the median give some indication of the center spread and shape of a distribution The distance between the first and third quartiles is a simple measure of spread that gives the range covered by the middle half of the data

bull Boxplots are a popular way of visualizing a distribution A boxplot incorporates the five-number summary as follows

bull Typically the ends of the box are at the quartiles so that the box length is the interquartile range IQR

bull The median is marked by a line within the box

bull Two lines (called whiskers) outside the box extend to the smallest (Minimum) and largest (Maximum) observations

Aside from the bar charts pie charts and line graphs used in most statistical or graphical data presentation software packages there are other popular types of graphs for the display of data summaries and distributions These include histograms quantile plots q-q plots scatter plots and loess curves Such graphs are very helpful for the visual inspection of your data

23 Graphic Displays of Basic Descriptive Data Summaries

3 Data Cleaningbull Data cleaning tasks

Fill in missing valuesIdentify outliers and smooth out noisy data

Correct inconsistent data1) Missing Databull Data is not always available

a Eg many tuples have no recorded value for several attributes such as customer income in sales data

bull Missing data may be due to a equipment malfunction

b inconsistent with other recorded data and thus deletedc data not entered due to misunderstandingd certain data may not be considered important at the time of entrye not register history or changes of the data

f Missing data may need to be inferred

How to Handle Missing Data

bull Ignore the tuple usually done when class label is missing(assuming the tasks in classificationmdashnot effective when the percentage of missing values per attribute varies considerably)

bull Fill in the missing value manually tedious + infeasiblebull Use a global constant to fill in the missing value eg ldquounknownrdquo a new class bull Use the attribute mean to fill in the missing valuebull Use the attribute mean for all samples belonging to the same class to fill in the missing value smarterbull Use the most probable value to fill in the missing value inference-based such as Bayesian formula or decision tree

bull 2Noisy Data

bull Noise random error or variance in a measured variable

bull Incorrect attribute values may be due tondash faulty data collection instrumentsndash data entry problemsndash data transmission problemsndash technology limitationndash inconsistency in naming convention

bull Other data problems which requires data cleaningndash duplicate recordsndash inconsistent data

How to Handle Noisy Data

bull Binning method

- first sort data and partition into (equi-depth) bins- then one can smooth by bin means smooth by bin median

- smooth by bin boundaries etc

bull Clustering- detect and remove outliers

bull Combined computer and human inspection- detect suspicious values and check by human

bull Regression- smooth by fitting the data into regression functions

Binning Methods for Data SmoothingSorted data for price (in dollars)

48915 21 21 24 25 26 28 29 34

Partition into (equi-depth) bins - Bin 1 4 8 9 15 - Bin 2 21 21 24 25 - Bin 3 26 28 29 34

Smoothing by bin means - Bin 1 9 9 9 9 - Bin 2 23 23 23 23 - Bin 3 29 29 29 29

Smoothing by bin boundaries - Bin 1 4 4 4 15 - Bin 2 21 21 25 25 - Bin 3 26 26 26 34

Cluster Analysis

Regression

Data integration combines data from multiple sources into a

coherent store Schema integration

integrate metadata from different sources entity identification problem identify real world

entities from multiple data sources eg Acust-id Bcust-

Detecting and resolving data value conflicts for the same real world entity attribute values from

different sources are different possible reasons different representations

different scales eg metric vs British units

bull Redundant data occur often when integration of multiple databasesndash The same attribute may have different names in

different databasesndash One attribute may be a ldquoderivedrdquo attribute in

another table eg annual revenuebull Redundant data may be able to be detected by

correlational analysisbull Careful integration of the data from multiple sources

may help reduceavoid redundancies and inconsistencies and improve mining speed and quality

Smoothing remove noise from data Aggregation summarization data cube construction Generalization concept hierarchy climbing Normalization scaled to fall within a small specified

range min-max normalization z-score normalization normalization by decimal scaling

Attributefeature construction New attributes constructed from the given ones

bull min-max normalization

bull z-score normalization

bull normalization by decimal scaling

AAA

AA

A

minnewminnewmaxnewminmax

minvv _)__(

A

A

devstand

meanvv

_

j

vv

10 Where j is the smallest integer such that Max(| |)lt1v

Min-max normalization to [new_minA new_maxA]

Ex Let income range $12000 to $98000 normalized to [00 10] Then $73600 is mapped to

Z-score normalization (μ mean σ standard deviation)

Ex Let μ = 54000 σ = 16000 Then Normalization by decimal scaling

71600)001(0001200098

0001260073

AAA

AA

A

minnewminnewmaxnewminmax

minvv _)__(

A

Avv

j

vv

10

Where j is the smallest integer such that Max(|νrsquo|) lt 1

225100016

0005460073

Data reduction techniques can be applied to obtain a reduced representation of the data set that is much smaller in volume yet closely maintains the integrity of the original data That is mining on the reduced data set should be more efficient yet produce the same (or almost the same) analytical results

Strategies for data reduction include the following1Data cube aggregation where aggregation operations are applied to the data in the construction of a data cube

2 Attribute subset selection where irrelevant weakly relevant or redundant attributes or dimensions may be detected and removed

3 Dimensionality reduction where encoding mechanisms are used to reduce the data set size

4 Numerosity reduction where the data are replaced or estimated by alternative smaller data representations such as parametric models (which need store only the model parameters instead of the actual data) or nonparametric methods such as clustering sampling and the use of histograms

5 Discretization and concept hierarchy generation where rawdata values for attributes are replaced by ranges or higher conceptual levels

Attribute subset selection reduces the data set size by removing irrelevant or redundant attributes (or dimensions)

The goal of attribute subset selection is to find a minimum set of attributes such that the resulting probability distribution of the data classes is as close as possible to the original distribution obtained using all attributes

Mining on a reduced set of attributes has an additional benefit

It reduces the number of attributes appearing in the discovered patterns helping to make the patterns easier to understand

The ldquoBestrdquo (and ldquoWorstrdquo) attributes are typically determined using tests of statistical significance which assume that the attributes are independent of one Many other attribute evaluation measures can be used such as the information gain measure used in building decision trees for classification

bull Basic heuristic methods of attribute subset selection include the following techniques some of which are illustrated in Figure

bull 1 Stepwise forward selection The procedure starts with an empty set of attributes as the reduced set The best of the original attributes is determined and added to the reduced set At each subsequent iteration or step the best of the remaining original attributes is added to the set

bull 2 Stepwise backward elimination The procedure starts with the full set of attributes At each step it removes the worst attribute remaining in the set

bull 3 Combination of forward selection and backward elimination The stepwise forward selection and backward elimination methods can be combined so that at each step the procedure selects the best attribute and removes the

worst from among the remaining attributes

bull 4 Decision tree induction Decision tree algorithms such as ID3 C45 and CART were originally intended for classification Decision tree induction constructs a flow chart like structure where each internal (non leaf) node denotes a test on an attribute each branch corresponds to an outcome of the test and each external (leaf) node denotes a class prediction At each node the algorithm chooses the ldquobestrdquo attribute to partition the data into individual classes When decision tree induction is used for attribute subset selection a tree is constructed from the given data All attributes that do not appear in the tree are assumed to be irrelevant The set of attributes appearing in the tree form the reduced subset of attributes

The stopping criteria for the methods may vary The procedure may employ a threshold on the measure used to determine when to stop the attribute selection process

In dimensionality reduction data encoding or transformations are applied so as to obtain a reduced or ldquocompressedrdquo representation of the original data If the original data can be reconstructed from the compressed data without any loss of information the data reduction is called lossless If instead we can reconstruct only an approximation of the original data then the data reduction is called lossy There are several well-tuned algorithms for string compression Although they are typically lossless they allow only limited manipulation of the data

In this section we instead focus on two popular and effective methods of lossy dimensionality reduction wavelet transforms and principal components analysis

bull Wavelet transforms can be applied to multidimensional data such as a data cube

bull This is done by first applying the transform to the first dimension then to the second and so on The computational complexity involved is linear with respect to the number of cells in the cube Wavelet transforms give good results on sparse or skewed data and on data with ordered attributes Lossy compression by wavelets is reportedly better than JPEG compression the current commercial standard Wavelet transforms have many real-world applications including the compression of fingerprint images computer vision analysis of time-series data and data cleaning

bull PCA is computationally inexpensive can be applied to ordered and unordered attributes and can handle sparse data and skewed data Multidimensional data of more than two dimensions can be handled by reducing the problem to two dimensions Principal components may be used as inputs to multiple regression and cluster analysis

bull In comparison with wavelet transforms PCA tends to be better at handling sparse data whereas wavelet transforms are more suitable for data of high dimensionality

ldquoCan we reduce the data volume by choosing alternative lsquosmallerrsquo forms of data representationrdquo

Techniques of numerosity reduction can indeed be applied for this purpose These techniques may be parametric or nonparametric

For parametric methods a model is used to estimate the data so that typically only the data parameters need to be stored instead of the actual data (Outliers may also be stored) Log-linear models which estimate discrete multidimensional probability distributions are an example

Nonparametric methods for storing reduced representations of the data include histograms clustering and sampling

Letrsquos look at each of the numerosity reduction techniques mentioned above

Linear regression Data are modeled to fit a straight line Often uses the least-square method to fit the

line Multiple regression allows a response variable Y

to be modeled as a linear function of multidimensional feature vector

Log-linear model approximates discrete multidimensional probability distributions

bull Linear regression Y = w X + bndash Two regression coefficients w and b specify the

line and are to be estimated by using the data at hand

ndash Using the least squares criterion to the known values of Y1 Y2 hellip X1 X2 hellip

bull Multiple regression Y = b0 + b1 X1 + b2 X2ndash Many nonlinear functions can be transformed into

the above

bull Log-linear modelsndash The multi-way table of joint probabilities is

approximated by a product of lower-order tablesndash Probability p(a b c d) = ab acad bcd

Histograms Histograms use binning to approximate data distributions and are a popular form of data reduction

A histogram for an attribute A partitions the data distribution of A into disjoint subsets or buckets If each bucket represents only a single attribute-valuefrequency pair the buckets are called singleton buckets Often buckets instead represent continuous ranges for the given attribute

There are several partitioning rules including the following

Equal-width In an equal-width histogram the width of each bucket range is uniform

Equal-frequency (or equidepth) In an equal-frequency histogram the buckets are created so that roughly the frequency of each bucket is constant (that is each bucket contains roughly the same number of contiguous data samples)

V-Optimal If we consider all of the possible histograms for a given number of buckets the V-Optimal histogram is the one with the least variance Histogram variance is a weighted sum of the original values that each bucket represents where bucket weight is equal to the number of values in the bucket

MaxDiff In a MaxDiff histogram we consider the difference between each pair of adjacent values

ClusteringClustering techniques consider data tuples as objects They

partition the objects into groups or clusters so that objects within a cluster are ldquosimilarrdquo to one another and ldquodissimilarrdquo to objects in other clusters

In data reduction the cluster representations of the data are used to replace the actual data The effectiveness of this technique depends on the nature of the data It is much more effective for data that can be organized into distinct clusters than for smeared data

Partition data set into clusters based on similarity and store cluster representation (eg centroid and diameter) only

Can be very effective if data is clustered but not if data is ldquosmearedrdquo

Can have hierarchical clustering and be stored in multi-dimensional index tree structures

There are many choices of clustering definitions and clustering algorithms

Cluster analysis will be studied in depth later

SamplingSampling can be used as a data reduction technique because it

allows a large data set to be represented by a much smaller random sample (or subset) of the data Suppose that a large data set D contains N tuples Letrsquos look at the most common ways that we could sample D for data reduction

An advantage of sampling for data reduction is that the cost of obtaining a sample is proportional to the size of the sample sampling complexity is potentially sublinear to the size of the data

Simple Random sample without replacement

bull For a fixed sample size sampling complexity increases only linearly as the number of data dimensions

bull When applied to data reduction sampling is most commonly used to estimate the answer to an aggregate query

Sampling obtaining a small sample s to represent the whole data set N

Allow a mining algorithm to run in complexity that is potentially sub-linear to the size of the data

Choose a representative subset of the data Simple random sampling may have very poor

performance in the presence of skew Develop adaptive sampling methods

Stratified sampling Approximate the percentage of each class (or

subpopulation of interest) in the overall database Used in conjunction with skewed data

Note Sampling may not reduce database IOs (page at a time)

Sampling with or without Replacement

SRSWOR

(simple random

sample without

replacement)

SRSWR

Raw Data

Raw Data ClusterStratified Sample

Data discretization techniques can be used to reduce the number of values for a given continuous attribute by dividing the range of the attribute into intervals Interval labels can then be used to replace actual data values Replacing numerous values of a continuous attribute by a small number of interval labels thereby reduces and simplifies the original data This leads to a concise easy-to-use knowledge-level representation of mining results

bull Discretization techniques can be categorized based on how the discretization is performed such as whether it uses class information or which direction it proceeds (ie top-down vs bottom-up) If the discretization process uses class information then we say it is supervised discretization Otherwise it is unsupervised If the process starts by first finding one or a few points (called split points or cut points) to split the entire attribute range and then repeats this recursively on the resulting intervals it is called top-down discretization or splitting This contrasts with bottom-up discretization or merging which starts by considering all of the continuous values as potential split-points removes some by merging neighborhood values to form intervals and then recursively applies this process to the resulting intervals Discretization can be performed recursively on an attribute to provide a hierarchical or multiresolution partitioning of the attribute values known as a concept hierarchy Concept hierarchies are useful for mining at multiple levels of abstraction

Three types of attributes Nominal mdash values from an unordered set eg color

profession Ordinal mdash values from an ordered set eg military or

academic rank Continuous mdash real numbers eg integer or real

numbers Discretization

Divide the range of a continuous attribute into intervals Some classification algorithms only accept categorical

attributes Reduce data size by discretization Prepare for further analysis

Typical methods All the methods can be applied recursively Binning (covered above)

Top-down split unsupervised Histogram analysis (covered above)

Top-down split unsupervised Clustering analysis (covered above)

Either top-down split or bottom-up merge unsupervised Entropy-based discretization supervised top-down

split Interval merging by 2 Analysis unsupervised bottom-

up merge Segmentation by natural partitioning top-down split

unsupervised

Given a set of samples S if S is partitioned into two intervals S1 and S2 using boundary T the information gain after partitioning is

Entropy is calculated based on class distribution of the samples in the set Given m classes the entropy of S1 is

where pi is the probability of class i in S1 The boundary that minimizes the entropy function over

all possible boundaries is selected as a binary discretization

The process is recursively applied to partitions obtained until some stopping criterion is met

Such a boundary may reduce data size and improve classification accuracy

)Entropy(S|S|

|S|+)Entropy(S

|S|

|S|=T)I(S 2

21

1

m

iii ppSEntropy

121 )(log)(

Merging-based (bottom-up) vs splitting-based methods Merge Find the best neighboring intervals and merge

them to form larger intervals recursively ChiMerge [Kerber AAAI 1992 See also Liu et al DMKD

2002] Initially each distinct value of a numerical attr A is

considered to be one interval 2 tests are performed for every pair of adjacent

intervals Adjacent intervals with the least 2 values are merged

together since low 2 values for a pair indicate similar class distributions

This merge process proceeds recursively until a predefined stopping criterion is met (such as significance level max-interval max inconsistency etc)

Cluster analysis is a popular data discretization method A clustering algorithm can be applied to discretize a numerical attribute A by partitioning the values of A into clusters or groups Clustering takes the distribution of A into consideration as well as the closeness of data points and therefore is able to produce high-quality discretization results

Clustering can be used to generate a concept hierarchy for A by following either a topdown splitting strategy or a bottom-up merging strategy where each cluster forms a node of the concept hierarchy In the former each initial cluster or partition may be further

decomposed into several subclusters forming a lower level of the hierarchy In the latter clusters are formed by repeatedly grouping neighboring clusters in order to form

higher-level concepts

bull Discretization by Intuitive Partitioningbull Although the above discretization methods are useful

in the generation of numerical hierarchies many users would like to see numerical ranges partitioned into relatively uniform easy-to-read intervals that appear intuitive or ldquonaturalrdquo

bull If an interval covers 3 6 7 or 9 distinct values at the most significant digit then partition the range into 3 intervals (3 equal-width intervals for 3 6 and 9 and 3 intervals in the grouping of 2-3-2 for 7)

bull If it covers 2 4 or 8 distinct values at the most significant digit then partition the range into 4 equal-width intervals

bull If it covers 1 5 or 10 distinct values at the most significant digit then partition the range into 5 equal-width intervals

bull The rule can be recursively applied to each interval creating a concept hierarchy for the given numerical attribute Real-world data often contain extremely large positive andor negative outlier values which could distort any top-down discretization method based on minimum and maximum data values

A simply 3-4-5 rule can be used to segment numeric data into relatively uniform ldquonaturalrdquo intervals If an interval covers 3 6 7 or 9 distinct values

at the most significant digit partition the range into 3 equi-width intervals

If it covers 2 4 or 8 distinct values at the most significant digit partition the range into 4 intervals

If it covers 1 5 or 10 distinct values at the most significant digit partition the range into 5 intervals

Categorical data are discrete data Categorical attributes have a finite (but possibly large) number of distinct values with no ordering among the values Examples include geographic location job category and itemtype There are several methods for the generation of concept hierarchies for categorical data

Specification of a partial ordering of attributes explicitly at the schema level by users or Experts

Specification of a portion of a hierarchy by explicit data grouping

Specification of a set of attributes but not of their partial ordering

Specification of only a partial set of attributes

Specification of a partialtotal ordering of attributes explicitly at the schema level by users or experts street lt city lt state lt country

Specification of a hierarchy for a set of values by explicit data grouping Urbana Champaign Chicago lt Illinois

Specification of only a partial set of attributes Eg only street lt city not others

Automatic generation of hierarchies (or attribute levels) by the analysis of the number of distinct values Eg for a set of attributes street city state

country

Some hierarchies can be automaatically generated based on the analysis of the number of distinct values per attribute in the data set The attribute with the most distinct values is

placed at the lowest level of the hierarchy Exceptions eg weekday month quarter year

country

province_or_ state

city

street

15 distinct values

365 distinct values

3567 distinct values

674339 distinct values

Data preparation or preprocessing is a big issue for both data warehousing and data mining

Discriptive data summarization is need for quality data preprocessing

Data preparation includes Data cleaning and data integration Data reduction and feature selection Discretization

A lot a methods have been developed but data preprocessing still an active area of research

  • Slide 1
  • Slide 2
  • Slide 3
  • Multi-Dimensional Measure of Data Quality
  • Major Tasks in Data Preprocessing
  • Figure Forms of data preprocessing
  • 2 Descriptive Data Summarization
  • Slide 8
  • 21 Measuring the Central Tendency
  • 22 Measuring the Dispersion of Data
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Data Integration
  • Handling Redundant Data in Data Integration
  • Data Transformation
  • Data Transformation Normalization
  • Slide 27
  • 5 Data Reduction
  • Slide 29
  • 51 Data Cube Aggregation
  • 52 Attribute Subset Selection
  • Slide 32
  • Slide 33
  • Slide 34
  • 53 Dimensionality Reduction
  • Slide 36
  • Slide 37
  • Slide 38
  • 4 Numerosity Reduction
  • Data Reduction Method (1) Regression and Log-Linear Models
  • Regress Analysis and Log-Linear Models
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Data Reduction Method (3) Clustering
  • Slide 47
  • Slide 48
  • Slide 49
  • Data Reduction Method (4) Sampling
  • Slide 51
  • Sampling Cluster or Stratified Sampling
  • Data Discretization and Concept Hierarchy Generation
  • Slide 54
  • Discretization
  • Discretization and Concept Hierarchy Generation for Numeric Data
  • Entropy-Based Discretization
  • Interval Merge by 2 Analysis
  • Cluster Analysis
  • Slide 60
  • Segmentation by Natural Partitioning
  • 62 Concept Hierarchy Generation for Categorical Data
  • Slide 63
  • Concept Hierarchy Generation for Categorical Data
  • Automatic Concept Hierarchy Generation
  • Summary
Page 8: UNIT – 1 Data Preprocessing. Data Preprocessing Learning Objectives Understand why preprocess the data. Understand how to clean the data. Understand how.

In this section we look at various ways to measure the central tendency of data The most common and most effective numerical measure of the ldquocenterrdquo of a set of data is the (arithmetic) mean

mean1048576mode = 3(mean1048576median)

The degree to which numerical data tend to spread is called the dispersion or variance of the data The most common measures of data dispersion are

1) Range Quartiles Outliers and Boxplots 2) Variance and Standard Deviation

The range of the set is the difference between the largest (max()) and smallest (min()) values

The most commonly used percentiles other than the median are quartiles The first quartile denoted by Q1 is the 25th percentile the third quartile denoted by Q3 is the 75th percentile The quartiles including the median give some indication of the center spread and shape of a distribution The distance between the first and third quartiles is a simple measure of spread that gives the range covered by the middle half of the data

bull Boxplots are a popular way of visualizing a distribution A boxplot incorporates the five-number summary as follows

bull Typically the ends of the box are at the quartiles so that the box length is the interquartile range IQR

bull The median is marked by a line within the box

bull Two lines (called whiskers) outside the box extend to the smallest (Minimum) and largest (Maximum) observations

Aside from the bar charts pie charts and line graphs used in most statistical or graphical data presentation software packages there are other popular types of graphs for the display of data summaries and distributions These include histograms quantile plots q-q plots scatter plots and loess curves Such graphs are very helpful for the visual inspection of your data

23 Graphic Displays of Basic Descriptive Data Summaries

3 Data Cleaningbull Data cleaning tasks

Fill in missing valuesIdentify outliers and smooth out noisy data

Correct inconsistent data1) Missing Databull Data is not always available

a Eg many tuples have no recorded value for several attributes such as customer income in sales data

bull Missing data may be due to a equipment malfunction

b inconsistent with other recorded data and thus deletedc data not entered due to misunderstandingd certain data may not be considered important at the time of entrye not register history or changes of the data

f Missing data may need to be inferred

How to Handle Missing Data

bull Ignore the tuple usually done when class label is missing(assuming the tasks in classificationmdashnot effective when the percentage of missing values per attribute varies considerably)

bull Fill in the missing value manually tedious + infeasiblebull Use a global constant to fill in the missing value eg ldquounknownrdquo a new class bull Use the attribute mean to fill in the missing valuebull Use the attribute mean for all samples belonging to the same class to fill in the missing value smarterbull Use the most probable value to fill in the missing value inference-based such as Bayesian formula or decision tree

bull 2Noisy Data

bull Noise random error or variance in a measured variable

bull Incorrect attribute values may be due tondash faulty data collection instrumentsndash data entry problemsndash data transmission problemsndash technology limitationndash inconsistency in naming convention

bull Other data problems which requires data cleaningndash duplicate recordsndash inconsistent data

How to Handle Noisy Data

bull Binning method

- first sort data and partition into (equi-depth) bins- then one can smooth by bin means smooth by bin median

- smooth by bin boundaries etc

bull Clustering- detect and remove outliers

bull Combined computer and human inspection- detect suspicious values and check by human

bull Regression- smooth by fitting the data into regression functions

Binning Methods for Data SmoothingSorted data for price (in dollars)

48915 21 21 24 25 26 28 29 34

Partition into (equi-depth) bins - Bin 1 4 8 9 15 - Bin 2 21 21 24 25 - Bin 3 26 28 29 34

Smoothing by bin means - Bin 1 9 9 9 9 - Bin 2 23 23 23 23 - Bin 3 29 29 29 29

Smoothing by bin boundaries - Bin 1 4 4 4 15 - Bin 2 21 21 25 25 - Bin 3 26 26 26 34

Cluster Analysis

Regression

Data integration combines data from multiple sources into a

coherent store Schema integration

integrate metadata from different sources entity identification problem identify real world

entities from multiple data sources eg Acust-id Bcust-

Detecting and resolving data value conflicts for the same real world entity attribute values from

different sources are different possible reasons different representations

different scales eg metric vs British units

bull Redundant data occur often when integration of multiple databasesndash The same attribute may have different names in

different databasesndash One attribute may be a ldquoderivedrdquo attribute in

another table eg annual revenuebull Redundant data may be able to be detected by

correlational analysisbull Careful integration of the data from multiple sources

may help reduceavoid redundancies and inconsistencies and improve mining speed and quality

Smoothing remove noise from data Aggregation summarization data cube construction Generalization concept hierarchy climbing Normalization scaled to fall within a small specified

range min-max normalization z-score normalization normalization by decimal scaling

Attributefeature construction New attributes constructed from the given ones

bull min-max normalization

bull z-score normalization

bull normalization by decimal scaling

AAA

AA

A

minnewminnewmaxnewminmax

minvv _)__(

A

A

devstand

meanvv

_

j

vv

10 Where j is the smallest integer such that Max(| |)lt1v

Min-max normalization to [new_minA new_maxA]

Ex Let income range $12000 to $98000 normalized to [00 10] Then $73600 is mapped to

Z-score normalization (μ mean σ standard deviation)

Ex Let μ = 54000 σ = 16000 Then Normalization by decimal scaling

71600)001(0001200098

0001260073

AAA

AA

A

minnewminnewmaxnewminmax

minvv _)__(

A

Avv

j

vv

10

Where j is the smallest integer such that Max(|νrsquo|) lt 1

225100016

0005460073

Data reduction techniques can be applied to obtain a reduced representation of the data set that is much smaller in volume yet closely maintains the integrity of the original data That is mining on the reduced data set should be more efficient yet produce the same (or almost the same) analytical results

Strategies for data reduction include the following1Data cube aggregation where aggregation operations are applied to the data in the construction of a data cube

2 Attribute subset selection where irrelevant weakly relevant or redundant attributes or dimensions may be detected and removed

3 Dimensionality reduction where encoding mechanisms are used to reduce the data set size

4 Numerosity reduction where the data are replaced or estimated by alternative smaller data representations such as parametric models (which need store only the model parameters instead of the actual data) or nonparametric methods such as clustering sampling and the use of histograms

5 Discretization and concept hierarchy generation where rawdata values for attributes are replaced by ranges or higher conceptual levels

Attribute subset selection reduces the data set size by removing irrelevant or redundant attributes (or dimensions)

The goal of attribute subset selection is to find a minimum set of attributes such that the resulting probability distribution of the data classes is as close as possible to the original distribution obtained using all attributes

Mining on a reduced set of attributes has an additional benefit

It reduces the number of attributes appearing in the discovered patterns helping to make the patterns easier to understand

The ldquoBestrdquo (and ldquoWorstrdquo) attributes are typically determined using tests of statistical significance which assume that the attributes are independent of one Many other attribute evaluation measures can be used such as the information gain measure used in building decision trees for classification

bull Basic heuristic methods of attribute subset selection include the following techniques some of which are illustrated in Figure

bull 1 Stepwise forward selection The procedure starts with an empty set of attributes as the reduced set The best of the original attributes is determined and added to the reduced set At each subsequent iteration or step the best of the remaining original attributes is added to the set

bull 2 Stepwise backward elimination The procedure starts with the full set of attributes At each step it removes the worst attribute remaining in the set

bull 3 Combination of forward selection and backward elimination The stepwise forward selection and backward elimination methods can be combined so that at each step the procedure selects the best attribute and removes the

worst from among the remaining attributes

bull 4 Decision tree induction Decision tree algorithms such as ID3 C45 and CART were originally intended for classification Decision tree induction constructs a flow chart like structure where each internal (non leaf) node denotes a test on an attribute each branch corresponds to an outcome of the test and each external (leaf) node denotes a class prediction At each node the algorithm chooses the ldquobestrdquo attribute to partition the data into individual classes When decision tree induction is used for attribute subset selection a tree is constructed from the given data All attributes that do not appear in the tree are assumed to be irrelevant The set of attributes appearing in the tree form the reduced subset of attributes

The stopping criteria for the methods may vary The procedure may employ a threshold on the measure used to determine when to stop the attribute selection process

In dimensionality reduction data encoding or transformations are applied so as to obtain a reduced or ldquocompressedrdquo representation of the original data If the original data can be reconstructed from the compressed data without any loss of information the data reduction is called lossless If instead we can reconstruct only an approximation of the original data then the data reduction is called lossy There are several well-tuned algorithms for string compression Although they are typically lossless they allow only limited manipulation of the data

In this section we instead focus on two popular and effective methods of lossy dimensionality reduction wavelet transforms and principal components analysis

bull Wavelet transforms can be applied to multidimensional data such as a data cube

bull This is done by first applying the transform to the first dimension then to the second and so on The computational complexity involved is linear with respect to the number of cells in the cube Wavelet transforms give good results on sparse or skewed data and on data with ordered attributes Lossy compression by wavelets is reportedly better than JPEG compression the current commercial standard Wavelet transforms have many real-world applications including the compression of fingerprint images computer vision analysis of time-series data and data cleaning

bull PCA is computationally inexpensive can be applied to ordered and unordered attributes and can handle sparse data and skewed data Multidimensional data of more than two dimensions can be handled by reducing the problem to two dimensions Principal components may be used as inputs to multiple regression and cluster analysis

bull In comparison with wavelet transforms PCA tends to be better at handling sparse data whereas wavelet transforms are more suitable for data of high dimensionality

ldquoCan we reduce the data volume by choosing alternative lsquosmallerrsquo forms of data representationrdquo

Techniques of numerosity reduction can indeed be applied for this purpose These techniques may be parametric or nonparametric

For parametric methods a model is used to estimate the data so that typically only the data parameters need to be stored instead of the actual data (Outliers may also be stored) Log-linear models which estimate discrete multidimensional probability distributions are an example

Nonparametric methods for storing reduced representations of the data include histograms clustering and sampling

Letrsquos look at each of the numerosity reduction techniques mentioned above

Linear regression Data are modeled to fit a straight line Often uses the least-square method to fit the

line Multiple regression allows a response variable Y

to be modeled as a linear function of multidimensional feature vector

Log-linear model approximates discrete multidimensional probability distributions

bull Linear regression Y = w X + bndash Two regression coefficients w and b specify the

line and are to be estimated by using the data at hand

ndash Using the least squares criterion to the known values of Y1 Y2 hellip X1 X2 hellip

bull Multiple regression Y = b0 + b1 X1 + b2 X2ndash Many nonlinear functions can be transformed into

the above

bull Log-linear modelsndash The multi-way table of joint probabilities is

approximated by a product of lower-order tablesndash Probability p(a b c d) = ab acad bcd

Histograms Histograms use binning to approximate data distributions and are a popular form of data reduction

A histogram for an attribute A partitions the data distribution of A into disjoint subsets or buckets If each bucket represents only a single attribute-valuefrequency pair the buckets are called singleton buckets Often buckets instead represent continuous ranges for the given attribute

There are several partitioning rules including the following

Equal-width In an equal-width histogram the width of each bucket range is uniform

Equal-frequency (or equidepth) In an equal-frequency histogram the buckets are created so that roughly the frequency of each bucket is constant (that is each bucket contains roughly the same number of contiguous data samples)

V-Optimal If we consider all of the possible histograms for a given number of buckets the V-Optimal histogram is the one with the least variance Histogram variance is a weighted sum of the original values that each bucket represents where bucket weight is equal to the number of values in the bucket

MaxDiff In a MaxDiff histogram we consider the difference between each pair of adjacent values

ClusteringClustering techniques consider data tuples as objects They

partition the objects into groups or clusters so that objects within a cluster are ldquosimilarrdquo to one another and ldquodissimilarrdquo to objects in other clusters

In data reduction the cluster representations of the data are used to replace the actual data The effectiveness of this technique depends on the nature of the data It is much more effective for data that can be organized into distinct clusters than for smeared data

Partition data set into clusters based on similarity and store cluster representation (eg centroid and diameter) only

Can be very effective if data is clustered but not if data is ldquosmearedrdquo

Can have hierarchical clustering and be stored in multi-dimensional index tree structures

There are many choices of clustering definitions and clustering algorithms

Cluster analysis will be studied in depth later

SamplingSampling can be used as a data reduction technique because it

allows a large data set to be represented by a much smaller random sample (or subset) of the data Suppose that a large data set D contains N tuples Letrsquos look at the most common ways that we could sample D for data reduction

An advantage of sampling for data reduction is that the cost of obtaining a sample is proportional to the size of the sample sampling complexity is potentially sublinear to the size of the data

Simple Random sample without replacement

bull For a fixed sample size sampling complexity increases only linearly as the number of data dimensions

bull When applied to data reduction sampling is most commonly used to estimate the answer to an aggregate query

Sampling obtaining a small sample s to represent the whole data set N

Allow a mining algorithm to run in complexity that is potentially sub-linear to the size of the data

Choose a representative subset of the data Simple random sampling may have very poor

performance in the presence of skew Develop adaptive sampling methods

Stratified sampling Approximate the percentage of each class (or

subpopulation of interest) in the overall database Used in conjunction with skewed data

Note Sampling may not reduce database IOs (page at a time)

Sampling with or without Replacement

SRSWOR

(simple random

sample without

replacement)

SRSWR

Raw Data

Raw Data ClusterStratified Sample

Data discretization techniques can be used to reduce the number of values for a given continuous attribute by dividing the range of the attribute into intervals Interval labels can then be used to replace actual data values Replacing numerous values of a continuous attribute by a small number of interval labels thereby reduces and simplifies the original data This leads to a concise easy-to-use knowledge-level representation of mining results

bull Discretization techniques can be categorized based on how the discretization is performed such as whether it uses class information or which direction it proceeds (ie top-down vs bottom-up) If the discretization process uses class information then we say it is supervised discretization Otherwise it is unsupervised If the process starts by first finding one or a few points (called split points or cut points) to split the entire attribute range and then repeats this recursively on the resulting intervals it is called top-down discretization or splitting This contrasts with bottom-up discretization or merging which starts by considering all of the continuous values as potential split-points removes some by merging neighborhood values to form intervals and then recursively applies this process to the resulting intervals Discretization can be performed recursively on an attribute to provide a hierarchical or multiresolution partitioning of the attribute values known as a concept hierarchy Concept hierarchies are useful for mining at multiple levels of abstraction

Three types of attributes Nominal mdash values from an unordered set eg color

profession Ordinal mdash values from an ordered set eg military or

academic rank Continuous mdash real numbers eg integer or real

numbers Discretization

Divide the range of a continuous attribute into intervals Some classification algorithms only accept categorical

attributes Reduce data size by discretization Prepare for further analysis

Typical methods All the methods can be applied recursively Binning (covered above)

Top-down split unsupervised Histogram analysis (covered above)

Top-down split unsupervised Clustering analysis (covered above)

Either top-down split or bottom-up merge unsupervised Entropy-based discretization supervised top-down

split Interval merging by 2 Analysis unsupervised bottom-

up merge Segmentation by natural partitioning top-down split

unsupervised

Given a set of samples S if S is partitioned into two intervals S1 and S2 using boundary T the information gain after partitioning is

Entropy is calculated based on class distribution of the samples in the set Given m classes the entropy of S1 is

where pi is the probability of class i in S1 The boundary that minimizes the entropy function over

all possible boundaries is selected as a binary discretization

The process is recursively applied to partitions obtained until some stopping criterion is met

Such a boundary may reduce data size and improve classification accuracy

)Entropy(S|S|

|S|+)Entropy(S

|S|

|S|=T)I(S 2

21

1

m

iii ppSEntropy

121 )(log)(

Merging-based (bottom-up) vs splitting-based methods Merge Find the best neighboring intervals and merge

them to form larger intervals recursively ChiMerge [Kerber AAAI 1992 See also Liu et al DMKD

2002] Initially each distinct value of a numerical attr A is

considered to be one interval 2 tests are performed for every pair of adjacent

intervals Adjacent intervals with the least 2 values are merged

together since low 2 values for a pair indicate similar class distributions

This merge process proceeds recursively until a predefined stopping criterion is met (such as significance level max-interval max inconsistency etc)

Cluster analysis is a popular data discretization method A clustering algorithm can be applied to discretize a numerical attribute A by partitioning the values of A into clusters or groups Clustering takes the distribution of A into consideration as well as the closeness of data points and therefore is able to produce high-quality discretization results

Clustering can be used to generate a concept hierarchy for A by following either a topdown splitting strategy or a bottom-up merging strategy where each cluster forms a node of the concept hierarchy In the former each initial cluster or partition may be further

decomposed into several subclusters forming a lower level of the hierarchy In the latter clusters are formed by repeatedly grouping neighboring clusters in order to form

higher-level concepts

bull Discretization by Intuitive Partitioningbull Although the above discretization methods are useful

in the generation of numerical hierarchies many users would like to see numerical ranges partitioned into relatively uniform easy-to-read intervals that appear intuitive or ldquonaturalrdquo

bull If an interval covers 3 6 7 or 9 distinct values at the most significant digit then partition the range into 3 intervals (3 equal-width intervals for 3 6 and 9 and 3 intervals in the grouping of 2-3-2 for 7)

bull If it covers 2 4 or 8 distinct values at the most significant digit then partition the range into 4 equal-width intervals

bull If it covers 1 5 or 10 distinct values at the most significant digit then partition the range into 5 equal-width intervals

bull The rule can be recursively applied to each interval creating a concept hierarchy for the given numerical attribute Real-world data often contain extremely large positive andor negative outlier values which could distort any top-down discretization method based on minimum and maximum data values

A simply 3-4-5 rule can be used to segment numeric data into relatively uniform ldquonaturalrdquo intervals If an interval covers 3 6 7 or 9 distinct values

at the most significant digit partition the range into 3 equi-width intervals

If it covers 2 4 or 8 distinct values at the most significant digit partition the range into 4 intervals

If it covers 1 5 or 10 distinct values at the most significant digit partition the range into 5 intervals

Categorical data are discrete data Categorical attributes have a finite (but possibly large) number of distinct values with no ordering among the values Examples include geographic location job category and itemtype There are several methods for the generation of concept hierarchies for categorical data

Specification of a partial ordering of attributes explicitly at the schema level by users or Experts

Specification of a portion of a hierarchy by explicit data grouping

Specification of a set of attributes but not of their partial ordering

Specification of only a partial set of attributes

Specification of a partialtotal ordering of attributes explicitly at the schema level by users or experts street lt city lt state lt country

Specification of a hierarchy for a set of values by explicit data grouping Urbana Champaign Chicago lt Illinois

Specification of only a partial set of attributes Eg only street lt city not others

Automatic generation of hierarchies (or attribute levels) by the analysis of the number of distinct values Eg for a set of attributes street city state

country

Some hierarchies can be automaatically generated based on the analysis of the number of distinct values per attribute in the data set The attribute with the most distinct values is

placed at the lowest level of the hierarchy Exceptions eg weekday month quarter year

country

province_or_ state

city

street

15 distinct values

365 distinct values

3567 distinct values

674339 distinct values

Data preparation or preprocessing is a big issue for both data warehousing and data mining

Discriptive data summarization is need for quality data preprocessing

Data preparation includes Data cleaning and data integration Data reduction and feature selection Discretization

A lot a methods have been developed but data preprocessing still an active area of research

  • Slide 1
  • Slide 2
  • Slide 3
  • Multi-Dimensional Measure of Data Quality
  • Major Tasks in Data Preprocessing
  • Figure Forms of data preprocessing
  • 2 Descriptive Data Summarization
  • Slide 8
  • 21 Measuring the Central Tendency
  • 22 Measuring the Dispersion of Data
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Data Integration
  • Handling Redundant Data in Data Integration
  • Data Transformation
  • Data Transformation Normalization
  • Slide 27
  • 5 Data Reduction
  • Slide 29
  • 51 Data Cube Aggregation
  • 52 Attribute Subset Selection
  • Slide 32
  • Slide 33
  • Slide 34
  • 53 Dimensionality Reduction
  • Slide 36
  • Slide 37
  • Slide 38
  • 4 Numerosity Reduction
  • Data Reduction Method (1) Regression and Log-Linear Models
  • Regress Analysis and Log-Linear Models
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Data Reduction Method (3) Clustering
  • Slide 47
  • Slide 48
  • Slide 49
  • Data Reduction Method (4) Sampling
  • Slide 51
  • Sampling Cluster or Stratified Sampling
  • Data Discretization and Concept Hierarchy Generation
  • Slide 54
  • Discretization
  • Discretization and Concept Hierarchy Generation for Numeric Data
  • Entropy-Based Discretization
  • Interval Merge by 2 Analysis
  • Cluster Analysis
  • Slide 60
  • Segmentation by Natural Partitioning
  • 62 Concept Hierarchy Generation for Categorical Data
  • Slide 63
  • Concept Hierarchy Generation for Categorical Data
  • Automatic Concept Hierarchy Generation
  • Summary
Page 9: UNIT – 1 Data Preprocessing. Data Preprocessing Learning Objectives Understand why preprocess the data. Understand how to clean the data. Understand how.

The degree to which numerical data tend to spread is called the dispersion or variance of the data The most common measures of data dispersion are

1) Range Quartiles Outliers and Boxplots 2) Variance and Standard Deviation

The range of the set is the difference between the largest (max()) and smallest (min()) values

The most commonly used percentiles other than the median are quartiles The first quartile denoted by Q1 is the 25th percentile the third quartile denoted by Q3 is the 75th percentile The quartiles including the median give some indication of the center spread and shape of a distribution The distance between the first and third quartiles is a simple measure of spread that gives the range covered by the middle half of the data

bull Boxplots are a popular way of visualizing a distribution A boxplot incorporates the five-number summary as follows

bull Typically the ends of the box are at the quartiles so that the box length is the interquartile range IQR

bull The median is marked by a line within the box

bull Two lines (called whiskers) outside the box extend to the smallest (Minimum) and largest (Maximum) observations

Aside from the bar charts pie charts and line graphs used in most statistical or graphical data presentation software packages there are other popular types of graphs for the display of data summaries and distributions These include histograms quantile plots q-q plots scatter plots and loess curves Such graphs are very helpful for the visual inspection of your data

23 Graphic Displays of Basic Descriptive Data Summaries

3 Data Cleaningbull Data cleaning tasks

Fill in missing valuesIdentify outliers and smooth out noisy data

Correct inconsistent data1) Missing Databull Data is not always available

a Eg many tuples have no recorded value for several attributes such as customer income in sales data

bull Missing data may be due to a equipment malfunction

b inconsistent with other recorded data and thus deletedc data not entered due to misunderstandingd certain data may not be considered important at the time of entrye not register history or changes of the data

f Missing data may need to be inferred

How to Handle Missing Data

bull Ignore the tuple usually done when class label is missing(assuming the tasks in classificationmdashnot effective when the percentage of missing values per attribute varies considerably)

bull Fill in the missing value manually tedious + infeasiblebull Use a global constant to fill in the missing value eg ldquounknownrdquo a new class bull Use the attribute mean to fill in the missing valuebull Use the attribute mean for all samples belonging to the same class to fill in the missing value smarterbull Use the most probable value to fill in the missing value inference-based such as Bayesian formula or decision tree

bull 2Noisy Data

bull Noise random error or variance in a measured variable

bull Incorrect attribute values may be due tondash faulty data collection instrumentsndash data entry problemsndash data transmission problemsndash technology limitationndash inconsistency in naming convention

bull Other data problems which requires data cleaningndash duplicate recordsndash inconsistent data

How to Handle Noisy Data

bull Binning method

- first sort data and partition into (equi-depth) bins- then one can smooth by bin means smooth by bin median

- smooth by bin boundaries etc

bull Clustering- detect and remove outliers

bull Combined computer and human inspection- detect suspicious values and check by human

bull Regression- smooth by fitting the data into regression functions

Binning Methods for Data SmoothingSorted data for price (in dollars)

48915 21 21 24 25 26 28 29 34

Partition into (equi-depth) bins - Bin 1 4 8 9 15 - Bin 2 21 21 24 25 - Bin 3 26 28 29 34

Smoothing by bin means - Bin 1 9 9 9 9 - Bin 2 23 23 23 23 - Bin 3 29 29 29 29

Smoothing by bin boundaries - Bin 1 4 4 4 15 - Bin 2 21 21 25 25 - Bin 3 26 26 26 34

Cluster Analysis

Regression

Data integration combines data from multiple sources into a

coherent store Schema integration

integrate metadata from different sources entity identification problem identify real world

entities from multiple data sources eg Acust-id Bcust-

Detecting and resolving data value conflicts for the same real world entity attribute values from

different sources are different possible reasons different representations

different scales eg metric vs British units

bull Redundant data occur often when integration of multiple databasesndash The same attribute may have different names in

different databasesndash One attribute may be a ldquoderivedrdquo attribute in

another table eg annual revenuebull Redundant data may be able to be detected by

correlational analysisbull Careful integration of the data from multiple sources

may help reduceavoid redundancies and inconsistencies and improve mining speed and quality

Smoothing remove noise from data Aggregation summarization data cube construction Generalization concept hierarchy climbing Normalization scaled to fall within a small specified

range min-max normalization z-score normalization normalization by decimal scaling

Attributefeature construction New attributes constructed from the given ones

bull min-max normalization

bull z-score normalization

bull normalization by decimal scaling

AAA

AA

A

minnewminnewmaxnewminmax

minvv _)__(

A

A

devstand

meanvv

_

j

vv

10 Where j is the smallest integer such that Max(| |)lt1v

Min-max normalization to [new_minA new_maxA]

Ex Let income range $12000 to $98000 normalized to [00 10] Then $73600 is mapped to

Z-score normalization (μ mean σ standard deviation)

Ex Let μ = 54000 σ = 16000 Then Normalization by decimal scaling

71600)001(0001200098

0001260073

AAA

AA

A

minnewminnewmaxnewminmax

minvv _)__(

A

Avv

j

vv

10

Where j is the smallest integer such that Max(|νrsquo|) lt 1

225100016

0005460073

Data reduction techniques can be applied to obtain a reduced representation of the data set that is much smaller in volume yet closely maintains the integrity of the original data That is mining on the reduced data set should be more efficient yet produce the same (or almost the same) analytical results

Strategies for data reduction include the following1Data cube aggregation where aggregation operations are applied to the data in the construction of a data cube

2 Attribute subset selection where irrelevant weakly relevant or redundant attributes or dimensions may be detected and removed

3 Dimensionality reduction where encoding mechanisms are used to reduce the data set size

4 Numerosity reduction where the data are replaced or estimated by alternative smaller data representations such as parametric models (which need store only the model parameters instead of the actual data) or nonparametric methods such as clustering sampling and the use of histograms

5 Discretization and concept hierarchy generation where rawdata values for attributes are replaced by ranges or higher conceptual levels

Attribute subset selection reduces the data set size by removing irrelevant or redundant attributes (or dimensions)

The goal of attribute subset selection is to find a minimum set of attributes such that the resulting probability distribution of the data classes is as close as possible to the original distribution obtained using all attributes

Mining on a reduced set of attributes has an additional benefit

It reduces the number of attributes appearing in the discovered patterns helping to make the patterns easier to understand

The ldquoBestrdquo (and ldquoWorstrdquo) attributes are typically determined using tests of statistical significance which assume that the attributes are independent of one Many other attribute evaluation measures can be used such as the information gain measure used in building decision trees for classification

bull Basic heuristic methods of attribute subset selection include the following techniques some of which are illustrated in Figure

bull 1 Stepwise forward selection The procedure starts with an empty set of attributes as the reduced set The best of the original attributes is determined and added to the reduced set At each subsequent iteration or step the best of the remaining original attributes is added to the set

bull 2 Stepwise backward elimination The procedure starts with the full set of attributes At each step it removes the worst attribute remaining in the set

bull 3 Combination of forward selection and backward elimination The stepwise forward selection and backward elimination methods can be combined so that at each step the procedure selects the best attribute and removes the

worst from among the remaining attributes

bull 4 Decision tree induction Decision tree algorithms such as ID3 C45 and CART were originally intended for classification Decision tree induction constructs a flow chart like structure where each internal (non leaf) node denotes a test on an attribute each branch corresponds to an outcome of the test and each external (leaf) node denotes a class prediction At each node the algorithm chooses the ldquobestrdquo attribute to partition the data into individual classes When decision tree induction is used for attribute subset selection a tree is constructed from the given data All attributes that do not appear in the tree are assumed to be irrelevant The set of attributes appearing in the tree form the reduced subset of attributes

The stopping criteria for the methods may vary The procedure may employ a threshold on the measure used to determine when to stop the attribute selection process

In dimensionality reduction data encoding or transformations are applied so as to obtain a reduced or ldquocompressedrdquo representation of the original data If the original data can be reconstructed from the compressed data without any loss of information the data reduction is called lossless If instead we can reconstruct only an approximation of the original data then the data reduction is called lossy There are several well-tuned algorithms for string compression Although they are typically lossless they allow only limited manipulation of the data

In this section we instead focus on two popular and effective methods of lossy dimensionality reduction wavelet transforms and principal components analysis

bull Wavelet transforms can be applied to multidimensional data such as a data cube

bull This is done by first applying the transform to the first dimension then to the second and so on The computational complexity involved is linear with respect to the number of cells in the cube Wavelet transforms give good results on sparse or skewed data and on data with ordered attributes Lossy compression by wavelets is reportedly better than JPEG compression the current commercial standard Wavelet transforms have many real-world applications including the compression of fingerprint images computer vision analysis of time-series data and data cleaning

bull PCA is computationally inexpensive can be applied to ordered and unordered attributes and can handle sparse data and skewed data Multidimensional data of more than two dimensions can be handled by reducing the problem to two dimensions Principal components may be used as inputs to multiple regression and cluster analysis

bull In comparison with wavelet transforms PCA tends to be better at handling sparse data whereas wavelet transforms are more suitable for data of high dimensionality

ldquoCan we reduce the data volume by choosing alternative lsquosmallerrsquo forms of data representationrdquo

Techniques of numerosity reduction can indeed be applied for this purpose These techniques may be parametric or nonparametric

For parametric methods a model is used to estimate the data so that typically only the data parameters need to be stored instead of the actual data (Outliers may also be stored) Log-linear models which estimate discrete multidimensional probability distributions are an example

Nonparametric methods for storing reduced representations of the data include histograms clustering and sampling

Letrsquos look at each of the numerosity reduction techniques mentioned above

Linear regression Data are modeled to fit a straight line Often uses the least-square method to fit the

line Multiple regression allows a response variable Y

to be modeled as a linear function of multidimensional feature vector

Log-linear model approximates discrete multidimensional probability distributions

bull Linear regression Y = w X + bndash Two regression coefficients w and b specify the

line and are to be estimated by using the data at hand

ndash Using the least squares criterion to the known values of Y1 Y2 hellip X1 X2 hellip

bull Multiple regression Y = b0 + b1 X1 + b2 X2ndash Many nonlinear functions can be transformed into

the above

bull Log-linear modelsndash The multi-way table of joint probabilities is

approximated by a product of lower-order tablesndash Probability p(a b c d) = ab acad bcd

Histograms Histograms use binning to approximate data distributions and are a popular form of data reduction

A histogram for an attribute A partitions the data distribution of A into disjoint subsets or buckets If each bucket represents only a single attribute-valuefrequency pair the buckets are called singleton buckets Often buckets instead represent continuous ranges for the given attribute

There are several partitioning rules including the following

Equal-width In an equal-width histogram the width of each bucket range is uniform

Equal-frequency (or equidepth) In an equal-frequency histogram the buckets are created so that roughly the frequency of each bucket is constant (that is each bucket contains roughly the same number of contiguous data samples)

V-Optimal If we consider all of the possible histograms for a given number of buckets the V-Optimal histogram is the one with the least variance Histogram variance is a weighted sum of the original values that each bucket represents where bucket weight is equal to the number of values in the bucket

MaxDiff In a MaxDiff histogram we consider the difference between each pair of adjacent values

ClusteringClustering techniques consider data tuples as objects They

partition the objects into groups or clusters so that objects within a cluster are ldquosimilarrdquo to one another and ldquodissimilarrdquo to objects in other clusters

In data reduction the cluster representations of the data are used to replace the actual data The effectiveness of this technique depends on the nature of the data It is much more effective for data that can be organized into distinct clusters than for smeared data

Partition data set into clusters based on similarity and store cluster representation (eg centroid and diameter) only

Can be very effective if data is clustered but not if data is ldquosmearedrdquo

Can have hierarchical clustering and be stored in multi-dimensional index tree structures

There are many choices of clustering definitions and clustering algorithms

Cluster analysis will be studied in depth later

SamplingSampling can be used as a data reduction technique because it

allows a large data set to be represented by a much smaller random sample (or subset) of the data Suppose that a large data set D contains N tuples Letrsquos look at the most common ways that we could sample D for data reduction

An advantage of sampling for data reduction is that the cost of obtaining a sample is proportional to the size of the sample sampling complexity is potentially sublinear to the size of the data

Simple Random sample without replacement

bull For a fixed sample size sampling complexity increases only linearly as the number of data dimensions

bull When applied to data reduction sampling is most commonly used to estimate the answer to an aggregate query

Sampling obtaining a small sample s to represent the whole data set N

Allow a mining algorithm to run in complexity that is potentially sub-linear to the size of the data

Choose a representative subset of the data Simple random sampling may have very poor

performance in the presence of skew Develop adaptive sampling methods

Stratified sampling Approximate the percentage of each class (or

subpopulation of interest) in the overall database Used in conjunction with skewed data

Note Sampling may not reduce database IOs (page at a time)

Sampling with or without Replacement

SRSWOR

(simple random

sample without

replacement)

SRSWR

Raw Data

Raw Data ClusterStratified Sample

Data discretization techniques can be used to reduce the number of values for a given continuous attribute by dividing the range of the attribute into intervals Interval labels can then be used to replace actual data values Replacing numerous values of a continuous attribute by a small number of interval labels thereby reduces and simplifies the original data This leads to a concise easy-to-use knowledge-level representation of mining results

bull Discretization techniques can be categorized based on how the discretization is performed such as whether it uses class information or which direction it proceeds (ie top-down vs bottom-up) If the discretization process uses class information then we say it is supervised discretization Otherwise it is unsupervised If the process starts by first finding one or a few points (called split points or cut points) to split the entire attribute range and then repeats this recursively on the resulting intervals it is called top-down discretization or splitting This contrasts with bottom-up discretization or merging which starts by considering all of the continuous values as potential split-points removes some by merging neighborhood values to form intervals and then recursively applies this process to the resulting intervals Discretization can be performed recursively on an attribute to provide a hierarchical or multiresolution partitioning of the attribute values known as a concept hierarchy Concept hierarchies are useful for mining at multiple levels of abstraction

Three types of attributes Nominal mdash values from an unordered set eg color

profession Ordinal mdash values from an ordered set eg military or

academic rank Continuous mdash real numbers eg integer or real

numbers Discretization

Divide the range of a continuous attribute into intervals Some classification algorithms only accept categorical

attributes Reduce data size by discretization Prepare for further analysis

Typical methods All the methods can be applied recursively Binning (covered above)

Top-down split unsupervised Histogram analysis (covered above)

Top-down split unsupervised Clustering analysis (covered above)

Either top-down split or bottom-up merge unsupervised Entropy-based discretization supervised top-down

split Interval merging by 2 Analysis unsupervised bottom-

up merge Segmentation by natural partitioning top-down split

unsupervised

Given a set of samples S if S is partitioned into two intervals S1 and S2 using boundary T the information gain after partitioning is

Entropy is calculated based on class distribution of the samples in the set Given m classes the entropy of S1 is

where pi is the probability of class i in S1 The boundary that minimizes the entropy function over

all possible boundaries is selected as a binary discretization

The process is recursively applied to partitions obtained until some stopping criterion is met

Such a boundary may reduce data size and improve classification accuracy

)Entropy(S|S|

|S|+)Entropy(S

|S|

|S|=T)I(S 2

21

1

m

iii ppSEntropy

121 )(log)(

Merging-based (bottom-up) vs splitting-based methods Merge Find the best neighboring intervals and merge

them to form larger intervals recursively ChiMerge [Kerber AAAI 1992 See also Liu et al DMKD

2002] Initially each distinct value of a numerical attr A is

considered to be one interval 2 tests are performed for every pair of adjacent

intervals Adjacent intervals with the least 2 values are merged

together since low 2 values for a pair indicate similar class distributions

This merge process proceeds recursively until a predefined stopping criterion is met (such as significance level max-interval max inconsistency etc)

Cluster analysis is a popular data discretization method A clustering algorithm can be applied to discretize a numerical attribute A by partitioning the values of A into clusters or groups Clustering takes the distribution of A into consideration as well as the closeness of data points and therefore is able to produce high-quality discretization results

Clustering can be used to generate a concept hierarchy for A by following either a topdown splitting strategy or a bottom-up merging strategy where each cluster forms a node of the concept hierarchy In the former each initial cluster or partition may be further

decomposed into several subclusters forming a lower level of the hierarchy In the latter clusters are formed by repeatedly grouping neighboring clusters in order to form

higher-level concepts

bull Discretization by Intuitive Partitioningbull Although the above discretization methods are useful

in the generation of numerical hierarchies many users would like to see numerical ranges partitioned into relatively uniform easy-to-read intervals that appear intuitive or ldquonaturalrdquo

bull If an interval covers 3 6 7 or 9 distinct values at the most significant digit then partition the range into 3 intervals (3 equal-width intervals for 3 6 and 9 and 3 intervals in the grouping of 2-3-2 for 7)

bull If it covers 2 4 or 8 distinct values at the most significant digit then partition the range into 4 equal-width intervals

bull If it covers 1 5 or 10 distinct values at the most significant digit then partition the range into 5 equal-width intervals

bull The rule can be recursively applied to each interval creating a concept hierarchy for the given numerical attribute Real-world data often contain extremely large positive andor negative outlier values which could distort any top-down discretization method based on minimum and maximum data values

A simply 3-4-5 rule can be used to segment numeric data into relatively uniform ldquonaturalrdquo intervals If an interval covers 3 6 7 or 9 distinct values

at the most significant digit partition the range into 3 equi-width intervals

If it covers 2 4 or 8 distinct values at the most significant digit partition the range into 4 intervals

If it covers 1 5 or 10 distinct values at the most significant digit partition the range into 5 intervals

Categorical data are discrete data Categorical attributes have a finite (but possibly large) number of distinct values with no ordering among the values Examples include geographic location job category and itemtype There are several methods for the generation of concept hierarchies for categorical data

Specification of a partial ordering of attributes explicitly at the schema level by users or Experts

Specification of a portion of a hierarchy by explicit data grouping

Specification of a set of attributes but not of their partial ordering

Specification of only a partial set of attributes

Specification of a partialtotal ordering of attributes explicitly at the schema level by users or experts street lt city lt state lt country

Specification of a hierarchy for a set of values by explicit data grouping Urbana Champaign Chicago lt Illinois

Specification of only a partial set of attributes Eg only street lt city not others

Automatic generation of hierarchies (or attribute levels) by the analysis of the number of distinct values Eg for a set of attributes street city state

country

Some hierarchies can be automaatically generated based on the analysis of the number of distinct values per attribute in the data set The attribute with the most distinct values is

placed at the lowest level of the hierarchy Exceptions eg weekday month quarter year

country

province_or_ state

city

street

15 distinct values

365 distinct values

3567 distinct values

674339 distinct values

Data preparation or preprocessing is a big issue for both data warehousing and data mining

Discriptive data summarization is need for quality data preprocessing

Data preparation includes Data cleaning and data integration Data reduction and feature selection Discretization

A lot a methods have been developed but data preprocessing still an active area of research

  • Slide 1
  • Slide 2
  • Slide 3
  • Multi-Dimensional Measure of Data Quality
  • Major Tasks in Data Preprocessing
  • Figure Forms of data preprocessing
  • 2 Descriptive Data Summarization
  • Slide 8
  • 21 Measuring the Central Tendency
  • 22 Measuring the Dispersion of Data
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Data Integration
  • Handling Redundant Data in Data Integration
  • Data Transformation
  • Data Transformation Normalization
  • Slide 27
  • 5 Data Reduction
  • Slide 29
  • 51 Data Cube Aggregation
  • 52 Attribute Subset Selection
  • Slide 32
  • Slide 33
  • Slide 34
  • 53 Dimensionality Reduction
  • Slide 36
  • Slide 37
  • Slide 38
  • 4 Numerosity Reduction
  • Data Reduction Method (1) Regression and Log-Linear Models
  • Regress Analysis and Log-Linear Models
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Data Reduction Method (3) Clustering
  • Slide 47
  • Slide 48
  • Slide 49
  • Data Reduction Method (4) Sampling
  • Slide 51
  • Sampling Cluster or Stratified Sampling
  • Data Discretization and Concept Hierarchy Generation
  • Slide 54
  • Discretization
  • Discretization and Concept Hierarchy Generation for Numeric Data
  • Entropy-Based Discretization
  • Interval Merge by 2 Analysis
  • Cluster Analysis
  • Slide 60
  • Segmentation by Natural Partitioning
  • 62 Concept Hierarchy Generation for Categorical Data
  • Slide 63
  • Concept Hierarchy Generation for Categorical Data
  • Automatic Concept Hierarchy Generation
  • Summary
Page 10: UNIT – 1 Data Preprocessing. Data Preprocessing Learning Objectives Understand why preprocess the data. Understand how to clean the data. Understand how.

bull Boxplots are a popular way of visualizing a distribution A boxplot incorporates the five-number summary as follows

bull Typically the ends of the box are at the quartiles so that the box length is the interquartile range IQR

bull The median is marked by a line within the box

bull Two lines (called whiskers) outside the box extend to the smallest (Minimum) and largest (Maximum) observations

Aside from the bar charts pie charts and line graphs used in most statistical or graphical data presentation software packages there are other popular types of graphs for the display of data summaries and distributions These include histograms quantile plots q-q plots scatter plots and loess curves Such graphs are very helpful for the visual inspection of your data

23 Graphic Displays of Basic Descriptive Data Summaries

3 Data Cleaningbull Data cleaning tasks

Fill in missing valuesIdentify outliers and smooth out noisy data

Correct inconsistent data1) Missing Databull Data is not always available

a Eg many tuples have no recorded value for several attributes such as customer income in sales data

bull Missing data may be due to a equipment malfunction

b inconsistent with other recorded data and thus deletedc data not entered due to misunderstandingd certain data may not be considered important at the time of entrye not register history or changes of the data

f Missing data may need to be inferred

How to Handle Missing Data

bull Ignore the tuple usually done when class label is missing(assuming the tasks in classificationmdashnot effective when the percentage of missing values per attribute varies considerably)

bull Fill in the missing value manually tedious + infeasiblebull Use a global constant to fill in the missing value eg ldquounknownrdquo a new class bull Use the attribute mean to fill in the missing valuebull Use the attribute mean for all samples belonging to the same class to fill in the missing value smarterbull Use the most probable value to fill in the missing value inference-based such as Bayesian formula or decision tree

bull 2Noisy Data

bull Noise random error or variance in a measured variable

bull Incorrect attribute values may be due tondash faulty data collection instrumentsndash data entry problemsndash data transmission problemsndash technology limitationndash inconsistency in naming convention

bull Other data problems which requires data cleaningndash duplicate recordsndash inconsistent data

How to Handle Noisy Data

bull Binning method

- first sort data and partition into (equi-depth) bins- then one can smooth by bin means smooth by bin median

- smooth by bin boundaries etc

bull Clustering- detect and remove outliers

bull Combined computer and human inspection- detect suspicious values and check by human

bull Regression- smooth by fitting the data into regression functions

Binning Methods for Data SmoothingSorted data for price (in dollars)

48915 21 21 24 25 26 28 29 34

Partition into (equi-depth) bins - Bin 1 4 8 9 15 - Bin 2 21 21 24 25 - Bin 3 26 28 29 34

Smoothing by bin means - Bin 1 9 9 9 9 - Bin 2 23 23 23 23 - Bin 3 29 29 29 29

Smoothing by bin boundaries - Bin 1 4 4 4 15 - Bin 2 21 21 25 25 - Bin 3 26 26 26 34

Cluster Analysis

Regression

Data integration combines data from multiple sources into a

coherent store Schema integration

integrate metadata from different sources entity identification problem identify real world

entities from multiple data sources eg Acust-id Bcust-

Detecting and resolving data value conflicts for the same real world entity attribute values from

different sources are different possible reasons different representations

different scales eg metric vs British units

bull Redundant data occur often when integration of multiple databasesndash The same attribute may have different names in

different databasesndash One attribute may be a ldquoderivedrdquo attribute in

another table eg annual revenuebull Redundant data may be able to be detected by

correlational analysisbull Careful integration of the data from multiple sources

may help reduceavoid redundancies and inconsistencies and improve mining speed and quality

Smoothing remove noise from data Aggregation summarization data cube construction Generalization concept hierarchy climbing Normalization scaled to fall within a small specified

range min-max normalization z-score normalization normalization by decimal scaling

Attributefeature construction New attributes constructed from the given ones

bull min-max normalization

bull z-score normalization

bull normalization by decimal scaling

AAA

AA

A

minnewminnewmaxnewminmax

minvv _)__(

A

A

devstand

meanvv

_

j

vv

10 Where j is the smallest integer such that Max(| |)lt1v

Min-max normalization to [new_minA new_maxA]

Ex Let income range $12000 to $98000 normalized to [00 10] Then $73600 is mapped to

Z-score normalization (μ mean σ standard deviation)

Ex Let μ = 54000 σ = 16000 Then Normalization by decimal scaling

71600)001(0001200098

0001260073

AAA

AA

A

minnewminnewmaxnewminmax

minvv _)__(

A

Avv

j

vv

10

Where j is the smallest integer such that Max(|νrsquo|) lt 1

225100016

0005460073

Data reduction techniques can be applied to obtain a reduced representation of the data set that is much smaller in volume yet closely maintains the integrity of the original data That is mining on the reduced data set should be more efficient yet produce the same (or almost the same) analytical results

Strategies for data reduction include the following1Data cube aggregation where aggregation operations are applied to the data in the construction of a data cube

2 Attribute subset selection where irrelevant weakly relevant or redundant attributes or dimensions may be detected and removed

3 Dimensionality reduction where encoding mechanisms are used to reduce the data set size

4 Numerosity reduction where the data are replaced or estimated by alternative smaller data representations such as parametric models (which need store only the model parameters instead of the actual data) or nonparametric methods such as clustering sampling and the use of histograms

5 Discretization and concept hierarchy generation where rawdata values for attributes are replaced by ranges or higher conceptual levels

Attribute subset selection reduces the data set size by removing irrelevant or redundant attributes (or dimensions)

The goal of attribute subset selection is to find a minimum set of attributes such that the resulting probability distribution of the data classes is as close as possible to the original distribution obtained using all attributes

Mining on a reduced set of attributes has an additional benefit

It reduces the number of attributes appearing in the discovered patterns helping to make the patterns easier to understand

The ldquoBestrdquo (and ldquoWorstrdquo) attributes are typically determined using tests of statistical significance which assume that the attributes are independent of one Many other attribute evaluation measures can be used such as the information gain measure used in building decision trees for classification

bull Basic heuristic methods of attribute subset selection include the following techniques some of which are illustrated in Figure

bull 1 Stepwise forward selection The procedure starts with an empty set of attributes as the reduced set The best of the original attributes is determined and added to the reduced set At each subsequent iteration or step the best of the remaining original attributes is added to the set

bull 2 Stepwise backward elimination The procedure starts with the full set of attributes At each step it removes the worst attribute remaining in the set

bull 3 Combination of forward selection and backward elimination The stepwise forward selection and backward elimination methods can be combined so that at each step the procedure selects the best attribute and removes the

worst from among the remaining attributes

bull 4 Decision tree induction Decision tree algorithms such as ID3 C45 and CART were originally intended for classification Decision tree induction constructs a flow chart like structure where each internal (non leaf) node denotes a test on an attribute each branch corresponds to an outcome of the test and each external (leaf) node denotes a class prediction At each node the algorithm chooses the ldquobestrdquo attribute to partition the data into individual classes When decision tree induction is used for attribute subset selection a tree is constructed from the given data All attributes that do not appear in the tree are assumed to be irrelevant The set of attributes appearing in the tree form the reduced subset of attributes

The stopping criteria for the methods may vary The procedure may employ a threshold on the measure used to determine when to stop the attribute selection process

In dimensionality reduction data encoding or transformations are applied so as to obtain a reduced or ldquocompressedrdquo representation of the original data If the original data can be reconstructed from the compressed data without any loss of information the data reduction is called lossless If instead we can reconstruct only an approximation of the original data then the data reduction is called lossy There are several well-tuned algorithms for string compression Although they are typically lossless they allow only limited manipulation of the data

In this section we instead focus on two popular and effective methods of lossy dimensionality reduction wavelet transforms and principal components analysis

bull Wavelet transforms can be applied to multidimensional data such as a data cube

bull This is done by first applying the transform to the first dimension then to the second and so on The computational complexity involved is linear with respect to the number of cells in the cube Wavelet transforms give good results on sparse or skewed data and on data with ordered attributes Lossy compression by wavelets is reportedly better than JPEG compression the current commercial standard Wavelet transforms have many real-world applications including the compression of fingerprint images computer vision analysis of time-series data and data cleaning

bull PCA is computationally inexpensive can be applied to ordered and unordered attributes and can handle sparse data and skewed data Multidimensional data of more than two dimensions can be handled by reducing the problem to two dimensions Principal components may be used as inputs to multiple regression and cluster analysis

bull In comparison with wavelet transforms PCA tends to be better at handling sparse data whereas wavelet transforms are more suitable for data of high dimensionality

ldquoCan we reduce the data volume by choosing alternative lsquosmallerrsquo forms of data representationrdquo

Techniques of numerosity reduction can indeed be applied for this purpose These techniques may be parametric or nonparametric

For parametric methods a model is used to estimate the data so that typically only the data parameters need to be stored instead of the actual data (Outliers may also be stored) Log-linear models which estimate discrete multidimensional probability distributions are an example

Nonparametric methods for storing reduced representations of the data include histograms clustering and sampling

Letrsquos look at each of the numerosity reduction techniques mentioned above

Linear regression Data are modeled to fit a straight line Often uses the least-square method to fit the

line Multiple regression allows a response variable Y

to be modeled as a linear function of multidimensional feature vector

Log-linear model approximates discrete multidimensional probability distributions

bull Linear regression Y = w X + bndash Two regression coefficients w and b specify the

line and are to be estimated by using the data at hand

ndash Using the least squares criterion to the known values of Y1 Y2 hellip X1 X2 hellip

bull Multiple regression Y = b0 + b1 X1 + b2 X2ndash Many nonlinear functions can be transformed into

the above

bull Log-linear modelsndash The multi-way table of joint probabilities is

approximated by a product of lower-order tablesndash Probability p(a b c d) = ab acad bcd

Histograms Histograms use binning to approximate data distributions and are a popular form of data reduction

A histogram for an attribute A partitions the data distribution of A into disjoint subsets or buckets If each bucket represents only a single attribute-valuefrequency pair the buckets are called singleton buckets Often buckets instead represent continuous ranges for the given attribute

There are several partitioning rules including the following

Equal-width In an equal-width histogram the width of each bucket range is uniform

Equal-frequency (or equidepth) In an equal-frequency histogram the buckets are created so that roughly the frequency of each bucket is constant (that is each bucket contains roughly the same number of contiguous data samples)

V-Optimal If we consider all of the possible histograms for a given number of buckets the V-Optimal histogram is the one with the least variance Histogram variance is a weighted sum of the original values that each bucket represents where bucket weight is equal to the number of values in the bucket

MaxDiff In a MaxDiff histogram we consider the difference between each pair of adjacent values

ClusteringClustering techniques consider data tuples as objects They

partition the objects into groups or clusters so that objects within a cluster are ldquosimilarrdquo to one another and ldquodissimilarrdquo to objects in other clusters

In data reduction the cluster representations of the data are used to replace the actual data The effectiveness of this technique depends on the nature of the data It is much more effective for data that can be organized into distinct clusters than for smeared data

Partition data set into clusters based on similarity and store cluster representation (eg centroid and diameter) only

Can be very effective if data is clustered but not if data is ldquosmearedrdquo

Can have hierarchical clustering and be stored in multi-dimensional index tree structures

There are many choices of clustering definitions and clustering algorithms

Cluster analysis will be studied in depth later

SamplingSampling can be used as a data reduction technique because it

allows a large data set to be represented by a much smaller random sample (or subset) of the data Suppose that a large data set D contains N tuples Letrsquos look at the most common ways that we could sample D for data reduction

An advantage of sampling for data reduction is that the cost of obtaining a sample is proportional to the size of the sample sampling complexity is potentially sublinear to the size of the data

Simple Random sample without replacement

bull For a fixed sample size sampling complexity increases only linearly as the number of data dimensions

bull When applied to data reduction sampling is most commonly used to estimate the answer to an aggregate query

Sampling obtaining a small sample s to represent the whole data set N

Allow a mining algorithm to run in complexity that is potentially sub-linear to the size of the data

Choose a representative subset of the data Simple random sampling may have very poor

performance in the presence of skew Develop adaptive sampling methods

Stratified sampling Approximate the percentage of each class (or

subpopulation of interest) in the overall database Used in conjunction with skewed data

Note Sampling may not reduce database IOs (page at a time)

Sampling with or without Replacement

SRSWOR

(simple random

sample without

replacement)

SRSWR

Raw Data

Raw Data ClusterStratified Sample

Data discretization techniques can be used to reduce the number of values for a given continuous attribute by dividing the range of the attribute into intervals Interval labels can then be used to replace actual data values Replacing numerous values of a continuous attribute by a small number of interval labels thereby reduces and simplifies the original data This leads to a concise easy-to-use knowledge-level representation of mining results

bull Discretization techniques can be categorized based on how the discretization is performed such as whether it uses class information or which direction it proceeds (ie top-down vs bottom-up) If the discretization process uses class information then we say it is supervised discretization Otherwise it is unsupervised If the process starts by first finding one or a few points (called split points or cut points) to split the entire attribute range and then repeats this recursively on the resulting intervals it is called top-down discretization or splitting This contrasts with bottom-up discretization or merging which starts by considering all of the continuous values as potential split-points removes some by merging neighborhood values to form intervals and then recursively applies this process to the resulting intervals Discretization can be performed recursively on an attribute to provide a hierarchical or multiresolution partitioning of the attribute values known as a concept hierarchy Concept hierarchies are useful for mining at multiple levels of abstraction

Three types of attributes Nominal mdash values from an unordered set eg color

profession Ordinal mdash values from an ordered set eg military or

academic rank Continuous mdash real numbers eg integer or real

numbers Discretization

Divide the range of a continuous attribute into intervals Some classification algorithms only accept categorical

attributes Reduce data size by discretization Prepare for further analysis

Typical methods All the methods can be applied recursively Binning (covered above)

Top-down split unsupervised Histogram analysis (covered above)

Top-down split unsupervised Clustering analysis (covered above)

Either top-down split or bottom-up merge unsupervised Entropy-based discretization supervised top-down

split Interval merging by 2 Analysis unsupervised bottom-

up merge Segmentation by natural partitioning top-down split

unsupervised

Given a set of samples S if S is partitioned into two intervals S1 and S2 using boundary T the information gain after partitioning is

Entropy is calculated based on class distribution of the samples in the set Given m classes the entropy of S1 is

where pi is the probability of class i in S1 The boundary that minimizes the entropy function over

all possible boundaries is selected as a binary discretization

The process is recursively applied to partitions obtained until some stopping criterion is met

Such a boundary may reduce data size and improve classification accuracy

)Entropy(S|S|

|S|+)Entropy(S

|S|

|S|=T)I(S 2

21

1

m

iii ppSEntropy

121 )(log)(

Merging-based (bottom-up) vs splitting-based methods Merge Find the best neighboring intervals and merge

them to form larger intervals recursively ChiMerge [Kerber AAAI 1992 See also Liu et al DMKD

2002] Initially each distinct value of a numerical attr A is

considered to be one interval 2 tests are performed for every pair of adjacent

intervals Adjacent intervals with the least 2 values are merged

together since low 2 values for a pair indicate similar class distributions

This merge process proceeds recursively until a predefined stopping criterion is met (such as significance level max-interval max inconsistency etc)

Cluster analysis is a popular data discretization method A clustering algorithm can be applied to discretize a numerical attribute A by partitioning the values of A into clusters or groups Clustering takes the distribution of A into consideration as well as the closeness of data points and therefore is able to produce high-quality discretization results

Clustering can be used to generate a concept hierarchy for A by following either a topdown splitting strategy or a bottom-up merging strategy where each cluster forms a node of the concept hierarchy In the former each initial cluster or partition may be further

decomposed into several subclusters forming a lower level of the hierarchy In the latter clusters are formed by repeatedly grouping neighboring clusters in order to form

higher-level concepts

bull Discretization by Intuitive Partitioningbull Although the above discretization methods are useful

in the generation of numerical hierarchies many users would like to see numerical ranges partitioned into relatively uniform easy-to-read intervals that appear intuitive or ldquonaturalrdquo

bull If an interval covers 3 6 7 or 9 distinct values at the most significant digit then partition the range into 3 intervals (3 equal-width intervals for 3 6 and 9 and 3 intervals in the grouping of 2-3-2 for 7)

bull If it covers 2 4 or 8 distinct values at the most significant digit then partition the range into 4 equal-width intervals

bull If it covers 1 5 or 10 distinct values at the most significant digit then partition the range into 5 equal-width intervals

bull The rule can be recursively applied to each interval creating a concept hierarchy for the given numerical attribute Real-world data often contain extremely large positive andor negative outlier values which could distort any top-down discretization method based on minimum and maximum data values

A simply 3-4-5 rule can be used to segment numeric data into relatively uniform ldquonaturalrdquo intervals If an interval covers 3 6 7 or 9 distinct values

at the most significant digit partition the range into 3 equi-width intervals

If it covers 2 4 or 8 distinct values at the most significant digit partition the range into 4 intervals

If it covers 1 5 or 10 distinct values at the most significant digit partition the range into 5 intervals

Categorical data are discrete data Categorical attributes have a finite (but possibly large) number of distinct values with no ordering among the values Examples include geographic location job category and itemtype There are several methods for the generation of concept hierarchies for categorical data

Specification of a partial ordering of attributes explicitly at the schema level by users or Experts

Specification of a portion of a hierarchy by explicit data grouping

Specification of a set of attributes but not of their partial ordering

Specification of only a partial set of attributes

Specification of a partialtotal ordering of attributes explicitly at the schema level by users or experts street lt city lt state lt country

Specification of a hierarchy for a set of values by explicit data grouping Urbana Champaign Chicago lt Illinois

Specification of only a partial set of attributes Eg only street lt city not others

Automatic generation of hierarchies (or attribute levels) by the analysis of the number of distinct values Eg for a set of attributes street city state

country

Some hierarchies can be automaatically generated based on the analysis of the number of distinct values per attribute in the data set The attribute with the most distinct values is

placed at the lowest level of the hierarchy Exceptions eg weekday month quarter year

country

province_or_ state

city

street

15 distinct values

365 distinct values

3567 distinct values

674339 distinct values

Data preparation or preprocessing is a big issue for both data warehousing and data mining

Discriptive data summarization is need for quality data preprocessing

Data preparation includes Data cleaning and data integration Data reduction and feature selection Discretization

A lot a methods have been developed but data preprocessing still an active area of research

  • Slide 1
  • Slide 2
  • Slide 3
  • Multi-Dimensional Measure of Data Quality
  • Major Tasks in Data Preprocessing
  • Figure Forms of data preprocessing
  • 2 Descriptive Data Summarization
  • Slide 8
  • 21 Measuring the Central Tendency
  • 22 Measuring the Dispersion of Data
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Data Integration
  • Handling Redundant Data in Data Integration
  • Data Transformation
  • Data Transformation Normalization
  • Slide 27
  • 5 Data Reduction
  • Slide 29
  • 51 Data Cube Aggregation
  • 52 Attribute Subset Selection
  • Slide 32
  • Slide 33
  • Slide 34
  • 53 Dimensionality Reduction
  • Slide 36
  • Slide 37
  • Slide 38
  • 4 Numerosity Reduction
  • Data Reduction Method (1) Regression and Log-Linear Models
  • Regress Analysis and Log-Linear Models
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Data Reduction Method (3) Clustering
  • Slide 47
  • Slide 48
  • Slide 49
  • Data Reduction Method (4) Sampling
  • Slide 51
  • Sampling Cluster or Stratified Sampling
  • Data Discretization and Concept Hierarchy Generation
  • Slide 54
  • Discretization
  • Discretization and Concept Hierarchy Generation for Numeric Data
  • Entropy-Based Discretization
  • Interval Merge by 2 Analysis
  • Cluster Analysis
  • Slide 60
  • Segmentation by Natural Partitioning
  • 62 Concept Hierarchy Generation for Categorical Data
  • Slide 63
  • Concept Hierarchy Generation for Categorical Data
  • Automatic Concept Hierarchy Generation
  • Summary
Page 11: UNIT – 1 Data Preprocessing. Data Preprocessing Learning Objectives Understand why preprocess the data. Understand how to clean the data. Understand how.

Aside from the bar charts pie charts and line graphs used in most statistical or graphical data presentation software packages there are other popular types of graphs for the display of data summaries and distributions These include histograms quantile plots q-q plots scatter plots and loess curves Such graphs are very helpful for the visual inspection of your data

23 Graphic Displays of Basic Descriptive Data Summaries

3 Data Cleaningbull Data cleaning tasks

Fill in missing valuesIdentify outliers and smooth out noisy data

Correct inconsistent data1) Missing Databull Data is not always available

a Eg many tuples have no recorded value for several attributes such as customer income in sales data

bull Missing data may be due to a equipment malfunction

b inconsistent with other recorded data and thus deletedc data not entered due to misunderstandingd certain data may not be considered important at the time of entrye not register history or changes of the data

f Missing data may need to be inferred

How to Handle Missing Data

bull Ignore the tuple usually done when class label is missing(assuming the tasks in classificationmdashnot effective when the percentage of missing values per attribute varies considerably)

bull Fill in the missing value manually tedious + infeasiblebull Use a global constant to fill in the missing value eg ldquounknownrdquo a new class bull Use the attribute mean to fill in the missing valuebull Use the attribute mean for all samples belonging to the same class to fill in the missing value smarterbull Use the most probable value to fill in the missing value inference-based such as Bayesian formula or decision tree

bull 2Noisy Data

bull Noise random error or variance in a measured variable

bull Incorrect attribute values may be due tondash faulty data collection instrumentsndash data entry problemsndash data transmission problemsndash technology limitationndash inconsistency in naming convention

bull Other data problems which requires data cleaningndash duplicate recordsndash inconsistent data

How to Handle Noisy Data

bull Binning method

- first sort data and partition into (equi-depth) bins- then one can smooth by bin means smooth by bin median

- smooth by bin boundaries etc

bull Clustering- detect and remove outliers

bull Combined computer and human inspection- detect suspicious values and check by human

bull Regression- smooth by fitting the data into regression functions

Binning Methods for Data SmoothingSorted data for price (in dollars)

48915 21 21 24 25 26 28 29 34

Partition into (equi-depth) bins - Bin 1 4 8 9 15 - Bin 2 21 21 24 25 - Bin 3 26 28 29 34

Smoothing by bin means - Bin 1 9 9 9 9 - Bin 2 23 23 23 23 - Bin 3 29 29 29 29

Smoothing by bin boundaries - Bin 1 4 4 4 15 - Bin 2 21 21 25 25 - Bin 3 26 26 26 34

Cluster Analysis

Regression

Data integration combines data from multiple sources into a

coherent store Schema integration

integrate metadata from different sources entity identification problem identify real world

entities from multiple data sources eg Acust-id Bcust-

Detecting and resolving data value conflicts for the same real world entity attribute values from

different sources are different possible reasons different representations

different scales eg metric vs British units

bull Redundant data occur often when integration of multiple databasesndash The same attribute may have different names in

different databasesndash One attribute may be a ldquoderivedrdquo attribute in

another table eg annual revenuebull Redundant data may be able to be detected by

correlational analysisbull Careful integration of the data from multiple sources

may help reduceavoid redundancies and inconsistencies and improve mining speed and quality

Smoothing remove noise from data Aggregation summarization data cube construction Generalization concept hierarchy climbing Normalization scaled to fall within a small specified

range min-max normalization z-score normalization normalization by decimal scaling

Attributefeature construction New attributes constructed from the given ones

bull min-max normalization

bull z-score normalization

bull normalization by decimal scaling

AAA

AA

A

minnewminnewmaxnewminmax

minvv _)__(

A

A

devstand

meanvv

_

j

vv

10 Where j is the smallest integer such that Max(| |)lt1v

Min-max normalization to [new_minA new_maxA]

Ex Let income range $12000 to $98000 normalized to [00 10] Then $73600 is mapped to

Z-score normalization (μ mean σ standard deviation)

Ex Let μ = 54000 σ = 16000 Then Normalization by decimal scaling

71600)001(0001200098

0001260073

AAA

AA

A

minnewminnewmaxnewminmax

minvv _)__(

A

Avv

j

vv

10

Where j is the smallest integer such that Max(|νrsquo|) lt 1

225100016

0005460073

Data reduction techniques can be applied to obtain a reduced representation of the data set that is much smaller in volume yet closely maintains the integrity of the original data That is mining on the reduced data set should be more efficient yet produce the same (or almost the same) analytical results

Strategies for data reduction include the following1Data cube aggregation where aggregation operations are applied to the data in the construction of a data cube

2 Attribute subset selection where irrelevant weakly relevant or redundant attributes or dimensions may be detected and removed

3 Dimensionality reduction where encoding mechanisms are used to reduce the data set size

4 Numerosity reduction where the data are replaced or estimated by alternative smaller data representations such as parametric models (which need store only the model parameters instead of the actual data) or nonparametric methods such as clustering sampling and the use of histograms

5 Discretization and concept hierarchy generation where rawdata values for attributes are replaced by ranges or higher conceptual levels

Attribute subset selection reduces the data set size by removing irrelevant or redundant attributes (or dimensions)

The goal of attribute subset selection is to find a minimum set of attributes such that the resulting probability distribution of the data classes is as close as possible to the original distribution obtained using all attributes

Mining on a reduced set of attributes has an additional benefit

It reduces the number of attributes appearing in the discovered patterns helping to make the patterns easier to understand

The ldquoBestrdquo (and ldquoWorstrdquo) attributes are typically determined using tests of statistical significance which assume that the attributes are independent of one Many other attribute evaluation measures can be used such as the information gain measure used in building decision trees for classification

bull Basic heuristic methods of attribute subset selection include the following techniques some of which are illustrated in Figure

bull 1 Stepwise forward selection The procedure starts with an empty set of attributes as the reduced set The best of the original attributes is determined and added to the reduced set At each subsequent iteration or step the best of the remaining original attributes is added to the set

bull 2 Stepwise backward elimination The procedure starts with the full set of attributes At each step it removes the worst attribute remaining in the set

bull 3 Combination of forward selection and backward elimination The stepwise forward selection and backward elimination methods can be combined so that at each step the procedure selects the best attribute and removes the

worst from among the remaining attributes

bull 4 Decision tree induction Decision tree algorithms such as ID3 C45 and CART were originally intended for classification Decision tree induction constructs a flow chart like structure where each internal (non leaf) node denotes a test on an attribute each branch corresponds to an outcome of the test and each external (leaf) node denotes a class prediction At each node the algorithm chooses the ldquobestrdquo attribute to partition the data into individual classes When decision tree induction is used for attribute subset selection a tree is constructed from the given data All attributes that do not appear in the tree are assumed to be irrelevant The set of attributes appearing in the tree form the reduced subset of attributes

The stopping criteria for the methods may vary The procedure may employ a threshold on the measure used to determine when to stop the attribute selection process

In dimensionality reduction data encoding or transformations are applied so as to obtain a reduced or ldquocompressedrdquo representation of the original data If the original data can be reconstructed from the compressed data without any loss of information the data reduction is called lossless If instead we can reconstruct only an approximation of the original data then the data reduction is called lossy There are several well-tuned algorithms for string compression Although they are typically lossless they allow only limited manipulation of the data

In this section we instead focus on two popular and effective methods of lossy dimensionality reduction wavelet transforms and principal components analysis

bull Wavelet transforms can be applied to multidimensional data such as a data cube

bull This is done by first applying the transform to the first dimension then to the second and so on The computational complexity involved is linear with respect to the number of cells in the cube Wavelet transforms give good results on sparse or skewed data and on data with ordered attributes Lossy compression by wavelets is reportedly better than JPEG compression the current commercial standard Wavelet transforms have many real-world applications including the compression of fingerprint images computer vision analysis of time-series data and data cleaning

bull PCA is computationally inexpensive can be applied to ordered and unordered attributes and can handle sparse data and skewed data Multidimensional data of more than two dimensions can be handled by reducing the problem to two dimensions Principal components may be used as inputs to multiple regression and cluster analysis

bull In comparison with wavelet transforms PCA tends to be better at handling sparse data whereas wavelet transforms are more suitable for data of high dimensionality

ldquoCan we reduce the data volume by choosing alternative lsquosmallerrsquo forms of data representationrdquo

Techniques of numerosity reduction can indeed be applied for this purpose These techniques may be parametric or nonparametric

For parametric methods a model is used to estimate the data so that typically only the data parameters need to be stored instead of the actual data (Outliers may also be stored) Log-linear models which estimate discrete multidimensional probability distributions are an example

Nonparametric methods for storing reduced representations of the data include histograms clustering and sampling

Letrsquos look at each of the numerosity reduction techniques mentioned above

Linear regression Data are modeled to fit a straight line Often uses the least-square method to fit the

line Multiple regression allows a response variable Y

to be modeled as a linear function of multidimensional feature vector

Log-linear model approximates discrete multidimensional probability distributions

bull Linear regression Y = w X + bndash Two regression coefficients w and b specify the

line and are to be estimated by using the data at hand

ndash Using the least squares criterion to the known values of Y1 Y2 hellip X1 X2 hellip

bull Multiple regression Y = b0 + b1 X1 + b2 X2ndash Many nonlinear functions can be transformed into

the above

bull Log-linear modelsndash The multi-way table of joint probabilities is

approximated by a product of lower-order tablesndash Probability p(a b c d) = ab acad bcd

Histograms Histograms use binning to approximate data distributions and are a popular form of data reduction

A histogram for an attribute A partitions the data distribution of A into disjoint subsets or buckets If each bucket represents only a single attribute-valuefrequency pair the buckets are called singleton buckets Often buckets instead represent continuous ranges for the given attribute

There are several partitioning rules including the following

Equal-width In an equal-width histogram the width of each bucket range is uniform

Equal-frequency (or equidepth) In an equal-frequency histogram the buckets are created so that roughly the frequency of each bucket is constant (that is each bucket contains roughly the same number of contiguous data samples)

V-Optimal If we consider all of the possible histograms for a given number of buckets the V-Optimal histogram is the one with the least variance Histogram variance is a weighted sum of the original values that each bucket represents where bucket weight is equal to the number of values in the bucket

MaxDiff In a MaxDiff histogram we consider the difference between each pair of adjacent values

ClusteringClustering techniques consider data tuples as objects They

partition the objects into groups or clusters so that objects within a cluster are ldquosimilarrdquo to one another and ldquodissimilarrdquo to objects in other clusters

In data reduction the cluster representations of the data are used to replace the actual data The effectiveness of this technique depends on the nature of the data It is much more effective for data that can be organized into distinct clusters than for smeared data

Partition data set into clusters based on similarity and store cluster representation (eg centroid and diameter) only

Can be very effective if data is clustered but not if data is ldquosmearedrdquo

Can have hierarchical clustering and be stored in multi-dimensional index tree structures

There are many choices of clustering definitions and clustering algorithms

Cluster analysis will be studied in depth later

SamplingSampling can be used as a data reduction technique because it

allows a large data set to be represented by a much smaller random sample (or subset) of the data Suppose that a large data set D contains N tuples Letrsquos look at the most common ways that we could sample D for data reduction

An advantage of sampling for data reduction is that the cost of obtaining a sample is proportional to the size of the sample sampling complexity is potentially sublinear to the size of the data

Simple Random sample without replacement

bull For a fixed sample size sampling complexity increases only linearly as the number of data dimensions

bull When applied to data reduction sampling is most commonly used to estimate the answer to an aggregate query

Sampling obtaining a small sample s to represent the whole data set N

Allow a mining algorithm to run in complexity that is potentially sub-linear to the size of the data

Choose a representative subset of the data Simple random sampling may have very poor

performance in the presence of skew Develop adaptive sampling methods

Stratified sampling Approximate the percentage of each class (or

subpopulation of interest) in the overall database Used in conjunction with skewed data

Note Sampling may not reduce database IOs (page at a time)

Sampling with or without Replacement

SRSWOR

(simple random

sample without

replacement)

SRSWR

Raw Data

Raw Data ClusterStratified Sample

Data discretization techniques can be used to reduce the number of values for a given continuous attribute by dividing the range of the attribute into intervals Interval labels can then be used to replace actual data values Replacing numerous values of a continuous attribute by a small number of interval labels thereby reduces and simplifies the original data This leads to a concise easy-to-use knowledge-level representation of mining results

bull Discretization techniques can be categorized based on how the discretization is performed such as whether it uses class information or which direction it proceeds (ie top-down vs bottom-up) If the discretization process uses class information then we say it is supervised discretization Otherwise it is unsupervised If the process starts by first finding one or a few points (called split points or cut points) to split the entire attribute range and then repeats this recursively on the resulting intervals it is called top-down discretization or splitting This contrasts with bottom-up discretization or merging which starts by considering all of the continuous values as potential split-points removes some by merging neighborhood values to form intervals and then recursively applies this process to the resulting intervals Discretization can be performed recursively on an attribute to provide a hierarchical or multiresolution partitioning of the attribute values known as a concept hierarchy Concept hierarchies are useful for mining at multiple levels of abstraction

Three types of attributes Nominal mdash values from an unordered set eg color

profession Ordinal mdash values from an ordered set eg military or

academic rank Continuous mdash real numbers eg integer or real

numbers Discretization

Divide the range of a continuous attribute into intervals Some classification algorithms only accept categorical

attributes Reduce data size by discretization Prepare for further analysis

Typical methods All the methods can be applied recursively Binning (covered above)

Top-down split unsupervised Histogram analysis (covered above)

Top-down split unsupervised Clustering analysis (covered above)

Either top-down split or bottom-up merge unsupervised Entropy-based discretization supervised top-down

split Interval merging by 2 Analysis unsupervised bottom-

up merge Segmentation by natural partitioning top-down split

unsupervised

Given a set of samples S if S is partitioned into two intervals S1 and S2 using boundary T the information gain after partitioning is

Entropy is calculated based on class distribution of the samples in the set Given m classes the entropy of S1 is

where pi is the probability of class i in S1 The boundary that minimizes the entropy function over

all possible boundaries is selected as a binary discretization

The process is recursively applied to partitions obtained until some stopping criterion is met

Such a boundary may reduce data size and improve classification accuracy

)Entropy(S|S|

|S|+)Entropy(S

|S|

|S|=T)I(S 2

21

1

m

iii ppSEntropy

121 )(log)(

Merging-based (bottom-up) vs splitting-based methods Merge Find the best neighboring intervals and merge

them to form larger intervals recursively ChiMerge [Kerber AAAI 1992 See also Liu et al DMKD

2002] Initially each distinct value of a numerical attr A is

considered to be one interval 2 tests are performed for every pair of adjacent

intervals Adjacent intervals with the least 2 values are merged

together since low 2 values for a pair indicate similar class distributions

This merge process proceeds recursively until a predefined stopping criterion is met (such as significance level max-interval max inconsistency etc)

Cluster analysis is a popular data discretization method A clustering algorithm can be applied to discretize a numerical attribute A by partitioning the values of A into clusters or groups Clustering takes the distribution of A into consideration as well as the closeness of data points and therefore is able to produce high-quality discretization results

Clustering can be used to generate a concept hierarchy for A by following either a topdown splitting strategy or a bottom-up merging strategy where each cluster forms a node of the concept hierarchy In the former each initial cluster or partition may be further

decomposed into several subclusters forming a lower level of the hierarchy In the latter clusters are formed by repeatedly grouping neighboring clusters in order to form

higher-level concepts

bull Discretization by Intuitive Partitioningbull Although the above discretization methods are useful

in the generation of numerical hierarchies many users would like to see numerical ranges partitioned into relatively uniform easy-to-read intervals that appear intuitive or ldquonaturalrdquo

bull If an interval covers 3 6 7 or 9 distinct values at the most significant digit then partition the range into 3 intervals (3 equal-width intervals for 3 6 and 9 and 3 intervals in the grouping of 2-3-2 for 7)

bull If it covers 2 4 or 8 distinct values at the most significant digit then partition the range into 4 equal-width intervals

bull If it covers 1 5 or 10 distinct values at the most significant digit then partition the range into 5 equal-width intervals

bull The rule can be recursively applied to each interval creating a concept hierarchy for the given numerical attribute Real-world data often contain extremely large positive andor negative outlier values which could distort any top-down discretization method based on minimum and maximum data values

A simply 3-4-5 rule can be used to segment numeric data into relatively uniform ldquonaturalrdquo intervals If an interval covers 3 6 7 or 9 distinct values

at the most significant digit partition the range into 3 equi-width intervals

If it covers 2 4 or 8 distinct values at the most significant digit partition the range into 4 intervals

If it covers 1 5 or 10 distinct values at the most significant digit partition the range into 5 intervals

Categorical data are discrete data Categorical attributes have a finite (but possibly large) number of distinct values with no ordering among the values Examples include geographic location job category and itemtype There are several methods for the generation of concept hierarchies for categorical data

Specification of a partial ordering of attributes explicitly at the schema level by users or Experts

Specification of a portion of a hierarchy by explicit data grouping

Specification of a set of attributes but not of their partial ordering

Specification of only a partial set of attributes

Specification of a partialtotal ordering of attributes explicitly at the schema level by users or experts street lt city lt state lt country

Specification of a hierarchy for a set of values by explicit data grouping Urbana Champaign Chicago lt Illinois

Specification of only a partial set of attributes Eg only street lt city not others

Automatic generation of hierarchies (or attribute levels) by the analysis of the number of distinct values Eg for a set of attributes street city state

country

Some hierarchies can be automaatically generated based on the analysis of the number of distinct values per attribute in the data set The attribute with the most distinct values is

placed at the lowest level of the hierarchy Exceptions eg weekday month quarter year

country

province_or_ state

city

street

15 distinct values

365 distinct values

3567 distinct values

674339 distinct values

Data preparation or preprocessing is a big issue for both data warehousing and data mining

Discriptive data summarization is need for quality data preprocessing

Data preparation includes Data cleaning and data integration Data reduction and feature selection Discretization

A lot a methods have been developed but data preprocessing still an active area of research

  • Slide 1
  • Slide 2
  • Slide 3
  • Multi-Dimensional Measure of Data Quality
  • Major Tasks in Data Preprocessing
  • Figure Forms of data preprocessing
  • 2 Descriptive Data Summarization
  • Slide 8
  • 21 Measuring the Central Tendency
  • 22 Measuring the Dispersion of Data
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Data Integration
  • Handling Redundant Data in Data Integration
  • Data Transformation
  • Data Transformation Normalization
  • Slide 27
  • 5 Data Reduction
  • Slide 29
  • 51 Data Cube Aggregation
  • 52 Attribute Subset Selection
  • Slide 32
  • Slide 33
  • Slide 34
  • 53 Dimensionality Reduction
  • Slide 36
  • Slide 37
  • Slide 38
  • 4 Numerosity Reduction
  • Data Reduction Method (1) Regression and Log-Linear Models
  • Regress Analysis and Log-Linear Models
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Data Reduction Method (3) Clustering
  • Slide 47
  • Slide 48
  • Slide 49
  • Data Reduction Method (4) Sampling
  • Slide 51
  • Sampling Cluster or Stratified Sampling
  • Data Discretization and Concept Hierarchy Generation
  • Slide 54
  • Discretization
  • Discretization and Concept Hierarchy Generation for Numeric Data
  • Entropy-Based Discretization
  • Interval Merge by 2 Analysis
  • Cluster Analysis
  • Slide 60
  • Segmentation by Natural Partitioning
  • 62 Concept Hierarchy Generation for Categorical Data
  • Slide 63
  • Concept Hierarchy Generation for Categorical Data
  • Automatic Concept Hierarchy Generation
  • Summary
Page 12: UNIT – 1 Data Preprocessing. Data Preprocessing Learning Objectives Understand why preprocess the data. Understand how to clean the data. Understand how.

3 Data Cleaningbull Data cleaning tasks

Fill in missing valuesIdentify outliers and smooth out noisy data

Correct inconsistent data1) Missing Databull Data is not always available

a Eg many tuples have no recorded value for several attributes such as customer income in sales data

bull Missing data may be due to a equipment malfunction

b inconsistent with other recorded data and thus deletedc data not entered due to misunderstandingd certain data may not be considered important at the time of entrye not register history or changes of the data

f Missing data may need to be inferred

How to Handle Missing Data

bull Ignore the tuple usually done when class label is missing(assuming the tasks in classificationmdashnot effective when the percentage of missing values per attribute varies considerably)

bull Fill in the missing value manually tedious + infeasiblebull Use a global constant to fill in the missing value eg ldquounknownrdquo a new class bull Use the attribute mean to fill in the missing valuebull Use the attribute mean for all samples belonging to the same class to fill in the missing value smarterbull Use the most probable value to fill in the missing value inference-based such as Bayesian formula or decision tree

bull 2Noisy Data

bull Noise random error or variance in a measured variable

bull Incorrect attribute values may be due tondash faulty data collection instrumentsndash data entry problemsndash data transmission problemsndash technology limitationndash inconsistency in naming convention

bull Other data problems which requires data cleaningndash duplicate recordsndash inconsistent data

How to Handle Noisy Data

bull Binning method

- first sort data and partition into (equi-depth) bins- then one can smooth by bin means smooth by bin median

- smooth by bin boundaries etc

bull Clustering- detect and remove outliers

bull Combined computer and human inspection- detect suspicious values and check by human

bull Regression- smooth by fitting the data into regression functions

Binning Methods for Data SmoothingSorted data for price (in dollars)

48915 21 21 24 25 26 28 29 34

Partition into (equi-depth) bins - Bin 1 4 8 9 15 - Bin 2 21 21 24 25 - Bin 3 26 28 29 34

Smoothing by bin means - Bin 1 9 9 9 9 - Bin 2 23 23 23 23 - Bin 3 29 29 29 29

Smoothing by bin boundaries - Bin 1 4 4 4 15 - Bin 2 21 21 25 25 - Bin 3 26 26 26 34

Cluster Analysis

Regression

Data integration combines data from multiple sources into a

coherent store Schema integration

integrate metadata from different sources entity identification problem identify real world

entities from multiple data sources eg Acust-id Bcust-

Detecting and resolving data value conflicts for the same real world entity attribute values from

different sources are different possible reasons different representations

different scales eg metric vs British units

bull Redundant data occur often when integration of multiple databasesndash The same attribute may have different names in

different databasesndash One attribute may be a ldquoderivedrdquo attribute in

another table eg annual revenuebull Redundant data may be able to be detected by

correlational analysisbull Careful integration of the data from multiple sources

may help reduceavoid redundancies and inconsistencies and improve mining speed and quality

Smoothing remove noise from data Aggregation summarization data cube construction Generalization concept hierarchy climbing Normalization scaled to fall within a small specified

range min-max normalization z-score normalization normalization by decimal scaling

Attributefeature construction New attributes constructed from the given ones

bull min-max normalization

bull z-score normalization

bull normalization by decimal scaling

AAA

AA

A

minnewminnewmaxnewminmax

minvv _)__(

A

A

devstand

meanvv

_

j

vv

10 Where j is the smallest integer such that Max(| |)lt1v

Min-max normalization to [new_minA new_maxA]

Ex Let income range $12000 to $98000 normalized to [00 10] Then $73600 is mapped to

Z-score normalization (μ mean σ standard deviation)

Ex Let μ = 54000 σ = 16000 Then Normalization by decimal scaling

71600)001(0001200098

0001260073

AAA

AA

A

minnewminnewmaxnewminmax

minvv _)__(

A

Avv

j

vv

10

Where j is the smallest integer such that Max(|νrsquo|) lt 1

225100016

0005460073

Data reduction techniques can be applied to obtain a reduced representation of the data set that is much smaller in volume yet closely maintains the integrity of the original data That is mining on the reduced data set should be more efficient yet produce the same (or almost the same) analytical results

Strategies for data reduction include the following1Data cube aggregation where aggregation operations are applied to the data in the construction of a data cube

2 Attribute subset selection where irrelevant weakly relevant or redundant attributes or dimensions may be detected and removed

3 Dimensionality reduction where encoding mechanisms are used to reduce the data set size

4 Numerosity reduction where the data are replaced or estimated by alternative smaller data representations such as parametric models (which need store only the model parameters instead of the actual data) or nonparametric methods such as clustering sampling and the use of histograms

5 Discretization and concept hierarchy generation where rawdata values for attributes are replaced by ranges or higher conceptual levels

Attribute subset selection reduces the data set size by removing irrelevant or redundant attributes (or dimensions)

The goal of attribute subset selection is to find a minimum set of attributes such that the resulting probability distribution of the data classes is as close as possible to the original distribution obtained using all attributes

Mining on a reduced set of attributes has an additional benefit

It reduces the number of attributes appearing in the discovered patterns helping to make the patterns easier to understand

The ldquoBestrdquo (and ldquoWorstrdquo) attributes are typically determined using tests of statistical significance which assume that the attributes are independent of one Many other attribute evaluation measures can be used such as the information gain measure used in building decision trees for classification

bull Basic heuristic methods of attribute subset selection include the following techniques some of which are illustrated in Figure

bull 1 Stepwise forward selection The procedure starts with an empty set of attributes as the reduced set The best of the original attributes is determined and added to the reduced set At each subsequent iteration or step the best of the remaining original attributes is added to the set

bull 2 Stepwise backward elimination The procedure starts with the full set of attributes At each step it removes the worst attribute remaining in the set

bull 3 Combination of forward selection and backward elimination The stepwise forward selection and backward elimination methods can be combined so that at each step the procedure selects the best attribute and removes the

worst from among the remaining attributes

bull 4 Decision tree induction Decision tree algorithms such as ID3 C45 and CART were originally intended for classification Decision tree induction constructs a flow chart like structure where each internal (non leaf) node denotes a test on an attribute each branch corresponds to an outcome of the test and each external (leaf) node denotes a class prediction At each node the algorithm chooses the ldquobestrdquo attribute to partition the data into individual classes When decision tree induction is used for attribute subset selection a tree is constructed from the given data All attributes that do not appear in the tree are assumed to be irrelevant The set of attributes appearing in the tree form the reduced subset of attributes

The stopping criteria for the methods may vary The procedure may employ a threshold on the measure used to determine when to stop the attribute selection process

In dimensionality reduction data encoding or transformations are applied so as to obtain a reduced or ldquocompressedrdquo representation of the original data If the original data can be reconstructed from the compressed data without any loss of information the data reduction is called lossless If instead we can reconstruct only an approximation of the original data then the data reduction is called lossy There are several well-tuned algorithms for string compression Although they are typically lossless they allow only limited manipulation of the data

In this section we instead focus on two popular and effective methods of lossy dimensionality reduction wavelet transforms and principal components analysis

bull Wavelet transforms can be applied to multidimensional data such as a data cube

bull This is done by first applying the transform to the first dimension then to the second and so on The computational complexity involved is linear with respect to the number of cells in the cube Wavelet transforms give good results on sparse or skewed data and on data with ordered attributes Lossy compression by wavelets is reportedly better than JPEG compression the current commercial standard Wavelet transforms have many real-world applications including the compression of fingerprint images computer vision analysis of time-series data and data cleaning

bull PCA is computationally inexpensive can be applied to ordered and unordered attributes and can handle sparse data and skewed data Multidimensional data of more than two dimensions can be handled by reducing the problem to two dimensions Principal components may be used as inputs to multiple regression and cluster analysis

bull In comparison with wavelet transforms PCA tends to be better at handling sparse data whereas wavelet transforms are more suitable for data of high dimensionality

ldquoCan we reduce the data volume by choosing alternative lsquosmallerrsquo forms of data representationrdquo

Techniques of numerosity reduction can indeed be applied for this purpose These techniques may be parametric or nonparametric

For parametric methods a model is used to estimate the data so that typically only the data parameters need to be stored instead of the actual data (Outliers may also be stored) Log-linear models which estimate discrete multidimensional probability distributions are an example

Nonparametric methods for storing reduced representations of the data include histograms clustering and sampling

Letrsquos look at each of the numerosity reduction techniques mentioned above

Linear regression Data are modeled to fit a straight line Often uses the least-square method to fit the

line Multiple regression allows a response variable Y

to be modeled as a linear function of multidimensional feature vector

Log-linear model approximates discrete multidimensional probability distributions

bull Linear regression Y = w X + bndash Two regression coefficients w and b specify the

line and are to be estimated by using the data at hand

ndash Using the least squares criterion to the known values of Y1 Y2 hellip X1 X2 hellip

bull Multiple regression Y = b0 + b1 X1 + b2 X2ndash Many nonlinear functions can be transformed into

the above

bull Log-linear modelsndash The multi-way table of joint probabilities is

approximated by a product of lower-order tablesndash Probability p(a b c d) = ab acad bcd

Histograms Histograms use binning to approximate data distributions and are a popular form of data reduction

A histogram for an attribute A partitions the data distribution of A into disjoint subsets or buckets If each bucket represents only a single attribute-valuefrequency pair the buckets are called singleton buckets Often buckets instead represent continuous ranges for the given attribute

There are several partitioning rules including the following

Equal-width In an equal-width histogram the width of each bucket range is uniform

Equal-frequency (or equidepth) In an equal-frequency histogram the buckets are created so that roughly the frequency of each bucket is constant (that is each bucket contains roughly the same number of contiguous data samples)

V-Optimal If we consider all of the possible histograms for a given number of buckets the V-Optimal histogram is the one with the least variance Histogram variance is a weighted sum of the original values that each bucket represents where bucket weight is equal to the number of values in the bucket

MaxDiff In a MaxDiff histogram we consider the difference between each pair of adjacent values

ClusteringClustering techniques consider data tuples as objects They

partition the objects into groups or clusters so that objects within a cluster are ldquosimilarrdquo to one another and ldquodissimilarrdquo to objects in other clusters

In data reduction the cluster representations of the data are used to replace the actual data The effectiveness of this technique depends on the nature of the data It is much more effective for data that can be organized into distinct clusters than for smeared data

Partition data set into clusters based on similarity and store cluster representation (eg centroid and diameter) only

Can be very effective if data is clustered but not if data is ldquosmearedrdquo

Can have hierarchical clustering and be stored in multi-dimensional index tree structures

There are many choices of clustering definitions and clustering algorithms

Cluster analysis will be studied in depth later

SamplingSampling can be used as a data reduction technique because it

allows a large data set to be represented by a much smaller random sample (or subset) of the data Suppose that a large data set D contains N tuples Letrsquos look at the most common ways that we could sample D for data reduction

An advantage of sampling for data reduction is that the cost of obtaining a sample is proportional to the size of the sample sampling complexity is potentially sublinear to the size of the data

Simple Random sample without replacement

bull For a fixed sample size sampling complexity increases only linearly as the number of data dimensions

bull When applied to data reduction sampling is most commonly used to estimate the answer to an aggregate query

Sampling obtaining a small sample s to represent the whole data set N

Allow a mining algorithm to run in complexity that is potentially sub-linear to the size of the data

Choose a representative subset of the data Simple random sampling may have very poor

performance in the presence of skew Develop adaptive sampling methods

Stratified sampling Approximate the percentage of each class (or

subpopulation of interest) in the overall database Used in conjunction with skewed data

Note Sampling may not reduce database IOs (page at a time)

Sampling with or without Replacement

SRSWOR

(simple random

sample without

replacement)

SRSWR

Raw Data

Raw Data ClusterStratified Sample

Data discretization techniques can be used to reduce the number of values for a given continuous attribute by dividing the range of the attribute into intervals Interval labels can then be used to replace actual data values Replacing numerous values of a continuous attribute by a small number of interval labels thereby reduces and simplifies the original data This leads to a concise easy-to-use knowledge-level representation of mining results

bull Discretization techniques can be categorized based on how the discretization is performed such as whether it uses class information or which direction it proceeds (ie top-down vs bottom-up) If the discretization process uses class information then we say it is supervised discretization Otherwise it is unsupervised If the process starts by first finding one or a few points (called split points or cut points) to split the entire attribute range and then repeats this recursively on the resulting intervals it is called top-down discretization or splitting This contrasts with bottom-up discretization or merging which starts by considering all of the continuous values as potential split-points removes some by merging neighborhood values to form intervals and then recursively applies this process to the resulting intervals Discretization can be performed recursively on an attribute to provide a hierarchical or multiresolution partitioning of the attribute values known as a concept hierarchy Concept hierarchies are useful for mining at multiple levels of abstraction

Three types of attributes Nominal mdash values from an unordered set eg color

profession Ordinal mdash values from an ordered set eg military or

academic rank Continuous mdash real numbers eg integer or real

numbers Discretization

Divide the range of a continuous attribute into intervals Some classification algorithms only accept categorical

attributes Reduce data size by discretization Prepare for further analysis

Typical methods All the methods can be applied recursively Binning (covered above)

Top-down split unsupervised Histogram analysis (covered above)

Top-down split unsupervised Clustering analysis (covered above)

Either top-down split or bottom-up merge unsupervised Entropy-based discretization supervised top-down

split Interval merging by 2 Analysis unsupervised bottom-

up merge Segmentation by natural partitioning top-down split

unsupervised

Given a set of samples S if S is partitioned into two intervals S1 and S2 using boundary T the information gain after partitioning is

Entropy is calculated based on class distribution of the samples in the set Given m classes the entropy of S1 is

where pi is the probability of class i in S1 The boundary that minimizes the entropy function over

all possible boundaries is selected as a binary discretization

The process is recursively applied to partitions obtained until some stopping criterion is met

Such a boundary may reduce data size and improve classification accuracy

)Entropy(S|S|

|S|+)Entropy(S

|S|

|S|=T)I(S 2

21

1

m

iii ppSEntropy

121 )(log)(

Merging-based (bottom-up) vs splitting-based methods Merge Find the best neighboring intervals and merge

them to form larger intervals recursively ChiMerge [Kerber AAAI 1992 See also Liu et al DMKD

2002] Initially each distinct value of a numerical attr A is

considered to be one interval 2 tests are performed for every pair of adjacent

intervals Adjacent intervals with the least 2 values are merged

together since low 2 values for a pair indicate similar class distributions

This merge process proceeds recursively until a predefined stopping criterion is met (such as significance level max-interval max inconsistency etc)

Cluster analysis is a popular data discretization method A clustering algorithm can be applied to discretize a numerical attribute A by partitioning the values of A into clusters or groups Clustering takes the distribution of A into consideration as well as the closeness of data points and therefore is able to produce high-quality discretization results

Clustering can be used to generate a concept hierarchy for A by following either a topdown splitting strategy or a bottom-up merging strategy where each cluster forms a node of the concept hierarchy In the former each initial cluster or partition may be further

decomposed into several subclusters forming a lower level of the hierarchy In the latter clusters are formed by repeatedly grouping neighboring clusters in order to form

higher-level concepts

bull Discretization by Intuitive Partitioningbull Although the above discretization methods are useful

in the generation of numerical hierarchies many users would like to see numerical ranges partitioned into relatively uniform easy-to-read intervals that appear intuitive or ldquonaturalrdquo

bull If an interval covers 3 6 7 or 9 distinct values at the most significant digit then partition the range into 3 intervals (3 equal-width intervals for 3 6 and 9 and 3 intervals in the grouping of 2-3-2 for 7)

bull If it covers 2 4 or 8 distinct values at the most significant digit then partition the range into 4 equal-width intervals

bull If it covers 1 5 or 10 distinct values at the most significant digit then partition the range into 5 equal-width intervals

bull The rule can be recursively applied to each interval creating a concept hierarchy for the given numerical attribute Real-world data often contain extremely large positive andor negative outlier values which could distort any top-down discretization method based on minimum and maximum data values

A simply 3-4-5 rule can be used to segment numeric data into relatively uniform ldquonaturalrdquo intervals If an interval covers 3 6 7 or 9 distinct values

at the most significant digit partition the range into 3 equi-width intervals

If it covers 2 4 or 8 distinct values at the most significant digit partition the range into 4 intervals

If it covers 1 5 or 10 distinct values at the most significant digit partition the range into 5 intervals

Categorical data are discrete data Categorical attributes have a finite (but possibly large) number of distinct values with no ordering among the values Examples include geographic location job category and itemtype There are several methods for the generation of concept hierarchies for categorical data

Specification of a partial ordering of attributes explicitly at the schema level by users or Experts

Specification of a portion of a hierarchy by explicit data grouping

Specification of a set of attributes but not of their partial ordering

Specification of only a partial set of attributes

Specification of a partialtotal ordering of attributes explicitly at the schema level by users or experts street lt city lt state lt country

Specification of a hierarchy for a set of values by explicit data grouping Urbana Champaign Chicago lt Illinois

Specification of only a partial set of attributes Eg only street lt city not others

Automatic generation of hierarchies (or attribute levels) by the analysis of the number of distinct values Eg for a set of attributes street city state

country

Some hierarchies can be automaatically generated based on the analysis of the number of distinct values per attribute in the data set The attribute with the most distinct values is

placed at the lowest level of the hierarchy Exceptions eg weekday month quarter year

country

province_or_ state

city

street

15 distinct values

365 distinct values

3567 distinct values

674339 distinct values

Data preparation or preprocessing is a big issue for both data warehousing and data mining

Discriptive data summarization is need for quality data preprocessing

Data preparation includes Data cleaning and data integration Data reduction and feature selection Discretization

A lot a methods have been developed but data preprocessing still an active area of research

  • Slide 1
  • Slide 2
  • Slide 3
  • Multi-Dimensional Measure of Data Quality
  • Major Tasks in Data Preprocessing
  • Figure Forms of data preprocessing
  • 2 Descriptive Data Summarization
  • Slide 8
  • 21 Measuring the Central Tendency
  • 22 Measuring the Dispersion of Data
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Data Integration
  • Handling Redundant Data in Data Integration
  • Data Transformation
  • Data Transformation Normalization
  • Slide 27
  • 5 Data Reduction
  • Slide 29
  • 51 Data Cube Aggregation
  • 52 Attribute Subset Selection
  • Slide 32
  • Slide 33
  • Slide 34
  • 53 Dimensionality Reduction
  • Slide 36
  • Slide 37
  • Slide 38
  • 4 Numerosity Reduction
  • Data Reduction Method (1) Regression and Log-Linear Models
  • Regress Analysis and Log-Linear Models
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Data Reduction Method (3) Clustering
  • Slide 47
  • Slide 48
  • Slide 49
  • Data Reduction Method (4) Sampling
  • Slide 51
  • Sampling Cluster or Stratified Sampling
  • Data Discretization and Concept Hierarchy Generation
  • Slide 54
  • Discretization
  • Discretization and Concept Hierarchy Generation for Numeric Data
  • Entropy-Based Discretization
  • Interval Merge by 2 Analysis
  • Cluster Analysis
  • Slide 60
  • Segmentation by Natural Partitioning
  • 62 Concept Hierarchy Generation for Categorical Data
  • Slide 63
  • Concept Hierarchy Generation for Categorical Data
  • Automatic Concept Hierarchy Generation
  • Summary
Page 13: UNIT – 1 Data Preprocessing. Data Preprocessing Learning Objectives Understand why preprocess the data. Understand how to clean the data. Understand how.

How to Handle Missing Data

bull Ignore the tuple usually done when class label is missing(assuming the tasks in classificationmdashnot effective when the percentage of missing values per attribute varies considerably)

bull Fill in the missing value manually tedious + infeasiblebull Use a global constant to fill in the missing value eg ldquounknownrdquo a new class bull Use the attribute mean to fill in the missing valuebull Use the attribute mean for all samples belonging to the same class to fill in the missing value smarterbull Use the most probable value to fill in the missing value inference-based such as Bayesian formula or decision tree

bull 2Noisy Data

bull Noise random error or variance in a measured variable

bull Incorrect attribute values may be due tondash faulty data collection instrumentsndash data entry problemsndash data transmission problemsndash technology limitationndash inconsistency in naming convention

bull Other data problems which requires data cleaningndash duplicate recordsndash inconsistent data

How to Handle Noisy Data

bull Binning method

- first sort data and partition into (equi-depth) bins- then one can smooth by bin means smooth by bin median

- smooth by bin boundaries etc

bull Clustering- detect and remove outliers

bull Combined computer and human inspection- detect suspicious values and check by human

bull Regression- smooth by fitting the data into regression functions

Binning Methods for Data SmoothingSorted data for price (in dollars)

48915 21 21 24 25 26 28 29 34

Partition into (equi-depth) bins - Bin 1 4 8 9 15 - Bin 2 21 21 24 25 - Bin 3 26 28 29 34

Smoothing by bin means - Bin 1 9 9 9 9 - Bin 2 23 23 23 23 - Bin 3 29 29 29 29

Smoothing by bin boundaries - Bin 1 4 4 4 15 - Bin 2 21 21 25 25 - Bin 3 26 26 26 34

Cluster Analysis

Regression

Data integration combines data from multiple sources into a

coherent store Schema integration

integrate metadata from different sources entity identification problem identify real world

entities from multiple data sources eg Acust-id Bcust-

Detecting and resolving data value conflicts for the same real world entity attribute values from

different sources are different possible reasons different representations

different scales eg metric vs British units

bull Redundant data occur often when integration of multiple databasesndash The same attribute may have different names in

different databasesndash One attribute may be a ldquoderivedrdquo attribute in

another table eg annual revenuebull Redundant data may be able to be detected by

correlational analysisbull Careful integration of the data from multiple sources

may help reduceavoid redundancies and inconsistencies and improve mining speed and quality

Smoothing remove noise from data Aggregation summarization data cube construction Generalization concept hierarchy climbing Normalization scaled to fall within a small specified

range min-max normalization z-score normalization normalization by decimal scaling

Attributefeature construction New attributes constructed from the given ones

bull min-max normalization

bull z-score normalization

bull normalization by decimal scaling

AAA

AA

A

minnewminnewmaxnewminmax

minvv _)__(

A

A

devstand

meanvv

_

j

vv

10 Where j is the smallest integer such that Max(| |)lt1v

Min-max normalization to [new_minA new_maxA]

Ex Let income range $12000 to $98000 normalized to [00 10] Then $73600 is mapped to

Z-score normalization (μ mean σ standard deviation)

Ex Let μ = 54000 σ = 16000 Then Normalization by decimal scaling

71600)001(0001200098

0001260073

AAA

AA

A

minnewminnewmaxnewminmax

minvv _)__(

A

Avv

j

vv

10

Where j is the smallest integer such that Max(|νrsquo|) lt 1

225100016

0005460073

Data reduction techniques can be applied to obtain a reduced representation of the data set that is much smaller in volume yet closely maintains the integrity of the original data That is mining on the reduced data set should be more efficient yet produce the same (or almost the same) analytical results

Strategies for data reduction include the following1Data cube aggregation where aggregation operations are applied to the data in the construction of a data cube

2 Attribute subset selection where irrelevant weakly relevant or redundant attributes or dimensions may be detected and removed

3 Dimensionality reduction where encoding mechanisms are used to reduce the data set size

4 Numerosity reduction where the data are replaced or estimated by alternative smaller data representations such as parametric models (which need store only the model parameters instead of the actual data) or nonparametric methods such as clustering sampling and the use of histograms

5 Discretization and concept hierarchy generation where rawdata values for attributes are replaced by ranges or higher conceptual levels

Attribute subset selection reduces the data set size by removing irrelevant or redundant attributes (or dimensions)

The goal of attribute subset selection is to find a minimum set of attributes such that the resulting probability distribution of the data classes is as close as possible to the original distribution obtained using all attributes

Mining on a reduced set of attributes has an additional benefit

It reduces the number of attributes appearing in the discovered patterns helping to make the patterns easier to understand

The ldquoBestrdquo (and ldquoWorstrdquo) attributes are typically determined using tests of statistical significance which assume that the attributes are independent of one Many other attribute evaluation measures can be used such as the information gain measure used in building decision trees for classification

bull Basic heuristic methods of attribute subset selection include the following techniques some of which are illustrated in Figure

bull 1 Stepwise forward selection The procedure starts with an empty set of attributes as the reduced set The best of the original attributes is determined and added to the reduced set At each subsequent iteration or step the best of the remaining original attributes is added to the set

bull 2 Stepwise backward elimination The procedure starts with the full set of attributes At each step it removes the worst attribute remaining in the set

bull 3 Combination of forward selection and backward elimination The stepwise forward selection and backward elimination methods can be combined so that at each step the procedure selects the best attribute and removes the

worst from among the remaining attributes

bull 4 Decision tree induction Decision tree algorithms such as ID3 C45 and CART were originally intended for classification Decision tree induction constructs a flow chart like structure where each internal (non leaf) node denotes a test on an attribute each branch corresponds to an outcome of the test and each external (leaf) node denotes a class prediction At each node the algorithm chooses the ldquobestrdquo attribute to partition the data into individual classes When decision tree induction is used for attribute subset selection a tree is constructed from the given data All attributes that do not appear in the tree are assumed to be irrelevant The set of attributes appearing in the tree form the reduced subset of attributes

The stopping criteria for the methods may vary The procedure may employ a threshold on the measure used to determine when to stop the attribute selection process

In dimensionality reduction data encoding or transformations are applied so as to obtain a reduced or ldquocompressedrdquo representation of the original data If the original data can be reconstructed from the compressed data without any loss of information the data reduction is called lossless If instead we can reconstruct only an approximation of the original data then the data reduction is called lossy There are several well-tuned algorithms for string compression Although they are typically lossless they allow only limited manipulation of the data

In this section we instead focus on two popular and effective methods of lossy dimensionality reduction wavelet transforms and principal components analysis

bull Wavelet transforms can be applied to multidimensional data such as a data cube

bull This is done by first applying the transform to the first dimension then to the second and so on The computational complexity involved is linear with respect to the number of cells in the cube Wavelet transforms give good results on sparse or skewed data and on data with ordered attributes Lossy compression by wavelets is reportedly better than JPEG compression the current commercial standard Wavelet transforms have many real-world applications including the compression of fingerprint images computer vision analysis of time-series data and data cleaning

bull PCA is computationally inexpensive can be applied to ordered and unordered attributes and can handle sparse data and skewed data Multidimensional data of more than two dimensions can be handled by reducing the problem to two dimensions Principal components may be used as inputs to multiple regression and cluster analysis

bull In comparison with wavelet transforms PCA tends to be better at handling sparse data whereas wavelet transforms are more suitable for data of high dimensionality

ldquoCan we reduce the data volume by choosing alternative lsquosmallerrsquo forms of data representationrdquo

Techniques of numerosity reduction can indeed be applied for this purpose These techniques may be parametric or nonparametric

For parametric methods a model is used to estimate the data so that typically only the data parameters need to be stored instead of the actual data (Outliers may also be stored) Log-linear models which estimate discrete multidimensional probability distributions are an example

Nonparametric methods for storing reduced representations of the data include histograms clustering and sampling

Letrsquos look at each of the numerosity reduction techniques mentioned above

Linear regression Data are modeled to fit a straight line Often uses the least-square method to fit the

line Multiple regression allows a response variable Y

to be modeled as a linear function of multidimensional feature vector

Log-linear model approximates discrete multidimensional probability distributions

bull Linear regression Y = w X + bndash Two regression coefficients w and b specify the

line and are to be estimated by using the data at hand

ndash Using the least squares criterion to the known values of Y1 Y2 hellip X1 X2 hellip

bull Multiple regression Y = b0 + b1 X1 + b2 X2ndash Many nonlinear functions can be transformed into

the above

bull Log-linear modelsndash The multi-way table of joint probabilities is

approximated by a product of lower-order tablesndash Probability p(a b c d) = ab acad bcd

Histograms Histograms use binning to approximate data distributions and are a popular form of data reduction

A histogram for an attribute A partitions the data distribution of A into disjoint subsets or buckets If each bucket represents only a single attribute-valuefrequency pair the buckets are called singleton buckets Often buckets instead represent continuous ranges for the given attribute

There are several partitioning rules including the following

Equal-width In an equal-width histogram the width of each bucket range is uniform

Equal-frequency (or equidepth) In an equal-frequency histogram the buckets are created so that roughly the frequency of each bucket is constant (that is each bucket contains roughly the same number of contiguous data samples)

V-Optimal If we consider all of the possible histograms for a given number of buckets the V-Optimal histogram is the one with the least variance Histogram variance is a weighted sum of the original values that each bucket represents where bucket weight is equal to the number of values in the bucket

MaxDiff In a MaxDiff histogram we consider the difference between each pair of adjacent values

ClusteringClustering techniques consider data tuples as objects They

partition the objects into groups or clusters so that objects within a cluster are ldquosimilarrdquo to one another and ldquodissimilarrdquo to objects in other clusters

In data reduction the cluster representations of the data are used to replace the actual data The effectiveness of this technique depends on the nature of the data It is much more effective for data that can be organized into distinct clusters than for smeared data

Partition data set into clusters based on similarity and store cluster representation (eg centroid and diameter) only

Can be very effective if data is clustered but not if data is ldquosmearedrdquo

Can have hierarchical clustering and be stored in multi-dimensional index tree structures

There are many choices of clustering definitions and clustering algorithms

Cluster analysis will be studied in depth later

SamplingSampling can be used as a data reduction technique because it

allows a large data set to be represented by a much smaller random sample (or subset) of the data Suppose that a large data set D contains N tuples Letrsquos look at the most common ways that we could sample D for data reduction

An advantage of sampling for data reduction is that the cost of obtaining a sample is proportional to the size of the sample sampling complexity is potentially sublinear to the size of the data

Simple Random sample without replacement

bull For a fixed sample size sampling complexity increases only linearly as the number of data dimensions

bull When applied to data reduction sampling is most commonly used to estimate the answer to an aggregate query

Sampling obtaining a small sample s to represent the whole data set N

Allow a mining algorithm to run in complexity that is potentially sub-linear to the size of the data

Choose a representative subset of the data Simple random sampling may have very poor

performance in the presence of skew Develop adaptive sampling methods

Stratified sampling Approximate the percentage of each class (or

subpopulation of interest) in the overall database Used in conjunction with skewed data

Note Sampling may not reduce database IOs (page at a time)

Sampling with or without Replacement

SRSWOR

(simple random

sample without

replacement)

SRSWR

Raw Data

Raw Data ClusterStratified Sample

Data discretization techniques can be used to reduce the number of values for a given continuous attribute by dividing the range of the attribute into intervals Interval labels can then be used to replace actual data values Replacing numerous values of a continuous attribute by a small number of interval labels thereby reduces and simplifies the original data This leads to a concise easy-to-use knowledge-level representation of mining results

bull Discretization techniques can be categorized based on how the discretization is performed such as whether it uses class information or which direction it proceeds (ie top-down vs bottom-up) If the discretization process uses class information then we say it is supervised discretization Otherwise it is unsupervised If the process starts by first finding one or a few points (called split points or cut points) to split the entire attribute range and then repeats this recursively on the resulting intervals it is called top-down discretization or splitting This contrasts with bottom-up discretization or merging which starts by considering all of the continuous values as potential split-points removes some by merging neighborhood values to form intervals and then recursively applies this process to the resulting intervals Discretization can be performed recursively on an attribute to provide a hierarchical or multiresolution partitioning of the attribute values known as a concept hierarchy Concept hierarchies are useful for mining at multiple levels of abstraction

Three types of attributes Nominal mdash values from an unordered set eg color

profession Ordinal mdash values from an ordered set eg military or

academic rank Continuous mdash real numbers eg integer or real

numbers Discretization

Divide the range of a continuous attribute into intervals Some classification algorithms only accept categorical

attributes Reduce data size by discretization Prepare for further analysis

Typical methods All the methods can be applied recursively Binning (covered above)

Top-down split unsupervised Histogram analysis (covered above)

Top-down split unsupervised Clustering analysis (covered above)

Either top-down split or bottom-up merge unsupervised Entropy-based discretization supervised top-down

split Interval merging by 2 Analysis unsupervised bottom-

up merge Segmentation by natural partitioning top-down split

unsupervised

Given a set of samples S if S is partitioned into two intervals S1 and S2 using boundary T the information gain after partitioning is

Entropy is calculated based on class distribution of the samples in the set Given m classes the entropy of S1 is

where pi is the probability of class i in S1 The boundary that minimizes the entropy function over

all possible boundaries is selected as a binary discretization

The process is recursively applied to partitions obtained until some stopping criterion is met

Such a boundary may reduce data size and improve classification accuracy

)Entropy(S|S|

|S|+)Entropy(S

|S|

|S|=T)I(S 2

21

1

m

iii ppSEntropy

121 )(log)(

Merging-based (bottom-up) vs splitting-based methods Merge Find the best neighboring intervals and merge

them to form larger intervals recursively ChiMerge [Kerber AAAI 1992 See also Liu et al DMKD

2002] Initially each distinct value of a numerical attr A is

considered to be one interval 2 tests are performed for every pair of adjacent

intervals Adjacent intervals with the least 2 values are merged

together since low 2 values for a pair indicate similar class distributions

This merge process proceeds recursively until a predefined stopping criterion is met (such as significance level max-interval max inconsistency etc)

Cluster analysis is a popular data discretization method A clustering algorithm can be applied to discretize a numerical attribute A by partitioning the values of A into clusters or groups Clustering takes the distribution of A into consideration as well as the closeness of data points and therefore is able to produce high-quality discretization results

Clustering can be used to generate a concept hierarchy for A by following either a topdown splitting strategy or a bottom-up merging strategy where each cluster forms a node of the concept hierarchy In the former each initial cluster or partition may be further

decomposed into several subclusters forming a lower level of the hierarchy In the latter clusters are formed by repeatedly grouping neighboring clusters in order to form

higher-level concepts

bull Discretization by Intuitive Partitioningbull Although the above discretization methods are useful

in the generation of numerical hierarchies many users would like to see numerical ranges partitioned into relatively uniform easy-to-read intervals that appear intuitive or ldquonaturalrdquo

bull If an interval covers 3 6 7 or 9 distinct values at the most significant digit then partition the range into 3 intervals (3 equal-width intervals for 3 6 and 9 and 3 intervals in the grouping of 2-3-2 for 7)

bull If it covers 2 4 or 8 distinct values at the most significant digit then partition the range into 4 equal-width intervals

bull If it covers 1 5 or 10 distinct values at the most significant digit then partition the range into 5 equal-width intervals

bull The rule can be recursively applied to each interval creating a concept hierarchy for the given numerical attribute Real-world data often contain extremely large positive andor negative outlier values which could distort any top-down discretization method based on minimum and maximum data values

A simply 3-4-5 rule can be used to segment numeric data into relatively uniform ldquonaturalrdquo intervals If an interval covers 3 6 7 or 9 distinct values

at the most significant digit partition the range into 3 equi-width intervals

If it covers 2 4 or 8 distinct values at the most significant digit partition the range into 4 intervals

If it covers 1 5 or 10 distinct values at the most significant digit partition the range into 5 intervals

Categorical data are discrete data Categorical attributes have a finite (but possibly large) number of distinct values with no ordering among the values Examples include geographic location job category and itemtype There are several methods for the generation of concept hierarchies for categorical data

Specification of a partial ordering of attributes explicitly at the schema level by users or Experts

Specification of a portion of a hierarchy by explicit data grouping

Specification of a set of attributes but not of their partial ordering

Specification of only a partial set of attributes

Specification of a partialtotal ordering of attributes explicitly at the schema level by users or experts street lt city lt state lt country

Specification of a hierarchy for a set of values by explicit data grouping Urbana Champaign Chicago lt Illinois

Specification of only a partial set of attributes Eg only street lt city not others

Automatic generation of hierarchies (or attribute levels) by the analysis of the number of distinct values Eg for a set of attributes street city state

country

Some hierarchies can be automaatically generated based on the analysis of the number of distinct values per attribute in the data set The attribute with the most distinct values is

placed at the lowest level of the hierarchy Exceptions eg weekday month quarter year

country

province_or_ state

city

street

15 distinct values

365 distinct values

3567 distinct values

674339 distinct values

Data preparation or preprocessing is a big issue for both data warehousing and data mining

Discriptive data summarization is need for quality data preprocessing

Data preparation includes Data cleaning and data integration Data reduction and feature selection Discretization

A lot a methods have been developed but data preprocessing still an active area of research

  • Slide 1
  • Slide 2
  • Slide 3
  • Multi-Dimensional Measure of Data Quality
  • Major Tasks in Data Preprocessing
  • Figure Forms of data preprocessing
  • 2 Descriptive Data Summarization
  • Slide 8
  • 21 Measuring the Central Tendency
  • 22 Measuring the Dispersion of Data
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Data Integration
  • Handling Redundant Data in Data Integration
  • Data Transformation
  • Data Transformation Normalization
  • Slide 27
  • 5 Data Reduction
  • Slide 29
  • 51 Data Cube Aggregation
  • 52 Attribute Subset Selection
  • Slide 32
  • Slide 33
  • Slide 34
  • 53 Dimensionality Reduction
  • Slide 36
  • Slide 37
  • Slide 38
  • 4 Numerosity Reduction
  • Data Reduction Method (1) Regression and Log-Linear Models
  • Regress Analysis and Log-Linear Models
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Data Reduction Method (3) Clustering
  • Slide 47
  • Slide 48
  • Slide 49
  • Data Reduction Method (4) Sampling
  • Slide 51
  • Sampling Cluster or Stratified Sampling
  • Data Discretization and Concept Hierarchy Generation
  • Slide 54
  • Discretization
  • Discretization and Concept Hierarchy Generation for Numeric Data
  • Entropy-Based Discretization
  • Interval Merge by 2 Analysis
  • Cluster Analysis
  • Slide 60
  • Segmentation by Natural Partitioning
  • 62 Concept Hierarchy Generation for Categorical Data
  • Slide 63
  • Concept Hierarchy Generation for Categorical Data
  • Automatic Concept Hierarchy Generation
  • Summary
Page 14: UNIT – 1 Data Preprocessing. Data Preprocessing Learning Objectives Understand why preprocess the data. Understand how to clean the data. Understand how.

bull 2Noisy Data

bull Noise random error or variance in a measured variable

bull Incorrect attribute values may be due tondash faulty data collection instrumentsndash data entry problemsndash data transmission problemsndash technology limitationndash inconsistency in naming convention

bull Other data problems which requires data cleaningndash duplicate recordsndash inconsistent data

How to Handle Noisy Data

bull Binning method

- first sort data and partition into (equi-depth) bins- then one can smooth by bin means smooth by bin median

- smooth by bin boundaries etc

bull Clustering- detect and remove outliers

bull Combined computer and human inspection- detect suspicious values and check by human

bull Regression- smooth by fitting the data into regression functions

Binning Methods for Data SmoothingSorted data for price (in dollars)

48915 21 21 24 25 26 28 29 34

Partition into (equi-depth) bins - Bin 1 4 8 9 15 - Bin 2 21 21 24 25 - Bin 3 26 28 29 34

Smoothing by bin means - Bin 1 9 9 9 9 - Bin 2 23 23 23 23 - Bin 3 29 29 29 29

Smoothing by bin boundaries - Bin 1 4 4 4 15 - Bin 2 21 21 25 25 - Bin 3 26 26 26 34

Cluster Analysis

Regression

Data integration combines data from multiple sources into a

coherent store Schema integration

integrate metadata from different sources entity identification problem identify real world

entities from multiple data sources eg Acust-id Bcust-

Detecting and resolving data value conflicts for the same real world entity attribute values from

different sources are different possible reasons different representations

different scales eg metric vs British units

bull Redundant data occur often when integration of multiple databasesndash The same attribute may have different names in

different databasesndash One attribute may be a ldquoderivedrdquo attribute in

another table eg annual revenuebull Redundant data may be able to be detected by

correlational analysisbull Careful integration of the data from multiple sources

may help reduceavoid redundancies and inconsistencies and improve mining speed and quality

Smoothing remove noise from data Aggregation summarization data cube construction Generalization concept hierarchy climbing Normalization scaled to fall within a small specified

range min-max normalization z-score normalization normalization by decimal scaling

Attributefeature construction New attributes constructed from the given ones

bull min-max normalization

bull z-score normalization

bull normalization by decimal scaling

AAA

AA

A

minnewminnewmaxnewminmax

minvv _)__(

A

A

devstand

meanvv

_

j

vv

10 Where j is the smallest integer such that Max(| |)lt1v

Min-max normalization to [new_minA new_maxA]

Ex Let income range $12000 to $98000 normalized to [00 10] Then $73600 is mapped to

Z-score normalization (μ mean σ standard deviation)

Ex Let μ = 54000 σ = 16000 Then Normalization by decimal scaling

71600)001(0001200098

0001260073

AAA

AA

A

minnewminnewmaxnewminmax

minvv _)__(

A

Avv

j

vv

10

Where j is the smallest integer such that Max(|νrsquo|) lt 1

225100016

0005460073

Data reduction techniques can be applied to obtain a reduced representation of the data set that is much smaller in volume yet closely maintains the integrity of the original data That is mining on the reduced data set should be more efficient yet produce the same (or almost the same) analytical results

Strategies for data reduction include the following1Data cube aggregation where aggregation operations are applied to the data in the construction of a data cube

2 Attribute subset selection where irrelevant weakly relevant or redundant attributes or dimensions may be detected and removed

3 Dimensionality reduction where encoding mechanisms are used to reduce the data set size

4 Numerosity reduction where the data are replaced or estimated by alternative smaller data representations such as parametric models (which need store only the model parameters instead of the actual data) or nonparametric methods such as clustering sampling and the use of histograms

5 Discretization and concept hierarchy generation where rawdata values for attributes are replaced by ranges or higher conceptual levels

Attribute subset selection reduces the data set size by removing irrelevant or redundant attributes (or dimensions)

The goal of attribute subset selection is to find a minimum set of attributes such that the resulting probability distribution of the data classes is as close as possible to the original distribution obtained using all attributes

Mining on a reduced set of attributes has an additional benefit

It reduces the number of attributes appearing in the discovered patterns helping to make the patterns easier to understand

The ldquoBestrdquo (and ldquoWorstrdquo) attributes are typically determined using tests of statistical significance which assume that the attributes are independent of one Many other attribute evaluation measures can be used such as the information gain measure used in building decision trees for classification

bull Basic heuristic methods of attribute subset selection include the following techniques some of which are illustrated in Figure

bull 1 Stepwise forward selection The procedure starts with an empty set of attributes as the reduced set The best of the original attributes is determined and added to the reduced set At each subsequent iteration or step the best of the remaining original attributes is added to the set

bull 2 Stepwise backward elimination The procedure starts with the full set of attributes At each step it removes the worst attribute remaining in the set

bull 3 Combination of forward selection and backward elimination The stepwise forward selection and backward elimination methods can be combined so that at each step the procedure selects the best attribute and removes the

worst from among the remaining attributes

bull 4 Decision tree induction Decision tree algorithms such as ID3 C45 and CART were originally intended for classification Decision tree induction constructs a flow chart like structure where each internal (non leaf) node denotes a test on an attribute each branch corresponds to an outcome of the test and each external (leaf) node denotes a class prediction At each node the algorithm chooses the ldquobestrdquo attribute to partition the data into individual classes When decision tree induction is used for attribute subset selection a tree is constructed from the given data All attributes that do not appear in the tree are assumed to be irrelevant The set of attributes appearing in the tree form the reduced subset of attributes

The stopping criteria for the methods may vary The procedure may employ a threshold on the measure used to determine when to stop the attribute selection process

In dimensionality reduction data encoding or transformations are applied so as to obtain a reduced or ldquocompressedrdquo representation of the original data If the original data can be reconstructed from the compressed data without any loss of information the data reduction is called lossless If instead we can reconstruct only an approximation of the original data then the data reduction is called lossy There are several well-tuned algorithms for string compression Although they are typically lossless they allow only limited manipulation of the data

In this section we instead focus on two popular and effective methods of lossy dimensionality reduction wavelet transforms and principal components analysis

bull Wavelet transforms can be applied to multidimensional data such as a data cube

bull This is done by first applying the transform to the first dimension then to the second and so on The computational complexity involved is linear with respect to the number of cells in the cube Wavelet transforms give good results on sparse or skewed data and on data with ordered attributes Lossy compression by wavelets is reportedly better than JPEG compression the current commercial standard Wavelet transforms have many real-world applications including the compression of fingerprint images computer vision analysis of time-series data and data cleaning

bull PCA is computationally inexpensive can be applied to ordered and unordered attributes and can handle sparse data and skewed data Multidimensional data of more than two dimensions can be handled by reducing the problem to two dimensions Principal components may be used as inputs to multiple regression and cluster analysis

bull In comparison with wavelet transforms PCA tends to be better at handling sparse data whereas wavelet transforms are more suitable for data of high dimensionality

ldquoCan we reduce the data volume by choosing alternative lsquosmallerrsquo forms of data representationrdquo

Techniques of numerosity reduction can indeed be applied for this purpose These techniques may be parametric or nonparametric

For parametric methods a model is used to estimate the data so that typically only the data parameters need to be stored instead of the actual data (Outliers may also be stored) Log-linear models which estimate discrete multidimensional probability distributions are an example

Nonparametric methods for storing reduced representations of the data include histograms clustering and sampling

Letrsquos look at each of the numerosity reduction techniques mentioned above

Linear regression Data are modeled to fit a straight line Often uses the least-square method to fit the

line Multiple regression allows a response variable Y

to be modeled as a linear function of multidimensional feature vector

Log-linear model approximates discrete multidimensional probability distributions

bull Linear regression Y = w X + bndash Two regression coefficients w and b specify the

line and are to be estimated by using the data at hand

ndash Using the least squares criterion to the known values of Y1 Y2 hellip X1 X2 hellip

bull Multiple regression Y = b0 + b1 X1 + b2 X2ndash Many nonlinear functions can be transformed into

the above

bull Log-linear modelsndash The multi-way table of joint probabilities is

approximated by a product of lower-order tablesndash Probability p(a b c d) = ab acad bcd

Histograms Histograms use binning to approximate data distributions and are a popular form of data reduction

A histogram for an attribute A partitions the data distribution of A into disjoint subsets or buckets If each bucket represents only a single attribute-valuefrequency pair the buckets are called singleton buckets Often buckets instead represent continuous ranges for the given attribute

There are several partitioning rules including the following

Equal-width In an equal-width histogram the width of each bucket range is uniform

Equal-frequency (or equidepth) In an equal-frequency histogram the buckets are created so that roughly the frequency of each bucket is constant (that is each bucket contains roughly the same number of contiguous data samples)

V-Optimal If we consider all of the possible histograms for a given number of buckets the V-Optimal histogram is the one with the least variance Histogram variance is a weighted sum of the original values that each bucket represents where bucket weight is equal to the number of values in the bucket

MaxDiff In a MaxDiff histogram we consider the difference between each pair of adjacent values

ClusteringClustering techniques consider data tuples as objects They

partition the objects into groups or clusters so that objects within a cluster are ldquosimilarrdquo to one another and ldquodissimilarrdquo to objects in other clusters

In data reduction the cluster representations of the data are used to replace the actual data The effectiveness of this technique depends on the nature of the data It is much more effective for data that can be organized into distinct clusters than for smeared data

Partition data set into clusters based on similarity and store cluster representation (eg centroid and diameter) only

Can be very effective if data is clustered but not if data is ldquosmearedrdquo

Can have hierarchical clustering and be stored in multi-dimensional index tree structures

There are many choices of clustering definitions and clustering algorithms

Cluster analysis will be studied in depth later

SamplingSampling can be used as a data reduction technique because it

allows a large data set to be represented by a much smaller random sample (or subset) of the data Suppose that a large data set D contains N tuples Letrsquos look at the most common ways that we could sample D for data reduction

An advantage of sampling for data reduction is that the cost of obtaining a sample is proportional to the size of the sample sampling complexity is potentially sublinear to the size of the data

Simple Random sample without replacement

bull For a fixed sample size sampling complexity increases only linearly as the number of data dimensions

bull When applied to data reduction sampling is most commonly used to estimate the answer to an aggregate query

Sampling obtaining a small sample s to represent the whole data set N

Allow a mining algorithm to run in complexity that is potentially sub-linear to the size of the data

Choose a representative subset of the data Simple random sampling may have very poor

performance in the presence of skew Develop adaptive sampling methods

Stratified sampling Approximate the percentage of each class (or

subpopulation of interest) in the overall database Used in conjunction with skewed data

Note Sampling may not reduce database IOs (page at a time)

Sampling with or without Replacement

SRSWOR

(simple random

sample without

replacement)

SRSWR

Raw Data

Raw Data ClusterStratified Sample

Data discretization techniques can be used to reduce the number of values for a given continuous attribute by dividing the range of the attribute into intervals Interval labels can then be used to replace actual data values Replacing numerous values of a continuous attribute by a small number of interval labels thereby reduces and simplifies the original data This leads to a concise easy-to-use knowledge-level representation of mining results

bull Discretization techniques can be categorized based on how the discretization is performed such as whether it uses class information or which direction it proceeds (ie top-down vs bottom-up) If the discretization process uses class information then we say it is supervised discretization Otherwise it is unsupervised If the process starts by first finding one or a few points (called split points or cut points) to split the entire attribute range and then repeats this recursively on the resulting intervals it is called top-down discretization or splitting This contrasts with bottom-up discretization or merging which starts by considering all of the continuous values as potential split-points removes some by merging neighborhood values to form intervals and then recursively applies this process to the resulting intervals Discretization can be performed recursively on an attribute to provide a hierarchical or multiresolution partitioning of the attribute values known as a concept hierarchy Concept hierarchies are useful for mining at multiple levels of abstraction

Three types of attributes Nominal mdash values from an unordered set eg color

profession Ordinal mdash values from an ordered set eg military or

academic rank Continuous mdash real numbers eg integer or real

numbers Discretization

Divide the range of a continuous attribute into intervals Some classification algorithms only accept categorical

attributes Reduce data size by discretization Prepare for further analysis

Typical methods All the methods can be applied recursively Binning (covered above)

Top-down split unsupervised Histogram analysis (covered above)

Top-down split unsupervised Clustering analysis (covered above)

Either top-down split or bottom-up merge unsupervised Entropy-based discretization supervised top-down

split Interval merging by 2 Analysis unsupervised bottom-

up merge Segmentation by natural partitioning top-down split

unsupervised

Given a set of samples S if S is partitioned into two intervals S1 and S2 using boundary T the information gain after partitioning is

Entropy is calculated based on class distribution of the samples in the set Given m classes the entropy of S1 is

where pi is the probability of class i in S1 The boundary that minimizes the entropy function over

all possible boundaries is selected as a binary discretization

The process is recursively applied to partitions obtained until some stopping criterion is met

Such a boundary may reduce data size and improve classification accuracy

)Entropy(S|S|

|S|+)Entropy(S

|S|

|S|=T)I(S 2

21

1

m

iii ppSEntropy

121 )(log)(

Merging-based (bottom-up) vs splitting-based methods Merge Find the best neighboring intervals and merge

them to form larger intervals recursively ChiMerge [Kerber AAAI 1992 See also Liu et al DMKD

2002] Initially each distinct value of a numerical attr A is

considered to be one interval 2 tests are performed for every pair of adjacent

intervals Adjacent intervals with the least 2 values are merged

together since low 2 values for a pair indicate similar class distributions

This merge process proceeds recursively until a predefined stopping criterion is met (such as significance level max-interval max inconsistency etc)

Cluster analysis is a popular data discretization method A clustering algorithm can be applied to discretize a numerical attribute A by partitioning the values of A into clusters or groups Clustering takes the distribution of A into consideration as well as the closeness of data points and therefore is able to produce high-quality discretization results

Clustering can be used to generate a concept hierarchy for A by following either a topdown splitting strategy or a bottom-up merging strategy where each cluster forms a node of the concept hierarchy In the former each initial cluster or partition may be further

decomposed into several subclusters forming a lower level of the hierarchy In the latter clusters are formed by repeatedly grouping neighboring clusters in order to form

higher-level concepts

bull Discretization by Intuitive Partitioningbull Although the above discretization methods are useful

in the generation of numerical hierarchies many users would like to see numerical ranges partitioned into relatively uniform easy-to-read intervals that appear intuitive or ldquonaturalrdquo

bull If an interval covers 3 6 7 or 9 distinct values at the most significant digit then partition the range into 3 intervals (3 equal-width intervals for 3 6 and 9 and 3 intervals in the grouping of 2-3-2 for 7)

bull If it covers 2 4 or 8 distinct values at the most significant digit then partition the range into 4 equal-width intervals

bull If it covers 1 5 or 10 distinct values at the most significant digit then partition the range into 5 equal-width intervals

bull The rule can be recursively applied to each interval creating a concept hierarchy for the given numerical attribute Real-world data often contain extremely large positive andor negative outlier values which could distort any top-down discretization method based on minimum and maximum data values

A simply 3-4-5 rule can be used to segment numeric data into relatively uniform ldquonaturalrdquo intervals If an interval covers 3 6 7 or 9 distinct values

at the most significant digit partition the range into 3 equi-width intervals

If it covers 2 4 or 8 distinct values at the most significant digit partition the range into 4 intervals

If it covers 1 5 or 10 distinct values at the most significant digit partition the range into 5 intervals

Categorical data are discrete data Categorical attributes have a finite (but possibly large) number of distinct values with no ordering among the values Examples include geographic location job category and itemtype There are several methods for the generation of concept hierarchies for categorical data

Specification of a partial ordering of attributes explicitly at the schema level by users or Experts

Specification of a portion of a hierarchy by explicit data grouping

Specification of a set of attributes but not of their partial ordering

Specification of only a partial set of attributes

Specification of a partialtotal ordering of attributes explicitly at the schema level by users or experts street lt city lt state lt country

Specification of a hierarchy for a set of values by explicit data grouping Urbana Champaign Chicago lt Illinois

Specification of only a partial set of attributes Eg only street lt city not others

Automatic generation of hierarchies (or attribute levels) by the analysis of the number of distinct values Eg for a set of attributes street city state

country

Some hierarchies can be automaatically generated based on the analysis of the number of distinct values per attribute in the data set The attribute with the most distinct values is

placed at the lowest level of the hierarchy Exceptions eg weekday month quarter year

country

province_or_ state

city

street

15 distinct values

365 distinct values

3567 distinct values

674339 distinct values

Data preparation or preprocessing is a big issue for both data warehousing and data mining

Discriptive data summarization is need for quality data preprocessing

Data preparation includes Data cleaning and data integration Data reduction and feature selection Discretization

A lot a methods have been developed but data preprocessing still an active area of research

  • Slide 1
  • Slide 2
  • Slide 3
  • Multi-Dimensional Measure of Data Quality
  • Major Tasks in Data Preprocessing
  • Figure Forms of data preprocessing
  • 2 Descriptive Data Summarization
  • Slide 8
  • 21 Measuring the Central Tendency
  • 22 Measuring the Dispersion of Data
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Data Integration
  • Handling Redundant Data in Data Integration
  • Data Transformation
  • Data Transformation Normalization
  • Slide 27
  • 5 Data Reduction
  • Slide 29
  • 51 Data Cube Aggregation
  • 52 Attribute Subset Selection
  • Slide 32
  • Slide 33
  • Slide 34
  • 53 Dimensionality Reduction
  • Slide 36
  • Slide 37
  • Slide 38
  • 4 Numerosity Reduction
  • Data Reduction Method (1) Regression and Log-Linear Models
  • Regress Analysis and Log-Linear Models
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Data Reduction Method (3) Clustering
  • Slide 47
  • Slide 48
  • Slide 49
  • Data Reduction Method (4) Sampling
  • Slide 51
  • Sampling Cluster or Stratified Sampling
  • Data Discretization and Concept Hierarchy Generation
  • Slide 54
  • Discretization
  • Discretization and Concept Hierarchy Generation for Numeric Data
  • Entropy-Based Discretization
  • Interval Merge by 2 Analysis
  • Cluster Analysis
  • Slide 60
  • Segmentation by Natural Partitioning
  • 62 Concept Hierarchy Generation for Categorical Data
  • Slide 63
  • Concept Hierarchy Generation for Categorical Data
  • Automatic Concept Hierarchy Generation
  • Summary
Page 15: UNIT – 1 Data Preprocessing. Data Preprocessing Learning Objectives Understand why preprocess the data. Understand how to clean the data. Understand how.

How to Handle Noisy Data

bull Binning method

- first sort data and partition into (equi-depth) bins- then one can smooth by bin means smooth by bin median

- smooth by bin boundaries etc

bull Clustering- detect and remove outliers

bull Combined computer and human inspection- detect suspicious values and check by human

bull Regression- smooth by fitting the data into regression functions

Binning Methods for Data SmoothingSorted data for price (in dollars)

48915 21 21 24 25 26 28 29 34

Partition into (equi-depth) bins - Bin 1 4 8 9 15 - Bin 2 21 21 24 25 - Bin 3 26 28 29 34

Smoothing by bin means - Bin 1 9 9 9 9 - Bin 2 23 23 23 23 - Bin 3 29 29 29 29

Smoothing by bin boundaries - Bin 1 4 4 4 15 - Bin 2 21 21 25 25 - Bin 3 26 26 26 34

Cluster Analysis

Regression

Data integration combines data from multiple sources into a

coherent store Schema integration

integrate metadata from different sources entity identification problem identify real world

entities from multiple data sources eg Acust-id Bcust-

Detecting and resolving data value conflicts for the same real world entity attribute values from

different sources are different possible reasons different representations

different scales eg metric vs British units

bull Redundant data occur often when integration of multiple databasesndash The same attribute may have different names in

different databasesndash One attribute may be a ldquoderivedrdquo attribute in

another table eg annual revenuebull Redundant data may be able to be detected by

correlational analysisbull Careful integration of the data from multiple sources

may help reduceavoid redundancies and inconsistencies and improve mining speed and quality

Smoothing remove noise from data Aggregation summarization data cube construction Generalization concept hierarchy climbing Normalization scaled to fall within a small specified

range min-max normalization z-score normalization normalization by decimal scaling

Attributefeature construction New attributes constructed from the given ones

bull min-max normalization

bull z-score normalization

bull normalization by decimal scaling

AAA

AA

A

minnewminnewmaxnewminmax

minvv _)__(

A

A

devstand

meanvv

_

j

vv

10 Where j is the smallest integer such that Max(| |)lt1v

Min-max normalization to [new_minA new_maxA]

Ex Let income range $12000 to $98000 normalized to [00 10] Then $73600 is mapped to

Z-score normalization (μ mean σ standard deviation)

Ex Let μ = 54000 σ = 16000 Then Normalization by decimal scaling

71600)001(0001200098

0001260073

AAA

AA

A

minnewminnewmaxnewminmax

minvv _)__(

A

Avv

j

vv

10

Where j is the smallest integer such that Max(|νrsquo|) lt 1

225100016

0005460073

Data reduction techniques can be applied to obtain a reduced representation of the data set that is much smaller in volume yet closely maintains the integrity of the original data That is mining on the reduced data set should be more efficient yet produce the same (or almost the same) analytical results

Strategies for data reduction include the following1Data cube aggregation where aggregation operations are applied to the data in the construction of a data cube

2 Attribute subset selection where irrelevant weakly relevant or redundant attributes or dimensions may be detected and removed

3 Dimensionality reduction where encoding mechanisms are used to reduce the data set size

4 Numerosity reduction where the data are replaced or estimated by alternative smaller data representations such as parametric models (which need store only the model parameters instead of the actual data) or nonparametric methods such as clustering sampling and the use of histograms

5 Discretization and concept hierarchy generation where rawdata values for attributes are replaced by ranges or higher conceptual levels

Attribute subset selection reduces the data set size by removing irrelevant or redundant attributes (or dimensions)

The goal of attribute subset selection is to find a minimum set of attributes such that the resulting probability distribution of the data classes is as close as possible to the original distribution obtained using all attributes

Mining on a reduced set of attributes has an additional benefit

It reduces the number of attributes appearing in the discovered patterns helping to make the patterns easier to understand

The ldquoBestrdquo (and ldquoWorstrdquo) attributes are typically determined using tests of statistical significance which assume that the attributes are independent of one Many other attribute evaluation measures can be used such as the information gain measure used in building decision trees for classification

bull Basic heuristic methods of attribute subset selection include the following techniques some of which are illustrated in Figure

bull 1 Stepwise forward selection The procedure starts with an empty set of attributes as the reduced set The best of the original attributes is determined and added to the reduced set At each subsequent iteration or step the best of the remaining original attributes is added to the set

bull 2 Stepwise backward elimination The procedure starts with the full set of attributes At each step it removes the worst attribute remaining in the set

bull 3 Combination of forward selection and backward elimination The stepwise forward selection and backward elimination methods can be combined so that at each step the procedure selects the best attribute and removes the

worst from among the remaining attributes

bull 4 Decision tree induction Decision tree algorithms such as ID3 C45 and CART were originally intended for classification Decision tree induction constructs a flow chart like structure where each internal (non leaf) node denotes a test on an attribute each branch corresponds to an outcome of the test and each external (leaf) node denotes a class prediction At each node the algorithm chooses the ldquobestrdquo attribute to partition the data into individual classes When decision tree induction is used for attribute subset selection a tree is constructed from the given data All attributes that do not appear in the tree are assumed to be irrelevant The set of attributes appearing in the tree form the reduced subset of attributes

The stopping criteria for the methods may vary The procedure may employ a threshold on the measure used to determine when to stop the attribute selection process

In dimensionality reduction data encoding or transformations are applied so as to obtain a reduced or ldquocompressedrdquo representation of the original data If the original data can be reconstructed from the compressed data without any loss of information the data reduction is called lossless If instead we can reconstruct only an approximation of the original data then the data reduction is called lossy There are several well-tuned algorithms for string compression Although they are typically lossless they allow only limited manipulation of the data

In this section we instead focus on two popular and effective methods of lossy dimensionality reduction wavelet transforms and principal components analysis

bull Wavelet transforms can be applied to multidimensional data such as a data cube

bull This is done by first applying the transform to the first dimension then to the second and so on The computational complexity involved is linear with respect to the number of cells in the cube Wavelet transforms give good results on sparse or skewed data and on data with ordered attributes Lossy compression by wavelets is reportedly better than JPEG compression the current commercial standard Wavelet transforms have many real-world applications including the compression of fingerprint images computer vision analysis of time-series data and data cleaning

bull PCA is computationally inexpensive can be applied to ordered and unordered attributes and can handle sparse data and skewed data Multidimensional data of more than two dimensions can be handled by reducing the problem to two dimensions Principal components may be used as inputs to multiple regression and cluster analysis

bull In comparison with wavelet transforms PCA tends to be better at handling sparse data whereas wavelet transforms are more suitable for data of high dimensionality

ldquoCan we reduce the data volume by choosing alternative lsquosmallerrsquo forms of data representationrdquo

Techniques of numerosity reduction can indeed be applied for this purpose These techniques may be parametric or nonparametric

For parametric methods a model is used to estimate the data so that typically only the data parameters need to be stored instead of the actual data (Outliers may also be stored) Log-linear models which estimate discrete multidimensional probability distributions are an example

Nonparametric methods for storing reduced representations of the data include histograms clustering and sampling

Letrsquos look at each of the numerosity reduction techniques mentioned above

Linear regression Data are modeled to fit a straight line Often uses the least-square method to fit the

line Multiple regression allows a response variable Y

to be modeled as a linear function of multidimensional feature vector

Log-linear model approximates discrete multidimensional probability distributions

bull Linear regression Y = w X + bndash Two regression coefficients w and b specify the

line and are to be estimated by using the data at hand

ndash Using the least squares criterion to the known values of Y1 Y2 hellip X1 X2 hellip

bull Multiple regression Y = b0 + b1 X1 + b2 X2ndash Many nonlinear functions can be transformed into

the above

bull Log-linear modelsndash The multi-way table of joint probabilities is

approximated by a product of lower-order tablesndash Probability p(a b c d) = ab acad bcd

Histograms Histograms use binning to approximate data distributions and are a popular form of data reduction

A histogram for an attribute A partitions the data distribution of A into disjoint subsets or buckets If each bucket represents only a single attribute-valuefrequency pair the buckets are called singleton buckets Often buckets instead represent continuous ranges for the given attribute

There are several partitioning rules including the following

Equal-width In an equal-width histogram the width of each bucket range is uniform

Equal-frequency (or equidepth) In an equal-frequency histogram the buckets are created so that roughly the frequency of each bucket is constant (that is each bucket contains roughly the same number of contiguous data samples)

V-Optimal If we consider all of the possible histograms for a given number of buckets the V-Optimal histogram is the one with the least variance Histogram variance is a weighted sum of the original values that each bucket represents where bucket weight is equal to the number of values in the bucket

MaxDiff In a MaxDiff histogram we consider the difference between each pair of adjacent values

ClusteringClustering techniques consider data tuples as objects They

partition the objects into groups or clusters so that objects within a cluster are ldquosimilarrdquo to one another and ldquodissimilarrdquo to objects in other clusters

In data reduction the cluster representations of the data are used to replace the actual data The effectiveness of this technique depends on the nature of the data It is much more effective for data that can be organized into distinct clusters than for smeared data

Partition data set into clusters based on similarity and store cluster representation (eg centroid and diameter) only

Can be very effective if data is clustered but not if data is ldquosmearedrdquo

Can have hierarchical clustering and be stored in multi-dimensional index tree structures

There are many choices of clustering definitions and clustering algorithms

Cluster analysis will be studied in depth later

SamplingSampling can be used as a data reduction technique because it

allows a large data set to be represented by a much smaller random sample (or subset) of the data Suppose that a large data set D contains N tuples Letrsquos look at the most common ways that we could sample D for data reduction

An advantage of sampling for data reduction is that the cost of obtaining a sample is proportional to the size of the sample sampling complexity is potentially sublinear to the size of the data

Simple Random sample without replacement

bull For a fixed sample size sampling complexity increases only linearly as the number of data dimensions

bull When applied to data reduction sampling is most commonly used to estimate the answer to an aggregate query

Sampling obtaining a small sample s to represent the whole data set N

Allow a mining algorithm to run in complexity that is potentially sub-linear to the size of the data

Choose a representative subset of the data Simple random sampling may have very poor

performance in the presence of skew Develop adaptive sampling methods

Stratified sampling Approximate the percentage of each class (or

subpopulation of interest) in the overall database Used in conjunction with skewed data

Note Sampling may not reduce database IOs (page at a time)

Sampling with or without Replacement

SRSWOR

(simple random

sample without

replacement)

SRSWR

Raw Data

Raw Data ClusterStratified Sample

Data discretization techniques can be used to reduce the number of values for a given continuous attribute by dividing the range of the attribute into intervals Interval labels can then be used to replace actual data values Replacing numerous values of a continuous attribute by a small number of interval labels thereby reduces and simplifies the original data This leads to a concise easy-to-use knowledge-level representation of mining results

bull Discretization techniques can be categorized based on how the discretization is performed such as whether it uses class information or which direction it proceeds (ie top-down vs bottom-up) If the discretization process uses class information then we say it is supervised discretization Otherwise it is unsupervised If the process starts by first finding one or a few points (called split points or cut points) to split the entire attribute range and then repeats this recursively on the resulting intervals it is called top-down discretization or splitting This contrasts with bottom-up discretization or merging which starts by considering all of the continuous values as potential split-points removes some by merging neighborhood values to form intervals and then recursively applies this process to the resulting intervals Discretization can be performed recursively on an attribute to provide a hierarchical or multiresolution partitioning of the attribute values known as a concept hierarchy Concept hierarchies are useful for mining at multiple levels of abstraction

Three types of attributes Nominal mdash values from an unordered set eg color

profession Ordinal mdash values from an ordered set eg military or

academic rank Continuous mdash real numbers eg integer or real

numbers Discretization

Divide the range of a continuous attribute into intervals Some classification algorithms only accept categorical

attributes Reduce data size by discretization Prepare for further analysis

Typical methods All the methods can be applied recursively Binning (covered above)

Top-down split unsupervised Histogram analysis (covered above)

Top-down split unsupervised Clustering analysis (covered above)

Either top-down split or bottom-up merge unsupervised Entropy-based discretization supervised top-down

split Interval merging by 2 Analysis unsupervised bottom-

up merge Segmentation by natural partitioning top-down split

unsupervised

Given a set of samples S if S is partitioned into two intervals S1 and S2 using boundary T the information gain after partitioning is

Entropy is calculated based on class distribution of the samples in the set Given m classes the entropy of S1 is

where pi is the probability of class i in S1 The boundary that minimizes the entropy function over

all possible boundaries is selected as a binary discretization

The process is recursively applied to partitions obtained until some stopping criterion is met

Such a boundary may reduce data size and improve classification accuracy

)Entropy(S|S|

|S|+)Entropy(S

|S|

|S|=T)I(S 2

21

1

m

iii ppSEntropy

121 )(log)(

Merging-based (bottom-up) vs splitting-based methods Merge Find the best neighboring intervals and merge

them to form larger intervals recursively ChiMerge [Kerber AAAI 1992 See also Liu et al DMKD

2002] Initially each distinct value of a numerical attr A is

considered to be one interval 2 tests are performed for every pair of adjacent

intervals Adjacent intervals with the least 2 values are merged

together since low 2 values for a pair indicate similar class distributions

This merge process proceeds recursively until a predefined stopping criterion is met (such as significance level max-interval max inconsistency etc)

Cluster analysis is a popular data discretization method A clustering algorithm can be applied to discretize a numerical attribute A by partitioning the values of A into clusters or groups Clustering takes the distribution of A into consideration as well as the closeness of data points and therefore is able to produce high-quality discretization results

Clustering can be used to generate a concept hierarchy for A by following either a topdown splitting strategy or a bottom-up merging strategy where each cluster forms a node of the concept hierarchy In the former each initial cluster or partition may be further

decomposed into several subclusters forming a lower level of the hierarchy In the latter clusters are formed by repeatedly grouping neighboring clusters in order to form

higher-level concepts

bull Discretization by Intuitive Partitioningbull Although the above discretization methods are useful

in the generation of numerical hierarchies many users would like to see numerical ranges partitioned into relatively uniform easy-to-read intervals that appear intuitive or ldquonaturalrdquo

bull If an interval covers 3 6 7 or 9 distinct values at the most significant digit then partition the range into 3 intervals (3 equal-width intervals for 3 6 and 9 and 3 intervals in the grouping of 2-3-2 for 7)

bull If it covers 2 4 or 8 distinct values at the most significant digit then partition the range into 4 equal-width intervals

bull If it covers 1 5 or 10 distinct values at the most significant digit then partition the range into 5 equal-width intervals

bull The rule can be recursively applied to each interval creating a concept hierarchy for the given numerical attribute Real-world data often contain extremely large positive andor negative outlier values which could distort any top-down discretization method based on minimum and maximum data values

A simply 3-4-5 rule can be used to segment numeric data into relatively uniform ldquonaturalrdquo intervals If an interval covers 3 6 7 or 9 distinct values

at the most significant digit partition the range into 3 equi-width intervals

If it covers 2 4 or 8 distinct values at the most significant digit partition the range into 4 intervals

If it covers 1 5 or 10 distinct values at the most significant digit partition the range into 5 intervals

Categorical data are discrete data Categorical attributes have a finite (but possibly large) number of distinct values with no ordering among the values Examples include geographic location job category and itemtype There are several methods for the generation of concept hierarchies for categorical data

Specification of a partial ordering of attributes explicitly at the schema level by users or Experts

Specification of a portion of a hierarchy by explicit data grouping

Specification of a set of attributes but not of their partial ordering

Specification of only a partial set of attributes

Specification of a partialtotal ordering of attributes explicitly at the schema level by users or experts street lt city lt state lt country

Specification of a hierarchy for a set of values by explicit data grouping Urbana Champaign Chicago lt Illinois

Specification of only a partial set of attributes Eg only street lt city not others

Automatic generation of hierarchies (or attribute levels) by the analysis of the number of distinct values Eg for a set of attributes street city state

country

Some hierarchies can be automaatically generated based on the analysis of the number of distinct values per attribute in the data set The attribute with the most distinct values is

placed at the lowest level of the hierarchy Exceptions eg weekday month quarter year

country

province_or_ state

city

street

15 distinct values

365 distinct values

3567 distinct values

674339 distinct values

Data preparation or preprocessing is a big issue for both data warehousing and data mining

Discriptive data summarization is need for quality data preprocessing

Data preparation includes Data cleaning and data integration Data reduction and feature selection Discretization

A lot a methods have been developed but data preprocessing still an active area of research

  • Slide 1
  • Slide 2
  • Slide 3
  • Multi-Dimensional Measure of Data Quality
  • Major Tasks in Data Preprocessing
  • Figure Forms of data preprocessing
  • 2 Descriptive Data Summarization
  • Slide 8
  • 21 Measuring the Central Tendency
  • 22 Measuring the Dispersion of Data
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Data Integration
  • Handling Redundant Data in Data Integration
  • Data Transformation
  • Data Transformation Normalization
  • Slide 27
  • 5 Data Reduction
  • Slide 29
  • 51 Data Cube Aggregation
  • 52 Attribute Subset Selection
  • Slide 32
  • Slide 33
  • Slide 34
  • 53 Dimensionality Reduction
  • Slide 36
  • Slide 37
  • Slide 38
  • 4 Numerosity Reduction
  • Data Reduction Method (1) Regression and Log-Linear Models
  • Regress Analysis and Log-Linear Models
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Data Reduction Method (3) Clustering
  • Slide 47
  • Slide 48
  • Slide 49
  • Data Reduction Method (4) Sampling
  • Slide 51
  • Sampling Cluster or Stratified Sampling
  • Data Discretization and Concept Hierarchy Generation
  • Slide 54
  • Discretization
  • Discretization and Concept Hierarchy Generation for Numeric Data
  • Entropy-Based Discretization
  • Interval Merge by 2 Analysis
  • Cluster Analysis
  • Slide 60
  • Segmentation by Natural Partitioning
  • 62 Concept Hierarchy Generation for Categorical Data
  • Slide 63
  • Concept Hierarchy Generation for Categorical Data
  • Automatic Concept Hierarchy Generation
  • Summary
Page 16: UNIT – 1 Data Preprocessing. Data Preprocessing Learning Objectives Understand why preprocess the data. Understand how to clean the data. Understand how.

Binning Methods for Data SmoothingSorted data for price (in dollars)

48915 21 21 24 25 26 28 29 34

Partition into (equi-depth) bins - Bin 1 4 8 9 15 - Bin 2 21 21 24 25 - Bin 3 26 28 29 34

Smoothing by bin means - Bin 1 9 9 9 9 - Bin 2 23 23 23 23 - Bin 3 29 29 29 29

Smoothing by bin boundaries - Bin 1 4 4 4 15 - Bin 2 21 21 25 25 - Bin 3 26 26 26 34

Cluster Analysis

Regression

Data integration combines data from multiple sources into a

coherent store Schema integration

integrate metadata from different sources entity identification problem identify real world

entities from multiple data sources eg Acust-id Bcust-

Detecting and resolving data value conflicts for the same real world entity attribute values from

different sources are different possible reasons different representations

different scales eg metric vs British units

bull Redundant data occur often when integration of multiple databasesndash The same attribute may have different names in

different databasesndash One attribute may be a ldquoderivedrdquo attribute in

another table eg annual revenuebull Redundant data may be able to be detected by

correlational analysisbull Careful integration of the data from multiple sources

may help reduceavoid redundancies and inconsistencies and improve mining speed and quality

Smoothing remove noise from data Aggregation summarization data cube construction Generalization concept hierarchy climbing Normalization scaled to fall within a small specified

range min-max normalization z-score normalization normalization by decimal scaling

Attributefeature construction New attributes constructed from the given ones

bull min-max normalization

bull z-score normalization

bull normalization by decimal scaling

AAA

AA

A

minnewminnewmaxnewminmax

minvv _)__(

A

A

devstand

meanvv

_

j

vv

10 Where j is the smallest integer such that Max(| |)lt1v

Min-max normalization to [new_minA new_maxA]

Ex Let income range $12000 to $98000 normalized to [00 10] Then $73600 is mapped to

Z-score normalization (μ mean σ standard deviation)

Ex Let μ = 54000 σ = 16000 Then Normalization by decimal scaling

71600)001(0001200098

0001260073

AAA

AA

A

minnewminnewmaxnewminmax

minvv _)__(

A

Avv

j

vv

10

Where j is the smallest integer such that Max(|νrsquo|) lt 1

225100016

0005460073

Data reduction techniques can be applied to obtain a reduced representation of the data set that is much smaller in volume yet closely maintains the integrity of the original data That is mining on the reduced data set should be more efficient yet produce the same (or almost the same) analytical results

Strategies for data reduction include the following1Data cube aggregation where aggregation operations are applied to the data in the construction of a data cube

2 Attribute subset selection where irrelevant weakly relevant or redundant attributes or dimensions may be detected and removed

3 Dimensionality reduction where encoding mechanisms are used to reduce the data set size

4 Numerosity reduction where the data are replaced or estimated by alternative smaller data representations such as parametric models (which need store only the model parameters instead of the actual data) or nonparametric methods such as clustering sampling and the use of histograms

5 Discretization and concept hierarchy generation where rawdata values for attributes are replaced by ranges or higher conceptual levels

Attribute subset selection reduces the data set size by removing irrelevant or redundant attributes (or dimensions)

The goal of attribute subset selection is to find a minimum set of attributes such that the resulting probability distribution of the data classes is as close as possible to the original distribution obtained using all attributes

Mining on a reduced set of attributes has an additional benefit

It reduces the number of attributes appearing in the discovered patterns helping to make the patterns easier to understand

The ldquoBestrdquo (and ldquoWorstrdquo) attributes are typically determined using tests of statistical significance which assume that the attributes are independent of one Many other attribute evaluation measures can be used such as the information gain measure used in building decision trees for classification

bull Basic heuristic methods of attribute subset selection include the following techniques some of which are illustrated in Figure

bull 1 Stepwise forward selection The procedure starts with an empty set of attributes as the reduced set The best of the original attributes is determined and added to the reduced set At each subsequent iteration or step the best of the remaining original attributes is added to the set

bull 2 Stepwise backward elimination The procedure starts with the full set of attributes At each step it removes the worst attribute remaining in the set

bull 3 Combination of forward selection and backward elimination The stepwise forward selection and backward elimination methods can be combined so that at each step the procedure selects the best attribute and removes the

worst from among the remaining attributes

bull 4 Decision tree induction Decision tree algorithms such as ID3 C45 and CART were originally intended for classification Decision tree induction constructs a flow chart like structure where each internal (non leaf) node denotes a test on an attribute each branch corresponds to an outcome of the test and each external (leaf) node denotes a class prediction At each node the algorithm chooses the ldquobestrdquo attribute to partition the data into individual classes When decision tree induction is used for attribute subset selection a tree is constructed from the given data All attributes that do not appear in the tree are assumed to be irrelevant The set of attributes appearing in the tree form the reduced subset of attributes

The stopping criteria for the methods may vary The procedure may employ a threshold on the measure used to determine when to stop the attribute selection process

In dimensionality reduction data encoding or transformations are applied so as to obtain a reduced or ldquocompressedrdquo representation of the original data If the original data can be reconstructed from the compressed data without any loss of information the data reduction is called lossless If instead we can reconstruct only an approximation of the original data then the data reduction is called lossy There are several well-tuned algorithms for string compression Although they are typically lossless they allow only limited manipulation of the data

In this section we instead focus on two popular and effective methods of lossy dimensionality reduction wavelet transforms and principal components analysis

bull Wavelet transforms can be applied to multidimensional data such as a data cube

bull This is done by first applying the transform to the first dimension then to the second and so on The computational complexity involved is linear with respect to the number of cells in the cube Wavelet transforms give good results on sparse or skewed data and on data with ordered attributes Lossy compression by wavelets is reportedly better than JPEG compression the current commercial standard Wavelet transforms have many real-world applications including the compression of fingerprint images computer vision analysis of time-series data and data cleaning

bull PCA is computationally inexpensive can be applied to ordered and unordered attributes and can handle sparse data and skewed data Multidimensional data of more than two dimensions can be handled by reducing the problem to two dimensions Principal components may be used as inputs to multiple regression and cluster analysis

bull In comparison with wavelet transforms PCA tends to be better at handling sparse data whereas wavelet transforms are more suitable for data of high dimensionality

ldquoCan we reduce the data volume by choosing alternative lsquosmallerrsquo forms of data representationrdquo

Techniques of numerosity reduction can indeed be applied for this purpose These techniques may be parametric or nonparametric

For parametric methods a model is used to estimate the data so that typically only the data parameters need to be stored instead of the actual data (Outliers may also be stored) Log-linear models which estimate discrete multidimensional probability distributions are an example

Nonparametric methods for storing reduced representations of the data include histograms clustering and sampling

Letrsquos look at each of the numerosity reduction techniques mentioned above

Linear regression Data are modeled to fit a straight line Often uses the least-square method to fit the

line Multiple regression allows a response variable Y

to be modeled as a linear function of multidimensional feature vector

Log-linear model approximates discrete multidimensional probability distributions

bull Linear regression Y = w X + bndash Two regression coefficients w and b specify the

line and are to be estimated by using the data at hand

ndash Using the least squares criterion to the known values of Y1 Y2 hellip X1 X2 hellip

bull Multiple regression Y = b0 + b1 X1 + b2 X2ndash Many nonlinear functions can be transformed into

the above

bull Log-linear modelsndash The multi-way table of joint probabilities is

approximated by a product of lower-order tablesndash Probability p(a b c d) = ab acad bcd

Histograms Histograms use binning to approximate data distributions and are a popular form of data reduction

A histogram for an attribute A partitions the data distribution of A into disjoint subsets or buckets If each bucket represents only a single attribute-valuefrequency pair the buckets are called singleton buckets Often buckets instead represent continuous ranges for the given attribute

There are several partitioning rules including the following

Equal-width In an equal-width histogram the width of each bucket range is uniform

Equal-frequency (or equidepth) In an equal-frequency histogram the buckets are created so that roughly the frequency of each bucket is constant (that is each bucket contains roughly the same number of contiguous data samples)

V-Optimal If we consider all of the possible histograms for a given number of buckets the V-Optimal histogram is the one with the least variance Histogram variance is a weighted sum of the original values that each bucket represents where bucket weight is equal to the number of values in the bucket

MaxDiff In a MaxDiff histogram we consider the difference between each pair of adjacent values

ClusteringClustering techniques consider data tuples as objects They

partition the objects into groups or clusters so that objects within a cluster are ldquosimilarrdquo to one another and ldquodissimilarrdquo to objects in other clusters

In data reduction the cluster representations of the data are used to replace the actual data The effectiveness of this technique depends on the nature of the data It is much more effective for data that can be organized into distinct clusters than for smeared data

Partition data set into clusters based on similarity and store cluster representation (eg centroid and diameter) only

Can be very effective if data is clustered but not if data is ldquosmearedrdquo

Can have hierarchical clustering and be stored in multi-dimensional index tree structures

There are many choices of clustering definitions and clustering algorithms

Cluster analysis will be studied in depth later

SamplingSampling can be used as a data reduction technique because it

allows a large data set to be represented by a much smaller random sample (or subset) of the data Suppose that a large data set D contains N tuples Letrsquos look at the most common ways that we could sample D for data reduction

An advantage of sampling for data reduction is that the cost of obtaining a sample is proportional to the size of the sample sampling complexity is potentially sublinear to the size of the data

Simple Random sample without replacement

bull For a fixed sample size sampling complexity increases only linearly as the number of data dimensions

bull When applied to data reduction sampling is most commonly used to estimate the answer to an aggregate query

Sampling obtaining a small sample s to represent the whole data set N

Allow a mining algorithm to run in complexity that is potentially sub-linear to the size of the data

Choose a representative subset of the data Simple random sampling may have very poor

performance in the presence of skew Develop adaptive sampling methods

Stratified sampling Approximate the percentage of each class (or

subpopulation of interest) in the overall database Used in conjunction with skewed data

Note Sampling may not reduce database IOs (page at a time)

Sampling with or without Replacement

SRSWOR

(simple random

sample without

replacement)

SRSWR

Raw Data

Raw Data ClusterStratified Sample

Data discretization techniques can be used to reduce the number of values for a given continuous attribute by dividing the range of the attribute into intervals Interval labels can then be used to replace actual data values Replacing numerous values of a continuous attribute by a small number of interval labels thereby reduces and simplifies the original data This leads to a concise easy-to-use knowledge-level representation of mining results

bull Discretization techniques can be categorized based on how the discretization is performed such as whether it uses class information or which direction it proceeds (ie top-down vs bottom-up) If the discretization process uses class information then we say it is supervised discretization Otherwise it is unsupervised If the process starts by first finding one or a few points (called split points or cut points) to split the entire attribute range and then repeats this recursively on the resulting intervals it is called top-down discretization or splitting This contrasts with bottom-up discretization or merging which starts by considering all of the continuous values as potential split-points removes some by merging neighborhood values to form intervals and then recursively applies this process to the resulting intervals Discretization can be performed recursively on an attribute to provide a hierarchical or multiresolution partitioning of the attribute values known as a concept hierarchy Concept hierarchies are useful for mining at multiple levels of abstraction

Three types of attributes Nominal mdash values from an unordered set eg color

profession Ordinal mdash values from an ordered set eg military or

academic rank Continuous mdash real numbers eg integer or real

numbers Discretization

Divide the range of a continuous attribute into intervals Some classification algorithms only accept categorical

attributes Reduce data size by discretization Prepare for further analysis

Typical methods All the methods can be applied recursively Binning (covered above)

Top-down split unsupervised Histogram analysis (covered above)

Top-down split unsupervised Clustering analysis (covered above)

Either top-down split or bottom-up merge unsupervised Entropy-based discretization supervised top-down

split Interval merging by 2 Analysis unsupervised bottom-

up merge Segmentation by natural partitioning top-down split

unsupervised

Given a set of samples S if S is partitioned into two intervals S1 and S2 using boundary T the information gain after partitioning is

Entropy is calculated based on class distribution of the samples in the set Given m classes the entropy of S1 is

where pi is the probability of class i in S1 The boundary that minimizes the entropy function over

all possible boundaries is selected as a binary discretization

The process is recursively applied to partitions obtained until some stopping criterion is met

Such a boundary may reduce data size and improve classification accuracy

)Entropy(S|S|

|S|+)Entropy(S

|S|

|S|=T)I(S 2

21

1

m

iii ppSEntropy

121 )(log)(

Merging-based (bottom-up) vs splitting-based methods Merge Find the best neighboring intervals and merge

them to form larger intervals recursively ChiMerge [Kerber AAAI 1992 See also Liu et al DMKD

2002] Initially each distinct value of a numerical attr A is

considered to be one interval 2 tests are performed for every pair of adjacent

intervals Adjacent intervals with the least 2 values are merged

together since low 2 values for a pair indicate similar class distributions

This merge process proceeds recursively until a predefined stopping criterion is met (such as significance level max-interval max inconsistency etc)

Cluster analysis is a popular data discretization method A clustering algorithm can be applied to discretize a numerical attribute A by partitioning the values of A into clusters or groups Clustering takes the distribution of A into consideration as well as the closeness of data points and therefore is able to produce high-quality discretization results

Clustering can be used to generate a concept hierarchy for A by following either a topdown splitting strategy or a bottom-up merging strategy where each cluster forms a node of the concept hierarchy In the former each initial cluster or partition may be further

decomposed into several subclusters forming a lower level of the hierarchy In the latter clusters are formed by repeatedly grouping neighboring clusters in order to form

higher-level concepts

bull Discretization by Intuitive Partitioningbull Although the above discretization methods are useful

in the generation of numerical hierarchies many users would like to see numerical ranges partitioned into relatively uniform easy-to-read intervals that appear intuitive or ldquonaturalrdquo

bull If an interval covers 3 6 7 or 9 distinct values at the most significant digit then partition the range into 3 intervals (3 equal-width intervals for 3 6 and 9 and 3 intervals in the grouping of 2-3-2 for 7)

bull If it covers 2 4 or 8 distinct values at the most significant digit then partition the range into 4 equal-width intervals

bull If it covers 1 5 or 10 distinct values at the most significant digit then partition the range into 5 equal-width intervals

bull The rule can be recursively applied to each interval creating a concept hierarchy for the given numerical attribute Real-world data often contain extremely large positive andor negative outlier values which could distort any top-down discretization method based on minimum and maximum data values

A simply 3-4-5 rule can be used to segment numeric data into relatively uniform ldquonaturalrdquo intervals If an interval covers 3 6 7 or 9 distinct values

at the most significant digit partition the range into 3 equi-width intervals

If it covers 2 4 or 8 distinct values at the most significant digit partition the range into 4 intervals

If it covers 1 5 or 10 distinct values at the most significant digit partition the range into 5 intervals

Categorical data are discrete data Categorical attributes have a finite (but possibly large) number of distinct values with no ordering among the values Examples include geographic location job category and itemtype There are several methods for the generation of concept hierarchies for categorical data

Specification of a partial ordering of attributes explicitly at the schema level by users or Experts

Specification of a portion of a hierarchy by explicit data grouping

Specification of a set of attributes but not of their partial ordering

Specification of only a partial set of attributes

Specification of a partialtotal ordering of attributes explicitly at the schema level by users or experts street lt city lt state lt country

Specification of a hierarchy for a set of values by explicit data grouping Urbana Champaign Chicago lt Illinois

Specification of only a partial set of attributes Eg only street lt city not others

Automatic generation of hierarchies (or attribute levels) by the analysis of the number of distinct values Eg for a set of attributes street city state

country

Some hierarchies can be automaatically generated based on the analysis of the number of distinct values per attribute in the data set The attribute with the most distinct values is

placed at the lowest level of the hierarchy Exceptions eg weekday month quarter year

country

province_or_ state

city

street

15 distinct values

365 distinct values

3567 distinct values

674339 distinct values

Data preparation or preprocessing is a big issue for both data warehousing and data mining

Discriptive data summarization is need for quality data preprocessing

Data preparation includes Data cleaning and data integration Data reduction and feature selection Discretization

A lot a methods have been developed but data preprocessing still an active area of research

  • Slide 1
  • Slide 2
  • Slide 3
  • Multi-Dimensional Measure of Data Quality
  • Major Tasks in Data Preprocessing
  • Figure Forms of data preprocessing
  • 2 Descriptive Data Summarization
  • Slide 8
  • 21 Measuring the Central Tendency
  • 22 Measuring the Dispersion of Data
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Data Integration
  • Handling Redundant Data in Data Integration
  • Data Transformation
  • Data Transformation Normalization
  • Slide 27
  • 5 Data Reduction
  • Slide 29
  • 51 Data Cube Aggregation
  • 52 Attribute Subset Selection
  • Slide 32
  • Slide 33
  • Slide 34
  • 53 Dimensionality Reduction
  • Slide 36
  • Slide 37
  • Slide 38
  • 4 Numerosity Reduction
  • Data Reduction Method (1) Regression and Log-Linear Models
  • Regress Analysis and Log-Linear Models
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Data Reduction Method (3) Clustering
  • Slide 47
  • Slide 48
  • Slide 49
  • Data Reduction Method (4) Sampling
  • Slide 51
  • Sampling Cluster or Stratified Sampling
  • Data Discretization and Concept Hierarchy Generation
  • Slide 54
  • Discretization
  • Discretization and Concept Hierarchy Generation for Numeric Data
  • Entropy-Based Discretization
  • Interval Merge by 2 Analysis
  • Cluster Analysis
  • Slide 60
  • Segmentation by Natural Partitioning
  • 62 Concept Hierarchy Generation for Categorical Data
  • Slide 63
  • Concept Hierarchy Generation for Categorical Data
  • Automatic Concept Hierarchy Generation
  • Summary
Page 17: UNIT – 1 Data Preprocessing. Data Preprocessing Learning Objectives Understand why preprocess the data. Understand how to clean the data. Understand how.

Cluster Analysis

Regression

Data integration combines data from multiple sources into a

coherent store Schema integration

integrate metadata from different sources entity identification problem identify real world

entities from multiple data sources eg Acust-id Bcust-

Detecting and resolving data value conflicts for the same real world entity attribute values from

different sources are different possible reasons different representations

different scales eg metric vs British units

bull Redundant data occur often when integration of multiple databasesndash The same attribute may have different names in

different databasesndash One attribute may be a ldquoderivedrdquo attribute in

another table eg annual revenuebull Redundant data may be able to be detected by

correlational analysisbull Careful integration of the data from multiple sources

may help reduceavoid redundancies and inconsistencies and improve mining speed and quality

Smoothing remove noise from data Aggregation summarization data cube construction Generalization concept hierarchy climbing Normalization scaled to fall within a small specified

range min-max normalization z-score normalization normalization by decimal scaling

Attributefeature construction New attributes constructed from the given ones

bull min-max normalization

bull z-score normalization

bull normalization by decimal scaling

AAA

AA

A

minnewminnewmaxnewminmax

minvv _)__(

A

A

devstand

meanvv

_

j

vv

10 Where j is the smallest integer such that Max(| |)lt1v

Min-max normalization to [new_minA new_maxA]

Ex Let income range $12000 to $98000 normalized to [00 10] Then $73600 is mapped to

Z-score normalization (μ mean σ standard deviation)

Ex Let μ = 54000 σ = 16000 Then Normalization by decimal scaling

71600)001(0001200098

0001260073

AAA

AA

A

minnewminnewmaxnewminmax

minvv _)__(

A

Avv

j

vv

10

Where j is the smallest integer such that Max(|νrsquo|) lt 1

225100016

0005460073

Data reduction techniques can be applied to obtain a reduced representation of the data set that is much smaller in volume yet closely maintains the integrity of the original data That is mining on the reduced data set should be more efficient yet produce the same (or almost the same) analytical results

Strategies for data reduction include the following1Data cube aggregation where aggregation operations are applied to the data in the construction of a data cube

2 Attribute subset selection where irrelevant weakly relevant or redundant attributes or dimensions may be detected and removed

3 Dimensionality reduction where encoding mechanisms are used to reduce the data set size

4 Numerosity reduction where the data are replaced or estimated by alternative smaller data representations such as parametric models (which need store only the model parameters instead of the actual data) or nonparametric methods such as clustering sampling and the use of histograms

5 Discretization and concept hierarchy generation where rawdata values for attributes are replaced by ranges or higher conceptual levels

Attribute subset selection reduces the data set size by removing irrelevant or redundant attributes (or dimensions)

The goal of attribute subset selection is to find a minimum set of attributes such that the resulting probability distribution of the data classes is as close as possible to the original distribution obtained using all attributes

Mining on a reduced set of attributes has an additional benefit

It reduces the number of attributes appearing in the discovered patterns helping to make the patterns easier to understand

The ldquoBestrdquo (and ldquoWorstrdquo) attributes are typically determined using tests of statistical significance which assume that the attributes are independent of one Many other attribute evaluation measures can be used such as the information gain measure used in building decision trees for classification

bull Basic heuristic methods of attribute subset selection include the following techniques some of which are illustrated in Figure

bull 1 Stepwise forward selection The procedure starts with an empty set of attributes as the reduced set The best of the original attributes is determined and added to the reduced set At each subsequent iteration or step the best of the remaining original attributes is added to the set

bull 2 Stepwise backward elimination The procedure starts with the full set of attributes At each step it removes the worst attribute remaining in the set

bull 3 Combination of forward selection and backward elimination The stepwise forward selection and backward elimination methods can be combined so that at each step the procedure selects the best attribute and removes the

worst from among the remaining attributes

bull 4 Decision tree induction Decision tree algorithms such as ID3 C45 and CART were originally intended for classification Decision tree induction constructs a flow chart like structure where each internal (non leaf) node denotes a test on an attribute each branch corresponds to an outcome of the test and each external (leaf) node denotes a class prediction At each node the algorithm chooses the ldquobestrdquo attribute to partition the data into individual classes When decision tree induction is used for attribute subset selection a tree is constructed from the given data All attributes that do not appear in the tree are assumed to be irrelevant The set of attributes appearing in the tree form the reduced subset of attributes

The stopping criteria for the methods may vary The procedure may employ a threshold on the measure used to determine when to stop the attribute selection process

In dimensionality reduction data encoding or transformations are applied so as to obtain a reduced or ldquocompressedrdquo representation of the original data If the original data can be reconstructed from the compressed data without any loss of information the data reduction is called lossless If instead we can reconstruct only an approximation of the original data then the data reduction is called lossy There are several well-tuned algorithms for string compression Although they are typically lossless they allow only limited manipulation of the data

In this section we instead focus on two popular and effective methods of lossy dimensionality reduction wavelet transforms and principal components analysis

bull Wavelet transforms can be applied to multidimensional data such as a data cube

bull This is done by first applying the transform to the first dimension then to the second and so on The computational complexity involved is linear with respect to the number of cells in the cube Wavelet transforms give good results on sparse or skewed data and on data with ordered attributes Lossy compression by wavelets is reportedly better than JPEG compression the current commercial standard Wavelet transforms have many real-world applications including the compression of fingerprint images computer vision analysis of time-series data and data cleaning

bull PCA is computationally inexpensive can be applied to ordered and unordered attributes and can handle sparse data and skewed data Multidimensional data of more than two dimensions can be handled by reducing the problem to two dimensions Principal components may be used as inputs to multiple regression and cluster analysis

bull In comparison with wavelet transforms PCA tends to be better at handling sparse data whereas wavelet transforms are more suitable for data of high dimensionality

ldquoCan we reduce the data volume by choosing alternative lsquosmallerrsquo forms of data representationrdquo

Techniques of numerosity reduction can indeed be applied for this purpose These techniques may be parametric or nonparametric

For parametric methods a model is used to estimate the data so that typically only the data parameters need to be stored instead of the actual data (Outliers may also be stored) Log-linear models which estimate discrete multidimensional probability distributions are an example

Nonparametric methods for storing reduced representations of the data include histograms clustering and sampling

Letrsquos look at each of the numerosity reduction techniques mentioned above

Linear regression Data are modeled to fit a straight line Often uses the least-square method to fit the

line Multiple regression allows a response variable Y

to be modeled as a linear function of multidimensional feature vector

Log-linear model approximates discrete multidimensional probability distributions

bull Linear regression Y = w X + bndash Two regression coefficients w and b specify the

line and are to be estimated by using the data at hand

ndash Using the least squares criterion to the known values of Y1 Y2 hellip X1 X2 hellip

bull Multiple regression Y = b0 + b1 X1 + b2 X2ndash Many nonlinear functions can be transformed into

the above

bull Log-linear modelsndash The multi-way table of joint probabilities is

approximated by a product of lower-order tablesndash Probability p(a b c d) = ab acad bcd

Histograms Histograms use binning to approximate data distributions and are a popular form of data reduction

A histogram for an attribute A partitions the data distribution of A into disjoint subsets or buckets If each bucket represents only a single attribute-valuefrequency pair the buckets are called singleton buckets Often buckets instead represent continuous ranges for the given attribute

There are several partitioning rules including the following

Equal-width In an equal-width histogram the width of each bucket range is uniform

Equal-frequency (or equidepth) In an equal-frequency histogram the buckets are created so that roughly the frequency of each bucket is constant (that is each bucket contains roughly the same number of contiguous data samples)

V-Optimal If we consider all of the possible histograms for a given number of buckets the V-Optimal histogram is the one with the least variance Histogram variance is a weighted sum of the original values that each bucket represents where bucket weight is equal to the number of values in the bucket

MaxDiff In a MaxDiff histogram we consider the difference between each pair of adjacent values

ClusteringClustering techniques consider data tuples as objects They

partition the objects into groups or clusters so that objects within a cluster are ldquosimilarrdquo to one another and ldquodissimilarrdquo to objects in other clusters

In data reduction the cluster representations of the data are used to replace the actual data The effectiveness of this technique depends on the nature of the data It is much more effective for data that can be organized into distinct clusters than for smeared data

Partition data set into clusters based on similarity and store cluster representation (eg centroid and diameter) only

Can be very effective if data is clustered but not if data is ldquosmearedrdquo

Can have hierarchical clustering and be stored in multi-dimensional index tree structures

There are many choices of clustering definitions and clustering algorithms

Cluster analysis will be studied in depth later

SamplingSampling can be used as a data reduction technique because it

allows a large data set to be represented by a much smaller random sample (or subset) of the data Suppose that a large data set D contains N tuples Letrsquos look at the most common ways that we could sample D for data reduction

An advantage of sampling for data reduction is that the cost of obtaining a sample is proportional to the size of the sample sampling complexity is potentially sublinear to the size of the data

Simple Random sample without replacement

bull For a fixed sample size sampling complexity increases only linearly as the number of data dimensions

bull When applied to data reduction sampling is most commonly used to estimate the answer to an aggregate query

Sampling obtaining a small sample s to represent the whole data set N

Allow a mining algorithm to run in complexity that is potentially sub-linear to the size of the data

Choose a representative subset of the data Simple random sampling may have very poor

performance in the presence of skew Develop adaptive sampling methods

Stratified sampling Approximate the percentage of each class (or

subpopulation of interest) in the overall database Used in conjunction with skewed data

Note Sampling may not reduce database IOs (page at a time)

Sampling with or without Replacement

SRSWOR

(simple random

sample without

replacement)

SRSWR

Raw Data

Raw Data ClusterStratified Sample

Data discretization techniques can be used to reduce the number of values for a given continuous attribute by dividing the range of the attribute into intervals Interval labels can then be used to replace actual data values Replacing numerous values of a continuous attribute by a small number of interval labels thereby reduces and simplifies the original data This leads to a concise easy-to-use knowledge-level representation of mining results

bull Discretization techniques can be categorized based on how the discretization is performed such as whether it uses class information or which direction it proceeds (ie top-down vs bottom-up) If the discretization process uses class information then we say it is supervised discretization Otherwise it is unsupervised If the process starts by first finding one or a few points (called split points or cut points) to split the entire attribute range and then repeats this recursively on the resulting intervals it is called top-down discretization or splitting This contrasts with bottom-up discretization or merging which starts by considering all of the continuous values as potential split-points removes some by merging neighborhood values to form intervals and then recursively applies this process to the resulting intervals Discretization can be performed recursively on an attribute to provide a hierarchical or multiresolution partitioning of the attribute values known as a concept hierarchy Concept hierarchies are useful for mining at multiple levels of abstraction

Three types of attributes Nominal mdash values from an unordered set eg color

profession Ordinal mdash values from an ordered set eg military or

academic rank Continuous mdash real numbers eg integer or real

numbers Discretization

Divide the range of a continuous attribute into intervals Some classification algorithms only accept categorical

attributes Reduce data size by discretization Prepare for further analysis

Typical methods All the methods can be applied recursively Binning (covered above)

Top-down split unsupervised Histogram analysis (covered above)

Top-down split unsupervised Clustering analysis (covered above)

Either top-down split or bottom-up merge unsupervised Entropy-based discretization supervised top-down

split Interval merging by 2 Analysis unsupervised bottom-

up merge Segmentation by natural partitioning top-down split

unsupervised

Given a set of samples S if S is partitioned into two intervals S1 and S2 using boundary T the information gain after partitioning is

Entropy is calculated based on class distribution of the samples in the set Given m classes the entropy of S1 is

where pi is the probability of class i in S1 The boundary that minimizes the entropy function over

all possible boundaries is selected as a binary discretization

The process is recursively applied to partitions obtained until some stopping criterion is met

Such a boundary may reduce data size and improve classification accuracy

)Entropy(S|S|

|S|+)Entropy(S

|S|

|S|=T)I(S 2

21

1

m

iii ppSEntropy

121 )(log)(

Merging-based (bottom-up) vs splitting-based methods Merge Find the best neighboring intervals and merge

them to form larger intervals recursively ChiMerge [Kerber AAAI 1992 See also Liu et al DMKD

2002] Initially each distinct value of a numerical attr A is

considered to be one interval 2 tests are performed for every pair of adjacent

intervals Adjacent intervals with the least 2 values are merged

together since low 2 values for a pair indicate similar class distributions

This merge process proceeds recursively until a predefined stopping criterion is met (such as significance level max-interval max inconsistency etc)

Cluster analysis is a popular data discretization method A clustering algorithm can be applied to discretize a numerical attribute A by partitioning the values of A into clusters or groups Clustering takes the distribution of A into consideration as well as the closeness of data points and therefore is able to produce high-quality discretization results

Clustering can be used to generate a concept hierarchy for A by following either a topdown splitting strategy or a bottom-up merging strategy where each cluster forms a node of the concept hierarchy In the former each initial cluster or partition may be further

decomposed into several subclusters forming a lower level of the hierarchy In the latter clusters are formed by repeatedly grouping neighboring clusters in order to form

higher-level concepts

bull Discretization by Intuitive Partitioningbull Although the above discretization methods are useful

in the generation of numerical hierarchies many users would like to see numerical ranges partitioned into relatively uniform easy-to-read intervals that appear intuitive or ldquonaturalrdquo

bull If an interval covers 3 6 7 or 9 distinct values at the most significant digit then partition the range into 3 intervals (3 equal-width intervals for 3 6 and 9 and 3 intervals in the grouping of 2-3-2 for 7)

bull If it covers 2 4 or 8 distinct values at the most significant digit then partition the range into 4 equal-width intervals

bull If it covers 1 5 or 10 distinct values at the most significant digit then partition the range into 5 equal-width intervals

bull The rule can be recursively applied to each interval creating a concept hierarchy for the given numerical attribute Real-world data often contain extremely large positive andor negative outlier values which could distort any top-down discretization method based on minimum and maximum data values

A simply 3-4-5 rule can be used to segment numeric data into relatively uniform ldquonaturalrdquo intervals If an interval covers 3 6 7 or 9 distinct values

at the most significant digit partition the range into 3 equi-width intervals

If it covers 2 4 or 8 distinct values at the most significant digit partition the range into 4 intervals

If it covers 1 5 or 10 distinct values at the most significant digit partition the range into 5 intervals

Categorical data are discrete data Categorical attributes have a finite (but possibly large) number of distinct values with no ordering among the values Examples include geographic location job category and itemtype There are several methods for the generation of concept hierarchies for categorical data

Specification of a partial ordering of attributes explicitly at the schema level by users or Experts

Specification of a portion of a hierarchy by explicit data grouping

Specification of a set of attributes but not of their partial ordering

Specification of only a partial set of attributes

Specification of a partialtotal ordering of attributes explicitly at the schema level by users or experts street lt city lt state lt country

Specification of a hierarchy for a set of values by explicit data grouping Urbana Champaign Chicago lt Illinois

Specification of only a partial set of attributes Eg only street lt city not others

Automatic generation of hierarchies (or attribute levels) by the analysis of the number of distinct values Eg for a set of attributes street city state

country

Some hierarchies can be automaatically generated based on the analysis of the number of distinct values per attribute in the data set The attribute with the most distinct values is

placed at the lowest level of the hierarchy Exceptions eg weekday month quarter year

country

province_or_ state

city

street

15 distinct values

365 distinct values

3567 distinct values

674339 distinct values

Data preparation or preprocessing is a big issue for both data warehousing and data mining

Discriptive data summarization is need for quality data preprocessing

Data preparation includes Data cleaning and data integration Data reduction and feature selection Discretization

A lot a methods have been developed but data preprocessing still an active area of research

  • Slide 1
  • Slide 2
  • Slide 3
  • Multi-Dimensional Measure of Data Quality
  • Major Tasks in Data Preprocessing
  • Figure Forms of data preprocessing
  • 2 Descriptive Data Summarization
  • Slide 8
  • 21 Measuring the Central Tendency
  • 22 Measuring the Dispersion of Data
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Data Integration
  • Handling Redundant Data in Data Integration
  • Data Transformation
  • Data Transformation Normalization
  • Slide 27
  • 5 Data Reduction
  • Slide 29
  • 51 Data Cube Aggregation
  • 52 Attribute Subset Selection
  • Slide 32
  • Slide 33
  • Slide 34
  • 53 Dimensionality Reduction
  • Slide 36
  • Slide 37
  • Slide 38
  • 4 Numerosity Reduction
  • Data Reduction Method (1) Regression and Log-Linear Models
  • Regress Analysis and Log-Linear Models
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Data Reduction Method (3) Clustering
  • Slide 47
  • Slide 48
  • Slide 49
  • Data Reduction Method (4) Sampling
  • Slide 51
  • Sampling Cluster or Stratified Sampling
  • Data Discretization and Concept Hierarchy Generation
  • Slide 54
  • Discretization
  • Discretization and Concept Hierarchy Generation for Numeric Data
  • Entropy-Based Discretization
  • Interval Merge by 2 Analysis
  • Cluster Analysis
  • Slide 60
  • Segmentation by Natural Partitioning
  • 62 Concept Hierarchy Generation for Categorical Data
  • Slide 63
  • Concept Hierarchy Generation for Categorical Data
  • Automatic Concept Hierarchy Generation
  • Summary
Page 18: UNIT – 1 Data Preprocessing. Data Preprocessing Learning Objectives Understand why preprocess the data. Understand how to clean the data. Understand how.

Regression

Data integration combines data from multiple sources into a

coherent store Schema integration

integrate metadata from different sources entity identification problem identify real world

entities from multiple data sources eg Acust-id Bcust-

Detecting and resolving data value conflicts for the same real world entity attribute values from

different sources are different possible reasons different representations

different scales eg metric vs British units

bull Redundant data occur often when integration of multiple databasesndash The same attribute may have different names in

different databasesndash One attribute may be a ldquoderivedrdquo attribute in

another table eg annual revenuebull Redundant data may be able to be detected by

correlational analysisbull Careful integration of the data from multiple sources

may help reduceavoid redundancies and inconsistencies and improve mining speed and quality

Smoothing remove noise from data Aggregation summarization data cube construction Generalization concept hierarchy climbing Normalization scaled to fall within a small specified

range min-max normalization z-score normalization normalization by decimal scaling

Attributefeature construction New attributes constructed from the given ones

bull min-max normalization

bull z-score normalization

bull normalization by decimal scaling

AAA

AA

A

minnewminnewmaxnewminmax

minvv _)__(

A

A

devstand

meanvv

_

j

vv

10 Where j is the smallest integer such that Max(| |)lt1v

Min-max normalization to [new_minA new_maxA]

Ex Let income range $12000 to $98000 normalized to [00 10] Then $73600 is mapped to

Z-score normalization (μ mean σ standard deviation)

Ex Let μ = 54000 σ = 16000 Then Normalization by decimal scaling

71600)001(0001200098

0001260073

AAA

AA

A

minnewminnewmaxnewminmax

minvv _)__(

A

Avv

j

vv

10

Where j is the smallest integer such that Max(|νrsquo|) lt 1

225100016

0005460073

Data reduction techniques can be applied to obtain a reduced representation of the data set that is much smaller in volume yet closely maintains the integrity of the original data That is mining on the reduced data set should be more efficient yet produce the same (or almost the same) analytical results

Strategies for data reduction include the following1Data cube aggregation where aggregation operations are applied to the data in the construction of a data cube

2 Attribute subset selection where irrelevant weakly relevant or redundant attributes or dimensions may be detected and removed

3 Dimensionality reduction where encoding mechanisms are used to reduce the data set size

4 Numerosity reduction where the data are replaced or estimated by alternative smaller data representations such as parametric models (which need store only the model parameters instead of the actual data) or nonparametric methods such as clustering sampling and the use of histograms

5 Discretization and concept hierarchy generation where rawdata values for attributes are replaced by ranges or higher conceptual levels

Attribute subset selection reduces the data set size by removing irrelevant or redundant attributes (or dimensions)

The goal of attribute subset selection is to find a minimum set of attributes such that the resulting probability distribution of the data classes is as close as possible to the original distribution obtained using all attributes

Mining on a reduced set of attributes has an additional benefit

It reduces the number of attributes appearing in the discovered patterns helping to make the patterns easier to understand

The ldquoBestrdquo (and ldquoWorstrdquo) attributes are typically determined using tests of statistical significance which assume that the attributes are independent of one Many other attribute evaluation measures can be used such as the information gain measure used in building decision trees for classification

bull Basic heuristic methods of attribute subset selection include the following techniques some of which are illustrated in Figure

bull 1 Stepwise forward selection The procedure starts with an empty set of attributes as the reduced set The best of the original attributes is determined and added to the reduced set At each subsequent iteration or step the best of the remaining original attributes is added to the set

bull 2 Stepwise backward elimination The procedure starts with the full set of attributes At each step it removes the worst attribute remaining in the set

bull 3 Combination of forward selection and backward elimination The stepwise forward selection and backward elimination methods can be combined so that at each step the procedure selects the best attribute and removes the

worst from among the remaining attributes

bull 4 Decision tree induction Decision tree algorithms such as ID3 C45 and CART were originally intended for classification Decision tree induction constructs a flow chart like structure where each internal (non leaf) node denotes a test on an attribute each branch corresponds to an outcome of the test and each external (leaf) node denotes a class prediction At each node the algorithm chooses the ldquobestrdquo attribute to partition the data into individual classes When decision tree induction is used for attribute subset selection a tree is constructed from the given data All attributes that do not appear in the tree are assumed to be irrelevant The set of attributes appearing in the tree form the reduced subset of attributes

The stopping criteria for the methods may vary The procedure may employ a threshold on the measure used to determine when to stop the attribute selection process

In dimensionality reduction data encoding or transformations are applied so as to obtain a reduced or ldquocompressedrdquo representation of the original data If the original data can be reconstructed from the compressed data without any loss of information the data reduction is called lossless If instead we can reconstruct only an approximation of the original data then the data reduction is called lossy There are several well-tuned algorithms for string compression Although they are typically lossless they allow only limited manipulation of the data

In this section we instead focus on two popular and effective methods of lossy dimensionality reduction wavelet transforms and principal components analysis

bull Wavelet transforms can be applied to multidimensional data such as a data cube

bull This is done by first applying the transform to the first dimension then to the second and so on The computational complexity involved is linear with respect to the number of cells in the cube Wavelet transforms give good results on sparse or skewed data and on data with ordered attributes Lossy compression by wavelets is reportedly better than JPEG compression the current commercial standard Wavelet transforms have many real-world applications including the compression of fingerprint images computer vision analysis of time-series data and data cleaning

bull PCA is computationally inexpensive can be applied to ordered and unordered attributes and can handle sparse data and skewed data Multidimensional data of more than two dimensions can be handled by reducing the problem to two dimensions Principal components may be used as inputs to multiple regression and cluster analysis

bull In comparison with wavelet transforms PCA tends to be better at handling sparse data whereas wavelet transforms are more suitable for data of high dimensionality

ldquoCan we reduce the data volume by choosing alternative lsquosmallerrsquo forms of data representationrdquo

Techniques of numerosity reduction can indeed be applied for this purpose These techniques may be parametric or nonparametric

For parametric methods a model is used to estimate the data so that typically only the data parameters need to be stored instead of the actual data (Outliers may also be stored) Log-linear models which estimate discrete multidimensional probability distributions are an example

Nonparametric methods for storing reduced representations of the data include histograms clustering and sampling

Letrsquos look at each of the numerosity reduction techniques mentioned above

Linear regression Data are modeled to fit a straight line Often uses the least-square method to fit the

line Multiple regression allows a response variable Y

to be modeled as a linear function of multidimensional feature vector

Log-linear model approximates discrete multidimensional probability distributions

bull Linear regression Y = w X + bndash Two regression coefficients w and b specify the

line and are to be estimated by using the data at hand

ndash Using the least squares criterion to the known values of Y1 Y2 hellip X1 X2 hellip

bull Multiple regression Y = b0 + b1 X1 + b2 X2ndash Many nonlinear functions can be transformed into

the above

bull Log-linear modelsndash The multi-way table of joint probabilities is

approximated by a product of lower-order tablesndash Probability p(a b c d) = ab acad bcd

Histograms Histograms use binning to approximate data distributions and are a popular form of data reduction

A histogram for an attribute A partitions the data distribution of A into disjoint subsets or buckets If each bucket represents only a single attribute-valuefrequency pair the buckets are called singleton buckets Often buckets instead represent continuous ranges for the given attribute

There are several partitioning rules including the following

Equal-width In an equal-width histogram the width of each bucket range is uniform

Equal-frequency (or equidepth) In an equal-frequency histogram the buckets are created so that roughly the frequency of each bucket is constant (that is each bucket contains roughly the same number of contiguous data samples)

V-Optimal If we consider all of the possible histograms for a given number of buckets the V-Optimal histogram is the one with the least variance Histogram variance is a weighted sum of the original values that each bucket represents where bucket weight is equal to the number of values in the bucket

MaxDiff In a MaxDiff histogram we consider the difference between each pair of adjacent values

ClusteringClustering techniques consider data tuples as objects They

partition the objects into groups or clusters so that objects within a cluster are ldquosimilarrdquo to one another and ldquodissimilarrdquo to objects in other clusters

In data reduction the cluster representations of the data are used to replace the actual data The effectiveness of this technique depends on the nature of the data It is much more effective for data that can be organized into distinct clusters than for smeared data

Partition data set into clusters based on similarity and store cluster representation (eg centroid and diameter) only

Can be very effective if data is clustered but not if data is ldquosmearedrdquo

Can have hierarchical clustering and be stored in multi-dimensional index tree structures

There are many choices of clustering definitions and clustering algorithms

Cluster analysis will be studied in depth later

SamplingSampling can be used as a data reduction technique because it

allows a large data set to be represented by a much smaller random sample (or subset) of the data Suppose that a large data set D contains N tuples Letrsquos look at the most common ways that we could sample D for data reduction

An advantage of sampling for data reduction is that the cost of obtaining a sample is proportional to the size of the sample sampling complexity is potentially sublinear to the size of the data

Simple Random sample without replacement

bull For a fixed sample size sampling complexity increases only linearly as the number of data dimensions

bull When applied to data reduction sampling is most commonly used to estimate the answer to an aggregate query

Sampling obtaining a small sample s to represent the whole data set N

Allow a mining algorithm to run in complexity that is potentially sub-linear to the size of the data

Choose a representative subset of the data Simple random sampling may have very poor

performance in the presence of skew Develop adaptive sampling methods

Stratified sampling Approximate the percentage of each class (or

subpopulation of interest) in the overall database Used in conjunction with skewed data

Note Sampling may not reduce database IOs (page at a time)

Sampling with or without Replacement

SRSWOR

(simple random

sample without

replacement)

SRSWR

Raw Data

Raw Data ClusterStratified Sample

Data discretization techniques can be used to reduce the number of values for a given continuous attribute by dividing the range of the attribute into intervals Interval labels can then be used to replace actual data values Replacing numerous values of a continuous attribute by a small number of interval labels thereby reduces and simplifies the original data This leads to a concise easy-to-use knowledge-level representation of mining results

bull Discretization techniques can be categorized based on how the discretization is performed such as whether it uses class information or which direction it proceeds (ie top-down vs bottom-up) If the discretization process uses class information then we say it is supervised discretization Otherwise it is unsupervised If the process starts by first finding one or a few points (called split points or cut points) to split the entire attribute range and then repeats this recursively on the resulting intervals it is called top-down discretization or splitting This contrasts with bottom-up discretization or merging which starts by considering all of the continuous values as potential split-points removes some by merging neighborhood values to form intervals and then recursively applies this process to the resulting intervals Discretization can be performed recursively on an attribute to provide a hierarchical or multiresolution partitioning of the attribute values known as a concept hierarchy Concept hierarchies are useful for mining at multiple levels of abstraction

Three types of attributes Nominal mdash values from an unordered set eg color

profession Ordinal mdash values from an ordered set eg military or

academic rank Continuous mdash real numbers eg integer or real

numbers Discretization

Divide the range of a continuous attribute into intervals Some classification algorithms only accept categorical

attributes Reduce data size by discretization Prepare for further analysis

Typical methods All the methods can be applied recursively Binning (covered above)

Top-down split unsupervised Histogram analysis (covered above)

Top-down split unsupervised Clustering analysis (covered above)

Either top-down split or bottom-up merge unsupervised Entropy-based discretization supervised top-down

split Interval merging by 2 Analysis unsupervised bottom-

up merge Segmentation by natural partitioning top-down split

unsupervised

Given a set of samples S if S is partitioned into two intervals S1 and S2 using boundary T the information gain after partitioning is

Entropy is calculated based on class distribution of the samples in the set Given m classes the entropy of S1 is

where pi is the probability of class i in S1 The boundary that minimizes the entropy function over

all possible boundaries is selected as a binary discretization

The process is recursively applied to partitions obtained until some stopping criterion is met

Such a boundary may reduce data size and improve classification accuracy

)Entropy(S|S|

|S|+)Entropy(S

|S|

|S|=T)I(S 2

21

1

m

iii ppSEntropy

121 )(log)(

Merging-based (bottom-up) vs splitting-based methods Merge Find the best neighboring intervals and merge

them to form larger intervals recursively ChiMerge [Kerber AAAI 1992 See also Liu et al DMKD

2002] Initially each distinct value of a numerical attr A is

considered to be one interval 2 tests are performed for every pair of adjacent

intervals Adjacent intervals with the least 2 values are merged

together since low 2 values for a pair indicate similar class distributions

This merge process proceeds recursively until a predefined stopping criterion is met (such as significance level max-interval max inconsistency etc)

Cluster analysis is a popular data discretization method A clustering algorithm can be applied to discretize a numerical attribute A by partitioning the values of A into clusters or groups Clustering takes the distribution of A into consideration as well as the closeness of data points and therefore is able to produce high-quality discretization results

Clustering can be used to generate a concept hierarchy for A by following either a topdown splitting strategy or a bottom-up merging strategy where each cluster forms a node of the concept hierarchy In the former each initial cluster or partition may be further

decomposed into several subclusters forming a lower level of the hierarchy In the latter clusters are formed by repeatedly grouping neighboring clusters in order to form

higher-level concepts

bull Discretization by Intuitive Partitioningbull Although the above discretization methods are useful

in the generation of numerical hierarchies many users would like to see numerical ranges partitioned into relatively uniform easy-to-read intervals that appear intuitive or ldquonaturalrdquo

bull If an interval covers 3 6 7 or 9 distinct values at the most significant digit then partition the range into 3 intervals (3 equal-width intervals for 3 6 and 9 and 3 intervals in the grouping of 2-3-2 for 7)

bull If it covers 2 4 or 8 distinct values at the most significant digit then partition the range into 4 equal-width intervals

bull If it covers 1 5 or 10 distinct values at the most significant digit then partition the range into 5 equal-width intervals

bull The rule can be recursively applied to each interval creating a concept hierarchy for the given numerical attribute Real-world data often contain extremely large positive andor negative outlier values which could distort any top-down discretization method based on minimum and maximum data values

A simply 3-4-5 rule can be used to segment numeric data into relatively uniform ldquonaturalrdquo intervals If an interval covers 3 6 7 or 9 distinct values

at the most significant digit partition the range into 3 equi-width intervals

If it covers 2 4 or 8 distinct values at the most significant digit partition the range into 4 intervals

If it covers 1 5 or 10 distinct values at the most significant digit partition the range into 5 intervals

Categorical data are discrete data Categorical attributes have a finite (but possibly large) number of distinct values with no ordering among the values Examples include geographic location job category and itemtype There are several methods for the generation of concept hierarchies for categorical data

Specification of a partial ordering of attributes explicitly at the schema level by users or Experts

Specification of a portion of a hierarchy by explicit data grouping

Specification of a set of attributes but not of their partial ordering

Specification of only a partial set of attributes

Specification of a partialtotal ordering of attributes explicitly at the schema level by users or experts street lt city lt state lt country

Specification of a hierarchy for a set of values by explicit data grouping Urbana Champaign Chicago lt Illinois

Specification of only a partial set of attributes Eg only street lt city not others

Automatic generation of hierarchies (or attribute levels) by the analysis of the number of distinct values Eg for a set of attributes street city state

country

Some hierarchies can be automaatically generated based on the analysis of the number of distinct values per attribute in the data set The attribute with the most distinct values is

placed at the lowest level of the hierarchy Exceptions eg weekday month quarter year

country

province_or_ state

city

street

15 distinct values

365 distinct values

3567 distinct values

674339 distinct values

Data preparation or preprocessing is a big issue for both data warehousing and data mining

Discriptive data summarization is need for quality data preprocessing

Data preparation includes Data cleaning and data integration Data reduction and feature selection Discretization

A lot a methods have been developed but data preprocessing still an active area of research

  • Slide 1
  • Slide 2
  • Slide 3
  • Multi-Dimensional Measure of Data Quality
  • Major Tasks in Data Preprocessing
  • Figure Forms of data preprocessing
  • 2 Descriptive Data Summarization
  • Slide 8
  • 21 Measuring the Central Tendency
  • 22 Measuring the Dispersion of Data
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Data Integration
  • Handling Redundant Data in Data Integration
  • Data Transformation
  • Data Transformation Normalization
  • Slide 27
  • 5 Data Reduction
  • Slide 29
  • 51 Data Cube Aggregation
  • 52 Attribute Subset Selection
  • Slide 32
  • Slide 33
  • Slide 34
  • 53 Dimensionality Reduction
  • Slide 36
  • Slide 37
  • Slide 38
  • 4 Numerosity Reduction
  • Data Reduction Method (1) Regression and Log-Linear Models
  • Regress Analysis and Log-Linear Models
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Data Reduction Method (3) Clustering
  • Slide 47
  • Slide 48
  • Slide 49
  • Data Reduction Method (4) Sampling
  • Slide 51
  • Sampling Cluster or Stratified Sampling
  • Data Discretization and Concept Hierarchy Generation
  • Slide 54
  • Discretization
  • Discretization and Concept Hierarchy Generation for Numeric Data
  • Entropy-Based Discretization
  • Interval Merge by 2 Analysis
  • Cluster Analysis
  • Slide 60
  • Segmentation by Natural Partitioning
  • 62 Concept Hierarchy Generation for Categorical Data
  • Slide 63
  • Concept Hierarchy Generation for Categorical Data
  • Automatic Concept Hierarchy Generation
  • Summary
Page 19: UNIT – 1 Data Preprocessing. Data Preprocessing Learning Objectives Understand why preprocess the data. Understand how to clean the data. Understand how.

Data integration combines data from multiple sources into a

coherent store Schema integration

integrate metadata from different sources entity identification problem identify real world

entities from multiple data sources eg Acust-id Bcust-

Detecting and resolving data value conflicts for the same real world entity attribute values from

different sources are different possible reasons different representations

different scales eg metric vs British units

bull Redundant data occur often when integration of multiple databasesndash The same attribute may have different names in

different databasesndash One attribute may be a ldquoderivedrdquo attribute in

another table eg annual revenuebull Redundant data may be able to be detected by

correlational analysisbull Careful integration of the data from multiple sources

may help reduceavoid redundancies and inconsistencies and improve mining speed and quality

Smoothing remove noise from data Aggregation summarization data cube construction Generalization concept hierarchy climbing Normalization scaled to fall within a small specified

range min-max normalization z-score normalization normalization by decimal scaling

Attributefeature construction New attributes constructed from the given ones

bull min-max normalization

bull z-score normalization

bull normalization by decimal scaling

AAA

AA

A

minnewminnewmaxnewminmax

minvv _)__(

A

A

devstand

meanvv

_

j

vv

10 Where j is the smallest integer such that Max(| |)lt1v

Min-max normalization to [new_minA new_maxA]

Ex Let income range $12000 to $98000 normalized to [00 10] Then $73600 is mapped to

Z-score normalization (μ mean σ standard deviation)

Ex Let μ = 54000 σ = 16000 Then Normalization by decimal scaling

71600)001(0001200098

0001260073

AAA

AA

A

minnewminnewmaxnewminmax

minvv _)__(

A

Avv

j

vv

10

Where j is the smallest integer such that Max(|νrsquo|) lt 1

225100016

0005460073

Data reduction techniques can be applied to obtain a reduced representation of the data set that is much smaller in volume yet closely maintains the integrity of the original data That is mining on the reduced data set should be more efficient yet produce the same (or almost the same) analytical results

Strategies for data reduction include the following1Data cube aggregation where aggregation operations are applied to the data in the construction of a data cube

2 Attribute subset selection where irrelevant weakly relevant or redundant attributes or dimensions may be detected and removed

3 Dimensionality reduction where encoding mechanisms are used to reduce the data set size

4 Numerosity reduction where the data are replaced or estimated by alternative smaller data representations such as parametric models (which need store only the model parameters instead of the actual data) or nonparametric methods such as clustering sampling and the use of histograms

5 Discretization and concept hierarchy generation where rawdata values for attributes are replaced by ranges or higher conceptual levels

Attribute subset selection reduces the data set size by removing irrelevant or redundant attributes (or dimensions)

The goal of attribute subset selection is to find a minimum set of attributes such that the resulting probability distribution of the data classes is as close as possible to the original distribution obtained using all attributes

Mining on a reduced set of attributes has an additional benefit

It reduces the number of attributes appearing in the discovered patterns helping to make the patterns easier to understand

The ldquoBestrdquo (and ldquoWorstrdquo) attributes are typically determined using tests of statistical significance which assume that the attributes are independent of one Many other attribute evaluation measures can be used such as the information gain measure used in building decision trees for classification

bull Basic heuristic methods of attribute subset selection include the following techniques some of which are illustrated in Figure

bull 1 Stepwise forward selection The procedure starts with an empty set of attributes as the reduced set The best of the original attributes is determined and added to the reduced set At each subsequent iteration or step the best of the remaining original attributes is added to the set

bull 2 Stepwise backward elimination The procedure starts with the full set of attributes At each step it removes the worst attribute remaining in the set

bull 3 Combination of forward selection and backward elimination The stepwise forward selection and backward elimination methods can be combined so that at each step the procedure selects the best attribute and removes the

worst from among the remaining attributes

bull 4 Decision tree induction Decision tree algorithms such as ID3 C45 and CART were originally intended for classification Decision tree induction constructs a flow chart like structure where each internal (non leaf) node denotes a test on an attribute each branch corresponds to an outcome of the test and each external (leaf) node denotes a class prediction At each node the algorithm chooses the ldquobestrdquo attribute to partition the data into individual classes When decision tree induction is used for attribute subset selection a tree is constructed from the given data All attributes that do not appear in the tree are assumed to be irrelevant The set of attributes appearing in the tree form the reduced subset of attributes

The stopping criteria for the methods may vary The procedure may employ a threshold on the measure used to determine when to stop the attribute selection process

In dimensionality reduction data encoding or transformations are applied so as to obtain a reduced or ldquocompressedrdquo representation of the original data If the original data can be reconstructed from the compressed data without any loss of information the data reduction is called lossless If instead we can reconstruct only an approximation of the original data then the data reduction is called lossy There are several well-tuned algorithms for string compression Although they are typically lossless they allow only limited manipulation of the data

In this section we instead focus on two popular and effective methods of lossy dimensionality reduction wavelet transforms and principal components analysis

bull Wavelet transforms can be applied to multidimensional data such as a data cube

bull This is done by first applying the transform to the first dimension then to the second and so on The computational complexity involved is linear with respect to the number of cells in the cube Wavelet transforms give good results on sparse or skewed data and on data with ordered attributes Lossy compression by wavelets is reportedly better than JPEG compression the current commercial standard Wavelet transforms have many real-world applications including the compression of fingerprint images computer vision analysis of time-series data and data cleaning

bull PCA is computationally inexpensive can be applied to ordered and unordered attributes and can handle sparse data and skewed data Multidimensional data of more than two dimensions can be handled by reducing the problem to two dimensions Principal components may be used as inputs to multiple regression and cluster analysis

bull In comparison with wavelet transforms PCA tends to be better at handling sparse data whereas wavelet transforms are more suitable for data of high dimensionality

ldquoCan we reduce the data volume by choosing alternative lsquosmallerrsquo forms of data representationrdquo

Techniques of numerosity reduction can indeed be applied for this purpose These techniques may be parametric or nonparametric

For parametric methods a model is used to estimate the data so that typically only the data parameters need to be stored instead of the actual data (Outliers may also be stored) Log-linear models which estimate discrete multidimensional probability distributions are an example

Nonparametric methods for storing reduced representations of the data include histograms clustering and sampling

Letrsquos look at each of the numerosity reduction techniques mentioned above

Linear regression Data are modeled to fit a straight line Often uses the least-square method to fit the

line Multiple regression allows a response variable Y

to be modeled as a linear function of multidimensional feature vector

Log-linear model approximates discrete multidimensional probability distributions

bull Linear regression Y = w X + bndash Two regression coefficients w and b specify the

line and are to be estimated by using the data at hand

ndash Using the least squares criterion to the known values of Y1 Y2 hellip X1 X2 hellip

bull Multiple regression Y = b0 + b1 X1 + b2 X2ndash Many nonlinear functions can be transformed into

the above

bull Log-linear modelsndash The multi-way table of joint probabilities is

approximated by a product of lower-order tablesndash Probability p(a b c d) = ab acad bcd

Histograms Histograms use binning to approximate data distributions and are a popular form of data reduction

A histogram for an attribute A partitions the data distribution of A into disjoint subsets or buckets If each bucket represents only a single attribute-valuefrequency pair the buckets are called singleton buckets Often buckets instead represent continuous ranges for the given attribute

There are several partitioning rules including the following

Equal-width In an equal-width histogram the width of each bucket range is uniform

Equal-frequency (or equidepth) In an equal-frequency histogram the buckets are created so that roughly the frequency of each bucket is constant (that is each bucket contains roughly the same number of contiguous data samples)

V-Optimal If we consider all of the possible histograms for a given number of buckets the V-Optimal histogram is the one with the least variance Histogram variance is a weighted sum of the original values that each bucket represents where bucket weight is equal to the number of values in the bucket

MaxDiff In a MaxDiff histogram we consider the difference between each pair of adjacent values

ClusteringClustering techniques consider data tuples as objects They

partition the objects into groups or clusters so that objects within a cluster are ldquosimilarrdquo to one another and ldquodissimilarrdquo to objects in other clusters

In data reduction the cluster representations of the data are used to replace the actual data The effectiveness of this technique depends on the nature of the data It is much more effective for data that can be organized into distinct clusters than for smeared data

Partition data set into clusters based on similarity and store cluster representation (eg centroid and diameter) only

Can be very effective if data is clustered but not if data is ldquosmearedrdquo

Can have hierarchical clustering and be stored in multi-dimensional index tree structures

There are many choices of clustering definitions and clustering algorithms

Cluster analysis will be studied in depth later

SamplingSampling can be used as a data reduction technique because it

allows a large data set to be represented by a much smaller random sample (or subset) of the data Suppose that a large data set D contains N tuples Letrsquos look at the most common ways that we could sample D for data reduction

An advantage of sampling for data reduction is that the cost of obtaining a sample is proportional to the size of the sample sampling complexity is potentially sublinear to the size of the data

Simple Random sample without replacement

bull For a fixed sample size sampling complexity increases only linearly as the number of data dimensions

bull When applied to data reduction sampling is most commonly used to estimate the answer to an aggregate query

Sampling obtaining a small sample s to represent the whole data set N

Allow a mining algorithm to run in complexity that is potentially sub-linear to the size of the data

Choose a representative subset of the data Simple random sampling may have very poor

performance in the presence of skew Develop adaptive sampling methods

Stratified sampling Approximate the percentage of each class (or

subpopulation of interest) in the overall database Used in conjunction with skewed data

Note Sampling may not reduce database IOs (page at a time)

Sampling with or without Replacement

SRSWOR

(simple random

sample without

replacement)

SRSWR

Raw Data

Raw Data ClusterStratified Sample

Data discretization techniques can be used to reduce the number of values for a given continuous attribute by dividing the range of the attribute into intervals Interval labels can then be used to replace actual data values Replacing numerous values of a continuous attribute by a small number of interval labels thereby reduces and simplifies the original data This leads to a concise easy-to-use knowledge-level representation of mining results

bull Discretization techniques can be categorized based on how the discretization is performed such as whether it uses class information or which direction it proceeds (ie top-down vs bottom-up) If the discretization process uses class information then we say it is supervised discretization Otherwise it is unsupervised If the process starts by first finding one or a few points (called split points or cut points) to split the entire attribute range and then repeats this recursively on the resulting intervals it is called top-down discretization or splitting This contrasts with bottom-up discretization or merging which starts by considering all of the continuous values as potential split-points removes some by merging neighborhood values to form intervals and then recursively applies this process to the resulting intervals Discretization can be performed recursively on an attribute to provide a hierarchical or multiresolution partitioning of the attribute values known as a concept hierarchy Concept hierarchies are useful for mining at multiple levels of abstraction

Three types of attributes Nominal mdash values from an unordered set eg color

profession Ordinal mdash values from an ordered set eg military or

academic rank Continuous mdash real numbers eg integer or real

numbers Discretization

Divide the range of a continuous attribute into intervals Some classification algorithms only accept categorical

attributes Reduce data size by discretization Prepare for further analysis

Typical methods All the methods can be applied recursively Binning (covered above)

Top-down split unsupervised Histogram analysis (covered above)

Top-down split unsupervised Clustering analysis (covered above)

Either top-down split or bottom-up merge unsupervised Entropy-based discretization supervised top-down

split Interval merging by 2 Analysis unsupervised bottom-

up merge Segmentation by natural partitioning top-down split

unsupervised

Given a set of samples S if S is partitioned into two intervals S1 and S2 using boundary T the information gain after partitioning is

Entropy is calculated based on class distribution of the samples in the set Given m classes the entropy of S1 is

where pi is the probability of class i in S1 The boundary that minimizes the entropy function over

all possible boundaries is selected as a binary discretization

The process is recursively applied to partitions obtained until some stopping criterion is met

Such a boundary may reduce data size and improve classification accuracy

)Entropy(S|S|

|S|+)Entropy(S

|S|

|S|=T)I(S 2

21

1

m

iii ppSEntropy

121 )(log)(

Merging-based (bottom-up) vs splitting-based methods Merge Find the best neighboring intervals and merge

them to form larger intervals recursively ChiMerge [Kerber AAAI 1992 See also Liu et al DMKD

2002] Initially each distinct value of a numerical attr A is

considered to be one interval 2 tests are performed for every pair of adjacent

intervals Adjacent intervals with the least 2 values are merged

together since low 2 values for a pair indicate similar class distributions

This merge process proceeds recursively until a predefined stopping criterion is met (such as significance level max-interval max inconsistency etc)

Cluster analysis is a popular data discretization method A clustering algorithm can be applied to discretize a numerical attribute A by partitioning the values of A into clusters or groups Clustering takes the distribution of A into consideration as well as the closeness of data points and therefore is able to produce high-quality discretization results

Clustering can be used to generate a concept hierarchy for A by following either a topdown splitting strategy or a bottom-up merging strategy where each cluster forms a node of the concept hierarchy In the former each initial cluster or partition may be further

decomposed into several subclusters forming a lower level of the hierarchy In the latter clusters are formed by repeatedly grouping neighboring clusters in order to form

higher-level concepts

bull Discretization by Intuitive Partitioningbull Although the above discretization methods are useful

in the generation of numerical hierarchies many users would like to see numerical ranges partitioned into relatively uniform easy-to-read intervals that appear intuitive or ldquonaturalrdquo

bull If an interval covers 3 6 7 or 9 distinct values at the most significant digit then partition the range into 3 intervals (3 equal-width intervals for 3 6 and 9 and 3 intervals in the grouping of 2-3-2 for 7)

bull If it covers 2 4 or 8 distinct values at the most significant digit then partition the range into 4 equal-width intervals

bull If it covers 1 5 or 10 distinct values at the most significant digit then partition the range into 5 equal-width intervals

bull The rule can be recursively applied to each interval creating a concept hierarchy for the given numerical attribute Real-world data often contain extremely large positive andor negative outlier values which could distort any top-down discretization method based on minimum and maximum data values

A simply 3-4-5 rule can be used to segment numeric data into relatively uniform ldquonaturalrdquo intervals If an interval covers 3 6 7 or 9 distinct values

at the most significant digit partition the range into 3 equi-width intervals

If it covers 2 4 or 8 distinct values at the most significant digit partition the range into 4 intervals

If it covers 1 5 or 10 distinct values at the most significant digit partition the range into 5 intervals

Categorical data are discrete data Categorical attributes have a finite (but possibly large) number of distinct values with no ordering among the values Examples include geographic location job category and itemtype There are several methods for the generation of concept hierarchies for categorical data

Specification of a partial ordering of attributes explicitly at the schema level by users or Experts

Specification of a portion of a hierarchy by explicit data grouping

Specification of a set of attributes but not of their partial ordering

Specification of only a partial set of attributes

Specification of a partialtotal ordering of attributes explicitly at the schema level by users or experts street lt city lt state lt country

Specification of a hierarchy for a set of values by explicit data grouping Urbana Champaign Chicago lt Illinois

Specification of only a partial set of attributes Eg only street lt city not others

Automatic generation of hierarchies (or attribute levels) by the analysis of the number of distinct values Eg for a set of attributes street city state

country

Some hierarchies can be automaatically generated based on the analysis of the number of distinct values per attribute in the data set The attribute with the most distinct values is

placed at the lowest level of the hierarchy Exceptions eg weekday month quarter year

country

province_or_ state

city

street

15 distinct values

365 distinct values

3567 distinct values

674339 distinct values

Data preparation or preprocessing is a big issue for both data warehousing and data mining

Discriptive data summarization is need for quality data preprocessing

Data preparation includes Data cleaning and data integration Data reduction and feature selection Discretization

A lot a methods have been developed but data preprocessing still an active area of research

  • Slide 1
  • Slide 2
  • Slide 3
  • Multi-Dimensional Measure of Data Quality
  • Major Tasks in Data Preprocessing
  • Figure Forms of data preprocessing
  • 2 Descriptive Data Summarization
  • Slide 8
  • 21 Measuring the Central Tendency
  • 22 Measuring the Dispersion of Data
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Data Integration
  • Handling Redundant Data in Data Integration
  • Data Transformation
  • Data Transformation Normalization
  • Slide 27
  • 5 Data Reduction
  • Slide 29
  • 51 Data Cube Aggregation
  • 52 Attribute Subset Selection
  • Slide 32
  • Slide 33
  • Slide 34
  • 53 Dimensionality Reduction
  • Slide 36
  • Slide 37
  • Slide 38
  • 4 Numerosity Reduction
  • Data Reduction Method (1) Regression and Log-Linear Models
  • Regress Analysis and Log-Linear Models
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Data Reduction Method (3) Clustering
  • Slide 47
  • Slide 48
  • Slide 49
  • Data Reduction Method (4) Sampling
  • Slide 51
  • Sampling Cluster or Stratified Sampling
  • Data Discretization and Concept Hierarchy Generation
  • Slide 54
  • Discretization
  • Discretization and Concept Hierarchy Generation for Numeric Data
  • Entropy-Based Discretization
  • Interval Merge by 2 Analysis
  • Cluster Analysis
  • Slide 60
  • Segmentation by Natural Partitioning
  • 62 Concept Hierarchy Generation for Categorical Data
  • Slide 63
  • Concept Hierarchy Generation for Categorical Data
  • Automatic Concept Hierarchy Generation
  • Summary
Page 20: UNIT – 1 Data Preprocessing. Data Preprocessing Learning Objectives Understand why preprocess the data. Understand how to clean the data. Understand how.

bull Redundant data occur often when integration of multiple databasesndash The same attribute may have different names in

different databasesndash One attribute may be a ldquoderivedrdquo attribute in

another table eg annual revenuebull Redundant data may be able to be detected by

correlational analysisbull Careful integration of the data from multiple sources

may help reduceavoid redundancies and inconsistencies and improve mining speed and quality

Smoothing remove noise from data Aggregation summarization data cube construction Generalization concept hierarchy climbing Normalization scaled to fall within a small specified

range min-max normalization z-score normalization normalization by decimal scaling

Attributefeature construction New attributes constructed from the given ones

bull min-max normalization

bull z-score normalization

bull normalization by decimal scaling

AAA

AA

A

minnewminnewmaxnewminmax

minvv _)__(

A

A

devstand

meanvv

_

j

vv

10 Where j is the smallest integer such that Max(| |)lt1v

Min-max normalization to [new_minA new_maxA]

Ex Let income range $12000 to $98000 normalized to [00 10] Then $73600 is mapped to

Z-score normalization (μ mean σ standard deviation)

Ex Let μ = 54000 σ = 16000 Then Normalization by decimal scaling

71600)001(0001200098

0001260073

AAA

AA

A

minnewminnewmaxnewminmax

minvv _)__(

A

Avv

j

vv

10

Where j is the smallest integer such that Max(|νrsquo|) lt 1

225100016

0005460073

Data reduction techniques can be applied to obtain a reduced representation of the data set that is much smaller in volume yet closely maintains the integrity of the original data That is mining on the reduced data set should be more efficient yet produce the same (or almost the same) analytical results

Strategies for data reduction include the following1Data cube aggregation where aggregation operations are applied to the data in the construction of a data cube

2 Attribute subset selection where irrelevant weakly relevant or redundant attributes or dimensions may be detected and removed

3 Dimensionality reduction where encoding mechanisms are used to reduce the data set size

4 Numerosity reduction where the data are replaced or estimated by alternative smaller data representations such as parametric models (which need store only the model parameters instead of the actual data) or nonparametric methods such as clustering sampling and the use of histograms

5 Discretization and concept hierarchy generation where rawdata values for attributes are replaced by ranges or higher conceptual levels

Attribute subset selection reduces the data set size by removing irrelevant or redundant attributes (or dimensions)

The goal of attribute subset selection is to find a minimum set of attributes such that the resulting probability distribution of the data classes is as close as possible to the original distribution obtained using all attributes

Mining on a reduced set of attributes has an additional benefit

It reduces the number of attributes appearing in the discovered patterns helping to make the patterns easier to understand

The ldquoBestrdquo (and ldquoWorstrdquo) attributes are typically determined using tests of statistical significance which assume that the attributes are independent of one Many other attribute evaluation measures can be used such as the information gain measure used in building decision trees for classification

bull Basic heuristic methods of attribute subset selection include the following techniques some of which are illustrated in Figure

bull 1 Stepwise forward selection The procedure starts with an empty set of attributes as the reduced set The best of the original attributes is determined and added to the reduced set At each subsequent iteration or step the best of the remaining original attributes is added to the set

bull 2 Stepwise backward elimination The procedure starts with the full set of attributes At each step it removes the worst attribute remaining in the set

bull 3 Combination of forward selection and backward elimination The stepwise forward selection and backward elimination methods can be combined so that at each step the procedure selects the best attribute and removes the

worst from among the remaining attributes

bull 4 Decision tree induction Decision tree algorithms such as ID3 C45 and CART were originally intended for classification Decision tree induction constructs a flow chart like structure where each internal (non leaf) node denotes a test on an attribute each branch corresponds to an outcome of the test and each external (leaf) node denotes a class prediction At each node the algorithm chooses the ldquobestrdquo attribute to partition the data into individual classes When decision tree induction is used for attribute subset selection a tree is constructed from the given data All attributes that do not appear in the tree are assumed to be irrelevant The set of attributes appearing in the tree form the reduced subset of attributes

The stopping criteria for the methods may vary The procedure may employ a threshold on the measure used to determine when to stop the attribute selection process

In dimensionality reduction data encoding or transformations are applied so as to obtain a reduced or ldquocompressedrdquo representation of the original data If the original data can be reconstructed from the compressed data without any loss of information the data reduction is called lossless If instead we can reconstruct only an approximation of the original data then the data reduction is called lossy There are several well-tuned algorithms for string compression Although they are typically lossless they allow only limited manipulation of the data

In this section we instead focus on two popular and effective methods of lossy dimensionality reduction wavelet transforms and principal components analysis

bull Wavelet transforms can be applied to multidimensional data such as a data cube

bull This is done by first applying the transform to the first dimension then to the second and so on The computational complexity involved is linear with respect to the number of cells in the cube Wavelet transforms give good results on sparse or skewed data and on data with ordered attributes Lossy compression by wavelets is reportedly better than JPEG compression the current commercial standard Wavelet transforms have many real-world applications including the compression of fingerprint images computer vision analysis of time-series data and data cleaning

bull PCA is computationally inexpensive can be applied to ordered and unordered attributes and can handle sparse data and skewed data Multidimensional data of more than two dimensions can be handled by reducing the problem to two dimensions Principal components may be used as inputs to multiple regression and cluster analysis

bull In comparison with wavelet transforms PCA tends to be better at handling sparse data whereas wavelet transforms are more suitable for data of high dimensionality

ldquoCan we reduce the data volume by choosing alternative lsquosmallerrsquo forms of data representationrdquo

Techniques of numerosity reduction can indeed be applied for this purpose These techniques may be parametric or nonparametric

For parametric methods a model is used to estimate the data so that typically only the data parameters need to be stored instead of the actual data (Outliers may also be stored) Log-linear models which estimate discrete multidimensional probability distributions are an example

Nonparametric methods for storing reduced representations of the data include histograms clustering and sampling

Letrsquos look at each of the numerosity reduction techniques mentioned above

Linear regression Data are modeled to fit a straight line Often uses the least-square method to fit the

line Multiple regression allows a response variable Y

to be modeled as a linear function of multidimensional feature vector

Log-linear model approximates discrete multidimensional probability distributions

bull Linear regression Y = w X + bndash Two regression coefficients w and b specify the

line and are to be estimated by using the data at hand

ndash Using the least squares criterion to the known values of Y1 Y2 hellip X1 X2 hellip

bull Multiple regression Y = b0 + b1 X1 + b2 X2ndash Many nonlinear functions can be transformed into

the above

bull Log-linear modelsndash The multi-way table of joint probabilities is

approximated by a product of lower-order tablesndash Probability p(a b c d) = ab acad bcd

Histograms Histograms use binning to approximate data distributions and are a popular form of data reduction

A histogram for an attribute A partitions the data distribution of A into disjoint subsets or buckets If each bucket represents only a single attribute-valuefrequency pair the buckets are called singleton buckets Often buckets instead represent continuous ranges for the given attribute

There are several partitioning rules including the following

Equal-width In an equal-width histogram the width of each bucket range is uniform

Equal-frequency (or equidepth) In an equal-frequency histogram the buckets are created so that roughly the frequency of each bucket is constant (that is each bucket contains roughly the same number of contiguous data samples)

V-Optimal If we consider all of the possible histograms for a given number of buckets the V-Optimal histogram is the one with the least variance Histogram variance is a weighted sum of the original values that each bucket represents where bucket weight is equal to the number of values in the bucket

MaxDiff In a MaxDiff histogram we consider the difference between each pair of adjacent values

ClusteringClustering techniques consider data tuples as objects They

partition the objects into groups or clusters so that objects within a cluster are ldquosimilarrdquo to one another and ldquodissimilarrdquo to objects in other clusters

In data reduction the cluster representations of the data are used to replace the actual data The effectiveness of this technique depends on the nature of the data It is much more effective for data that can be organized into distinct clusters than for smeared data

Partition data set into clusters based on similarity and store cluster representation (eg centroid and diameter) only

Can be very effective if data is clustered but not if data is ldquosmearedrdquo

Can have hierarchical clustering and be stored in multi-dimensional index tree structures

There are many choices of clustering definitions and clustering algorithms

Cluster analysis will be studied in depth later

SamplingSampling can be used as a data reduction technique because it

allows a large data set to be represented by a much smaller random sample (or subset) of the data Suppose that a large data set D contains N tuples Letrsquos look at the most common ways that we could sample D for data reduction

An advantage of sampling for data reduction is that the cost of obtaining a sample is proportional to the size of the sample sampling complexity is potentially sublinear to the size of the data

Simple Random sample without replacement

bull For a fixed sample size sampling complexity increases only linearly as the number of data dimensions

bull When applied to data reduction sampling is most commonly used to estimate the answer to an aggregate query

Sampling obtaining a small sample s to represent the whole data set N

Allow a mining algorithm to run in complexity that is potentially sub-linear to the size of the data

Choose a representative subset of the data Simple random sampling may have very poor

performance in the presence of skew Develop adaptive sampling methods

Stratified sampling Approximate the percentage of each class (or

subpopulation of interest) in the overall database Used in conjunction with skewed data

Note Sampling may not reduce database IOs (page at a time)

Sampling with or without Replacement

SRSWOR

(simple random

sample without

replacement)

SRSWR

Raw Data

Raw Data ClusterStratified Sample

Data discretization techniques can be used to reduce the number of values for a given continuous attribute by dividing the range of the attribute into intervals Interval labels can then be used to replace actual data values Replacing numerous values of a continuous attribute by a small number of interval labels thereby reduces and simplifies the original data This leads to a concise easy-to-use knowledge-level representation of mining results

bull Discretization techniques can be categorized based on how the discretization is performed such as whether it uses class information or which direction it proceeds (ie top-down vs bottom-up) If the discretization process uses class information then we say it is supervised discretization Otherwise it is unsupervised If the process starts by first finding one or a few points (called split points or cut points) to split the entire attribute range and then repeats this recursively on the resulting intervals it is called top-down discretization or splitting This contrasts with bottom-up discretization or merging which starts by considering all of the continuous values as potential split-points removes some by merging neighborhood values to form intervals and then recursively applies this process to the resulting intervals Discretization can be performed recursively on an attribute to provide a hierarchical or multiresolution partitioning of the attribute values known as a concept hierarchy Concept hierarchies are useful for mining at multiple levels of abstraction

Three types of attributes Nominal mdash values from an unordered set eg color

profession Ordinal mdash values from an ordered set eg military or

academic rank Continuous mdash real numbers eg integer or real

numbers Discretization

Divide the range of a continuous attribute into intervals Some classification algorithms only accept categorical

attributes Reduce data size by discretization Prepare for further analysis

Typical methods All the methods can be applied recursively Binning (covered above)

Top-down split unsupervised Histogram analysis (covered above)

Top-down split unsupervised Clustering analysis (covered above)

Either top-down split or bottom-up merge unsupervised Entropy-based discretization supervised top-down

split Interval merging by 2 Analysis unsupervised bottom-

up merge Segmentation by natural partitioning top-down split

unsupervised

Given a set of samples S if S is partitioned into two intervals S1 and S2 using boundary T the information gain after partitioning is

Entropy is calculated based on class distribution of the samples in the set Given m classes the entropy of S1 is

where pi is the probability of class i in S1 The boundary that minimizes the entropy function over

all possible boundaries is selected as a binary discretization

The process is recursively applied to partitions obtained until some stopping criterion is met

Such a boundary may reduce data size and improve classification accuracy

)Entropy(S|S|

|S|+)Entropy(S

|S|

|S|=T)I(S 2

21

1

m

iii ppSEntropy

121 )(log)(

Merging-based (bottom-up) vs splitting-based methods Merge Find the best neighboring intervals and merge

them to form larger intervals recursively ChiMerge [Kerber AAAI 1992 See also Liu et al DMKD

2002] Initially each distinct value of a numerical attr A is

considered to be one interval 2 tests are performed for every pair of adjacent

intervals Adjacent intervals with the least 2 values are merged

together since low 2 values for a pair indicate similar class distributions

This merge process proceeds recursively until a predefined stopping criterion is met (such as significance level max-interval max inconsistency etc)

Cluster analysis is a popular data discretization method A clustering algorithm can be applied to discretize a numerical attribute A by partitioning the values of A into clusters or groups Clustering takes the distribution of A into consideration as well as the closeness of data points and therefore is able to produce high-quality discretization results

Clustering can be used to generate a concept hierarchy for A by following either a topdown splitting strategy or a bottom-up merging strategy where each cluster forms a node of the concept hierarchy In the former each initial cluster or partition may be further

decomposed into several subclusters forming a lower level of the hierarchy In the latter clusters are formed by repeatedly grouping neighboring clusters in order to form

higher-level concepts

bull Discretization by Intuitive Partitioningbull Although the above discretization methods are useful

in the generation of numerical hierarchies many users would like to see numerical ranges partitioned into relatively uniform easy-to-read intervals that appear intuitive or ldquonaturalrdquo

bull If an interval covers 3 6 7 or 9 distinct values at the most significant digit then partition the range into 3 intervals (3 equal-width intervals for 3 6 and 9 and 3 intervals in the grouping of 2-3-2 for 7)

bull If it covers 2 4 or 8 distinct values at the most significant digit then partition the range into 4 equal-width intervals

bull If it covers 1 5 or 10 distinct values at the most significant digit then partition the range into 5 equal-width intervals

bull The rule can be recursively applied to each interval creating a concept hierarchy for the given numerical attribute Real-world data often contain extremely large positive andor negative outlier values which could distort any top-down discretization method based on minimum and maximum data values

A simply 3-4-5 rule can be used to segment numeric data into relatively uniform ldquonaturalrdquo intervals If an interval covers 3 6 7 or 9 distinct values

at the most significant digit partition the range into 3 equi-width intervals

If it covers 2 4 or 8 distinct values at the most significant digit partition the range into 4 intervals

If it covers 1 5 or 10 distinct values at the most significant digit partition the range into 5 intervals

Categorical data are discrete data Categorical attributes have a finite (but possibly large) number of distinct values with no ordering among the values Examples include geographic location job category and itemtype There are several methods for the generation of concept hierarchies for categorical data

Specification of a partial ordering of attributes explicitly at the schema level by users or Experts

Specification of a portion of a hierarchy by explicit data grouping

Specification of a set of attributes but not of their partial ordering

Specification of only a partial set of attributes

Specification of a partialtotal ordering of attributes explicitly at the schema level by users or experts street lt city lt state lt country

Specification of a hierarchy for a set of values by explicit data grouping Urbana Champaign Chicago lt Illinois

Specification of only a partial set of attributes Eg only street lt city not others

Automatic generation of hierarchies (or attribute levels) by the analysis of the number of distinct values Eg for a set of attributes street city state

country

Some hierarchies can be automaatically generated based on the analysis of the number of distinct values per attribute in the data set The attribute with the most distinct values is

placed at the lowest level of the hierarchy Exceptions eg weekday month quarter year

country

province_or_ state

city

street

15 distinct values

365 distinct values

3567 distinct values

674339 distinct values

Data preparation or preprocessing is a big issue for both data warehousing and data mining

Discriptive data summarization is need for quality data preprocessing

Data preparation includes Data cleaning and data integration Data reduction and feature selection Discretization

A lot a methods have been developed but data preprocessing still an active area of research

  • Slide 1
  • Slide 2
  • Slide 3
  • Multi-Dimensional Measure of Data Quality
  • Major Tasks in Data Preprocessing
  • Figure Forms of data preprocessing
  • 2 Descriptive Data Summarization
  • Slide 8
  • 21 Measuring the Central Tendency
  • 22 Measuring the Dispersion of Data
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Data Integration
  • Handling Redundant Data in Data Integration
  • Data Transformation
  • Data Transformation Normalization
  • Slide 27
  • 5 Data Reduction
  • Slide 29
  • 51 Data Cube Aggregation
  • 52 Attribute Subset Selection
  • Slide 32
  • Slide 33
  • Slide 34
  • 53 Dimensionality Reduction
  • Slide 36
  • Slide 37
  • Slide 38
  • 4 Numerosity Reduction
  • Data Reduction Method (1) Regression and Log-Linear Models
  • Regress Analysis and Log-Linear Models
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Data Reduction Method (3) Clustering
  • Slide 47
  • Slide 48
  • Slide 49
  • Data Reduction Method (4) Sampling
  • Slide 51
  • Sampling Cluster or Stratified Sampling
  • Data Discretization and Concept Hierarchy Generation
  • Slide 54
  • Discretization
  • Discretization and Concept Hierarchy Generation for Numeric Data
  • Entropy-Based Discretization
  • Interval Merge by 2 Analysis
  • Cluster Analysis
  • Slide 60
  • Segmentation by Natural Partitioning
  • 62 Concept Hierarchy Generation for Categorical Data
  • Slide 63
  • Concept Hierarchy Generation for Categorical Data
  • Automatic Concept Hierarchy Generation
  • Summary
Page 21: UNIT – 1 Data Preprocessing. Data Preprocessing Learning Objectives Understand why preprocess the data. Understand how to clean the data. Understand how.

Smoothing remove noise from data Aggregation summarization data cube construction Generalization concept hierarchy climbing Normalization scaled to fall within a small specified

range min-max normalization z-score normalization normalization by decimal scaling

Attributefeature construction New attributes constructed from the given ones

bull min-max normalization

bull z-score normalization

bull normalization by decimal scaling

AAA

AA

A

minnewminnewmaxnewminmax

minvv _)__(

A

A

devstand

meanvv

_

j

vv

10 Where j is the smallest integer such that Max(| |)lt1v

Min-max normalization to [new_minA new_maxA]

Ex Let income range $12000 to $98000 normalized to [00 10] Then $73600 is mapped to

Z-score normalization (μ mean σ standard deviation)

Ex Let μ = 54000 σ = 16000 Then Normalization by decimal scaling

71600)001(0001200098

0001260073

AAA

AA

A

minnewminnewmaxnewminmax

minvv _)__(

A

Avv

j

vv

10

Where j is the smallest integer such that Max(|νrsquo|) lt 1

225100016

0005460073

Data reduction techniques can be applied to obtain a reduced representation of the data set that is much smaller in volume yet closely maintains the integrity of the original data That is mining on the reduced data set should be more efficient yet produce the same (or almost the same) analytical results

Strategies for data reduction include the following1Data cube aggregation where aggregation operations are applied to the data in the construction of a data cube

2 Attribute subset selection where irrelevant weakly relevant or redundant attributes or dimensions may be detected and removed

3 Dimensionality reduction where encoding mechanisms are used to reduce the data set size

4 Numerosity reduction where the data are replaced or estimated by alternative smaller data representations such as parametric models (which need store only the model parameters instead of the actual data) or nonparametric methods such as clustering sampling and the use of histograms

5 Discretization and concept hierarchy generation where rawdata values for attributes are replaced by ranges or higher conceptual levels

Attribute subset selection reduces the data set size by removing irrelevant or redundant attributes (or dimensions)

The goal of attribute subset selection is to find a minimum set of attributes such that the resulting probability distribution of the data classes is as close as possible to the original distribution obtained using all attributes

Mining on a reduced set of attributes has an additional benefit

It reduces the number of attributes appearing in the discovered patterns helping to make the patterns easier to understand

The ldquoBestrdquo (and ldquoWorstrdquo) attributes are typically determined using tests of statistical significance which assume that the attributes are independent of one Many other attribute evaluation measures can be used such as the information gain measure used in building decision trees for classification

bull Basic heuristic methods of attribute subset selection include the following techniques some of which are illustrated in Figure

bull 1 Stepwise forward selection The procedure starts with an empty set of attributes as the reduced set The best of the original attributes is determined and added to the reduced set At each subsequent iteration or step the best of the remaining original attributes is added to the set

bull 2 Stepwise backward elimination The procedure starts with the full set of attributes At each step it removes the worst attribute remaining in the set

bull 3 Combination of forward selection and backward elimination The stepwise forward selection and backward elimination methods can be combined so that at each step the procedure selects the best attribute and removes the

worst from among the remaining attributes

bull 4 Decision tree induction Decision tree algorithms such as ID3 C45 and CART were originally intended for classification Decision tree induction constructs a flow chart like structure where each internal (non leaf) node denotes a test on an attribute each branch corresponds to an outcome of the test and each external (leaf) node denotes a class prediction At each node the algorithm chooses the ldquobestrdquo attribute to partition the data into individual classes When decision tree induction is used for attribute subset selection a tree is constructed from the given data All attributes that do not appear in the tree are assumed to be irrelevant The set of attributes appearing in the tree form the reduced subset of attributes

The stopping criteria for the methods may vary The procedure may employ a threshold on the measure used to determine when to stop the attribute selection process

In dimensionality reduction data encoding or transformations are applied so as to obtain a reduced or ldquocompressedrdquo representation of the original data If the original data can be reconstructed from the compressed data without any loss of information the data reduction is called lossless If instead we can reconstruct only an approximation of the original data then the data reduction is called lossy There are several well-tuned algorithms for string compression Although they are typically lossless they allow only limited manipulation of the data

In this section we instead focus on two popular and effective methods of lossy dimensionality reduction wavelet transforms and principal components analysis

bull Wavelet transforms can be applied to multidimensional data such as a data cube

bull This is done by first applying the transform to the first dimension then to the second and so on The computational complexity involved is linear with respect to the number of cells in the cube Wavelet transforms give good results on sparse or skewed data and on data with ordered attributes Lossy compression by wavelets is reportedly better than JPEG compression the current commercial standard Wavelet transforms have many real-world applications including the compression of fingerprint images computer vision analysis of time-series data and data cleaning

bull PCA is computationally inexpensive can be applied to ordered and unordered attributes and can handle sparse data and skewed data Multidimensional data of more than two dimensions can be handled by reducing the problem to two dimensions Principal components may be used as inputs to multiple regression and cluster analysis

bull In comparison with wavelet transforms PCA tends to be better at handling sparse data whereas wavelet transforms are more suitable for data of high dimensionality

ldquoCan we reduce the data volume by choosing alternative lsquosmallerrsquo forms of data representationrdquo

Techniques of numerosity reduction can indeed be applied for this purpose These techniques may be parametric or nonparametric

For parametric methods a model is used to estimate the data so that typically only the data parameters need to be stored instead of the actual data (Outliers may also be stored) Log-linear models which estimate discrete multidimensional probability distributions are an example

Nonparametric methods for storing reduced representations of the data include histograms clustering and sampling

Letrsquos look at each of the numerosity reduction techniques mentioned above

Linear regression Data are modeled to fit a straight line Often uses the least-square method to fit the

line Multiple regression allows a response variable Y

to be modeled as a linear function of multidimensional feature vector

Log-linear model approximates discrete multidimensional probability distributions

bull Linear regression Y = w X + bndash Two regression coefficients w and b specify the

line and are to be estimated by using the data at hand

ndash Using the least squares criterion to the known values of Y1 Y2 hellip X1 X2 hellip

bull Multiple regression Y = b0 + b1 X1 + b2 X2ndash Many nonlinear functions can be transformed into

the above

bull Log-linear modelsndash The multi-way table of joint probabilities is

approximated by a product of lower-order tablesndash Probability p(a b c d) = ab acad bcd

Histograms Histograms use binning to approximate data distributions and are a popular form of data reduction

A histogram for an attribute A partitions the data distribution of A into disjoint subsets or buckets If each bucket represents only a single attribute-valuefrequency pair the buckets are called singleton buckets Often buckets instead represent continuous ranges for the given attribute

There are several partitioning rules including the following

Equal-width In an equal-width histogram the width of each bucket range is uniform

Equal-frequency (or equidepth) In an equal-frequency histogram the buckets are created so that roughly the frequency of each bucket is constant (that is each bucket contains roughly the same number of contiguous data samples)

V-Optimal If we consider all of the possible histograms for a given number of buckets the V-Optimal histogram is the one with the least variance Histogram variance is a weighted sum of the original values that each bucket represents where bucket weight is equal to the number of values in the bucket

MaxDiff In a MaxDiff histogram we consider the difference between each pair of adjacent values

ClusteringClustering techniques consider data tuples as objects They

partition the objects into groups or clusters so that objects within a cluster are ldquosimilarrdquo to one another and ldquodissimilarrdquo to objects in other clusters

In data reduction the cluster representations of the data are used to replace the actual data The effectiveness of this technique depends on the nature of the data It is much more effective for data that can be organized into distinct clusters than for smeared data

Partition data set into clusters based on similarity and store cluster representation (eg centroid and diameter) only

Can be very effective if data is clustered but not if data is ldquosmearedrdquo

Can have hierarchical clustering and be stored in multi-dimensional index tree structures

There are many choices of clustering definitions and clustering algorithms

Cluster analysis will be studied in depth later

SamplingSampling can be used as a data reduction technique because it

allows a large data set to be represented by a much smaller random sample (or subset) of the data Suppose that a large data set D contains N tuples Letrsquos look at the most common ways that we could sample D for data reduction

An advantage of sampling for data reduction is that the cost of obtaining a sample is proportional to the size of the sample sampling complexity is potentially sublinear to the size of the data

Simple Random sample without replacement

bull For a fixed sample size sampling complexity increases only linearly as the number of data dimensions

bull When applied to data reduction sampling is most commonly used to estimate the answer to an aggregate query

Sampling obtaining a small sample s to represent the whole data set N

Allow a mining algorithm to run in complexity that is potentially sub-linear to the size of the data

Choose a representative subset of the data Simple random sampling may have very poor

performance in the presence of skew Develop adaptive sampling methods

Stratified sampling Approximate the percentage of each class (or

subpopulation of interest) in the overall database Used in conjunction with skewed data

Note Sampling may not reduce database IOs (page at a time)

Sampling with or without Replacement

SRSWOR

(simple random

sample without

replacement)

SRSWR

Raw Data

Raw Data ClusterStratified Sample

Data discretization techniques can be used to reduce the number of values for a given continuous attribute by dividing the range of the attribute into intervals Interval labels can then be used to replace actual data values Replacing numerous values of a continuous attribute by a small number of interval labels thereby reduces and simplifies the original data This leads to a concise easy-to-use knowledge-level representation of mining results

bull Discretization techniques can be categorized based on how the discretization is performed such as whether it uses class information or which direction it proceeds (ie top-down vs bottom-up) If the discretization process uses class information then we say it is supervised discretization Otherwise it is unsupervised If the process starts by first finding one or a few points (called split points or cut points) to split the entire attribute range and then repeats this recursively on the resulting intervals it is called top-down discretization or splitting This contrasts with bottom-up discretization or merging which starts by considering all of the continuous values as potential split-points removes some by merging neighborhood values to form intervals and then recursively applies this process to the resulting intervals Discretization can be performed recursively on an attribute to provide a hierarchical or multiresolution partitioning of the attribute values known as a concept hierarchy Concept hierarchies are useful for mining at multiple levels of abstraction

Three types of attributes Nominal mdash values from an unordered set eg color

profession Ordinal mdash values from an ordered set eg military or

academic rank Continuous mdash real numbers eg integer or real

numbers Discretization

Divide the range of a continuous attribute into intervals Some classification algorithms only accept categorical

attributes Reduce data size by discretization Prepare for further analysis

Typical methods All the methods can be applied recursively Binning (covered above)

Top-down split unsupervised Histogram analysis (covered above)

Top-down split unsupervised Clustering analysis (covered above)

Either top-down split or bottom-up merge unsupervised Entropy-based discretization supervised top-down

split Interval merging by 2 Analysis unsupervised bottom-

up merge Segmentation by natural partitioning top-down split

unsupervised

Given a set of samples S if S is partitioned into two intervals S1 and S2 using boundary T the information gain after partitioning is

Entropy is calculated based on class distribution of the samples in the set Given m classes the entropy of S1 is

where pi is the probability of class i in S1 The boundary that minimizes the entropy function over

all possible boundaries is selected as a binary discretization

The process is recursively applied to partitions obtained until some stopping criterion is met

Such a boundary may reduce data size and improve classification accuracy

)Entropy(S|S|

|S|+)Entropy(S

|S|

|S|=T)I(S 2

21

1

m

iii ppSEntropy

121 )(log)(

Merging-based (bottom-up) vs splitting-based methods Merge Find the best neighboring intervals and merge

them to form larger intervals recursively ChiMerge [Kerber AAAI 1992 See also Liu et al DMKD

2002] Initially each distinct value of a numerical attr A is

considered to be one interval 2 tests are performed for every pair of adjacent

intervals Adjacent intervals with the least 2 values are merged

together since low 2 values for a pair indicate similar class distributions

This merge process proceeds recursively until a predefined stopping criterion is met (such as significance level max-interval max inconsistency etc)

Cluster analysis is a popular data discretization method A clustering algorithm can be applied to discretize a numerical attribute A by partitioning the values of A into clusters or groups Clustering takes the distribution of A into consideration as well as the closeness of data points and therefore is able to produce high-quality discretization results

Clustering can be used to generate a concept hierarchy for A by following either a topdown splitting strategy or a bottom-up merging strategy where each cluster forms a node of the concept hierarchy In the former each initial cluster or partition may be further

decomposed into several subclusters forming a lower level of the hierarchy In the latter clusters are formed by repeatedly grouping neighboring clusters in order to form

higher-level concepts

bull Discretization by Intuitive Partitioningbull Although the above discretization methods are useful

in the generation of numerical hierarchies many users would like to see numerical ranges partitioned into relatively uniform easy-to-read intervals that appear intuitive or ldquonaturalrdquo

bull If an interval covers 3 6 7 or 9 distinct values at the most significant digit then partition the range into 3 intervals (3 equal-width intervals for 3 6 and 9 and 3 intervals in the grouping of 2-3-2 for 7)

bull If it covers 2 4 or 8 distinct values at the most significant digit then partition the range into 4 equal-width intervals

bull If it covers 1 5 or 10 distinct values at the most significant digit then partition the range into 5 equal-width intervals

bull The rule can be recursively applied to each interval creating a concept hierarchy for the given numerical attribute Real-world data often contain extremely large positive andor negative outlier values which could distort any top-down discretization method based on minimum and maximum data values

A simply 3-4-5 rule can be used to segment numeric data into relatively uniform ldquonaturalrdquo intervals If an interval covers 3 6 7 or 9 distinct values

at the most significant digit partition the range into 3 equi-width intervals

If it covers 2 4 or 8 distinct values at the most significant digit partition the range into 4 intervals

If it covers 1 5 or 10 distinct values at the most significant digit partition the range into 5 intervals

Categorical data are discrete data Categorical attributes have a finite (but possibly large) number of distinct values with no ordering among the values Examples include geographic location job category and itemtype There are several methods for the generation of concept hierarchies for categorical data

Specification of a partial ordering of attributes explicitly at the schema level by users or Experts

Specification of a portion of a hierarchy by explicit data grouping

Specification of a set of attributes but not of their partial ordering

Specification of only a partial set of attributes

Specification of a partialtotal ordering of attributes explicitly at the schema level by users or experts street lt city lt state lt country

Specification of a hierarchy for a set of values by explicit data grouping Urbana Champaign Chicago lt Illinois

Specification of only a partial set of attributes Eg only street lt city not others

Automatic generation of hierarchies (or attribute levels) by the analysis of the number of distinct values Eg for a set of attributes street city state

country

Some hierarchies can be automaatically generated based on the analysis of the number of distinct values per attribute in the data set The attribute with the most distinct values is

placed at the lowest level of the hierarchy Exceptions eg weekday month quarter year

country

province_or_ state

city

street

15 distinct values

365 distinct values

3567 distinct values

674339 distinct values

Data preparation or preprocessing is a big issue for both data warehousing and data mining

Discriptive data summarization is need for quality data preprocessing

Data preparation includes Data cleaning and data integration Data reduction and feature selection Discretization

A lot a methods have been developed but data preprocessing still an active area of research

  • Slide 1
  • Slide 2
  • Slide 3
  • Multi-Dimensional Measure of Data Quality
  • Major Tasks in Data Preprocessing
  • Figure Forms of data preprocessing
  • 2 Descriptive Data Summarization
  • Slide 8
  • 21 Measuring the Central Tendency
  • 22 Measuring the Dispersion of Data
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Data Integration
  • Handling Redundant Data in Data Integration
  • Data Transformation
  • Data Transformation Normalization
  • Slide 27
  • 5 Data Reduction
  • Slide 29
  • 51 Data Cube Aggregation
  • 52 Attribute Subset Selection
  • Slide 32
  • Slide 33
  • Slide 34
  • 53 Dimensionality Reduction
  • Slide 36
  • Slide 37
  • Slide 38
  • 4 Numerosity Reduction
  • Data Reduction Method (1) Regression and Log-Linear Models
  • Regress Analysis and Log-Linear Models
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Data Reduction Method (3) Clustering
  • Slide 47
  • Slide 48
  • Slide 49
  • Data Reduction Method (4) Sampling
  • Slide 51
  • Sampling Cluster or Stratified Sampling
  • Data Discretization and Concept Hierarchy Generation
  • Slide 54
  • Discretization
  • Discretization and Concept Hierarchy Generation for Numeric Data
  • Entropy-Based Discretization
  • Interval Merge by 2 Analysis
  • Cluster Analysis
  • Slide 60
  • Segmentation by Natural Partitioning
  • 62 Concept Hierarchy Generation for Categorical Data
  • Slide 63
  • Concept Hierarchy Generation for Categorical Data
  • Automatic Concept Hierarchy Generation
  • Summary
Page 22: UNIT – 1 Data Preprocessing. Data Preprocessing Learning Objectives Understand why preprocess the data. Understand how to clean the data. Understand how.

bull min-max normalization

bull z-score normalization

bull normalization by decimal scaling

AAA

AA

A

minnewminnewmaxnewminmax

minvv _)__(

A

A

devstand

meanvv

_

j

vv

10 Where j is the smallest integer such that Max(| |)lt1v

Min-max normalization to [new_minA new_maxA]

Ex Let income range $12000 to $98000 normalized to [00 10] Then $73600 is mapped to

Z-score normalization (μ mean σ standard deviation)

Ex Let μ = 54000 σ = 16000 Then Normalization by decimal scaling

71600)001(0001200098

0001260073

AAA

AA

A

minnewminnewmaxnewminmax

minvv _)__(

A

Avv

j

vv

10

Where j is the smallest integer such that Max(|νrsquo|) lt 1

225100016

0005460073

Data reduction techniques can be applied to obtain a reduced representation of the data set that is much smaller in volume yet closely maintains the integrity of the original data That is mining on the reduced data set should be more efficient yet produce the same (or almost the same) analytical results

Strategies for data reduction include the following1Data cube aggregation where aggregation operations are applied to the data in the construction of a data cube

2 Attribute subset selection where irrelevant weakly relevant or redundant attributes or dimensions may be detected and removed

3 Dimensionality reduction where encoding mechanisms are used to reduce the data set size

4 Numerosity reduction where the data are replaced or estimated by alternative smaller data representations such as parametric models (which need store only the model parameters instead of the actual data) or nonparametric methods such as clustering sampling and the use of histograms

5 Discretization and concept hierarchy generation where rawdata values for attributes are replaced by ranges or higher conceptual levels

Attribute subset selection reduces the data set size by removing irrelevant or redundant attributes (or dimensions)

The goal of attribute subset selection is to find a minimum set of attributes such that the resulting probability distribution of the data classes is as close as possible to the original distribution obtained using all attributes

Mining on a reduced set of attributes has an additional benefit

It reduces the number of attributes appearing in the discovered patterns helping to make the patterns easier to understand

The ldquoBestrdquo (and ldquoWorstrdquo) attributes are typically determined using tests of statistical significance which assume that the attributes are independent of one Many other attribute evaluation measures can be used such as the information gain measure used in building decision trees for classification

bull Basic heuristic methods of attribute subset selection include the following techniques some of which are illustrated in Figure

bull 1 Stepwise forward selection The procedure starts with an empty set of attributes as the reduced set The best of the original attributes is determined and added to the reduced set At each subsequent iteration or step the best of the remaining original attributes is added to the set

bull 2 Stepwise backward elimination The procedure starts with the full set of attributes At each step it removes the worst attribute remaining in the set

bull 3 Combination of forward selection and backward elimination The stepwise forward selection and backward elimination methods can be combined so that at each step the procedure selects the best attribute and removes the

worst from among the remaining attributes

bull 4 Decision tree induction Decision tree algorithms such as ID3 C45 and CART were originally intended for classification Decision tree induction constructs a flow chart like structure where each internal (non leaf) node denotes a test on an attribute each branch corresponds to an outcome of the test and each external (leaf) node denotes a class prediction At each node the algorithm chooses the ldquobestrdquo attribute to partition the data into individual classes When decision tree induction is used for attribute subset selection a tree is constructed from the given data All attributes that do not appear in the tree are assumed to be irrelevant The set of attributes appearing in the tree form the reduced subset of attributes

The stopping criteria for the methods may vary The procedure may employ a threshold on the measure used to determine when to stop the attribute selection process

In dimensionality reduction data encoding or transformations are applied so as to obtain a reduced or ldquocompressedrdquo representation of the original data If the original data can be reconstructed from the compressed data without any loss of information the data reduction is called lossless If instead we can reconstruct only an approximation of the original data then the data reduction is called lossy There are several well-tuned algorithms for string compression Although they are typically lossless they allow only limited manipulation of the data

In this section we instead focus on two popular and effective methods of lossy dimensionality reduction wavelet transforms and principal components analysis

bull Wavelet transforms can be applied to multidimensional data such as a data cube

bull This is done by first applying the transform to the first dimension then to the second and so on The computational complexity involved is linear with respect to the number of cells in the cube Wavelet transforms give good results on sparse or skewed data and on data with ordered attributes Lossy compression by wavelets is reportedly better than JPEG compression the current commercial standard Wavelet transforms have many real-world applications including the compression of fingerprint images computer vision analysis of time-series data and data cleaning

bull PCA is computationally inexpensive can be applied to ordered and unordered attributes and can handle sparse data and skewed data Multidimensional data of more than two dimensions can be handled by reducing the problem to two dimensions Principal components may be used as inputs to multiple regression and cluster analysis

bull In comparison with wavelet transforms PCA tends to be better at handling sparse data whereas wavelet transforms are more suitable for data of high dimensionality

ldquoCan we reduce the data volume by choosing alternative lsquosmallerrsquo forms of data representationrdquo

Techniques of numerosity reduction can indeed be applied for this purpose These techniques may be parametric or nonparametric

For parametric methods a model is used to estimate the data so that typically only the data parameters need to be stored instead of the actual data (Outliers may also be stored) Log-linear models which estimate discrete multidimensional probability distributions are an example

Nonparametric methods for storing reduced representations of the data include histograms clustering and sampling

Letrsquos look at each of the numerosity reduction techniques mentioned above

Linear regression Data are modeled to fit a straight line Often uses the least-square method to fit the

line Multiple regression allows a response variable Y

to be modeled as a linear function of multidimensional feature vector

Log-linear model approximates discrete multidimensional probability distributions

bull Linear regression Y = w X + bndash Two regression coefficients w and b specify the

line and are to be estimated by using the data at hand

ndash Using the least squares criterion to the known values of Y1 Y2 hellip X1 X2 hellip

bull Multiple regression Y = b0 + b1 X1 + b2 X2ndash Many nonlinear functions can be transformed into

the above

bull Log-linear modelsndash The multi-way table of joint probabilities is

approximated by a product of lower-order tablesndash Probability p(a b c d) = ab acad bcd

Histograms Histograms use binning to approximate data distributions and are a popular form of data reduction

A histogram for an attribute A partitions the data distribution of A into disjoint subsets or buckets If each bucket represents only a single attribute-valuefrequency pair the buckets are called singleton buckets Often buckets instead represent continuous ranges for the given attribute

There are several partitioning rules including the following

Equal-width In an equal-width histogram the width of each bucket range is uniform

Equal-frequency (or equidepth) In an equal-frequency histogram the buckets are created so that roughly the frequency of each bucket is constant (that is each bucket contains roughly the same number of contiguous data samples)

V-Optimal If we consider all of the possible histograms for a given number of buckets the V-Optimal histogram is the one with the least variance Histogram variance is a weighted sum of the original values that each bucket represents where bucket weight is equal to the number of values in the bucket

MaxDiff In a MaxDiff histogram we consider the difference between each pair of adjacent values

ClusteringClustering techniques consider data tuples as objects They

partition the objects into groups or clusters so that objects within a cluster are ldquosimilarrdquo to one another and ldquodissimilarrdquo to objects in other clusters

In data reduction the cluster representations of the data are used to replace the actual data The effectiveness of this technique depends on the nature of the data It is much more effective for data that can be organized into distinct clusters than for smeared data

Partition data set into clusters based on similarity and store cluster representation (eg centroid and diameter) only

Can be very effective if data is clustered but not if data is ldquosmearedrdquo

Can have hierarchical clustering and be stored in multi-dimensional index tree structures

There are many choices of clustering definitions and clustering algorithms

Cluster analysis will be studied in depth later

SamplingSampling can be used as a data reduction technique because it

allows a large data set to be represented by a much smaller random sample (or subset) of the data Suppose that a large data set D contains N tuples Letrsquos look at the most common ways that we could sample D for data reduction

An advantage of sampling for data reduction is that the cost of obtaining a sample is proportional to the size of the sample sampling complexity is potentially sublinear to the size of the data

Simple Random sample without replacement

bull For a fixed sample size sampling complexity increases only linearly as the number of data dimensions

bull When applied to data reduction sampling is most commonly used to estimate the answer to an aggregate query

Sampling obtaining a small sample s to represent the whole data set N

Allow a mining algorithm to run in complexity that is potentially sub-linear to the size of the data

Choose a representative subset of the data Simple random sampling may have very poor

performance in the presence of skew Develop adaptive sampling methods

Stratified sampling Approximate the percentage of each class (or

subpopulation of interest) in the overall database Used in conjunction with skewed data

Note Sampling may not reduce database IOs (page at a time)

Sampling with or without Replacement

SRSWOR

(simple random

sample without

replacement)

SRSWR

Raw Data

Raw Data ClusterStratified Sample

Data discretization techniques can be used to reduce the number of values for a given continuous attribute by dividing the range of the attribute into intervals Interval labels can then be used to replace actual data values Replacing numerous values of a continuous attribute by a small number of interval labels thereby reduces and simplifies the original data This leads to a concise easy-to-use knowledge-level representation of mining results

bull Discretization techniques can be categorized based on how the discretization is performed such as whether it uses class information or which direction it proceeds (ie top-down vs bottom-up) If the discretization process uses class information then we say it is supervised discretization Otherwise it is unsupervised If the process starts by first finding one or a few points (called split points or cut points) to split the entire attribute range and then repeats this recursively on the resulting intervals it is called top-down discretization or splitting This contrasts with bottom-up discretization or merging which starts by considering all of the continuous values as potential split-points removes some by merging neighborhood values to form intervals and then recursively applies this process to the resulting intervals Discretization can be performed recursively on an attribute to provide a hierarchical or multiresolution partitioning of the attribute values known as a concept hierarchy Concept hierarchies are useful for mining at multiple levels of abstraction

Three types of attributes Nominal mdash values from an unordered set eg color

profession Ordinal mdash values from an ordered set eg military or

academic rank Continuous mdash real numbers eg integer or real

numbers Discretization

Divide the range of a continuous attribute into intervals Some classification algorithms only accept categorical

attributes Reduce data size by discretization Prepare for further analysis

Typical methods All the methods can be applied recursively Binning (covered above)

Top-down split unsupervised Histogram analysis (covered above)

Top-down split unsupervised Clustering analysis (covered above)

Either top-down split or bottom-up merge unsupervised Entropy-based discretization supervised top-down

split Interval merging by 2 Analysis unsupervised bottom-

up merge Segmentation by natural partitioning top-down split

unsupervised

Given a set of samples S if S is partitioned into two intervals S1 and S2 using boundary T the information gain after partitioning is

Entropy is calculated based on class distribution of the samples in the set Given m classes the entropy of S1 is

where pi is the probability of class i in S1 The boundary that minimizes the entropy function over

all possible boundaries is selected as a binary discretization

The process is recursively applied to partitions obtained until some stopping criterion is met

Such a boundary may reduce data size and improve classification accuracy

)Entropy(S|S|

|S|+)Entropy(S

|S|

|S|=T)I(S 2

21

1

m

iii ppSEntropy

121 )(log)(

Merging-based (bottom-up) vs splitting-based methods Merge Find the best neighboring intervals and merge

them to form larger intervals recursively ChiMerge [Kerber AAAI 1992 See also Liu et al DMKD

2002] Initially each distinct value of a numerical attr A is

considered to be one interval 2 tests are performed for every pair of adjacent

intervals Adjacent intervals with the least 2 values are merged

together since low 2 values for a pair indicate similar class distributions

This merge process proceeds recursively until a predefined stopping criterion is met (such as significance level max-interval max inconsistency etc)

Cluster analysis is a popular data discretization method A clustering algorithm can be applied to discretize a numerical attribute A by partitioning the values of A into clusters or groups Clustering takes the distribution of A into consideration as well as the closeness of data points and therefore is able to produce high-quality discretization results

Clustering can be used to generate a concept hierarchy for A by following either a topdown splitting strategy or a bottom-up merging strategy where each cluster forms a node of the concept hierarchy In the former each initial cluster or partition may be further

decomposed into several subclusters forming a lower level of the hierarchy In the latter clusters are formed by repeatedly grouping neighboring clusters in order to form

higher-level concepts

bull Discretization by Intuitive Partitioningbull Although the above discretization methods are useful

in the generation of numerical hierarchies many users would like to see numerical ranges partitioned into relatively uniform easy-to-read intervals that appear intuitive or ldquonaturalrdquo

bull If an interval covers 3 6 7 or 9 distinct values at the most significant digit then partition the range into 3 intervals (3 equal-width intervals for 3 6 and 9 and 3 intervals in the grouping of 2-3-2 for 7)

bull If it covers 2 4 or 8 distinct values at the most significant digit then partition the range into 4 equal-width intervals

bull If it covers 1 5 or 10 distinct values at the most significant digit then partition the range into 5 equal-width intervals

bull The rule can be recursively applied to each interval creating a concept hierarchy for the given numerical attribute Real-world data often contain extremely large positive andor negative outlier values which could distort any top-down discretization method based on minimum and maximum data values

A simply 3-4-5 rule can be used to segment numeric data into relatively uniform ldquonaturalrdquo intervals If an interval covers 3 6 7 or 9 distinct values

at the most significant digit partition the range into 3 equi-width intervals

If it covers 2 4 or 8 distinct values at the most significant digit partition the range into 4 intervals

If it covers 1 5 or 10 distinct values at the most significant digit partition the range into 5 intervals

Categorical data are discrete data Categorical attributes have a finite (but possibly large) number of distinct values with no ordering among the values Examples include geographic location job category and itemtype There are several methods for the generation of concept hierarchies for categorical data

Specification of a partial ordering of attributes explicitly at the schema level by users or Experts

Specification of a portion of a hierarchy by explicit data grouping

Specification of a set of attributes but not of their partial ordering

Specification of only a partial set of attributes

Specification of a partialtotal ordering of attributes explicitly at the schema level by users or experts street lt city lt state lt country

Specification of a hierarchy for a set of values by explicit data grouping Urbana Champaign Chicago lt Illinois

Specification of only a partial set of attributes Eg only street lt city not others

Automatic generation of hierarchies (or attribute levels) by the analysis of the number of distinct values Eg for a set of attributes street city state

country

Some hierarchies can be automaatically generated based on the analysis of the number of distinct values per attribute in the data set The attribute with the most distinct values is

placed at the lowest level of the hierarchy Exceptions eg weekday month quarter year

country

province_or_ state

city

street

15 distinct values

365 distinct values

3567 distinct values

674339 distinct values

Data preparation or preprocessing is a big issue for both data warehousing and data mining

Discriptive data summarization is need for quality data preprocessing

Data preparation includes Data cleaning and data integration Data reduction and feature selection Discretization

A lot a methods have been developed but data preprocessing still an active area of research

  • Slide 1
  • Slide 2
  • Slide 3
  • Multi-Dimensional Measure of Data Quality
  • Major Tasks in Data Preprocessing
  • Figure Forms of data preprocessing
  • 2 Descriptive Data Summarization
  • Slide 8
  • 21 Measuring the Central Tendency
  • 22 Measuring the Dispersion of Data
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Data Integration
  • Handling Redundant Data in Data Integration
  • Data Transformation
  • Data Transformation Normalization
  • Slide 27
  • 5 Data Reduction
  • Slide 29
  • 51 Data Cube Aggregation
  • 52 Attribute Subset Selection
  • Slide 32
  • Slide 33
  • Slide 34
  • 53 Dimensionality Reduction
  • Slide 36
  • Slide 37
  • Slide 38
  • 4 Numerosity Reduction
  • Data Reduction Method (1) Regression and Log-Linear Models
  • Regress Analysis and Log-Linear Models
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Data Reduction Method (3) Clustering
  • Slide 47
  • Slide 48
  • Slide 49
  • Data Reduction Method (4) Sampling
  • Slide 51
  • Sampling Cluster or Stratified Sampling
  • Data Discretization and Concept Hierarchy Generation
  • Slide 54
  • Discretization
  • Discretization and Concept Hierarchy Generation for Numeric Data
  • Entropy-Based Discretization
  • Interval Merge by 2 Analysis
  • Cluster Analysis
  • Slide 60
  • Segmentation by Natural Partitioning
  • 62 Concept Hierarchy Generation for Categorical Data
  • Slide 63
  • Concept Hierarchy Generation for Categorical Data
  • Automatic Concept Hierarchy Generation
  • Summary
Page 23: UNIT – 1 Data Preprocessing. Data Preprocessing Learning Objectives Understand why preprocess the data. Understand how to clean the data. Understand how.

Min-max normalization to [new_minA new_maxA]

Ex Let income range $12000 to $98000 normalized to [00 10] Then $73600 is mapped to

Z-score normalization (μ mean σ standard deviation)

Ex Let μ = 54000 σ = 16000 Then Normalization by decimal scaling

71600)001(0001200098

0001260073

AAA

AA

A

minnewminnewmaxnewminmax

minvv _)__(

A

Avv

j

vv

10

Where j is the smallest integer such that Max(|νrsquo|) lt 1

225100016

0005460073

Data reduction techniques can be applied to obtain a reduced representation of the data set that is much smaller in volume yet closely maintains the integrity of the original data That is mining on the reduced data set should be more efficient yet produce the same (or almost the same) analytical results

Strategies for data reduction include the following1Data cube aggregation where aggregation operations are applied to the data in the construction of a data cube

2 Attribute subset selection where irrelevant weakly relevant or redundant attributes or dimensions may be detected and removed

3 Dimensionality reduction where encoding mechanisms are used to reduce the data set size

4 Numerosity reduction where the data are replaced or estimated by alternative smaller data representations such as parametric models (which need store only the model parameters instead of the actual data) or nonparametric methods such as clustering sampling and the use of histograms

5 Discretization and concept hierarchy generation where rawdata values for attributes are replaced by ranges or higher conceptual levels

Attribute subset selection reduces the data set size by removing irrelevant or redundant attributes (or dimensions)

The goal of attribute subset selection is to find a minimum set of attributes such that the resulting probability distribution of the data classes is as close as possible to the original distribution obtained using all attributes

Mining on a reduced set of attributes has an additional benefit

It reduces the number of attributes appearing in the discovered patterns helping to make the patterns easier to understand

The ldquoBestrdquo (and ldquoWorstrdquo) attributes are typically determined using tests of statistical significance which assume that the attributes are independent of one Many other attribute evaluation measures can be used such as the information gain measure used in building decision trees for classification

bull Basic heuristic methods of attribute subset selection include the following techniques some of which are illustrated in Figure

bull 1 Stepwise forward selection The procedure starts with an empty set of attributes as the reduced set The best of the original attributes is determined and added to the reduced set At each subsequent iteration or step the best of the remaining original attributes is added to the set

bull 2 Stepwise backward elimination The procedure starts with the full set of attributes At each step it removes the worst attribute remaining in the set

bull 3 Combination of forward selection and backward elimination The stepwise forward selection and backward elimination methods can be combined so that at each step the procedure selects the best attribute and removes the

worst from among the remaining attributes

bull 4 Decision tree induction Decision tree algorithms such as ID3 C45 and CART were originally intended for classification Decision tree induction constructs a flow chart like structure where each internal (non leaf) node denotes a test on an attribute each branch corresponds to an outcome of the test and each external (leaf) node denotes a class prediction At each node the algorithm chooses the ldquobestrdquo attribute to partition the data into individual classes When decision tree induction is used for attribute subset selection a tree is constructed from the given data All attributes that do not appear in the tree are assumed to be irrelevant The set of attributes appearing in the tree form the reduced subset of attributes

The stopping criteria for the methods may vary The procedure may employ a threshold on the measure used to determine when to stop the attribute selection process

In dimensionality reduction data encoding or transformations are applied so as to obtain a reduced or ldquocompressedrdquo representation of the original data If the original data can be reconstructed from the compressed data without any loss of information the data reduction is called lossless If instead we can reconstruct only an approximation of the original data then the data reduction is called lossy There are several well-tuned algorithms for string compression Although they are typically lossless they allow only limited manipulation of the data

In this section we instead focus on two popular and effective methods of lossy dimensionality reduction wavelet transforms and principal components analysis

bull Wavelet transforms can be applied to multidimensional data such as a data cube

bull This is done by first applying the transform to the first dimension then to the second and so on The computational complexity involved is linear with respect to the number of cells in the cube Wavelet transforms give good results on sparse or skewed data and on data with ordered attributes Lossy compression by wavelets is reportedly better than JPEG compression the current commercial standard Wavelet transforms have many real-world applications including the compression of fingerprint images computer vision analysis of time-series data and data cleaning

bull PCA is computationally inexpensive can be applied to ordered and unordered attributes and can handle sparse data and skewed data Multidimensional data of more than two dimensions can be handled by reducing the problem to two dimensions Principal components may be used as inputs to multiple regression and cluster analysis

bull In comparison with wavelet transforms PCA tends to be better at handling sparse data whereas wavelet transforms are more suitable for data of high dimensionality

ldquoCan we reduce the data volume by choosing alternative lsquosmallerrsquo forms of data representationrdquo

Techniques of numerosity reduction can indeed be applied for this purpose These techniques may be parametric or nonparametric

For parametric methods a model is used to estimate the data so that typically only the data parameters need to be stored instead of the actual data (Outliers may also be stored) Log-linear models which estimate discrete multidimensional probability distributions are an example

Nonparametric methods for storing reduced representations of the data include histograms clustering and sampling

Letrsquos look at each of the numerosity reduction techniques mentioned above

Linear regression Data are modeled to fit a straight line Often uses the least-square method to fit the

line Multiple regression allows a response variable Y

to be modeled as a linear function of multidimensional feature vector

Log-linear model approximates discrete multidimensional probability distributions

bull Linear regression Y = w X + bndash Two regression coefficients w and b specify the

line and are to be estimated by using the data at hand

ndash Using the least squares criterion to the known values of Y1 Y2 hellip X1 X2 hellip

bull Multiple regression Y = b0 + b1 X1 + b2 X2ndash Many nonlinear functions can be transformed into

the above

bull Log-linear modelsndash The multi-way table of joint probabilities is

approximated by a product of lower-order tablesndash Probability p(a b c d) = ab acad bcd

Histograms Histograms use binning to approximate data distributions and are a popular form of data reduction

A histogram for an attribute A partitions the data distribution of A into disjoint subsets or buckets If each bucket represents only a single attribute-valuefrequency pair the buckets are called singleton buckets Often buckets instead represent continuous ranges for the given attribute

There are several partitioning rules including the following

Equal-width In an equal-width histogram the width of each bucket range is uniform

Equal-frequency (or equidepth) In an equal-frequency histogram the buckets are created so that roughly the frequency of each bucket is constant (that is each bucket contains roughly the same number of contiguous data samples)

V-Optimal If we consider all of the possible histograms for a given number of buckets the V-Optimal histogram is the one with the least variance Histogram variance is a weighted sum of the original values that each bucket represents where bucket weight is equal to the number of values in the bucket

MaxDiff In a MaxDiff histogram we consider the difference between each pair of adjacent values

ClusteringClustering techniques consider data tuples as objects They

partition the objects into groups or clusters so that objects within a cluster are ldquosimilarrdquo to one another and ldquodissimilarrdquo to objects in other clusters

In data reduction the cluster representations of the data are used to replace the actual data The effectiveness of this technique depends on the nature of the data It is much more effective for data that can be organized into distinct clusters than for smeared data

Partition data set into clusters based on similarity and store cluster representation (eg centroid and diameter) only

Can be very effective if data is clustered but not if data is ldquosmearedrdquo

Can have hierarchical clustering and be stored in multi-dimensional index tree structures

There are many choices of clustering definitions and clustering algorithms

Cluster analysis will be studied in depth later

SamplingSampling can be used as a data reduction technique because it

allows a large data set to be represented by a much smaller random sample (or subset) of the data Suppose that a large data set D contains N tuples Letrsquos look at the most common ways that we could sample D for data reduction

An advantage of sampling for data reduction is that the cost of obtaining a sample is proportional to the size of the sample sampling complexity is potentially sublinear to the size of the data

Simple Random sample without replacement

bull For a fixed sample size sampling complexity increases only linearly as the number of data dimensions

bull When applied to data reduction sampling is most commonly used to estimate the answer to an aggregate query

Sampling obtaining a small sample s to represent the whole data set N

Allow a mining algorithm to run in complexity that is potentially sub-linear to the size of the data

Choose a representative subset of the data Simple random sampling may have very poor

performance in the presence of skew Develop adaptive sampling methods

Stratified sampling Approximate the percentage of each class (or

subpopulation of interest) in the overall database Used in conjunction with skewed data

Note Sampling may not reduce database IOs (page at a time)

Sampling with or without Replacement

SRSWOR

(simple random

sample without

replacement)

SRSWR

Raw Data

Raw Data ClusterStratified Sample

Data discretization techniques can be used to reduce the number of values for a given continuous attribute by dividing the range of the attribute into intervals Interval labels can then be used to replace actual data values Replacing numerous values of a continuous attribute by a small number of interval labels thereby reduces and simplifies the original data This leads to a concise easy-to-use knowledge-level representation of mining results

bull Discretization techniques can be categorized based on how the discretization is performed such as whether it uses class information or which direction it proceeds (ie top-down vs bottom-up) If the discretization process uses class information then we say it is supervised discretization Otherwise it is unsupervised If the process starts by first finding one or a few points (called split points or cut points) to split the entire attribute range and then repeats this recursively on the resulting intervals it is called top-down discretization or splitting This contrasts with bottom-up discretization or merging which starts by considering all of the continuous values as potential split-points removes some by merging neighborhood values to form intervals and then recursively applies this process to the resulting intervals Discretization can be performed recursively on an attribute to provide a hierarchical or multiresolution partitioning of the attribute values known as a concept hierarchy Concept hierarchies are useful for mining at multiple levels of abstraction

Three types of attributes Nominal mdash values from an unordered set eg color

profession Ordinal mdash values from an ordered set eg military or

academic rank Continuous mdash real numbers eg integer or real

numbers Discretization

Divide the range of a continuous attribute into intervals Some classification algorithms only accept categorical

attributes Reduce data size by discretization Prepare for further analysis

Typical methods All the methods can be applied recursively Binning (covered above)

Top-down split unsupervised Histogram analysis (covered above)

Top-down split unsupervised Clustering analysis (covered above)

Either top-down split or bottom-up merge unsupervised Entropy-based discretization supervised top-down

split Interval merging by 2 Analysis unsupervised bottom-

up merge Segmentation by natural partitioning top-down split

unsupervised

Given a set of samples S if S is partitioned into two intervals S1 and S2 using boundary T the information gain after partitioning is

Entropy is calculated based on class distribution of the samples in the set Given m classes the entropy of S1 is

where pi is the probability of class i in S1 The boundary that minimizes the entropy function over

all possible boundaries is selected as a binary discretization

The process is recursively applied to partitions obtained until some stopping criterion is met

Such a boundary may reduce data size and improve classification accuracy

)Entropy(S|S|

|S|+)Entropy(S

|S|

|S|=T)I(S 2

21

1

m

iii ppSEntropy

121 )(log)(

Merging-based (bottom-up) vs splitting-based methods Merge Find the best neighboring intervals and merge

them to form larger intervals recursively ChiMerge [Kerber AAAI 1992 See also Liu et al DMKD

2002] Initially each distinct value of a numerical attr A is

considered to be one interval 2 tests are performed for every pair of adjacent

intervals Adjacent intervals with the least 2 values are merged

together since low 2 values for a pair indicate similar class distributions

This merge process proceeds recursively until a predefined stopping criterion is met (such as significance level max-interval max inconsistency etc)

Cluster analysis is a popular data discretization method A clustering algorithm can be applied to discretize a numerical attribute A by partitioning the values of A into clusters or groups Clustering takes the distribution of A into consideration as well as the closeness of data points and therefore is able to produce high-quality discretization results

Clustering can be used to generate a concept hierarchy for A by following either a topdown splitting strategy or a bottom-up merging strategy where each cluster forms a node of the concept hierarchy In the former each initial cluster or partition may be further

decomposed into several subclusters forming a lower level of the hierarchy In the latter clusters are formed by repeatedly grouping neighboring clusters in order to form

higher-level concepts

bull Discretization by Intuitive Partitioningbull Although the above discretization methods are useful

in the generation of numerical hierarchies many users would like to see numerical ranges partitioned into relatively uniform easy-to-read intervals that appear intuitive or ldquonaturalrdquo

bull If an interval covers 3 6 7 or 9 distinct values at the most significant digit then partition the range into 3 intervals (3 equal-width intervals for 3 6 and 9 and 3 intervals in the grouping of 2-3-2 for 7)

bull If it covers 2 4 or 8 distinct values at the most significant digit then partition the range into 4 equal-width intervals

bull If it covers 1 5 or 10 distinct values at the most significant digit then partition the range into 5 equal-width intervals

bull The rule can be recursively applied to each interval creating a concept hierarchy for the given numerical attribute Real-world data often contain extremely large positive andor negative outlier values which could distort any top-down discretization method based on minimum and maximum data values

A simply 3-4-5 rule can be used to segment numeric data into relatively uniform ldquonaturalrdquo intervals If an interval covers 3 6 7 or 9 distinct values

at the most significant digit partition the range into 3 equi-width intervals

If it covers 2 4 or 8 distinct values at the most significant digit partition the range into 4 intervals

If it covers 1 5 or 10 distinct values at the most significant digit partition the range into 5 intervals

Categorical data are discrete data Categorical attributes have a finite (but possibly large) number of distinct values with no ordering among the values Examples include geographic location job category and itemtype There are several methods for the generation of concept hierarchies for categorical data

Specification of a partial ordering of attributes explicitly at the schema level by users or Experts

Specification of a portion of a hierarchy by explicit data grouping

Specification of a set of attributes but not of their partial ordering

Specification of only a partial set of attributes

Specification of a partialtotal ordering of attributes explicitly at the schema level by users or experts street lt city lt state lt country

Specification of a hierarchy for a set of values by explicit data grouping Urbana Champaign Chicago lt Illinois

Specification of only a partial set of attributes Eg only street lt city not others

Automatic generation of hierarchies (or attribute levels) by the analysis of the number of distinct values Eg for a set of attributes street city state

country

Some hierarchies can be automaatically generated based on the analysis of the number of distinct values per attribute in the data set The attribute with the most distinct values is

placed at the lowest level of the hierarchy Exceptions eg weekday month quarter year

country

province_or_ state

city

street

15 distinct values

365 distinct values

3567 distinct values

674339 distinct values

Data preparation or preprocessing is a big issue for both data warehousing and data mining

Discriptive data summarization is need for quality data preprocessing

Data preparation includes Data cleaning and data integration Data reduction and feature selection Discretization

A lot a methods have been developed but data preprocessing still an active area of research

  • Slide 1
  • Slide 2
  • Slide 3
  • Multi-Dimensional Measure of Data Quality
  • Major Tasks in Data Preprocessing
  • Figure Forms of data preprocessing
  • 2 Descriptive Data Summarization
  • Slide 8
  • 21 Measuring the Central Tendency
  • 22 Measuring the Dispersion of Data
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Data Integration
  • Handling Redundant Data in Data Integration
  • Data Transformation
  • Data Transformation Normalization
  • Slide 27
  • 5 Data Reduction
  • Slide 29
  • 51 Data Cube Aggregation
  • 52 Attribute Subset Selection
  • Slide 32
  • Slide 33
  • Slide 34
  • 53 Dimensionality Reduction
  • Slide 36
  • Slide 37
  • Slide 38
  • 4 Numerosity Reduction
  • Data Reduction Method (1) Regression and Log-Linear Models
  • Regress Analysis and Log-Linear Models
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Data Reduction Method (3) Clustering
  • Slide 47
  • Slide 48
  • Slide 49
  • Data Reduction Method (4) Sampling
  • Slide 51
  • Sampling Cluster or Stratified Sampling
  • Data Discretization and Concept Hierarchy Generation
  • Slide 54
  • Discretization
  • Discretization and Concept Hierarchy Generation for Numeric Data
  • Entropy-Based Discretization
  • Interval Merge by 2 Analysis
  • Cluster Analysis
  • Slide 60
  • Segmentation by Natural Partitioning
  • 62 Concept Hierarchy Generation for Categorical Data
  • Slide 63
  • Concept Hierarchy Generation for Categorical Data
  • Automatic Concept Hierarchy Generation
  • Summary
Page 24: UNIT – 1 Data Preprocessing. Data Preprocessing Learning Objectives Understand why preprocess the data. Understand how to clean the data. Understand how.

Data reduction techniques can be applied to obtain a reduced representation of the data set that is much smaller in volume yet closely maintains the integrity of the original data That is mining on the reduced data set should be more efficient yet produce the same (or almost the same) analytical results

Strategies for data reduction include the following1Data cube aggregation where aggregation operations are applied to the data in the construction of a data cube

2 Attribute subset selection where irrelevant weakly relevant or redundant attributes or dimensions may be detected and removed

3 Dimensionality reduction where encoding mechanisms are used to reduce the data set size

4 Numerosity reduction where the data are replaced or estimated by alternative smaller data representations such as parametric models (which need store only the model parameters instead of the actual data) or nonparametric methods such as clustering sampling and the use of histograms

5 Discretization and concept hierarchy generation where rawdata values for attributes are replaced by ranges or higher conceptual levels

Attribute subset selection reduces the data set size by removing irrelevant or redundant attributes (or dimensions)

The goal of attribute subset selection is to find a minimum set of attributes such that the resulting probability distribution of the data classes is as close as possible to the original distribution obtained using all attributes

Mining on a reduced set of attributes has an additional benefit

It reduces the number of attributes appearing in the discovered patterns helping to make the patterns easier to understand

The ldquoBestrdquo (and ldquoWorstrdquo) attributes are typically determined using tests of statistical significance which assume that the attributes are independent of one Many other attribute evaluation measures can be used such as the information gain measure used in building decision trees for classification

bull Basic heuristic methods of attribute subset selection include the following techniques some of which are illustrated in Figure

bull 1 Stepwise forward selection The procedure starts with an empty set of attributes as the reduced set The best of the original attributes is determined and added to the reduced set At each subsequent iteration or step the best of the remaining original attributes is added to the set

bull 2 Stepwise backward elimination The procedure starts with the full set of attributes At each step it removes the worst attribute remaining in the set

bull 3 Combination of forward selection and backward elimination The stepwise forward selection and backward elimination methods can be combined so that at each step the procedure selects the best attribute and removes the

worst from among the remaining attributes

bull 4 Decision tree induction Decision tree algorithms such as ID3 C45 and CART were originally intended for classification Decision tree induction constructs a flow chart like structure where each internal (non leaf) node denotes a test on an attribute each branch corresponds to an outcome of the test and each external (leaf) node denotes a class prediction At each node the algorithm chooses the ldquobestrdquo attribute to partition the data into individual classes When decision tree induction is used for attribute subset selection a tree is constructed from the given data All attributes that do not appear in the tree are assumed to be irrelevant The set of attributes appearing in the tree form the reduced subset of attributes

The stopping criteria for the methods may vary The procedure may employ a threshold on the measure used to determine when to stop the attribute selection process

In dimensionality reduction data encoding or transformations are applied so as to obtain a reduced or ldquocompressedrdquo representation of the original data If the original data can be reconstructed from the compressed data without any loss of information the data reduction is called lossless If instead we can reconstruct only an approximation of the original data then the data reduction is called lossy There are several well-tuned algorithms for string compression Although they are typically lossless they allow only limited manipulation of the data

In this section we instead focus on two popular and effective methods of lossy dimensionality reduction wavelet transforms and principal components analysis

bull Wavelet transforms can be applied to multidimensional data such as a data cube

bull This is done by first applying the transform to the first dimension then to the second and so on The computational complexity involved is linear with respect to the number of cells in the cube Wavelet transforms give good results on sparse or skewed data and on data with ordered attributes Lossy compression by wavelets is reportedly better than JPEG compression the current commercial standard Wavelet transforms have many real-world applications including the compression of fingerprint images computer vision analysis of time-series data and data cleaning

bull PCA is computationally inexpensive can be applied to ordered and unordered attributes and can handle sparse data and skewed data Multidimensional data of more than two dimensions can be handled by reducing the problem to two dimensions Principal components may be used as inputs to multiple regression and cluster analysis

bull In comparison with wavelet transforms PCA tends to be better at handling sparse data whereas wavelet transforms are more suitable for data of high dimensionality

ldquoCan we reduce the data volume by choosing alternative lsquosmallerrsquo forms of data representationrdquo

Techniques of numerosity reduction can indeed be applied for this purpose These techniques may be parametric or nonparametric

For parametric methods a model is used to estimate the data so that typically only the data parameters need to be stored instead of the actual data (Outliers may also be stored) Log-linear models which estimate discrete multidimensional probability distributions are an example

Nonparametric methods for storing reduced representations of the data include histograms clustering and sampling

Letrsquos look at each of the numerosity reduction techniques mentioned above

Linear regression Data are modeled to fit a straight line Often uses the least-square method to fit the

line Multiple regression allows a response variable Y

to be modeled as a linear function of multidimensional feature vector

Log-linear model approximates discrete multidimensional probability distributions

bull Linear regression Y = w X + bndash Two regression coefficients w and b specify the

line and are to be estimated by using the data at hand

ndash Using the least squares criterion to the known values of Y1 Y2 hellip X1 X2 hellip

bull Multiple regression Y = b0 + b1 X1 + b2 X2ndash Many nonlinear functions can be transformed into

the above

bull Log-linear modelsndash The multi-way table of joint probabilities is

approximated by a product of lower-order tablesndash Probability p(a b c d) = ab acad bcd

Histograms Histograms use binning to approximate data distributions and are a popular form of data reduction

A histogram for an attribute A partitions the data distribution of A into disjoint subsets or buckets If each bucket represents only a single attribute-valuefrequency pair the buckets are called singleton buckets Often buckets instead represent continuous ranges for the given attribute

There are several partitioning rules including the following

Equal-width In an equal-width histogram the width of each bucket range is uniform

Equal-frequency (or equidepth) In an equal-frequency histogram the buckets are created so that roughly the frequency of each bucket is constant (that is each bucket contains roughly the same number of contiguous data samples)

V-Optimal If we consider all of the possible histograms for a given number of buckets the V-Optimal histogram is the one with the least variance Histogram variance is a weighted sum of the original values that each bucket represents where bucket weight is equal to the number of values in the bucket

MaxDiff In a MaxDiff histogram we consider the difference between each pair of adjacent values

ClusteringClustering techniques consider data tuples as objects They

partition the objects into groups or clusters so that objects within a cluster are ldquosimilarrdquo to one another and ldquodissimilarrdquo to objects in other clusters

In data reduction the cluster representations of the data are used to replace the actual data The effectiveness of this technique depends on the nature of the data It is much more effective for data that can be organized into distinct clusters than for smeared data

Partition data set into clusters based on similarity and store cluster representation (eg centroid and diameter) only

Can be very effective if data is clustered but not if data is ldquosmearedrdquo

Can have hierarchical clustering and be stored in multi-dimensional index tree structures

There are many choices of clustering definitions and clustering algorithms

Cluster analysis will be studied in depth later

SamplingSampling can be used as a data reduction technique because it

allows a large data set to be represented by a much smaller random sample (or subset) of the data Suppose that a large data set D contains N tuples Letrsquos look at the most common ways that we could sample D for data reduction

An advantage of sampling for data reduction is that the cost of obtaining a sample is proportional to the size of the sample sampling complexity is potentially sublinear to the size of the data

Simple Random sample without replacement

bull For a fixed sample size sampling complexity increases only linearly as the number of data dimensions

bull When applied to data reduction sampling is most commonly used to estimate the answer to an aggregate query

Sampling obtaining a small sample s to represent the whole data set N

Allow a mining algorithm to run in complexity that is potentially sub-linear to the size of the data

Choose a representative subset of the data Simple random sampling may have very poor

performance in the presence of skew Develop adaptive sampling methods

Stratified sampling Approximate the percentage of each class (or

subpopulation of interest) in the overall database Used in conjunction with skewed data

Note Sampling may not reduce database IOs (page at a time)

Sampling with or without Replacement

SRSWOR

(simple random

sample without

replacement)

SRSWR

Raw Data

Raw Data ClusterStratified Sample

Data discretization techniques can be used to reduce the number of values for a given continuous attribute by dividing the range of the attribute into intervals Interval labels can then be used to replace actual data values Replacing numerous values of a continuous attribute by a small number of interval labels thereby reduces and simplifies the original data This leads to a concise easy-to-use knowledge-level representation of mining results

bull Discretization techniques can be categorized based on how the discretization is performed such as whether it uses class information or which direction it proceeds (ie top-down vs bottom-up) If the discretization process uses class information then we say it is supervised discretization Otherwise it is unsupervised If the process starts by first finding one or a few points (called split points or cut points) to split the entire attribute range and then repeats this recursively on the resulting intervals it is called top-down discretization or splitting This contrasts with bottom-up discretization or merging which starts by considering all of the continuous values as potential split-points removes some by merging neighborhood values to form intervals and then recursively applies this process to the resulting intervals Discretization can be performed recursively on an attribute to provide a hierarchical or multiresolution partitioning of the attribute values known as a concept hierarchy Concept hierarchies are useful for mining at multiple levels of abstraction

Three types of attributes Nominal mdash values from an unordered set eg color

profession Ordinal mdash values from an ordered set eg military or

academic rank Continuous mdash real numbers eg integer or real

numbers Discretization

Divide the range of a continuous attribute into intervals Some classification algorithms only accept categorical

attributes Reduce data size by discretization Prepare for further analysis

Typical methods All the methods can be applied recursively Binning (covered above)

Top-down split unsupervised Histogram analysis (covered above)

Top-down split unsupervised Clustering analysis (covered above)

Either top-down split or bottom-up merge unsupervised Entropy-based discretization supervised top-down

split Interval merging by 2 Analysis unsupervised bottom-

up merge Segmentation by natural partitioning top-down split

unsupervised

Given a set of samples S if S is partitioned into two intervals S1 and S2 using boundary T the information gain after partitioning is

Entropy is calculated based on class distribution of the samples in the set Given m classes the entropy of S1 is

where pi is the probability of class i in S1 The boundary that minimizes the entropy function over

all possible boundaries is selected as a binary discretization

The process is recursively applied to partitions obtained until some stopping criterion is met

Such a boundary may reduce data size and improve classification accuracy

)Entropy(S|S|

|S|+)Entropy(S

|S|

|S|=T)I(S 2

21

1

m

iii ppSEntropy

121 )(log)(

Merging-based (bottom-up) vs splitting-based methods Merge Find the best neighboring intervals and merge

them to form larger intervals recursively ChiMerge [Kerber AAAI 1992 See also Liu et al DMKD

2002] Initially each distinct value of a numerical attr A is

considered to be one interval 2 tests are performed for every pair of adjacent

intervals Adjacent intervals with the least 2 values are merged

together since low 2 values for a pair indicate similar class distributions

This merge process proceeds recursively until a predefined stopping criterion is met (such as significance level max-interval max inconsistency etc)

Cluster analysis is a popular data discretization method A clustering algorithm can be applied to discretize a numerical attribute A by partitioning the values of A into clusters or groups Clustering takes the distribution of A into consideration as well as the closeness of data points and therefore is able to produce high-quality discretization results

Clustering can be used to generate a concept hierarchy for A by following either a topdown splitting strategy or a bottom-up merging strategy where each cluster forms a node of the concept hierarchy In the former each initial cluster or partition may be further

decomposed into several subclusters forming a lower level of the hierarchy In the latter clusters are formed by repeatedly grouping neighboring clusters in order to form

higher-level concepts

bull Discretization by Intuitive Partitioningbull Although the above discretization methods are useful

in the generation of numerical hierarchies many users would like to see numerical ranges partitioned into relatively uniform easy-to-read intervals that appear intuitive or ldquonaturalrdquo

bull If an interval covers 3 6 7 or 9 distinct values at the most significant digit then partition the range into 3 intervals (3 equal-width intervals for 3 6 and 9 and 3 intervals in the grouping of 2-3-2 for 7)

bull If it covers 2 4 or 8 distinct values at the most significant digit then partition the range into 4 equal-width intervals

bull If it covers 1 5 or 10 distinct values at the most significant digit then partition the range into 5 equal-width intervals

bull The rule can be recursively applied to each interval creating a concept hierarchy for the given numerical attribute Real-world data often contain extremely large positive andor negative outlier values which could distort any top-down discretization method based on minimum and maximum data values

A simply 3-4-5 rule can be used to segment numeric data into relatively uniform ldquonaturalrdquo intervals If an interval covers 3 6 7 or 9 distinct values

at the most significant digit partition the range into 3 equi-width intervals

If it covers 2 4 or 8 distinct values at the most significant digit partition the range into 4 intervals

If it covers 1 5 or 10 distinct values at the most significant digit partition the range into 5 intervals

Categorical data are discrete data Categorical attributes have a finite (but possibly large) number of distinct values with no ordering among the values Examples include geographic location job category and itemtype There are several methods for the generation of concept hierarchies for categorical data

Specification of a partial ordering of attributes explicitly at the schema level by users or Experts

Specification of a portion of a hierarchy by explicit data grouping

Specification of a set of attributes but not of their partial ordering

Specification of only a partial set of attributes

Specification of a partialtotal ordering of attributes explicitly at the schema level by users or experts street lt city lt state lt country

Specification of a hierarchy for a set of values by explicit data grouping Urbana Champaign Chicago lt Illinois

Specification of only a partial set of attributes Eg only street lt city not others

Automatic generation of hierarchies (or attribute levels) by the analysis of the number of distinct values Eg for a set of attributes street city state

country

Some hierarchies can be automaatically generated based on the analysis of the number of distinct values per attribute in the data set The attribute with the most distinct values is

placed at the lowest level of the hierarchy Exceptions eg weekday month quarter year

country

province_or_ state

city

street

15 distinct values

365 distinct values

3567 distinct values

674339 distinct values

Data preparation or preprocessing is a big issue for both data warehousing and data mining

Discriptive data summarization is need for quality data preprocessing

Data preparation includes Data cleaning and data integration Data reduction and feature selection Discretization

A lot a methods have been developed but data preprocessing still an active area of research

  • Slide 1
  • Slide 2
  • Slide 3
  • Multi-Dimensional Measure of Data Quality
  • Major Tasks in Data Preprocessing
  • Figure Forms of data preprocessing
  • 2 Descriptive Data Summarization
  • Slide 8
  • 21 Measuring the Central Tendency
  • 22 Measuring the Dispersion of Data
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Data Integration
  • Handling Redundant Data in Data Integration
  • Data Transformation
  • Data Transformation Normalization
  • Slide 27
  • 5 Data Reduction
  • Slide 29
  • 51 Data Cube Aggregation
  • 52 Attribute Subset Selection
  • Slide 32
  • Slide 33
  • Slide 34
  • 53 Dimensionality Reduction
  • Slide 36
  • Slide 37
  • Slide 38
  • 4 Numerosity Reduction
  • Data Reduction Method (1) Regression and Log-Linear Models
  • Regress Analysis and Log-Linear Models
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Data Reduction Method (3) Clustering
  • Slide 47
  • Slide 48
  • Slide 49
  • Data Reduction Method (4) Sampling
  • Slide 51
  • Sampling Cluster or Stratified Sampling
  • Data Discretization and Concept Hierarchy Generation
  • Slide 54
  • Discretization
  • Discretization and Concept Hierarchy Generation for Numeric Data
  • Entropy-Based Discretization
  • Interval Merge by 2 Analysis
  • Cluster Analysis
  • Slide 60
  • Segmentation by Natural Partitioning
  • 62 Concept Hierarchy Generation for Categorical Data
  • Slide 63
  • Concept Hierarchy Generation for Categorical Data
  • Automatic Concept Hierarchy Generation
  • Summary
Page 25: UNIT – 1 Data Preprocessing. Data Preprocessing Learning Objectives Understand why preprocess the data. Understand how to clean the data. Understand how.

Strategies for data reduction include the following1Data cube aggregation where aggregation operations are applied to the data in the construction of a data cube

2 Attribute subset selection where irrelevant weakly relevant or redundant attributes or dimensions may be detected and removed

3 Dimensionality reduction where encoding mechanisms are used to reduce the data set size

4 Numerosity reduction where the data are replaced or estimated by alternative smaller data representations such as parametric models (which need store only the model parameters instead of the actual data) or nonparametric methods such as clustering sampling and the use of histograms

5 Discretization and concept hierarchy generation where rawdata values for attributes are replaced by ranges or higher conceptual levels

Attribute subset selection reduces the data set size by removing irrelevant or redundant attributes (or dimensions)

The goal of attribute subset selection is to find a minimum set of attributes such that the resulting probability distribution of the data classes is as close as possible to the original distribution obtained using all attributes

Mining on a reduced set of attributes has an additional benefit

It reduces the number of attributes appearing in the discovered patterns helping to make the patterns easier to understand

The ldquoBestrdquo (and ldquoWorstrdquo) attributes are typically determined using tests of statistical significance which assume that the attributes are independent of one Many other attribute evaluation measures can be used such as the information gain measure used in building decision trees for classification

bull Basic heuristic methods of attribute subset selection include the following techniques some of which are illustrated in Figure

bull 1 Stepwise forward selection The procedure starts with an empty set of attributes as the reduced set The best of the original attributes is determined and added to the reduced set At each subsequent iteration or step the best of the remaining original attributes is added to the set

bull 2 Stepwise backward elimination The procedure starts with the full set of attributes At each step it removes the worst attribute remaining in the set

bull 3 Combination of forward selection and backward elimination The stepwise forward selection and backward elimination methods can be combined so that at each step the procedure selects the best attribute and removes the

worst from among the remaining attributes

bull 4 Decision tree induction Decision tree algorithms such as ID3 C45 and CART were originally intended for classification Decision tree induction constructs a flow chart like structure where each internal (non leaf) node denotes a test on an attribute each branch corresponds to an outcome of the test and each external (leaf) node denotes a class prediction At each node the algorithm chooses the ldquobestrdquo attribute to partition the data into individual classes When decision tree induction is used for attribute subset selection a tree is constructed from the given data All attributes that do not appear in the tree are assumed to be irrelevant The set of attributes appearing in the tree form the reduced subset of attributes

The stopping criteria for the methods may vary The procedure may employ a threshold on the measure used to determine when to stop the attribute selection process

In dimensionality reduction data encoding or transformations are applied so as to obtain a reduced or ldquocompressedrdquo representation of the original data If the original data can be reconstructed from the compressed data without any loss of information the data reduction is called lossless If instead we can reconstruct only an approximation of the original data then the data reduction is called lossy There are several well-tuned algorithms for string compression Although they are typically lossless they allow only limited manipulation of the data

In this section we instead focus on two popular and effective methods of lossy dimensionality reduction wavelet transforms and principal components analysis

bull Wavelet transforms can be applied to multidimensional data such as a data cube

bull This is done by first applying the transform to the first dimension then to the second and so on The computational complexity involved is linear with respect to the number of cells in the cube Wavelet transforms give good results on sparse or skewed data and on data with ordered attributes Lossy compression by wavelets is reportedly better than JPEG compression the current commercial standard Wavelet transforms have many real-world applications including the compression of fingerprint images computer vision analysis of time-series data and data cleaning

bull PCA is computationally inexpensive can be applied to ordered and unordered attributes and can handle sparse data and skewed data Multidimensional data of more than two dimensions can be handled by reducing the problem to two dimensions Principal components may be used as inputs to multiple regression and cluster analysis

bull In comparison with wavelet transforms PCA tends to be better at handling sparse data whereas wavelet transforms are more suitable for data of high dimensionality

ldquoCan we reduce the data volume by choosing alternative lsquosmallerrsquo forms of data representationrdquo

Techniques of numerosity reduction can indeed be applied for this purpose These techniques may be parametric or nonparametric

For parametric methods a model is used to estimate the data so that typically only the data parameters need to be stored instead of the actual data (Outliers may also be stored) Log-linear models which estimate discrete multidimensional probability distributions are an example

Nonparametric methods for storing reduced representations of the data include histograms clustering and sampling

Letrsquos look at each of the numerosity reduction techniques mentioned above

Linear regression Data are modeled to fit a straight line Often uses the least-square method to fit the

line Multiple regression allows a response variable Y

to be modeled as a linear function of multidimensional feature vector

Log-linear model approximates discrete multidimensional probability distributions

bull Linear regression Y = w X + bndash Two regression coefficients w and b specify the

line and are to be estimated by using the data at hand

ndash Using the least squares criterion to the known values of Y1 Y2 hellip X1 X2 hellip

bull Multiple regression Y = b0 + b1 X1 + b2 X2ndash Many nonlinear functions can be transformed into

the above

bull Log-linear modelsndash The multi-way table of joint probabilities is

approximated by a product of lower-order tablesndash Probability p(a b c d) = ab acad bcd

Histograms Histograms use binning to approximate data distributions and are a popular form of data reduction

A histogram for an attribute A partitions the data distribution of A into disjoint subsets or buckets If each bucket represents only a single attribute-valuefrequency pair the buckets are called singleton buckets Often buckets instead represent continuous ranges for the given attribute

There are several partitioning rules including the following

Equal-width In an equal-width histogram the width of each bucket range is uniform

Equal-frequency (or equidepth) In an equal-frequency histogram the buckets are created so that roughly the frequency of each bucket is constant (that is each bucket contains roughly the same number of contiguous data samples)

V-Optimal If we consider all of the possible histograms for a given number of buckets the V-Optimal histogram is the one with the least variance Histogram variance is a weighted sum of the original values that each bucket represents where bucket weight is equal to the number of values in the bucket

MaxDiff In a MaxDiff histogram we consider the difference between each pair of adjacent values

ClusteringClustering techniques consider data tuples as objects They

partition the objects into groups or clusters so that objects within a cluster are ldquosimilarrdquo to one another and ldquodissimilarrdquo to objects in other clusters

In data reduction the cluster representations of the data are used to replace the actual data The effectiveness of this technique depends on the nature of the data It is much more effective for data that can be organized into distinct clusters than for smeared data

Partition data set into clusters based on similarity and store cluster representation (eg centroid and diameter) only

Can be very effective if data is clustered but not if data is ldquosmearedrdquo

Can have hierarchical clustering and be stored in multi-dimensional index tree structures

There are many choices of clustering definitions and clustering algorithms

Cluster analysis will be studied in depth later

SamplingSampling can be used as a data reduction technique because it

allows a large data set to be represented by a much smaller random sample (or subset) of the data Suppose that a large data set D contains N tuples Letrsquos look at the most common ways that we could sample D for data reduction

An advantage of sampling for data reduction is that the cost of obtaining a sample is proportional to the size of the sample sampling complexity is potentially sublinear to the size of the data

Simple Random sample without replacement

bull For a fixed sample size sampling complexity increases only linearly as the number of data dimensions

bull When applied to data reduction sampling is most commonly used to estimate the answer to an aggregate query

Sampling obtaining a small sample s to represent the whole data set N

Allow a mining algorithm to run in complexity that is potentially sub-linear to the size of the data

Choose a representative subset of the data Simple random sampling may have very poor

performance in the presence of skew Develop adaptive sampling methods

Stratified sampling Approximate the percentage of each class (or

subpopulation of interest) in the overall database Used in conjunction with skewed data

Note Sampling may not reduce database IOs (page at a time)

Sampling with or without Replacement

SRSWOR

(simple random

sample without

replacement)

SRSWR

Raw Data

Raw Data ClusterStratified Sample

Data discretization techniques can be used to reduce the number of values for a given continuous attribute by dividing the range of the attribute into intervals Interval labels can then be used to replace actual data values Replacing numerous values of a continuous attribute by a small number of interval labels thereby reduces and simplifies the original data This leads to a concise easy-to-use knowledge-level representation of mining results

bull Discretization techniques can be categorized based on how the discretization is performed such as whether it uses class information or which direction it proceeds (ie top-down vs bottom-up) If the discretization process uses class information then we say it is supervised discretization Otherwise it is unsupervised If the process starts by first finding one or a few points (called split points or cut points) to split the entire attribute range and then repeats this recursively on the resulting intervals it is called top-down discretization or splitting This contrasts with bottom-up discretization or merging which starts by considering all of the continuous values as potential split-points removes some by merging neighborhood values to form intervals and then recursively applies this process to the resulting intervals Discretization can be performed recursively on an attribute to provide a hierarchical or multiresolution partitioning of the attribute values known as a concept hierarchy Concept hierarchies are useful for mining at multiple levels of abstraction

Three types of attributes Nominal mdash values from an unordered set eg color

profession Ordinal mdash values from an ordered set eg military or

academic rank Continuous mdash real numbers eg integer or real

numbers Discretization

Divide the range of a continuous attribute into intervals Some classification algorithms only accept categorical

attributes Reduce data size by discretization Prepare for further analysis

Typical methods All the methods can be applied recursively Binning (covered above)

Top-down split unsupervised Histogram analysis (covered above)

Top-down split unsupervised Clustering analysis (covered above)

Either top-down split or bottom-up merge unsupervised Entropy-based discretization supervised top-down

split Interval merging by 2 Analysis unsupervised bottom-

up merge Segmentation by natural partitioning top-down split

unsupervised

Given a set of samples S if S is partitioned into two intervals S1 and S2 using boundary T the information gain after partitioning is

Entropy is calculated based on class distribution of the samples in the set Given m classes the entropy of S1 is

where pi is the probability of class i in S1 The boundary that minimizes the entropy function over

all possible boundaries is selected as a binary discretization

The process is recursively applied to partitions obtained until some stopping criterion is met

Such a boundary may reduce data size and improve classification accuracy

)Entropy(S|S|

|S|+)Entropy(S

|S|

|S|=T)I(S 2

21

1

m

iii ppSEntropy

121 )(log)(

Merging-based (bottom-up) vs splitting-based methods Merge Find the best neighboring intervals and merge

them to form larger intervals recursively ChiMerge [Kerber AAAI 1992 See also Liu et al DMKD

2002] Initially each distinct value of a numerical attr A is

considered to be one interval 2 tests are performed for every pair of adjacent

intervals Adjacent intervals with the least 2 values are merged

together since low 2 values for a pair indicate similar class distributions

This merge process proceeds recursively until a predefined stopping criterion is met (such as significance level max-interval max inconsistency etc)

Cluster analysis is a popular data discretization method A clustering algorithm can be applied to discretize a numerical attribute A by partitioning the values of A into clusters or groups Clustering takes the distribution of A into consideration as well as the closeness of data points and therefore is able to produce high-quality discretization results

Clustering can be used to generate a concept hierarchy for A by following either a topdown splitting strategy or a bottom-up merging strategy where each cluster forms a node of the concept hierarchy In the former each initial cluster or partition may be further

decomposed into several subclusters forming a lower level of the hierarchy In the latter clusters are formed by repeatedly grouping neighboring clusters in order to form

higher-level concepts

bull Discretization by Intuitive Partitioningbull Although the above discretization methods are useful

in the generation of numerical hierarchies many users would like to see numerical ranges partitioned into relatively uniform easy-to-read intervals that appear intuitive or ldquonaturalrdquo

bull If an interval covers 3 6 7 or 9 distinct values at the most significant digit then partition the range into 3 intervals (3 equal-width intervals for 3 6 and 9 and 3 intervals in the grouping of 2-3-2 for 7)

bull If it covers 2 4 or 8 distinct values at the most significant digit then partition the range into 4 equal-width intervals

bull If it covers 1 5 or 10 distinct values at the most significant digit then partition the range into 5 equal-width intervals

bull The rule can be recursively applied to each interval creating a concept hierarchy for the given numerical attribute Real-world data often contain extremely large positive andor negative outlier values which could distort any top-down discretization method based on minimum and maximum data values

A simply 3-4-5 rule can be used to segment numeric data into relatively uniform ldquonaturalrdquo intervals If an interval covers 3 6 7 or 9 distinct values

at the most significant digit partition the range into 3 equi-width intervals

If it covers 2 4 or 8 distinct values at the most significant digit partition the range into 4 intervals

If it covers 1 5 or 10 distinct values at the most significant digit partition the range into 5 intervals

Categorical data are discrete data Categorical attributes have a finite (but possibly large) number of distinct values with no ordering among the values Examples include geographic location job category and itemtype There are several methods for the generation of concept hierarchies for categorical data

Specification of a partial ordering of attributes explicitly at the schema level by users or Experts

Specification of a portion of a hierarchy by explicit data grouping

Specification of a set of attributes but not of their partial ordering

Specification of only a partial set of attributes

Specification of a partialtotal ordering of attributes explicitly at the schema level by users or experts street lt city lt state lt country

Specification of a hierarchy for a set of values by explicit data grouping Urbana Champaign Chicago lt Illinois

Specification of only a partial set of attributes Eg only street lt city not others

Automatic generation of hierarchies (or attribute levels) by the analysis of the number of distinct values Eg for a set of attributes street city state

country

Some hierarchies can be automaatically generated based on the analysis of the number of distinct values per attribute in the data set The attribute with the most distinct values is

placed at the lowest level of the hierarchy Exceptions eg weekday month quarter year

country

province_or_ state

city

street

15 distinct values

365 distinct values

3567 distinct values

674339 distinct values

Data preparation or preprocessing is a big issue for both data warehousing and data mining

Discriptive data summarization is need for quality data preprocessing

Data preparation includes Data cleaning and data integration Data reduction and feature selection Discretization

A lot a methods have been developed but data preprocessing still an active area of research

  • Slide 1
  • Slide 2
  • Slide 3
  • Multi-Dimensional Measure of Data Quality
  • Major Tasks in Data Preprocessing
  • Figure Forms of data preprocessing
  • 2 Descriptive Data Summarization
  • Slide 8
  • 21 Measuring the Central Tendency
  • 22 Measuring the Dispersion of Data
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Data Integration
  • Handling Redundant Data in Data Integration
  • Data Transformation
  • Data Transformation Normalization
  • Slide 27
  • 5 Data Reduction
  • Slide 29
  • 51 Data Cube Aggregation
  • 52 Attribute Subset Selection
  • Slide 32
  • Slide 33
  • Slide 34
  • 53 Dimensionality Reduction
  • Slide 36
  • Slide 37
  • Slide 38
  • 4 Numerosity Reduction
  • Data Reduction Method (1) Regression and Log-Linear Models
  • Regress Analysis and Log-Linear Models
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Data Reduction Method (3) Clustering
  • Slide 47
  • Slide 48
  • Slide 49
  • Data Reduction Method (4) Sampling
  • Slide 51
  • Sampling Cluster or Stratified Sampling
  • Data Discretization and Concept Hierarchy Generation
  • Slide 54
  • Discretization
  • Discretization and Concept Hierarchy Generation for Numeric Data
  • Entropy-Based Discretization
  • Interval Merge by 2 Analysis
  • Cluster Analysis
  • Slide 60
  • Segmentation by Natural Partitioning
  • 62 Concept Hierarchy Generation for Categorical Data
  • Slide 63
  • Concept Hierarchy Generation for Categorical Data
  • Automatic Concept Hierarchy Generation
  • Summary
Page 26: UNIT – 1 Data Preprocessing. Data Preprocessing Learning Objectives Understand why preprocess the data. Understand how to clean the data. Understand how.

Attribute subset selection reduces the data set size by removing irrelevant or redundant attributes (or dimensions)

The goal of attribute subset selection is to find a minimum set of attributes such that the resulting probability distribution of the data classes is as close as possible to the original distribution obtained using all attributes

Mining on a reduced set of attributes has an additional benefit

It reduces the number of attributes appearing in the discovered patterns helping to make the patterns easier to understand

The ldquoBestrdquo (and ldquoWorstrdquo) attributes are typically determined using tests of statistical significance which assume that the attributes are independent of one Many other attribute evaluation measures can be used such as the information gain measure used in building decision trees for classification

bull Basic heuristic methods of attribute subset selection include the following techniques some of which are illustrated in Figure

bull 1 Stepwise forward selection The procedure starts with an empty set of attributes as the reduced set The best of the original attributes is determined and added to the reduced set At each subsequent iteration or step the best of the remaining original attributes is added to the set

bull 2 Stepwise backward elimination The procedure starts with the full set of attributes At each step it removes the worst attribute remaining in the set

bull 3 Combination of forward selection and backward elimination The stepwise forward selection and backward elimination methods can be combined so that at each step the procedure selects the best attribute and removes the

worst from among the remaining attributes

bull 4 Decision tree induction Decision tree algorithms such as ID3 C45 and CART were originally intended for classification Decision tree induction constructs a flow chart like structure where each internal (non leaf) node denotes a test on an attribute each branch corresponds to an outcome of the test and each external (leaf) node denotes a class prediction At each node the algorithm chooses the ldquobestrdquo attribute to partition the data into individual classes When decision tree induction is used for attribute subset selection a tree is constructed from the given data All attributes that do not appear in the tree are assumed to be irrelevant The set of attributes appearing in the tree form the reduced subset of attributes

The stopping criteria for the methods may vary The procedure may employ a threshold on the measure used to determine when to stop the attribute selection process

In dimensionality reduction data encoding or transformations are applied so as to obtain a reduced or ldquocompressedrdquo representation of the original data If the original data can be reconstructed from the compressed data without any loss of information the data reduction is called lossless If instead we can reconstruct only an approximation of the original data then the data reduction is called lossy There are several well-tuned algorithms for string compression Although they are typically lossless they allow only limited manipulation of the data

In this section we instead focus on two popular and effective methods of lossy dimensionality reduction wavelet transforms and principal components analysis

bull Wavelet transforms can be applied to multidimensional data such as a data cube

bull This is done by first applying the transform to the first dimension then to the second and so on The computational complexity involved is linear with respect to the number of cells in the cube Wavelet transforms give good results on sparse or skewed data and on data with ordered attributes Lossy compression by wavelets is reportedly better than JPEG compression the current commercial standard Wavelet transforms have many real-world applications including the compression of fingerprint images computer vision analysis of time-series data and data cleaning

bull PCA is computationally inexpensive can be applied to ordered and unordered attributes and can handle sparse data and skewed data Multidimensional data of more than two dimensions can be handled by reducing the problem to two dimensions Principal components may be used as inputs to multiple regression and cluster analysis

bull In comparison with wavelet transforms PCA tends to be better at handling sparse data whereas wavelet transforms are more suitable for data of high dimensionality

ldquoCan we reduce the data volume by choosing alternative lsquosmallerrsquo forms of data representationrdquo

Techniques of numerosity reduction can indeed be applied for this purpose These techniques may be parametric or nonparametric

For parametric methods a model is used to estimate the data so that typically only the data parameters need to be stored instead of the actual data (Outliers may also be stored) Log-linear models which estimate discrete multidimensional probability distributions are an example

Nonparametric methods for storing reduced representations of the data include histograms clustering and sampling

Letrsquos look at each of the numerosity reduction techniques mentioned above

Linear regression Data are modeled to fit a straight line Often uses the least-square method to fit the

line Multiple regression allows a response variable Y

to be modeled as a linear function of multidimensional feature vector

Log-linear model approximates discrete multidimensional probability distributions

bull Linear regression Y = w X + bndash Two regression coefficients w and b specify the

line and are to be estimated by using the data at hand

ndash Using the least squares criterion to the known values of Y1 Y2 hellip X1 X2 hellip

bull Multiple regression Y = b0 + b1 X1 + b2 X2ndash Many nonlinear functions can be transformed into

the above

bull Log-linear modelsndash The multi-way table of joint probabilities is

approximated by a product of lower-order tablesndash Probability p(a b c d) = ab acad bcd

Histograms Histograms use binning to approximate data distributions and are a popular form of data reduction

A histogram for an attribute A partitions the data distribution of A into disjoint subsets or buckets If each bucket represents only a single attribute-valuefrequency pair the buckets are called singleton buckets Often buckets instead represent continuous ranges for the given attribute

There are several partitioning rules including the following

Equal-width In an equal-width histogram the width of each bucket range is uniform

Equal-frequency (or equidepth) In an equal-frequency histogram the buckets are created so that roughly the frequency of each bucket is constant (that is each bucket contains roughly the same number of contiguous data samples)

V-Optimal If we consider all of the possible histograms for a given number of buckets the V-Optimal histogram is the one with the least variance Histogram variance is a weighted sum of the original values that each bucket represents where bucket weight is equal to the number of values in the bucket

MaxDiff In a MaxDiff histogram we consider the difference between each pair of adjacent values

ClusteringClustering techniques consider data tuples as objects They

partition the objects into groups or clusters so that objects within a cluster are ldquosimilarrdquo to one another and ldquodissimilarrdquo to objects in other clusters

In data reduction the cluster representations of the data are used to replace the actual data The effectiveness of this technique depends on the nature of the data It is much more effective for data that can be organized into distinct clusters than for smeared data

Partition data set into clusters based on similarity and store cluster representation (eg centroid and diameter) only

Can be very effective if data is clustered but not if data is ldquosmearedrdquo

Can have hierarchical clustering and be stored in multi-dimensional index tree structures

There are many choices of clustering definitions and clustering algorithms

Cluster analysis will be studied in depth later

SamplingSampling can be used as a data reduction technique because it

allows a large data set to be represented by a much smaller random sample (or subset) of the data Suppose that a large data set D contains N tuples Letrsquos look at the most common ways that we could sample D for data reduction

An advantage of sampling for data reduction is that the cost of obtaining a sample is proportional to the size of the sample sampling complexity is potentially sublinear to the size of the data

Simple Random sample without replacement

bull For a fixed sample size sampling complexity increases only linearly as the number of data dimensions

bull When applied to data reduction sampling is most commonly used to estimate the answer to an aggregate query

Sampling obtaining a small sample s to represent the whole data set N

Allow a mining algorithm to run in complexity that is potentially sub-linear to the size of the data

Choose a representative subset of the data Simple random sampling may have very poor

performance in the presence of skew Develop adaptive sampling methods

Stratified sampling Approximate the percentage of each class (or

subpopulation of interest) in the overall database Used in conjunction with skewed data

Note Sampling may not reduce database IOs (page at a time)

Sampling with or without Replacement

SRSWOR

(simple random

sample without

replacement)

SRSWR

Raw Data

Raw Data ClusterStratified Sample

Data discretization techniques can be used to reduce the number of values for a given continuous attribute by dividing the range of the attribute into intervals Interval labels can then be used to replace actual data values Replacing numerous values of a continuous attribute by a small number of interval labels thereby reduces and simplifies the original data This leads to a concise easy-to-use knowledge-level representation of mining results

bull Discretization techniques can be categorized based on how the discretization is performed such as whether it uses class information or which direction it proceeds (ie top-down vs bottom-up) If the discretization process uses class information then we say it is supervised discretization Otherwise it is unsupervised If the process starts by first finding one or a few points (called split points or cut points) to split the entire attribute range and then repeats this recursively on the resulting intervals it is called top-down discretization or splitting This contrasts with bottom-up discretization or merging which starts by considering all of the continuous values as potential split-points removes some by merging neighborhood values to form intervals and then recursively applies this process to the resulting intervals Discretization can be performed recursively on an attribute to provide a hierarchical or multiresolution partitioning of the attribute values known as a concept hierarchy Concept hierarchies are useful for mining at multiple levels of abstraction

Three types of attributes Nominal mdash values from an unordered set eg color

profession Ordinal mdash values from an ordered set eg military or

academic rank Continuous mdash real numbers eg integer or real

numbers Discretization

Divide the range of a continuous attribute into intervals Some classification algorithms only accept categorical

attributes Reduce data size by discretization Prepare for further analysis

Typical methods All the methods can be applied recursively Binning (covered above)

Top-down split unsupervised Histogram analysis (covered above)

Top-down split unsupervised Clustering analysis (covered above)

Either top-down split or bottom-up merge unsupervised Entropy-based discretization supervised top-down

split Interval merging by 2 Analysis unsupervised bottom-

up merge Segmentation by natural partitioning top-down split

unsupervised

Given a set of samples S if S is partitioned into two intervals S1 and S2 using boundary T the information gain after partitioning is

Entropy is calculated based on class distribution of the samples in the set Given m classes the entropy of S1 is

where pi is the probability of class i in S1 The boundary that minimizes the entropy function over

all possible boundaries is selected as a binary discretization

The process is recursively applied to partitions obtained until some stopping criterion is met

Such a boundary may reduce data size and improve classification accuracy

)Entropy(S|S|

|S|+)Entropy(S

|S|

|S|=T)I(S 2

21

1

m

iii ppSEntropy

121 )(log)(

Merging-based (bottom-up) vs splitting-based methods Merge Find the best neighboring intervals and merge

them to form larger intervals recursively ChiMerge [Kerber AAAI 1992 See also Liu et al DMKD

2002] Initially each distinct value of a numerical attr A is

considered to be one interval 2 tests are performed for every pair of adjacent

intervals Adjacent intervals with the least 2 values are merged

together since low 2 values for a pair indicate similar class distributions

This merge process proceeds recursively until a predefined stopping criterion is met (such as significance level max-interval max inconsistency etc)

Cluster analysis is a popular data discretization method A clustering algorithm can be applied to discretize a numerical attribute A by partitioning the values of A into clusters or groups Clustering takes the distribution of A into consideration as well as the closeness of data points and therefore is able to produce high-quality discretization results

Clustering can be used to generate a concept hierarchy for A by following either a topdown splitting strategy or a bottom-up merging strategy where each cluster forms a node of the concept hierarchy In the former each initial cluster or partition may be further

decomposed into several subclusters forming a lower level of the hierarchy In the latter clusters are formed by repeatedly grouping neighboring clusters in order to form

higher-level concepts

bull Discretization by Intuitive Partitioningbull Although the above discretization methods are useful

in the generation of numerical hierarchies many users would like to see numerical ranges partitioned into relatively uniform easy-to-read intervals that appear intuitive or ldquonaturalrdquo

bull If an interval covers 3 6 7 or 9 distinct values at the most significant digit then partition the range into 3 intervals (3 equal-width intervals for 3 6 and 9 and 3 intervals in the grouping of 2-3-2 for 7)

bull If it covers 2 4 or 8 distinct values at the most significant digit then partition the range into 4 equal-width intervals

bull If it covers 1 5 or 10 distinct values at the most significant digit then partition the range into 5 equal-width intervals

bull The rule can be recursively applied to each interval creating a concept hierarchy for the given numerical attribute Real-world data often contain extremely large positive andor negative outlier values which could distort any top-down discretization method based on minimum and maximum data values

A simply 3-4-5 rule can be used to segment numeric data into relatively uniform ldquonaturalrdquo intervals If an interval covers 3 6 7 or 9 distinct values

at the most significant digit partition the range into 3 equi-width intervals

If it covers 2 4 or 8 distinct values at the most significant digit partition the range into 4 intervals

If it covers 1 5 or 10 distinct values at the most significant digit partition the range into 5 intervals

Categorical data are discrete data Categorical attributes have a finite (but possibly large) number of distinct values with no ordering among the values Examples include geographic location job category and itemtype There are several methods for the generation of concept hierarchies for categorical data

Specification of a partial ordering of attributes explicitly at the schema level by users or Experts

Specification of a portion of a hierarchy by explicit data grouping

Specification of a set of attributes but not of their partial ordering

Specification of only a partial set of attributes

Specification of a partialtotal ordering of attributes explicitly at the schema level by users or experts street lt city lt state lt country

Specification of a hierarchy for a set of values by explicit data grouping Urbana Champaign Chicago lt Illinois

Specification of only a partial set of attributes Eg only street lt city not others

Automatic generation of hierarchies (or attribute levels) by the analysis of the number of distinct values Eg for a set of attributes street city state

country

Some hierarchies can be automaatically generated based on the analysis of the number of distinct values per attribute in the data set The attribute with the most distinct values is

placed at the lowest level of the hierarchy Exceptions eg weekday month quarter year

country

province_or_ state

city

street

15 distinct values

365 distinct values

3567 distinct values

674339 distinct values

Data preparation or preprocessing is a big issue for both data warehousing and data mining

Discriptive data summarization is need for quality data preprocessing

Data preparation includes Data cleaning and data integration Data reduction and feature selection Discretization

A lot a methods have been developed but data preprocessing still an active area of research

  • Slide 1
  • Slide 2
  • Slide 3
  • Multi-Dimensional Measure of Data Quality
  • Major Tasks in Data Preprocessing
  • Figure Forms of data preprocessing
  • 2 Descriptive Data Summarization
  • Slide 8
  • 21 Measuring the Central Tendency
  • 22 Measuring the Dispersion of Data
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Data Integration
  • Handling Redundant Data in Data Integration
  • Data Transformation
  • Data Transformation Normalization
  • Slide 27
  • 5 Data Reduction
  • Slide 29
  • 51 Data Cube Aggregation
  • 52 Attribute Subset Selection
  • Slide 32
  • Slide 33
  • Slide 34
  • 53 Dimensionality Reduction
  • Slide 36
  • Slide 37
  • Slide 38
  • 4 Numerosity Reduction
  • Data Reduction Method (1) Regression and Log-Linear Models
  • Regress Analysis and Log-Linear Models
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Data Reduction Method (3) Clustering
  • Slide 47
  • Slide 48
  • Slide 49
  • Data Reduction Method (4) Sampling
  • Slide 51
  • Sampling Cluster or Stratified Sampling
  • Data Discretization and Concept Hierarchy Generation
  • Slide 54
  • Discretization
  • Discretization and Concept Hierarchy Generation for Numeric Data
  • Entropy-Based Discretization
  • Interval Merge by 2 Analysis
  • Cluster Analysis
  • Slide 60
  • Segmentation by Natural Partitioning
  • 62 Concept Hierarchy Generation for Categorical Data
  • Slide 63
  • Concept Hierarchy Generation for Categorical Data
  • Automatic Concept Hierarchy Generation
  • Summary
Page 27: UNIT – 1 Data Preprocessing. Data Preprocessing Learning Objectives Understand why preprocess the data. Understand how to clean the data. Understand how.

The ldquoBestrdquo (and ldquoWorstrdquo) attributes are typically determined using tests of statistical significance which assume that the attributes are independent of one Many other attribute evaluation measures can be used such as the information gain measure used in building decision trees for classification

bull Basic heuristic methods of attribute subset selection include the following techniques some of which are illustrated in Figure

bull 1 Stepwise forward selection The procedure starts with an empty set of attributes as the reduced set The best of the original attributes is determined and added to the reduced set At each subsequent iteration or step the best of the remaining original attributes is added to the set

bull 2 Stepwise backward elimination The procedure starts with the full set of attributes At each step it removes the worst attribute remaining in the set

bull 3 Combination of forward selection and backward elimination The stepwise forward selection and backward elimination methods can be combined so that at each step the procedure selects the best attribute and removes the

worst from among the remaining attributes

bull 4 Decision tree induction Decision tree algorithms such as ID3 C45 and CART were originally intended for classification Decision tree induction constructs a flow chart like structure where each internal (non leaf) node denotes a test on an attribute each branch corresponds to an outcome of the test and each external (leaf) node denotes a class prediction At each node the algorithm chooses the ldquobestrdquo attribute to partition the data into individual classes When decision tree induction is used for attribute subset selection a tree is constructed from the given data All attributes that do not appear in the tree are assumed to be irrelevant The set of attributes appearing in the tree form the reduced subset of attributes

The stopping criteria for the methods may vary The procedure may employ a threshold on the measure used to determine when to stop the attribute selection process

In dimensionality reduction data encoding or transformations are applied so as to obtain a reduced or ldquocompressedrdquo representation of the original data If the original data can be reconstructed from the compressed data without any loss of information the data reduction is called lossless If instead we can reconstruct only an approximation of the original data then the data reduction is called lossy There are several well-tuned algorithms for string compression Although they are typically lossless they allow only limited manipulation of the data

In this section we instead focus on two popular and effective methods of lossy dimensionality reduction wavelet transforms and principal components analysis

bull Wavelet transforms can be applied to multidimensional data such as a data cube

bull This is done by first applying the transform to the first dimension then to the second and so on The computational complexity involved is linear with respect to the number of cells in the cube Wavelet transforms give good results on sparse or skewed data and on data with ordered attributes Lossy compression by wavelets is reportedly better than JPEG compression the current commercial standard Wavelet transforms have many real-world applications including the compression of fingerprint images computer vision analysis of time-series data and data cleaning

bull PCA is computationally inexpensive can be applied to ordered and unordered attributes and can handle sparse data and skewed data Multidimensional data of more than two dimensions can be handled by reducing the problem to two dimensions Principal components may be used as inputs to multiple regression and cluster analysis

bull In comparison with wavelet transforms PCA tends to be better at handling sparse data whereas wavelet transforms are more suitable for data of high dimensionality

ldquoCan we reduce the data volume by choosing alternative lsquosmallerrsquo forms of data representationrdquo

Techniques of numerosity reduction can indeed be applied for this purpose These techniques may be parametric or nonparametric

For parametric methods a model is used to estimate the data so that typically only the data parameters need to be stored instead of the actual data (Outliers may also be stored) Log-linear models which estimate discrete multidimensional probability distributions are an example

Nonparametric methods for storing reduced representations of the data include histograms clustering and sampling

Letrsquos look at each of the numerosity reduction techniques mentioned above

Linear regression Data are modeled to fit a straight line Often uses the least-square method to fit the

line Multiple regression allows a response variable Y

to be modeled as a linear function of multidimensional feature vector

Log-linear model approximates discrete multidimensional probability distributions

bull Linear regression Y = w X + bndash Two regression coefficients w and b specify the

line and are to be estimated by using the data at hand

ndash Using the least squares criterion to the known values of Y1 Y2 hellip X1 X2 hellip

bull Multiple regression Y = b0 + b1 X1 + b2 X2ndash Many nonlinear functions can be transformed into

the above

bull Log-linear modelsndash The multi-way table of joint probabilities is

approximated by a product of lower-order tablesndash Probability p(a b c d) = ab acad bcd

Histograms Histograms use binning to approximate data distributions and are a popular form of data reduction

A histogram for an attribute A partitions the data distribution of A into disjoint subsets or buckets If each bucket represents only a single attribute-valuefrequency pair the buckets are called singleton buckets Often buckets instead represent continuous ranges for the given attribute

There are several partitioning rules including the following

Equal-width In an equal-width histogram the width of each bucket range is uniform

Equal-frequency (or equidepth) In an equal-frequency histogram the buckets are created so that roughly the frequency of each bucket is constant (that is each bucket contains roughly the same number of contiguous data samples)

V-Optimal If we consider all of the possible histograms for a given number of buckets the V-Optimal histogram is the one with the least variance Histogram variance is a weighted sum of the original values that each bucket represents where bucket weight is equal to the number of values in the bucket

MaxDiff In a MaxDiff histogram we consider the difference between each pair of adjacent values

ClusteringClustering techniques consider data tuples as objects They

partition the objects into groups or clusters so that objects within a cluster are ldquosimilarrdquo to one another and ldquodissimilarrdquo to objects in other clusters

In data reduction the cluster representations of the data are used to replace the actual data The effectiveness of this technique depends on the nature of the data It is much more effective for data that can be organized into distinct clusters than for smeared data

Partition data set into clusters based on similarity and store cluster representation (eg centroid and diameter) only

Can be very effective if data is clustered but not if data is ldquosmearedrdquo

Can have hierarchical clustering and be stored in multi-dimensional index tree structures

There are many choices of clustering definitions and clustering algorithms

Cluster analysis will be studied in depth later

SamplingSampling can be used as a data reduction technique because it

allows a large data set to be represented by a much smaller random sample (or subset) of the data Suppose that a large data set D contains N tuples Letrsquos look at the most common ways that we could sample D for data reduction

An advantage of sampling for data reduction is that the cost of obtaining a sample is proportional to the size of the sample sampling complexity is potentially sublinear to the size of the data

Simple Random sample without replacement

bull For a fixed sample size sampling complexity increases only linearly as the number of data dimensions

bull When applied to data reduction sampling is most commonly used to estimate the answer to an aggregate query

Sampling obtaining a small sample s to represent the whole data set N

Allow a mining algorithm to run in complexity that is potentially sub-linear to the size of the data

Choose a representative subset of the data Simple random sampling may have very poor

performance in the presence of skew Develop adaptive sampling methods

Stratified sampling Approximate the percentage of each class (or

subpopulation of interest) in the overall database Used in conjunction with skewed data

Note Sampling may not reduce database IOs (page at a time)

Sampling with or without Replacement

SRSWOR

(simple random

sample without

replacement)

SRSWR

Raw Data

Raw Data ClusterStratified Sample

Data discretization techniques can be used to reduce the number of values for a given continuous attribute by dividing the range of the attribute into intervals Interval labels can then be used to replace actual data values Replacing numerous values of a continuous attribute by a small number of interval labels thereby reduces and simplifies the original data This leads to a concise easy-to-use knowledge-level representation of mining results

bull Discretization techniques can be categorized based on how the discretization is performed such as whether it uses class information or which direction it proceeds (ie top-down vs bottom-up) If the discretization process uses class information then we say it is supervised discretization Otherwise it is unsupervised If the process starts by first finding one or a few points (called split points or cut points) to split the entire attribute range and then repeats this recursively on the resulting intervals it is called top-down discretization or splitting This contrasts with bottom-up discretization or merging which starts by considering all of the continuous values as potential split-points removes some by merging neighborhood values to form intervals and then recursively applies this process to the resulting intervals Discretization can be performed recursively on an attribute to provide a hierarchical or multiresolution partitioning of the attribute values known as a concept hierarchy Concept hierarchies are useful for mining at multiple levels of abstraction

Three types of attributes Nominal mdash values from an unordered set eg color

profession Ordinal mdash values from an ordered set eg military or

academic rank Continuous mdash real numbers eg integer or real

numbers Discretization

Divide the range of a continuous attribute into intervals Some classification algorithms only accept categorical

attributes Reduce data size by discretization Prepare for further analysis

Typical methods All the methods can be applied recursively Binning (covered above)

Top-down split unsupervised Histogram analysis (covered above)

Top-down split unsupervised Clustering analysis (covered above)

Either top-down split or bottom-up merge unsupervised Entropy-based discretization supervised top-down

split Interval merging by 2 Analysis unsupervised bottom-

up merge Segmentation by natural partitioning top-down split

unsupervised

Given a set of samples S if S is partitioned into two intervals S1 and S2 using boundary T the information gain after partitioning is

Entropy is calculated based on class distribution of the samples in the set Given m classes the entropy of S1 is

where pi is the probability of class i in S1 The boundary that minimizes the entropy function over

all possible boundaries is selected as a binary discretization

The process is recursively applied to partitions obtained until some stopping criterion is met

Such a boundary may reduce data size and improve classification accuracy

)Entropy(S|S|

|S|+)Entropy(S

|S|

|S|=T)I(S 2

21

1

m

iii ppSEntropy

121 )(log)(

Merging-based (bottom-up) vs splitting-based methods Merge Find the best neighboring intervals and merge

them to form larger intervals recursively ChiMerge [Kerber AAAI 1992 See also Liu et al DMKD

2002] Initially each distinct value of a numerical attr A is

considered to be one interval 2 tests are performed for every pair of adjacent

intervals Adjacent intervals with the least 2 values are merged

together since low 2 values for a pair indicate similar class distributions

This merge process proceeds recursively until a predefined stopping criterion is met (such as significance level max-interval max inconsistency etc)

Cluster analysis is a popular data discretization method A clustering algorithm can be applied to discretize a numerical attribute A by partitioning the values of A into clusters or groups Clustering takes the distribution of A into consideration as well as the closeness of data points and therefore is able to produce high-quality discretization results

Clustering can be used to generate a concept hierarchy for A by following either a topdown splitting strategy or a bottom-up merging strategy where each cluster forms a node of the concept hierarchy In the former each initial cluster or partition may be further

decomposed into several subclusters forming a lower level of the hierarchy In the latter clusters are formed by repeatedly grouping neighboring clusters in order to form

higher-level concepts

bull Discretization by Intuitive Partitioningbull Although the above discretization methods are useful

in the generation of numerical hierarchies many users would like to see numerical ranges partitioned into relatively uniform easy-to-read intervals that appear intuitive or ldquonaturalrdquo

bull If an interval covers 3 6 7 or 9 distinct values at the most significant digit then partition the range into 3 intervals (3 equal-width intervals for 3 6 and 9 and 3 intervals in the grouping of 2-3-2 for 7)

bull If it covers 2 4 or 8 distinct values at the most significant digit then partition the range into 4 equal-width intervals

bull If it covers 1 5 or 10 distinct values at the most significant digit then partition the range into 5 equal-width intervals

bull The rule can be recursively applied to each interval creating a concept hierarchy for the given numerical attribute Real-world data often contain extremely large positive andor negative outlier values which could distort any top-down discretization method based on minimum and maximum data values

A simply 3-4-5 rule can be used to segment numeric data into relatively uniform ldquonaturalrdquo intervals If an interval covers 3 6 7 or 9 distinct values

at the most significant digit partition the range into 3 equi-width intervals

If it covers 2 4 or 8 distinct values at the most significant digit partition the range into 4 intervals

If it covers 1 5 or 10 distinct values at the most significant digit partition the range into 5 intervals

Categorical data are discrete data Categorical attributes have a finite (but possibly large) number of distinct values with no ordering among the values Examples include geographic location job category and itemtype There are several methods for the generation of concept hierarchies for categorical data

Specification of a partial ordering of attributes explicitly at the schema level by users or Experts

Specification of a portion of a hierarchy by explicit data grouping

Specification of a set of attributes but not of their partial ordering

Specification of only a partial set of attributes

Specification of a partialtotal ordering of attributes explicitly at the schema level by users or experts street lt city lt state lt country

Specification of a hierarchy for a set of values by explicit data grouping Urbana Champaign Chicago lt Illinois

Specification of only a partial set of attributes Eg only street lt city not others

Automatic generation of hierarchies (or attribute levels) by the analysis of the number of distinct values Eg for a set of attributes street city state

country

Some hierarchies can be automaatically generated based on the analysis of the number of distinct values per attribute in the data set The attribute with the most distinct values is

placed at the lowest level of the hierarchy Exceptions eg weekday month quarter year

country

province_or_ state

city

street

15 distinct values

365 distinct values

3567 distinct values

674339 distinct values

Data preparation or preprocessing is a big issue for both data warehousing and data mining

Discriptive data summarization is need for quality data preprocessing

Data preparation includes Data cleaning and data integration Data reduction and feature selection Discretization

A lot a methods have been developed but data preprocessing still an active area of research

  • Slide 1
  • Slide 2
  • Slide 3
  • Multi-Dimensional Measure of Data Quality
  • Major Tasks in Data Preprocessing
  • Figure Forms of data preprocessing
  • 2 Descriptive Data Summarization
  • Slide 8
  • 21 Measuring the Central Tendency
  • 22 Measuring the Dispersion of Data
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Data Integration
  • Handling Redundant Data in Data Integration
  • Data Transformation
  • Data Transformation Normalization
  • Slide 27
  • 5 Data Reduction
  • Slide 29
  • 51 Data Cube Aggregation
  • 52 Attribute Subset Selection
  • Slide 32
  • Slide 33
  • Slide 34
  • 53 Dimensionality Reduction
  • Slide 36
  • Slide 37
  • Slide 38
  • 4 Numerosity Reduction
  • Data Reduction Method (1) Regression and Log-Linear Models
  • Regress Analysis and Log-Linear Models
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Data Reduction Method (3) Clustering
  • Slide 47
  • Slide 48
  • Slide 49
  • Data Reduction Method (4) Sampling
  • Slide 51
  • Sampling Cluster or Stratified Sampling
  • Data Discretization and Concept Hierarchy Generation
  • Slide 54
  • Discretization
  • Discretization and Concept Hierarchy Generation for Numeric Data
  • Entropy-Based Discretization
  • Interval Merge by 2 Analysis
  • Cluster Analysis
  • Slide 60
  • Segmentation by Natural Partitioning
  • 62 Concept Hierarchy Generation for Categorical Data
  • Slide 63
  • Concept Hierarchy Generation for Categorical Data
  • Automatic Concept Hierarchy Generation
  • Summary
Page 28: UNIT – 1 Data Preprocessing. Data Preprocessing Learning Objectives Understand why preprocess the data. Understand how to clean the data. Understand how.

bull Basic heuristic methods of attribute subset selection include the following techniques some of which are illustrated in Figure

bull 1 Stepwise forward selection The procedure starts with an empty set of attributes as the reduced set The best of the original attributes is determined and added to the reduced set At each subsequent iteration or step the best of the remaining original attributes is added to the set

bull 2 Stepwise backward elimination The procedure starts with the full set of attributes At each step it removes the worst attribute remaining in the set

bull 3 Combination of forward selection and backward elimination The stepwise forward selection and backward elimination methods can be combined so that at each step the procedure selects the best attribute and removes the

worst from among the remaining attributes

bull 4 Decision tree induction Decision tree algorithms such as ID3 C45 and CART were originally intended for classification Decision tree induction constructs a flow chart like structure where each internal (non leaf) node denotes a test on an attribute each branch corresponds to an outcome of the test and each external (leaf) node denotes a class prediction At each node the algorithm chooses the ldquobestrdquo attribute to partition the data into individual classes When decision tree induction is used for attribute subset selection a tree is constructed from the given data All attributes that do not appear in the tree are assumed to be irrelevant The set of attributes appearing in the tree form the reduced subset of attributes

The stopping criteria for the methods may vary The procedure may employ a threshold on the measure used to determine when to stop the attribute selection process

In dimensionality reduction data encoding or transformations are applied so as to obtain a reduced or ldquocompressedrdquo representation of the original data If the original data can be reconstructed from the compressed data without any loss of information the data reduction is called lossless If instead we can reconstruct only an approximation of the original data then the data reduction is called lossy There are several well-tuned algorithms for string compression Although they are typically lossless they allow only limited manipulation of the data

In this section we instead focus on two popular and effective methods of lossy dimensionality reduction wavelet transforms and principal components analysis

bull Wavelet transforms can be applied to multidimensional data such as a data cube

bull This is done by first applying the transform to the first dimension then to the second and so on The computational complexity involved is linear with respect to the number of cells in the cube Wavelet transforms give good results on sparse or skewed data and on data with ordered attributes Lossy compression by wavelets is reportedly better than JPEG compression the current commercial standard Wavelet transforms have many real-world applications including the compression of fingerprint images computer vision analysis of time-series data and data cleaning

bull PCA is computationally inexpensive can be applied to ordered and unordered attributes and can handle sparse data and skewed data Multidimensional data of more than two dimensions can be handled by reducing the problem to two dimensions Principal components may be used as inputs to multiple regression and cluster analysis

bull In comparison with wavelet transforms PCA tends to be better at handling sparse data whereas wavelet transforms are more suitable for data of high dimensionality

ldquoCan we reduce the data volume by choosing alternative lsquosmallerrsquo forms of data representationrdquo

Techniques of numerosity reduction can indeed be applied for this purpose These techniques may be parametric or nonparametric

For parametric methods a model is used to estimate the data so that typically only the data parameters need to be stored instead of the actual data (Outliers may also be stored) Log-linear models which estimate discrete multidimensional probability distributions are an example

Nonparametric methods for storing reduced representations of the data include histograms clustering and sampling

Letrsquos look at each of the numerosity reduction techniques mentioned above

Linear regression Data are modeled to fit a straight line Often uses the least-square method to fit the

line Multiple regression allows a response variable Y

to be modeled as a linear function of multidimensional feature vector

Log-linear model approximates discrete multidimensional probability distributions

bull Linear regression Y = w X + bndash Two regression coefficients w and b specify the

line and are to be estimated by using the data at hand

ndash Using the least squares criterion to the known values of Y1 Y2 hellip X1 X2 hellip

bull Multiple regression Y = b0 + b1 X1 + b2 X2ndash Many nonlinear functions can be transformed into

the above

bull Log-linear modelsndash The multi-way table of joint probabilities is

approximated by a product of lower-order tablesndash Probability p(a b c d) = ab acad bcd

Histograms Histograms use binning to approximate data distributions and are a popular form of data reduction

A histogram for an attribute A partitions the data distribution of A into disjoint subsets or buckets If each bucket represents only a single attribute-valuefrequency pair the buckets are called singleton buckets Often buckets instead represent continuous ranges for the given attribute

There are several partitioning rules including the following

Equal-width In an equal-width histogram the width of each bucket range is uniform

Equal-frequency (or equidepth) In an equal-frequency histogram the buckets are created so that roughly the frequency of each bucket is constant (that is each bucket contains roughly the same number of contiguous data samples)

V-Optimal If we consider all of the possible histograms for a given number of buckets the V-Optimal histogram is the one with the least variance Histogram variance is a weighted sum of the original values that each bucket represents where bucket weight is equal to the number of values in the bucket

MaxDiff In a MaxDiff histogram we consider the difference between each pair of adjacent values

ClusteringClustering techniques consider data tuples as objects They

partition the objects into groups or clusters so that objects within a cluster are ldquosimilarrdquo to one another and ldquodissimilarrdquo to objects in other clusters

In data reduction the cluster representations of the data are used to replace the actual data The effectiveness of this technique depends on the nature of the data It is much more effective for data that can be organized into distinct clusters than for smeared data

Partition data set into clusters based on similarity and store cluster representation (eg centroid and diameter) only

Can be very effective if data is clustered but not if data is ldquosmearedrdquo

Can have hierarchical clustering and be stored in multi-dimensional index tree structures

There are many choices of clustering definitions and clustering algorithms

Cluster analysis will be studied in depth later

SamplingSampling can be used as a data reduction technique because it

allows a large data set to be represented by a much smaller random sample (or subset) of the data Suppose that a large data set D contains N tuples Letrsquos look at the most common ways that we could sample D for data reduction

An advantage of sampling for data reduction is that the cost of obtaining a sample is proportional to the size of the sample sampling complexity is potentially sublinear to the size of the data

Simple Random sample without replacement

bull For a fixed sample size sampling complexity increases only linearly as the number of data dimensions

bull When applied to data reduction sampling is most commonly used to estimate the answer to an aggregate query

Sampling obtaining a small sample s to represent the whole data set N

Allow a mining algorithm to run in complexity that is potentially sub-linear to the size of the data

Choose a representative subset of the data Simple random sampling may have very poor

performance in the presence of skew Develop adaptive sampling methods

Stratified sampling Approximate the percentage of each class (or

subpopulation of interest) in the overall database Used in conjunction with skewed data

Note Sampling may not reduce database IOs (page at a time)

Sampling with or without Replacement

SRSWOR

(simple random

sample without

replacement)

SRSWR

Raw Data

Raw Data ClusterStratified Sample

Data discretization techniques can be used to reduce the number of values for a given continuous attribute by dividing the range of the attribute into intervals Interval labels can then be used to replace actual data values Replacing numerous values of a continuous attribute by a small number of interval labels thereby reduces and simplifies the original data This leads to a concise easy-to-use knowledge-level representation of mining results

bull Discretization techniques can be categorized based on how the discretization is performed such as whether it uses class information or which direction it proceeds (ie top-down vs bottom-up) If the discretization process uses class information then we say it is supervised discretization Otherwise it is unsupervised If the process starts by first finding one or a few points (called split points or cut points) to split the entire attribute range and then repeats this recursively on the resulting intervals it is called top-down discretization or splitting This contrasts with bottom-up discretization or merging which starts by considering all of the continuous values as potential split-points removes some by merging neighborhood values to form intervals and then recursively applies this process to the resulting intervals Discretization can be performed recursively on an attribute to provide a hierarchical or multiresolution partitioning of the attribute values known as a concept hierarchy Concept hierarchies are useful for mining at multiple levels of abstraction

Three types of attributes Nominal mdash values from an unordered set eg color

profession Ordinal mdash values from an ordered set eg military or

academic rank Continuous mdash real numbers eg integer or real

numbers Discretization

Divide the range of a continuous attribute into intervals Some classification algorithms only accept categorical

attributes Reduce data size by discretization Prepare for further analysis

Typical methods All the methods can be applied recursively Binning (covered above)

Top-down split unsupervised Histogram analysis (covered above)

Top-down split unsupervised Clustering analysis (covered above)

Either top-down split or bottom-up merge unsupervised Entropy-based discretization supervised top-down

split Interval merging by 2 Analysis unsupervised bottom-

up merge Segmentation by natural partitioning top-down split

unsupervised

Given a set of samples S if S is partitioned into two intervals S1 and S2 using boundary T the information gain after partitioning is

Entropy is calculated based on class distribution of the samples in the set Given m classes the entropy of S1 is

where pi is the probability of class i in S1 The boundary that minimizes the entropy function over

all possible boundaries is selected as a binary discretization

The process is recursively applied to partitions obtained until some stopping criterion is met

Such a boundary may reduce data size and improve classification accuracy

)Entropy(S|S|

|S|+)Entropy(S

|S|

|S|=T)I(S 2

21

1

m

iii ppSEntropy

121 )(log)(

Merging-based (bottom-up) vs splitting-based methods Merge Find the best neighboring intervals and merge

them to form larger intervals recursively ChiMerge [Kerber AAAI 1992 See also Liu et al DMKD

2002] Initially each distinct value of a numerical attr A is

considered to be one interval 2 tests are performed for every pair of adjacent

intervals Adjacent intervals with the least 2 values are merged

together since low 2 values for a pair indicate similar class distributions

This merge process proceeds recursively until a predefined stopping criterion is met (such as significance level max-interval max inconsistency etc)

Cluster analysis is a popular data discretization method A clustering algorithm can be applied to discretize a numerical attribute A by partitioning the values of A into clusters or groups Clustering takes the distribution of A into consideration as well as the closeness of data points and therefore is able to produce high-quality discretization results

Clustering can be used to generate a concept hierarchy for A by following either a topdown splitting strategy or a bottom-up merging strategy where each cluster forms a node of the concept hierarchy In the former each initial cluster or partition may be further

decomposed into several subclusters forming a lower level of the hierarchy In the latter clusters are formed by repeatedly grouping neighboring clusters in order to form

higher-level concepts

bull Discretization by Intuitive Partitioningbull Although the above discretization methods are useful

in the generation of numerical hierarchies many users would like to see numerical ranges partitioned into relatively uniform easy-to-read intervals that appear intuitive or ldquonaturalrdquo

bull If an interval covers 3 6 7 or 9 distinct values at the most significant digit then partition the range into 3 intervals (3 equal-width intervals for 3 6 and 9 and 3 intervals in the grouping of 2-3-2 for 7)

bull If it covers 2 4 or 8 distinct values at the most significant digit then partition the range into 4 equal-width intervals

bull If it covers 1 5 or 10 distinct values at the most significant digit then partition the range into 5 equal-width intervals

bull The rule can be recursively applied to each interval creating a concept hierarchy for the given numerical attribute Real-world data often contain extremely large positive andor negative outlier values which could distort any top-down discretization method based on minimum and maximum data values

A simply 3-4-5 rule can be used to segment numeric data into relatively uniform ldquonaturalrdquo intervals If an interval covers 3 6 7 or 9 distinct values

at the most significant digit partition the range into 3 equi-width intervals

If it covers 2 4 or 8 distinct values at the most significant digit partition the range into 4 intervals

If it covers 1 5 or 10 distinct values at the most significant digit partition the range into 5 intervals

Categorical data are discrete data Categorical attributes have a finite (but possibly large) number of distinct values with no ordering among the values Examples include geographic location job category and itemtype There are several methods for the generation of concept hierarchies for categorical data

Specification of a partial ordering of attributes explicitly at the schema level by users or Experts

Specification of a portion of a hierarchy by explicit data grouping

Specification of a set of attributes but not of their partial ordering

Specification of only a partial set of attributes

Specification of a partialtotal ordering of attributes explicitly at the schema level by users or experts street lt city lt state lt country

Specification of a hierarchy for a set of values by explicit data grouping Urbana Champaign Chicago lt Illinois

Specification of only a partial set of attributes Eg only street lt city not others

Automatic generation of hierarchies (or attribute levels) by the analysis of the number of distinct values Eg for a set of attributes street city state

country

Some hierarchies can be automaatically generated based on the analysis of the number of distinct values per attribute in the data set The attribute with the most distinct values is

placed at the lowest level of the hierarchy Exceptions eg weekday month quarter year

country

province_or_ state

city

street

15 distinct values

365 distinct values

3567 distinct values

674339 distinct values

Data preparation or preprocessing is a big issue for both data warehousing and data mining

Discriptive data summarization is need for quality data preprocessing

Data preparation includes Data cleaning and data integration Data reduction and feature selection Discretization

A lot a methods have been developed but data preprocessing still an active area of research

  • Slide 1
  • Slide 2
  • Slide 3
  • Multi-Dimensional Measure of Data Quality
  • Major Tasks in Data Preprocessing
  • Figure Forms of data preprocessing
  • 2 Descriptive Data Summarization
  • Slide 8
  • 21 Measuring the Central Tendency
  • 22 Measuring the Dispersion of Data
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Data Integration
  • Handling Redundant Data in Data Integration
  • Data Transformation
  • Data Transformation Normalization
  • Slide 27
  • 5 Data Reduction
  • Slide 29
  • 51 Data Cube Aggregation
  • 52 Attribute Subset Selection
  • Slide 32
  • Slide 33
  • Slide 34
  • 53 Dimensionality Reduction
  • Slide 36
  • Slide 37
  • Slide 38
  • 4 Numerosity Reduction
  • Data Reduction Method (1) Regression and Log-Linear Models
  • Regress Analysis and Log-Linear Models
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Data Reduction Method (3) Clustering
  • Slide 47
  • Slide 48
  • Slide 49
  • Data Reduction Method (4) Sampling
  • Slide 51
  • Sampling Cluster or Stratified Sampling
  • Data Discretization and Concept Hierarchy Generation
  • Slide 54
  • Discretization
  • Discretization and Concept Hierarchy Generation for Numeric Data
  • Entropy-Based Discretization
  • Interval Merge by 2 Analysis
  • Cluster Analysis
  • Slide 60
  • Segmentation by Natural Partitioning
  • 62 Concept Hierarchy Generation for Categorical Data
  • Slide 63
  • Concept Hierarchy Generation for Categorical Data
  • Automatic Concept Hierarchy Generation
  • Summary
Page 29: UNIT – 1 Data Preprocessing. Data Preprocessing Learning Objectives Understand why preprocess the data. Understand how to clean the data. Understand how.

bull 4 Decision tree induction Decision tree algorithms such as ID3 C45 and CART were originally intended for classification Decision tree induction constructs a flow chart like structure where each internal (non leaf) node denotes a test on an attribute each branch corresponds to an outcome of the test and each external (leaf) node denotes a class prediction At each node the algorithm chooses the ldquobestrdquo attribute to partition the data into individual classes When decision tree induction is used for attribute subset selection a tree is constructed from the given data All attributes that do not appear in the tree are assumed to be irrelevant The set of attributes appearing in the tree form the reduced subset of attributes

The stopping criteria for the methods may vary The procedure may employ a threshold on the measure used to determine when to stop the attribute selection process

In dimensionality reduction data encoding or transformations are applied so as to obtain a reduced or ldquocompressedrdquo representation of the original data If the original data can be reconstructed from the compressed data without any loss of information the data reduction is called lossless If instead we can reconstruct only an approximation of the original data then the data reduction is called lossy There are several well-tuned algorithms for string compression Although they are typically lossless they allow only limited manipulation of the data

In this section we instead focus on two popular and effective methods of lossy dimensionality reduction wavelet transforms and principal components analysis

bull Wavelet transforms can be applied to multidimensional data such as a data cube

bull This is done by first applying the transform to the first dimension then to the second and so on The computational complexity involved is linear with respect to the number of cells in the cube Wavelet transforms give good results on sparse or skewed data and on data with ordered attributes Lossy compression by wavelets is reportedly better than JPEG compression the current commercial standard Wavelet transforms have many real-world applications including the compression of fingerprint images computer vision analysis of time-series data and data cleaning

bull PCA is computationally inexpensive can be applied to ordered and unordered attributes and can handle sparse data and skewed data Multidimensional data of more than two dimensions can be handled by reducing the problem to two dimensions Principal components may be used as inputs to multiple regression and cluster analysis

bull In comparison with wavelet transforms PCA tends to be better at handling sparse data whereas wavelet transforms are more suitable for data of high dimensionality

ldquoCan we reduce the data volume by choosing alternative lsquosmallerrsquo forms of data representationrdquo

Techniques of numerosity reduction can indeed be applied for this purpose These techniques may be parametric or nonparametric

For parametric methods a model is used to estimate the data so that typically only the data parameters need to be stored instead of the actual data (Outliers may also be stored) Log-linear models which estimate discrete multidimensional probability distributions are an example

Nonparametric methods for storing reduced representations of the data include histograms clustering and sampling

Letrsquos look at each of the numerosity reduction techniques mentioned above

Linear regression Data are modeled to fit a straight line Often uses the least-square method to fit the

line Multiple regression allows a response variable Y

to be modeled as a linear function of multidimensional feature vector

Log-linear model approximates discrete multidimensional probability distributions

bull Linear regression Y = w X + bndash Two regression coefficients w and b specify the

line and are to be estimated by using the data at hand

ndash Using the least squares criterion to the known values of Y1 Y2 hellip X1 X2 hellip

bull Multiple regression Y = b0 + b1 X1 + b2 X2ndash Many nonlinear functions can be transformed into

the above

bull Log-linear modelsndash The multi-way table of joint probabilities is

approximated by a product of lower-order tablesndash Probability p(a b c d) = ab acad bcd

Histograms Histograms use binning to approximate data distributions and are a popular form of data reduction

A histogram for an attribute A partitions the data distribution of A into disjoint subsets or buckets If each bucket represents only a single attribute-valuefrequency pair the buckets are called singleton buckets Often buckets instead represent continuous ranges for the given attribute

There are several partitioning rules including the following

Equal-width In an equal-width histogram the width of each bucket range is uniform

Equal-frequency (or equidepth) In an equal-frequency histogram the buckets are created so that roughly the frequency of each bucket is constant (that is each bucket contains roughly the same number of contiguous data samples)

V-Optimal If we consider all of the possible histograms for a given number of buckets the V-Optimal histogram is the one with the least variance Histogram variance is a weighted sum of the original values that each bucket represents where bucket weight is equal to the number of values in the bucket

MaxDiff In a MaxDiff histogram we consider the difference between each pair of adjacent values

ClusteringClustering techniques consider data tuples as objects They

partition the objects into groups or clusters so that objects within a cluster are ldquosimilarrdquo to one another and ldquodissimilarrdquo to objects in other clusters

In data reduction the cluster representations of the data are used to replace the actual data The effectiveness of this technique depends on the nature of the data It is much more effective for data that can be organized into distinct clusters than for smeared data

Partition data set into clusters based on similarity and store cluster representation (eg centroid and diameter) only

Can be very effective if data is clustered but not if data is ldquosmearedrdquo

Can have hierarchical clustering and be stored in multi-dimensional index tree structures

There are many choices of clustering definitions and clustering algorithms

Cluster analysis will be studied in depth later

SamplingSampling can be used as a data reduction technique because it

allows a large data set to be represented by a much smaller random sample (or subset) of the data Suppose that a large data set D contains N tuples Letrsquos look at the most common ways that we could sample D for data reduction

An advantage of sampling for data reduction is that the cost of obtaining a sample is proportional to the size of the sample sampling complexity is potentially sublinear to the size of the data

Simple Random sample without replacement

bull For a fixed sample size sampling complexity increases only linearly as the number of data dimensions

bull When applied to data reduction sampling is most commonly used to estimate the answer to an aggregate query

Sampling obtaining a small sample s to represent the whole data set N

Allow a mining algorithm to run in complexity that is potentially sub-linear to the size of the data

Choose a representative subset of the data Simple random sampling may have very poor

performance in the presence of skew Develop adaptive sampling methods

Stratified sampling Approximate the percentage of each class (or

subpopulation of interest) in the overall database Used in conjunction with skewed data

Note Sampling may not reduce database IOs (page at a time)

Sampling with or without Replacement

SRSWOR

(simple random

sample without

replacement)

SRSWR

Raw Data

Raw Data ClusterStratified Sample

Data discretization techniques can be used to reduce the number of values for a given continuous attribute by dividing the range of the attribute into intervals Interval labels can then be used to replace actual data values Replacing numerous values of a continuous attribute by a small number of interval labels thereby reduces and simplifies the original data This leads to a concise easy-to-use knowledge-level representation of mining results

bull Discretization techniques can be categorized based on how the discretization is performed such as whether it uses class information or which direction it proceeds (ie top-down vs bottom-up) If the discretization process uses class information then we say it is supervised discretization Otherwise it is unsupervised If the process starts by first finding one or a few points (called split points or cut points) to split the entire attribute range and then repeats this recursively on the resulting intervals it is called top-down discretization or splitting This contrasts with bottom-up discretization or merging which starts by considering all of the continuous values as potential split-points removes some by merging neighborhood values to form intervals and then recursively applies this process to the resulting intervals Discretization can be performed recursively on an attribute to provide a hierarchical or multiresolution partitioning of the attribute values known as a concept hierarchy Concept hierarchies are useful for mining at multiple levels of abstraction

Three types of attributes Nominal mdash values from an unordered set eg color

profession Ordinal mdash values from an ordered set eg military or

academic rank Continuous mdash real numbers eg integer or real

numbers Discretization

Divide the range of a continuous attribute into intervals Some classification algorithms only accept categorical

attributes Reduce data size by discretization Prepare for further analysis

Typical methods All the methods can be applied recursively Binning (covered above)

Top-down split unsupervised Histogram analysis (covered above)

Top-down split unsupervised Clustering analysis (covered above)

Either top-down split or bottom-up merge unsupervised Entropy-based discretization supervised top-down

split Interval merging by 2 Analysis unsupervised bottom-

up merge Segmentation by natural partitioning top-down split

unsupervised

Given a set of samples S if S is partitioned into two intervals S1 and S2 using boundary T the information gain after partitioning is

Entropy is calculated based on class distribution of the samples in the set Given m classes the entropy of S1 is

where pi is the probability of class i in S1 The boundary that minimizes the entropy function over

all possible boundaries is selected as a binary discretization

The process is recursively applied to partitions obtained until some stopping criterion is met

Such a boundary may reduce data size and improve classification accuracy

)Entropy(S|S|

|S|+)Entropy(S

|S|

|S|=T)I(S 2

21

1

m

iii ppSEntropy

121 )(log)(

Merging-based (bottom-up) vs splitting-based methods Merge Find the best neighboring intervals and merge

them to form larger intervals recursively ChiMerge [Kerber AAAI 1992 See also Liu et al DMKD

2002] Initially each distinct value of a numerical attr A is

considered to be one interval 2 tests are performed for every pair of adjacent

intervals Adjacent intervals with the least 2 values are merged

together since low 2 values for a pair indicate similar class distributions

This merge process proceeds recursively until a predefined stopping criterion is met (such as significance level max-interval max inconsistency etc)

Cluster analysis is a popular data discretization method A clustering algorithm can be applied to discretize a numerical attribute A by partitioning the values of A into clusters or groups Clustering takes the distribution of A into consideration as well as the closeness of data points and therefore is able to produce high-quality discretization results

Clustering can be used to generate a concept hierarchy for A by following either a topdown splitting strategy or a bottom-up merging strategy where each cluster forms a node of the concept hierarchy In the former each initial cluster or partition may be further

decomposed into several subclusters forming a lower level of the hierarchy In the latter clusters are formed by repeatedly grouping neighboring clusters in order to form

higher-level concepts

bull Discretization by Intuitive Partitioningbull Although the above discretization methods are useful

in the generation of numerical hierarchies many users would like to see numerical ranges partitioned into relatively uniform easy-to-read intervals that appear intuitive or ldquonaturalrdquo

bull If an interval covers 3 6 7 or 9 distinct values at the most significant digit then partition the range into 3 intervals (3 equal-width intervals for 3 6 and 9 and 3 intervals in the grouping of 2-3-2 for 7)

bull If it covers 2 4 or 8 distinct values at the most significant digit then partition the range into 4 equal-width intervals

bull If it covers 1 5 or 10 distinct values at the most significant digit then partition the range into 5 equal-width intervals

bull The rule can be recursively applied to each interval creating a concept hierarchy for the given numerical attribute Real-world data often contain extremely large positive andor negative outlier values which could distort any top-down discretization method based on minimum and maximum data values

A simply 3-4-5 rule can be used to segment numeric data into relatively uniform ldquonaturalrdquo intervals If an interval covers 3 6 7 or 9 distinct values

at the most significant digit partition the range into 3 equi-width intervals

If it covers 2 4 or 8 distinct values at the most significant digit partition the range into 4 intervals

If it covers 1 5 or 10 distinct values at the most significant digit partition the range into 5 intervals

Categorical data are discrete data Categorical attributes have a finite (but possibly large) number of distinct values with no ordering among the values Examples include geographic location job category and itemtype There are several methods for the generation of concept hierarchies for categorical data

Specification of a partial ordering of attributes explicitly at the schema level by users or Experts

Specification of a portion of a hierarchy by explicit data grouping

Specification of a set of attributes but not of their partial ordering

Specification of only a partial set of attributes

Specification of a partialtotal ordering of attributes explicitly at the schema level by users or experts street lt city lt state lt country

Specification of a hierarchy for a set of values by explicit data grouping Urbana Champaign Chicago lt Illinois

Specification of only a partial set of attributes Eg only street lt city not others

Automatic generation of hierarchies (or attribute levels) by the analysis of the number of distinct values Eg for a set of attributes street city state

country

Some hierarchies can be automaatically generated based on the analysis of the number of distinct values per attribute in the data set The attribute with the most distinct values is

placed at the lowest level of the hierarchy Exceptions eg weekday month quarter year

country

province_or_ state

city

street

15 distinct values

365 distinct values

3567 distinct values

674339 distinct values

Data preparation or preprocessing is a big issue for both data warehousing and data mining

Discriptive data summarization is need for quality data preprocessing

Data preparation includes Data cleaning and data integration Data reduction and feature selection Discretization

A lot a methods have been developed but data preprocessing still an active area of research

  • Slide 1
  • Slide 2
  • Slide 3
  • Multi-Dimensional Measure of Data Quality
  • Major Tasks in Data Preprocessing
  • Figure Forms of data preprocessing
  • 2 Descriptive Data Summarization
  • Slide 8
  • 21 Measuring the Central Tendency
  • 22 Measuring the Dispersion of Data
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Data Integration
  • Handling Redundant Data in Data Integration
  • Data Transformation
  • Data Transformation Normalization
  • Slide 27
  • 5 Data Reduction
  • Slide 29
  • 51 Data Cube Aggregation
  • 52 Attribute Subset Selection
  • Slide 32
  • Slide 33
  • Slide 34
  • 53 Dimensionality Reduction
  • Slide 36
  • Slide 37
  • Slide 38
  • 4 Numerosity Reduction
  • Data Reduction Method (1) Regression and Log-Linear Models
  • Regress Analysis and Log-Linear Models
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Data Reduction Method (3) Clustering
  • Slide 47
  • Slide 48
  • Slide 49
  • Data Reduction Method (4) Sampling
  • Slide 51
  • Sampling Cluster or Stratified Sampling
  • Data Discretization and Concept Hierarchy Generation
  • Slide 54
  • Discretization
  • Discretization and Concept Hierarchy Generation for Numeric Data
  • Entropy-Based Discretization
  • Interval Merge by 2 Analysis
  • Cluster Analysis
  • Slide 60
  • Segmentation by Natural Partitioning
  • 62 Concept Hierarchy Generation for Categorical Data
  • Slide 63
  • Concept Hierarchy Generation for Categorical Data
  • Automatic Concept Hierarchy Generation
  • Summary
Page 30: UNIT – 1 Data Preprocessing. Data Preprocessing Learning Objectives Understand why preprocess the data. Understand how to clean the data. Understand how.

In dimensionality reduction data encoding or transformations are applied so as to obtain a reduced or ldquocompressedrdquo representation of the original data If the original data can be reconstructed from the compressed data without any loss of information the data reduction is called lossless If instead we can reconstruct only an approximation of the original data then the data reduction is called lossy There are several well-tuned algorithms for string compression Although they are typically lossless they allow only limited manipulation of the data

In this section we instead focus on two popular and effective methods of lossy dimensionality reduction wavelet transforms and principal components analysis

bull Wavelet transforms can be applied to multidimensional data such as a data cube

bull This is done by first applying the transform to the first dimension then to the second and so on The computational complexity involved is linear with respect to the number of cells in the cube Wavelet transforms give good results on sparse or skewed data and on data with ordered attributes Lossy compression by wavelets is reportedly better than JPEG compression the current commercial standard Wavelet transforms have many real-world applications including the compression of fingerprint images computer vision analysis of time-series data and data cleaning

bull PCA is computationally inexpensive can be applied to ordered and unordered attributes and can handle sparse data and skewed data Multidimensional data of more than two dimensions can be handled by reducing the problem to two dimensions Principal components may be used as inputs to multiple regression and cluster analysis

bull In comparison with wavelet transforms PCA tends to be better at handling sparse data whereas wavelet transforms are more suitable for data of high dimensionality

ldquoCan we reduce the data volume by choosing alternative lsquosmallerrsquo forms of data representationrdquo

Techniques of numerosity reduction can indeed be applied for this purpose These techniques may be parametric or nonparametric

For parametric methods a model is used to estimate the data so that typically only the data parameters need to be stored instead of the actual data (Outliers may also be stored) Log-linear models which estimate discrete multidimensional probability distributions are an example

Nonparametric methods for storing reduced representations of the data include histograms clustering and sampling

Letrsquos look at each of the numerosity reduction techniques mentioned above

Linear regression Data are modeled to fit a straight line Often uses the least-square method to fit the

line Multiple regression allows a response variable Y

to be modeled as a linear function of multidimensional feature vector

Log-linear model approximates discrete multidimensional probability distributions

bull Linear regression Y = w X + bndash Two regression coefficients w and b specify the

line and are to be estimated by using the data at hand

ndash Using the least squares criterion to the known values of Y1 Y2 hellip X1 X2 hellip

bull Multiple regression Y = b0 + b1 X1 + b2 X2ndash Many nonlinear functions can be transformed into

the above

bull Log-linear modelsndash The multi-way table of joint probabilities is

approximated by a product of lower-order tablesndash Probability p(a b c d) = ab acad bcd

Histograms Histograms use binning to approximate data distributions and are a popular form of data reduction

A histogram for an attribute A partitions the data distribution of A into disjoint subsets or buckets If each bucket represents only a single attribute-valuefrequency pair the buckets are called singleton buckets Often buckets instead represent continuous ranges for the given attribute

There are several partitioning rules including the following

Equal-width In an equal-width histogram the width of each bucket range is uniform

Equal-frequency (or equidepth) In an equal-frequency histogram the buckets are created so that roughly the frequency of each bucket is constant (that is each bucket contains roughly the same number of contiguous data samples)

V-Optimal If we consider all of the possible histograms for a given number of buckets the V-Optimal histogram is the one with the least variance Histogram variance is a weighted sum of the original values that each bucket represents where bucket weight is equal to the number of values in the bucket

MaxDiff In a MaxDiff histogram we consider the difference between each pair of adjacent values

ClusteringClustering techniques consider data tuples as objects They

partition the objects into groups or clusters so that objects within a cluster are ldquosimilarrdquo to one another and ldquodissimilarrdquo to objects in other clusters

In data reduction the cluster representations of the data are used to replace the actual data The effectiveness of this technique depends on the nature of the data It is much more effective for data that can be organized into distinct clusters than for smeared data

Partition data set into clusters based on similarity and store cluster representation (eg centroid and diameter) only

Can be very effective if data is clustered but not if data is ldquosmearedrdquo

Can have hierarchical clustering and be stored in multi-dimensional index tree structures

There are many choices of clustering definitions and clustering algorithms

Cluster analysis will be studied in depth later

SamplingSampling can be used as a data reduction technique because it

allows a large data set to be represented by a much smaller random sample (or subset) of the data Suppose that a large data set D contains N tuples Letrsquos look at the most common ways that we could sample D for data reduction

An advantage of sampling for data reduction is that the cost of obtaining a sample is proportional to the size of the sample sampling complexity is potentially sublinear to the size of the data

Simple Random sample without replacement

bull For a fixed sample size sampling complexity increases only linearly as the number of data dimensions

bull When applied to data reduction sampling is most commonly used to estimate the answer to an aggregate query

Sampling obtaining a small sample s to represent the whole data set N

Allow a mining algorithm to run in complexity that is potentially sub-linear to the size of the data

Choose a representative subset of the data Simple random sampling may have very poor

performance in the presence of skew Develop adaptive sampling methods

Stratified sampling Approximate the percentage of each class (or

subpopulation of interest) in the overall database Used in conjunction with skewed data

Note Sampling may not reduce database IOs (page at a time)

Sampling with or without Replacement

SRSWOR

(simple random

sample without

replacement)

SRSWR

Raw Data

Raw Data ClusterStratified Sample

Data discretization techniques can be used to reduce the number of values for a given continuous attribute by dividing the range of the attribute into intervals Interval labels can then be used to replace actual data values Replacing numerous values of a continuous attribute by a small number of interval labels thereby reduces and simplifies the original data This leads to a concise easy-to-use knowledge-level representation of mining results

bull Discretization techniques can be categorized based on how the discretization is performed such as whether it uses class information or which direction it proceeds (ie top-down vs bottom-up) If the discretization process uses class information then we say it is supervised discretization Otherwise it is unsupervised If the process starts by first finding one or a few points (called split points or cut points) to split the entire attribute range and then repeats this recursively on the resulting intervals it is called top-down discretization or splitting This contrasts with bottom-up discretization or merging which starts by considering all of the continuous values as potential split-points removes some by merging neighborhood values to form intervals and then recursively applies this process to the resulting intervals Discretization can be performed recursively on an attribute to provide a hierarchical or multiresolution partitioning of the attribute values known as a concept hierarchy Concept hierarchies are useful for mining at multiple levels of abstraction

Three types of attributes Nominal mdash values from an unordered set eg color

profession Ordinal mdash values from an ordered set eg military or

academic rank Continuous mdash real numbers eg integer or real

numbers Discretization

Divide the range of a continuous attribute into intervals Some classification algorithms only accept categorical

attributes Reduce data size by discretization Prepare for further analysis

Typical methods All the methods can be applied recursively Binning (covered above)

Top-down split unsupervised Histogram analysis (covered above)

Top-down split unsupervised Clustering analysis (covered above)

Either top-down split or bottom-up merge unsupervised Entropy-based discretization supervised top-down

split Interval merging by 2 Analysis unsupervised bottom-

up merge Segmentation by natural partitioning top-down split

unsupervised

Given a set of samples S if S is partitioned into two intervals S1 and S2 using boundary T the information gain after partitioning is

Entropy is calculated based on class distribution of the samples in the set Given m classes the entropy of S1 is

where pi is the probability of class i in S1 The boundary that minimizes the entropy function over

all possible boundaries is selected as a binary discretization

The process is recursively applied to partitions obtained until some stopping criterion is met

Such a boundary may reduce data size and improve classification accuracy

)Entropy(S|S|

|S|+)Entropy(S

|S|

|S|=T)I(S 2

21

1

m

iii ppSEntropy

121 )(log)(

Merging-based (bottom-up) vs splitting-based methods Merge Find the best neighboring intervals and merge

them to form larger intervals recursively ChiMerge [Kerber AAAI 1992 See also Liu et al DMKD

2002] Initially each distinct value of a numerical attr A is

considered to be one interval 2 tests are performed for every pair of adjacent

intervals Adjacent intervals with the least 2 values are merged

together since low 2 values for a pair indicate similar class distributions

This merge process proceeds recursively until a predefined stopping criterion is met (such as significance level max-interval max inconsistency etc)

Cluster analysis is a popular data discretization method A clustering algorithm can be applied to discretize a numerical attribute A by partitioning the values of A into clusters or groups Clustering takes the distribution of A into consideration as well as the closeness of data points and therefore is able to produce high-quality discretization results

Clustering can be used to generate a concept hierarchy for A by following either a topdown splitting strategy or a bottom-up merging strategy where each cluster forms a node of the concept hierarchy In the former each initial cluster or partition may be further

decomposed into several subclusters forming a lower level of the hierarchy In the latter clusters are formed by repeatedly grouping neighboring clusters in order to form

higher-level concepts

bull Discretization by Intuitive Partitioningbull Although the above discretization methods are useful

in the generation of numerical hierarchies many users would like to see numerical ranges partitioned into relatively uniform easy-to-read intervals that appear intuitive or ldquonaturalrdquo

bull If an interval covers 3 6 7 or 9 distinct values at the most significant digit then partition the range into 3 intervals (3 equal-width intervals for 3 6 and 9 and 3 intervals in the grouping of 2-3-2 for 7)

bull If it covers 2 4 or 8 distinct values at the most significant digit then partition the range into 4 equal-width intervals

bull If it covers 1 5 or 10 distinct values at the most significant digit then partition the range into 5 equal-width intervals

bull The rule can be recursively applied to each interval creating a concept hierarchy for the given numerical attribute Real-world data often contain extremely large positive andor negative outlier values which could distort any top-down discretization method based on minimum and maximum data values

A simply 3-4-5 rule can be used to segment numeric data into relatively uniform ldquonaturalrdquo intervals If an interval covers 3 6 7 or 9 distinct values

at the most significant digit partition the range into 3 equi-width intervals

If it covers 2 4 or 8 distinct values at the most significant digit partition the range into 4 intervals

If it covers 1 5 or 10 distinct values at the most significant digit partition the range into 5 intervals

Categorical data are discrete data Categorical attributes have a finite (but possibly large) number of distinct values with no ordering among the values Examples include geographic location job category and itemtype There are several methods for the generation of concept hierarchies for categorical data

Specification of a partial ordering of attributes explicitly at the schema level by users or Experts

Specification of a portion of a hierarchy by explicit data grouping

Specification of a set of attributes but not of their partial ordering

Specification of only a partial set of attributes

Specification of a partialtotal ordering of attributes explicitly at the schema level by users or experts street lt city lt state lt country

Specification of a hierarchy for a set of values by explicit data grouping Urbana Champaign Chicago lt Illinois

Specification of only a partial set of attributes Eg only street lt city not others

Automatic generation of hierarchies (or attribute levels) by the analysis of the number of distinct values Eg for a set of attributes street city state

country

Some hierarchies can be automaatically generated based on the analysis of the number of distinct values per attribute in the data set The attribute with the most distinct values is

placed at the lowest level of the hierarchy Exceptions eg weekday month quarter year

country

province_or_ state

city

street

15 distinct values

365 distinct values

3567 distinct values

674339 distinct values

Data preparation or preprocessing is a big issue for both data warehousing and data mining

Discriptive data summarization is need for quality data preprocessing

Data preparation includes Data cleaning and data integration Data reduction and feature selection Discretization

A lot a methods have been developed but data preprocessing still an active area of research

  • Slide 1
  • Slide 2
  • Slide 3
  • Multi-Dimensional Measure of Data Quality
  • Major Tasks in Data Preprocessing
  • Figure Forms of data preprocessing
  • 2 Descriptive Data Summarization
  • Slide 8
  • 21 Measuring the Central Tendency
  • 22 Measuring the Dispersion of Data
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Data Integration
  • Handling Redundant Data in Data Integration
  • Data Transformation
  • Data Transformation Normalization
  • Slide 27
  • 5 Data Reduction
  • Slide 29
  • 51 Data Cube Aggregation
  • 52 Attribute Subset Selection
  • Slide 32
  • Slide 33
  • Slide 34
  • 53 Dimensionality Reduction
  • Slide 36
  • Slide 37
  • Slide 38
  • 4 Numerosity Reduction
  • Data Reduction Method (1) Regression and Log-Linear Models
  • Regress Analysis and Log-Linear Models
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Data Reduction Method (3) Clustering
  • Slide 47
  • Slide 48
  • Slide 49
  • Data Reduction Method (4) Sampling
  • Slide 51
  • Sampling Cluster or Stratified Sampling
  • Data Discretization and Concept Hierarchy Generation
  • Slide 54
  • Discretization
  • Discretization and Concept Hierarchy Generation for Numeric Data
  • Entropy-Based Discretization
  • Interval Merge by 2 Analysis
  • Cluster Analysis
  • Slide 60
  • Segmentation by Natural Partitioning
  • 62 Concept Hierarchy Generation for Categorical Data
  • Slide 63
  • Concept Hierarchy Generation for Categorical Data
  • Automatic Concept Hierarchy Generation
  • Summary
Page 31: UNIT – 1 Data Preprocessing. Data Preprocessing Learning Objectives Understand why preprocess the data. Understand how to clean the data. Understand how.

bull Wavelet transforms can be applied to multidimensional data such as a data cube

bull This is done by first applying the transform to the first dimension then to the second and so on The computational complexity involved is linear with respect to the number of cells in the cube Wavelet transforms give good results on sparse or skewed data and on data with ordered attributes Lossy compression by wavelets is reportedly better than JPEG compression the current commercial standard Wavelet transforms have many real-world applications including the compression of fingerprint images computer vision analysis of time-series data and data cleaning

bull PCA is computationally inexpensive can be applied to ordered and unordered attributes and can handle sparse data and skewed data Multidimensional data of more than two dimensions can be handled by reducing the problem to two dimensions Principal components may be used as inputs to multiple regression and cluster analysis

bull In comparison with wavelet transforms PCA tends to be better at handling sparse data whereas wavelet transforms are more suitable for data of high dimensionality

ldquoCan we reduce the data volume by choosing alternative lsquosmallerrsquo forms of data representationrdquo

Techniques of numerosity reduction can indeed be applied for this purpose These techniques may be parametric or nonparametric

For parametric methods a model is used to estimate the data so that typically only the data parameters need to be stored instead of the actual data (Outliers may also be stored) Log-linear models which estimate discrete multidimensional probability distributions are an example

Nonparametric methods for storing reduced representations of the data include histograms clustering and sampling

Letrsquos look at each of the numerosity reduction techniques mentioned above

Linear regression Data are modeled to fit a straight line Often uses the least-square method to fit the

line Multiple regression allows a response variable Y

to be modeled as a linear function of multidimensional feature vector

Log-linear model approximates discrete multidimensional probability distributions

bull Linear regression Y = w X + bndash Two regression coefficients w and b specify the

line and are to be estimated by using the data at hand

ndash Using the least squares criterion to the known values of Y1 Y2 hellip X1 X2 hellip

bull Multiple regression Y = b0 + b1 X1 + b2 X2ndash Many nonlinear functions can be transformed into

the above

bull Log-linear modelsndash The multi-way table of joint probabilities is

approximated by a product of lower-order tablesndash Probability p(a b c d) = ab acad bcd

Histograms Histograms use binning to approximate data distributions and are a popular form of data reduction

A histogram for an attribute A partitions the data distribution of A into disjoint subsets or buckets If each bucket represents only a single attribute-valuefrequency pair the buckets are called singleton buckets Often buckets instead represent continuous ranges for the given attribute

There are several partitioning rules including the following

Equal-width In an equal-width histogram the width of each bucket range is uniform

Equal-frequency (or equidepth) In an equal-frequency histogram the buckets are created so that roughly the frequency of each bucket is constant (that is each bucket contains roughly the same number of contiguous data samples)

V-Optimal If we consider all of the possible histograms for a given number of buckets the V-Optimal histogram is the one with the least variance Histogram variance is a weighted sum of the original values that each bucket represents where bucket weight is equal to the number of values in the bucket

MaxDiff In a MaxDiff histogram we consider the difference between each pair of adjacent values

ClusteringClustering techniques consider data tuples as objects They

partition the objects into groups or clusters so that objects within a cluster are ldquosimilarrdquo to one another and ldquodissimilarrdquo to objects in other clusters

In data reduction the cluster representations of the data are used to replace the actual data The effectiveness of this technique depends on the nature of the data It is much more effective for data that can be organized into distinct clusters than for smeared data

Partition data set into clusters based on similarity and store cluster representation (eg centroid and diameter) only

Can be very effective if data is clustered but not if data is ldquosmearedrdquo

Can have hierarchical clustering and be stored in multi-dimensional index tree structures

There are many choices of clustering definitions and clustering algorithms

Cluster analysis will be studied in depth later

SamplingSampling can be used as a data reduction technique because it

allows a large data set to be represented by a much smaller random sample (or subset) of the data Suppose that a large data set D contains N tuples Letrsquos look at the most common ways that we could sample D for data reduction

An advantage of sampling for data reduction is that the cost of obtaining a sample is proportional to the size of the sample sampling complexity is potentially sublinear to the size of the data

Simple Random sample without replacement

bull For a fixed sample size sampling complexity increases only linearly as the number of data dimensions

bull When applied to data reduction sampling is most commonly used to estimate the answer to an aggregate query

Sampling obtaining a small sample s to represent the whole data set N

Allow a mining algorithm to run in complexity that is potentially sub-linear to the size of the data

Choose a representative subset of the data Simple random sampling may have very poor

performance in the presence of skew Develop adaptive sampling methods

Stratified sampling Approximate the percentage of each class (or

subpopulation of interest) in the overall database Used in conjunction with skewed data

Note Sampling may not reduce database IOs (page at a time)

Sampling with or without Replacement

SRSWOR

(simple random

sample without

replacement)

SRSWR

Raw Data

Raw Data ClusterStratified Sample

Data discretization techniques can be used to reduce the number of values for a given continuous attribute by dividing the range of the attribute into intervals Interval labels can then be used to replace actual data values Replacing numerous values of a continuous attribute by a small number of interval labels thereby reduces and simplifies the original data This leads to a concise easy-to-use knowledge-level representation of mining results

bull Discretization techniques can be categorized based on how the discretization is performed such as whether it uses class information or which direction it proceeds (ie top-down vs bottom-up) If the discretization process uses class information then we say it is supervised discretization Otherwise it is unsupervised If the process starts by first finding one or a few points (called split points or cut points) to split the entire attribute range and then repeats this recursively on the resulting intervals it is called top-down discretization or splitting This contrasts with bottom-up discretization or merging which starts by considering all of the continuous values as potential split-points removes some by merging neighborhood values to form intervals and then recursively applies this process to the resulting intervals Discretization can be performed recursively on an attribute to provide a hierarchical or multiresolution partitioning of the attribute values known as a concept hierarchy Concept hierarchies are useful for mining at multiple levels of abstraction

Three types of attributes Nominal mdash values from an unordered set eg color

profession Ordinal mdash values from an ordered set eg military or

academic rank Continuous mdash real numbers eg integer or real

numbers Discretization

Divide the range of a continuous attribute into intervals Some classification algorithms only accept categorical

attributes Reduce data size by discretization Prepare for further analysis

Typical methods All the methods can be applied recursively Binning (covered above)

Top-down split unsupervised Histogram analysis (covered above)

Top-down split unsupervised Clustering analysis (covered above)

Either top-down split or bottom-up merge unsupervised Entropy-based discretization supervised top-down

split Interval merging by 2 Analysis unsupervised bottom-

up merge Segmentation by natural partitioning top-down split

unsupervised

Given a set of samples S if S is partitioned into two intervals S1 and S2 using boundary T the information gain after partitioning is

Entropy is calculated based on class distribution of the samples in the set Given m classes the entropy of S1 is

where pi is the probability of class i in S1 The boundary that minimizes the entropy function over

all possible boundaries is selected as a binary discretization

The process is recursively applied to partitions obtained until some stopping criterion is met

Such a boundary may reduce data size and improve classification accuracy

)Entropy(S|S|

|S|+)Entropy(S

|S|

|S|=T)I(S 2

21

1

m

iii ppSEntropy

121 )(log)(

Merging-based (bottom-up) vs splitting-based methods Merge Find the best neighboring intervals and merge

them to form larger intervals recursively ChiMerge [Kerber AAAI 1992 See also Liu et al DMKD

2002] Initially each distinct value of a numerical attr A is

considered to be one interval 2 tests are performed for every pair of adjacent

intervals Adjacent intervals with the least 2 values are merged

together since low 2 values for a pair indicate similar class distributions

This merge process proceeds recursively until a predefined stopping criterion is met (such as significance level max-interval max inconsistency etc)

Cluster analysis is a popular data discretization method A clustering algorithm can be applied to discretize a numerical attribute A by partitioning the values of A into clusters or groups Clustering takes the distribution of A into consideration as well as the closeness of data points and therefore is able to produce high-quality discretization results

Clustering can be used to generate a concept hierarchy for A by following either a topdown splitting strategy or a bottom-up merging strategy where each cluster forms a node of the concept hierarchy In the former each initial cluster or partition may be further

decomposed into several subclusters forming a lower level of the hierarchy In the latter clusters are formed by repeatedly grouping neighboring clusters in order to form

higher-level concepts

bull Discretization by Intuitive Partitioningbull Although the above discretization methods are useful

in the generation of numerical hierarchies many users would like to see numerical ranges partitioned into relatively uniform easy-to-read intervals that appear intuitive or ldquonaturalrdquo

bull If an interval covers 3 6 7 or 9 distinct values at the most significant digit then partition the range into 3 intervals (3 equal-width intervals for 3 6 and 9 and 3 intervals in the grouping of 2-3-2 for 7)

bull If it covers 2 4 or 8 distinct values at the most significant digit then partition the range into 4 equal-width intervals

bull If it covers 1 5 or 10 distinct values at the most significant digit then partition the range into 5 equal-width intervals

bull The rule can be recursively applied to each interval creating a concept hierarchy for the given numerical attribute Real-world data often contain extremely large positive andor negative outlier values which could distort any top-down discretization method based on minimum and maximum data values

A simply 3-4-5 rule can be used to segment numeric data into relatively uniform ldquonaturalrdquo intervals If an interval covers 3 6 7 or 9 distinct values

at the most significant digit partition the range into 3 equi-width intervals

If it covers 2 4 or 8 distinct values at the most significant digit partition the range into 4 intervals

If it covers 1 5 or 10 distinct values at the most significant digit partition the range into 5 intervals

Categorical data are discrete data Categorical attributes have a finite (but possibly large) number of distinct values with no ordering among the values Examples include geographic location job category and itemtype There are several methods for the generation of concept hierarchies for categorical data

Specification of a partial ordering of attributes explicitly at the schema level by users or Experts

Specification of a portion of a hierarchy by explicit data grouping

Specification of a set of attributes but not of their partial ordering

Specification of only a partial set of attributes

Specification of a partialtotal ordering of attributes explicitly at the schema level by users or experts street lt city lt state lt country

Specification of a hierarchy for a set of values by explicit data grouping Urbana Champaign Chicago lt Illinois

Specification of only a partial set of attributes Eg only street lt city not others

Automatic generation of hierarchies (or attribute levels) by the analysis of the number of distinct values Eg for a set of attributes street city state

country

Some hierarchies can be automaatically generated based on the analysis of the number of distinct values per attribute in the data set The attribute with the most distinct values is

placed at the lowest level of the hierarchy Exceptions eg weekday month quarter year

country

province_or_ state

city

street

15 distinct values

365 distinct values

3567 distinct values

674339 distinct values

Data preparation or preprocessing is a big issue for both data warehousing and data mining

Discriptive data summarization is need for quality data preprocessing

Data preparation includes Data cleaning and data integration Data reduction and feature selection Discretization

A lot a methods have been developed but data preprocessing still an active area of research

  • Slide 1
  • Slide 2
  • Slide 3
  • Multi-Dimensional Measure of Data Quality
  • Major Tasks in Data Preprocessing
  • Figure Forms of data preprocessing
  • 2 Descriptive Data Summarization
  • Slide 8
  • 21 Measuring the Central Tendency
  • 22 Measuring the Dispersion of Data
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Data Integration
  • Handling Redundant Data in Data Integration
  • Data Transformation
  • Data Transformation Normalization
  • Slide 27
  • 5 Data Reduction
  • Slide 29
  • 51 Data Cube Aggregation
  • 52 Attribute Subset Selection
  • Slide 32
  • Slide 33
  • Slide 34
  • 53 Dimensionality Reduction
  • Slide 36
  • Slide 37
  • Slide 38
  • 4 Numerosity Reduction
  • Data Reduction Method (1) Regression and Log-Linear Models
  • Regress Analysis and Log-Linear Models
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Data Reduction Method (3) Clustering
  • Slide 47
  • Slide 48
  • Slide 49
  • Data Reduction Method (4) Sampling
  • Slide 51
  • Sampling Cluster or Stratified Sampling
  • Data Discretization and Concept Hierarchy Generation
  • Slide 54
  • Discretization
  • Discretization and Concept Hierarchy Generation for Numeric Data
  • Entropy-Based Discretization
  • Interval Merge by 2 Analysis
  • Cluster Analysis
  • Slide 60
  • Segmentation by Natural Partitioning
  • 62 Concept Hierarchy Generation for Categorical Data
  • Slide 63
  • Concept Hierarchy Generation for Categorical Data
  • Automatic Concept Hierarchy Generation
  • Summary
Page 32: UNIT – 1 Data Preprocessing. Data Preprocessing Learning Objectives Understand why preprocess the data. Understand how to clean the data. Understand how.

bull PCA is computationally inexpensive can be applied to ordered and unordered attributes and can handle sparse data and skewed data Multidimensional data of more than two dimensions can be handled by reducing the problem to two dimensions Principal components may be used as inputs to multiple regression and cluster analysis

bull In comparison with wavelet transforms PCA tends to be better at handling sparse data whereas wavelet transforms are more suitable for data of high dimensionality

ldquoCan we reduce the data volume by choosing alternative lsquosmallerrsquo forms of data representationrdquo

Techniques of numerosity reduction can indeed be applied for this purpose These techniques may be parametric or nonparametric

For parametric methods a model is used to estimate the data so that typically only the data parameters need to be stored instead of the actual data (Outliers may also be stored) Log-linear models which estimate discrete multidimensional probability distributions are an example

Nonparametric methods for storing reduced representations of the data include histograms clustering and sampling

Letrsquos look at each of the numerosity reduction techniques mentioned above

Linear regression Data are modeled to fit a straight line Often uses the least-square method to fit the

line Multiple regression allows a response variable Y

to be modeled as a linear function of multidimensional feature vector

Log-linear model approximates discrete multidimensional probability distributions

bull Linear regression Y = w X + bndash Two regression coefficients w and b specify the

line and are to be estimated by using the data at hand

ndash Using the least squares criterion to the known values of Y1 Y2 hellip X1 X2 hellip

bull Multiple regression Y = b0 + b1 X1 + b2 X2ndash Many nonlinear functions can be transformed into

the above

bull Log-linear modelsndash The multi-way table of joint probabilities is

approximated by a product of lower-order tablesndash Probability p(a b c d) = ab acad bcd

Histograms Histograms use binning to approximate data distributions and are a popular form of data reduction

A histogram for an attribute A partitions the data distribution of A into disjoint subsets or buckets If each bucket represents only a single attribute-valuefrequency pair the buckets are called singleton buckets Often buckets instead represent continuous ranges for the given attribute

There are several partitioning rules including the following

Equal-width In an equal-width histogram the width of each bucket range is uniform

Equal-frequency (or equidepth) In an equal-frequency histogram the buckets are created so that roughly the frequency of each bucket is constant (that is each bucket contains roughly the same number of contiguous data samples)

V-Optimal If we consider all of the possible histograms for a given number of buckets the V-Optimal histogram is the one with the least variance Histogram variance is a weighted sum of the original values that each bucket represents where bucket weight is equal to the number of values in the bucket

MaxDiff In a MaxDiff histogram we consider the difference between each pair of adjacent values

ClusteringClustering techniques consider data tuples as objects They

partition the objects into groups or clusters so that objects within a cluster are ldquosimilarrdquo to one another and ldquodissimilarrdquo to objects in other clusters

In data reduction the cluster representations of the data are used to replace the actual data The effectiveness of this technique depends on the nature of the data It is much more effective for data that can be organized into distinct clusters than for smeared data

Partition data set into clusters based on similarity and store cluster representation (eg centroid and diameter) only

Can be very effective if data is clustered but not if data is ldquosmearedrdquo

Can have hierarchical clustering and be stored in multi-dimensional index tree structures

There are many choices of clustering definitions and clustering algorithms

Cluster analysis will be studied in depth later

SamplingSampling can be used as a data reduction technique because it

allows a large data set to be represented by a much smaller random sample (or subset) of the data Suppose that a large data set D contains N tuples Letrsquos look at the most common ways that we could sample D for data reduction

An advantage of sampling for data reduction is that the cost of obtaining a sample is proportional to the size of the sample sampling complexity is potentially sublinear to the size of the data

Simple Random sample without replacement

bull For a fixed sample size sampling complexity increases only linearly as the number of data dimensions

bull When applied to data reduction sampling is most commonly used to estimate the answer to an aggregate query

Sampling obtaining a small sample s to represent the whole data set N

Allow a mining algorithm to run in complexity that is potentially sub-linear to the size of the data

Choose a representative subset of the data Simple random sampling may have very poor

performance in the presence of skew Develop adaptive sampling methods

Stratified sampling Approximate the percentage of each class (or

subpopulation of interest) in the overall database Used in conjunction with skewed data

Note Sampling may not reduce database IOs (page at a time)

Sampling with or without Replacement

SRSWOR

(simple random

sample without

replacement)

SRSWR

Raw Data

Raw Data ClusterStratified Sample

Data discretization techniques can be used to reduce the number of values for a given continuous attribute by dividing the range of the attribute into intervals Interval labels can then be used to replace actual data values Replacing numerous values of a continuous attribute by a small number of interval labels thereby reduces and simplifies the original data This leads to a concise easy-to-use knowledge-level representation of mining results

bull Discretization techniques can be categorized based on how the discretization is performed such as whether it uses class information or which direction it proceeds (ie top-down vs bottom-up) If the discretization process uses class information then we say it is supervised discretization Otherwise it is unsupervised If the process starts by first finding one or a few points (called split points or cut points) to split the entire attribute range and then repeats this recursively on the resulting intervals it is called top-down discretization or splitting This contrasts with bottom-up discretization or merging which starts by considering all of the continuous values as potential split-points removes some by merging neighborhood values to form intervals and then recursively applies this process to the resulting intervals Discretization can be performed recursively on an attribute to provide a hierarchical or multiresolution partitioning of the attribute values known as a concept hierarchy Concept hierarchies are useful for mining at multiple levels of abstraction

Three types of attributes Nominal mdash values from an unordered set eg color

profession Ordinal mdash values from an ordered set eg military or

academic rank Continuous mdash real numbers eg integer or real

numbers Discretization

Divide the range of a continuous attribute into intervals Some classification algorithms only accept categorical

attributes Reduce data size by discretization Prepare for further analysis

Typical methods All the methods can be applied recursively Binning (covered above)

Top-down split unsupervised Histogram analysis (covered above)

Top-down split unsupervised Clustering analysis (covered above)

Either top-down split or bottom-up merge unsupervised Entropy-based discretization supervised top-down

split Interval merging by 2 Analysis unsupervised bottom-

up merge Segmentation by natural partitioning top-down split

unsupervised

Given a set of samples S if S is partitioned into two intervals S1 and S2 using boundary T the information gain after partitioning is

Entropy is calculated based on class distribution of the samples in the set Given m classes the entropy of S1 is

where pi is the probability of class i in S1 The boundary that minimizes the entropy function over

all possible boundaries is selected as a binary discretization

The process is recursively applied to partitions obtained until some stopping criterion is met

Such a boundary may reduce data size and improve classification accuracy

)Entropy(S|S|

|S|+)Entropy(S

|S|

|S|=T)I(S 2

21

1

m

iii ppSEntropy

121 )(log)(

Merging-based (bottom-up) vs splitting-based methods Merge Find the best neighboring intervals and merge

them to form larger intervals recursively ChiMerge [Kerber AAAI 1992 See also Liu et al DMKD

2002] Initially each distinct value of a numerical attr A is

considered to be one interval 2 tests are performed for every pair of adjacent

intervals Adjacent intervals with the least 2 values are merged

together since low 2 values for a pair indicate similar class distributions

This merge process proceeds recursively until a predefined stopping criterion is met (such as significance level max-interval max inconsistency etc)

Cluster analysis is a popular data discretization method A clustering algorithm can be applied to discretize a numerical attribute A by partitioning the values of A into clusters or groups Clustering takes the distribution of A into consideration as well as the closeness of data points and therefore is able to produce high-quality discretization results

Clustering can be used to generate a concept hierarchy for A by following either a topdown splitting strategy or a bottom-up merging strategy where each cluster forms a node of the concept hierarchy In the former each initial cluster or partition may be further

decomposed into several subclusters forming a lower level of the hierarchy In the latter clusters are formed by repeatedly grouping neighboring clusters in order to form

higher-level concepts

bull Discretization by Intuitive Partitioningbull Although the above discretization methods are useful

in the generation of numerical hierarchies many users would like to see numerical ranges partitioned into relatively uniform easy-to-read intervals that appear intuitive or ldquonaturalrdquo

bull If an interval covers 3 6 7 or 9 distinct values at the most significant digit then partition the range into 3 intervals (3 equal-width intervals for 3 6 and 9 and 3 intervals in the grouping of 2-3-2 for 7)

bull If it covers 2 4 or 8 distinct values at the most significant digit then partition the range into 4 equal-width intervals

bull If it covers 1 5 or 10 distinct values at the most significant digit then partition the range into 5 equal-width intervals

bull The rule can be recursively applied to each interval creating a concept hierarchy for the given numerical attribute Real-world data often contain extremely large positive andor negative outlier values which could distort any top-down discretization method based on minimum and maximum data values

A simply 3-4-5 rule can be used to segment numeric data into relatively uniform ldquonaturalrdquo intervals If an interval covers 3 6 7 or 9 distinct values

at the most significant digit partition the range into 3 equi-width intervals

If it covers 2 4 or 8 distinct values at the most significant digit partition the range into 4 intervals

If it covers 1 5 or 10 distinct values at the most significant digit partition the range into 5 intervals

Categorical data are discrete data Categorical attributes have a finite (but possibly large) number of distinct values with no ordering among the values Examples include geographic location job category and itemtype There are several methods for the generation of concept hierarchies for categorical data

Specification of a partial ordering of attributes explicitly at the schema level by users or Experts

Specification of a portion of a hierarchy by explicit data grouping

Specification of a set of attributes but not of their partial ordering

Specification of only a partial set of attributes

Specification of a partialtotal ordering of attributes explicitly at the schema level by users or experts street lt city lt state lt country

Specification of a hierarchy for a set of values by explicit data grouping Urbana Champaign Chicago lt Illinois

Specification of only a partial set of attributes Eg only street lt city not others

Automatic generation of hierarchies (or attribute levels) by the analysis of the number of distinct values Eg for a set of attributes street city state

country

Some hierarchies can be automaatically generated based on the analysis of the number of distinct values per attribute in the data set The attribute with the most distinct values is

placed at the lowest level of the hierarchy Exceptions eg weekday month quarter year

country

province_or_ state

city

street

15 distinct values

365 distinct values

3567 distinct values

674339 distinct values

Data preparation or preprocessing is a big issue for both data warehousing and data mining

Discriptive data summarization is need for quality data preprocessing

Data preparation includes Data cleaning and data integration Data reduction and feature selection Discretization

A lot a methods have been developed but data preprocessing still an active area of research

  • Slide 1
  • Slide 2
  • Slide 3
  • Multi-Dimensional Measure of Data Quality
  • Major Tasks in Data Preprocessing
  • Figure Forms of data preprocessing
  • 2 Descriptive Data Summarization
  • Slide 8
  • 21 Measuring the Central Tendency
  • 22 Measuring the Dispersion of Data
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Data Integration
  • Handling Redundant Data in Data Integration
  • Data Transformation
  • Data Transformation Normalization
  • Slide 27
  • 5 Data Reduction
  • Slide 29
  • 51 Data Cube Aggregation
  • 52 Attribute Subset Selection
  • Slide 32
  • Slide 33
  • Slide 34
  • 53 Dimensionality Reduction
  • Slide 36
  • Slide 37
  • Slide 38
  • 4 Numerosity Reduction
  • Data Reduction Method (1) Regression and Log-Linear Models
  • Regress Analysis and Log-Linear Models
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Data Reduction Method (3) Clustering
  • Slide 47
  • Slide 48
  • Slide 49
  • Data Reduction Method (4) Sampling
  • Slide 51
  • Sampling Cluster or Stratified Sampling
  • Data Discretization and Concept Hierarchy Generation
  • Slide 54
  • Discretization
  • Discretization and Concept Hierarchy Generation for Numeric Data
  • Entropy-Based Discretization
  • Interval Merge by 2 Analysis
  • Cluster Analysis
  • Slide 60
  • Segmentation by Natural Partitioning
  • 62 Concept Hierarchy Generation for Categorical Data
  • Slide 63
  • Concept Hierarchy Generation for Categorical Data
  • Automatic Concept Hierarchy Generation
  • Summary
Page 33: UNIT – 1 Data Preprocessing. Data Preprocessing Learning Objectives Understand why preprocess the data. Understand how to clean the data. Understand how.

ldquoCan we reduce the data volume by choosing alternative lsquosmallerrsquo forms of data representationrdquo

Techniques of numerosity reduction can indeed be applied for this purpose These techniques may be parametric or nonparametric

For parametric methods a model is used to estimate the data so that typically only the data parameters need to be stored instead of the actual data (Outliers may also be stored) Log-linear models which estimate discrete multidimensional probability distributions are an example

Nonparametric methods for storing reduced representations of the data include histograms clustering and sampling

Letrsquos look at each of the numerosity reduction techniques mentioned above

Linear regression Data are modeled to fit a straight line Often uses the least-square method to fit the

line Multiple regression allows a response variable Y

to be modeled as a linear function of multidimensional feature vector

Log-linear model approximates discrete multidimensional probability distributions

bull Linear regression Y = w X + bndash Two regression coefficients w and b specify the

line and are to be estimated by using the data at hand

ndash Using the least squares criterion to the known values of Y1 Y2 hellip X1 X2 hellip

bull Multiple regression Y = b0 + b1 X1 + b2 X2ndash Many nonlinear functions can be transformed into

the above

bull Log-linear modelsndash The multi-way table of joint probabilities is

approximated by a product of lower-order tablesndash Probability p(a b c d) = ab acad bcd

Histograms Histograms use binning to approximate data distributions and are a popular form of data reduction

A histogram for an attribute A partitions the data distribution of A into disjoint subsets or buckets If each bucket represents only a single attribute-valuefrequency pair the buckets are called singleton buckets Often buckets instead represent continuous ranges for the given attribute

There are several partitioning rules including the following

Equal-width In an equal-width histogram the width of each bucket range is uniform

Equal-frequency (or equidepth) In an equal-frequency histogram the buckets are created so that roughly the frequency of each bucket is constant (that is each bucket contains roughly the same number of contiguous data samples)

V-Optimal If we consider all of the possible histograms for a given number of buckets the V-Optimal histogram is the one with the least variance Histogram variance is a weighted sum of the original values that each bucket represents where bucket weight is equal to the number of values in the bucket

MaxDiff In a MaxDiff histogram we consider the difference between each pair of adjacent values

ClusteringClustering techniques consider data tuples as objects They

partition the objects into groups or clusters so that objects within a cluster are ldquosimilarrdquo to one another and ldquodissimilarrdquo to objects in other clusters

In data reduction the cluster representations of the data are used to replace the actual data The effectiveness of this technique depends on the nature of the data It is much more effective for data that can be organized into distinct clusters than for smeared data

Partition data set into clusters based on similarity and store cluster representation (eg centroid and diameter) only

Can be very effective if data is clustered but not if data is ldquosmearedrdquo

Can have hierarchical clustering and be stored in multi-dimensional index tree structures

There are many choices of clustering definitions and clustering algorithms

Cluster analysis will be studied in depth later

SamplingSampling can be used as a data reduction technique because it

allows a large data set to be represented by a much smaller random sample (or subset) of the data Suppose that a large data set D contains N tuples Letrsquos look at the most common ways that we could sample D for data reduction

An advantage of sampling for data reduction is that the cost of obtaining a sample is proportional to the size of the sample sampling complexity is potentially sublinear to the size of the data

Simple Random sample without replacement

bull For a fixed sample size sampling complexity increases only linearly as the number of data dimensions

bull When applied to data reduction sampling is most commonly used to estimate the answer to an aggregate query

Sampling obtaining a small sample s to represent the whole data set N

Allow a mining algorithm to run in complexity that is potentially sub-linear to the size of the data

Choose a representative subset of the data Simple random sampling may have very poor

performance in the presence of skew Develop adaptive sampling methods

Stratified sampling Approximate the percentage of each class (or

subpopulation of interest) in the overall database Used in conjunction with skewed data

Note Sampling may not reduce database IOs (page at a time)

Sampling with or without Replacement

SRSWOR

(simple random

sample without

replacement)

SRSWR

Raw Data

Raw Data ClusterStratified Sample

Data discretization techniques can be used to reduce the number of values for a given continuous attribute by dividing the range of the attribute into intervals Interval labels can then be used to replace actual data values Replacing numerous values of a continuous attribute by a small number of interval labels thereby reduces and simplifies the original data This leads to a concise easy-to-use knowledge-level representation of mining results

bull Discretization techniques can be categorized based on how the discretization is performed such as whether it uses class information or which direction it proceeds (ie top-down vs bottom-up) If the discretization process uses class information then we say it is supervised discretization Otherwise it is unsupervised If the process starts by first finding one or a few points (called split points or cut points) to split the entire attribute range and then repeats this recursively on the resulting intervals it is called top-down discretization or splitting This contrasts with bottom-up discretization or merging which starts by considering all of the continuous values as potential split-points removes some by merging neighborhood values to form intervals and then recursively applies this process to the resulting intervals Discretization can be performed recursively on an attribute to provide a hierarchical or multiresolution partitioning of the attribute values known as a concept hierarchy Concept hierarchies are useful for mining at multiple levels of abstraction

Three types of attributes Nominal mdash values from an unordered set eg color

profession Ordinal mdash values from an ordered set eg military or

academic rank Continuous mdash real numbers eg integer or real

numbers Discretization

Divide the range of a continuous attribute into intervals Some classification algorithms only accept categorical

attributes Reduce data size by discretization Prepare for further analysis

Typical methods All the methods can be applied recursively Binning (covered above)

Top-down split unsupervised Histogram analysis (covered above)

Top-down split unsupervised Clustering analysis (covered above)

Either top-down split or bottom-up merge unsupervised Entropy-based discretization supervised top-down

split Interval merging by 2 Analysis unsupervised bottom-

up merge Segmentation by natural partitioning top-down split

unsupervised

Given a set of samples S if S is partitioned into two intervals S1 and S2 using boundary T the information gain after partitioning is

Entropy is calculated based on class distribution of the samples in the set Given m classes the entropy of S1 is

where pi is the probability of class i in S1 The boundary that minimizes the entropy function over

all possible boundaries is selected as a binary discretization

The process is recursively applied to partitions obtained until some stopping criterion is met

Such a boundary may reduce data size and improve classification accuracy

)Entropy(S|S|

|S|+)Entropy(S

|S|

|S|=T)I(S 2

21

1

m

iii ppSEntropy

121 )(log)(

Merging-based (bottom-up) vs splitting-based methods Merge Find the best neighboring intervals and merge

them to form larger intervals recursively ChiMerge [Kerber AAAI 1992 See also Liu et al DMKD

2002] Initially each distinct value of a numerical attr A is

considered to be one interval 2 tests are performed for every pair of adjacent

intervals Adjacent intervals with the least 2 values are merged

together since low 2 values for a pair indicate similar class distributions

This merge process proceeds recursively until a predefined stopping criterion is met (such as significance level max-interval max inconsistency etc)

Cluster analysis is a popular data discretization method A clustering algorithm can be applied to discretize a numerical attribute A by partitioning the values of A into clusters or groups Clustering takes the distribution of A into consideration as well as the closeness of data points and therefore is able to produce high-quality discretization results

Clustering can be used to generate a concept hierarchy for A by following either a topdown splitting strategy or a bottom-up merging strategy where each cluster forms a node of the concept hierarchy In the former each initial cluster or partition may be further

decomposed into several subclusters forming a lower level of the hierarchy In the latter clusters are formed by repeatedly grouping neighboring clusters in order to form

higher-level concepts

bull Discretization by Intuitive Partitioningbull Although the above discretization methods are useful

in the generation of numerical hierarchies many users would like to see numerical ranges partitioned into relatively uniform easy-to-read intervals that appear intuitive or ldquonaturalrdquo

bull If an interval covers 3 6 7 or 9 distinct values at the most significant digit then partition the range into 3 intervals (3 equal-width intervals for 3 6 and 9 and 3 intervals in the grouping of 2-3-2 for 7)

bull If it covers 2 4 or 8 distinct values at the most significant digit then partition the range into 4 equal-width intervals

bull If it covers 1 5 or 10 distinct values at the most significant digit then partition the range into 5 equal-width intervals

bull The rule can be recursively applied to each interval creating a concept hierarchy for the given numerical attribute Real-world data often contain extremely large positive andor negative outlier values which could distort any top-down discretization method based on minimum and maximum data values

A simply 3-4-5 rule can be used to segment numeric data into relatively uniform ldquonaturalrdquo intervals If an interval covers 3 6 7 or 9 distinct values

at the most significant digit partition the range into 3 equi-width intervals

If it covers 2 4 or 8 distinct values at the most significant digit partition the range into 4 intervals

If it covers 1 5 or 10 distinct values at the most significant digit partition the range into 5 intervals

Categorical data are discrete data Categorical attributes have a finite (but possibly large) number of distinct values with no ordering among the values Examples include geographic location job category and itemtype There are several methods for the generation of concept hierarchies for categorical data

Specification of a partial ordering of attributes explicitly at the schema level by users or Experts

Specification of a portion of a hierarchy by explicit data grouping

Specification of a set of attributes but not of their partial ordering

Specification of only a partial set of attributes

Specification of a partialtotal ordering of attributes explicitly at the schema level by users or experts street lt city lt state lt country

Specification of a hierarchy for a set of values by explicit data grouping Urbana Champaign Chicago lt Illinois

Specification of only a partial set of attributes Eg only street lt city not others

Automatic generation of hierarchies (or attribute levels) by the analysis of the number of distinct values Eg for a set of attributes street city state

country

Some hierarchies can be automaatically generated based on the analysis of the number of distinct values per attribute in the data set The attribute with the most distinct values is

placed at the lowest level of the hierarchy Exceptions eg weekday month quarter year

country

province_or_ state

city

street

15 distinct values

365 distinct values

3567 distinct values

674339 distinct values

Data preparation or preprocessing is a big issue for both data warehousing and data mining

Discriptive data summarization is need for quality data preprocessing

Data preparation includes Data cleaning and data integration Data reduction and feature selection Discretization

A lot a methods have been developed but data preprocessing still an active area of research

  • Slide 1
  • Slide 2
  • Slide 3
  • Multi-Dimensional Measure of Data Quality
  • Major Tasks in Data Preprocessing
  • Figure Forms of data preprocessing
  • 2 Descriptive Data Summarization
  • Slide 8
  • 21 Measuring the Central Tendency
  • 22 Measuring the Dispersion of Data
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Data Integration
  • Handling Redundant Data in Data Integration
  • Data Transformation
  • Data Transformation Normalization
  • Slide 27
  • 5 Data Reduction
  • Slide 29
  • 51 Data Cube Aggregation
  • 52 Attribute Subset Selection
  • Slide 32
  • Slide 33
  • Slide 34
  • 53 Dimensionality Reduction
  • Slide 36
  • Slide 37
  • Slide 38
  • 4 Numerosity Reduction
  • Data Reduction Method (1) Regression and Log-Linear Models
  • Regress Analysis and Log-Linear Models
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Data Reduction Method (3) Clustering
  • Slide 47
  • Slide 48
  • Slide 49
  • Data Reduction Method (4) Sampling
  • Slide 51
  • Sampling Cluster or Stratified Sampling
  • Data Discretization and Concept Hierarchy Generation
  • Slide 54
  • Discretization
  • Discretization and Concept Hierarchy Generation for Numeric Data
  • Entropy-Based Discretization
  • Interval Merge by 2 Analysis
  • Cluster Analysis
  • Slide 60
  • Segmentation by Natural Partitioning
  • 62 Concept Hierarchy Generation for Categorical Data
  • Slide 63
  • Concept Hierarchy Generation for Categorical Data
  • Automatic Concept Hierarchy Generation
  • Summary
Page 34: UNIT – 1 Data Preprocessing. Data Preprocessing Learning Objectives Understand why preprocess the data. Understand how to clean the data. Understand how.

Linear regression Data are modeled to fit a straight line Often uses the least-square method to fit the

line Multiple regression allows a response variable Y

to be modeled as a linear function of multidimensional feature vector

Log-linear model approximates discrete multidimensional probability distributions

bull Linear regression Y = w X + bndash Two regression coefficients w and b specify the

line and are to be estimated by using the data at hand

ndash Using the least squares criterion to the known values of Y1 Y2 hellip X1 X2 hellip

bull Multiple regression Y = b0 + b1 X1 + b2 X2ndash Many nonlinear functions can be transformed into

the above

bull Log-linear modelsndash The multi-way table of joint probabilities is

approximated by a product of lower-order tablesndash Probability p(a b c d) = ab acad bcd

Histograms Histograms use binning to approximate data distributions and are a popular form of data reduction

A histogram for an attribute A partitions the data distribution of A into disjoint subsets or buckets If each bucket represents only a single attribute-valuefrequency pair the buckets are called singleton buckets Often buckets instead represent continuous ranges for the given attribute

There are several partitioning rules including the following

Equal-width In an equal-width histogram the width of each bucket range is uniform

Equal-frequency (or equidepth) In an equal-frequency histogram the buckets are created so that roughly the frequency of each bucket is constant (that is each bucket contains roughly the same number of contiguous data samples)

V-Optimal If we consider all of the possible histograms for a given number of buckets the V-Optimal histogram is the one with the least variance Histogram variance is a weighted sum of the original values that each bucket represents where bucket weight is equal to the number of values in the bucket

MaxDiff In a MaxDiff histogram we consider the difference between each pair of adjacent values

ClusteringClustering techniques consider data tuples as objects They

partition the objects into groups or clusters so that objects within a cluster are ldquosimilarrdquo to one another and ldquodissimilarrdquo to objects in other clusters

In data reduction the cluster representations of the data are used to replace the actual data The effectiveness of this technique depends on the nature of the data It is much more effective for data that can be organized into distinct clusters than for smeared data

Partition data set into clusters based on similarity and store cluster representation (eg centroid and diameter) only

Can be very effective if data is clustered but not if data is ldquosmearedrdquo

Can have hierarchical clustering and be stored in multi-dimensional index tree structures

There are many choices of clustering definitions and clustering algorithms

Cluster analysis will be studied in depth later

SamplingSampling can be used as a data reduction technique because it

allows a large data set to be represented by a much smaller random sample (or subset) of the data Suppose that a large data set D contains N tuples Letrsquos look at the most common ways that we could sample D for data reduction

An advantage of sampling for data reduction is that the cost of obtaining a sample is proportional to the size of the sample sampling complexity is potentially sublinear to the size of the data

Simple Random sample without replacement

bull For a fixed sample size sampling complexity increases only linearly as the number of data dimensions

bull When applied to data reduction sampling is most commonly used to estimate the answer to an aggregate query

Sampling obtaining a small sample s to represent the whole data set N

Allow a mining algorithm to run in complexity that is potentially sub-linear to the size of the data

Choose a representative subset of the data Simple random sampling may have very poor

performance in the presence of skew Develop adaptive sampling methods

Stratified sampling Approximate the percentage of each class (or

subpopulation of interest) in the overall database Used in conjunction with skewed data

Note Sampling may not reduce database IOs (page at a time)

Sampling with or without Replacement

SRSWOR

(simple random

sample without

replacement)

SRSWR

Raw Data

Raw Data ClusterStratified Sample

Data discretization techniques can be used to reduce the number of values for a given continuous attribute by dividing the range of the attribute into intervals Interval labels can then be used to replace actual data values Replacing numerous values of a continuous attribute by a small number of interval labels thereby reduces and simplifies the original data This leads to a concise easy-to-use knowledge-level representation of mining results

bull Discretization techniques can be categorized based on how the discretization is performed such as whether it uses class information or which direction it proceeds (ie top-down vs bottom-up) If the discretization process uses class information then we say it is supervised discretization Otherwise it is unsupervised If the process starts by first finding one or a few points (called split points or cut points) to split the entire attribute range and then repeats this recursively on the resulting intervals it is called top-down discretization or splitting This contrasts with bottom-up discretization or merging which starts by considering all of the continuous values as potential split-points removes some by merging neighborhood values to form intervals and then recursively applies this process to the resulting intervals Discretization can be performed recursively on an attribute to provide a hierarchical or multiresolution partitioning of the attribute values known as a concept hierarchy Concept hierarchies are useful for mining at multiple levels of abstraction

Three types of attributes Nominal mdash values from an unordered set eg color

profession Ordinal mdash values from an ordered set eg military or

academic rank Continuous mdash real numbers eg integer or real

numbers Discretization

Divide the range of a continuous attribute into intervals Some classification algorithms only accept categorical

attributes Reduce data size by discretization Prepare for further analysis

Typical methods All the methods can be applied recursively Binning (covered above)

Top-down split unsupervised Histogram analysis (covered above)

Top-down split unsupervised Clustering analysis (covered above)

Either top-down split or bottom-up merge unsupervised Entropy-based discretization supervised top-down

split Interval merging by 2 Analysis unsupervised bottom-

up merge Segmentation by natural partitioning top-down split

unsupervised

Given a set of samples S if S is partitioned into two intervals S1 and S2 using boundary T the information gain after partitioning is

Entropy is calculated based on class distribution of the samples in the set Given m classes the entropy of S1 is

where pi is the probability of class i in S1 The boundary that minimizes the entropy function over

all possible boundaries is selected as a binary discretization

The process is recursively applied to partitions obtained until some stopping criterion is met

Such a boundary may reduce data size and improve classification accuracy

)Entropy(S|S|

|S|+)Entropy(S

|S|

|S|=T)I(S 2

21

1

m

iii ppSEntropy

121 )(log)(

Merging-based (bottom-up) vs splitting-based methods Merge Find the best neighboring intervals and merge

them to form larger intervals recursively ChiMerge [Kerber AAAI 1992 See also Liu et al DMKD

2002] Initially each distinct value of a numerical attr A is

considered to be one interval 2 tests are performed for every pair of adjacent

intervals Adjacent intervals with the least 2 values are merged

together since low 2 values for a pair indicate similar class distributions

This merge process proceeds recursively until a predefined stopping criterion is met (such as significance level max-interval max inconsistency etc)

Cluster analysis is a popular data discretization method A clustering algorithm can be applied to discretize a numerical attribute A by partitioning the values of A into clusters or groups Clustering takes the distribution of A into consideration as well as the closeness of data points and therefore is able to produce high-quality discretization results

Clustering can be used to generate a concept hierarchy for A by following either a topdown splitting strategy or a bottom-up merging strategy where each cluster forms a node of the concept hierarchy In the former each initial cluster or partition may be further

decomposed into several subclusters forming a lower level of the hierarchy In the latter clusters are formed by repeatedly grouping neighboring clusters in order to form

higher-level concepts

bull Discretization by Intuitive Partitioningbull Although the above discretization methods are useful

in the generation of numerical hierarchies many users would like to see numerical ranges partitioned into relatively uniform easy-to-read intervals that appear intuitive or ldquonaturalrdquo

bull If an interval covers 3 6 7 or 9 distinct values at the most significant digit then partition the range into 3 intervals (3 equal-width intervals for 3 6 and 9 and 3 intervals in the grouping of 2-3-2 for 7)

bull If it covers 2 4 or 8 distinct values at the most significant digit then partition the range into 4 equal-width intervals

bull If it covers 1 5 or 10 distinct values at the most significant digit then partition the range into 5 equal-width intervals

bull The rule can be recursively applied to each interval creating a concept hierarchy for the given numerical attribute Real-world data often contain extremely large positive andor negative outlier values which could distort any top-down discretization method based on minimum and maximum data values

A simply 3-4-5 rule can be used to segment numeric data into relatively uniform ldquonaturalrdquo intervals If an interval covers 3 6 7 or 9 distinct values

at the most significant digit partition the range into 3 equi-width intervals

If it covers 2 4 or 8 distinct values at the most significant digit partition the range into 4 intervals

If it covers 1 5 or 10 distinct values at the most significant digit partition the range into 5 intervals

Categorical data are discrete data Categorical attributes have a finite (but possibly large) number of distinct values with no ordering among the values Examples include geographic location job category and itemtype There are several methods for the generation of concept hierarchies for categorical data

Specification of a partial ordering of attributes explicitly at the schema level by users or Experts

Specification of a portion of a hierarchy by explicit data grouping

Specification of a set of attributes but not of their partial ordering

Specification of only a partial set of attributes

Specification of a partialtotal ordering of attributes explicitly at the schema level by users or experts street lt city lt state lt country

Specification of a hierarchy for a set of values by explicit data grouping Urbana Champaign Chicago lt Illinois

Specification of only a partial set of attributes Eg only street lt city not others

Automatic generation of hierarchies (or attribute levels) by the analysis of the number of distinct values Eg for a set of attributes street city state

country

Some hierarchies can be automaatically generated based on the analysis of the number of distinct values per attribute in the data set The attribute with the most distinct values is

placed at the lowest level of the hierarchy Exceptions eg weekday month quarter year

country

province_or_ state

city

street

15 distinct values

365 distinct values

3567 distinct values

674339 distinct values

Data preparation or preprocessing is a big issue for both data warehousing and data mining

Discriptive data summarization is need for quality data preprocessing

Data preparation includes Data cleaning and data integration Data reduction and feature selection Discretization

A lot a methods have been developed but data preprocessing still an active area of research

  • Slide 1
  • Slide 2
  • Slide 3
  • Multi-Dimensional Measure of Data Quality
  • Major Tasks in Data Preprocessing
  • Figure Forms of data preprocessing
  • 2 Descriptive Data Summarization
  • Slide 8
  • 21 Measuring the Central Tendency
  • 22 Measuring the Dispersion of Data
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Data Integration
  • Handling Redundant Data in Data Integration
  • Data Transformation
  • Data Transformation Normalization
  • Slide 27
  • 5 Data Reduction
  • Slide 29
  • 51 Data Cube Aggregation
  • 52 Attribute Subset Selection
  • Slide 32
  • Slide 33
  • Slide 34
  • 53 Dimensionality Reduction
  • Slide 36
  • Slide 37
  • Slide 38
  • 4 Numerosity Reduction
  • Data Reduction Method (1) Regression and Log-Linear Models
  • Regress Analysis and Log-Linear Models
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Data Reduction Method (3) Clustering
  • Slide 47
  • Slide 48
  • Slide 49
  • Data Reduction Method (4) Sampling
  • Slide 51
  • Sampling Cluster or Stratified Sampling
  • Data Discretization and Concept Hierarchy Generation
  • Slide 54
  • Discretization
  • Discretization and Concept Hierarchy Generation for Numeric Data
  • Entropy-Based Discretization
  • Interval Merge by 2 Analysis
  • Cluster Analysis
  • Slide 60
  • Segmentation by Natural Partitioning
  • 62 Concept Hierarchy Generation for Categorical Data
  • Slide 63
  • Concept Hierarchy Generation for Categorical Data
  • Automatic Concept Hierarchy Generation
  • Summary
Page 35: UNIT – 1 Data Preprocessing. Data Preprocessing Learning Objectives Understand why preprocess the data. Understand how to clean the data. Understand how.

bull Linear regression Y = w X + bndash Two regression coefficients w and b specify the

line and are to be estimated by using the data at hand

ndash Using the least squares criterion to the known values of Y1 Y2 hellip X1 X2 hellip

bull Multiple regression Y = b0 + b1 X1 + b2 X2ndash Many nonlinear functions can be transformed into

the above

bull Log-linear modelsndash The multi-way table of joint probabilities is

approximated by a product of lower-order tablesndash Probability p(a b c d) = ab acad bcd

Histograms Histograms use binning to approximate data distributions and are a popular form of data reduction

A histogram for an attribute A partitions the data distribution of A into disjoint subsets or buckets If each bucket represents only a single attribute-valuefrequency pair the buckets are called singleton buckets Often buckets instead represent continuous ranges for the given attribute

There are several partitioning rules including the following

Equal-width In an equal-width histogram the width of each bucket range is uniform

Equal-frequency (or equidepth) In an equal-frequency histogram the buckets are created so that roughly the frequency of each bucket is constant (that is each bucket contains roughly the same number of contiguous data samples)

V-Optimal If we consider all of the possible histograms for a given number of buckets the V-Optimal histogram is the one with the least variance Histogram variance is a weighted sum of the original values that each bucket represents where bucket weight is equal to the number of values in the bucket

MaxDiff In a MaxDiff histogram we consider the difference between each pair of adjacent values

ClusteringClustering techniques consider data tuples as objects They

partition the objects into groups or clusters so that objects within a cluster are ldquosimilarrdquo to one another and ldquodissimilarrdquo to objects in other clusters

In data reduction the cluster representations of the data are used to replace the actual data The effectiveness of this technique depends on the nature of the data It is much more effective for data that can be organized into distinct clusters than for smeared data

Partition data set into clusters based on similarity and store cluster representation (eg centroid and diameter) only

Can be very effective if data is clustered but not if data is ldquosmearedrdquo

Can have hierarchical clustering and be stored in multi-dimensional index tree structures

There are many choices of clustering definitions and clustering algorithms

Cluster analysis will be studied in depth later

SamplingSampling can be used as a data reduction technique because it

allows a large data set to be represented by a much smaller random sample (or subset) of the data Suppose that a large data set D contains N tuples Letrsquos look at the most common ways that we could sample D for data reduction

An advantage of sampling for data reduction is that the cost of obtaining a sample is proportional to the size of the sample sampling complexity is potentially sublinear to the size of the data

Simple Random sample without replacement

bull For a fixed sample size sampling complexity increases only linearly as the number of data dimensions

bull When applied to data reduction sampling is most commonly used to estimate the answer to an aggregate query

Sampling obtaining a small sample s to represent the whole data set N

Allow a mining algorithm to run in complexity that is potentially sub-linear to the size of the data

Choose a representative subset of the data Simple random sampling may have very poor

performance in the presence of skew Develop adaptive sampling methods

Stratified sampling Approximate the percentage of each class (or

subpopulation of interest) in the overall database Used in conjunction with skewed data

Note Sampling may not reduce database IOs (page at a time)

Sampling with or without Replacement

SRSWOR

(simple random

sample without

replacement)

SRSWR

Raw Data

Raw Data ClusterStratified Sample

Data discretization techniques can be used to reduce the number of values for a given continuous attribute by dividing the range of the attribute into intervals Interval labels can then be used to replace actual data values Replacing numerous values of a continuous attribute by a small number of interval labels thereby reduces and simplifies the original data This leads to a concise easy-to-use knowledge-level representation of mining results

bull Discretization techniques can be categorized based on how the discretization is performed such as whether it uses class information or which direction it proceeds (ie top-down vs bottom-up) If the discretization process uses class information then we say it is supervised discretization Otherwise it is unsupervised If the process starts by first finding one or a few points (called split points or cut points) to split the entire attribute range and then repeats this recursively on the resulting intervals it is called top-down discretization or splitting This contrasts with bottom-up discretization or merging which starts by considering all of the continuous values as potential split-points removes some by merging neighborhood values to form intervals and then recursively applies this process to the resulting intervals Discretization can be performed recursively on an attribute to provide a hierarchical or multiresolution partitioning of the attribute values known as a concept hierarchy Concept hierarchies are useful for mining at multiple levels of abstraction

Three types of attributes Nominal mdash values from an unordered set eg color

profession Ordinal mdash values from an ordered set eg military or

academic rank Continuous mdash real numbers eg integer or real

numbers Discretization

Divide the range of a continuous attribute into intervals Some classification algorithms only accept categorical

attributes Reduce data size by discretization Prepare for further analysis

Typical methods All the methods can be applied recursively Binning (covered above)

Top-down split unsupervised Histogram analysis (covered above)

Top-down split unsupervised Clustering analysis (covered above)

Either top-down split or bottom-up merge unsupervised Entropy-based discretization supervised top-down

split Interval merging by 2 Analysis unsupervised bottom-

up merge Segmentation by natural partitioning top-down split

unsupervised

Given a set of samples S if S is partitioned into two intervals S1 and S2 using boundary T the information gain after partitioning is

Entropy is calculated based on class distribution of the samples in the set Given m classes the entropy of S1 is

where pi is the probability of class i in S1 The boundary that minimizes the entropy function over

all possible boundaries is selected as a binary discretization

The process is recursively applied to partitions obtained until some stopping criterion is met

Such a boundary may reduce data size and improve classification accuracy

)Entropy(S|S|

|S|+)Entropy(S

|S|

|S|=T)I(S 2

21

1

m

iii ppSEntropy

121 )(log)(

Merging-based (bottom-up) vs splitting-based methods Merge Find the best neighboring intervals and merge

them to form larger intervals recursively ChiMerge [Kerber AAAI 1992 See also Liu et al DMKD

2002] Initially each distinct value of a numerical attr A is

considered to be one interval 2 tests are performed for every pair of adjacent

intervals Adjacent intervals with the least 2 values are merged

together since low 2 values for a pair indicate similar class distributions

This merge process proceeds recursively until a predefined stopping criterion is met (such as significance level max-interval max inconsistency etc)

Cluster analysis is a popular data discretization method A clustering algorithm can be applied to discretize a numerical attribute A by partitioning the values of A into clusters or groups Clustering takes the distribution of A into consideration as well as the closeness of data points and therefore is able to produce high-quality discretization results

Clustering can be used to generate a concept hierarchy for A by following either a topdown splitting strategy or a bottom-up merging strategy where each cluster forms a node of the concept hierarchy In the former each initial cluster or partition may be further

decomposed into several subclusters forming a lower level of the hierarchy In the latter clusters are formed by repeatedly grouping neighboring clusters in order to form

higher-level concepts

bull Discretization by Intuitive Partitioningbull Although the above discretization methods are useful

in the generation of numerical hierarchies many users would like to see numerical ranges partitioned into relatively uniform easy-to-read intervals that appear intuitive or ldquonaturalrdquo

bull If an interval covers 3 6 7 or 9 distinct values at the most significant digit then partition the range into 3 intervals (3 equal-width intervals for 3 6 and 9 and 3 intervals in the grouping of 2-3-2 for 7)

bull If it covers 2 4 or 8 distinct values at the most significant digit then partition the range into 4 equal-width intervals

bull If it covers 1 5 or 10 distinct values at the most significant digit then partition the range into 5 equal-width intervals

bull The rule can be recursively applied to each interval creating a concept hierarchy for the given numerical attribute Real-world data often contain extremely large positive andor negative outlier values which could distort any top-down discretization method based on minimum and maximum data values

A simply 3-4-5 rule can be used to segment numeric data into relatively uniform ldquonaturalrdquo intervals If an interval covers 3 6 7 or 9 distinct values

at the most significant digit partition the range into 3 equi-width intervals

If it covers 2 4 or 8 distinct values at the most significant digit partition the range into 4 intervals

If it covers 1 5 or 10 distinct values at the most significant digit partition the range into 5 intervals

Categorical data are discrete data Categorical attributes have a finite (but possibly large) number of distinct values with no ordering among the values Examples include geographic location job category and itemtype There are several methods for the generation of concept hierarchies for categorical data

Specification of a partial ordering of attributes explicitly at the schema level by users or Experts

Specification of a portion of a hierarchy by explicit data grouping

Specification of a set of attributes but not of their partial ordering

Specification of only a partial set of attributes

Specification of a partialtotal ordering of attributes explicitly at the schema level by users or experts street lt city lt state lt country

Specification of a hierarchy for a set of values by explicit data grouping Urbana Champaign Chicago lt Illinois

Specification of only a partial set of attributes Eg only street lt city not others

Automatic generation of hierarchies (or attribute levels) by the analysis of the number of distinct values Eg for a set of attributes street city state

country

Some hierarchies can be automaatically generated based on the analysis of the number of distinct values per attribute in the data set The attribute with the most distinct values is

placed at the lowest level of the hierarchy Exceptions eg weekday month quarter year

country

province_or_ state

city

street

15 distinct values

365 distinct values

3567 distinct values

674339 distinct values

Data preparation or preprocessing is a big issue for both data warehousing and data mining

Discriptive data summarization is need for quality data preprocessing

Data preparation includes Data cleaning and data integration Data reduction and feature selection Discretization

A lot a methods have been developed but data preprocessing still an active area of research

  • Slide 1
  • Slide 2
  • Slide 3
  • Multi-Dimensional Measure of Data Quality
  • Major Tasks in Data Preprocessing
  • Figure Forms of data preprocessing
  • 2 Descriptive Data Summarization
  • Slide 8
  • 21 Measuring the Central Tendency
  • 22 Measuring the Dispersion of Data
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Data Integration
  • Handling Redundant Data in Data Integration
  • Data Transformation
  • Data Transformation Normalization
  • Slide 27
  • 5 Data Reduction
  • Slide 29
  • 51 Data Cube Aggregation
  • 52 Attribute Subset Selection
  • Slide 32
  • Slide 33
  • Slide 34
  • 53 Dimensionality Reduction
  • Slide 36
  • Slide 37
  • Slide 38
  • 4 Numerosity Reduction
  • Data Reduction Method (1) Regression and Log-Linear Models
  • Regress Analysis and Log-Linear Models
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Data Reduction Method (3) Clustering
  • Slide 47
  • Slide 48
  • Slide 49
  • Data Reduction Method (4) Sampling
  • Slide 51
  • Sampling Cluster or Stratified Sampling
  • Data Discretization and Concept Hierarchy Generation
  • Slide 54
  • Discretization
  • Discretization and Concept Hierarchy Generation for Numeric Data
  • Entropy-Based Discretization
  • Interval Merge by 2 Analysis
  • Cluster Analysis
  • Slide 60
  • Segmentation by Natural Partitioning
  • 62 Concept Hierarchy Generation for Categorical Data
  • Slide 63
  • Concept Hierarchy Generation for Categorical Data
  • Automatic Concept Hierarchy Generation
  • Summary
Page 36: UNIT – 1 Data Preprocessing. Data Preprocessing Learning Objectives Understand why preprocess the data. Understand how to clean the data. Understand how.

Histograms Histograms use binning to approximate data distributions and are a popular form of data reduction

A histogram for an attribute A partitions the data distribution of A into disjoint subsets or buckets If each bucket represents only a single attribute-valuefrequency pair the buckets are called singleton buckets Often buckets instead represent continuous ranges for the given attribute

There are several partitioning rules including the following

Equal-width In an equal-width histogram the width of each bucket range is uniform

Equal-frequency (or equidepth) In an equal-frequency histogram the buckets are created so that roughly the frequency of each bucket is constant (that is each bucket contains roughly the same number of contiguous data samples)

V-Optimal If we consider all of the possible histograms for a given number of buckets the V-Optimal histogram is the one with the least variance Histogram variance is a weighted sum of the original values that each bucket represents where bucket weight is equal to the number of values in the bucket

MaxDiff In a MaxDiff histogram we consider the difference between each pair of adjacent values

ClusteringClustering techniques consider data tuples as objects They

partition the objects into groups or clusters so that objects within a cluster are ldquosimilarrdquo to one another and ldquodissimilarrdquo to objects in other clusters

In data reduction the cluster representations of the data are used to replace the actual data The effectiveness of this technique depends on the nature of the data It is much more effective for data that can be organized into distinct clusters than for smeared data

Partition data set into clusters based on similarity and store cluster representation (eg centroid and diameter) only

Can be very effective if data is clustered but not if data is ldquosmearedrdquo

Can have hierarchical clustering and be stored in multi-dimensional index tree structures

There are many choices of clustering definitions and clustering algorithms

Cluster analysis will be studied in depth later

SamplingSampling can be used as a data reduction technique because it

allows a large data set to be represented by a much smaller random sample (or subset) of the data Suppose that a large data set D contains N tuples Letrsquos look at the most common ways that we could sample D for data reduction

An advantage of sampling for data reduction is that the cost of obtaining a sample is proportional to the size of the sample sampling complexity is potentially sublinear to the size of the data

Simple Random sample without replacement

bull For a fixed sample size sampling complexity increases only linearly as the number of data dimensions

bull When applied to data reduction sampling is most commonly used to estimate the answer to an aggregate query

Sampling obtaining a small sample s to represent the whole data set N

Allow a mining algorithm to run in complexity that is potentially sub-linear to the size of the data

Choose a representative subset of the data Simple random sampling may have very poor

performance in the presence of skew Develop adaptive sampling methods

Stratified sampling Approximate the percentage of each class (or

subpopulation of interest) in the overall database Used in conjunction with skewed data

Note Sampling may not reduce database IOs (page at a time)

Sampling with or without Replacement

SRSWOR

(simple random

sample without

replacement)

SRSWR

Raw Data

Raw Data ClusterStratified Sample

Data discretization techniques can be used to reduce the number of values for a given continuous attribute by dividing the range of the attribute into intervals Interval labels can then be used to replace actual data values Replacing numerous values of a continuous attribute by a small number of interval labels thereby reduces and simplifies the original data This leads to a concise easy-to-use knowledge-level representation of mining results

bull Discretization techniques can be categorized based on how the discretization is performed such as whether it uses class information or which direction it proceeds (ie top-down vs bottom-up) If the discretization process uses class information then we say it is supervised discretization Otherwise it is unsupervised If the process starts by first finding one or a few points (called split points or cut points) to split the entire attribute range and then repeats this recursively on the resulting intervals it is called top-down discretization or splitting This contrasts with bottom-up discretization or merging which starts by considering all of the continuous values as potential split-points removes some by merging neighborhood values to form intervals and then recursively applies this process to the resulting intervals Discretization can be performed recursively on an attribute to provide a hierarchical or multiresolution partitioning of the attribute values known as a concept hierarchy Concept hierarchies are useful for mining at multiple levels of abstraction

Three types of attributes Nominal mdash values from an unordered set eg color

profession Ordinal mdash values from an ordered set eg military or

academic rank Continuous mdash real numbers eg integer or real

numbers Discretization

Divide the range of a continuous attribute into intervals Some classification algorithms only accept categorical

attributes Reduce data size by discretization Prepare for further analysis

Typical methods All the methods can be applied recursively Binning (covered above)

Top-down split unsupervised Histogram analysis (covered above)

Top-down split unsupervised Clustering analysis (covered above)

Either top-down split or bottom-up merge unsupervised Entropy-based discretization supervised top-down

split Interval merging by 2 Analysis unsupervised bottom-

up merge Segmentation by natural partitioning top-down split

unsupervised

Given a set of samples S if S is partitioned into two intervals S1 and S2 using boundary T the information gain after partitioning is

Entropy is calculated based on class distribution of the samples in the set Given m classes the entropy of S1 is

where pi is the probability of class i in S1 The boundary that minimizes the entropy function over

all possible boundaries is selected as a binary discretization

The process is recursively applied to partitions obtained until some stopping criterion is met

Such a boundary may reduce data size and improve classification accuracy

)Entropy(S|S|

|S|+)Entropy(S

|S|

|S|=T)I(S 2

21

1

m

iii ppSEntropy

121 )(log)(

Merging-based (bottom-up) vs splitting-based methods Merge Find the best neighboring intervals and merge

them to form larger intervals recursively ChiMerge [Kerber AAAI 1992 See also Liu et al DMKD

2002] Initially each distinct value of a numerical attr A is

considered to be one interval 2 tests are performed for every pair of adjacent

intervals Adjacent intervals with the least 2 values are merged

together since low 2 values for a pair indicate similar class distributions

This merge process proceeds recursively until a predefined stopping criterion is met (such as significance level max-interval max inconsistency etc)

Cluster analysis is a popular data discretization method A clustering algorithm can be applied to discretize a numerical attribute A by partitioning the values of A into clusters or groups Clustering takes the distribution of A into consideration as well as the closeness of data points and therefore is able to produce high-quality discretization results

Clustering can be used to generate a concept hierarchy for A by following either a topdown splitting strategy or a bottom-up merging strategy where each cluster forms a node of the concept hierarchy In the former each initial cluster or partition may be further

decomposed into several subclusters forming a lower level of the hierarchy In the latter clusters are formed by repeatedly grouping neighboring clusters in order to form

higher-level concepts

bull Discretization by Intuitive Partitioningbull Although the above discretization methods are useful

in the generation of numerical hierarchies many users would like to see numerical ranges partitioned into relatively uniform easy-to-read intervals that appear intuitive or ldquonaturalrdquo

bull If an interval covers 3 6 7 or 9 distinct values at the most significant digit then partition the range into 3 intervals (3 equal-width intervals for 3 6 and 9 and 3 intervals in the grouping of 2-3-2 for 7)

bull If it covers 2 4 or 8 distinct values at the most significant digit then partition the range into 4 equal-width intervals

bull If it covers 1 5 or 10 distinct values at the most significant digit then partition the range into 5 equal-width intervals

bull The rule can be recursively applied to each interval creating a concept hierarchy for the given numerical attribute Real-world data often contain extremely large positive andor negative outlier values which could distort any top-down discretization method based on minimum and maximum data values

A simply 3-4-5 rule can be used to segment numeric data into relatively uniform ldquonaturalrdquo intervals If an interval covers 3 6 7 or 9 distinct values

at the most significant digit partition the range into 3 equi-width intervals

If it covers 2 4 or 8 distinct values at the most significant digit partition the range into 4 intervals

If it covers 1 5 or 10 distinct values at the most significant digit partition the range into 5 intervals

Categorical data are discrete data Categorical attributes have a finite (but possibly large) number of distinct values with no ordering among the values Examples include geographic location job category and itemtype There are several methods for the generation of concept hierarchies for categorical data

Specification of a partial ordering of attributes explicitly at the schema level by users or Experts

Specification of a portion of a hierarchy by explicit data grouping

Specification of a set of attributes but not of their partial ordering

Specification of only a partial set of attributes

Specification of a partialtotal ordering of attributes explicitly at the schema level by users or experts street lt city lt state lt country

Specification of a hierarchy for a set of values by explicit data grouping Urbana Champaign Chicago lt Illinois

Specification of only a partial set of attributes Eg only street lt city not others

Automatic generation of hierarchies (or attribute levels) by the analysis of the number of distinct values Eg for a set of attributes street city state

country

Some hierarchies can be automaatically generated based on the analysis of the number of distinct values per attribute in the data set The attribute with the most distinct values is

placed at the lowest level of the hierarchy Exceptions eg weekday month quarter year

country

province_or_ state

city

street

15 distinct values

365 distinct values

3567 distinct values

674339 distinct values

Data preparation or preprocessing is a big issue for both data warehousing and data mining

Discriptive data summarization is need for quality data preprocessing

Data preparation includes Data cleaning and data integration Data reduction and feature selection Discretization

A lot a methods have been developed but data preprocessing still an active area of research

  • Slide 1
  • Slide 2
  • Slide 3
  • Multi-Dimensional Measure of Data Quality
  • Major Tasks in Data Preprocessing
  • Figure Forms of data preprocessing
  • 2 Descriptive Data Summarization
  • Slide 8
  • 21 Measuring the Central Tendency
  • 22 Measuring the Dispersion of Data
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Data Integration
  • Handling Redundant Data in Data Integration
  • Data Transformation
  • Data Transformation Normalization
  • Slide 27
  • 5 Data Reduction
  • Slide 29
  • 51 Data Cube Aggregation
  • 52 Attribute Subset Selection
  • Slide 32
  • Slide 33
  • Slide 34
  • 53 Dimensionality Reduction
  • Slide 36
  • Slide 37
  • Slide 38
  • 4 Numerosity Reduction
  • Data Reduction Method (1) Regression and Log-Linear Models
  • Regress Analysis and Log-Linear Models
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Data Reduction Method (3) Clustering
  • Slide 47
  • Slide 48
  • Slide 49
  • Data Reduction Method (4) Sampling
  • Slide 51
  • Sampling Cluster or Stratified Sampling
  • Data Discretization and Concept Hierarchy Generation
  • Slide 54
  • Discretization
  • Discretization and Concept Hierarchy Generation for Numeric Data
  • Entropy-Based Discretization
  • Interval Merge by 2 Analysis
  • Cluster Analysis
  • Slide 60
  • Segmentation by Natural Partitioning
  • 62 Concept Hierarchy Generation for Categorical Data
  • Slide 63
  • Concept Hierarchy Generation for Categorical Data
  • Automatic Concept Hierarchy Generation
  • Summary
Page 37: UNIT – 1 Data Preprocessing. Data Preprocessing Learning Objectives Understand why preprocess the data. Understand how to clean the data. Understand how.

There are several partitioning rules including the following

Equal-width In an equal-width histogram the width of each bucket range is uniform

Equal-frequency (or equidepth) In an equal-frequency histogram the buckets are created so that roughly the frequency of each bucket is constant (that is each bucket contains roughly the same number of contiguous data samples)

V-Optimal If we consider all of the possible histograms for a given number of buckets the V-Optimal histogram is the one with the least variance Histogram variance is a weighted sum of the original values that each bucket represents where bucket weight is equal to the number of values in the bucket

MaxDiff In a MaxDiff histogram we consider the difference between each pair of adjacent values

ClusteringClustering techniques consider data tuples as objects They

partition the objects into groups or clusters so that objects within a cluster are ldquosimilarrdquo to one another and ldquodissimilarrdquo to objects in other clusters

In data reduction the cluster representations of the data are used to replace the actual data The effectiveness of this technique depends on the nature of the data It is much more effective for data that can be organized into distinct clusters than for smeared data

Partition data set into clusters based on similarity and store cluster representation (eg centroid and diameter) only

Can be very effective if data is clustered but not if data is ldquosmearedrdquo

Can have hierarchical clustering and be stored in multi-dimensional index tree structures

There are many choices of clustering definitions and clustering algorithms

Cluster analysis will be studied in depth later

SamplingSampling can be used as a data reduction technique because it

allows a large data set to be represented by a much smaller random sample (or subset) of the data Suppose that a large data set D contains N tuples Letrsquos look at the most common ways that we could sample D for data reduction

An advantage of sampling for data reduction is that the cost of obtaining a sample is proportional to the size of the sample sampling complexity is potentially sublinear to the size of the data

Simple Random sample without replacement

bull For a fixed sample size sampling complexity increases only linearly as the number of data dimensions

bull When applied to data reduction sampling is most commonly used to estimate the answer to an aggregate query

Sampling obtaining a small sample s to represent the whole data set N

Allow a mining algorithm to run in complexity that is potentially sub-linear to the size of the data

Choose a representative subset of the data Simple random sampling may have very poor

performance in the presence of skew Develop adaptive sampling methods

Stratified sampling Approximate the percentage of each class (or

subpopulation of interest) in the overall database Used in conjunction with skewed data

Note Sampling may not reduce database IOs (page at a time)

Sampling with or without Replacement

SRSWOR

(simple random

sample without

replacement)

SRSWR

Raw Data

Raw Data ClusterStratified Sample

Data discretization techniques can be used to reduce the number of values for a given continuous attribute by dividing the range of the attribute into intervals Interval labels can then be used to replace actual data values Replacing numerous values of a continuous attribute by a small number of interval labels thereby reduces and simplifies the original data This leads to a concise easy-to-use knowledge-level representation of mining results

bull Discretization techniques can be categorized based on how the discretization is performed such as whether it uses class information or which direction it proceeds (ie top-down vs bottom-up) If the discretization process uses class information then we say it is supervised discretization Otherwise it is unsupervised If the process starts by first finding one or a few points (called split points or cut points) to split the entire attribute range and then repeats this recursively on the resulting intervals it is called top-down discretization or splitting This contrasts with bottom-up discretization or merging which starts by considering all of the continuous values as potential split-points removes some by merging neighborhood values to form intervals and then recursively applies this process to the resulting intervals Discretization can be performed recursively on an attribute to provide a hierarchical or multiresolution partitioning of the attribute values known as a concept hierarchy Concept hierarchies are useful for mining at multiple levels of abstraction

Three types of attributes Nominal mdash values from an unordered set eg color

profession Ordinal mdash values from an ordered set eg military or

academic rank Continuous mdash real numbers eg integer or real

numbers Discretization

Divide the range of a continuous attribute into intervals Some classification algorithms only accept categorical

attributes Reduce data size by discretization Prepare for further analysis

Typical methods All the methods can be applied recursively Binning (covered above)

Top-down split unsupervised Histogram analysis (covered above)

Top-down split unsupervised Clustering analysis (covered above)

Either top-down split or bottom-up merge unsupervised Entropy-based discretization supervised top-down

split Interval merging by 2 Analysis unsupervised bottom-

up merge Segmentation by natural partitioning top-down split

unsupervised

Given a set of samples S if S is partitioned into two intervals S1 and S2 using boundary T the information gain after partitioning is

Entropy is calculated based on class distribution of the samples in the set Given m classes the entropy of S1 is

where pi is the probability of class i in S1 The boundary that minimizes the entropy function over

all possible boundaries is selected as a binary discretization

The process is recursively applied to partitions obtained until some stopping criterion is met

Such a boundary may reduce data size and improve classification accuracy

)Entropy(S|S|

|S|+)Entropy(S

|S|

|S|=T)I(S 2

21

1

m

iii ppSEntropy

121 )(log)(

Merging-based (bottom-up) vs splitting-based methods Merge Find the best neighboring intervals and merge

them to form larger intervals recursively ChiMerge [Kerber AAAI 1992 See also Liu et al DMKD

2002] Initially each distinct value of a numerical attr A is

considered to be one interval 2 tests are performed for every pair of adjacent

intervals Adjacent intervals with the least 2 values are merged

together since low 2 values for a pair indicate similar class distributions

This merge process proceeds recursively until a predefined stopping criterion is met (such as significance level max-interval max inconsistency etc)

Cluster analysis is a popular data discretization method A clustering algorithm can be applied to discretize a numerical attribute A by partitioning the values of A into clusters or groups Clustering takes the distribution of A into consideration as well as the closeness of data points and therefore is able to produce high-quality discretization results

Clustering can be used to generate a concept hierarchy for A by following either a topdown splitting strategy or a bottom-up merging strategy where each cluster forms a node of the concept hierarchy In the former each initial cluster or partition may be further

decomposed into several subclusters forming a lower level of the hierarchy In the latter clusters are formed by repeatedly grouping neighboring clusters in order to form

higher-level concepts

bull Discretization by Intuitive Partitioningbull Although the above discretization methods are useful

in the generation of numerical hierarchies many users would like to see numerical ranges partitioned into relatively uniform easy-to-read intervals that appear intuitive or ldquonaturalrdquo

bull If an interval covers 3 6 7 or 9 distinct values at the most significant digit then partition the range into 3 intervals (3 equal-width intervals for 3 6 and 9 and 3 intervals in the grouping of 2-3-2 for 7)

bull If it covers 2 4 or 8 distinct values at the most significant digit then partition the range into 4 equal-width intervals

bull If it covers 1 5 or 10 distinct values at the most significant digit then partition the range into 5 equal-width intervals

bull The rule can be recursively applied to each interval creating a concept hierarchy for the given numerical attribute Real-world data often contain extremely large positive andor negative outlier values which could distort any top-down discretization method based on minimum and maximum data values

A simply 3-4-5 rule can be used to segment numeric data into relatively uniform ldquonaturalrdquo intervals If an interval covers 3 6 7 or 9 distinct values

at the most significant digit partition the range into 3 equi-width intervals

If it covers 2 4 or 8 distinct values at the most significant digit partition the range into 4 intervals

If it covers 1 5 or 10 distinct values at the most significant digit partition the range into 5 intervals

Categorical data are discrete data Categorical attributes have a finite (but possibly large) number of distinct values with no ordering among the values Examples include geographic location job category and itemtype There are several methods for the generation of concept hierarchies for categorical data

Specification of a partial ordering of attributes explicitly at the schema level by users or Experts

Specification of a portion of a hierarchy by explicit data grouping

Specification of a set of attributes but not of their partial ordering

Specification of only a partial set of attributes

Specification of a partialtotal ordering of attributes explicitly at the schema level by users or experts street lt city lt state lt country

Specification of a hierarchy for a set of values by explicit data grouping Urbana Champaign Chicago lt Illinois

Specification of only a partial set of attributes Eg only street lt city not others

Automatic generation of hierarchies (or attribute levels) by the analysis of the number of distinct values Eg for a set of attributes street city state

country

Some hierarchies can be automaatically generated based on the analysis of the number of distinct values per attribute in the data set The attribute with the most distinct values is

placed at the lowest level of the hierarchy Exceptions eg weekday month quarter year

country

province_or_ state

city

street

15 distinct values

365 distinct values

3567 distinct values

674339 distinct values

Data preparation or preprocessing is a big issue for both data warehousing and data mining

Discriptive data summarization is need for quality data preprocessing

Data preparation includes Data cleaning and data integration Data reduction and feature selection Discretization

A lot a methods have been developed but data preprocessing still an active area of research

  • Slide 1
  • Slide 2
  • Slide 3
  • Multi-Dimensional Measure of Data Quality
  • Major Tasks in Data Preprocessing
  • Figure Forms of data preprocessing
  • 2 Descriptive Data Summarization
  • Slide 8
  • 21 Measuring the Central Tendency
  • 22 Measuring the Dispersion of Data
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Data Integration
  • Handling Redundant Data in Data Integration
  • Data Transformation
  • Data Transformation Normalization
  • Slide 27
  • 5 Data Reduction
  • Slide 29
  • 51 Data Cube Aggregation
  • 52 Attribute Subset Selection
  • Slide 32
  • Slide 33
  • Slide 34
  • 53 Dimensionality Reduction
  • Slide 36
  • Slide 37
  • Slide 38
  • 4 Numerosity Reduction
  • Data Reduction Method (1) Regression and Log-Linear Models
  • Regress Analysis and Log-Linear Models
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Data Reduction Method (3) Clustering
  • Slide 47
  • Slide 48
  • Slide 49
  • Data Reduction Method (4) Sampling
  • Slide 51
  • Sampling Cluster or Stratified Sampling
  • Data Discretization and Concept Hierarchy Generation
  • Slide 54
  • Discretization
  • Discretization and Concept Hierarchy Generation for Numeric Data
  • Entropy-Based Discretization
  • Interval Merge by 2 Analysis
  • Cluster Analysis
  • Slide 60
  • Segmentation by Natural Partitioning
  • 62 Concept Hierarchy Generation for Categorical Data
  • Slide 63
  • Concept Hierarchy Generation for Categorical Data
  • Automatic Concept Hierarchy Generation
  • Summary
Page 38: UNIT – 1 Data Preprocessing. Data Preprocessing Learning Objectives Understand why preprocess the data. Understand how to clean the data. Understand how.

ClusteringClustering techniques consider data tuples as objects They

partition the objects into groups or clusters so that objects within a cluster are ldquosimilarrdquo to one another and ldquodissimilarrdquo to objects in other clusters

In data reduction the cluster representations of the data are used to replace the actual data The effectiveness of this technique depends on the nature of the data It is much more effective for data that can be organized into distinct clusters than for smeared data

Partition data set into clusters based on similarity and store cluster representation (eg centroid and diameter) only

Can be very effective if data is clustered but not if data is ldquosmearedrdquo

Can have hierarchical clustering and be stored in multi-dimensional index tree structures

There are many choices of clustering definitions and clustering algorithms

Cluster analysis will be studied in depth later

SamplingSampling can be used as a data reduction technique because it

allows a large data set to be represented by a much smaller random sample (or subset) of the data Suppose that a large data set D contains N tuples Letrsquos look at the most common ways that we could sample D for data reduction

An advantage of sampling for data reduction is that the cost of obtaining a sample is proportional to the size of the sample sampling complexity is potentially sublinear to the size of the data

Simple Random sample without replacement

bull For a fixed sample size sampling complexity increases only linearly as the number of data dimensions

bull When applied to data reduction sampling is most commonly used to estimate the answer to an aggregate query

Sampling obtaining a small sample s to represent the whole data set N

Allow a mining algorithm to run in complexity that is potentially sub-linear to the size of the data

Choose a representative subset of the data Simple random sampling may have very poor

performance in the presence of skew Develop adaptive sampling methods

Stratified sampling Approximate the percentage of each class (or

subpopulation of interest) in the overall database Used in conjunction with skewed data

Note Sampling may not reduce database IOs (page at a time)

Sampling with or without Replacement

SRSWOR

(simple random

sample without

replacement)

SRSWR

Raw Data

Raw Data ClusterStratified Sample

Data discretization techniques can be used to reduce the number of values for a given continuous attribute by dividing the range of the attribute into intervals Interval labels can then be used to replace actual data values Replacing numerous values of a continuous attribute by a small number of interval labels thereby reduces and simplifies the original data This leads to a concise easy-to-use knowledge-level representation of mining results

bull Discretization techniques can be categorized based on how the discretization is performed such as whether it uses class information or which direction it proceeds (ie top-down vs bottom-up) If the discretization process uses class information then we say it is supervised discretization Otherwise it is unsupervised If the process starts by first finding one or a few points (called split points or cut points) to split the entire attribute range and then repeats this recursively on the resulting intervals it is called top-down discretization or splitting This contrasts with bottom-up discretization or merging which starts by considering all of the continuous values as potential split-points removes some by merging neighborhood values to form intervals and then recursively applies this process to the resulting intervals Discretization can be performed recursively on an attribute to provide a hierarchical or multiresolution partitioning of the attribute values known as a concept hierarchy Concept hierarchies are useful for mining at multiple levels of abstraction

Three types of attributes Nominal mdash values from an unordered set eg color

profession Ordinal mdash values from an ordered set eg military or

academic rank Continuous mdash real numbers eg integer or real

numbers Discretization

Divide the range of a continuous attribute into intervals Some classification algorithms only accept categorical

attributes Reduce data size by discretization Prepare for further analysis

Typical methods All the methods can be applied recursively Binning (covered above)

Top-down split unsupervised Histogram analysis (covered above)

Top-down split unsupervised Clustering analysis (covered above)

Either top-down split or bottom-up merge unsupervised Entropy-based discretization supervised top-down

split Interval merging by 2 Analysis unsupervised bottom-

up merge Segmentation by natural partitioning top-down split

unsupervised

Given a set of samples S if S is partitioned into two intervals S1 and S2 using boundary T the information gain after partitioning is

Entropy is calculated based on class distribution of the samples in the set Given m classes the entropy of S1 is

where pi is the probability of class i in S1 The boundary that minimizes the entropy function over

all possible boundaries is selected as a binary discretization

The process is recursively applied to partitions obtained until some stopping criterion is met

Such a boundary may reduce data size and improve classification accuracy

)Entropy(S|S|

|S|+)Entropy(S

|S|

|S|=T)I(S 2

21

1

m

iii ppSEntropy

121 )(log)(

Merging-based (bottom-up) vs splitting-based methods Merge Find the best neighboring intervals and merge

them to form larger intervals recursively ChiMerge [Kerber AAAI 1992 See also Liu et al DMKD

2002] Initially each distinct value of a numerical attr A is

considered to be one interval 2 tests are performed for every pair of adjacent

intervals Adjacent intervals with the least 2 values are merged

together since low 2 values for a pair indicate similar class distributions

This merge process proceeds recursively until a predefined stopping criterion is met (such as significance level max-interval max inconsistency etc)

Cluster analysis is a popular data discretization method A clustering algorithm can be applied to discretize a numerical attribute A by partitioning the values of A into clusters or groups Clustering takes the distribution of A into consideration as well as the closeness of data points and therefore is able to produce high-quality discretization results

Clustering can be used to generate a concept hierarchy for A by following either a topdown splitting strategy or a bottom-up merging strategy where each cluster forms a node of the concept hierarchy In the former each initial cluster or partition may be further

decomposed into several subclusters forming a lower level of the hierarchy In the latter clusters are formed by repeatedly grouping neighboring clusters in order to form

higher-level concepts

bull Discretization by Intuitive Partitioningbull Although the above discretization methods are useful

in the generation of numerical hierarchies many users would like to see numerical ranges partitioned into relatively uniform easy-to-read intervals that appear intuitive or ldquonaturalrdquo

bull If an interval covers 3 6 7 or 9 distinct values at the most significant digit then partition the range into 3 intervals (3 equal-width intervals for 3 6 and 9 and 3 intervals in the grouping of 2-3-2 for 7)

bull If it covers 2 4 or 8 distinct values at the most significant digit then partition the range into 4 equal-width intervals

bull If it covers 1 5 or 10 distinct values at the most significant digit then partition the range into 5 equal-width intervals

bull The rule can be recursively applied to each interval creating a concept hierarchy for the given numerical attribute Real-world data often contain extremely large positive andor negative outlier values which could distort any top-down discretization method based on minimum and maximum data values

A simply 3-4-5 rule can be used to segment numeric data into relatively uniform ldquonaturalrdquo intervals If an interval covers 3 6 7 or 9 distinct values

at the most significant digit partition the range into 3 equi-width intervals

If it covers 2 4 or 8 distinct values at the most significant digit partition the range into 4 intervals

If it covers 1 5 or 10 distinct values at the most significant digit partition the range into 5 intervals

Categorical data are discrete data Categorical attributes have a finite (but possibly large) number of distinct values with no ordering among the values Examples include geographic location job category and itemtype There are several methods for the generation of concept hierarchies for categorical data

Specification of a partial ordering of attributes explicitly at the schema level by users or Experts

Specification of a portion of a hierarchy by explicit data grouping

Specification of a set of attributes but not of their partial ordering

Specification of only a partial set of attributes

Specification of a partialtotal ordering of attributes explicitly at the schema level by users or experts street lt city lt state lt country

Specification of a hierarchy for a set of values by explicit data grouping Urbana Champaign Chicago lt Illinois

Specification of only a partial set of attributes Eg only street lt city not others

Automatic generation of hierarchies (or attribute levels) by the analysis of the number of distinct values Eg for a set of attributes street city state

country

Some hierarchies can be automaatically generated based on the analysis of the number of distinct values per attribute in the data set The attribute with the most distinct values is

placed at the lowest level of the hierarchy Exceptions eg weekday month quarter year

country

province_or_ state

city

street

15 distinct values

365 distinct values

3567 distinct values

674339 distinct values

Data preparation or preprocessing is a big issue for both data warehousing and data mining

Discriptive data summarization is need for quality data preprocessing

Data preparation includes Data cleaning and data integration Data reduction and feature selection Discretization

A lot a methods have been developed but data preprocessing still an active area of research

  • Slide 1
  • Slide 2
  • Slide 3
  • Multi-Dimensional Measure of Data Quality
  • Major Tasks in Data Preprocessing
  • Figure Forms of data preprocessing
  • 2 Descriptive Data Summarization
  • Slide 8
  • 21 Measuring the Central Tendency
  • 22 Measuring the Dispersion of Data
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Data Integration
  • Handling Redundant Data in Data Integration
  • Data Transformation
  • Data Transformation Normalization
  • Slide 27
  • 5 Data Reduction
  • Slide 29
  • 51 Data Cube Aggregation
  • 52 Attribute Subset Selection
  • Slide 32
  • Slide 33
  • Slide 34
  • 53 Dimensionality Reduction
  • Slide 36
  • Slide 37
  • Slide 38
  • 4 Numerosity Reduction
  • Data Reduction Method (1) Regression and Log-Linear Models
  • Regress Analysis and Log-Linear Models
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Data Reduction Method (3) Clustering
  • Slide 47
  • Slide 48
  • Slide 49
  • Data Reduction Method (4) Sampling
  • Slide 51
  • Sampling Cluster or Stratified Sampling
  • Data Discretization and Concept Hierarchy Generation
  • Slide 54
  • Discretization
  • Discretization and Concept Hierarchy Generation for Numeric Data
  • Entropy-Based Discretization
  • Interval Merge by 2 Analysis
  • Cluster Analysis
  • Slide 60
  • Segmentation by Natural Partitioning
  • 62 Concept Hierarchy Generation for Categorical Data
  • Slide 63
  • Concept Hierarchy Generation for Categorical Data
  • Automatic Concept Hierarchy Generation
  • Summary
Page 39: UNIT – 1 Data Preprocessing. Data Preprocessing Learning Objectives Understand why preprocess the data. Understand how to clean the data. Understand how.

Partition data set into clusters based on similarity and store cluster representation (eg centroid and diameter) only

Can be very effective if data is clustered but not if data is ldquosmearedrdquo

Can have hierarchical clustering and be stored in multi-dimensional index tree structures

There are many choices of clustering definitions and clustering algorithms

Cluster analysis will be studied in depth later

SamplingSampling can be used as a data reduction technique because it

allows a large data set to be represented by a much smaller random sample (or subset) of the data Suppose that a large data set D contains N tuples Letrsquos look at the most common ways that we could sample D for data reduction

An advantage of sampling for data reduction is that the cost of obtaining a sample is proportional to the size of the sample sampling complexity is potentially sublinear to the size of the data

Simple Random sample without replacement

bull For a fixed sample size sampling complexity increases only linearly as the number of data dimensions

bull When applied to data reduction sampling is most commonly used to estimate the answer to an aggregate query

Sampling obtaining a small sample s to represent the whole data set N

Allow a mining algorithm to run in complexity that is potentially sub-linear to the size of the data

Choose a representative subset of the data Simple random sampling may have very poor

performance in the presence of skew Develop adaptive sampling methods

Stratified sampling Approximate the percentage of each class (or

subpopulation of interest) in the overall database Used in conjunction with skewed data

Note Sampling may not reduce database IOs (page at a time)

Sampling with or without Replacement

SRSWOR

(simple random

sample without

replacement)

SRSWR

Raw Data

Raw Data ClusterStratified Sample

Data discretization techniques can be used to reduce the number of values for a given continuous attribute by dividing the range of the attribute into intervals Interval labels can then be used to replace actual data values Replacing numerous values of a continuous attribute by a small number of interval labels thereby reduces and simplifies the original data This leads to a concise easy-to-use knowledge-level representation of mining results

bull Discretization techniques can be categorized based on how the discretization is performed such as whether it uses class information or which direction it proceeds (ie top-down vs bottom-up) If the discretization process uses class information then we say it is supervised discretization Otherwise it is unsupervised If the process starts by first finding one or a few points (called split points or cut points) to split the entire attribute range and then repeats this recursively on the resulting intervals it is called top-down discretization or splitting This contrasts with bottom-up discretization or merging which starts by considering all of the continuous values as potential split-points removes some by merging neighborhood values to form intervals and then recursively applies this process to the resulting intervals Discretization can be performed recursively on an attribute to provide a hierarchical or multiresolution partitioning of the attribute values known as a concept hierarchy Concept hierarchies are useful for mining at multiple levels of abstraction

Three types of attributes Nominal mdash values from an unordered set eg color

profession Ordinal mdash values from an ordered set eg military or

academic rank Continuous mdash real numbers eg integer or real

numbers Discretization

Divide the range of a continuous attribute into intervals Some classification algorithms only accept categorical

attributes Reduce data size by discretization Prepare for further analysis

Typical methods All the methods can be applied recursively Binning (covered above)

Top-down split unsupervised Histogram analysis (covered above)

Top-down split unsupervised Clustering analysis (covered above)

Either top-down split or bottom-up merge unsupervised Entropy-based discretization supervised top-down

split Interval merging by 2 Analysis unsupervised bottom-

up merge Segmentation by natural partitioning top-down split

unsupervised

Given a set of samples S if S is partitioned into two intervals S1 and S2 using boundary T the information gain after partitioning is

Entropy is calculated based on class distribution of the samples in the set Given m classes the entropy of S1 is

where pi is the probability of class i in S1 The boundary that minimizes the entropy function over

all possible boundaries is selected as a binary discretization

The process is recursively applied to partitions obtained until some stopping criterion is met

Such a boundary may reduce data size and improve classification accuracy

)Entropy(S|S|

|S|+)Entropy(S

|S|

|S|=T)I(S 2

21

1

m

iii ppSEntropy

121 )(log)(

Merging-based (bottom-up) vs splitting-based methods Merge Find the best neighboring intervals and merge

them to form larger intervals recursively ChiMerge [Kerber AAAI 1992 See also Liu et al DMKD

2002] Initially each distinct value of a numerical attr A is

considered to be one interval 2 tests are performed for every pair of adjacent

intervals Adjacent intervals with the least 2 values are merged

together since low 2 values for a pair indicate similar class distributions

This merge process proceeds recursively until a predefined stopping criterion is met (such as significance level max-interval max inconsistency etc)

Cluster analysis is a popular data discretization method A clustering algorithm can be applied to discretize a numerical attribute A by partitioning the values of A into clusters or groups Clustering takes the distribution of A into consideration as well as the closeness of data points and therefore is able to produce high-quality discretization results

Clustering can be used to generate a concept hierarchy for A by following either a topdown splitting strategy or a bottom-up merging strategy where each cluster forms a node of the concept hierarchy In the former each initial cluster or partition may be further

decomposed into several subclusters forming a lower level of the hierarchy In the latter clusters are formed by repeatedly grouping neighboring clusters in order to form

higher-level concepts

bull Discretization by Intuitive Partitioningbull Although the above discretization methods are useful

in the generation of numerical hierarchies many users would like to see numerical ranges partitioned into relatively uniform easy-to-read intervals that appear intuitive or ldquonaturalrdquo

bull If an interval covers 3 6 7 or 9 distinct values at the most significant digit then partition the range into 3 intervals (3 equal-width intervals for 3 6 and 9 and 3 intervals in the grouping of 2-3-2 for 7)

bull If it covers 2 4 or 8 distinct values at the most significant digit then partition the range into 4 equal-width intervals

bull If it covers 1 5 or 10 distinct values at the most significant digit then partition the range into 5 equal-width intervals

bull The rule can be recursively applied to each interval creating a concept hierarchy for the given numerical attribute Real-world data often contain extremely large positive andor negative outlier values which could distort any top-down discretization method based on minimum and maximum data values

A simply 3-4-5 rule can be used to segment numeric data into relatively uniform ldquonaturalrdquo intervals If an interval covers 3 6 7 or 9 distinct values

at the most significant digit partition the range into 3 equi-width intervals

If it covers 2 4 or 8 distinct values at the most significant digit partition the range into 4 intervals

If it covers 1 5 or 10 distinct values at the most significant digit partition the range into 5 intervals

Categorical data are discrete data Categorical attributes have a finite (but possibly large) number of distinct values with no ordering among the values Examples include geographic location job category and itemtype There are several methods for the generation of concept hierarchies for categorical data

Specification of a partial ordering of attributes explicitly at the schema level by users or Experts

Specification of a portion of a hierarchy by explicit data grouping

Specification of a set of attributes but not of their partial ordering

Specification of only a partial set of attributes

Specification of a partialtotal ordering of attributes explicitly at the schema level by users or experts street lt city lt state lt country

Specification of a hierarchy for a set of values by explicit data grouping Urbana Champaign Chicago lt Illinois

Specification of only a partial set of attributes Eg only street lt city not others

Automatic generation of hierarchies (or attribute levels) by the analysis of the number of distinct values Eg for a set of attributes street city state

country

Some hierarchies can be automaatically generated based on the analysis of the number of distinct values per attribute in the data set The attribute with the most distinct values is

placed at the lowest level of the hierarchy Exceptions eg weekday month quarter year

country

province_or_ state

city

street

15 distinct values

365 distinct values

3567 distinct values

674339 distinct values

Data preparation or preprocessing is a big issue for both data warehousing and data mining

Discriptive data summarization is need for quality data preprocessing

Data preparation includes Data cleaning and data integration Data reduction and feature selection Discretization

A lot a methods have been developed but data preprocessing still an active area of research

  • Slide 1
  • Slide 2
  • Slide 3
  • Multi-Dimensional Measure of Data Quality
  • Major Tasks in Data Preprocessing
  • Figure Forms of data preprocessing
  • 2 Descriptive Data Summarization
  • Slide 8
  • 21 Measuring the Central Tendency
  • 22 Measuring the Dispersion of Data
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Data Integration
  • Handling Redundant Data in Data Integration
  • Data Transformation
  • Data Transformation Normalization
  • Slide 27
  • 5 Data Reduction
  • Slide 29
  • 51 Data Cube Aggregation
  • 52 Attribute Subset Selection
  • Slide 32
  • Slide 33
  • Slide 34
  • 53 Dimensionality Reduction
  • Slide 36
  • Slide 37
  • Slide 38
  • 4 Numerosity Reduction
  • Data Reduction Method (1) Regression and Log-Linear Models
  • Regress Analysis and Log-Linear Models
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Data Reduction Method (3) Clustering
  • Slide 47
  • Slide 48
  • Slide 49
  • Data Reduction Method (4) Sampling
  • Slide 51
  • Sampling Cluster or Stratified Sampling
  • Data Discretization and Concept Hierarchy Generation
  • Slide 54
  • Discretization
  • Discretization and Concept Hierarchy Generation for Numeric Data
  • Entropy-Based Discretization
  • Interval Merge by 2 Analysis
  • Cluster Analysis
  • Slide 60
  • Segmentation by Natural Partitioning
  • 62 Concept Hierarchy Generation for Categorical Data
  • Slide 63
  • Concept Hierarchy Generation for Categorical Data
  • Automatic Concept Hierarchy Generation
  • Summary
Page 40: UNIT – 1 Data Preprocessing. Data Preprocessing Learning Objectives Understand why preprocess the data. Understand how to clean the data. Understand how.

SamplingSampling can be used as a data reduction technique because it

allows a large data set to be represented by a much smaller random sample (or subset) of the data Suppose that a large data set D contains N tuples Letrsquos look at the most common ways that we could sample D for data reduction

An advantage of sampling for data reduction is that the cost of obtaining a sample is proportional to the size of the sample sampling complexity is potentially sublinear to the size of the data

Simple Random sample without replacement

bull For a fixed sample size sampling complexity increases only linearly as the number of data dimensions

bull When applied to data reduction sampling is most commonly used to estimate the answer to an aggregate query

Sampling obtaining a small sample s to represent the whole data set N

Allow a mining algorithm to run in complexity that is potentially sub-linear to the size of the data

Choose a representative subset of the data Simple random sampling may have very poor

performance in the presence of skew Develop adaptive sampling methods

Stratified sampling Approximate the percentage of each class (or

subpopulation of interest) in the overall database Used in conjunction with skewed data

Note Sampling may not reduce database IOs (page at a time)

Sampling with or without Replacement

SRSWOR

(simple random

sample without

replacement)

SRSWR

Raw Data

Raw Data ClusterStratified Sample

Data discretization techniques can be used to reduce the number of values for a given continuous attribute by dividing the range of the attribute into intervals Interval labels can then be used to replace actual data values Replacing numerous values of a continuous attribute by a small number of interval labels thereby reduces and simplifies the original data This leads to a concise easy-to-use knowledge-level representation of mining results

bull Discretization techniques can be categorized based on how the discretization is performed such as whether it uses class information or which direction it proceeds (ie top-down vs bottom-up) If the discretization process uses class information then we say it is supervised discretization Otherwise it is unsupervised If the process starts by first finding one or a few points (called split points or cut points) to split the entire attribute range and then repeats this recursively on the resulting intervals it is called top-down discretization or splitting This contrasts with bottom-up discretization or merging which starts by considering all of the continuous values as potential split-points removes some by merging neighborhood values to form intervals and then recursively applies this process to the resulting intervals Discretization can be performed recursively on an attribute to provide a hierarchical or multiresolution partitioning of the attribute values known as a concept hierarchy Concept hierarchies are useful for mining at multiple levels of abstraction

Three types of attributes Nominal mdash values from an unordered set eg color

profession Ordinal mdash values from an ordered set eg military or

academic rank Continuous mdash real numbers eg integer or real

numbers Discretization

Divide the range of a continuous attribute into intervals Some classification algorithms only accept categorical

attributes Reduce data size by discretization Prepare for further analysis

Typical methods All the methods can be applied recursively Binning (covered above)

Top-down split unsupervised Histogram analysis (covered above)

Top-down split unsupervised Clustering analysis (covered above)

Either top-down split or bottom-up merge unsupervised Entropy-based discretization supervised top-down

split Interval merging by 2 Analysis unsupervised bottom-

up merge Segmentation by natural partitioning top-down split

unsupervised

Given a set of samples S if S is partitioned into two intervals S1 and S2 using boundary T the information gain after partitioning is

Entropy is calculated based on class distribution of the samples in the set Given m classes the entropy of S1 is

where pi is the probability of class i in S1 The boundary that minimizes the entropy function over

all possible boundaries is selected as a binary discretization

The process is recursively applied to partitions obtained until some stopping criterion is met

Such a boundary may reduce data size and improve classification accuracy

)Entropy(S|S|

|S|+)Entropy(S

|S|

|S|=T)I(S 2

21

1

m

iii ppSEntropy

121 )(log)(

Merging-based (bottom-up) vs splitting-based methods Merge Find the best neighboring intervals and merge

them to form larger intervals recursively ChiMerge [Kerber AAAI 1992 See also Liu et al DMKD

2002] Initially each distinct value of a numerical attr A is

considered to be one interval 2 tests are performed for every pair of adjacent

intervals Adjacent intervals with the least 2 values are merged

together since low 2 values for a pair indicate similar class distributions

This merge process proceeds recursively until a predefined stopping criterion is met (such as significance level max-interval max inconsistency etc)

Cluster analysis is a popular data discretization method A clustering algorithm can be applied to discretize a numerical attribute A by partitioning the values of A into clusters or groups Clustering takes the distribution of A into consideration as well as the closeness of data points and therefore is able to produce high-quality discretization results

Clustering can be used to generate a concept hierarchy for A by following either a topdown splitting strategy or a bottom-up merging strategy where each cluster forms a node of the concept hierarchy In the former each initial cluster or partition may be further

decomposed into several subclusters forming a lower level of the hierarchy In the latter clusters are formed by repeatedly grouping neighboring clusters in order to form

higher-level concepts

bull Discretization by Intuitive Partitioningbull Although the above discretization methods are useful

in the generation of numerical hierarchies many users would like to see numerical ranges partitioned into relatively uniform easy-to-read intervals that appear intuitive or ldquonaturalrdquo

bull If an interval covers 3 6 7 or 9 distinct values at the most significant digit then partition the range into 3 intervals (3 equal-width intervals for 3 6 and 9 and 3 intervals in the grouping of 2-3-2 for 7)

bull If it covers 2 4 or 8 distinct values at the most significant digit then partition the range into 4 equal-width intervals

bull If it covers 1 5 or 10 distinct values at the most significant digit then partition the range into 5 equal-width intervals

bull The rule can be recursively applied to each interval creating a concept hierarchy for the given numerical attribute Real-world data often contain extremely large positive andor negative outlier values which could distort any top-down discretization method based on minimum and maximum data values

A simply 3-4-5 rule can be used to segment numeric data into relatively uniform ldquonaturalrdquo intervals If an interval covers 3 6 7 or 9 distinct values

at the most significant digit partition the range into 3 equi-width intervals

If it covers 2 4 or 8 distinct values at the most significant digit partition the range into 4 intervals

If it covers 1 5 or 10 distinct values at the most significant digit partition the range into 5 intervals

Categorical data are discrete data Categorical attributes have a finite (but possibly large) number of distinct values with no ordering among the values Examples include geographic location job category and itemtype There are several methods for the generation of concept hierarchies for categorical data

Specification of a partial ordering of attributes explicitly at the schema level by users or Experts

Specification of a portion of a hierarchy by explicit data grouping

Specification of a set of attributes but not of their partial ordering

Specification of only a partial set of attributes

Specification of a partialtotal ordering of attributes explicitly at the schema level by users or experts street lt city lt state lt country

Specification of a hierarchy for a set of values by explicit data grouping Urbana Champaign Chicago lt Illinois

Specification of only a partial set of attributes Eg only street lt city not others

Automatic generation of hierarchies (or attribute levels) by the analysis of the number of distinct values Eg for a set of attributes street city state

country

Some hierarchies can be automaatically generated based on the analysis of the number of distinct values per attribute in the data set The attribute with the most distinct values is

placed at the lowest level of the hierarchy Exceptions eg weekday month quarter year

country

province_or_ state

city

street

15 distinct values

365 distinct values

3567 distinct values

674339 distinct values

Data preparation or preprocessing is a big issue for both data warehousing and data mining

Discriptive data summarization is need for quality data preprocessing

Data preparation includes Data cleaning and data integration Data reduction and feature selection Discretization

A lot a methods have been developed but data preprocessing still an active area of research

  • Slide 1
  • Slide 2
  • Slide 3
  • Multi-Dimensional Measure of Data Quality
  • Major Tasks in Data Preprocessing
  • Figure Forms of data preprocessing
  • 2 Descriptive Data Summarization
  • Slide 8
  • 21 Measuring the Central Tendency
  • 22 Measuring the Dispersion of Data
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Data Integration
  • Handling Redundant Data in Data Integration
  • Data Transformation
  • Data Transformation Normalization
  • Slide 27
  • 5 Data Reduction
  • Slide 29
  • 51 Data Cube Aggregation
  • 52 Attribute Subset Selection
  • Slide 32
  • Slide 33
  • Slide 34
  • 53 Dimensionality Reduction
  • Slide 36
  • Slide 37
  • Slide 38
  • 4 Numerosity Reduction
  • Data Reduction Method (1) Regression and Log-Linear Models
  • Regress Analysis and Log-Linear Models
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Data Reduction Method (3) Clustering
  • Slide 47
  • Slide 48
  • Slide 49
  • Data Reduction Method (4) Sampling
  • Slide 51
  • Sampling Cluster or Stratified Sampling
  • Data Discretization and Concept Hierarchy Generation
  • Slide 54
  • Discretization
  • Discretization and Concept Hierarchy Generation for Numeric Data
  • Entropy-Based Discretization
  • Interval Merge by 2 Analysis
  • Cluster Analysis
  • Slide 60
  • Segmentation by Natural Partitioning
  • 62 Concept Hierarchy Generation for Categorical Data
  • Slide 63
  • Concept Hierarchy Generation for Categorical Data
  • Automatic Concept Hierarchy Generation
  • Summary
Page 41: UNIT – 1 Data Preprocessing. Data Preprocessing Learning Objectives Understand why preprocess the data. Understand how to clean the data. Understand how.

Simple Random sample without replacement

bull For a fixed sample size sampling complexity increases only linearly as the number of data dimensions

bull When applied to data reduction sampling is most commonly used to estimate the answer to an aggregate query

Sampling obtaining a small sample s to represent the whole data set N

Allow a mining algorithm to run in complexity that is potentially sub-linear to the size of the data

Choose a representative subset of the data Simple random sampling may have very poor

performance in the presence of skew Develop adaptive sampling methods

Stratified sampling Approximate the percentage of each class (or

subpopulation of interest) in the overall database Used in conjunction with skewed data

Note Sampling may not reduce database IOs (page at a time)

Sampling with or without Replacement

SRSWOR

(simple random

sample without

replacement)

SRSWR

Raw Data

Raw Data ClusterStratified Sample

Data discretization techniques can be used to reduce the number of values for a given continuous attribute by dividing the range of the attribute into intervals Interval labels can then be used to replace actual data values Replacing numerous values of a continuous attribute by a small number of interval labels thereby reduces and simplifies the original data This leads to a concise easy-to-use knowledge-level representation of mining results

bull Discretization techniques can be categorized based on how the discretization is performed such as whether it uses class information or which direction it proceeds (ie top-down vs bottom-up) If the discretization process uses class information then we say it is supervised discretization Otherwise it is unsupervised If the process starts by first finding one or a few points (called split points or cut points) to split the entire attribute range and then repeats this recursively on the resulting intervals it is called top-down discretization or splitting This contrasts with bottom-up discretization or merging which starts by considering all of the continuous values as potential split-points removes some by merging neighborhood values to form intervals and then recursively applies this process to the resulting intervals Discretization can be performed recursively on an attribute to provide a hierarchical or multiresolution partitioning of the attribute values known as a concept hierarchy Concept hierarchies are useful for mining at multiple levels of abstraction

Three types of attributes Nominal mdash values from an unordered set eg color

profession Ordinal mdash values from an ordered set eg military or

academic rank Continuous mdash real numbers eg integer or real

numbers Discretization

Divide the range of a continuous attribute into intervals Some classification algorithms only accept categorical

attributes Reduce data size by discretization Prepare for further analysis

Typical methods All the methods can be applied recursively Binning (covered above)

Top-down split unsupervised Histogram analysis (covered above)

Top-down split unsupervised Clustering analysis (covered above)

Either top-down split or bottom-up merge unsupervised Entropy-based discretization supervised top-down

split Interval merging by 2 Analysis unsupervised bottom-

up merge Segmentation by natural partitioning top-down split

unsupervised

Given a set of samples S if S is partitioned into two intervals S1 and S2 using boundary T the information gain after partitioning is

Entropy is calculated based on class distribution of the samples in the set Given m classes the entropy of S1 is

where pi is the probability of class i in S1 The boundary that minimizes the entropy function over

all possible boundaries is selected as a binary discretization

The process is recursively applied to partitions obtained until some stopping criterion is met

Such a boundary may reduce data size and improve classification accuracy

)Entropy(S|S|

|S|+)Entropy(S

|S|

|S|=T)I(S 2

21

1

m

iii ppSEntropy

121 )(log)(

Merging-based (bottom-up) vs splitting-based methods Merge Find the best neighboring intervals and merge

them to form larger intervals recursively ChiMerge [Kerber AAAI 1992 See also Liu et al DMKD

2002] Initially each distinct value of a numerical attr A is

considered to be one interval 2 tests are performed for every pair of adjacent

intervals Adjacent intervals with the least 2 values are merged

together since low 2 values for a pair indicate similar class distributions

This merge process proceeds recursively until a predefined stopping criterion is met (such as significance level max-interval max inconsistency etc)

Cluster analysis is a popular data discretization method A clustering algorithm can be applied to discretize a numerical attribute A by partitioning the values of A into clusters or groups Clustering takes the distribution of A into consideration as well as the closeness of data points and therefore is able to produce high-quality discretization results

Clustering can be used to generate a concept hierarchy for A by following either a topdown splitting strategy or a bottom-up merging strategy where each cluster forms a node of the concept hierarchy In the former each initial cluster or partition may be further

decomposed into several subclusters forming a lower level of the hierarchy In the latter clusters are formed by repeatedly grouping neighboring clusters in order to form

higher-level concepts

bull Discretization by Intuitive Partitioningbull Although the above discretization methods are useful

in the generation of numerical hierarchies many users would like to see numerical ranges partitioned into relatively uniform easy-to-read intervals that appear intuitive or ldquonaturalrdquo

bull If an interval covers 3 6 7 or 9 distinct values at the most significant digit then partition the range into 3 intervals (3 equal-width intervals for 3 6 and 9 and 3 intervals in the grouping of 2-3-2 for 7)

bull If it covers 2 4 or 8 distinct values at the most significant digit then partition the range into 4 equal-width intervals

bull If it covers 1 5 or 10 distinct values at the most significant digit then partition the range into 5 equal-width intervals

bull The rule can be recursively applied to each interval creating a concept hierarchy for the given numerical attribute Real-world data often contain extremely large positive andor negative outlier values which could distort any top-down discretization method based on minimum and maximum data values

A simply 3-4-5 rule can be used to segment numeric data into relatively uniform ldquonaturalrdquo intervals If an interval covers 3 6 7 or 9 distinct values

at the most significant digit partition the range into 3 equi-width intervals

If it covers 2 4 or 8 distinct values at the most significant digit partition the range into 4 intervals

If it covers 1 5 or 10 distinct values at the most significant digit partition the range into 5 intervals

Categorical data are discrete data Categorical attributes have a finite (but possibly large) number of distinct values with no ordering among the values Examples include geographic location job category and itemtype There are several methods for the generation of concept hierarchies for categorical data

Specification of a partial ordering of attributes explicitly at the schema level by users or Experts

Specification of a portion of a hierarchy by explicit data grouping

Specification of a set of attributes but not of their partial ordering

Specification of only a partial set of attributes

Specification of a partialtotal ordering of attributes explicitly at the schema level by users or experts street lt city lt state lt country

Specification of a hierarchy for a set of values by explicit data grouping Urbana Champaign Chicago lt Illinois

Specification of only a partial set of attributes Eg only street lt city not others

Automatic generation of hierarchies (or attribute levels) by the analysis of the number of distinct values Eg for a set of attributes street city state

country

Some hierarchies can be automaatically generated based on the analysis of the number of distinct values per attribute in the data set The attribute with the most distinct values is

placed at the lowest level of the hierarchy Exceptions eg weekday month quarter year

country

province_or_ state

city

street

15 distinct values

365 distinct values

3567 distinct values

674339 distinct values

Data preparation or preprocessing is a big issue for both data warehousing and data mining

Discriptive data summarization is need for quality data preprocessing

Data preparation includes Data cleaning and data integration Data reduction and feature selection Discretization

A lot a methods have been developed but data preprocessing still an active area of research

  • Slide 1
  • Slide 2
  • Slide 3
  • Multi-Dimensional Measure of Data Quality
  • Major Tasks in Data Preprocessing
  • Figure Forms of data preprocessing
  • 2 Descriptive Data Summarization
  • Slide 8
  • 21 Measuring the Central Tendency
  • 22 Measuring the Dispersion of Data
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Data Integration
  • Handling Redundant Data in Data Integration
  • Data Transformation
  • Data Transformation Normalization
  • Slide 27
  • 5 Data Reduction
  • Slide 29
  • 51 Data Cube Aggregation
  • 52 Attribute Subset Selection
  • Slide 32
  • Slide 33
  • Slide 34
  • 53 Dimensionality Reduction
  • Slide 36
  • Slide 37
  • Slide 38
  • 4 Numerosity Reduction
  • Data Reduction Method (1) Regression and Log-Linear Models
  • Regress Analysis and Log-Linear Models
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Data Reduction Method (3) Clustering
  • Slide 47
  • Slide 48
  • Slide 49
  • Data Reduction Method (4) Sampling
  • Slide 51
  • Sampling Cluster or Stratified Sampling
  • Data Discretization and Concept Hierarchy Generation
  • Slide 54
  • Discretization
  • Discretization and Concept Hierarchy Generation for Numeric Data
  • Entropy-Based Discretization
  • Interval Merge by 2 Analysis
  • Cluster Analysis
  • Slide 60
  • Segmentation by Natural Partitioning
  • 62 Concept Hierarchy Generation for Categorical Data
  • Slide 63
  • Concept Hierarchy Generation for Categorical Data
  • Automatic Concept Hierarchy Generation
  • Summary
Page 42: UNIT – 1 Data Preprocessing. Data Preprocessing Learning Objectives Understand why preprocess the data. Understand how to clean the data. Understand how.

bull For a fixed sample size sampling complexity increases only linearly as the number of data dimensions

bull When applied to data reduction sampling is most commonly used to estimate the answer to an aggregate query

Sampling obtaining a small sample s to represent the whole data set N

Allow a mining algorithm to run in complexity that is potentially sub-linear to the size of the data

Choose a representative subset of the data Simple random sampling may have very poor

performance in the presence of skew Develop adaptive sampling methods

Stratified sampling Approximate the percentage of each class (or

subpopulation of interest) in the overall database Used in conjunction with skewed data

Note Sampling may not reduce database IOs (page at a time)

Sampling with or without Replacement

SRSWOR

(simple random

sample without

replacement)

SRSWR

Raw Data

Raw Data ClusterStratified Sample

Data discretization techniques can be used to reduce the number of values for a given continuous attribute by dividing the range of the attribute into intervals Interval labels can then be used to replace actual data values Replacing numerous values of a continuous attribute by a small number of interval labels thereby reduces and simplifies the original data This leads to a concise easy-to-use knowledge-level representation of mining results

bull Discretization techniques can be categorized based on how the discretization is performed such as whether it uses class information or which direction it proceeds (ie top-down vs bottom-up) If the discretization process uses class information then we say it is supervised discretization Otherwise it is unsupervised If the process starts by first finding one or a few points (called split points or cut points) to split the entire attribute range and then repeats this recursively on the resulting intervals it is called top-down discretization or splitting This contrasts with bottom-up discretization or merging which starts by considering all of the continuous values as potential split-points removes some by merging neighborhood values to form intervals and then recursively applies this process to the resulting intervals Discretization can be performed recursively on an attribute to provide a hierarchical or multiresolution partitioning of the attribute values known as a concept hierarchy Concept hierarchies are useful for mining at multiple levels of abstraction

Three types of attributes Nominal mdash values from an unordered set eg color

profession Ordinal mdash values from an ordered set eg military or

academic rank Continuous mdash real numbers eg integer or real

numbers Discretization

Divide the range of a continuous attribute into intervals Some classification algorithms only accept categorical

attributes Reduce data size by discretization Prepare for further analysis

Typical methods All the methods can be applied recursively Binning (covered above)

Top-down split unsupervised Histogram analysis (covered above)

Top-down split unsupervised Clustering analysis (covered above)

Either top-down split or bottom-up merge unsupervised Entropy-based discretization supervised top-down

split Interval merging by 2 Analysis unsupervised bottom-

up merge Segmentation by natural partitioning top-down split

unsupervised

Given a set of samples S if S is partitioned into two intervals S1 and S2 using boundary T the information gain after partitioning is

Entropy is calculated based on class distribution of the samples in the set Given m classes the entropy of S1 is

where pi is the probability of class i in S1 The boundary that minimizes the entropy function over

all possible boundaries is selected as a binary discretization

The process is recursively applied to partitions obtained until some stopping criterion is met

Such a boundary may reduce data size and improve classification accuracy

)Entropy(S|S|

|S|+)Entropy(S

|S|

|S|=T)I(S 2

21

1

m

iii ppSEntropy

121 )(log)(

Merging-based (bottom-up) vs splitting-based methods Merge Find the best neighboring intervals and merge

them to form larger intervals recursively ChiMerge [Kerber AAAI 1992 See also Liu et al DMKD

2002] Initially each distinct value of a numerical attr A is

considered to be one interval 2 tests are performed for every pair of adjacent

intervals Adjacent intervals with the least 2 values are merged

together since low 2 values for a pair indicate similar class distributions

This merge process proceeds recursively until a predefined stopping criterion is met (such as significance level max-interval max inconsistency etc)

Cluster analysis is a popular data discretization method A clustering algorithm can be applied to discretize a numerical attribute A by partitioning the values of A into clusters or groups Clustering takes the distribution of A into consideration as well as the closeness of data points and therefore is able to produce high-quality discretization results

Clustering can be used to generate a concept hierarchy for A by following either a topdown splitting strategy or a bottom-up merging strategy where each cluster forms a node of the concept hierarchy In the former each initial cluster or partition may be further

decomposed into several subclusters forming a lower level of the hierarchy In the latter clusters are formed by repeatedly grouping neighboring clusters in order to form

higher-level concepts

bull Discretization by Intuitive Partitioningbull Although the above discretization methods are useful

in the generation of numerical hierarchies many users would like to see numerical ranges partitioned into relatively uniform easy-to-read intervals that appear intuitive or ldquonaturalrdquo

bull If an interval covers 3 6 7 or 9 distinct values at the most significant digit then partition the range into 3 intervals (3 equal-width intervals for 3 6 and 9 and 3 intervals in the grouping of 2-3-2 for 7)

bull If it covers 2 4 or 8 distinct values at the most significant digit then partition the range into 4 equal-width intervals

bull If it covers 1 5 or 10 distinct values at the most significant digit then partition the range into 5 equal-width intervals

bull The rule can be recursively applied to each interval creating a concept hierarchy for the given numerical attribute Real-world data often contain extremely large positive andor negative outlier values which could distort any top-down discretization method based on minimum and maximum data values

A simply 3-4-5 rule can be used to segment numeric data into relatively uniform ldquonaturalrdquo intervals If an interval covers 3 6 7 or 9 distinct values

at the most significant digit partition the range into 3 equi-width intervals

If it covers 2 4 or 8 distinct values at the most significant digit partition the range into 4 intervals

If it covers 1 5 or 10 distinct values at the most significant digit partition the range into 5 intervals

Categorical data are discrete data Categorical attributes have a finite (but possibly large) number of distinct values with no ordering among the values Examples include geographic location job category and itemtype There are several methods for the generation of concept hierarchies for categorical data

Specification of a partial ordering of attributes explicitly at the schema level by users or Experts

Specification of a portion of a hierarchy by explicit data grouping

Specification of a set of attributes but not of their partial ordering

Specification of only a partial set of attributes

Specification of a partialtotal ordering of attributes explicitly at the schema level by users or experts street lt city lt state lt country

Specification of a hierarchy for a set of values by explicit data grouping Urbana Champaign Chicago lt Illinois

Specification of only a partial set of attributes Eg only street lt city not others

Automatic generation of hierarchies (or attribute levels) by the analysis of the number of distinct values Eg for a set of attributes street city state

country

Some hierarchies can be automaatically generated based on the analysis of the number of distinct values per attribute in the data set The attribute with the most distinct values is

placed at the lowest level of the hierarchy Exceptions eg weekday month quarter year

country

province_or_ state

city

street

15 distinct values

365 distinct values

3567 distinct values

674339 distinct values

Data preparation or preprocessing is a big issue for both data warehousing and data mining

Discriptive data summarization is need for quality data preprocessing

Data preparation includes Data cleaning and data integration Data reduction and feature selection Discretization

A lot a methods have been developed but data preprocessing still an active area of research

  • Slide 1
  • Slide 2
  • Slide 3
  • Multi-Dimensional Measure of Data Quality
  • Major Tasks in Data Preprocessing
  • Figure Forms of data preprocessing
  • 2 Descriptive Data Summarization
  • Slide 8
  • 21 Measuring the Central Tendency
  • 22 Measuring the Dispersion of Data
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Data Integration
  • Handling Redundant Data in Data Integration
  • Data Transformation
  • Data Transformation Normalization
  • Slide 27
  • 5 Data Reduction
  • Slide 29
  • 51 Data Cube Aggregation
  • 52 Attribute Subset Selection
  • Slide 32
  • Slide 33
  • Slide 34
  • 53 Dimensionality Reduction
  • Slide 36
  • Slide 37
  • Slide 38
  • 4 Numerosity Reduction
  • Data Reduction Method (1) Regression and Log-Linear Models
  • Regress Analysis and Log-Linear Models
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Data Reduction Method (3) Clustering
  • Slide 47
  • Slide 48
  • Slide 49
  • Data Reduction Method (4) Sampling
  • Slide 51
  • Sampling Cluster or Stratified Sampling
  • Data Discretization and Concept Hierarchy Generation
  • Slide 54
  • Discretization
  • Discretization and Concept Hierarchy Generation for Numeric Data
  • Entropy-Based Discretization
  • Interval Merge by 2 Analysis
  • Cluster Analysis
  • Slide 60
  • Segmentation by Natural Partitioning
  • 62 Concept Hierarchy Generation for Categorical Data
  • Slide 63
  • Concept Hierarchy Generation for Categorical Data
  • Automatic Concept Hierarchy Generation
  • Summary
Page 43: UNIT – 1 Data Preprocessing. Data Preprocessing Learning Objectives Understand why preprocess the data. Understand how to clean the data. Understand how.

Sampling obtaining a small sample s to represent the whole data set N

Allow a mining algorithm to run in complexity that is potentially sub-linear to the size of the data

Choose a representative subset of the data Simple random sampling may have very poor

performance in the presence of skew Develop adaptive sampling methods

Stratified sampling Approximate the percentage of each class (or

subpopulation of interest) in the overall database Used in conjunction with skewed data

Note Sampling may not reduce database IOs (page at a time)

Sampling with or without Replacement

SRSWOR

(simple random

sample without

replacement)

SRSWR

Raw Data

Raw Data ClusterStratified Sample

Data discretization techniques can be used to reduce the number of values for a given continuous attribute by dividing the range of the attribute into intervals Interval labels can then be used to replace actual data values Replacing numerous values of a continuous attribute by a small number of interval labels thereby reduces and simplifies the original data This leads to a concise easy-to-use knowledge-level representation of mining results

bull Discretization techniques can be categorized based on how the discretization is performed such as whether it uses class information or which direction it proceeds (ie top-down vs bottom-up) If the discretization process uses class information then we say it is supervised discretization Otherwise it is unsupervised If the process starts by first finding one or a few points (called split points or cut points) to split the entire attribute range and then repeats this recursively on the resulting intervals it is called top-down discretization or splitting This contrasts with bottom-up discretization or merging which starts by considering all of the continuous values as potential split-points removes some by merging neighborhood values to form intervals and then recursively applies this process to the resulting intervals Discretization can be performed recursively on an attribute to provide a hierarchical or multiresolution partitioning of the attribute values known as a concept hierarchy Concept hierarchies are useful for mining at multiple levels of abstraction

Three types of attributes Nominal mdash values from an unordered set eg color

profession Ordinal mdash values from an ordered set eg military or

academic rank Continuous mdash real numbers eg integer or real

numbers Discretization

Divide the range of a continuous attribute into intervals Some classification algorithms only accept categorical

attributes Reduce data size by discretization Prepare for further analysis

Typical methods All the methods can be applied recursively Binning (covered above)

Top-down split unsupervised Histogram analysis (covered above)

Top-down split unsupervised Clustering analysis (covered above)

Either top-down split or bottom-up merge unsupervised Entropy-based discretization supervised top-down

split Interval merging by 2 Analysis unsupervised bottom-

up merge Segmentation by natural partitioning top-down split

unsupervised

Given a set of samples S if S is partitioned into two intervals S1 and S2 using boundary T the information gain after partitioning is

Entropy is calculated based on class distribution of the samples in the set Given m classes the entropy of S1 is

where pi is the probability of class i in S1 The boundary that minimizes the entropy function over

all possible boundaries is selected as a binary discretization

The process is recursively applied to partitions obtained until some stopping criterion is met

Such a boundary may reduce data size and improve classification accuracy

)Entropy(S|S|

|S|+)Entropy(S

|S|

|S|=T)I(S 2

21

1

m

iii ppSEntropy

121 )(log)(

Merging-based (bottom-up) vs splitting-based methods Merge Find the best neighboring intervals and merge

them to form larger intervals recursively ChiMerge [Kerber AAAI 1992 See also Liu et al DMKD

2002] Initially each distinct value of a numerical attr A is

considered to be one interval 2 tests are performed for every pair of adjacent

intervals Adjacent intervals with the least 2 values are merged

together since low 2 values for a pair indicate similar class distributions

This merge process proceeds recursively until a predefined stopping criterion is met (such as significance level max-interval max inconsistency etc)

Cluster analysis is a popular data discretization method A clustering algorithm can be applied to discretize a numerical attribute A by partitioning the values of A into clusters or groups Clustering takes the distribution of A into consideration as well as the closeness of data points and therefore is able to produce high-quality discretization results

Clustering can be used to generate a concept hierarchy for A by following either a topdown splitting strategy or a bottom-up merging strategy where each cluster forms a node of the concept hierarchy In the former each initial cluster or partition may be further

decomposed into several subclusters forming a lower level of the hierarchy In the latter clusters are formed by repeatedly grouping neighboring clusters in order to form

higher-level concepts

bull Discretization by Intuitive Partitioningbull Although the above discretization methods are useful

in the generation of numerical hierarchies many users would like to see numerical ranges partitioned into relatively uniform easy-to-read intervals that appear intuitive or ldquonaturalrdquo

bull If an interval covers 3 6 7 or 9 distinct values at the most significant digit then partition the range into 3 intervals (3 equal-width intervals for 3 6 and 9 and 3 intervals in the grouping of 2-3-2 for 7)

bull If it covers 2 4 or 8 distinct values at the most significant digit then partition the range into 4 equal-width intervals

bull If it covers 1 5 or 10 distinct values at the most significant digit then partition the range into 5 equal-width intervals

bull The rule can be recursively applied to each interval creating a concept hierarchy for the given numerical attribute Real-world data often contain extremely large positive andor negative outlier values which could distort any top-down discretization method based on minimum and maximum data values

A simply 3-4-5 rule can be used to segment numeric data into relatively uniform ldquonaturalrdquo intervals If an interval covers 3 6 7 or 9 distinct values

at the most significant digit partition the range into 3 equi-width intervals

If it covers 2 4 or 8 distinct values at the most significant digit partition the range into 4 intervals

If it covers 1 5 or 10 distinct values at the most significant digit partition the range into 5 intervals

Categorical data are discrete data Categorical attributes have a finite (but possibly large) number of distinct values with no ordering among the values Examples include geographic location job category and itemtype There are several methods for the generation of concept hierarchies for categorical data

Specification of a partial ordering of attributes explicitly at the schema level by users or Experts

Specification of a portion of a hierarchy by explicit data grouping

Specification of a set of attributes but not of their partial ordering

Specification of only a partial set of attributes

Specification of a partialtotal ordering of attributes explicitly at the schema level by users or experts street lt city lt state lt country

Specification of a hierarchy for a set of values by explicit data grouping Urbana Champaign Chicago lt Illinois

Specification of only a partial set of attributes Eg only street lt city not others

Automatic generation of hierarchies (or attribute levels) by the analysis of the number of distinct values Eg for a set of attributes street city state

country

Some hierarchies can be automaatically generated based on the analysis of the number of distinct values per attribute in the data set The attribute with the most distinct values is

placed at the lowest level of the hierarchy Exceptions eg weekday month quarter year

country

province_or_ state

city

street

15 distinct values

365 distinct values

3567 distinct values

674339 distinct values

Data preparation or preprocessing is a big issue for both data warehousing and data mining

Discriptive data summarization is need for quality data preprocessing

Data preparation includes Data cleaning and data integration Data reduction and feature selection Discretization

A lot a methods have been developed but data preprocessing still an active area of research

  • Slide 1
  • Slide 2
  • Slide 3
  • Multi-Dimensional Measure of Data Quality
  • Major Tasks in Data Preprocessing
  • Figure Forms of data preprocessing
  • 2 Descriptive Data Summarization
  • Slide 8
  • 21 Measuring the Central Tendency
  • 22 Measuring the Dispersion of Data
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Data Integration
  • Handling Redundant Data in Data Integration
  • Data Transformation
  • Data Transformation Normalization
  • Slide 27
  • 5 Data Reduction
  • Slide 29
  • 51 Data Cube Aggregation
  • 52 Attribute Subset Selection
  • Slide 32
  • Slide 33
  • Slide 34
  • 53 Dimensionality Reduction
  • Slide 36
  • Slide 37
  • Slide 38
  • 4 Numerosity Reduction
  • Data Reduction Method (1) Regression and Log-Linear Models
  • Regress Analysis and Log-Linear Models
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Data Reduction Method (3) Clustering
  • Slide 47
  • Slide 48
  • Slide 49
  • Data Reduction Method (4) Sampling
  • Slide 51
  • Sampling Cluster or Stratified Sampling
  • Data Discretization and Concept Hierarchy Generation
  • Slide 54
  • Discretization
  • Discretization and Concept Hierarchy Generation for Numeric Data
  • Entropy-Based Discretization
  • Interval Merge by 2 Analysis
  • Cluster Analysis
  • Slide 60
  • Segmentation by Natural Partitioning
  • 62 Concept Hierarchy Generation for Categorical Data
  • Slide 63
  • Concept Hierarchy Generation for Categorical Data
  • Automatic Concept Hierarchy Generation
  • Summary
Page 44: UNIT – 1 Data Preprocessing. Data Preprocessing Learning Objectives Understand why preprocess the data. Understand how to clean the data. Understand how.

Sampling with or without Replacement

SRSWOR

(simple random

sample without

replacement)

SRSWR

Raw Data

Raw Data ClusterStratified Sample

Data discretization techniques can be used to reduce the number of values for a given continuous attribute by dividing the range of the attribute into intervals Interval labels can then be used to replace actual data values Replacing numerous values of a continuous attribute by a small number of interval labels thereby reduces and simplifies the original data This leads to a concise easy-to-use knowledge-level representation of mining results

bull Discretization techniques can be categorized based on how the discretization is performed such as whether it uses class information or which direction it proceeds (ie top-down vs bottom-up) If the discretization process uses class information then we say it is supervised discretization Otherwise it is unsupervised If the process starts by first finding one or a few points (called split points or cut points) to split the entire attribute range and then repeats this recursively on the resulting intervals it is called top-down discretization or splitting This contrasts with bottom-up discretization or merging which starts by considering all of the continuous values as potential split-points removes some by merging neighborhood values to form intervals and then recursively applies this process to the resulting intervals Discretization can be performed recursively on an attribute to provide a hierarchical or multiresolution partitioning of the attribute values known as a concept hierarchy Concept hierarchies are useful for mining at multiple levels of abstraction

Three types of attributes Nominal mdash values from an unordered set eg color

profession Ordinal mdash values from an ordered set eg military or

academic rank Continuous mdash real numbers eg integer or real

numbers Discretization

Divide the range of a continuous attribute into intervals Some classification algorithms only accept categorical

attributes Reduce data size by discretization Prepare for further analysis

Typical methods All the methods can be applied recursively Binning (covered above)

Top-down split unsupervised Histogram analysis (covered above)

Top-down split unsupervised Clustering analysis (covered above)

Either top-down split or bottom-up merge unsupervised Entropy-based discretization supervised top-down

split Interval merging by 2 Analysis unsupervised bottom-

up merge Segmentation by natural partitioning top-down split

unsupervised

Given a set of samples S if S is partitioned into two intervals S1 and S2 using boundary T the information gain after partitioning is

Entropy is calculated based on class distribution of the samples in the set Given m classes the entropy of S1 is

where pi is the probability of class i in S1 The boundary that minimizes the entropy function over

all possible boundaries is selected as a binary discretization

The process is recursively applied to partitions obtained until some stopping criterion is met

Such a boundary may reduce data size and improve classification accuracy

)Entropy(S|S|

|S|+)Entropy(S

|S|

|S|=T)I(S 2

21

1

m

iii ppSEntropy

121 )(log)(

Merging-based (bottom-up) vs splitting-based methods Merge Find the best neighboring intervals and merge

them to form larger intervals recursively ChiMerge [Kerber AAAI 1992 See also Liu et al DMKD

2002] Initially each distinct value of a numerical attr A is

considered to be one interval 2 tests are performed for every pair of adjacent

intervals Adjacent intervals with the least 2 values are merged

together since low 2 values for a pair indicate similar class distributions

This merge process proceeds recursively until a predefined stopping criterion is met (such as significance level max-interval max inconsistency etc)

Cluster analysis is a popular data discretization method A clustering algorithm can be applied to discretize a numerical attribute A by partitioning the values of A into clusters or groups Clustering takes the distribution of A into consideration as well as the closeness of data points and therefore is able to produce high-quality discretization results

Clustering can be used to generate a concept hierarchy for A by following either a topdown splitting strategy or a bottom-up merging strategy where each cluster forms a node of the concept hierarchy In the former each initial cluster or partition may be further

decomposed into several subclusters forming a lower level of the hierarchy In the latter clusters are formed by repeatedly grouping neighboring clusters in order to form

higher-level concepts

bull Discretization by Intuitive Partitioningbull Although the above discretization methods are useful

in the generation of numerical hierarchies many users would like to see numerical ranges partitioned into relatively uniform easy-to-read intervals that appear intuitive or ldquonaturalrdquo

bull If an interval covers 3 6 7 or 9 distinct values at the most significant digit then partition the range into 3 intervals (3 equal-width intervals for 3 6 and 9 and 3 intervals in the grouping of 2-3-2 for 7)

bull If it covers 2 4 or 8 distinct values at the most significant digit then partition the range into 4 equal-width intervals

bull If it covers 1 5 or 10 distinct values at the most significant digit then partition the range into 5 equal-width intervals

bull The rule can be recursively applied to each interval creating a concept hierarchy for the given numerical attribute Real-world data often contain extremely large positive andor negative outlier values which could distort any top-down discretization method based on minimum and maximum data values

A simply 3-4-5 rule can be used to segment numeric data into relatively uniform ldquonaturalrdquo intervals If an interval covers 3 6 7 or 9 distinct values

at the most significant digit partition the range into 3 equi-width intervals

If it covers 2 4 or 8 distinct values at the most significant digit partition the range into 4 intervals

If it covers 1 5 or 10 distinct values at the most significant digit partition the range into 5 intervals

Categorical data are discrete data Categorical attributes have a finite (but possibly large) number of distinct values with no ordering among the values Examples include geographic location job category and itemtype There are several methods for the generation of concept hierarchies for categorical data

Specification of a partial ordering of attributes explicitly at the schema level by users or Experts

Specification of a portion of a hierarchy by explicit data grouping

Specification of a set of attributes but not of their partial ordering

Specification of only a partial set of attributes

Specification of a partialtotal ordering of attributes explicitly at the schema level by users or experts street lt city lt state lt country

Specification of a hierarchy for a set of values by explicit data grouping Urbana Champaign Chicago lt Illinois

Specification of only a partial set of attributes Eg only street lt city not others

Automatic generation of hierarchies (or attribute levels) by the analysis of the number of distinct values Eg for a set of attributes street city state

country

Some hierarchies can be automaatically generated based on the analysis of the number of distinct values per attribute in the data set The attribute with the most distinct values is

placed at the lowest level of the hierarchy Exceptions eg weekday month quarter year

country

province_or_ state

city

street

15 distinct values

365 distinct values

3567 distinct values

674339 distinct values

Data preparation or preprocessing is a big issue for both data warehousing and data mining

Discriptive data summarization is need for quality data preprocessing

Data preparation includes Data cleaning and data integration Data reduction and feature selection Discretization

A lot a methods have been developed but data preprocessing still an active area of research

  • Slide 1
  • Slide 2
  • Slide 3
  • Multi-Dimensional Measure of Data Quality
  • Major Tasks in Data Preprocessing
  • Figure Forms of data preprocessing
  • 2 Descriptive Data Summarization
  • Slide 8
  • 21 Measuring the Central Tendency
  • 22 Measuring the Dispersion of Data
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Data Integration
  • Handling Redundant Data in Data Integration
  • Data Transformation
  • Data Transformation Normalization
  • Slide 27
  • 5 Data Reduction
  • Slide 29
  • 51 Data Cube Aggregation
  • 52 Attribute Subset Selection
  • Slide 32
  • Slide 33
  • Slide 34
  • 53 Dimensionality Reduction
  • Slide 36
  • Slide 37
  • Slide 38
  • 4 Numerosity Reduction
  • Data Reduction Method (1) Regression and Log-Linear Models
  • Regress Analysis and Log-Linear Models
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Data Reduction Method (3) Clustering
  • Slide 47
  • Slide 48
  • Slide 49
  • Data Reduction Method (4) Sampling
  • Slide 51
  • Sampling Cluster or Stratified Sampling
  • Data Discretization and Concept Hierarchy Generation
  • Slide 54
  • Discretization
  • Discretization and Concept Hierarchy Generation for Numeric Data
  • Entropy-Based Discretization
  • Interval Merge by 2 Analysis
  • Cluster Analysis
  • Slide 60
  • Segmentation by Natural Partitioning
  • 62 Concept Hierarchy Generation for Categorical Data
  • Slide 63
  • Concept Hierarchy Generation for Categorical Data
  • Automatic Concept Hierarchy Generation
  • Summary
Page 45: UNIT – 1 Data Preprocessing. Data Preprocessing Learning Objectives Understand why preprocess the data. Understand how to clean the data. Understand how.

Raw Data ClusterStratified Sample

Data discretization techniques can be used to reduce the number of values for a given continuous attribute by dividing the range of the attribute into intervals Interval labels can then be used to replace actual data values Replacing numerous values of a continuous attribute by a small number of interval labels thereby reduces and simplifies the original data This leads to a concise easy-to-use knowledge-level representation of mining results

bull Discretization techniques can be categorized based on how the discretization is performed such as whether it uses class information or which direction it proceeds (ie top-down vs bottom-up) If the discretization process uses class information then we say it is supervised discretization Otherwise it is unsupervised If the process starts by first finding one or a few points (called split points or cut points) to split the entire attribute range and then repeats this recursively on the resulting intervals it is called top-down discretization or splitting This contrasts with bottom-up discretization or merging which starts by considering all of the continuous values as potential split-points removes some by merging neighborhood values to form intervals and then recursively applies this process to the resulting intervals Discretization can be performed recursively on an attribute to provide a hierarchical or multiresolution partitioning of the attribute values known as a concept hierarchy Concept hierarchies are useful for mining at multiple levels of abstraction

Three types of attributes Nominal mdash values from an unordered set eg color

profession Ordinal mdash values from an ordered set eg military or

academic rank Continuous mdash real numbers eg integer or real

numbers Discretization

Divide the range of a continuous attribute into intervals Some classification algorithms only accept categorical

attributes Reduce data size by discretization Prepare for further analysis

Typical methods All the methods can be applied recursively Binning (covered above)

Top-down split unsupervised Histogram analysis (covered above)

Top-down split unsupervised Clustering analysis (covered above)

Either top-down split or bottom-up merge unsupervised Entropy-based discretization supervised top-down

split Interval merging by 2 Analysis unsupervised bottom-

up merge Segmentation by natural partitioning top-down split

unsupervised

Given a set of samples S if S is partitioned into two intervals S1 and S2 using boundary T the information gain after partitioning is

Entropy is calculated based on class distribution of the samples in the set Given m classes the entropy of S1 is

where pi is the probability of class i in S1 The boundary that minimizes the entropy function over

all possible boundaries is selected as a binary discretization

The process is recursively applied to partitions obtained until some stopping criterion is met

Such a boundary may reduce data size and improve classification accuracy

)Entropy(S|S|

|S|+)Entropy(S

|S|

|S|=T)I(S 2

21

1

m

iii ppSEntropy

121 )(log)(

Merging-based (bottom-up) vs splitting-based methods Merge Find the best neighboring intervals and merge

them to form larger intervals recursively ChiMerge [Kerber AAAI 1992 See also Liu et al DMKD

2002] Initially each distinct value of a numerical attr A is

considered to be one interval 2 tests are performed for every pair of adjacent

intervals Adjacent intervals with the least 2 values are merged

together since low 2 values for a pair indicate similar class distributions

This merge process proceeds recursively until a predefined stopping criterion is met (such as significance level max-interval max inconsistency etc)

Cluster analysis is a popular data discretization method A clustering algorithm can be applied to discretize a numerical attribute A by partitioning the values of A into clusters or groups Clustering takes the distribution of A into consideration as well as the closeness of data points and therefore is able to produce high-quality discretization results

Clustering can be used to generate a concept hierarchy for A by following either a topdown splitting strategy or a bottom-up merging strategy where each cluster forms a node of the concept hierarchy In the former each initial cluster or partition may be further

decomposed into several subclusters forming a lower level of the hierarchy In the latter clusters are formed by repeatedly grouping neighboring clusters in order to form

higher-level concepts

bull Discretization by Intuitive Partitioningbull Although the above discretization methods are useful

in the generation of numerical hierarchies many users would like to see numerical ranges partitioned into relatively uniform easy-to-read intervals that appear intuitive or ldquonaturalrdquo

bull If an interval covers 3 6 7 or 9 distinct values at the most significant digit then partition the range into 3 intervals (3 equal-width intervals for 3 6 and 9 and 3 intervals in the grouping of 2-3-2 for 7)

bull If it covers 2 4 or 8 distinct values at the most significant digit then partition the range into 4 equal-width intervals

bull If it covers 1 5 or 10 distinct values at the most significant digit then partition the range into 5 equal-width intervals

bull The rule can be recursively applied to each interval creating a concept hierarchy for the given numerical attribute Real-world data often contain extremely large positive andor negative outlier values which could distort any top-down discretization method based on minimum and maximum data values

A simply 3-4-5 rule can be used to segment numeric data into relatively uniform ldquonaturalrdquo intervals If an interval covers 3 6 7 or 9 distinct values

at the most significant digit partition the range into 3 equi-width intervals

If it covers 2 4 or 8 distinct values at the most significant digit partition the range into 4 intervals

If it covers 1 5 or 10 distinct values at the most significant digit partition the range into 5 intervals

Categorical data are discrete data Categorical attributes have a finite (but possibly large) number of distinct values with no ordering among the values Examples include geographic location job category and itemtype There are several methods for the generation of concept hierarchies for categorical data

Specification of a partial ordering of attributes explicitly at the schema level by users or Experts

Specification of a portion of a hierarchy by explicit data grouping

Specification of a set of attributes but not of their partial ordering

Specification of only a partial set of attributes

Specification of a partialtotal ordering of attributes explicitly at the schema level by users or experts street lt city lt state lt country

Specification of a hierarchy for a set of values by explicit data grouping Urbana Champaign Chicago lt Illinois

Specification of only a partial set of attributes Eg only street lt city not others

Automatic generation of hierarchies (or attribute levels) by the analysis of the number of distinct values Eg for a set of attributes street city state

country

Some hierarchies can be automaatically generated based on the analysis of the number of distinct values per attribute in the data set The attribute with the most distinct values is

placed at the lowest level of the hierarchy Exceptions eg weekday month quarter year

country

province_or_ state

city

street

15 distinct values

365 distinct values

3567 distinct values

674339 distinct values

Data preparation or preprocessing is a big issue for both data warehousing and data mining

Discriptive data summarization is need for quality data preprocessing

Data preparation includes Data cleaning and data integration Data reduction and feature selection Discretization

A lot a methods have been developed but data preprocessing still an active area of research

  • Slide 1
  • Slide 2
  • Slide 3
  • Multi-Dimensional Measure of Data Quality
  • Major Tasks in Data Preprocessing
  • Figure Forms of data preprocessing
  • 2 Descriptive Data Summarization
  • Slide 8
  • 21 Measuring the Central Tendency
  • 22 Measuring the Dispersion of Data
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Data Integration
  • Handling Redundant Data in Data Integration
  • Data Transformation
  • Data Transformation Normalization
  • Slide 27
  • 5 Data Reduction
  • Slide 29
  • 51 Data Cube Aggregation
  • 52 Attribute Subset Selection
  • Slide 32
  • Slide 33
  • Slide 34
  • 53 Dimensionality Reduction
  • Slide 36
  • Slide 37
  • Slide 38
  • 4 Numerosity Reduction
  • Data Reduction Method (1) Regression and Log-Linear Models
  • Regress Analysis and Log-Linear Models
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Data Reduction Method (3) Clustering
  • Slide 47
  • Slide 48
  • Slide 49
  • Data Reduction Method (4) Sampling
  • Slide 51
  • Sampling Cluster or Stratified Sampling
  • Data Discretization and Concept Hierarchy Generation
  • Slide 54
  • Discretization
  • Discretization and Concept Hierarchy Generation for Numeric Data
  • Entropy-Based Discretization
  • Interval Merge by 2 Analysis
  • Cluster Analysis
  • Slide 60
  • Segmentation by Natural Partitioning
  • 62 Concept Hierarchy Generation for Categorical Data
  • Slide 63
  • Concept Hierarchy Generation for Categorical Data
  • Automatic Concept Hierarchy Generation
  • Summary
Page 46: UNIT – 1 Data Preprocessing. Data Preprocessing Learning Objectives Understand why preprocess the data. Understand how to clean the data. Understand how.

Data discretization techniques can be used to reduce the number of values for a given continuous attribute by dividing the range of the attribute into intervals Interval labels can then be used to replace actual data values Replacing numerous values of a continuous attribute by a small number of interval labels thereby reduces and simplifies the original data This leads to a concise easy-to-use knowledge-level representation of mining results

bull Discretization techniques can be categorized based on how the discretization is performed such as whether it uses class information or which direction it proceeds (ie top-down vs bottom-up) If the discretization process uses class information then we say it is supervised discretization Otherwise it is unsupervised If the process starts by first finding one or a few points (called split points or cut points) to split the entire attribute range and then repeats this recursively on the resulting intervals it is called top-down discretization or splitting This contrasts with bottom-up discretization or merging which starts by considering all of the continuous values as potential split-points removes some by merging neighborhood values to form intervals and then recursively applies this process to the resulting intervals Discretization can be performed recursively on an attribute to provide a hierarchical or multiresolution partitioning of the attribute values known as a concept hierarchy Concept hierarchies are useful for mining at multiple levels of abstraction

Three types of attributes Nominal mdash values from an unordered set eg color

profession Ordinal mdash values from an ordered set eg military or

academic rank Continuous mdash real numbers eg integer or real

numbers Discretization

Divide the range of a continuous attribute into intervals Some classification algorithms only accept categorical

attributes Reduce data size by discretization Prepare for further analysis

Typical methods All the methods can be applied recursively Binning (covered above)

Top-down split unsupervised Histogram analysis (covered above)

Top-down split unsupervised Clustering analysis (covered above)

Either top-down split or bottom-up merge unsupervised Entropy-based discretization supervised top-down

split Interval merging by 2 Analysis unsupervised bottom-

up merge Segmentation by natural partitioning top-down split

unsupervised

Given a set of samples S if S is partitioned into two intervals S1 and S2 using boundary T the information gain after partitioning is

Entropy is calculated based on class distribution of the samples in the set Given m classes the entropy of S1 is

where pi is the probability of class i in S1 The boundary that minimizes the entropy function over

all possible boundaries is selected as a binary discretization

The process is recursively applied to partitions obtained until some stopping criterion is met

Such a boundary may reduce data size and improve classification accuracy

)Entropy(S|S|

|S|+)Entropy(S

|S|

|S|=T)I(S 2

21

1

m

iii ppSEntropy

121 )(log)(

Merging-based (bottom-up) vs splitting-based methods Merge Find the best neighboring intervals and merge

them to form larger intervals recursively ChiMerge [Kerber AAAI 1992 See also Liu et al DMKD

2002] Initially each distinct value of a numerical attr A is

considered to be one interval 2 tests are performed for every pair of adjacent

intervals Adjacent intervals with the least 2 values are merged

together since low 2 values for a pair indicate similar class distributions

This merge process proceeds recursively until a predefined stopping criterion is met (such as significance level max-interval max inconsistency etc)

Cluster analysis is a popular data discretization method A clustering algorithm can be applied to discretize a numerical attribute A by partitioning the values of A into clusters or groups Clustering takes the distribution of A into consideration as well as the closeness of data points and therefore is able to produce high-quality discretization results

Clustering can be used to generate a concept hierarchy for A by following either a topdown splitting strategy or a bottom-up merging strategy where each cluster forms a node of the concept hierarchy In the former each initial cluster or partition may be further

decomposed into several subclusters forming a lower level of the hierarchy In the latter clusters are formed by repeatedly grouping neighboring clusters in order to form

higher-level concepts

bull Discretization by Intuitive Partitioningbull Although the above discretization methods are useful

in the generation of numerical hierarchies many users would like to see numerical ranges partitioned into relatively uniform easy-to-read intervals that appear intuitive or ldquonaturalrdquo

bull If an interval covers 3 6 7 or 9 distinct values at the most significant digit then partition the range into 3 intervals (3 equal-width intervals for 3 6 and 9 and 3 intervals in the grouping of 2-3-2 for 7)

bull If it covers 2 4 or 8 distinct values at the most significant digit then partition the range into 4 equal-width intervals

bull If it covers 1 5 or 10 distinct values at the most significant digit then partition the range into 5 equal-width intervals

bull The rule can be recursively applied to each interval creating a concept hierarchy for the given numerical attribute Real-world data often contain extremely large positive andor negative outlier values which could distort any top-down discretization method based on minimum and maximum data values

A simply 3-4-5 rule can be used to segment numeric data into relatively uniform ldquonaturalrdquo intervals If an interval covers 3 6 7 or 9 distinct values

at the most significant digit partition the range into 3 equi-width intervals

If it covers 2 4 or 8 distinct values at the most significant digit partition the range into 4 intervals

If it covers 1 5 or 10 distinct values at the most significant digit partition the range into 5 intervals

Categorical data are discrete data Categorical attributes have a finite (but possibly large) number of distinct values with no ordering among the values Examples include geographic location job category and itemtype There are several methods for the generation of concept hierarchies for categorical data

Specification of a partial ordering of attributes explicitly at the schema level by users or Experts

Specification of a portion of a hierarchy by explicit data grouping

Specification of a set of attributes but not of their partial ordering

Specification of only a partial set of attributes

Specification of a partialtotal ordering of attributes explicitly at the schema level by users or experts street lt city lt state lt country

Specification of a hierarchy for a set of values by explicit data grouping Urbana Champaign Chicago lt Illinois

Specification of only a partial set of attributes Eg only street lt city not others

Automatic generation of hierarchies (or attribute levels) by the analysis of the number of distinct values Eg for a set of attributes street city state

country

Some hierarchies can be automaatically generated based on the analysis of the number of distinct values per attribute in the data set The attribute with the most distinct values is

placed at the lowest level of the hierarchy Exceptions eg weekday month quarter year

country

province_or_ state

city

street

15 distinct values

365 distinct values

3567 distinct values

674339 distinct values

Data preparation or preprocessing is a big issue for both data warehousing and data mining

Discriptive data summarization is need for quality data preprocessing

Data preparation includes Data cleaning and data integration Data reduction and feature selection Discretization

A lot a methods have been developed but data preprocessing still an active area of research

  • Slide 1
  • Slide 2
  • Slide 3
  • Multi-Dimensional Measure of Data Quality
  • Major Tasks in Data Preprocessing
  • Figure Forms of data preprocessing
  • 2 Descriptive Data Summarization
  • Slide 8
  • 21 Measuring the Central Tendency
  • 22 Measuring the Dispersion of Data
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Data Integration
  • Handling Redundant Data in Data Integration
  • Data Transformation
  • Data Transformation Normalization
  • Slide 27
  • 5 Data Reduction
  • Slide 29
  • 51 Data Cube Aggregation
  • 52 Attribute Subset Selection
  • Slide 32
  • Slide 33
  • Slide 34
  • 53 Dimensionality Reduction
  • Slide 36
  • Slide 37
  • Slide 38
  • 4 Numerosity Reduction
  • Data Reduction Method (1) Regression and Log-Linear Models
  • Regress Analysis and Log-Linear Models
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Data Reduction Method (3) Clustering
  • Slide 47
  • Slide 48
  • Slide 49
  • Data Reduction Method (4) Sampling
  • Slide 51
  • Sampling Cluster or Stratified Sampling
  • Data Discretization and Concept Hierarchy Generation
  • Slide 54
  • Discretization
  • Discretization and Concept Hierarchy Generation for Numeric Data
  • Entropy-Based Discretization
  • Interval Merge by 2 Analysis
  • Cluster Analysis
  • Slide 60
  • Segmentation by Natural Partitioning
  • 62 Concept Hierarchy Generation for Categorical Data
  • Slide 63
  • Concept Hierarchy Generation for Categorical Data
  • Automatic Concept Hierarchy Generation
  • Summary
Page 47: UNIT – 1 Data Preprocessing. Data Preprocessing Learning Objectives Understand why preprocess the data. Understand how to clean the data. Understand how.

bull Discretization techniques can be categorized based on how the discretization is performed such as whether it uses class information or which direction it proceeds (ie top-down vs bottom-up) If the discretization process uses class information then we say it is supervised discretization Otherwise it is unsupervised If the process starts by first finding one or a few points (called split points or cut points) to split the entire attribute range and then repeats this recursively on the resulting intervals it is called top-down discretization or splitting This contrasts with bottom-up discretization or merging which starts by considering all of the continuous values as potential split-points removes some by merging neighborhood values to form intervals and then recursively applies this process to the resulting intervals Discretization can be performed recursively on an attribute to provide a hierarchical or multiresolution partitioning of the attribute values known as a concept hierarchy Concept hierarchies are useful for mining at multiple levels of abstraction

Three types of attributes Nominal mdash values from an unordered set eg color

profession Ordinal mdash values from an ordered set eg military or

academic rank Continuous mdash real numbers eg integer or real

numbers Discretization

Divide the range of a continuous attribute into intervals Some classification algorithms only accept categorical

attributes Reduce data size by discretization Prepare for further analysis

Typical methods All the methods can be applied recursively Binning (covered above)

Top-down split unsupervised Histogram analysis (covered above)

Top-down split unsupervised Clustering analysis (covered above)

Either top-down split or bottom-up merge unsupervised Entropy-based discretization supervised top-down

split Interval merging by 2 Analysis unsupervised bottom-

up merge Segmentation by natural partitioning top-down split

unsupervised

Given a set of samples S if S is partitioned into two intervals S1 and S2 using boundary T the information gain after partitioning is

Entropy is calculated based on class distribution of the samples in the set Given m classes the entropy of S1 is

where pi is the probability of class i in S1 The boundary that minimizes the entropy function over

all possible boundaries is selected as a binary discretization

The process is recursively applied to partitions obtained until some stopping criterion is met

Such a boundary may reduce data size and improve classification accuracy

)Entropy(S|S|

|S|+)Entropy(S

|S|

|S|=T)I(S 2

21

1

m

iii ppSEntropy

121 )(log)(

Merging-based (bottom-up) vs splitting-based methods Merge Find the best neighboring intervals and merge

them to form larger intervals recursively ChiMerge [Kerber AAAI 1992 See also Liu et al DMKD

2002] Initially each distinct value of a numerical attr A is

considered to be one interval 2 tests are performed for every pair of adjacent

intervals Adjacent intervals with the least 2 values are merged

together since low 2 values for a pair indicate similar class distributions

This merge process proceeds recursively until a predefined stopping criterion is met (such as significance level max-interval max inconsistency etc)

Cluster analysis is a popular data discretization method A clustering algorithm can be applied to discretize a numerical attribute A by partitioning the values of A into clusters or groups Clustering takes the distribution of A into consideration as well as the closeness of data points and therefore is able to produce high-quality discretization results

Clustering can be used to generate a concept hierarchy for A by following either a topdown splitting strategy or a bottom-up merging strategy where each cluster forms a node of the concept hierarchy In the former each initial cluster or partition may be further

decomposed into several subclusters forming a lower level of the hierarchy In the latter clusters are formed by repeatedly grouping neighboring clusters in order to form

higher-level concepts

bull Discretization by Intuitive Partitioningbull Although the above discretization methods are useful

in the generation of numerical hierarchies many users would like to see numerical ranges partitioned into relatively uniform easy-to-read intervals that appear intuitive or ldquonaturalrdquo

bull If an interval covers 3 6 7 or 9 distinct values at the most significant digit then partition the range into 3 intervals (3 equal-width intervals for 3 6 and 9 and 3 intervals in the grouping of 2-3-2 for 7)

bull If it covers 2 4 or 8 distinct values at the most significant digit then partition the range into 4 equal-width intervals

bull If it covers 1 5 or 10 distinct values at the most significant digit then partition the range into 5 equal-width intervals

bull The rule can be recursively applied to each interval creating a concept hierarchy for the given numerical attribute Real-world data often contain extremely large positive andor negative outlier values which could distort any top-down discretization method based on minimum and maximum data values

A simply 3-4-5 rule can be used to segment numeric data into relatively uniform ldquonaturalrdquo intervals If an interval covers 3 6 7 or 9 distinct values

at the most significant digit partition the range into 3 equi-width intervals

If it covers 2 4 or 8 distinct values at the most significant digit partition the range into 4 intervals

If it covers 1 5 or 10 distinct values at the most significant digit partition the range into 5 intervals

Categorical data are discrete data Categorical attributes have a finite (but possibly large) number of distinct values with no ordering among the values Examples include geographic location job category and itemtype There are several methods for the generation of concept hierarchies for categorical data

Specification of a partial ordering of attributes explicitly at the schema level by users or Experts

Specification of a portion of a hierarchy by explicit data grouping

Specification of a set of attributes but not of their partial ordering

Specification of only a partial set of attributes

Specification of a partialtotal ordering of attributes explicitly at the schema level by users or experts street lt city lt state lt country

Specification of a hierarchy for a set of values by explicit data grouping Urbana Champaign Chicago lt Illinois

Specification of only a partial set of attributes Eg only street lt city not others

Automatic generation of hierarchies (or attribute levels) by the analysis of the number of distinct values Eg for a set of attributes street city state

country

Some hierarchies can be automaatically generated based on the analysis of the number of distinct values per attribute in the data set The attribute with the most distinct values is

placed at the lowest level of the hierarchy Exceptions eg weekday month quarter year

country

province_or_ state

city

street

15 distinct values

365 distinct values

3567 distinct values

674339 distinct values

Data preparation or preprocessing is a big issue for both data warehousing and data mining

Discriptive data summarization is need for quality data preprocessing

Data preparation includes Data cleaning and data integration Data reduction and feature selection Discretization

A lot a methods have been developed but data preprocessing still an active area of research

  • Slide 1
  • Slide 2
  • Slide 3
  • Multi-Dimensional Measure of Data Quality
  • Major Tasks in Data Preprocessing
  • Figure Forms of data preprocessing
  • 2 Descriptive Data Summarization
  • Slide 8
  • 21 Measuring the Central Tendency
  • 22 Measuring the Dispersion of Data
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Data Integration
  • Handling Redundant Data in Data Integration
  • Data Transformation
  • Data Transformation Normalization
  • Slide 27
  • 5 Data Reduction
  • Slide 29
  • 51 Data Cube Aggregation
  • 52 Attribute Subset Selection
  • Slide 32
  • Slide 33
  • Slide 34
  • 53 Dimensionality Reduction
  • Slide 36
  • Slide 37
  • Slide 38
  • 4 Numerosity Reduction
  • Data Reduction Method (1) Regression and Log-Linear Models
  • Regress Analysis and Log-Linear Models
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Data Reduction Method (3) Clustering
  • Slide 47
  • Slide 48
  • Slide 49
  • Data Reduction Method (4) Sampling
  • Slide 51
  • Sampling Cluster or Stratified Sampling
  • Data Discretization and Concept Hierarchy Generation
  • Slide 54
  • Discretization
  • Discretization and Concept Hierarchy Generation for Numeric Data
  • Entropy-Based Discretization
  • Interval Merge by 2 Analysis
  • Cluster Analysis
  • Slide 60
  • Segmentation by Natural Partitioning
  • 62 Concept Hierarchy Generation for Categorical Data
  • Slide 63
  • Concept Hierarchy Generation for Categorical Data
  • Automatic Concept Hierarchy Generation
  • Summary
Page 48: UNIT – 1 Data Preprocessing. Data Preprocessing Learning Objectives Understand why preprocess the data. Understand how to clean the data. Understand how.

Three types of attributes Nominal mdash values from an unordered set eg color

profession Ordinal mdash values from an ordered set eg military or

academic rank Continuous mdash real numbers eg integer or real

numbers Discretization

Divide the range of a continuous attribute into intervals Some classification algorithms only accept categorical

attributes Reduce data size by discretization Prepare for further analysis

Typical methods All the methods can be applied recursively Binning (covered above)

Top-down split unsupervised Histogram analysis (covered above)

Top-down split unsupervised Clustering analysis (covered above)

Either top-down split or bottom-up merge unsupervised Entropy-based discretization supervised top-down

split Interval merging by 2 Analysis unsupervised bottom-

up merge Segmentation by natural partitioning top-down split

unsupervised

Given a set of samples S if S is partitioned into two intervals S1 and S2 using boundary T the information gain after partitioning is

Entropy is calculated based on class distribution of the samples in the set Given m classes the entropy of S1 is

where pi is the probability of class i in S1 The boundary that minimizes the entropy function over

all possible boundaries is selected as a binary discretization

The process is recursively applied to partitions obtained until some stopping criterion is met

Such a boundary may reduce data size and improve classification accuracy

)Entropy(S|S|

|S|+)Entropy(S

|S|

|S|=T)I(S 2

21

1

m

iii ppSEntropy

121 )(log)(

Merging-based (bottom-up) vs splitting-based methods Merge Find the best neighboring intervals and merge

them to form larger intervals recursively ChiMerge [Kerber AAAI 1992 See also Liu et al DMKD

2002] Initially each distinct value of a numerical attr A is

considered to be one interval 2 tests are performed for every pair of adjacent

intervals Adjacent intervals with the least 2 values are merged

together since low 2 values for a pair indicate similar class distributions

This merge process proceeds recursively until a predefined stopping criterion is met (such as significance level max-interval max inconsistency etc)

Cluster analysis is a popular data discretization method A clustering algorithm can be applied to discretize a numerical attribute A by partitioning the values of A into clusters or groups Clustering takes the distribution of A into consideration as well as the closeness of data points and therefore is able to produce high-quality discretization results

Clustering can be used to generate a concept hierarchy for A by following either a topdown splitting strategy or a bottom-up merging strategy where each cluster forms a node of the concept hierarchy In the former each initial cluster or partition may be further

decomposed into several subclusters forming a lower level of the hierarchy In the latter clusters are formed by repeatedly grouping neighboring clusters in order to form

higher-level concepts

bull Discretization by Intuitive Partitioningbull Although the above discretization methods are useful

in the generation of numerical hierarchies many users would like to see numerical ranges partitioned into relatively uniform easy-to-read intervals that appear intuitive or ldquonaturalrdquo

bull If an interval covers 3 6 7 or 9 distinct values at the most significant digit then partition the range into 3 intervals (3 equal-width intervals for 3 6 and 9 and 3 intervals in the grouping of 2-3-2 for 7)

bull If it covers 2 4 or 8 distinct values at the most significant digit then partition the range into 4 equal-width intervals

bull If it covers 1 5 or 10 distinct values at the most significant digit then partition the range into 5 equal-width intervals

bull The rule can be recursively applied to each interval creating a concept hierarchy for the given numerical attribute Real-world data often contain extremely large positive andor negative outlier values which could distort any top-down discretization method based on minimum and maximum data values

A simply 3-4-5 rule can be used to segment numeric data into relatively uniform ldquonaturalrdquo intervals If an interval covers 3 6 7 or 9 distinct values

at the most significant digit partition the range into 3 equi-width intervals

If it covers 2 4 or 8 distinct values at the most significant digit partition the range into 4 intervals

If it covers 1 5 or 10 distinct values at the most significant digit partition the range into 5 intervals

Categorical data are discrete data Categorical attributes have a finite (but possibly large) number of distinct values with no ordering among the values Examples include geographic location job category and itemtype There are several methods for the generation of concept hierarchies for categorical data

Specification of a partial ordering of attributes explicitly at the schema level by users or Experts

Specification of a portion of a hierarchy by explicit data grouping

Specification of a set of attributes but not of their partial ordering

Specification of only a partial set of attributes

Specification of a partialtotal ordering of attributes explicitly at the schema level by users or experts street lt city lt state lt country

Specification of a hierarchy for a set of values by explicit data grouping Urbana Champaign Chicago lt Illinois

Specification of only a partial set of attributes Eg only street lt city not others

Automatic generation of hierarchies (or attribute levels) by the analysis of the number of distinct values Eg for a set of attributes street city state

country

Some hierarchies can be automaatically generated based on the analysis of the number of distinct values per attribute in the data set The attribute with the most distinct values is

placed at the lowest level of the hierarchy Exceptions eg weekday month quarter year

country

province_or_ state

city

street

15 distinct values

365 distinct values

3567 distinct values

674339 distinct values

Data preparation or preprocessing is a big issue for both data warehousing and data mining

Discriptive data summarization is need for quality data preprocessing

Data preparation includes Data cleaning and data integration Data reduction and feature selection Discretization

A lot a methods have been developed but data preprocessing still an active area of research

  • Slide 1
  • Slide 2
  • Slide 3
  • Multi-Dimensional Measure of Data Quality
  • Major Tasks in Data Preprocessing
  • Figure Forms of data preprocessing
  • 2 Descriptive Data Summarization
  • Slide 8
  • 21 Measuring the Central Tendency
  • 22 Measuring the Dispersion of Data
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Data Integration
  • Handling Redundant Data in Data Integration
  • Data Transformation
  • Data Transformation Normalization
  • Slide 27
  • 5 Data Reduction
  • Slide 29
  • 51 Data Cube Aggregation
  • 52 Attribute Subset Selection
  • Slide 32
  • Slide 33
  • Slide 34
  • 53 Dimensionality Reduction
  • Slide 36
  • Slide 37
  • Slide 38
  • 4 Numerosity Reduction
  • Data Reduction Method (1) Regression and Log-Linear Models
  • Regress Analysis and Log-Linear Models
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Data Reduction Method (3) Clustering
  • Slide 47
  • Slide 48
  • Slide 49
  • Data Reduction Method (4) Sampling
  • Slide 51
  • Sampling Cluster or Stratified Sampling
  • Data Discretization and Concept Hierarchy Generation
  • Slide 54
  • Discretization
  • Discretization and Concept Hierarchy Generation for Numeric Data
  • Entropy-Based Discretization
  • Interval Merge by 2 Analysis
  • Cluster Analysis
  • Slide 60
  • Segmentation by Natural Partitioning
  • 62 Concept Hierarchy Generation for Categorical Data
  • Slide 63
  • Concept Hierarchy Generation for Categorical Data
  • Automatic Concept Hierarchy Generation
  • Summary
Page 49: UNIT – 1 Data Preprocessing. Data Preprocessing Learning Objectives Understand why preprocess the data. Understand how to clean the data. Understand how.

Typical methods All the methods can be applied recursively Binning (covered above)

Top-down split unsupervised Histogram analysis (covered above)

Top-down split unsupervised Clustering analysis (covered above)

Either top-down split or bottom-up merge unsupervised Entropy-based discretization supervised top-down

split Interval merging by 2 Analysis unsupervised bottom-

up merge Segmentation by natural partitioning top-down split

unsupervised

Given a set of samples S if S is partitioned into two intervals S1 and S2 using boundary T the information gain after partitioning is

Entropy is calculated based on class distribution of the samples in the set Given m classes the entropy of S1 is

where pi is the probability of class i in S1 The boundary that minimizes the entropy function over

all possible boundaries is selected as a binary discretization

The process is recursively applied to partitions obtained until some stopping criterion is met

Such a boundary may reduce data size and improve classification accuracy

)Entropy(S|S|

|S|+)Entropy(S

|S|

|S|=T)I(S 2

21

1

m

iii ppSEntropy

121 )(log)(

Merging-based (bottom-up) vs splitting-based methods Merge Find the best neighboring intervals and merge

them to form larger intervals recursively ChiMerge [Kerber AAAI 1992 See also Liu et al DMKD

2002] Initially each distinct value of a numerical attr A is

considered to be one interval 2 tests are performed for every pair of adjacent

intervals Adjacent intervals with the least 2 values are merged

together since low 2 values for a pair indicate similar class distributions

This merge process proceeds recursively until a predefined stopping criterion is met (such as significance level max-interval max inconsistency etc)

Cluster analysis is a popular data discretization method A clustering algorithm can be applied to discretize a numerical attribute A by partitioning the values of A into clusters or groups Clustering takes the distribution of A into consideration as well as the closeness of data points and therefore is able to produce high-quality discretization results

Clustering can be used to generate a concept hierarchy for A by following either a topdown splitting strategy or a bottom-up merging strategy where each cluster forms a node of the concept hierarchy In the former each initial cluster or partition may be further

decomposed into several subclusters forming a lower level of the hierarchy In the latter clusters are formed by repeatedly grouping neighboring clusters in order to form

higher-level concepts

bull Discretization by Intuitive Partitioningbull Although the above discretization methods are useful

in the generation of numerical hierarchies many users would like to see numerical ranges partitioned into relatively uniform easy-to-read intervals that appear intuitive or ldquonaturalrdquo

bull If an interval covers 3 6 7 or 9 distinct values at the most significant digit then partition the range into 3 intervals (3 equal-width intervals for 3 6 and 9 and 3 intervals in the grouping of 2-3-2 for 7)

bull If it covers 2 4 or 8 distinct values at the most significant digit then partition the range into 4 equal-width intervals

bull If it covers 1 5 or 10 distinct values at the most significant digit then partition the range into 5 equal-width intervals

bull The rule can be recursively applied to each interval creating a concept hierarchy for the given numerical attribute Real-world data often contain extremely large positive andor negative outlier values which could distort any top-down discretization method based on minimum and maximum data values

A simply 3-4-5 rule can be used to segment numeric data into relatively uniform ldquonaturalrdquo intervals If an interval covers 3 6 7 or 9 distinct values

at the most significant digit partition the range into 3 equi-width intervals

If it covers 2 4 or 8 distinct values at the most significant digit partition the range into 4 intervals

If it covers 1 5 or 10 distinct values at the most significant digit partition the range into 5 intervals

Categorical data are discrete data Categorical attributes have a finite (but possibly large) number of distinct values with no ordering among the values Examples include geographic location job category and itemtype There are several methods for the generation of concept hierarchies for categorical data

Specification of a partial ordering of attributes explicitly at the schema level by users or Experts

Specification of a portion of a hierarchy by explicit data grouping

Specification of a set of attributes but not of their partial ordering

Specification of only a partial set of attributes

Specification of a partialtotal ordering of attributes explicitly at the schema level by users or experts street lt city lt state lt country

Specification of a hierarchy for a set of values by explicit data grouping Urbana Champaign Chicago lt Illinois

Specification of only a partial set of attributes Eg only street lt city not others

Automatic generation of hierarchies (or attribute levels) by the analysis of the number of distinct values Eg for a set of attributes street city state

country

Some hierarchies can be automaatically generated based on the analysis of the number of distinct values per attribute in the data set The attribute with the most distinct values is

placed at the lowest level of the hierarchy Exceptions eg weekday month quarter year

country

province_or_ state

city

street

15 distinct values

365 distinct values

3567 distinct values

674339 distinct values

Data preparation or preprocessing is a big issue for both data warehousing and data mining

Discriptive data summarization is need for quality data preprocessing

Data preparation includes Data cleaning and data integration Data reduction and feature selection Discretization

A lot a methods have been developed but data preprocessing still an active area of research

  • Slide 1
  • Slide 2
  • Slide 3
  • Multi-Dimensional Measure of Data Quality
  • Major Tasks in Data Preprocessing
  • Figure Forms of data preprocessing
  • 2 Descriptive Data Summarization
  • Slide 8
  • 21 Measuring the Central Tendency
  • 22 Measuring the Dispersion of Data
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Data Integration
  • Handling Redundant Data in Data Integration
  • Data Transformation
  • Data Transformation Normalization
  • Slide 27
  • 5 Data Reduction
  • Slide 29
  • 51 Data Cube Aggregation
  • 52 Attribute Subset Selection
  • Slide 32
  • Slide 33
  • Slide 34
  • 53 Dimensionality Reduction
  • Slide 36
  • Slide 37
  • Slide 38
  • 4 Numerosity Reduction
  • Data Reduction Method (1) Regression and Log-Linear Models
  • Regress Analysis and Log-Linear Models
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Data Reduction Method (3) Clustering
  • Slide 47
  • Slide 48
  • Slide 49
  • Data Reduction Method (4) Sampling
  • Slide 51
  • Sampling Cluster or Stratified Sampling
  • Data Discretization and Concept Hierarchy Generation
  • Slide 54
  • Discretization
  • Discretization and Concept Hierarchy Generation for Numeric Data
  • Entropy-Based Discretization
  • Interval Merge by 2 Analysis
  • Cluster Analysis
  • Slide 60
  • Segmentation by Natural Partitioning
  • 62 Concept Hierarchy Generation for Categorical Data
  • Slide 63
  • Concept Hierarchy Generation for Categorical Data
  • Automatic Concept Hierarchy Generation
  • Summary
Page 50: UNIT – 1 Data Preprocessing. Data Preprocessing Learning Objectives Understand why preprocess the data. Understand how to clean the data. Understand how.

Given a set of samples S if S is partitioned into two intervals S1 and S2 using boundary T the information gain after partitioning is

Entropy is calculated based on class distribution of the samples in the set Given m classes the entropy of S1 is

where pi is the probability of class i in S1 The boundary that minimizes the entropy function over

all possible boundaries is selected as a binary discretization

The process is recursively applied to partitions obtained until some stopping criterion is met

Such a boundary may reduce data size and improve classification accuracy

)Entropy(S|S|

|S|+)Entropy(S

|S|

|S|=T)I(S 2

21

1

m

iii ppSEntropy

121 )(log)(

Merging-based (bottom-up) vs splitting-based methods Merge Find the best neighboring intervals and merge

them to form larger intervals recursively ChiMerge [Kerber AAAI 1992 See also Liu et al DMKD

2002] Initially each distinct value of a numerical attr A is

considered to be one interval 2 tests are performed for every pair of adjacent

intervals Adjacent intervals with the least 2 values are merged

together since low 2 values for a pair indicate similar class distributions

This merge process proceeds recursively until a predefined stopping criterion is met (such as significance level max-interval max inconsistency etc)

Cluster analysis is a popular data discretization method A clustering algorithm can be applied to discretize a numerical attribute A by partitioning the values of A into clusters or groups Clustering takes the distribution of A into consideration as well as the closeness of data points and therefore is able to produce high-quality discretization results

Clustering can be used to generate a concept hierarchy for A by following either a topdown splitting strategy or a bottom-up merging strategy where each cluster forms a node of the concept hierarchy In the former each initial cluster or partition may be further

decomposed into several subclusters forming a lower level of the hierarchy In the latter clusters are formed by repeatedly grouping neighboring clusters in order to form

higher-level concepts

bull Discretization by Intuitive Partitioningbull Although the above discretization methods are useful

in the generation of numerical hierarchies many users would like to see numerical ranges partitioned into relatively uniform easy-to-read intervals that appear intuitive or ldquonaturalrdquo

bull If an interval covers 3 6 7 or 9 distinct values at the most significant digit then partition the range into 3 intervals (3 equal-width intervals for 3 6 and 9 and 3 intervals in the grouping of 2-3-2 for 7)

bull If it covers 2 4 or 8 distinct values at the most significant digit then partition the range into 4 equal-width intervals

bull If it covers 1 5 or 10 distinct values at the most significant digit then partition the range into 5 equal-width intervals

bull The rule can be recursively applied to each interval creating a concept hierarchy for the given numerical attribute Real-world data often contain extremely large positive andor negative outlier values which could distort any top-down discretization method based on minimum and maximum data values

A simply 3-4-5 rule can be used to segment numeric data into relatively uniform ldquonaturalrdquo intervals If an interval covers 3 6 7 or 9 distinct values

at the most significant digit partition the range into 3 equi-width intervals

If it covers 2 4 or 8 distinct values at the most significant digit partition the range into 4 intervals

If it covers 1 5 or 10 distinct values at the most significant digit partition the range into 5 intervals

Categorical data are discrete data Categorical attributes have a finite (but possibly large) number of distinct values with no ordering among the values Examples include geographic location job category and itemtype There are several methods for the generation of concept hierarchies for categorical data

Specification of a partial ordering of attributes explicitly at the schema level by users or Experts

Specification of a portion of a hierarchy by explicit data grouping

Specification of a set of attributes but not of their partial ordering

Specification of only a partial set of attributes

Specification of a partialtotal ordering of attributes explicitly at the schema level by users or experts street lt city lt state lt country

Specification of a hierarchy for a set of values by explicit data grouping Urbana Champaign Chicago lt Illinois

Specification of only a partial set of attributes Eg only street lt city not others

Automatic generation of hierarchies (or attribute levels) by the analysis of the number of distinct values Eg for a set of attributes street city state

country

Some hierarchies can be automaatically generated based on the analysis of the number of distinct values per attribute in the data set The attribute with the most distinct values is

placed at the lowest level of the hierarchy Exceptions eg weekday month quarter year

country

province_or_ state

city

street

15 distinct values

365 distinct values

3567 distinct values

674339 distinct values

Data preparation or preprocessing is a big issue for both data warehousing and data mining

Discriptive data summarization is need for quality data preprocessing

Data preparation includes Data cleaning and data integration Data reduction and feature selection Discretization

A lot a methods have been developed but data preprocessing still an active area of research

  • Slide 1
  • Slide 2
  • Slide 3
  • Multi-Dimensional Measure of Data Quality
  • Major Tasks in Data Preprocessing
  • Figure Forms of data preprocessing
  • 2 Descriptive Data Summarization
  • Slide 8
  • 21 Measuring the Central Tendency
  • 22 Measuring the Dispersion of Data
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Data Integration
  • Handling Redundant Data in Data Integration
  • Data Transformation
  • Data Transformation Normalization
  • Slide 27
  • 5 Data Reduction
  • Slide 29
  • 51 Data Cube Aggregation
  • 52 Attribute Subset Selection
  • Slide 32
  • Slide 33
  • Slide 34
  • 53 Dimensionality Reduction
  • Slide 36
  • Slide 37
  • Slide 38
  • 4 Numerosity Reduction
  • Data Reduction Method (1) Regression and Log-Linear Models
  • Regress Analysis and Log-Linear Models
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Data Reduction Method (3) Clustering
  • Slide 47
  • Slide 48
  • Slide 49
  • Data Reduction Method (4) Sampling
  • Slide 51
  • Sampling Cluster or Stratified Sampling
  • Data Discretization and Concept Hierarchy Generation
  • Slide 54
  • Discretization
  • Discretization and Concept Hierarchy Generation for Numeric Data
  • Entropy-Based Discretization
  • Interval Merge by 2 Analysis
  • Cluster Analysis
  • Slide 60
  • Segmentation by Natural Partitioning
  • 62 Concept Hierarchy Generation for Categorical Data
  • Slide 63
  • Concept Hierarchy Generation for Categorical Data
  • Automatic Concept Hierarchy Generation
  • Summary
Page 51: UNIT – 1 Data Preprocessing. Data Preprocessing Learning Objectives Understand why preprocess the data. Understand how to clean the data. Understand how.

Merging-based (bottom-up) vs splitting-based methods Merge Find the best neighboring intervals and merge

them to form larger intervals recursively ChiMerge [Kerber AAAI 1992 See also Liu et al DMKD

2002] Initially each distinct value of a numerical attr A is

considered to be one interval 2 tests are performed for every pair of adjacent

intervals Adjacent intervals with the least 2 values are merged

together since low 2 values for a pair indicate similar class distributions

This merge process proceeds recursively until a predefined stopping criterion is met (such as significance level max-interval max inconsistency etc)

Cluster analysis is a popular data discretization method A clustering algorithm can be applied to discretize a numerical attribute A by partitioning the values of A into clusters or groups Clustering takes the distribution of A into consideration as well as the closeness of data points and therefore is able to produce high-quality discretization results

Clustering can be used to generate a concept hierarchy for A by following either a topdown splitting strategy or a bottom-up merging strategy where each cluster forms a node of the concept hierarchy In the former each initial cluster or partition may be further

decomposed into several subclusters forming a lower level of the hierarchy In the latter clusters are formed by repeatedly grouping neighboring clusters in order to form

higher-level concepts

bull Discretization by Intuitive Partitioningbull Although the above discretization methods are useful

in the generation of numerical hierarchies many users would like to see numerical ranges partitioned into relatively uniform easy-to-read intervals that appear intuitive or ldquonaturalrdquo

bull If an interval covers 3 6 7 or 9 distinct values at the most significant digit then partition the range into 3 intervals (3 equal-width intervals for 3 6 and 9 and 3 intervals in the grouping of 2-3-2 for 7)

bull If it covers 2 4 or 8 distinct values at the most significant digit then partition the range into 4 equal-width intervals

bull If it covers 1 5 or 10 distinct values at the most significant digit then partition the range into 5 equal-width intervals

bull The rule can be recursively applied to each interval creating a concept hierarchy for the given numerical attribute Real-world data often contain extremely large positive andor negative outlier values which could distort any top-down discretization method based on minimum and maximum data values

A simply 3-4-5 rule can be used to segment numeric data into relatively uniform ldquonaturalrdquo intervals If an interval covers 3 6 7 or 9 distinct values

at the most significant digit partition the range into 3 equi-width intervals

If it covers 2 4 or 8 distinct values at the most significant digit partition the range into 4 intervals

If it covers 1 5 or 10 distinct values at the most significant digit partition the range into 5 intervals

Categorical data are discrete data Categorical attributes have a finite (but possibly large) number of distinct values with no ordering among the values Examples include geographic location job category and itemtype There are several methods for the generation of concept hierarchies for categorical data

Specification of a partial ordering of attributes explicitly at the schema level by users or Experts

Specification of a portion of a hierarchy by explicit data grouping

Specification of a set of attributes but not of their partial ordering

Specification of only a partial set of attributes

Specification of a partialtotal ordering of attributes explicitly at the schema level by users or experts street lt city lt state lt country

Specification of a hierarchy for a set of values by explicit data grouping Urbana Champaign Chicago lt Illinois

Specification of only a partial set of attributes Eg only street lt city not others

Automatic generation of hierarchies (or attribute levels) by the analysis of the number of distinct values Eg for a set of attributes street city state

country

Some hierarchies can be automaatically generated based on the analysis of the number of distinct values per attribute in the data set The attribute with the most distinct values is

placed at the lowest level of the hierarchy Exceptions eg weekday month quarter year

country

province_or_ state

city

street

15 distinct values

365 distinct values

3567 distinct values

674339 distinct values

Data preparation or preprocessing is a big issue for both data warehousing and data mining

Discriptive data summarization is need for quality data preprocessing

Data preparation includes Data cleaning and data integration Data reduction and feature selection Discretization

A lot a methods have been developed but data preprocessing still an active area of research

  • Slide 1
  • Slide 2
  • Slide 3
  • Multi-Dimensional Measure of Data Quality
  • Major Tasks in Data Preprocessing
  • Figure Forms of data preprocessing
  • 2 Descriptive Data Summarization
  • Slide 8
  • 21 Measuring the Central Tendency
  • 22 Measuring the Dispersion of Data
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Data Integration
  • Handling Redundant Data in Data Integration
  • Data Transformation
  • Data Transformation Normalization
  • Slide 27
  • 5 Data Reduction
  • Slide 29
  • 51 Data Cube Aggregation
  • 52 Attribute Subset Selection
  • Slide 32
  • Slide 33
  • Slide 34
  • 53 Dimensionality Reduction
  • Slide 36
  • Slide 37
  • Slide 38
  • 4 Numerosity Reduction
  • Data Reduction Method (1) Regression and Log-Linear Models
  • Regress Analysis and Log-Linear Models
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Data Reduction Method (3) Clustering
  • Slide 47
  • Slide 48
  • Slide 49
  • Data Reduction Method (4) Sampling
  • Slide 51
  • Sampling Cluster or Stratified Sampling
  • Data Discretization and Concept Hierarchy Generation
  • Slide 54
  • Discretization
  • Discretization and Concept Hierarchy Generation for Numeric Data
  • Entropy-Based Discretization
  • Interval Merge by 2 Analysis
  • Cluster Analysis
  • Slide 60
  • Segmentation by Natural Partitioning
  • 62 Concept Hierarchy Generation for Categorical Data
  • Slide 63
  • Concept Hierarchy Generation for Categorical Data
  • Automatic Concept Hierarchy Generation
  • Summary
Page 52: UNIT – 1 Data Preprocessing. Data Preprocessing Learning Objectives Understand why preprocess the data. Understand how to clean the data. Understand how.

Cluster analysis is a popular data discretization method A clustering algorithm can be applied to discretize a numerical attribute A by partitioning the values of A into clusters or groups Clustering takes the distribution of A into consideration as well as the closeness of data points and therefore is able to produce high-quality discretization results

Clustering can be used to generate a concept hierarchy for A by following either a topdown splitting strategy or a bottom-up merging strategy where each cluster forms a node of the concept hierarchy In the former each initial cluster or partition may be further

decomposed into several subclusters forming a lower level of the hierarchy In the latter clusters are formed by repeatedly grouping neighboring clusters in order to form

higher-level concepts

bull Discretization by Intuitive Partitioningbull Although the above discretization methods are useful

in the generation of numerical hierarchies many users would like to see numerical ranges partitioned into relatively uniform easy-to-read intervals that appear intuitive or ldquonaturalrdquo

bull If an interval covers 3 6 7 or 9 distinct values at the most significant digit then partition the range into 3 intervals (3 equal-width intervals for 3 6 and 9 and 3 intervals in the grouping of 2-3-2 for 7)

bull If it covers 2 4 or 8 distinct values at the most significant digit then partition the range into 4 equal-width intervals

bull If it covers 1 5 or 10 distinct values at the most significant digit then partition the range into 5 equal-width intervals

bull The rule can be recursively applied to each interval creating a concept hierarchy for the given numerical attribute Real-world data often contain extremely large positive andor negative outlier values which could distort any top-down discretization method based on minimum and maximum data values

A simply 3-4-5 rule can be used to segment numeric data into relatively uniform ldquonaturalrdquo intervals If an interval covers 3 6 7 or 9 distinct values

at the most significant digit partition the range into 3 equi-width intervals

If it covers 2 4 or 8 distinct values at the most significant digit partition the range into 4 intervals

If it covers 1 5 or 10 distinct values at the most significant digit partition the range into 5 intervals

Categorical data are discrete data Categorical attributes have a finite (but possibly large) number of distinct values with no ordering among the values Examples include geographic location job category and itemtype There are several methods for the generation of concept hierarchies for categorical data

Specification of a partial ordering of attributes explicitly at the schema level by users or Experts

Specification of a portion of a hierarchy by explicit data grouping

Specification of a set of attributes but not of their partial ordering

Specification of only a partial set of attributes

Specification of a partialtotal ordering of attributes explicitly at the schema level by users or experts street lt city lt state lt country

Specification of a hierarchy for a set of values by explicit data grouping Urbana Champaign Chicago lt Illinois

Specification of only a partial set of attributes Eg only street lt city not others

Automatic generation of hierarchies (or attribute levels) by the analysis of the number of distinct values Eg for a set of attributes street city state

country

Some hierarchies can be automaatically generated based on the analysis of the number of distinct values per attribute in the data set The attribute with the most distinct values is

placed at the lowest level of the hierarchy Exceptions eg weekday month quarter year

country

province_or_ state

city

street

15 distinct values

365 distinct values

3567 distinct values

674339 distinct values

Data preparation or preprocessing is a big issue for both data warehousing and data mining

Discriptive data summarization is need for quality data preprocessing

Data preparation includes Data cleaning and data integration Data reduction and feature selection Discretization

A lot a methods have been developed but data preprocessing still an active area of research

  • Slide 1
  • Slide 2
  • Slide 3
  • Multi-Dimensional Measure of Data Quality
  • Major Tasks in Data Preprocessing
  • Figure Forms of data preprocessing
  • 2 Descriptive Data Summarization
  • Slide 8
  • 21 Measuring the Central Tendency
  • 22 Measuring the Dispersion of Data
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Data Integration
  • Handling Redundant Data in Data Integration
  • Data Transformation
  • Data Transformation Normalization
  • Slide 27
  • 5 Data Reduction
  • Slide 29
  • 51 Data Cube Aggregation
  • 52 Attribute Subset Selection
  • Slide 32
  • Slide 33
  • Slide 34
  • 53 Dimensionality Reduction
  • Slide 36
  • Slide 37
  • Slide 38
  • 4 Numerosity Reduction
  • Data Reduction Method (1) Regression and Log-Linear Models
  • Regress Analysis and Log-Linear Models
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Data Reduction Method (3) Clustering
  • Slide 47
  • Slide 48
  • Slide 49
  • Data Reduction Method (4) Sampling
  • Slide 51
  • Sampling Cluster or Stratified Sampling
  • Data Discretization and Concept Hierarchy Generation
  • Slide 54
  • Discretization
  • Discretization and Concept Hierarchy Generation for Numeric Data
  • Entropy-Based Discretization
  • Interval Merge by 2 Analysis
  • Cluster Analysis
  • Slide 60
  • Segmentation by Natural Partitioning
  • 62 Concept Hierarchy Generation for Categorical Data
  • Slide 63
  • Concept Hierarchy Generation for Categorical Data
  • Automatic Concept Hierarchy Generation
  • Summary
Page 53: UNIT – 1 Data Preprocessing. Data Preprocessing Learning Objectives Understand why preprocess the data. Understand how to clean the data. Understand how.

bull Discretization by Intuitive Partitioningbull Although the above discretization methods are useful

in the generation of numerical hierarchies many users would like to see numerical ranges partitioned into relatively uniform easy-to-read intervals that appear intuitive or ldquonaturalrdquo

bull If an interval covers 3 6 7 or 9 distinct values at the most significant digit then partition the range into 3 intervals (3 equal-width intervals for 3 6 and 9 and 3 intervals in the grouping of 2-3-2 for 7)

bull If it covers 2 4 or 8 distinct values at the most significant digit then partition the range into 4 equal-width intervals

bull If it covers 1 5 or 10 distinct values at the most significant digit then partition the range into 5 equal-width intervals

bull The rule can be recursively applied to each interval creating a concept hierarchy for the given numerical attribute Real-world data often contain extremely large positive andor negative outlier values which could distort any top-down discretization method based on minimum and maximum data values

A simply 3-4-5 rule can be used to segment numeric data into relatively uniform ldquonaturalrdquo intervals If an interval covers 3 6 7 or 9 distinct values

at the most significant digit partition the range into 3 equi-width intervals

If it covers 2 4 or 8 distinct values at the most significant digit partition the range into 4 intervals

If it covers 1 5 or 10 distinct values at the most significant digit partition the range into 5 intervals

Categorical data are discrete data Categorical attributes have a finite (but possibly large) number of distinct values with no ordering among the values Examples include geographic location job category and itemtype There are several methods for the generation of concept hierarchies for categorical data

Specification of a partial ordering of attributes explicitly at the schema level by users or Experts

Specification of a portion of a hierarchy by explicit data grouping

Specification of a set of attributes but not of their partial ordering

Specification of only a partial set of attributes

Specification of a partialtotal ordering of attributes explicitly at the schema level by users or experts street lt city lt state lt country

Specification of a hierarchy for a set of values by explicit data grouping Urbana Champaign Chicago lt Illinois

Specification of only a partial set of attributes Eg only street lt city not others

Automatic generation of hierarchies (or attribute levels) by the analysis of the number of distinct values Eg for a set of attributes street city state

country

Some hierarchies can be automaatically generated based on the analysis of the number of distinct values per attribute in the data set The attribute with the most distinct values is

placed at the lowest level of the hierarchy Exceptions eg weekday month quarter year

country

province_or_ state

city

street

15 distinct values

365 distinct values

3567 distinct values

674339 distinct values

Data preparation or preprocessing is a big issue for both data warehousing and data mining

Discriptive data summarization is need for quality data preprocessing

Data preparation includes Data cleaning and data integration Data reduction and feature selection Discretization

A lot a methods have been developed but data preprocessing still an active area of research

  • Slide 1
  • Slide 2
  • Slide 3
  • Multi-Dimensional Measure of Data Quality
  • Major Tasks in Data Preprocessing
  • Figure Forms of data preprocessing
  • 2 Descriptive Data Summarization
  • Slide 8
  • 21 Measuring the Central Tendency
  • 22 Measuring the Dispersion of Data
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Data Integration
  • Handling Redundant Data in Data Integration
  • Data Transformation
  • Data Transformation Normalization
  • Slide 27
  • 5 Data Reduction
  • Slide 29
  • 51 Data Cube Aggregation
  • 52 Attribute Subset Selection
  • Slide 32
  • Slide 33
  • Slide 34
  • 53 Dimensionality Reduction
  • Slide 36
  • Slide 37
  • Slide 38
  • 4 Numerosity Reduction
  • Data Reduction Method (1) Regression and Log-Linear Models
  • Regress Analysis and Log-Linear Models
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Data Reduction Method (3) Clustering
  • Slide 47
  • Slide 48
  • Slide 49
  • Data Reduction Method (4) Sampling
  • Slide 51
  • Sampling Cluster or Stratified Sampling
  • Data Discretization and Concept Hierarchy Generation
  • Slide 54
  • Discretization
  • Discretization and Concept Hierarchy Generation for Numeric Data
  • Entropy-Based Discretization
  • Interval Merge by 2 Analysis
  • Cluster Analysis
  • Slide 60
  • Segmentation by Natural Partitioning
  • 62 Concept Hierarchy Generation for Categorical Data
  • Slide 63
  • Concept Hierarchy Generation for Categorical Data
  • Automatic Concept Hierarchy Generation
  • Summary
Page 54: UNIT – 1 Data Preprocessing. Data Preprocessing Learning Objectives Understand why preprocess the data. Understand how to clean the data. Understand how.

A simply 3-4-5 rule can be used to segment numeric data into relatively uniform ldquonaturalrdquo intervals If an interval covers 3 6 7 or 9 distinct values

at the most significant digit partition the range into 3 equi-width intervals

If it covers 2 4 or 8 distinct values at the most significant digit partition the range into 4 intervals

If it covers 1 5 or 10 distinct values at the most significant digit partition the range into 5 intervals

Categorical data are discrete data Categorical attributes have a finite (but possibly large) number of distinct values with no ordering among the values Examples include geographic location job category and itemtype There are several methods for the generation of concept hierarchies for categorical data

Specification of a partial ordering of attributes explicitly at the schema level by users or Experts

Specification of a portion of a hierarchy by explicit data grouping

Specification of a set of attributes but not of their partial ordering

Specification of only a partial set of attributes

Specification of a partialtotal ordering of attributes explicitly at the schema level by users or experts street lt city lt state lt country

Specification of a hierarchy for a set of values by explicit data grouping Urbana Champaign Chicago lt Illinois

Specification of only a partial set of attributes Eg only street lt city not others

Automatic generation of hierarchies (or attribute levels) by the analysis of the number of distinct values Eg for a set of attributes street city state

country

Some hierarchies can be automaatically generated based on the analysis of the number of distinct values per attribute in the data set The attribute with the most distinct values is

placed at the lowest level of the hierarchy Exceptions eg weekday month quarter year

country

province_or_ state

city

street

15 distinct values

365 distinct values

3567 distinct values

674339 distinct values

Data preparation or preprocessing is a big issue for both data warehousing and data mining

Discriptive data summarization is need for quality data preprocessing

Data preparation includes Data cleaning and data integration Data reduction and feature selection Discretization

A lot a methods have been developed but data preprocessing still an active area of research

  • Slide 1
  • Slide 2
  • Slide 3
  • Multi-Dimensional Measure of Data Quality
  • Major Tasks in Data Preprocessing
  • Figure Forms of data preprocessing
  • 2 Descriptive Data Summarization
  • Slide 8
  • 21 Measuring the Central Tendency
  • 22 Measuring the Dispersion of Data
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Data Integration
  • Handling Redundant Data in Data Integration
  • Data Transformation
  • Data Transformation Normalization
  • Slide 27
  • 5 Data Reduction
  • Slide 29
  • 51 Data Cube Aggregation
  • 52 Attribute Subset Selection
  • Slide 32
  • Slide 33
  • Slide 34
  • 53 Dimensionality Reduction
  • Slide 36
  • Slide 37
  • Slide 38
  • 4 Numerosity Reduction
  • Data Reduction Method (1) Regression and Log-Linear Models
  • Regress Analysis and Log-Linear Models
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Data Reduction Method (3) Clustering
  • Slide 47
  • Slide 48
  • Slide 49
  • Data Reduction Method (4) Sampling
  • Slide 51
  • Sampling Cluster or Stratified Sampling
  • Data Discretization and Concept Hierarchy Generation
  • Slide 54
  • Discretization
  • Discretization and Concept Hierarchy Generation for Numeric Data
  • Entropy-Based Discretization
  • Interval Merge by 2 Analysis
  • Cluster Analysis
  • Slide 60
  • Segmentation by Natural Partitioning
  • 62 Concept Hierarchy Generation for Categorical Data
  • Slide 63
  • Concept Hierarchy Generation for Categorical Data
  • Automatic Concept Hierarchy Generation
  • Summary
Page 55: UNIT – 1 Data Preprocessing. Data Preprocessing Learning Objectives Understand why preprocess the data. Understand how to clean the data. Understand how.

Categorical data are discrete data Categorical attributes have a finite (but possibly large) number of distinct values with no ordering among the values Examples include geographic location job category and itemtype There are several methods for the generation of concept hierarchies for categorical data

Specification of a partial ordering of attributes explicitly at the schema level by users or Experts

Specification of a portion of a hierarchy by explicit data grouping

Specification of a set of attributes but not of their partial ordering

Specification of only a partial set of attributes

Specification of a partialtotal ordering of attributes explicitly at the schema level by users or experts street lt city lt state lt country

Specification of a hierarchy for a set of values by explicit data grouping Urbana Champaign Chicago lt Illinois

Specification of only a partial set of attributes Eg only street lt city not others

Automatic generation of hierarchies (or attribute levels) by the analysis of the number of distinct values Eg for a set of attributes street city state

country

Some hierarchies can be automaatically generated based on the analysis of the number of distinct values per attribute in the data set The attribute with the most distinct values is

placed at the lowest level of the hierarchy Exceptions eg weekday month quarter year

country

province_or_ state

city

street

15 distinct values

365 distinct values

3567 distinct values

674339 distinct values

Data preparation or preprocessing is a big issue for both data warehousing and data mining

Discriptive data summarization is need for quality data preprocessing

Data preparation includes Data cleaning and data integration Data reduction and feature selection Discretization

A lot a methods have been developed but data preprocessing still an active area of research

  • Slide 1
  • Slide 2
  • Slide 3
  • Multi-Dimensional Measure of Data Quality
  • Major Tasks in Data Preprocessing
  • Figure Forms of data preprocessing
  • 2 Descriptive Data Summarization
  • Slide 8
  • 21 Measuring the Central Tendency
  • 22 Measuring the Dispersion of Data
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Data Integration
  • Handling Redundant Data in Data Integration
  • Data Transformation
  • Data Transformation Normalization
  • Slide 27
  • 5 Data Reduction
  • Slide 29
  • 51 Data Cube Aggregation
  • 52 Attribute Subset Selection
  • Slide 32
  • Slide 33
  • Slide 34
  • 53 Dimensionality Reduction
  • Slide 36
  • Slide 37
  • Slide 38
  • 4 Numerosity Reduction
  • Data Reduction Method (1) Regression and Log-Linear Models
  • Regress Analysis and Log-Linear Models
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Data Reduction Method (3) Clustering
  • Slide 47
  • Slide 48
  • Slide 49
  • Data Reduction Method (4) Sampling
  • Slide 51
  • Sampling Cluster or Stratified Sampling
  • Data Discretization and Concept Hierarchy Generation
  • Slide 54
  • Discretization
  • Discretization and Concept Hierarchy Generation for Numeric Data
  • Entropy-Based Discretization
  • Interval Merge by 2 Analysis
  • Cluster Analysis
  • Slide 60
  • Segmentation by Natural Partitioning
  • 62 Concept Hierarchy Generation for Categorical Data
  • Slide 63
  • Concept Hierarchy Generation for Categorical Data
  • Automatic Concept Hierarchy Generation
  • Summary
Page 56: UNIT – 1 Data Preprocessing. Data Preprocessing Learning Objectives Understand why preprocess the data. Understand how to clean the data. Understand how.

Specification of a partialtotal ordering of attributes explicitly at the schema level by users or experts street lt city lt state lt country

Specification of a hierarchy for a set of values by explicit data grouping Urbana Champaign Chicago lt Illinois

Specification of only a partial set of attributes Eg only street lt city not others

Automatic generation of hierarchies (or attribute levels) by the analysis of the number of distinct values Eg for a set of attributes street city state

country

Some hierarchies can be automaatically generated based on the analysis of the number of distinct values per attribute in the data set The attribute with the most distinct values is

placed at the lowest level of the hierarchy Exceptions eg weekday month quarter year

country

province_or_ state

city

street

15 distinct values

365 distinct values

3567 distinct values

674339 distinct values

Data preparation or preprocessing is a big issue for both data warehousing and data mining

Discriptive data summarization is need for quality data preprocessing

Data preparation includes Data cleaning and data integration Data reduction and feature selection Discretization

A lot a methods have been developed but data preprocessing still an active area of research

  • Slide 1
  • Slide 2
  • Slide 3
  • Multi-Dimensional Measure of Data Quality
  • Major Tasks in Data Preprocessing
  • Figure Forms of data preprocessing
  • 2 Descriptive Data Summarization
  • Slide 8
  • 21 Measuring the Central Tendency
  • 22 Measuring the Dispersion of Data
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Data Integration
  • Handling Redundant Data in Data Integration
  • Data Transformation
  • Data Transformation Normalization
  • Slide 27
  • 5 Data Reduction
  • Slide 29
  • 51 Data Cube Aggregation
  • 52 Attribute Subset Selection
  • Slide 32
  • Slide 33
  • Slide 34
  • 53 Dimensionality Reduction
  • Slide 36
  • Slide 37
  • Slide 38
  • 4 Numerosity Reduction
  • Data Reduction Method (1) Regression and Log-Linear Models
  • Regress Analysis and Log-Linear Models
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Data Reduction Method (3) Clustering
  • Slide 47
  • Slide 48
  • Slide 49
  • Data Reduction Method (4) Sampling
  • Slide 51
  • Sampling Cluster or Stratified Sampling
  • Data Discretization and Concept Hierarchy Generation
  • Slide 54
  • Discretization
  • Discretization and Concept Hierarchy Generation for Numeric Data
  • Entropy-Based Discretization
  • Interval Merge by 2 Analysis
  • Cluster Analysis
  • Slide 60
  • Segmentation by Natural Partitioning
  • 62 Concept Hierarchy Generation for Categorical Data
  • Slide 63
  • Concept Hierarchy Generation for Categorical Data
  • Automatic Concept Hierarchy Generation
  • Summary
Page 57: UNIT – 1 Data Preprocessing. Data Preprocessing Learning Objectives Understand why preprocess the data. Understand how to clean the data. Understand how.

Some hierarchies can be automaatically generated based on the analysis of the number of distinct values per attribute in the data set The attribute with the most distinct values is

placed at the lowest level of the hierarchy Exceptions eg weekday month quarter year

country

province_or_ state

city

street

15 distinct values

365 distinct values

3567 distinct values

674339 distinct values

Data preparation or preprocessing is a big issue for both data warehousing and data mining

Discriptive data summarization is need for quality data preprocessing

Data preparation includes Data cleaning and data integration Data reduction and feature selection Discretization

A lot a methods have been developed but data preprocessing still an active area of research

  • Slide 1
  • Slide 2
  • Slide 3
  • Multi-Dimensional Measure of Data Quality
  • Major Tasks in Data Preprocessing
  • Figure Forms of data preprocessing
  • 2 Descriptive Data Summarization
  • Slide 8
  • 21 Measuring the Central Tendency
  • 22 Measuring the Dispersion of Data
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Data Integration
  • Handling Redundant Data in Data Integration
  • Data Transformation
  • Data Transformation Normalization
  • Slide 27
  • 5 Data Reduction
  • Slide 29
  • 51 Data Cube Aggregation
  • 52 Attribute Subset Selection
  • Slide 32
  • Slide 33
  • Slide 34
  • 53 Dimensionality Reduction
  • Slide 36
  • Slide 37
  • Slide 38
  • 4 Numerosity Reduction
  • Data Reduction Method (1) Regression and Log-Linear Models
  • Regress Analysis and Log-Linear Models
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Data Reduction Method (3) Clustering
  • Slide 47
  • Slide 48
  • Slide 49
  • Data Reduction Method (4) Sampling
  • Slide 51
  • Sampling Cluster or Stratified Sampling
  • Data Discretization and Concept Hierarchy Generation
  • Slide 54
  • Discretization
  • Discretization and Concept Hierarchy Generation for Numeric Data
  • Entropy-Based Discretization
  • Interval Merge by 2 Analysis
  • Cluster Analysis
  • Slide 60
  • Segmentation by Natural Partitioning
  • 62 Concept Hierarchy Generation for Categorical Data
  • Slide 63
  • Concept Hierarchy Generation for Categorical Data
  • Automatic Concept Hierarchy Generation
  • Summary
Page 58: UNIT – 1 Data Preprocessing. Data Preprocessing Learning Objectives Understand why preprocess the data. Understand how to clean the data. Understand how.

Data preparation or preprocessing is a big issue for both data warehousing and data mining

Discriptive data summarization is need for quality data preprocessing

Data preparation includes Data cleaning and data integration Data reduction and feature selection Discretization

A lot a methods have been developed but data preprocessing still an active area of research

  • Slide 1
  • Slide 2
  • Slide 3
  • Multi-Dimensional Measure of Data Quality
  • Major Tasks in Data Preprocessing
  • Figure Forms of data preprocessing
  • 2 Descriptive Data Summarization
  • Slide 8
  • 21 Measuring the Central Tendency
  • 22 Measuring the Dispersion of Data
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
  • Slide 17
  • Slide 18
  • Slide 19
  • Slide 20
  • Slide 21
  • Slide 22
  • Data Integration
  • Handling Redundant Data in Data Integration
  • Data Transformation
  • Data Transformation Normalization
  • Slide 27
  • 5 Data Reduction
  • Slide 29
  • 51 Data Cube Aggregation
  • 52 Attribute Subset Selection
  • Slide 32
  • Slide 33
  • Slide 34
  • 53 Dimensionality Reduction
  • Slide 36
  • Slide 37
  • Slide 38
  • 4 Numerosity Reduction
  • Data Reduction Method (1) Regression and Log-Linear Models
  • Regress Analysis and Log-Linear Models
  • Slide 42
  • Slide 43
  • Slide 44
  • Slide 45
  • Data Reduction Method (3) Clustering
  • Slide 47
  • Slide 48
  • Slide 49
  • Data Reduction Method (4) Sampling
  • Slide 51
  • Sampling Cluster or Stratified Sampling
  • Data Discretization and Concept Hierarchy Generation
  • Slide 54
  • Discretization
  • Discretization and Concept Hierarchy Generation for Numeric Data
  • Entropy-Based Discretization
  • Interval Merge by 2 Analysis
  • Cluster Analysis
  • Slide 60
  • Segmentation by Natural Partitioning
  • 62 Concept Hierarchy Generation for Categorical Data
  • Slide 63
  • Concept Hierarchy Generation for Categorical Data
  • Automatic Concept Hierarchy Generation
  • Summary