Top Banner
Data Preprocessing
64

Data Preprocessing. An important issue for data warehousing and data mining real world data tend to be incomplete, noisy and inconsistent includes –data.

Dec 21, 2015

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Data Preprocessing. An important issue for data warehousing and data mining real world data tend to be incomplete, noisy and inconsistent includes –data.

Data Preprocessing

Page 2: Data Preprocessing. An important issue for data warehousing and data mining real world data tend to be incomplete, noisy and inconsistent includes –data.

Data Preprocessing

• An important issue for data warehousing and data mining

• real world data tend to be incomplete, noisy and inconsistent

• includes – data cleaning– data integration– data transformation– data reduction

Page 3: Data Preprocessing. An important issue for data warehousing and data mining real world data tend to be incomplete, noisy and inconsistent includes –data.

Forms of Data Preprocessing

Data Cleaning

Data integration

T1T2

T2000

A1 A2 A3 ... A126T1T4

T1456

A1 A2 A3 ... A115Data reduction

Data transformation -2, 32, 100, 59, 48 -0.02, 0.32, 1.00, 0.59, 0.48

Page 4: Data Preprocessing. An important issue for data warehousing and data mining real world data tend to be incomplete, noisy and inconsistent includes –data.

Data Preprocessing

• Data cleaning– fill in missing values– smooth noisy data– identify outliers– correct data inconsistency

Page 5: Data Preprocessing. An important issue for data warehousing and data mining real world data tend to be incomplete, noisy and inconsistent includes –data.

Data Preprocessing

• Data integration– combines data from multiple sources to form

a coherent data store.

– Metadata, correlation analysis, data conflict detection and resolution of semantic heterogeneity contribute towards smooth data integration.

Page 6: Data Preprocessing. An important issue for data warehousing and data mining real world data tend to be incomplete, noisy and inconsistent includes –data.

Data Preprocessing

• Data transformation

– convert the data into appropriate forms for mining.

– E.g. attribute data maybe normalized to fall between a small range such as 0.0 to 1.0

Page 7: Data Preprocessing. An important issue for data warehousing and data mining real world data tend to be incomplete, noisy and inconsistent includes –data.

Data Preprocessing

• Data reduction

– data cube aggregation, dimension reduction, data compression, numerosity reduction and discretization.

– Used to obtain a reduced representation of the data while minimizing the loss of information content.

Page 8: Data Preprocessing. An important issue for data warehousing and data mining real world data tend to be incomplete, noisy and inconsistent includes –data.

Data Preprocessing

• Automatic generation of concept hierarchies for numeric data– binning, histogram analysis– cluster analysis, entropy based discretization– segmentation by natural partitioning– for categoric data, concept hierarchies may

be generated based on the number of distinct values of the attributes defining hierarchies.

Page 9: Data Preprocessing. An important issue for data warehousing and data mining real world data tend to be incomplete, noisy and inconsistent includes –data.

Forms of Data Preprocessing

Data Cleaning

Data integration

T1T2

T2000

A1 A2 A3 ... A126T1T4

T1456

A1 A2 A3 ... A115Data reduction

Data transformation-2, 32, 100, 59, 48 -0.02, 0.32, 1.00, 0.59, 0.48

Page 10: Data Preprocessing. An important issue for data warehousing and data mining real world data tend to be incomplete, noisy and inconsistent includes –data.

Data Cleaning

• Handling data that are– incomplete, – noisy and – inconsistent

It is an imperfect world

Page 11: Data Preprocessing. An important issue for data warehousing and data mining real world data tend to be incomplete, noisy and inconsistent includes –data.

Data Cleaning :Missing Values

• Method of filling the missing values– Ignore the tuple– Fill in the missing value manually– Use a global constant – Use the attribute mean – Use the attribute mean for all samples

belonging to the same class– Use the most probable value

Page 12: Data Preprocessing. An important issue for data warehousing and data mining real world data tend to be incomplete, noisy and inconsistent includes –data.

Data Cleaning:Noisy Data

• Noise - random error or variance in a measured variable

• smooth out the data to remove the noise

Page 13: Data Preprocessing. An important issue for data warehousing and data mining real world data tend to be incomplete, noisy and inconsistent includes –data.

Data Cleaning:Noisy Data

• Data Smoothing Techniques• Binning

– smooth a sorted data value by consulting its neighborhood

– the sorted values are distributed into a number of buckets or bins

• smoothing by bin means• smoothing by bin medians• smoothing by bin boundaries

Page 14: Data Preprocessing. An important issue for data warehousing and data mining real world data tend to be incomplete, noisy and inconsistent includes –data.

Simple Discretization Methods: Binning

• Equal-width (distance) partitioning:– Divides the range into N intervals of equal size: uniform

grid– if A and B are the lowest and highest values of the

attribute, the width of intervals will be: W = (B –A)/N.– The most straightforward, but outliers may dominate

presentation– Skewed data is not handled well.

• Equal-depth (frequency) partitioning:– Divides the range into N intervals, each containing

approximately same number of samples– Good data scaling– Managing categorical attributes can be tricky.

Page 15: Data Preprocessing. An important issue for data warehousing and data mining real world data tend to be incomplete, noisy and inconsistent includes –data.

Binning Methods for Data Smoothing* Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29,

34* Partition into (equi-depth) bins: - Bin 1: 4, 8, 9, 15 - Bin 2: 21, 21, 24, 25 - Bin 3: 26, 28, 29, 34* Smoothing by bin means: - Bin 1: 9, 9, 9, 9 - Bin 2: 23, 23, 23, 23 - Bin 3: 29, 29, 29, 29* Smoothing by bin boundaries: - Bin 1: 4, 4, 4, 15 - Bin 2: 21, 21, 25, 25 - Bin 3: 26, 26, 26, 34

Page 16: Data Preprocessing. An important issue for data warehousing and data mining real world data tend to be incomplete, noisy and inconsistent includes –data.

Cluster Analysis

– Clustering• Outliers may be detected by clustering,

where similar values are organized into groups or clusters.

– Combined computer and human inspection

– Regression

Page 17: Data Preprocessing. An important issue for data warehousing and data mining real world data tend to be incomplete, noisy and inconsistent includes –data.

Cluster Analysis

Page 18: Data Preprocessing. An important issue for data warehousing and data mining real world data tend to be incomplete, noisy and inconsistent includes –data.

Regression

x

y

y = x + 1

X1

Y1

Y1’

Page 19: Data Preprocessing. An important issue for data warehousing and data mining real world data tend to be incomplete, noisy and inconsistent includes –data.

Data Smoothing Techniques Binning

• Example– sorted data for price:

4, 8, 15, 21, 21, 24, 25, 28, 34– Partition into equidepth bins

• Bin 1: 4, 8, 15

• Bin 2: 21, 21, 24

• Bin 3: 25, 28, 34

Page 20: Data Preprocessing. An important issue for data warehousing and data mining real world data tend to be incomplete, noisy and inconsistent includes –data.

Data Smoothing Techniques : Binning

– smoothing by bin means• Bin 1: 9, 9, 9

• Bin 2: 22, 22, 22

• Bin 3: 29, 29, 29

– smoothing by bin boundaries• Bin 1: 4, 4, 15

• Bin 2: 21, 21, 24

• Bin 3: 25, 25, 34

Page 21: Data Preprocessing. An important issue for data warehousing and data mining real world data tend to be incomplete, noisy and inconsistent includes –data.

Data Cleaning : Inconsistent Data

• Can be corrected manually using external references

• Source of inconsistency– error made at data entry, can be corrected

using paper trace

Page 22: Data Preprocessing. An important issue for data warehousing and data mining real world data tend to be incomplete, noisy and inconsistent includes –data.

Forms of Data Preprocessing

Data Cleaning

Data integration

T1T2

T2000

A1 A2 A3 ... A126T1T4

T1456

A1 A2 A3 ... A115Data reduction

Data transformation-2, 32, 100, 59, 48 -0.02, 0.32, 1.00, 0.59, 0.48

Page 23: Data Preprocessing. An important issue for data warehousing and data mining real world data tend to be incomplete, noisy and inconsistent includes –data.

Data Integration and Transformation

• Data integration– combines data from multiple

sources into a coherent data store e.g. data warehouse

– sources may include multiple database, data cubes or flat files

– Issues in data integration• schema integration

• redundancy

• detection and resolution of data value conflicts

• Data Transformation– data are transformed or

consolidates into forms appropriate for mining

– involves• smoothing

• Aggregation

• Generalization

• Normalization

• Attribute construction

Page 24: Data Preprocessing. An important issue for data warehousing and data mining real world data tend to be incomplete, noisy and inconsistent includes –data.

Data Integration• Schema integration

– integrate metadata from different sources– Entity identification problem: identify real world

entities from multiple data sources, e.g., A.cust-id B.cust-#

• Detecting and resolving data value conflicts– for the same real world entity, attribute values from

different sources are different– possible reasons: different representations, different

scales, e.g., metric vs. British units

Page 25: Data Preprocessing. An important issue for data warehousing and data mining real world data tend to be incomplete, noisy and inconsistent includes –data.

Data Integration• Redundant data occur often when integration of multiple

databases

– The same attribute may have different names in different databases

– One attribute may be a “derived” attribute in another table, e.g., annual revenue

• Redundant data may be able to be detected by correlational analysis

• Careful integration of the data from multiple sources may help reduce/avoid redundancies and inconsistencies and improve mining speed and quality

Page 26: Data Preprocessing. An important issue for data warehousing and data mining real world data tend to be incomplete, noisy and inconsistent includes –data.

Data Transformation• Smoothing: remove noise from data

• Aggregation: summarization, data cube construction

• Generalization: concept hierarchy climbing

• Normalization: scaled to fall within a small, specified range

– min-max normalization

– z-score normalization

– normalization by decimal scaling

• Attribute/feature construction

– New attributes constructed from the given ones

Page 27: Data Preprocessing. An important issue for data warehousing and data mining real world data tend to be incomplete, noisy and inconsistent includes –data.

Data Transformation: Normalization• min-max normalization

• z-score normalization

• normalization by decimal scaling

AAA

AA

A

minnewminnewmaxnewminmax

minvv _)__('

A

A

devstand

meanvv

_'

j

vv

10' Where j is the smallest integer such that Max(| |)<1'v

Page 28: Data Preprocessing. An important issue for data warehousing and data mining real world data tend to be incomplete, noisy and inconsistent includes –data.

Forms of Data Preprocessing

Data Cleaning

Data integration

T1T2

T2000

A1 A2 A3 ... A126T1T4

T1456

A1 A2 A3 ... A115Data reduction

Data transformation-2, 32, 100, 59, 48 -0.02, 0.32, 1.00, 0.59, 0.48

Page 29: Data Preprocessing. An important issue for data warehousing and data mining real world data tend to be incomplete, noisy and inconsistent includes –data.

Data Reduction

• To obtain a reduced representation of the data set that is

– much smaller in volume

– but closely maintains the integrity of the original data

– mining on the reduced dataset should be more efficient yet produce the same analytical results.

Page 30: Data Preprocessing. An important issue for data warehousing and data mining real world data tend to be incomplete, noisy and inconsistent includes –data.

Data Reduction

Data Reduction

Data cubeAggregation

Dimensionalityreduction

Datacompression

Numerosity reduction

Discretization andConcept Hierarchygeneration

Page 31: Data Preprocessing. An important issue for data warehousing and data mining real world data tend to be incomplete, noisy and inconsistent includes –data.

Data Cube Aggregation• The lowest level of a data cube

– the aggregated data for an individual entity of interest

– e.g., a customer in a phone calling data warehouse.

• Multiple levels of aggregation in data cubes

– Further reduce the size of data to deal with

• Reference appropriate levels

– Use the smallest representation which is enough to solve the task

• Queries regarding aggregated information should be

answered using data cube, when possible

Page 32: Data Preprocessing. An important issue for data warehousing and data mining real world data tend to be incomplete, noisy and inconsistent includes –data.

Data Cube Aggregation

Year = 1997Quarter SalesQ1 $224,000Q2 $408,000Q3 $350,000Q4 $586,000

Year = 1998

Year = 1999

Year Sales1997 $1,568,0001998 $2,356,0001999 $3,594,000

Sales data for company AllElectronics for 1997 - 1999

Page 33: Data Preprocessing. An important issue for data warehousing and data mining real world data tend to be incomplete, noisy and inconsistent includes –data.

Data Reduction

Data Reduction

Data cubeAggregation

Dimensionalityreduction

Datacompression

Numerosity reduction

Discretization andConcept Hierarchygeneration

Page 34: Data Preprocessing. An important issue for data warehousing and data mining real world data tend to be incomplete, noisy and inconsistent includes –data.

Dimensionality Reduction

Data preparation

Standard form

Dimensionreduction

DataSubset

PredictionMethods

Evaluation

The role of dimension reduction in Data Mining

Page 35: Data Preprocessing. An important issue for data warehousing and data mining real world data tend to be incomplete, noisy and inconsistent includes –data.

Dimensionality Reduction

– Data sets for analysis may contain hundreds of attributes that may be irrelevant to the mining task or redundant

– Dimensionality reduction reduces the dataset size by removing such attributes among them

Page 36: Data Preprocessing. An important issue for data warehousing and data mining real world data tend to be incomplete, noisy and inconsistent includes –data.

Dimensionality Reduction

– How can we find a good subset of the original attributes??

– attribute subset selection is to find a minimum set of attributes such that the resulting probability distribution of the data classes is as close as possible to the original distribution obtained using all attributes.

Page 37: Data Preprocessing. An important issue for data warehousing and data mining real world data tend to be incomplete, noisy and inconsistent includes –data.

Dimensionality Reduction

• Attribute subset selection techniques– Forward selection

• start with empty set of attributes• the best of the original attributes is determined and

added to the set.• At each subsequent iteration or step, the best of the

remaining original attributes is added to the set.

– Stepwise backward elimination• starts with the full set of attributes• At each step, it removes the worst attribute

remaining in the set.

Page 38: Data Preprocessing. An important issue for data warehousing and data mining real world data tend to be incomplete, noisy and inconsistent includes –data.

Dimensionality Reduction

• Attribute subset selection techniques

– Combination of forward selection and backward elimination

• the procedure combines and selects the best attribute and removes the worst from among the remaining attributes

Page 39: Data Preprocessing. An important issue for data warehousing and data mining real world data tend to be incomplete, noisy and inconsistent includes –data.

Dimensionality Reduction

• Attribute subset selection techniques– Decision tree induction

• ID3, C4.5 intended for classification

• construct a flow chart like structure where each internal (nonleaf) node denotes a test on an attribute

• each branch corresponds to an outcome of the test and each external node denotes a class prediction

• At each node the algorithm chooses the best attribute to partition the data into individual classes.

Page 40: Data Preprocessing. An important issue for data warehousing and data mining real world data tend to be incomplete, noisy and inconsistent includes –data.

Example of Decision Tree Induction

Initial attribute set:{A1, A2, A3, A4, A5, A6}

A4 ?

A1? A6?

Class 1 Class 2 Class 1 Class 2

> Reduced attribute set: {A1, A4, A6}

Page 41: Data Preprocessing. An important issue for data warehousing and data mining real world data tend to be incomplete, noisy and inconsistent includes –data.

Dimensionality Reduction

• Attribute subset selection techniques– Reducts computation by rough set theory– selection of attributes are identified by the concept

of discernibility relations of classes in the dataset– Will be discussed in next class.

Page 42: Data Preprocessing. An important issue for data warehousing and data mining real world data tend to be incomplete, noisy and inconsistent includes –data.

Data Reduction

Data Reduction

Data cubeAggregation

Dimensionalityreduction

Datacompression

Numerosity reduction

Discretization andConcept Hierarchygeneration

Page 43: Data Preprocessing. An important issue for data warehousing and data mining real world data tend to be incomplete, noisy and inconsistent includes –data.

Data Compression

• Apply data encoding or transformation to obtain a reduced or compressed representation of the original data

• lossless– although typically lossless, they allow only

limited manipulation of data.

• lossy

Page 44: Data Preprocessing. An important issue for data warehousing and data mining real world data tend to be incomplete, noisy and inconsistent includes –data.

Data Compression

• Two methods of lossy data compression– Wavelet Transforms– Principle Component Analysis

Page 45: Data Preprocessing. An important issue for data warehousing and data mining real world data tend to be incomplete, noisy and inconsistent includes –data.

Data Compression

• Wavelet Transforms– is a linear signal processing technique that

when applied to a data vector D, transforms it to a numerically different vector D’ of wavelet coefficients

Page 46: Data Preprocessing. An important issue for data warehousing and data mining real world data tend to be incomplete, noisy and inconsistent includes –data.

Data Compression

• Principle Component Analysis– suppose the data to be compresses consist of N

tuples from k dimensions. – PCA searches for c k-dimensional orthogonal

vectors that can best be used to represent the data where c k.

– the original data are projected onto a much smaller space

Page 47: Data Preprocessing. An important issue for data warehousing and data mining real world data tend to be incomplete, noisy and inconsistent includes –data.

Data Reduction

Data Reduction

Data cubeAggregation

Dimensionalityreduction

Datacompression

Numerosity reduction

Discretization andConcept Hierarchygeneration

Page 48: Data Preprocessing. An important issue for data warehousing and data mining real world data tend to be incomplete, noisy and inconsistent includes –data.

Numerosity Reduction

• Numerosity reduction technique can be applied to reduce the data volume by choosing alternative, smaller forms of data representation

• techniques– Regression and Log-Linear Models

– Histograms

– Clustering

– Sampling

Page 49: Data Preprocessing. An important issue for data warehousing and data mining real world data tend to be incomplete, noisy and inconsistent includes –data.

Data Reduction

Data Reduction

Data cubeAggregation

Dimensionalityreduction

Datacompression

Numerosity reduction

Discretization andConcept Hierarchygeneration

Page 50: Data Preprocessing. An important issue for data warehousing and data mining real world data tend to be incomplete, noisy and inconsistent includes –data.

Discretization• Three types of attributes:

– Nominal — values from an unordered set– Ordinal — values from an ordered set– Continuous — real numbers

• Discretization: – divide the range of a continuous attribute into intervals– Some classification algorithms only accept categorical

attributes.– Reduce data size by discretization– Prepare for further analysis

Page 51: Data Preprocessing. An important issue for data warehousing and data mining real world data tend to be incomplete, noisy and inconsistent includes –data.

Discretization and Concept hierarchy• Discretization

– reduce the number of values for a given continuous attribute by dividing the range of the attribute into intervals. Interval labels can then be used to replace actual data values

• Concept hierarchies – reduce the data by collecting and replacing low level

concepts (such as numeric values for the attribute age) by higher level concepts (such as young, middle-aged, or senior)

Page 52: Data Preprocessing. An important issue for data warehousing and data mining real world data tend to be incomplete, noisy and inconsistent includes –data.

Discretization

• Example :– Manual discretization of AUS data set

Page 53: Data Preprocessing. An important issue for data warehousing and data mining real world data tend to be incomplete, noisy and inconsistent includes –data.

Discretization and Concept Hierarchy Generation for Numeric Data

• Binning (see sections before)

• Histogram analysis (see sections before)

• Clustering analysis (see sections before)

• Entropy-based discretization

• Segmentation by natural partitioning

Page 54: Data Preprocessing. An important issue for data warehousing and data mining real world data tend to be incomplete, noisy and inconsistent includes –data.

Entropy-Based Discretization• Given a set of samples S, if S is partitioned into two

intervals S1 and S2 using boundary T, the entropy after partitioning is

• The boundary that minimizes the entropy function over all possible boundaries is selected as a binary discretization.

E S TS

EntS

EntS S S S( , )| |

| |( )

| |

| |( ) 1

12

2

Page 55: Data Preprocessing. An important issue for data warehousing and data mining real world data tend to be incomplete, noisy and inconsistent includes –data.

Entropy-Based Discretization

• The process is recursively applied to partitions obtained until some stopping criterion is met,

• Experiments show that it may reduce data size and improve classification accuracy

Ent S E T S( ) ( , )

Page 56: Data Preprocessing. An important issue for data warehousing and data mining real world data tend to be incomplete, noisy and inconsistent includes –data.

Segmentation by Natural Partitioning

• A simply 3-4-5 rule can be used to segment numeric data

into relatively uniform, “natural” intervals.

– If an interval covers 3, 6, 7 or 9 distinct values at the most

significant digit, partition the range into 3 equi-width intervals

– If it covers 2, 4, or 8 distinct values at the most significant

digit, partition the range into 4 intervals

– If it covers 1, 5, or 10 distinct values at the most significant

digit, partition the range into 5 intervals (see fig 3.16,pg137)

Page 57: Data Preprocessing. An important issue for data warehousing and data mining real world data tend to be incomplete, noisy and inconsistent includes –data.

Concept Hierarchy Generation

• Many techniques can be applied recursively in order to provide a hierarchical partitioning of the attribute - concept hierarchy

• Concept hierarchy useful for mining at multiple levels of abstraction

Page 58: Data Preprocessing. An important issue for data warehousing and data mining real world data tend to be incomplete, noisy and inconsistent includes –data.

Concept Hierarchy Generation for Categorical Data

• Specification of a partial ordering of attributes explicitly at the schema level by users or experts

– street<city<state<country

• Specification of a portion of a hierarchy by explicit data grouping

– {Urbana, Champaign, Chicago}<Illinois

• Specification of a set of attributes.

– System automatically generates partial ordering by analysis of the number of distinct values

– E.g., street < city <state < country

• Specification of only a partial set of attributes

– E.g., only street < city, not others

Page 59: Data Preprocessing. An important issue for data warehousing and data mining real world data tend to be incomplete, noisy and inconsistent includes –data.

Automatic Concept Hierarchy Generation

• Some concept hierarchies can be automatically generated based on the analysis of the number of distinct values per attribute in the given data set – The attribute with the most distinct values is placed at the lowest

level of the hierarchy– Note: Exception—weekday, month, quarter, year

country

province_or_ state

city

street

15 distinct values

365 distinct values

3567 distinct values

674,339 distinct values

Page 60: Data Preprocessing. An important issue for data warehousing and data mining real world data tend to be incomplete, noisy and inconsistent includes –data.

Discretization and Concept Hierarchy Generation

• Manual Discretization– The information to convert the continuous

values into discrete values are obtain from the expert of the domain area

– Example( refer to UCI machine learning data banks)

Page 61: Data Preprocessing. An important issue for data warehousing and data mining real world data tend to be incomplete, noisy and inconsistent includes –data.

Data Discretization

Page 62: Data Preprocessing. An important issue for data warehousing and data mining real world data tend to be incomplete, noisy and inconsistent includes –data.

Data Discretization

Symbol h02 h03 h11 h12 h13 h21 h22 h30 h31

0.86711 0.18849 0.08184 0.16839 0.12728 0.01923 0.24873 0.12638 0.04125

  0.54536 0.02198 0.02583 0.0241 0.01231 0.01844 0.1193 0.00087 0.00535

0.58806 0.05518 0.08122 0.00895 0.07504 0.01626 0.18318 0.03664 0.05776

  0.61814 0.00880 0.05408 0.01927 0.05894 0.00178 0.07934 0.01363 0.02165

0.88477 0.14812 0.01660 0.13137 0.06236 0.02861 0.21195 0.04551 0.00528

  0.80491 0.05006 0.03593 0.01596 0.04019 0.00195 0.12116 0.01324 0.01841

0.73293 0.05052 0.16291 0.05135 0.11263 0.02107 0.1385 0.00799 0.07375

  0.66253 0.08034 0.03918 0.01415 0.10883 0.01978 0.11662 0.0049 0.01161

0.91948 0.02059 0.01081 0.06653 0.00924 0.01543 0.15602 0.00388 0.00697

  0.82281 0.06182 0.02135 0.03221 0.03237 0.01006 0.12365 0.00398 0.00606

2.213 0.71402 0.059 0.22918 0.00903 0.01181 0.63556 0.05279 0.08960

  2.15402 0.18761 0.08548 0.33771 0.81689 0.11741 0.70659 0.03468 0.13071

0.15565 0.00002 0.00662 0.00547 0.00182 0.00775 0.03896 0.02263 0.00017

  0.16081 0.01299 0.01091 0.00812 0.00205 0.01267 0.04902 0.04908 0.01069

Table 5: The invariance features for mathematical symbols

Page 63: Data Preprocessing. An important issue for data warehousing and data mining real world data tend to be incomplete, noisy and inconsistent includes –data.

Data Discretization  Table 6: Discretization of the mathematical symbols

Orientation h02 h03 h11 h12 h13 h21 h22 h30 h31 Results

Orientation #1 1 2 1 2 2 2 2 1 2

Orientation #2 0 1 0 1 1 1 1 0 0

Orientation #1 0 1 1 0 2 1 2 1 2

Orientation #2 0 0 1 1 1 0 0 1 1

Orientation #1 2 2 0 2 1 2 2 1 0

Orientation #2 1 1 0 1 1 0 1 1 1

Orientation #1 0 1 1 1 2 2 1 0 2

Orientation #2 0 2 1 0 2 2 0 0 1

Orientation #1 2 0 0 2 0 1 1 0 1

Orientation #2 1 1 0 1 1 0 1 0 0

Orientation #1 2 2 1 2 0 1 2 1 2

Orientation #2 2 2 1 2 2 2 2 1 2

Orientation #1 0 0 0 0 0 0 0 1 0

Orientation #2 0 0 0 0 0 1 0 1 1

Page 64: Data Preprocessing. An important issue for data warehousing and data mining real world data tend to be incomplete, noisy and inconsistent includes –data.

Summary

• Data preparation is a big issue for both warehousing and mining

• Data preparation includes

– Data cleaning and data integration

– Data reduction and feature selection

– Discretization

• A lot a methods have been developed but still an active area of

research