Top Banner
Data Preprocessing Presented by P.Veeralakshmi M.C.A
23
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Dmblog

Data Preprocessing

Presented byP.Veeralakshmi

M.C.A

Page 2: Dmblog

Why Data Preprocessing?• Data in the real world is dirty• Incomplete Incomplete data may come from

Human/hardware/software problems e.g., occupation=“ ”• Noisy: Faulty data collection instruments e.g., Salary=“-10”

Page 3: Dmblog

Cont….

• Inconsistent: Functional dependency violation e.g., Age=“42” Birthday=“03/07/1997”

Page 4: Dmblog

Major Tasks in Data Preprocessing?• Data cleaning Fill in missing values, smooth noisy data, identify or

remove outliers, and resolve inconsistencies• Data integration Integration of multiple databases, data cubes, or files• Data transformation Normalization and aggregation• Data reduction Obtains reduced representation in volume but

produces the same or similar analytical results• Data discretization Part of data reduction but with particular importance,

especially for numerical data

Page 5: Dmblog

Descriptive Data Summarization

• It is a techniques can be used to identify the which data values should be treated as noise or outliers.

• Measures of central tendency include 1. mean 2. median 3. mode 4. midrange,

Page 6: Dmblog

Graphic Displays of Basic Descriptive Data Summaries

• Aside from the bar charts, pie charts, and line graphs used in most statistical or graphical

data presentation.• Histogram• Quantile plots• q-q plots• scatter plots• loess curves.

Page 7: Dmblog

HISTOGRAM

Page 8: Dmblog

Qunatile

Page 9: Dmblog

Q-Q plot

Page 10: Dmblog

Loess Curve

Page 11: Dmblog

Data Cleaning

• Data cleaning (or data cleansing) routines attempt to fill in

• missing values• noise • identifying outliers• correct inconsistencies.

Page 12: Dmblog

Missing Values1. Ignore the tuple2. Fill in the missing value manually3. Use a global constant to fill in the missing value4. Use the attribute mean to fill in the missing value5. Use the attribute mean for all samples belonging

to the same class as the given tuple6. Use the most probable value to fill in the missing

value Method 6,however, is a popular strategy.

Page 13: Dmblog

Noisy Data1. Binning:• The sorted values are distributed into a

number of “buckets,” or bins ex: Bin = 4,8,15• Smoothing by bin means Bin = 9• Smoothing by bin boundaries 4,4,15

Page 14: Dmblog

2.Regression

• Data can be smoothed by fitting the data to a function.

• Linear regression involves finding the “best” line to fit two attributes

• so that one attribute can be used to predict the other.

• Multiple linear regression is an extension of linear regression

Page 15: Dmblog

3. Clustering

• Outliers may be detected by clustering, where similar values are organized into groups, or “clusters.”

• The values that fall outside of the set of clusters may be considered outliers

..........………

…..

Page 16: Dmblog

SOME RULES

• The data should also be examined regarding• unique rules - each value attribute must be

different from all other values • consecutive rules - no missing values between

the lowest and highest values .• null rules - A null rule specifies the use of

blanks,question marks, special characters.

Page 17: Dmblog

Data Integration

• Data integration, which combines data from multiple sources into a coherent data store.

• Data integration Technique:• Schema integration• Redundancy • correlation analysis

Page 18: Dmblog

Data Transformation• In data transformation, the data are transformed

or consolidated into forms appropriate for mining.

• Data transformation can involve the following:• Smoothing - to remove noise from the data.• Aggregation - summary or aggregation

operations are applied to the data. • Ex : the daily sales data may be aggregated so as

to compute monthly and annual total amounts.

Page 19: Dmblog

Cont….

• Generalization - low-level or “primitive” (raw) data are replaced by higher-level concepts through the use of concept hierarchies.

• Normalization - the attribute data are scaled so as to fall within a small specified range, such as 1:0 to 1:0, or 0:0 to 1:0.

• Attribute construction - new attributes are constructed and added from the given set of attributes.

Page 20: Dmblog

Data Reduction• Data reduction techniques can be applied to obtain a reduced representation of the data set that is much smaller in volume.1. Data cube aggregation where aggregation operations are applied to the data in the construction of a data cube.2. Attribute subset selection where irrelevant, weakly relevant, or redundant attributes or dimensions may be detected and removed.

Page 21: Dmblog

Cont…3.Dimensionality reduction where encoding mechanisms are used to reduce the data set size.4.Numerosity reduction where the data are replaced or estimated by alternative•clustering• sampling• histograms.

Page 22: Dmblog

Data Discretization• Data discretization techniques can be used to

reduce the number of values for a given continuous attribute by dividing the range of the

attribute into intervals. • Binning• Histogram Analysis• Entropy-Based Discretization• Interval Merging by x2 Analysis• Cluster Analysis• Discretization by Intuitive Partitioning

Page 23: Dmblog

THANK YOU