Top Banner
TRINITY INSTITUTE OF PROFESSIONAL STUDIES Sector – 9, Dwarka Institutional Area, New Delhi-75 Affiliated Institution of G.G.S.IP.U, Delhi BCA Data Warehouse & Data Mining 20302 Data Preprocessing
18

Data Preprocessing- Data Warehouse & Data Mining

Apr 11, 2017

Download

Education

Trinity Dwarka
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Data Preprocessing- Data Warehouse & Data Mining

TRINITY INSTITUTE OF PROFESSIONAL STUDIES

Sector – 9, Dwarka Institutional Area, New Delhi-75Affiliated Institution of G.G.S.IP.U, Delhi

BCAData Warehouse & Data

Mining20302

Data Preprocessing

Page 2: Data Preprocessing- Data Warehouse & Data Mining

TRINITY INSTITUTE OF PROFESSIONAL STUDIESSector – 9, Dwarka Institutional Area, New Delhi-75

2

• Data Preprocessing: An Overview

– Data Quality

– Major Tasks in Data Preprocessing

• Data Cleaning

• Data Integration

• Data Reduction

• Data Transformation and Data Discretization

• Summary

Data Preprocessing

Page 3: Data Preprocessing- Data Warehouse & Data Mining

TRINITY INSTITUTE OF PROFESSIONAL STUDIESSector – 9, Dwarka Institutional Area, New Delhi-75

3

Data Quality: Why Preprocess the Data?

Measures for data quality: A multidimensional view◦ Accuracy: correct or wrong, accurate or not

◦ Completeness: not recorded, unavailable, …

◦ Consistency: some modified but some not, dangling, …

◦ Timeliness: timely update?

◦ Believability: how trustable the data are correct?

◦ Interpretability: how easily the data can be understood?

Page 4: Data Preprocessing- Data Warehouse & Data Mining

TRINITY INSTITUTE OF PROFESSIONAL STUDIESSector – 9, Dwarka Institutional Area, New Delhi-75

4

Reasons for inaccurate data

Data collection instruments may be faultyHuman or computer errors occurring at data entryUsers may purposely submit incorrect data for

mandatory fields when they don’t want to share personal information

Technology limitations such as buffer size Incorrect data may also result from inconsistencies

in naming conventions or inconsistent formatsDuplicate tuples also require cleaning

Page 5: Data Preprocessing- Data Warehouse & Data Mining

TRINITY INSTITUTE OF PROFESSIONAL STUDIESSector – 9, Dwarka Institutional Area, New Delhi-75

5

Reasons for incomplete data

Attributes of interest may not be availableOther data may not be included as it was not

considered imp at the time of entryRelevant data may not be recorded due to

misunderstanding or equipment malfunctionsInconsistent data may be deletedData history or modifications may be overlookedMissing data

Page 6: Data Preprocessing- Data Warehouse & Data Mining

TRINITY INSTITUTE OF PROFESSIONAL STUDIESSector – 9, Dwarka Institutional Area, New Delhi-75

6

Major Tasks in Data Preprocessing

Data cleaning◦ Fill in missing values, smooth noisy data, identify or remove

outliers, and resolve inconsistencies Data integration

◦ Integration of multiple databases, data cubes, or files Data reduction

◦ Dimensionality reduction◦ Numerosity reduction◦ Data compression

Data transformation and data discretization◦ Normalization ◦ Concept hierarchy generation

Page 7: Data Preprocessing- Data Warehouse & Data Mining

TRINITY INSTITUTE OF PROFESSIONAL STUDIESSector – 9, Dwarka Institutional Area, New Delhi-75

7

Forms of Data Preprocessing

Page 8: Data Preprocessing- Data Warehouse & Data Mining

TRINITY INSTITUTE OF PROFESSIONAL STUDIESSector – 9, Dwarka Institutional Area, New Delhi-75

8

Why Is Data Preprocessing Important?

No quality data, no quality mining results! Quality decisions must be based on quality data

e.g., duplicate or missing data may cause incorrect or even misleading statistics.

Data warehouse needs consistent integration of quality data

Data extraction, cleaning, and transformation comprises the majority of the work of building a data warehouse

Page 9: Data Preprocessing- Data Warehouse & Data Mining

TRINITY INSTITUTE OF PROFESSIONAL STUDIESSector – 9, Dwarka Institutional Area, New Delhi-75

9

Data Cleaning Data in the Real World Is Dirty :- Lots of potentially incorrect data, e.g.,

instrument faulty, human or computer error, transmission error

◦ incomplete: lacking attribute values, lacking certain attributes of interest, or containing only aggregate data

e.g., Occupation = “ ” (missing data◦ noisy: containing noise, errors, or outliers

e.g., Salary = “−10” (an error)◦ inconsistent: containing discrepancies in codes or names, e.g.,

Age = “42”, Birthday = “03/07/2010” Was rating “1, 2, 3”, now rating “A, B, C” discrepancy between duplicate records

◦ Intentional(e.g., disguised missing data) Jan. 1 as everyone’s birthday?

Page 10: Data Preprocessing- Data Warehouse & Data Mining

TRINITY INSTITUTE OF PROFESSIONAL STUDIESSector – 9, Dwarka Institutional Area, New Delhi-75

10

Incomplete (Missing) Data Data is not always available

◦ E.g., many tuples have no recorded value for several attributes, such as customer income in sales data

Missing data may be due to ◦ equipment malfunction◦ inconsistent with other recorded data and thus deleted◦ data not entered due to misunderstanding◦ certain data may not be considered important at the time of

entry◦ not register history or changes of the data

Missing data may need to be inferred

Page 11: Data Preprocessing- Data Warehouse & Data Mining

TRINITY INSTITUTE OF PROFESSIONAL STUDIESSector – 9, Dwarka Institutional Area, New Delhi-75

How to Handle Missing Data?

Ignore the tuple: usually done when class label is missing (when doing classification)—not effective when the % of missing values per attribute varies considerably

Fill in the missing value manually: tedious + infeasible Fill in it automatically with◦ a global constant : e.g., “unknown”, a new class?! ◦ the attribute mean◦ the attribute mean for all samples belonging to the same class: smarter◦ the most probable value: inference-based such as Bayesian formula or

decision tree

Page 12: Data Preprocessing- Data Warehouse & Data Mining

TRINITY INSTITUTE OF PROFESSIONAL STUDIESSector – 9, Dwarka Institutional Area, New Delhi-75

Noisy Data• Noise: random error or variance in a measured variable• Incorrect attribute values may be due to

– faulty data collection instruments– data entry problems– data transmission problems– technology limitation– inconsistency in naming convention

• Other data problems which require data cleaning– duplicate records– incomplete data– inconsistent data

Page 13: Data Preprocessing- Data Warehouse & Data Mining

TRINITY INSTITUTE OF PROFESSIONAL STUDIESSector – 9, Dwarka Institutional Area, New Delhi-75

How to Handle Noisy Data? Binning

◦ first sort data and partition into (equal-frequency) bins◦ then one can smooth by bin means, smooth by bin median,

smooth by bin boundaries, etc. Regression

◦ smooth by fitting the data into regression functions Outlier Analysis by Clustering

◦ detect and remove outliers Combined computer and human inspection

◦ detect suspicious values and check by human (e.g., deal with possible outliers)

Page 14: Data Preprocessing- Data Warehouse & Data Mining

TRINITY INSTITUTE OF PROFESSIONAL STUDIESSector – 9, Dwarka Institutional Area, New Delhi-75

Binning method to Smooth data

Binning Method smooth the sorted data value by consulting its neighborhood i.e. the values around it.The sorted values are distributed into a number of “buckets” or “bins”.Since binning methods consult the neighborhood of values, they perform local smoothing.

Page 15: Data Preprocessing- Data Warehouse & Data Mining

TRINITY INSTITUTE OF PROFESSIONAL STUDIESSector – 9, Dwarka Institutional Area, New Delhi-75

Binning Methods for Data Smoothing

Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34* Partition into equal-frequency (equi-depth) bins: - Bin 1: 4, 8, 9, 15 - Bin 2: 21, 21, 24, 25 - Bin 3: 26, 28, 29, 34* Smoothing by bin means: - Bin 1: 9, 9, 9, 9 - Bin 2: 23, 23, 23, 23 - Bin 3: 29, 29, 29, 29* Smoothing by bin boundaries: - Bin 1: 4, 4, 15, 15 - Bin 2: 21, 21, 25, 25 - Bin 3: 26, 26, 34, 34

Page 16: Data Preprocessing- Data Warehouse & Data Mining

TRINITY INSTITUTE OF PROFESSIONAL STUDIESSector – 9, Dwarka Institutional Area, New Delhi-75

Data Cleaning as a Process

I. Data discrepancy detection

Discrepancies can be caused by the following factors:

Poorly designed data entry forms that have many optional fieldsHuman error in data entryDeliberate errorsData decay(outdated addresses) Inconsistent data representations Instrumental and device errorsSystem errors

Page 17: Data Preprocessing- Data Warehouse & Data Mining

TRINITY INSTITUTE OF PROFESSIONAL STUDIESSector – 9, Dwarka Institutional Area, New Delhi-75

Data discrepancy can be detected by : Use metadata (e.g., domain, range, dependency, distribution) Check field overloading Check uniqueness rule, consecutive rule and null rule Use commercial tools

Data scrubbing tools : use simple domain knowledge (e.g., postal code, spell-check) to detect errors and make corrections

Data auditing tools : by analyzing data to discover rules and relationship to detect violators (e.g., correlation and clustering to find outliers)

II. Data migration and integration Data migration tools: allow transformations to be specified ETL (Extraction/Transformation/Loading) tools: allow users to specify transformations

through a graphical user interface

# Potter’s Wheel is a publicly available data cleaning tool that integrates discrepancy detection and transformation.

Data Cleaning as a Process

Page 18: Data Preprocessing- Data Warehouse & Data Mining

TRINITY INSTITUTE OF PROFESSIONAL STUDIESSector – 9, Dwarka Institutional Area, New Delhi-75

THANK YOU