06 Data Mining-Data Preprocessing-overview

HAN 10-ch03-083-124-9780123814791 2011/6/1 3:16 Page 84 #2

84 Chapter 3 Data Preprocessing

3.1 Data Preprocessing: An OverviewThis section presents an overview of data preprocessing. Section 3.1.1 illustrates themany elements defining data quality. This provides the incentive behind data prepro-cessing. Section 3.1.2 outlines the major tasks in data preprocessing.

3.1.1 Data Quality: Why Preprocess the Data?Data have quality if they satisfy the requirements of the intended use. There are manyfactors comprising data quality, including accuracy, completeness, consistency, timeliness,believability, and interpretability.

Imagine that you are a manager at AllElectronics and have been charged with ana-lyzing the companys data with respect to your branchs sales. You immediately set outto perform this task. You carefully inspect the companys database and data warehouse,identifying and selecting the attributes or dimensions (e.g., item, price, and units sold)to be included in your analysis. Alas! You notice that several of the attributes for varioustuples have no recorded value. For your analysis, you would like to include informa-tion as to whether each item purchased was advertised as on sale, yet you discover thatthis information has not been recorded. Furthermore, users of your database systemhave reported errors, unusual values, and inconsistencies in the data recorded for sometransactions. In other words, the data you wish to analyze by data mining techniques areincomplete (lacking attribute values or certain attributes of interest, or containing onlyaggregate data); inaccurate or noisy (containing errors, or values that deviate from theexpected); and inconsistent (e.g., containing discrepancies in the department codes usedto categorize items). Welcome to the real world!

This scenario illustrates three of the elements defining data quality: accuracy, com-pleteness, and consistency. Inaccurate, incomplete, and inconsistent data are common-place properties of large real-world databases and data warehouses. There are manypossible reasons for inaccurate data (i.e., having incorrect attribute values). The data col-lection instruments used may be faulty. There may have been human or computer errorsoccurring at data entry. Users may purposely submit incorrect data values for manda-tory fields when they do not wish to submit personal information (e.g., by choosingthe default value January 1 displayed for birthday). This is known as disguised missingdata. Errors in data transmission can also occur. There may be technology limitationssuch as limited buffer size for coordinating synchronized data transfer and consump-tion. Incorrect data may also result from inconsistencies in naming conventions or datacodes, or inconsistent formats for input fields (e.g., date). Duplicate tuples also requiredata cleaning.

Incomplete data can occur for a number of reasons. Attributes of interest may notalways be available, such as customer information for sales transaction data. Other datamay not be included simply because they were not considered important at the timeof entry. Relevant data may not be recorded due to a misunderstanding or because ofequipment malfunctions. Data that were inconsistent with other recorded data may

HAN 10-ch03-083-124-9780123814791 2011/6/1 3:16 Page 85 #3

3.1 Data Preprocessing: An Overview 85

have been deleted. Furthermore, the recording of the data history or modifications mayhave been overlooked. Missing data, particularly for tuples with missing values for someattributes, may need to be inferred.

Recall that data quality depends on the intended use of the data. Two different usersmay have very different assessments of the quality of a given database. For example, amarketing analyst may need to access the database mentioned before for a list of cus-tomer addresses. Some of the addresses are outdated or incorrect, yet overall, 80% ofthe addresses are accurate. The marketing analyst considers this to be a large customerdatabase for target marketing purposes and is pleased with the databases accuracy,although, as sales manager, you found the data inaccurate.

Timeliness also affects data quality. Suppose that you are overseeing the distribu-tion of monthly sales bonuses to the top sales representatives at AllElectronics. Severalsales representatives, however, fail to submit their sales records on time at the end ofthe month. There are also a number of corrections and adjustments that flow in afterthe months end. For a period of time following each month, the data stored in thedatabase are incomplete. However, once all of the data are received, it is correct. The factthat the month-end data are not updated in a timely fashion has a negative impact onthe data quality.

Two other factors affecting data quality are believability and interpretability. Believ-ability reflects how much the data are trusted by users, while interpretability reflectshow easy the data are understood. Suppose that a database, at one point, had severalerrors, all of which have since been corrected. The past errors, however, had causedmany problems for sales department users, and so they no longer trust the data. Thedata also use many accounting codes, which the sales department does not know how tointerpret. Even though the database is now accurate, complete, consistent, and timely,sales department users may regard it as of low quality due to poor believability andinterpretability.

3.1.2 Major Tasks in Data PreprocessingIn this section, we look at the major steps involved in data preprocessing, namely, datacleaning, data integration, data reduction, and data transformation.

Data cleaning routines work to clean the data by filling in missing values, smooth-ing noisy data, identifying or removing outliers, and resolving inconsistencies. If usersbelieve the data are dirty, they are unlikely to trust the results of any data mining that hasbeen applied. Furthermore, dirty data can cause confusion for the mining procedure,resulting in unreliable output. Although most mining routines have some proceduresfor dealing with incomplete or noisy data, they are not always robust. Instead, they mayconcentrate on avoiding overfitting the data to the function being modeled. Therefore,a useful preprocessing step is to run your data through some data cleaning routines.Section 3.2 discusses methods for data cleaning.

Getting back to your task at AllElectronics, suppose that you would like to includedata from multiple sources in your analysis. This would involve integrating multipledatabases, data cubes, or files (i.e., data integration). Yet some attributes representing a

HAN 10-ch03-083-124-9780123814791 2011/6/1 3:16 Page 86 #4

86 Chapter 3 Data Preprocessing

given concept may have different names in different databases, causing inconsistenciesand redundancies. For example, the attribute for customer identification may be referredto as customer id in one data store and cust id in another. Naming inconsistencies mayalso occur for attribute values. For example, the same first name could be registered asBill in one database, William in another, and B. in a third. Furthermore, you sus-pect that some attributes may be inferred from others (e.g., annual revenue). Havinga large amount of redundant data may slow down or confuse the knowledge discov-ery process. Clearly, in addition to data cleaning, steps must be taken to help avoidredundancies during data integration. Typically, data cleaning and data integration areperformed as a preprocessing step when preparing data for a data warehouse. Addi-tional data cleaning can be performed to detect and remove redundancies that may haveresulted from data integration.

Hmmm, you wonder, as you consider your data even further. The data set I haveselected for analysis is HUGE, which is sure to slow down the mining process. Is there away I can reduce the size of my data set without jeopardizing the data mining results?Data reduction obtains a reduced representation of the data set that is much smaller involume, yet produces the same (or almost the same) analytical results. Data reductionstrategies include dimensionality reduction and numerosity reduction.

In dimensionality reduction, data encoding schemes are applied so as to obtain areduced or compressed representation of the original data. Examples include datacompression techniques (e.g., wavelet transforms and principal components analysis),attribute subset selection (e.g., removing irrelevant attributes), and attribute construction(e.g., where a small set of more useful attributes is derived from the original set).

In numerosity reduction, the data are replaced by alternative, smaller representa-tions using parametric models (e.g., regression or log-linear models) or nonparametricmodels (e.g., histograms, clusters, sampling, or data aggregation). Data reduction is thetopic of Section 3.4.

Getting back to your data, you have decided, say, that you would like to use a distance-based mining algorithm for your analysis, such as neural networks, nearest-neighborclassifiers, or clustering.1 Such methods provide better results if the data to be ana-lyzed have been normalized, that is, scaled to a smaller range such as [0.0, 1.0]. Yourcustomer data, for example, contain the attributes age and annual salary. The annualsalary attribute usually takes much larger values than age. Therefore, if the attributesare left unnormalized, the distance measurements taken on annual salary will generallyoutweigh distance measurements taken on age. Discretization and concept hierarchy gen-eration can also be useful, where raw data values for attributes are replaced by ranges orhigher conceptual levels. For example, raw values for age may be replaced by higher-levelconcepts, such as youth, adult, or senior.

Discretization and concept hierarchy generation are powerful tools for data min-ing in that they allow data mining at multiple abstraction levels. Normalization, data

1Neural networks and nearest-neighbor classifiers are described in Chapter 9, and clustering is discussedin Chapters 10 and 11.

HAN 10-ch03-083-124-9780123814791 2011/6/1 3:16 Page 87 #5

3.2 Data Preprocessing: An Overview 87

discretization, and concept hierarchy generation are forms of data transformation.You soon realize such data transformation operations are additional data preprocessingprocedures that would contribute toward the success of the mining process. Dataintegration and data discretization are discussed in Sections 3.5.

Figure 3.1 summarizes the data preprocessing steps described here. Note that the pre-vious categorization is not mutually exclusive. For example, the removal of redundantdata may be seen as a form of data cleaning, as well as data reduction.

In summary, real-world data tend to be dirty, incomplete, and inconsistent. Data pre-processing techniques can improve data quality, thereby helping to improve the accuracyand efficiency of the subsequent mining process. Data preprocessing is an important stepin the knowledge discovery process, because quality decisions must be based on qual-ity data. Detecting data anomalies, rectifying them early, and reducing the data to beanalyzed can lead to huge payoffs for decision making.

Data cleaning

Data integration

Data reductionAttributes Attributes

A1 A2 A3 ... A126T1T2T3T4...

T2000

Tran

sact

ions

Tran

sact

ions T1

T4...

T1456

A1 A3 ... A115

Data transformation 2, 32, 100, 59, 48 0.02, 0.32, 1.00, 0.59, 0.48

Figure 3.1 Forms of data preprocessing.

Front Cover Data Mining: Concepts and TechniquesCopyrightDedicationTable of ContentsForewordForeword to Second EditionPrefaceAcknowledgmentsAbout the AuthorsChapter 1. Introduction1.1 Why Data Mining?1.2 What Is Data Mining?1.3 What Kinds of Data Can Be Mined?1.4 What Kinds of Patterns Can Be Mined?1.5 Which Technologies Are Used?1.6 Which Kinds of Applications Are Targeted?1.7 Major Issues in Data Mining1.8 Summary1.9 Exercises1.10 Bibliographic Notes

Chapter 2. Getting to Know Your Data2.1 Data Objects and Attribute Types2.2 Basic Statistical Descriptions of Data2.3 Data Visualization2.4 Measuring Data Similarity and Dissimilarity2.5 Summary2.6 Exercises2.7 Bibliographic Notes

Chapter 3. Data Preprocessing3.1 Data Preprocessing: An Overview3.2 Data Cleaning3.3 Data Integration3.4 Data Reduction3.5 Data Transformation and Data Discretization3.6 Summary3.7 Exercises3.8 Bibliographic Notes

Chapter 4. Data Warehousing and Online Analytical Processing4.1 Data Warehouse: Basic Concepts4.2 Data Warehouse Modeling: Data Cube and OLAP4.3 Data Warehouse Design and Usage4.4 Data Warehouse Implementation4.5 Data Generalization by Attribute-Oriented Induction4.6 Summary4.7 Exercises4.8 Bibliographic Notes

Chapter 5. Data Cube Technology5.1 Data Cube Computation: Preliminary Concepts5.2 Data Cube Computation Methods5.3 Processing Advanced Kinds of Queries by Exploring Cube Technology5.4 Multidimensional Data Analysis in Cube Space5.5 Summary5.6 Exercises5.7 Bibliographic Notes

Chapter 6. Mining Frequent Patterns, Associations, and Correlations: Basic Concepts and Methods6.1 Basic Concepts6.2 Frequent Itemset Mining Methods6.3 Which Patterns Are Interesting?Pattern Evaluation Methods6.4 Summary6.5 Exercises6.6 Bibliographic Notes

Chapter 7. Advanced Pattern Mining7.1 Pattern Mining: A Road Map7.2 Pattern Mining in Multilevel, Multidimensional Space7.3 Constraint-Based Frequent Pattern Mining7.4 Mining High-Dimensional Data and Colossal Patterns7.5 Mining Compressed or Approximate Patterns7.6 Pattern Exploration and Application7.7 Summary7.8 Exercises7.9 Bibliographic Notes

Chapter 8. Classification: Basic Concepts8.1 Basic Concepts8.2 Decision Tree Induction8.3 Bayes Classification Methods8.4 Rule-Based Classification8.5 Model Evaluation and Selection8.6 Techniques to Improve Classification Accuracy8.7 Summary8.8 Exercises8.9 Bibliographic Notes

Chapter 9. Classification: Advanced Methods9.1 Bayesian Belief Networks9.2 Classification by Backpropagation9.3 Support Vector Machines9.4 Classification Using Frequent Patterns9.5 Lazy Learners (or Learning from Your Neighbors)9.6 Other Classification Methods9.7 Additional Topics Regarding Classification9.8 Summary9.9 Exercises9.10 Bibliographic Notes

Chapter 10. Cluster Analysis: Basic Concepts and Methods10.1 Cluster Analysis10.2 Partitioning Methods10.3 Hierarchical Methods10.4 Density-Based Methods10.5 Grid-Based Methods10.6 Evaluation of Clustering10.7 Summary10.8 Exercises10.9 Bibliographic Notes

Chapter 11. Advanced Cluster Analysis11.1 Probabilistic Model-Based Clustering11.2 Clustering High-Dimensional Data11.3 Clustering Graph and Network Data11.4 Clustering with Constraints11.5 Summary11.6 Exercises11.7 Bibliographic Notes

Chapter 12. Outlier Detection12.1 Outliers and Outlier Analysis12.2 Outlier Detection Methods12.3 Statistical Approaches12.4 Proximity-Based Approaches12.5 Clustering-Based Approaches12.6 Classification-Based Approaches12.7 Mining Contextual and Collective Outliers12.8 Outlier Detection in High-Dimensional Data12.9 Summary12.10 Exercises12.11 Bibliographic Notes

Chapter 13. Data Mining Trends and Research Frontiers13.1 Mining Complex Data Types13.2 Other Methodologies of Data Mining13.3 Data Mining Applications13.4 Data Mining and Society13.5 Data Mining Trends13.6 Summary13.7 Exercises13.8 Bibliographic Notes

BibliographyIndexFront Cover Data Mining: Concepts and TechniquesCopyrightDedicationTable of ContentsForewordForeword to Second EditionPrefaceAcknowledgmentsAbout the AuthorsChapter 1. Introduction1.1 Why Data Mining?1.2 What Is Data Mining?1.3 What Kinds of Data Can Be Mined?1.4 What Kinds of Patterns Can Be Mined?1.5 Which Technologies Are Used?1.6 Which Kinds of Applications Are Targeted?1.7 Major Issues in Data Mining1.8 Summary1.9 Exercises1.10 Bibliographic Notes







































BibliographyIndex

06 Data Mining-Data Preprocessing-overview

Documents

data quality

inconsistent data

inaccurate data

incomplete data

data entry

data warehouses

data preprocessing3

data transmission