Data Mining and Data Warehousing

1

DCS 008 Data Mining and Data Warehousing Unit I

Structure of the Unit

1.1 Introduction 1.2 Learning Objectives 1.3 Data mining concepts

1.3.1 An overview

1.3.2 Data mining Tasks

1.3.3 Data mining Process

1.4 Information and production factor

1.5 Data mining vs Query tools 1.6 Data Mining in Marketing 1.7 Self learning Computer System 1.8 Concept Learning

1.9 Data Learning 1.10 Data mining and Data Ware housing 1.11 Summary 1.12 Exercises

2

1.1 Introduction As a student who knows the basics of the computers and data, you would have known that the modern world is surrounded by various types of data (numbers, image, video, sound). Simply to say that the whole world is a data driven one. As years pass by the size of these data has grown very big . The volume of the old and past data has become enormously big and considered to be a waste by most of the owners. This has occurred in all the areas like Super market transaction data, Credit card processing details , Telephone calls dialed/received details, Ration card details, Election / voters details etc., By the statement Waste to Wealth, these data can be used to get vital informations, answer the important decision making questions, to instruct the beneficial ways by analyzing and arranging. In order to extract the information / answers / ways from the data available in a large size, there are statistical and others concepts are being used . One of the major discipline which has been used for this in these days is known as DATA MINING. Like mining the land for the treasure you have to mine the large data to find the precious information which lies with in the data (like the relationships / Patterns) 1.2 Learning Objectives

Understanding the necessity of analyzing and processing of complex, large, information-rich data sets

To make the students know the initial concepts related to data mining 1.3 Data mining concepts 1.3.1 An overview Data is growing at a phenomenal rate. Users expect more sophisticated information How to get that? You have to uncover the hidden information in the large data .To do that Data mining is used. You may be familiar with common queries to explore the information from a data base, But, how for the queries in data mining different from this? See the following examples and you will understand the difference. Examples for a data base query Find all credit applicants with last name of Smith. Identify customers who have purchased more than $10,000 in the last month. Find all customers who have purchased milk

3

Examples for a data mining query Find all credit applicants who are poor credit risks. (classification) Identify customers with similar buying habits. (Clustering) Find all items which are frequently purchased with milk. (association rules) So, in Short the definition for DATA MINING can be given as Data Mining is an analytic process designed to explore data (usually large amounts of data - typically business or market related) in search of consistent patterns and/or systematic relationships between variables, and then to summarize the data in a novel ways which is understandable and useful (the hidden information ) and validate the findings by applying the detected patterns to new subsets of data. The concept of Data Mining is becoming increasingly popular as a business information management tool where it is expected to reveal knowledge structures that can guide decisions in conditions of limited certainty. Recently, there has been increased interest in developing new analytic techniques specifically designed to address the issues relevant to business Data Mining (e.g., Classification Trees).

Data mining, the extraction of hidden predictive information from large databases, is a powerful new technology with great potential to help companies focus on the most important information in their data warehouses. Data mining tools predict future trends and behaviors, allowing businesses to make proactive, knowledge-driven decisions. The automated, prospective analyses offered by data mining move beyond the analyses of past events provided by retrospective tools typical of decision support systems. Data mining tools can answer business questions that traditionally were too time consuming to resolve. They scour databases for hidden patterns, finding predictive information that experts may miss because it lies outside their expectations.

1.3.2 Data mining Tasks The Basic Data Mining Tasks Can be defined as follows

4

Classification maps data into predefined groups or classes Supervised learning Pattern recognition Prediction Regression is used to map a data item to a real valued prediction variable. Clustering groups similar data together into clusters. Unsupervised learning Segmentation Partitioning Summarization maps data into subsets with associated simple descriptions. Characterization Generalization Link Analysis uncovers relationships among data. Affinity Analysis Association Rules Sequential Analysis determines sequential patterns. Ex: Time Series Analysis Example: Stock Market Predict future values Determine similar patterns over time Classify behavior 1.3.3 Data mining Process

The ultimate goal of data mining is prediction - and predictive data mining is the most common type of data mining and one that has the most direct business applications. The process of data mining consists of three stages: (1) the initial exploration, (2) model building or pattern identification with validation/verification, and (3) deployment (i.e., the application of the model to new data in order to generate predictions).

Stage 1: Exploration. This stage usually starts with data preparation which may involve cleaning data, data transformations, selecting subsets of records and - in case of data sets with large numbers of variables ("fields") - performing some preliminary feature selection operations to bring the number of variables to a manageable range (depending on the statistical methods which are being considered). Then, depending on the nature of

5

the analytic problem, this first stage of the process of data mining may involve anywhere between a simple choice of straightforward predictors for a regression model, to elaborate exploratory analyses using a wide variety of graphical and statistical methods (see Exploratory Data Analysis (EDA)) in order to identify the most relevant variables and determine the complexity and/or the general nature of models that can be taken into account in the next stage.

Stage 2: Model building and validation. This stage involves considering various models and choosing the best one based on their predictive performance (i.e., explaining the variability in question and producing stable results across samples). This may sound like a simple operation, but in fact, it sometimes involves a very elaborate process. There are a variety of techniques developed to achieve that goal - many of which are based on so-called "competitive evaluation of models," that is, applying different models to the same data set and then comparing their performance to choose the best. These techniques - which are often considered the core of predictive data mining - include: Bagging (Voting, Averaging), Boosting, Stacking (Stacked Generalizations), and Meta-Learning.

Stage 3: Deployment. That final stage involves using the model selected as best in the previous stage and applying it to new data in order to generate predictions or estimates of the expected outcome.

1.4 Information and production factor Information / Knowledge can behave as a factor of production . According to elementary economics texts, the raw material for any productive activity can be put in one of three categories: land (raw materials, in general), labor, and capital. Some economists mention entrepreneurship as a fourth factor but none talk about knowledge. This is strange since know-how is the key determinant for the most important kind part of output: increased production. Still, its not that strange, since knowledge has unusual properties: there is no metric for it, and one cant calculate a monetary rate for it (cf. $/acre for land). 1.4.1 An example from agriculture Imagine that you are a crop farmer. Your inputs are land and other raw materials like fertilizer and seed; your labor in planting, cultivating and harvesting the crop; and money youve borrowed from the bank to pay for your tractor. You can increase output by increasing any of these factors: cultivating more land, working more hours, or borrowing money to buy better tractor or better seed. However, you can also increase output through know-how. For example, you might discover that your land is better suited to one kind of corn rather than another. You could make a more substantial improvement in output if you changed your practices, for example by implementing crop rotation. Farmers in Europe had practiced a three-year

6

rotation since the Middle Ages: rye or winter wheat, followed by spring oats or barley, then letting the soil rest (fallow) during the third stage. Four-field rotation (wheat, barley, turnips, and clover; no fallow) was a key development in the British Agricultural Revolution in the 18th Century. This system removed the need for a fallow period and allowed livestock to be bred year-round. (I suspect that if a four-crop rotation had been invented now, it would be eligible for a business process patent.) Most of the increases in our material well-being have come about through innovation, that is, the application of knowledge. How is it, then, that knowledge as a factor of production gets such a cursory treatment in traditional economics? 1.4.2 Measuring Knowledge A key difficulty is that knowledge is easy to describe but very hard to measure. One can talk about uses of knowledge, but I have so far found no simple metric. Its even hard to measure information content. There are many different perspectives, such as: library science (eg a user-centered measure of information); information theory (measuring data channel capacity); and algorithmic complexity (eg Kolmogorov complexity). All give different results. One can always, of course, argue that money is the ultimate metric: the knowledge value of something is what someone will pay for it. However, this is true for anything, including all the other factors of production. The difference is that land, labor and capital all have an underlying objective measure. One cannot calculate a $/something rate for knowledge in the way one can for the other three. Lets say land is measured in acres and labor in hours, and money in dollars. Youll pay me so much per acre of land, so much per hour of labor, and so many cents of interest per dollar I loan you. Land in different locations, labor of different kinds, and loans of different risks will earn different payment rates. Knowledge does have some value when its sold, e.g. when a patent is licensed or when a list of customer names is valued on a balance sheet. However, theres no rate, no $/something for the knowledge purchased. That suggests that the underlying concept is indefinite. It is perhaps so indefinite that we are fooling ourselves by even imagining that it exists. 1.5 Data mining vs Query tools There are various tools available for data mining commercially . The users can use that and do data mining to get required results and models. Some if them have been given below for your reference. 1.5.1 Clementine

7

SPSS Clementine, the premier data mining workbench, allows experts in business processes, data, and modeling to collaborate in exploring data and building models. It also supports the proven, industry-standard CRISP-DM methodology, which enables predictive insights to be developed consistently, repeatedly.

No wonder that organizations from FORTUNE 500 companies to government agencies and academic institutions point to Clementine as a critical factor in their success.

1.5.2 CART

CART is a robust data mining tool that automatically searches for important patterns and relationships in large data sets and quickly uncovers hidden structures even highly complex data sets. It works on the Windows, Mac and Unix platforms

1.5.3 Web Information Extractor Web Information Extractor is a powerful tool for web data mining, content extraction and content update monitor. It can extract structure or unstructured data (including text, picture and other file) from web page, reform into local file or save to database, post to web server. No need to define complex template rules, just browse to the web page you are interesting and click what you want to define the extraction task, and run it as you want. 1.5.4 The Query Tool The Query Tool is a powerful data mining application. It allows you to perform data analysis on any SQL database Developed predominately for the non technical user. No knowledge of SQL is required. NEW features: Query Builder, quickly and simply build powerful queries; Summary; summarise any two columns against an aggregate function (MIN, AVG etc.) of any numerical column. Query Editor; now you can create your own scripts. 1.6 Data Mining in Marketting 1.6.1 Marketting Optimization

If you are the owner of a business, you should already be aware of the fact that there a multiple techniques you can use to market to your customers. There is the internet, direct mail, and telemarketing.

While using these techniques can help your business succeed, there is even more you can do to tip the odds in your favor. You will want to become familiar with a technique that is called marketing optimization. This is a technique that is intricately connected to data mining. With market optimization, you will take a group of offers and customers, and after reviewing the limits of the campaign, you will use data mining to decide which

8

marketing offers should be made to specific customers. Market optimization is a powerful tool that will take your marketing to the next level. Instead of mass marketing a product to a broad group of people that may not respond to it, you can take a group of marketing strategies and market them to different people based on patterns and relationships. The first step in marketing optimization is to create a group of marketing offers. Each offer will be created separately from the others, and each one of them will have their own financial attributes. An example of this would be the cost required to run each campaign. Each offer will have a model connected to it that will make a prediction based on the customer information that is presented to it. The prediction could come in the form of a score. The score could be defined by the probability of a customer purchasing a product. The models will be created by data mining tools. These models can be added to your marketing strategy. After you have set up your offers, you will next want to look at the purchasing habits of the customers you already have. Your goal is to analyze each offer you're making and optimize it in a way that will allow you to bring in the largest profits.

1.6.2 Illustration with an example

To illustrate market optimization with data mining, let me use an example. Suppose you were the marketing director for a financial institution such as a bank. You have a number of products which you offer to your customers, and these are CDs, credit cards, gold credit cards, and a savings account. Despite the fact that your company offers these four products, it is your job to market checking accounts and savings accounts. It is your goal to figure out which customers will be interested in savings accounts compared to checking accounts. After thinking about how you can successfully market your products to your customers, you can have come up with two possible strategies that you will present to your manager. The first possible strategy is to market to customers who would like to save money for their children so they can attend college when they turn 18 years old. The second strategy is to market to students who are already attending college. Now that you have two offers you're interested in marketing, you will next want to study the data you have obtained. In this example, you work for a large company that has a data warehouse. You look at the customer data over the last few years to make a marketing decision. Your company uses a data mining tool that will predict the chances of people signing up for your products. You will want to create certain mathematical models that will allow you to predict the possible responses. In this example, you are targeting young parents who may be looking to save money for their children, and you are targeting young people that are already in college.

9

Computer algorithms will be able to look at the history of customer transactions to determine the chances of success for your marketing campaign. In this example, the best way to find out if young parents and college students will be interested in your offer is by looking at the historical response rate. If the historical response rate is only 10%, it is likely that it will remain the same for your new marketing strategy. However, historical response rates are simply, and to be more precise, you will want to use complex data mining strategies.

How ever by this time you would have realized how for the data mining concepts are Used in marketing and optimizing the same. 1.7 Self learning Computer System

An Self learning Computer System, also known as a knowledge based system or an Expert system , is a computer program that contains the knowledge and analytical skills of one or more human experts, related to a specific subject. This class of program was first developed by researchers in artificial intelligence during the 1960s and 1970s and applied commercially throughout the 1980s.

An Self learning Computer System is a software system that incorporates concepts derived from experts in a field and uses their knowledge to provide problem analysis to users of the software.

The most common form of Self learning Computer System is a computer program, with a set of rules, that analyzes information (usually supplied by the user of the system) about a specific class of problems, and recommends one or more courses of user action. The expert system may also provide mathematical analysis of the problem(s). The expert system utilizes what appears to be reasoning capabilities to reach conclusions.

A related term is wizard. A wizard is an interactive computer program that helps a user solve a problem. Originally the term wizard was used for programs that construct a database search query based on criteria supplied by the user. However, some rule-based expert systems are also called wizards. Other "Wizards" are a sequence of online forms that guide users through a series of choices, such as the ones which manage the installation of new software on computers, and these are not expert systems.

In other words, A Self learning Computer System or an expert system is a computer

program that simulates the judgement and behavior of a human or an organization that

has expert knowledge and experience in a particular field. Typically, such a system

contains a knowledge base containing accumulated experience and a set of rules for

applying the knowledge base to each particular situation that is described to the program.

10

Sophisticated expert systems can be enhanced with additions to the knowledge base or to

the set of rules.

Among the best-known expert systems have been those that play chess and that assist in medical diagnosis.

1.8 Concept Learning 1.8.1 Analyzing Concepts Concepts are categories of stimuli that have certain features in common. The shapes on the above are all members of a conceptual category: rectangle. Their common features are (1) 4 lines; (2) opposite lines parallel; (3) lines connected at ends; (4) lines form 4 right angles. The fact that they are different colors and sizes and have different orientations is irrelevant. Color, size, and orientation are not defining features of the concept If a stimulus is a member of a specified conceptual category, it is referred to as a positive instance. If it is not a member, it is referred to as negative instance. These are all negative instances of the rectangle concept: As rectangles are defined, a stimulus is a negative instance if it lacks any one of the specified features. Every concept has two components: Attributes: These are features of a stimulus that one must look for to decide if that stimulus is a positive instance of the concept. A rule: This a statement that specifies which attributes must be present or absent for a stimulus to qualify as a positive instance of the concept.

11

For rectangles, the attributes would be the four features discussed earlier, and the rule would be that all the attributes must be present. The simplest rules refer to the presence or absence of a single attribute. For example, a vertebrate animal is defined as an animal with a backbone. Which of these stimuli are positive instances? This rule is called affirmation. It says that a stimulus must possess a single specified attribute to qualify as a positive instance of a concept. The opposite or complement of affirmation is is negation. To qualify as a positive instance, a stimulus must lack a single specified attribute. An invertebrate animal is one that lacks a backbone. These are the positive and negative instances when the negation rule is applied. More complex conceptual rules involve two or more specified attributes. For example, the conjunction rule states that a stimulus must possess two or more specified attributes to qualify as a positive instance of the concept. This was the rule used earlier to define the concept of a rectangle. 1.8.2 Behavioral Processes

+ + + _

_ _ _

+

12

In behavioral terms, when a concept is learned, two processes control how we respond to a stimulus: Generalization: We generalize a certain response (like the name of an object) to all members of the conceptual class based on their common attributes. Discrimination: We discriminate between stimuli which belong to the conceptual class and those that dont because they lack one or more of the defining attributes. For example, we generalize the word rectangle to those stimuli that possess the defining attributes... Rectangle Rectangle Rectangle ...and discriminate between these stimuli and others that are outside the conceptual class, in which case we respond with a different word: ? 1.9 Data learning The learning from the data given can be done in many ways. The data can be arranged in a particular format to learn from them. The following are some of the examples : (i) A database system, also called a database management system (DBMS), consists of a collection of interrelated data, known as a database, and a set of software programs to manage and access the data. The software programs involve mechanisms for the definition of database structures; for data storage; for concurrent, shared, or distributed data access; and for ensuring the consistency and security of the information stored, despite system crashes or attempts at unauthorized access.

13

(ii) A relational database is a collection of tables, each of which is assigned a unique name. Each table consists of a set of attributes (columns or fields) and usually stores a large set of tuples (records or rows). Each tuple in a relational table represents an object identified by a unique key and described by a set of attribute values. (iii) A semantic data model, such as an entity-relationship (ER) data model, is often constructed for relational databases. An ER data model represents the database as a set of entities and their relationships. (iv) A data warehouse which is a repository of information collected from multiple sources, stored under a unified schema (v) A data mart, subset of a data warehouse. It focuses on selected subjects, 1.10 Data mining and Data Ware housing A data warehouse is an integrated and consolidated collection of data. It can be defined as a repository of purposely selected and adopted operational data which can successfully answer any ad hoc, complex , analytical, statistical queries. Time dependent data will be present in a Data warehouse. A Data warehousing can be defined as a process of organizing the storage of large, multivariate data sets in a way that facilitates the retrieval of information for analytic purposes. 1.10.1 Functional requirements of a Data Warehouse

A data ware house provides the needed support for all the informational applications of a company. It must support various types of applications, all of which have their own requirements in terms of data and the way data are modeled and used, etc., A data warehouse must support 1. Decision support processing 2. Informational Application 3. Model building 4. Consolidation

The data in the warehouse is being processed and gives out the decision to be taken in the crucial times of the business. Certain information present in the data warehouse is derived for the necessity. Also modeling of the data can be done by exploring the data in the data warehouse. Consolidation of the data / information can be done through various tools in a data warehouse. Data in a data warehouse must there fore be organized such that it can be analyzed or explored along with different contextual dimensions.

14

Data sources, users, and informational applications for a data ware house Fig 1.1 show the many sources and different types of users There can be many sources for a data warehouse to get data (Corporate,external,offline etc.,).In a warehouse the data can be structured and unstructured ( like large text objects, pictures, audio, video etc., ). The people who use the data warehouse data can be Executives, Administrative officials, operational end users, External users and data & business analysts etc., The applications like Decision support processing, Extended data warehouse applications can be done on the data in a warehouse. 1.10.2 Data warehousing

Data warehousing is essentially what you need to do in order to create a data warehouse, and what you do with it. It is the process of creating, populating, and then querying a data warehouse and can involve a number of discrete technologies such as:

In a Dimensional Model, context of the measurements are represented in dimension tables. You can also think of the context of a measurement as the characteristics such as who, what, where, when, how of a measurement (subject ). In your business process Sales, the characteristics of the 'monthly sales number' measurement can be a Location (Where), Time (When), Product Sold (What).

Data Warehouse

Corporate data

Offline data

External data

Structured and unstructured data

Data warehouse Environment

CEOs, Executives etc,

External Users

15

The Dimension Attributes are the various columns in a dimension table. In the Location dimension, the attributes can be Location Code, State, Country, Zip code. Generally the Dimension Attributes are used in report labels, and query constraints such as where Country='USA'. The dimension attributes also contain one or more hierarchical relationships.

Before designing your data warehouse, you need to decide what this data warehouse contains. Say if you want to build a data warehouse containing monthly sales numbers across multiple store locations, across time and across products then your dimensions are:

Location Time Product

Each dimension table contains data for one dimension. In the above example you get all your store location information and put that into one single table called Location. Your store location data may be spanned across multiple tables in your OLTP system (unlike OLAP), but you need to de-normalize all that data into one single table.

Dimensional modeling is the design concept used by many data warehouse designers to build their data warehouse. Dimensional model is the underlying data model used by many of the commercial OLAP products available today in the market. In this model, all data is contained in two types of tables called Fact Table and Dimension Table. 1.11 Summary

In this Unit you have learnt about the basic concepts involved in Data Mining. Now You would have got the idea that Data mining, the extraction of hidden predictive information from large databases, is a powerful new technology with great potential to help companies focus on the most important information in their data warehouses. The role of information on production has been also explained to you. Also you had an overview of the various tools used in the mining like CART, Clementine etc.,

The marketing can be done in a powerful way by using the Data mining results. The marketing people are much exited to use these facilities and you would have understood this by the example given above in the unit. Also the learning concepts and details about the self learning or Expert systems have been explained. The learning from the data has been explained to you in a brief manner. Lastly the necessity of Data warehouse and its usage in various aspects has been explained.

1.12 Exercises 1. What is data mining? In your answer, address the following: (a) Is it another type? (b) Is it a simple transformation of technology developed fromdatabases, statistics, and

16

machine learning? 2. How information behaves as production factor explain .Illustrate with an example of your own (not given in the book) 3.Give brief notes on various mining tools which are known to you. 4. What do you mean by data mining in marketing ? explain with suitable example 5.What is concept? How one can learn a concept? Explain with examples the factors of concept. 6.what are all the ways the one can learn from data ? 7. Data ware house explain the concepts

17

Unit II Structure of the Unit

2.1 Introduction 2.2 Learning Objectives

2.3 Knowledge discovery process 2.3.1 Data Selection

2.3.2 Data Cleaning

2.3.3 Data Enrichment

2.4 Preliminary Analysis of Data using traditional query tools

2.5 Visualization techniques

2.6 OLAP Tools

2.7 Decision trees

2.8 Association Rules

2.9 Neural Networks

2.10 Genetics Algorithms

2.11 KDD in Data bases

2.12 Summary

2.13 Exercises

18

2.1 Introduction

There are various processes and tools involved in Data mining. To get the knowledge from large data bases one of the process used is KDD (Knowledge Discovery Process). Also there are processes like Data cleaning, Data Selection, Data enrichment etc., to prepare the data for mining and get the results out of it. There are methods in data mining, which can be used in various businesses and fields, can give useful and suitable solutions to various problems. They can be Decision trees, Association Rules, Neural Networks, Genetic Algorithms etc., To visualize the results and data there are techniques called Visualization Techniques. Through which one can view various effects on a situation and can understand easily the results. The data in a large data base can be analyzed through various traditional queries to get the suitable information and Knowledge.

2.2 Learning Objectives To Know the concepts in Knowledge Discovery process in mining

large data bases . Also to understand the process of data cleaning, Data selection, and Data enrichment under KDD.

Students to know about the Visualization techniques used in data mining , Various methods involved in mining process like Decision trees, Association Rules etc.,

2.3 Knowledge discovery process 2.3.1 An Overview Why Do We Need KDD? The traditional method of turning data into knowledge relies on manual analysis and interpretation. For example, in the health-care industry, it is common for specialists to periodically analyze current trends and changes in health-care data, say, on a quarterly basis. The specialists then provide a report detailing the analysis to the sponsoring health-

19

care organization; this report becomes the basis for future decision making and planning for health-care management. In a totally different type of application, planetary geologists sift through remotely sensed images of planets and asteroids, carefully locating and cataloging such geologic objects of interest as impact craters. Be it science, marketing, finance, health care, retail, or any other field, the classical approach to data analysis relies fundamentally on one or more analysts becoming innovaintimately familiar with the data and serving as an interface between the data and the users and products. For these (and many other) applications, this form of manual probing of a data set is slow, expensive, and highly subjective. In fact, as data volumes grow dramatically, this type of manual data analysis is becoming completely impractical in many domains. Databases are increasing in size in two ways: (1) the number N of records or objects in the database and (2) the number d of fields or attributes to an object. Databases containing on the order of N = 109 objects are becoming increasingly common, for example, in the astronomical sciences. Similarly, the number of fields d can easily be on the order of 102 or even 103, for example, in medical diagnostic applications. Who could be expected to digest millions of records, each having tens or hundreds of fields? We believe that this job is certainly not one for humans; hence, analysis work needs to be automated, at least partially. The need to scale up human analysis capabilities to handling the large number of bytes that we can collect is both economic and scientific. Businesses use data to gain competitive advantage, increase efficiency, and provide more valuable services to customers. Data we capture about our environment are the basic evidence we use to build theories and models of the universe we live in. Because computers have enabled humans to gather more data than we can digest, it is only natural to turn to computational techniques to help us unearth meaningful patterns and structures from the massive volumes of data. Hence, KDD is an attempt to address a problem that the digital information era made a fact of life for all of us: data overload. Data Mining and Knowledge Discovery in the Real World A large degree of the current interest in KDD is the result of the media interest surrounding successful KDD applications, for example, In science, one of the primary application areas is astronomy. Here, a notable success was achieved by SKICAT, a system used by astronomers to perform image analysis, classification, and cataloging of sky objects from sky-survey images (Fayyad, Djorgovski, and Weir 1996). In its first application, the system was used to process the 3 terabytes (1012 bytes) of image data resulting from the Second Palomar Observatory Sky Survey, where it is estimated that on the order of 109 sky objects are detectable. SKICAT can outperform humans and traditional computational techniques in classifying faint sky objects. See Fayyad, Haussler, and Stolorz (1996) for a survey of scientific applications. In business, main KDD application areas includes marketing, finance (especially investment), fraud detection, manufacturing, telecommunications, and Internet agents.

20

The Interdisciplinary Nature of KDD KDD has evolved, and continues to evolve, from the intersection of research fields such as machine learning, pattern recognition, databases, statistics, AI, knowledge acquisition for expert systems, data visualization, and high-performance computing. The unifying goal is extracting high-level knowledge from low-level data in the context of large data sets. Knowledge discovery is the non-trivial extraction of implicit, previously unknown, and potentially useful information from databases. Both, the number and the size of databases are rapidly growing because of the large amount of data obtained from satellite images, X-ray crystllography or other scientific equipment. This growth by far exceeds human capacities to analyze the databases in order to find implicit regularities, rules or clusters hidden in the data. Therefore, knowledge discovery becomes more and more important in databases. Typical tasks for knowledge discovery are the identification of classes (clustering), the prediction of new, unknown objects (classification), the discovery of associations or deviations in spatial databases. The term 'visual Data Mining' refers to the emphasis of integrating the user in the knowledge discovery process. Since these are challenging tasks, knowledge discovery algorithms should be incremental, i.e. when updating the database the algorithm does not have to be applied to the whole database.

KDD (Knowledge Discovery in Databases or Knowledge Discovery and Data Mining) is a recent term related to data mining and involves sorting of huge quantity of data to pick out useful and relevant information. Basic steps in the knowledge discovery process are,

Data selection Data cleaning/cleansing Data Enrichment Data mining Pattern evaluation Knowledge presentation

21

Knowledge discovery Process An overview

H an : In trod u c tio n to K D D 11

D a ta M in in g : A K D D P ro c e s s

D a ta m in in g : th e co re o f k n o w le d g e d is co v e ry p ro ce ss .

D a ta C lea n in g

D a ta In teg r a t io n

D a ta b a ses

D a ta W a r eh o u se

T a s k-r e lev a n t D a ta

S e le c t io n

D a ta M in in g

P a tter n E v a lu a tio n

2.3.1.Data Selection : The selection of data for a KDD process has to be done as a first step. This selection of data is the selection of relevant data for the field of approach to arrive at a meaningful knowledge. Identification of relevant data In a large and vast data bank one has to select the relevant and necessary data / information that is found to be important for the project / process that has to be done to get the targeted knowledge. For example in a super market if one want to get the knowledge of sales of milk products then the transaction data relevant to sales of milk products has to be gathered and processed and here the other sales details are not necessary. But if the shop keeper wants to know the overall performance then every transaction becomes necessary for the process. Representation of data After choosing the relevant data the data has to be represented in a suitable structure. That structure formats like data base, text etc, can also be decided and data can be represented in that format.

22

2.3.2 Data cleaning: Data Cleaning is the act of detecting and correcting (or removing) corrupt or inaccurate attributes or records The first step in Knowledge discovery Process is the Data Cleaning and that is necessary because Data in the real world is dirty means , Incomplete: lacking attribute values, lacking certain attributes of interest, or containing only aggregate data noisy: containing errors or outliers inconsistent: containing discrepancies in codes or names No quality data, no quality mining results Quality decisions must be based on quality data Data warehouse needs consistent integration of quality data Before proceed to the further steps in Knowledge discovery Process, the Data cleaning Has to be don that involves Fill in missing values, smooth noisy data, identify or remove outliers, and resolve inconsistencies etc.,in order to get the successful results. So the organizations are forced to think about a unified logical view of the wide variety of data and databases they possess, they have to address the issues of mapping data to a single naming convention, uniformly representing and handling missing data, and handling noise and errors when possible. We can list some of the Data cleaning tasks as Data acquisition and metadata Fill in missing values Unifieddate format Convertingnominalto numeric Identify outliers and smooth out noisy data Correct inconsistent data How to Handle Missing Data? Ignore the tuple: usually done when class label is missing (assuming the tasks in classificationnot effective when the percentage of missing values per attribute varies considerably. Fill in the missing value manually: tedious + infeasible? Use a global constant to fill in the missing value: e.g., unknown, a new class?! Imputation: Use the attribute mean to fill in the missing value, or use the attribute mean for all samples belonging to the same class to fill in the missing value: smarter Use the most probable value to fill in the missing value: inference-based such as Bayesian formula or decision tree

23

Noisy Data : There can be random error or variance in a measured variable, Incorrect attribute values may due to faulty data collection instruments data entry problems data transmission problems technology limitation inconsistency in naming convention

Other data problems which requires data cleaning duplicate records incomplete data inconsistent data

How to handle Noisy data?

Binning method: first sort data and partition into (equi-depth) bins then one can smooth by bin means, smooth by bin median, smooth by bin

boundaries,etc. Clustering detect and remove outliers Combined computer and human inspection detect suspicious values and check by human Regression smooth by fitting the data into regression functions

2.3.3 Data Enrichment : The represented data has to be enriched with various additional details apart from the base details that has been gathered. The requirements for this enrichment can be Behavioral opurchase from related businesses (Air Miles) oEg. number of vehicles, travel frequency Demographic oEg. age, gender, marital status, children, income level Psychographic oEg. risk taker, conservative, cultured, hi-tech adverse, credit worthy, trustworthy

24

2.4 Preliminary Analysis of the Data set The gathered data set can be analysed for various purposes before proceeds to the KDD process. One of the most needed can be Statistical Analysis.

Statistical Analysis

Mean and Confidence Interval.

Probably the most often used descriptive statistic is the mean. The mean is a particularly informative measure of the "central tendency" of the variable if it is reported along with its confidence intervals. Usually we are interested in statistics (such as the mean) from our sample data set only to the extent to which they can infer information about the population. The confidence intervals for the mean give us a range of values around the mean where we expect the "true" (population) mean is located .

For example, if the mean in your sample is 23, and the lower and upper limits of the p=.05 confidence interval are 19 and 27 respectively, then you can conclude that there is a 95% probability that the population mean is greater than 19 and lower than 27. If you set the p-level to a smaller value, then the interval would become wider thereby increasing the "certainty" of the estimate, and vice versa; as we all know from the weather forecast, the more "vague" the prediction (i.e., wider the confidence interval), the more likely it will materialize. Note that the width of the confidence interval depends on the sample size and on the variation of data values. The larger the sample size, the more reliable its mean. The larger the variation, the less reliable the mean. The calculation of confidence intervals is based on the assumption that the variable is normally distributed in the population. The estimate may not be valid if this assumption is not met, unless the sample size is large, say n=100 or more.

Shape of the Distribution, Normality.

An important aspect of the "description" of a variable is the shape of its distribution, which tells you the frequency of values from different ranges of the variable. Typically, a researcher is interested in how well the distribution can be approximated by the normal distribution Simple descriptive statistics can provide some information relevant to this issue. For example, if the skewness (which measures the deviation of the distribution from symmetry) is clearly different from 0, then that distribution is asymmetrical, while normal distributions are perfectly symmetrical. If the kurtosis (which measures "peakedness" of the distribution) is clearly different from 0, then the distribution is either flatter or more peaked than normal; the kurtosis of the normal distribution is 0.

More precise information can be obtained by performing one of the tests of normality to determine the probability that the sample came from a normally distributed population of observations (e.g., the so-called Kolmogorov-Smirnov test, or the Shapiro-Wilks' W test.

25

However, none of these tests can entirely substitute for a visual examination of the data using a histogram (i.e., a graph that shows the frequency distribution of a variable).

The graph allows you to evaluate the normality of the empirical distribution because it also shows the normal curve superimposed over the histogram. It also allows you to examine various aspects of the distribution qualitatively. For example, the distribution could be bimodal (have 2 peaks). This might suggest that the sample is not homogeneous but possibly its elements came from two different populations, each more or less normally distributed. In such cases, in order to understand the nature of the variable in question, you should look for a way to quantitatively identify the two sub-samples.

Correlations

Purpose (What is Correlation?) Correlation is a measure of the relation between two or more variables. The measurement scales used should be at least interval scales, but other correlation coefficients are available to handle other types of data. Correlation coefficients can range from -1.00 to +1.00. The value of -1.00 represents a perfect negative correlation while a value of +1.00 represents a perfect positive correlation. A value of 0.00 represents a lack of correlation.

26

The most widely-used type of correlation coefficient is Pearson r, also called linear or product- moment correlation.

Simple Linear Correlation (Pearson r). Pearson correlation (hereafter called correlation), assumes that the two variables are measured on at least interval scales (see Elementary Concepts), and it determines the extent to which values of the two variables are "proportional" to each other. The value of correlation (i.e., correlation coefficient) does not depend on the specific measurement units used; for example, the correlation between height and weight will be identical regardless of whether inches and pounds, or centimeters and kilograms are used as measurement units. Proportional means linearly related; that is, the correlation is high if it can be "summarized" by a straight line (sloped upwards or downwards).

27

This line is called the regression line or least squares line, because it is determined such that the sum of the squared distances of all the data points from the line is the lowest possible. Note that the concept of squared distances will have important functional consequences on how the value of the correlation coefficient reacts to various specific arrangements of data (as we will later see).

How to Interpret the Values of Correlations. As mentioned before, the correlation coefficient (r) represents the linear relationship between two variables. If the correlation coefficient is squared, then the resulting value (r2, the coefficient of determination) will represent the proportion of common variation in the two variables (i.e., the "strength" or "magnitude" of the relationship). In order to evaluate the correlation between variables, it is important to know this "magnitude" or "strength" as well as the significance of the correlation.

Significance of Correlations. The significance level calculated for each correlation is a primary source of information about the reliability of the correlation. As explained before (see Elementary Concepts), the significance of a correlation coefficient of a particular magnitude will change depending on the size of the sample from which it was computed. The test of significance is based on the assumption that the distribution of the residual values (i.e., the deviations from the regression line) for the dependent variable y follows the normal distribution, and that the variability of the residual values is the same for all values of the independent variable x. However, Monte Carlo studies suggest that meeting those assumptions closely is not absolutely crucial if your sample size is not very small and when the departure from normality is not very large. It is impossible to formulate precise recommendations based on those Monte- Carlo results, but many researchers follow a rule of thumb that if your sample size is 50 or more then serious biases are unlikely, and if your sample size is over 100 then you should not be concerned at all with the normality assumptions. There are, however, much more common and serious threats to the validity of information that a correlation coefficient can provide; they are briefly discussed in the following paragraphs.

Outliers.

Outliers are atypical (by definition), infrequent observations. Because of the way in which the regression line is determined (especially the fact that it is based on minimizing not the sum of simple distances but the sum of squares of distances of data points from the line), outliers have a profound influence on the slope of the regression line and consequently on the value of the correlation coefficient. A single outlier is capable of considerably changing the slope of the regression line and, consequently, the value of the correlation, as demonstrated in the following example. Note, that as shown on that illustration, just one outlier can be entirely responsible for a high value of the correlation that otherwise (without the outlier) would be close to zero. Needless to say, one should never base important conclusions on the value of the correlation coefficient alone (i.e., examining the respective scatterplot is always recommended).

28

Note that if the sample size is relatively small, then including or excluding specific data points that are not as clearly "outliers" as the one shown in the previous example may have a profound influence on the regression line (and the correlation coefficient). This is illustrated in the following example where we call the points being excluded "outliers;" one may argue, however, that they are not outliers but rather extreme values.

Typically, we believe that outliers represent a random error that we would like to be able to control. Unfortunately, there is no widely accepted method to remove outliers automatically (however, see the next paragraph), thus what we are left with is to identify any outliers by examining a scatter plot of each important correlation. Needless to say, outliers may not only artificially increase the value of a correlation coefficient, but they can also decrease the value of a "legitimate" correlation.

t-test for independent samples

Purpose, Assumptions. The t-test is the most commonly used method to evaluate the differences in means between two groups. For example, the t-test can be used to test for a

29

difference in test scores between a group of patients who were given a drug and a control group who received a placebo. Theoretically, the t-test can be used even if the sample sizes are very small (e.g., as small as 10; some researchers claim that even smaller n's are possible), as long as the variables are normally distributed within each group and the variation of scores in the two groups is not reliably different (see also Elementary Concepts). As mentioned before, the normality assumption can be evaluated by looking at the distribution of the data (via histograms) or by performing a normality test. The equality of variances assumption can be verified with the F test, or you can use the more robust Levene's test. If these conditions are not met, then you can evaluate the differences in means between two groups using one of the nonparametric alternatives to the t- test (see Nonparametric and Distribution Fitting).

The p-level reported with a t-test represents the probability of error involved in accepting our research hypothesis about the existence of a difference. Technically speaking, this is the probability of error associated with rejecting the hypothesis of no difference between the two categories of observations (corresponding to the groups) in the population when, in fact, the hypothesis is true. Some researchers suggest that if the difference is in the predicted direction, you can consider only one half (one "tail") of the probability distribution and thus divide the standard p-level reported with a t-test (a "two-tailed" probability) by two. Others, however, suggest that you should always report the standard, two-tailed t-test probability.

Arrangement of Data. In order to perform the t-test for independent samples, one independent (grouping) variable (e.g., Gender: male/female) and at least one dependent variable (e.g., a test score) are required. The means of the dependent variable will be compared between selected groups based on the specified values (e.g., male and female) of the independent variable. The following data set can be analyzed with a t-test comparing the average WCC score in males and females.

GENDER WCC case 1 case 2 case 3 case 4 case 5

male male male female female

111 110 109 102 104

mean WCC in males = 110 mean WCC in females = 103

t-test graphs. In the t-test analysis, comparisons of means and measures of variation in the two groups can be visualized in box and whisker plots (for an example, see the graph below).

30

These graphs help you to quickly evaluate and "intuitively visualize" the strength of the relation between the grouping and the dependent variable.

Breakdown: Descriptive Statistics by Groups

Purpose. The breakdowns analysis calculates descriptive statistics and correlations for dependent variables in each of a number of groups defined by one or more grouping (independent) variables.

Arrangement of Data. In the following example data set (spreadsheet), the dependent variable WCC (White Cell Count) can be broken down by 2 independent variables: Gender (values: males and females), and Height (values: tall and short).

GENDER HEIGHT WCC case 1 case 2 case 3 case 4 case 5 ...

male male male female female ...

short tall tall tall short ...

101 110 92 112 95 ...

The resulting breakdowns might look as follows (we are assuming that Gender was specified as the first independent variable, and Height as the second).

Entire sample Mean=100 SD=13 N=120 Males Mean=99 SD=13

Females Mean=101 SD=13

31

N=60 N=60 Tall/males Mean=98 SD=13 N=30

Short/males Mean=100 SD=13 N=30

Tall/females Mean=101 SD=13 N=30

Short/females Mean=101 SD=13 N=30

The composition of the "intermediate" level cells of the "breakdown tree" depends on the order in which independent variables are arranged. For example, in the above example, you see the means for "all males" and "all females" but you do not see the means for "all tall subjects" and "all short subjects" which would have been produced had you specified independent variable Height as the first grouping variable rather than the second.

Statistical Tests in Breakdowns. Breakdowns are typically used as an exploratory data analysis technique; the typical question that this technique can help answer is very simple: Are the groups created by the independent variables different regarding the dependent variable? If you are interested in differences concerning the means, then the appropriate test is the breakdowns one-way ANOVA (F test). If you are interested in variation differences, then you should test for homogeneity of variances.

Other Related Data Analysis Techniques. Although for exploratory data analysis, breakdowns can use more than one independent variable, the statistical procedures in breakdowns assume the existence of a single grouping factor (even if, in fact, the breakdown results from a combination of a number of grouping variables). Thus, those statistics do not reveal or even take into account any possible interactions between grouping variables in the design. For example, there could be differences between the influence of one independent variable on the dependent variable at different levels of another independent variable (e.g., tall people could have lower WCC than short ones, but only if they are males; see the "tree" data above). You can explore such effects by examining breakdowns "visually," using different orders of independent variables, but the magnitude or significance of such effects cannot be estimated by the breakdown statistics.

Frequency tables

Purpose. Frequency or one-way tables represent the simplest method for analyzing categorical (nominal) data (refer to Elementary Concepts). They are often used as one of the exploratory procedures to review how different categories of values are distributed in the sample. For example, in a survey of spectator interest in different sports, we could summarize the respondents' interest in watching football in a frequency table as follows:

STATISTICA BASIC STATS

FOOTBALL: "Watching football"

Category Count Cumulatv Percent Cumulatv

32

Count Percent ALWAYS : Always interested USUALLY : Usually interested SOMETIMS: Sometimes interested NEVER : Never interested Missing

39 16 26 19 0

39 55 81 100 100

39.00000 16.00000 26.00000 19.00000 0.00000

39.0000 55.0000 81.0000 100.0000 100.0000

The table above shows the number, proportion, and cumulative proportion of respondents who characterized their interest in watching football as either (1) Always interested, (2) Usually interested, (3) Sometimes interested, or (4) Never interested

Applications. In practically every research project, a first "look" at the data usually includes frequency tables. For example, in survey research, frequency tables can show the number of males and females who participated in the survey, the number of respondents from particular ethnic and racial backgrounds, and so on. Responses on some labeled attitude measurement scales (e.g., interest in watching football) can also be nicely summarized via the frequency table. In medical research, one may tabulate the number of patients displaying specific symptoms; in industrial research one may tabulate the frequency of different causes leading to catastrophic failure of products during stress tests (e.g., which parts are actually responsible for the complete malfunction of television sets under extreme temperatures?). Customarily, if a data set includes any categorical data, then one of the first steps in the data analysis is to compute a frequency table for those categorical variables.

Tools for this analysis :

To do this statistical analysis there are various tools liks SPSS, MicroSoft Excel etc., One can use these tools to have an preliminary analysis of the selected data for KDD. Some tools : Microsoft Excel: The Analysis ToolPak is a tool in Microsoft Excel to perform basic statistical procedures. Microsoft Excel is spreadsheet software that is used to store information in columns and rows, which can then be organized and/or processed. In addition to the basic spreadsheet functions, the Analysis ToolPak in Excel contains procedures such as ANOVA, correlations, descriptive statistics, histograms, percentiles, regression, and t-tests. This document describes how to get basic descriptive statistics, perform an ANOVA, a t-test, and a linear regression. The primary reason to use Excel for statistical data analysis is because it is so widely available. The Analysis Toolpak is an add-on that can be installed for free if you have the installation disk for Microsoft Office. It is also publicly available . SPSS

33

SPSS is among the most widely used programs for statistical analysis in social science. It is used by market researchers, health researchers, survey companies, government, education researchers, marketing organizations and others. In addition to statistical analysis, data management (case selection, file reshaping, creating derived data) and data documentation (a metadata dictionary is stored with the data) are features of the base software.

Statistics included in the base software:

Descriptive statistics: Cross tabulation, Frequencies, Descriptives, Explore, Descriptive Ratio Statistics

Bivariate statistics: Means, t-test, ANOVA, Correlation (bivariate, partial, distances), Nonparametric tests

Prediction for numerical outcomes: Linear regression Prediction for identifying groups: Factor analysis, cluster analysis (two-step, K-

means, hierarchical), Discriminant

2.5 Visualization techniques The human mind has boundless potential, and humans have been exploring many ways to use the mind for thousands of years. The technique of visualization can help you acquire new knowledge and skills more quickly than with conventional techniques The amount of data stored on electronic media is growing exponentially fast. Making sense of such data is becoming harder and more challenging. Online retailing in the Internet age, for example, is very different than retailing a decade ago because the three most important factors of the past (location, location, and location) are irrelevant for online stores. One of the greatest challenges we face today is making sense of all this data. Data mining, or knowledge discovery, is the process of identifying new patterns and insights in data, whether it is for understanding the Human Genome to develop new drugs, for discovering new patterns in recent Census data to warn about hidden trends, or for understanding your customers better at an electronic webstore in order to provide a personalized one-to-one experience. Data mining, sometimes referred to as knowledge discovery is at the intersection of multiple researchareas, including Machine Learning Statistics Pattern Recognition],Databases and Visualization Good marketing and business-oriented data mining books are also available. With the maturity of databases and constant improvements in computational speed, data mining algorithms that were too expensive to execute are now within reach. Data mining serves two goals:

34

1. Insight: identify patterns and trends that are comprehensible, so that action can be taken based on the insight. For example, characterize the heavy spenders on a web site, or people that buy product X. By understanding the underlying patterns, the web site can be personalized and improved. The insight may also lead to decisions that affect other channels, such as brick-and-mortar storesplacement of products, marketing efforts, and cross-sells. 2. Prediction: a model is built that predicts (or scores) based on input data. For example, a model can be built to predict the propensity of customers to buy product X based on their demographic data and browsing patterns on a web site. Customers with high scores can be used in a direct marketing campaign. If the prediction is for a discrete variable with a few values (e.g., buy product X or not), the task is called classification; if the prediction is for a continuous variable (e.g., customer spending in the next year), the task is called regression. The majority of research in data mining has concentrated on building the best models for prediction. Part of the reason, no doubt, is that a prediction task is well defined and can be objectively measured on an independent test-set. Given a dataset that is labeled with the correct predictions, it is split into a training set and a test-set. A learning algorithm is given the training set and produces a model that can map new unseen data into the prediction. The model can then be evaluated for its accuracy in making predictions on the unseen test-set. Descriptive data mining, which yields human insight, is harder to evaluate, yet necessary in many domains because the users may not trust predictions coming out of a black box or because legally one must explain the predictions. For example, even if a Perceptron algorithm [20] outperforms a loan officer in predicting who will default on a loan, the person requesting a loan cannot be rejected simply because he is on the wrong side of a 37-dimensional hyperplane; legally, the loan officer must explain the reason for the rejection. The choice of a predictive model can have a profound influence on the resulting accuracy and on the ability of humans to gain insight from it. Some models are naturally easier to understand than others. For example, a model consisting of if-then rules is easy to understand, unless the number of rules is too large. Decision trees, are also relatively easy to understand. Linear models get a little harder, especially if discrete inputs are used. Nearest-neighbor algorithms in high dimensions are almost impossible for users to understand, and non-linear models in high dimensions, such as Neural Networks are the most opaque. 3 One way to aid users in understanding the models is to visualize them. MineSet, for example, is a data mining tool that integrates data mining and visualization very tightly. Models built can be viewed and interacted with. Figure 1 shows a visualization of the Nave-Bayes classifier. Given a target value, which in this case was who earns over $50,000 in the US working population, the visualization shows a small set of "important" attributes (measured using mutual information or cross-entropy). For each attribute, a bar chart shows how much "evidence" each value (or range of values) of that attribute provides for the target label. For example, higher education levels (right bars in the education row) imply higher salaries because the bars are higher. Similarly, salary

35

increases with age up to a point and then decreases, and salary increases with the number of hour worked per week. The combination of a back-end algorithm that bins the data, computes the importance of hundreds of attributes, and then a visualization that shows the important attributes visually, makes this a very useful tool that helps identify patterns. Users can interact with the model by clicking on attribute values and seeing the predictions that the model makes.

Figure 1: A visualization of the Naive-Bayes classifier Examples of Visualization tools Miner3D Create engaging data visualizations and live data-driven graphics! Miner3D delivers new data insights by allowing you to visually spot trends, clusters, patterns or outliers A.Unsupervised Visual Data Clustering Kohonen's Self-Organizing Maps Miner3D now includes a visual implementation of Self Organizing Maps. Users looking for unattended data clustering tool will find this modul surprisingly powerful.

36

Users looking for unattended and unsupervised data clustering tool, capable of generating convincible results, will recognize strong data analysis potential of Kohonen's Self-Organizing Maps (SOMs). Kohonen maps are a tool for arranging the data points into a manageable 2D or 3D space in a way that preserves closeness. Also known as self-organizing maps (SOM), Kohonen maps are inspired biologically. The SOM computational mechanism reflects how many scientists think the human brain organizes many-faceted concepts into its 3D structure. The SOM algorithm lays a 2D grid of "neuronal units" and assigns each data point to the unit that will "recognize" it. The assignment is made in such a way that neighboring units recognize similar data. The result of applying a Kohonen map to a data set is a 2D plot, but Miner3D can also support 3D Kohonen maps. In this plot, data points (rows) that are similar in the chosen set of attributes will be grouped close together, while dissimilar rows will be separated by a greater distance in the plot space. This allows you, the user, to tease out salient data patterns. Self-Organizing Maps has been the data clustering method sought by many people from different areas of business and science. The new enhancement of yet powerful set of Miner3D data analysis tools further broadens its application portfolio.

B.K-Means clustering A powerful K-Means clustering method can be used to visually cluster data sets and for data set reduction Cluster analysis is a set of mathematical techniques for partitioning a series of data objects into a smaller amount of groups, or clusters, so that the data objects within one cluster are more similar to each other than to those in other clusters. Miner3D provides the popular K-means method of clustering. K-Means Clustering and K-Means Data Reduction give you more power and more options to process large data sets. K-means can be used either for clustering data sets visually in 3D or for row reduction and compression of large data sets. Miner3Ds implementation of K-Means uses a high-performance proprietary scheme based on filtering algorithms and multidimensional binary search trees.

37

K-means clustering is only available in Miner3D Enterprise and Miner3D Developer packages

2.6 OLAP (or Online Analytical Processing)

OLAP (or Online Analytical Processing) has been growing in popularity due to the increase in data volumes and the recognition of the business value of

analytics. Until the mid-nineties, performing OLAP analysis was an extremely costly process mainly restricted to larger organizations.

The major OLAP vendor are Hyperion, Cognos, Business Objects, MicroStrategy. The cost per seat were in the range of $1500 to $5000 per annum. The setting up of the environment to perform OLAP analysis would also require substantial investments in time and monetary resources.

This has changed as the major database vendor have started to incorporate OLAP modules within their database offering - Microsoft SQL Server 2000 with Analysis Services, Oracle with Express and Darwin, and IBM with DB2.

What is OLAP?

OLAP allows business users to slice and dice data at will. Normally data in an organization is distributed in multiple data sources and are incompatible with each other. A retail example: Point-of-sales data and sales made via call-center or the Web are stored in different location and formats. It would a time consuming process for an executive to obtain OLAP reports such as - What are the most popular products purchased by customers between the ages 15 to 30?

Part of the OLAP implementation process involves extracting data from the various data repositories and making them compatible. Making data compatible involves ensuring that the meaning of the data in one repository matches all other repositories. An example of incompatible data: Customer ages can be stored as birth date for purchases made over the web and stored as age categories (i.e. between 15 and 30) for in store sales.

It is not always necessary to create a data warehouse for OLAP analysis. Data stored by operational systems, such as point-of-sales, are in types of databases called OLTPs. OLTP, Online Transaction Process, databases do not have any difference from a

38

structural perspective from any other databases. The main difference, and only, difference is the way in which data is stored.

Examples of OLTPs can include ERP, CRM, SCM, Point-of-Sale applications, Call Center.

OLTPs are designed for optimal transaction speed. When a consumer makes a purchase online, they expect the transactions to occur instantaneously. With a database design, call data modeling, optimized for transactions the record 'Consumer name, Address, Telephone, Order Number, Order Name, Price, Payment Method' is created quickly on the database and the results can be recalled by managers equally quickly if needed.

. Data Model for OLTP

Data are not typically stored for an extended period on OLTPs for storage cost and transaction speed reasons.

39

OLAPs have a different mandate from OLTPs. OLAPs are designed to give an overview analysis of what happened. Hence the data storage (i.e. data modeling) has to be set up differently. The most common method is called the star design.

Star Data Model for OLAP

The central table in an OLAP start data model is called the fact table. The surrounding tables are called the dimensions. Using the above data model, it is possible to build reports that answer questions such as:

The supervisor that gave the most discounts. The quantity shipped on a particular date, month, year or quarter. In which zip code did product A sell the most.

To obtain answers, such as the ones above, from a data model OLAP cubes are created. OLAP cubes are not strictly cuboids - it is the name given to the process of linking data from the different dimensions. The cubes can be developed along business units such as sales or marketing. Or a giant cube can be formed with all the dimensions.

40

OLAP Cube with Time, Customer and Product Dimensions

OLAP can be a valuable and rewarding business tool. Aside from producing reports, OLAP analysis can aid an organization evaluate balanced scorecard targets.

Steps in the OLAP Creation Process

OLAP Online Analytical Processing Tools

OLAP (online analytical processing) is a function of business intelligence software that enables a user to easily and selectively extract and view data from different points of view. Designed for managers looking to make sense of their information, OLAP tools structure data hierarchically the way managers think of their enterprises, but also allows business analysts to rotate that data, changing the relationships to get more detailed insight into corporate information.

Examples for OLAP tools

1. WebFOCUS

WebFOCUS OLAP combines all the functionality of query tools, reporting tools, and OLAP into a single powerful solution with one common interface so business analysts can slice and dice the data and see business processes in a new way. WebFOCUS makes data part of an organization's natural culture by giving developers the premier design environments for automated ad hoc and parameter-driven reporting and giving everyone

41

else the ability to receive and retrieve data in any format, performing analysis using whatever device or application is part of the daily working life.

WebFOCUS ad hoc reporting and OLAP features allow users to slice and dice data in an almost unlimited number of ways. Satisfying the broadest range of analytical needs, business intelligence application developers can easily enhance reports with extensive data-analysis functionality so that end users can dynamically interact with the information. WebFOCUS also supports the real-time creation of Excel spreadsheets and Excel PivotTables with full styling, drill-downs, and formula capabilities so that Excel power users can analyze their corporate data in a tool with which they are already familiar.

42

2. PivotCubeX

PivotCubeX is a visual ActiveX control for OLAP analysis and reporting. You can use it to load data from huge relational databases, look for information or details and create summaries and reports that help the end user in making accurate decisions. It provides highly dynamic interface for interactive data analysis

3. OlapCube

OlapCube is a simple, yet powerful tool to analyze data. OlapCube will let you create local cubes (files with .cub extension) from data stored in any relational database (including MySQL, PostgreSQL, Microsoft Access, SQL Server, SQL Server Express, Oracle, Oracle Express). You can explore the resulting cube with our OlapCube Reader. Or you can use Microsoft Excel to create rich and customized reports.

2.7. Decision Trees

What is a Decision Tree?

43

A decision tree is a predictive model that, as its name implies, can be viewed as a tree. Specifically each branch of the tree is a classification question and the leaves of the tree are partitions of the dataset with their classification. For instance if we were going to classify customers who churn (dont renew their phone contracts) in the Cellular Telephone Industry a decision tree might look something like that found in Figure 2.1.

Figure 2.1 A decision tree is a predictive model that makes a prediction on the basis of a

series of decision much like the game of 20 questions.

You may notice some interesting things about the tree:

It divides up the data on each branch point without losing any of the data (the number of total records in a given parent node is equal to the sum of the records contained in its two children).

The number of churners and non-churners is conserved as you move up or down the tree

It is pretty easy to understand how the model is being built (in contrast to the models from neural networks or from standard statistics).

It would also be pretty easy to use this model if you actually had to target those customers that are likely to churn with a targeted marketing offer.

You may also build some intuitions about your customer base. E.g. customers who have been with you for a couple of years and have up to date cellular phones are pretty loyal.

Viewing decision trees as segmentation with a purpose From a business perspective decision trees can be viewed as creating a segmentation of the original dataset (each segment would be one of the leaves of the tree). Segmentation

44

of customers, products, and sales regions is something that marketing managers have been doing for many years. In the past this segmentation has been performed in order to get a high level view of a large amount of data - with no particular reason for creating the segmentation except that the records within each segmentation were somewhat similar to each other. In this case the segmentation is done for a particular reason - namely for the prediction of some important piece of information. The records that fall within each segment fall there because they have similarity with respect to the information being predicted - not just that they are similar - without similarity being well defined. These predictive segments that are derived from the decision tree also come with a description of the characteristics that define the predictive segment. Thus the decision trees and the algorithms that create them may be complex, the results can be presented in an easy to understand way that can be quite useful to the business user.

Applying decision trees to Business Because of their tree structure and ability to easily generate rules decision trees are the favored technique for building understandable models. Because of this clarity they also allow for more complex profit and ROI models to be added easily in on top of the predictive model. For instance once a customer population is found with high predicted likelihood to attrite a variety of cost models can be used to see if an expensive marketing intervention should be used because the customers are highly valuable or a less expensive intervention should be used because the revenue from this sub-population of customers is marginal. Because of their high level of automation and the ease of translating decision tree models into SQL for deployment in relational databases the technology has also proven to be easy to integrate with existing IT processes, requiring little preprocessing and cleansing of the data, or extraction of a special purpose file specifically for data mining.

Where can decision trees be used? Decision trees are data mining technology that has been around in a form very similar to the technology of today for almost twenty years now and early versions of the algorithms date back in the 1960s. Often times these techniques were originally developed for statisticians to automate the process of determining which fields in their database were actually useful or correlated with the particular problem that they were trying to understand. Partially because of this history, decision tree algorithms tend to automate the entire process of hypothesis generation and then validation much more completely and in a much more integrated way than any other data mining techniques. They are also particularly adept at handling raw data with little or no pre-processing. Perhaps also because they were originally developed to mimic the way an analyst interactively performs data mining they provide a simple to understand predictive model based on rules (such as 90% of the time credit card customers of less than 3 months who max out their credit limit are going to default on their credit card loan.). Because decision trees score so highly on so many of the critical features of data mining they can be used in a wide variety of business problems for both exploration and for prediction. They have been used for problems ranging from credit card attrition prediction to time series prediction of the exchange rate of different international

45

currencies. There are also some problems where decision trees will not do as well. Some very simple problems where the prediction is just a simple multiple of the predictor can be solved much more quickly and easily by linear regression. Usually the models to be built and the interactions to be detected are much more complex in real world problems and this is where decision trees excel.

Using decision trees for Exploration The decision tree technology can be used for exploration of the dataset and business problem. This is often done by looking at the predictors and values that are chosen for each split of the tree. Often times these predictors provide usable insights or propose questions that need to be answered. For instance if you ran across the following in your database for cellular phone churn you might seriously wonder about the way your telesales operators were making their calls and maybe change the way that they are compensated: IF customer lifetime < 1.1 years AND sales channel = telesales THEN chance of churn is 65%.

Using decision trees for Data Preprocessing Another way that the decision tree technology has been used is for preprocessing data for other prediction algorithms. Because the algorithm is fairly robust with respect to a variety of predictor types (e.g. number, categorical etc.) and because it can be run relatively quickly decision trees can be used on the first pass of a data mining run to create a subset of possibly useful predictors that can then be fed into neural networks, nearest neighbor and normal statistical routines - which can take a considerable amount of time to run if there are large numbers of possible predictors to be used in the model.

Decision tress for Prediction Although some forms of decision trees were initially developed as exploratory tools to refine and preprocess data for more standard statistical techniques like logistic regression. They have also been used and more increasingly often being used for prediction. This is interesting because many statisticians will still use decision trees for exploratory analysis effectively building a predictive model as a by product but then ignore the predictive model in favor of techniques that they are most comfortable with. Sometimes veteran analysts will do this even excluding the predictive model when it is superior to that produced by other techniques. With a host of new products and skilled users now appearing this tendency to use decision trees only for exploration now seems to be changing.

The first step is Growing the Tree The first step in the process is that of growing the tree. Specifically the algorithm seeks to create a tree that works as perfectly as possible on all the data that is available. Most of the time it is not possible to have the algorithm work perfectly. There is always noise in the database to some degree (there are variables that are not being collected that have an impact on the target you are trying to predict). The name of the game in growing the tree is in finding the best possible question to ask at each branch point of the tree. At the bottom of the tree you will come up with nodes that you would like to be all of one type or the other. Thus the question: Are you over 40?

46

probably does not sufficiently distinguish between those who are churners and those who are not - lets say it is 40%/60%. On the other hand there may be a series of questions that do quite a nice job in distinguishing those cellular phone customers who will churn and those who wont. Maybe the series of questions would be something like: Have you been a customer for less than a year, do you have a telephone that is more than two years old and were you originally landed as a customer via telesales rather than direct sales? This series of questions defines a segment of the customer population in which 90% churn. These are then relevant questions to be asking in relation to predicting churn.

The difference between a good question and a bad question The difference between a good question and a bad question has to do with how much the question can organize the data - or in this case, change the likelihood of a churner appearing in the customer segment. If we started off with our population being half churners and half non-churners then we would expect that a question that didnt organize the data to some degree into one segment that was more likely to churn than the other then it wouldnt be a very useful question to ask. On the other hand if we asked a question that was very good at distinguishing between churners and non-churners - say that split 100 customers into one segment of 50 churners and another segment of 50 non-churners then this would be considered to be a good question. In fact it had decreased the disorder of the original segment as much as was possible. The process in decision tree algorithms is very similar when they build trees. These algorithms look at all possible distinguishing questions that could possibly break up the original training dataset into segments that are nearly homogeneous with respect to the different classes being predicted. Some decision tree algorithms may use heuristics in order to pick the questions or even pick them at random. CART picks the questions in a very unsophisticated way: It tries them all. After it has tried them all CART picks the best one uses it to split the data into two more organized segments and then again asks all possible questions on each of those new segments individually.

When does the tree stop growing? If the decision tree algorithm just continued growing the tree like this it could conceivably create more and more questions and branches in the tree so that eventually there was only one record in the segment. To let the tree grow to this size is both computationally expensive but also unnecessary. Most decision tree algorithms stop growing the tree when one of three criteria are met:

The segment contains only one record. (There is no further question that you could ask which could further refine a segment of just one.)

All the records

Data Mining and Data Warehousing

Documents

data learning

large data

overview data

data mining concepts

data available

past data

data mining query

data mining process