source

Introduction to Informatics - Fall 02

Alternative methods of accessing digital information: Data mining

I. Introduction

• What is it?

II. How does it work?

• The virtuous circle of data mining

• Techniques of data mining

III. Data mining applications

• What is it good for?

• DM and CRM


I. Introduction

• What is it?

Data mining is a process of knowledge discovery in databases

It involves the extraction of interesting information, patterns, or rules from data in large databases

These data are non-trivial, implicit, previously unknown and potentially useful

It is a search for valuable information in large volumes of data

It uses statistical techniques to explore and analyze large quantities of data in order to discover meaningful patterns and rules


Why has data mining become so popular?

Large amounts of data are being produced as more functions become automated

Many algorithms require large data sets for training and learning

Data are being warehoused

They are being extracted from various systems (accounting, billing, ordering etc) and stored in a central location

They are stores in a common format

Consistent definitions for fields and keys

Computer power is increasing and costs are decreasing


And

Strong competitive pressures

Information-intensive activities (business, science) are competing for market share, funding etc.

Realization of the increasing value of information (especially as a source of revenue)

There is value in what can be discovered in data

For business, there is value in customization

Commercial data mining software is now available

There are off the self products


Data mining can be directed

The goal is to use the available data to build a model that describes a variable of interest in relation to the data set

Given what we know about people in Bloomington, which types of people are likely to subscribe to DSL?

Data mining can also be undirected

There is no variable of interest

The goal is to search through the available data to look for patterns and relationships

What can we learn about students at IU who default on their student loans?


Data mining provides an organization with “memory” and “intelligence”

Noticing

Uses on-line transaction processing systems (OLTP)

Remembering

Capturing as much of the transaction process as possible

Phone records, communications, CRM exchanges

Learning

The records must be organized into “data warehouses”

Data mining is used to analyze these data

Intelligence involves patterns, rules, and predictions


Data mining typically involves six activities

1. Classification

Examining the features of a data instance and assigning it to a predefined class

Records in a database are updated by filling in a field with a “class code”

The process uses a “training set” to sort unclassified data into discrete classes

Assigning keywords to articles as they arrive

Sorting credit card applicants according to risk levels

Assigning people to demographic categories


2. Estimation

This process sorts continuously valued outcomes

Using new data to predict whether a given data instance is above or below a threshold

This requires a model to determine the threshold level

It can be used to make predictions

Use customer data to determine churn rates

Estimate how long a person is likely to remain a customer

Assess the probability that people will respond to an offer of a home equity loan

The model runs between 0-1 with a .83 threshold


3. Prediction

Similar to estimation but with the expectation that there will be some check in the future

Uses a training set with historical data and a known predicted variable

Predicting the size of a balance that is likely to be transferred when a person accepts a credit card offer

Determining which customers will leave in a given time period

Predicting which customers will add a new service such as caller ID in a given area


4. Affinity grouping or association rules

The goal is to explore an available data set to determine which data instances should be grouped together

This involves discovering relationships among data

Which items should be placed near each other in a supermarket?

Which products can be grouped for cross-selling?

5. Clustering

The task is to sort undifferentiated data into like groups

This process does not begin with predefined classes

What do the book and music purchases tell us about our customers?


6. Description and visualization

Developing a preliminary understanding of the data

This is a first step in developing an explanation

What can we tell about the people who shop in a food coop?

Visualization is the graphic representation of the data

Directed data mining

Classification, estimation, prediction

Undirected data mining

Affinity grouping, clustering, description


Classes of data mining activity

Information Discovery, Inc. (2001). A Characterization of Data Mining Technologies and Processes. http://www.datamining.com/dm-tech.htm


Types of data mining

Hypothesis testing

Top down approach designed to test careful guesses

Process

Hypotheses are formulated to be falsified

This is done in scientific and business applications

Specific kinds of data are proposed to test the hypothesis

A data requirements document is created

The data are gathered and prepared

Profile the data, especially if it is derived from heterogeneous sources


Process continued:

Data preparation (cont)

The transformation is very important and will vary with the type of software being used

Computer model is built based on the data

The model is evaluated to reject or fail to reject thehypothesis

This is done by applying it to the data set

The end result is an analysis which statistically tests the hypothesis

The results are stated with the appropriate margin of error


Issues in data preparation

Summarization: developing the appropriate level of detail

The original data should not be summarized at all

The fine grained data may be irrelevant to the question

There may be too few examples at the finest level of detail

Incompatible computer architectures

Data transport software can translate among different languages and formats

COBOL, C, C++, ASCII encoding, single and double precision floating point integers…)


Inconsistent data encoding

Different sources represent the same data in different ways

If not caught, these can introduce error into later ` analysis

Textual data

Mostly not useful

What is useful should be encoded in another form

This is best done by hand (to key UK, Wales, Scotland to country code “44”)

Missing values

Most software is not good at handling these


Knowledge discovery: takes two major forms

Directed

Goal is to explain the value of some field (income, genetic information) or a specific relationship

Analysis seeks to estimate, classify, and predict the target field

This is an explanatory function

Finding patterns in data to explain the past and predict the future

What type of person is likely to default on a loan?

Is these genetic markers are found, what future predispositions are indicated?


Process

Identify available data sources

It’s best to have preclassified data

Preparing the data for analysis

Similar issues are involved

Also involves adding fields to the data to clarify what we take for granted but that software cannot

Based on our experience with the data

Improves the chances of finding patterns

Training set: build the initial model

Test set: adjust to improve generality

Evaluation set: test the model


Process (cont)

Building and training the model

Toss in as many variables as seem relevant and let the algorithm sort then out

Goal is to develop an explanation of independent (target) variables based on dependent (input) variables

The test set is used to minimize the problem of overfitting

Evaluating the model

Error rate of the evaluation set is a good indicator of the error rate with new data


Undirected knowledge discovery

There is no target field to serve as the focus of analysis

The goal is to search for meaningful patterns

Question might be: what goes together?

Process

Similar to directed knowledge discovery

Identify potential targets for directed knowledge discovery analysis

At the end of the process, one result is often new variables

Generate new hypotheses to test



I. Introduction

• What is it?






• DM and CRM



The virtuous cycle of data miningTransform data into useful information

with DM

Act on the information

Measure the results to reuse the data

Identify problems where DM can provide

value


In business applications, data mining does not seek to replicate previous efforts

The goal is to discover new markets, not saturate old ones

In science, replication of results is more important

Data mining is a creative activity

Many patterns will be found, but the art is in focusing on the meaningful ones

Data mining results can change over time

Models can become less useful over time as data changes and markets change


Characteristics of DM systems

The focus is on the analysis of current and historical data to predict future action

The analytic work depends on the flow of data (which is not regular)

Typically the emphasis is on working with large data sets

The purpose is to support decision making in business and hypothesis testing in science

Response time are slower due to the computing cycles involved in analysis


Another way to think about it

Aggregate the data

Prepare it in a common format

Find patterns in the data

There are a range of techniques that can be used

Respond to the patterns (what do they mean?)

Data information

Act on the patterns

Information action

Action generates value


Identify data requirements

Obtain data

Validate, explore,

clean data

Transpose data

Add derived variables

Create model set

Choose modeling technique

Train model

Check model performance

Choose best model

If improvements, obtain more data

If values don’t look

correct

If data are not available

If values don’t look

correct

If new derived variable improves performance

If new segmentation improves performance

If a new technique improves performance

Building a DM model


Data mining depends on three main elements

DM techniques

These are algorithmic approaches to problem solving that are statistically based

Data

DM data should be clean, simple, and in a table with well-defined columns

Data modeling

This is a process of developing predictive models for directed DM

The method for building these models is based on principles of experimental design


Techniques of DM

Automatic cluster detection

Used for undirected DM to find groupings in data

Algorithms sort the data set top down (divisive) or bottom-up (agglomerative)

k-means cluster detection is a common method

Start with an arbitrary number of “seeds” (the initial clusters)

Assign the records to the closest seed (“centroid”)

Then recalculate the k-means and move the centroid

Continue until each is in the center of a cluster of records and there are clear boundaries


Seed 1

Seed 2

Seed 3

New centroid 2

New centroid 1

New centroid 3

Results of a cluster analysis


This is a discovery technique but is difficult to interpret

It shows that some records are closer together given the arbitrary starting points

The number of initial seeds is important

The number should minimize the distance between members of a cluster and maximize the distance between clusters

The results should be combined with other techniques to see if they have any meaning

Use it when you think there are patterns that you can’t see

Or when there are too many patterns and you want to reduce complexity


Decision trees

A classification tree labels records and assigns them to classes

The data is split iteratively until the groupings become useful

The point of the initial split is critical

Each branch cuts the space into two or more pieces and is a test on a record

Each record is tested at each node of the tree until it reaches the “leaf” or terminal node (categorical - yes/no)

Is record X greater than Y? - if yes, keep moving


Respond 8%

RentRespond 5%

OwnRespond 15%

LowRespond 9%

MediumRespond 13%

YesRespond 16%

NoRespond 45%

HighRespond 24%

YesRespond 4%

NoRespond 18%

Family income? Mortgage?

Savings account?

Own home?

A sample decision tree

http://www.spss.com/datamine/trees.htm


Decision trees are good when the goal is to develop a set of rules to organize data for predictive purposes

This works best when the tree has a manageable number of branches

The rule is formulated by tracing the branches back to the root

They are not good for discovering relationships among variables

Each split in the tree is a test of a single variable

They also may produce errors when the training set is too small


Neural networks

They use data models that simulate the structure of the brain to generalize and learn from data

They learn from a set of inputs and adjust parameters of the model using to new knowledge to find patterns in data

They fit a model to a set of historical data to classify or make predictions

They can find interaction effects among variables

You do not need to have any specific model in mind when running the analysis

They require extensive preparation of data

They also require a lengthy training period


Neural network model

OutputInputs

Appraised value

Floor space

Size of garage

Age of house

Acreage

Other factors

Example of a neural network model

Berry, M.J. and Linoff, G. (1997). Data Mining Techniques. Wiley. p290


Neural networks are useful in predicting a target variable when the data is highly non-linear with interactions

A disadvantage is that it is difficult interpret the resultant model with its layers of weights and transformations

The result is a set of weights distributed throughout the network

The weights provide no insight into why the solution is valid

This makes the use of neural nets a “black box” process

They are not very useful when these relationships in the data need to be explained



I. Introduction

• What is it?






• DM and CRM


III. What is it good for?

Data mining is used for

Research

Pharmaceutical companied use DM prediction to predict which chemicals are likely to produce powerful drugs

Process improvement

Using DM to determine thresholds for manufacturing (to separate good from bad product)

Marketing

Learning about customers to refine and target marketing campaigns and save money


Fed government uses DM to search for criminals and terrorists

Analyse FBI field agent reports

Look for patterns in international funds transfer

Customer relationship management

Developing sophisticated customer profiles shared across the business

Learning from customer behavior

source

Documents

data warehouses data

data intelligence

new data

historical data

undifferentiated data

customer data

data mining classification

data mining applications