Chapter 13 Data Mining - 2 -

Chapter 13Data Mining

- 2 - (c) 2000 Dr. Ralph Bergmann and Prof. Dr. Michael M. Richter, Universität Kaiserslautern

Recommended References• This lecture assumes some knowledge on learning systems. We

recommend:– P. Langley: Elements of Machine Learning. Morgan Kaufman 1996.

– T.M. Mitchell: Machine Learning. McGraw Hill 1997.

– R. Bergmann: Slides on “Lernende Systeme”, wwwagr.informatik.uni-kl.de/~bergmann ;also: M.M. Richter: Lernende Systeme, Vorlesungsmanuskript Kaiserslautern.

– Bergmann, R. & Stahl, S. (1998). Similarity Measures for Object-Oriented Case Representations. Proceedings of the European Workshop on Case-Based Reasoning,

EWCBR'98.

• Data Mining references:– P. Adriaans, D.Zatinge: Data Mining. Addison Wesley 1996.

– Th. Reinartz: Focusing Solutions for Data Mining. Springer Lecture Notes in AI 1623, 1998.

– S.M. Weiss, N. Indurkhya: Predictive Data Mining. Morgan Kaufman 1997.


Data Mining, Learning and Performance (1)

• The ultimate goal is to make an optimal performance of some process P.

• The meaning of this is given by the users utility.• In order to make an optimal performance certain

knowledge is necessary. This knowledge may be implicitly in the available data and has to be made usable, i.e. has to be learned.

• For learning one needs:– What are precisely the goals?– How to measure the achievements of the goals?– How to react if goals are not achieved ?


Users view on the performance of P

Formal evaluationfunction F for P

?

• Coincidence of the users view on the performance and the result of the evaluation is wanted.

• Often the coincidence can be only be approximated


The performance of the process P is tested in experiments whichgenerates certain data D. Thesedata are the input to some evaluation function F.



Process P and knowledge K

Generated data Dexperiment

Evaluation result

Analysis result

Data Mining: analyze data and evaluation result

Improved Process P’ and knowledge K’

Data mining

Learning

update


KDD: Knowledge Discovery in Data Bases

• Knowledge Discovery in Data Bases is the non-trivial

process of identifying valid, novel, potential useful, and

ultimately understandable patterns in data (Fayyad).

• Data Mining is often used as a synonym for KDD but

sometimes restricted to a crucial step in KDD:

• The step of applying data analysis and discovery

algorithms that, under acceptable computational

efficiency limitations, produce a particular enumeration

of patterns over the data.


KDD Phases

BusinessUnderstanding

DataUnderstanding

DataPreparation

DataExploration

DataMining

Evaluation

Deployment


Requirement Analysis for KDD Processes

requirementanalysis

applicationproperties

application requirements

applicationgoals

datacharacteristics

volume

quality

representationdomaincharacteristics

system contextcharacteristics


Data Mining and the Pre-Sales Process• The purpose of the data mining for the pre-sales

process is to get knowledge which allows the supplier to catch more customers of the intended target groups.

• The knowledge obtained can be concerned with– The market in general– The market with respect to certain products– The behavior of certain customer classes: Marketing Campaign

Management: How react customers on marketing actions ? Basket Analysis: What buy customers typically ?

– Individual customers and their behavior

• The general strategy for data mining of a company is the strategic model which on the other hand is influenced from feed back of the results obtained.


Data Mining and the Sales Process

• The purpose of the data mining for the sales process is to get knowledge which allows the supplier to improve the quality of his processes in such a way that customers who have contacted the supplier – are guided efficiently in the sales process– make a positive decision for the sale

• This includes– offering the products appropriately– offering adequate alternatives– guiding effeciently through the dialogue

• This influences the diagnostic and the action model.


Data Mining in the After Sales Process

• The purpose of the data mining for the after-sales process is to get knowledge which allows to deal with customer questions and complaints more efficiently.

• The goals are– improve recognition of reasons for calls– avoid repeated calls– come efficiently to solutions

• Useful knowledge is mainly contained in experiences and therefore the collection of experiences is central.

• Experiences are best stored as cases in CBR.


The Starting Point: Data (1)

• Data have a certain quality– Correctness and completeness problem

• It is essential to address the problem of data quality: if you feed garbage into the system, you will get garbage out !– the insights obtained from the data lead to

incorrect consequences (wrong data)– the insights are too general to be useful

(incomplete data)


Starting Point: Data (2)

• Data may be noisy• Incorrect data

– wrong values for the attributes– incorrect classification– duplicate data

• Incomplete data– missing values for some attributes– missing attributes– missing objects

• Data not usable– free text difficult to cope with– terminology not understood– not suitable for the intended goals


Starting Point: Data (3)

• Knowledge management task:• Quality management !• Data sampling

– Define the goals– Quality is more important than quantity– Make use of existing information sources to

ensure completeness of the base– Create your own sources– Data have to come in time: Data which are too old

are not useful (updating problem)• See chapter 15.


Data for what Knowledge ?• The way data are obtained depends on the type of

knowledge one is interested in.• We distinguish three main types:

– Knowledge about some market. This will influence the strategic, the diagnostic model and the action model of the supplier.

– Knowledge about individual customers. It is used to treat the customer individually, e.g. making special offers.

– Knowledge about technical objects: Their quality, how to explain to operate them etc,

• With the type of knowledge different – goals of the supplier– data sources

are connected.


Idea: Store knowledge like physical objectsAllows: Access, delivery, manipulation as for physical objects.

Data Ware House:• Access to knowledge for immediate use• Makes knowledge available for improving the quality

The data warehouse is managed by the knowledge manager.

Data Ware House


From Data to Knowledge (1)

Data Facts

Information Description, definition, perspective

Knowledge Strategy, practice, method

Wisdom Insight, moral

Understanding

AssociationRelationshipsConnectivity

What, when, where, who?

How, why? Implications

(Understand relations)

(Understand models, rules, patterns)

(Understand principles)

What?



• Data are raw products• Information pieces are semifinished products• Knowledge and wisdom are high quality products

But:When using knowledge acces to actual data and information is necessary, How to do this ?

Abstraction


It is a knowledge management task to provide for each application

of knowledge the needed actual data:

Task to perform

Knowledge applied Data needed




• Only explicit knowledge can be used directly• Explicit knowledge is directly formulated:

Prescriptions, rules, norms Suggestions, ways to behave General laws, exceptions Hierarchical relations Properties, Constraints . . .



• Implicit knowledge cannot be directly used• Implicit knowledge is:

contained in data and information often hidden and difficult to discover not directly applicable silent knowledge



Implicit knowledge: Sales statistics contain implicit knowledge about

customer preferences Data bases about accidents contain implicit

knowledge about dangerous situations Test data contain implicit knowledge about quality



• Data and pieces of information have to be correct (or exact tolerances have to be given)

• Knowledge has not to be totally correct in order to be useful:– Probabilities, Evidences– Heuristics– Rules of thumb– Vague statements („this is not reliable“, „the

weather there is not nice in November“)– Fuzzy statements

• A correct statement in a complex situation may even be useless because it is too complicated


Wisdom

• Wisdom is usually referred to as a very advanced type of knowledge

• It refers to the understanding of basic background principles

• Only in the exact sciences it can be expressed in precise terms

• Wisdom is of relevance for the strategic model (which is mainly informal)


Make Knowledge Explicit (1)

• General properties of products need to be differently represented in different situations:

• Vacations in Tirol are nice and warm (for persons from Alaska) nice and cool (for persons from Brazil)

• A car is good and speedy on small and hilly roads (Germany) is comfortable (USA)


Make Knowledge Explicit (2)

• Use the properties of a product in order to– guarantee the satisfaction of different safety

regulations– satisfy different types of demands– respect different types of sensitivities

• Describe these properties in different ways• For such purposes one has to extract the specific

views from the overall knowledge


Reliability of Knowledge (1)

Extension of knowledge

Darkness indicatesreliability

Obtained by direct retrieval

Obtained by logical deduction

Obtained by approximativereasoning

Obtained by CBR

Obtained by learning and datamining

This assumesthat the underlyingdata and informationbases are reliable


Reliability of Knowledge (2)

• This schema is only a rough and general indication.• The success in applications depend heavily on e.g.

– correctness, amount and typicality of data– adequate choice of the specific method and precision with

which it is applied– number of experiments carried out– testing of the results

• Therefore the success depends on the investigated effort.

• There is again the utility question: Costs of obtaining knowledge versus gain of applying knowledge


Sources of Data• General analysis, public domain

– accessible to everyone but often widely distributed and hard to collect

• General analysis, performed by the company itself or some paid institution– expensive, but can be taylored to the needs of the company

• History of customers– requires customers who buy regularly– has to be updated regularly

• Internal analysis of customer behavior– reaction on change of

• prices• dialogue strategies etc.

• Cases– collected experiences, failure statistics etc.


History of Customers• Knowledge about behavior about individual customers

should in general not be obtained by asking personal questions but rather automatically.

• One possibility is to do this at the cashier if the customer pays by a customer or credit card. A method for E-C is if the customer orders directly over the net.

• There may be certain restrictions by law.• The history can contain among others

– main products ordered and their quantities– times or events when ordered (weekend, holidays, time of the

year,...)

• The history should contain (if possible) information about the customer (for description of customer classes)– age, sex, profession, location of living,...


Cases (1)

• In the after sales process histories have to be recorded if they are available, they are the material for the cases.

• Often there are not enough cases available to cover all or most of the relevant problem situations.

• In this situation artificial cases can be created which is done by variation of relevant parameters.

• Both, collecting and creating cases requires some a priori understanding of the tasks to be performed.

• To build a CBR-System one has to define the four containers vocabulary, case base, similarity measure and solution transformation.


Cases (2)

• There are commercial systems like CBR-Works which support the collection and representation of cases (see also chapters 3 and 12).

• A general methodology for developing CBR-System for applications in the help desk area is described in– R. Bergmann, S. Breen, M. Göker, M. Manago, S. Wess:

Developing Industrial Case-Based Reasoning Application - The INRECA- Methodology. Springer Lecture Notes in AI 1612, 1999.


From Data to Information Using Knowledge

Raw Datawill be

valuable Information

by using

Knowledge

Customer: Company X, ArchitectsPC component: Matrox G100?

Company X:1x PC Dual-Pentium XL437, Sold 4/972x ML 649 (P233/124/9,6), Sold 5/97

SW: High-End, CAD&3D Visual., TCP/IP Netw., …G100:

Entry level graphics card, AGP slot necessary,very good Price/Power relation, limited 3D power,

...

“The G100 is only little useful for Company X because the architects use high-end 3D graphics

software. G100 is an entry level graphics cardand additionally needs it an AGP slot which is not built in the current HW configuration

of the PC’s.”


Three Main Phases

• Measurement: Collects numerical data about the intended utility

• Evaluation: Extracts statements about the utility from the data (excellent, good, sufficient, improved, insufficient, ...)

• Sensitivity Analysis: Extracts influence factors responsible for the result of the evaluation.

• The learning and data mining tools can – use the results of all three phases– can improve these phases


Measurement

• The utility is often only informal and implicit in the head of the user.

• The measurement problem is – to map it on quantitative magnitudes– to define procedures which measure these quantities.

• The measurement procedures are often difficult to define and expensive.

• The parameters in the procedures have to be named precisely such that the procedure can be applied repeatedly (as e.g. in the exact sciences)


Evaluation

• The evaluation of the measured data has to close the gap between the data and the utility of the user:– the evaluation predicate should (at least ideally) coincide with

the predicate which is given by the user to the performance (see also the relation between similarity and utility in chapter 6).

• The evaluation should contain a statement about its reliability, e.g.– tolerances for errors– error probabilities– confidence intervals

• The reliability depends heavily on the input data (volume, representability, correctness,noise, etc.)


Sensitivity Analysis• This is the most difficult and the most important phase.

• The evaluation is given as a function Ev(d1,....,dn) where the di are data obtained by the measurement.

• The data di are on the other hand an indirect consequence of parameters pi which can be directly influenced by the person who designs the process (or product etc.) which is evaluated:

– Ev(d1,...,dn) = Influence(p1,...,pm)– where the function Influence is in general unknown.

• We call a parameter pi an (important) influence factor if small variations of pi result in large variations of the function influence(....,pi,...).

• The determination of influence factors is the basis for learning improvements of object under consideration.


QMCB: Quantitative Models of Consumer Behavior

• Goal: The calculation and prediction of meaningful market diagnostics on the basis of data.

• A possible approach: Integration of statistical methods and models as well as econometric models in a knowledge based system.

• Tasks:– Descriptive (a posterori) analysis of data– Model based simulation of future buying behavior.

• The special types of task require special data representations for useful evaluations.


Different Types of Forecasts

• The types vary with respect to the knowledge they contain and the usefulness of the prognosis. From the QMCB one should be able to compute directly (examples):– Market share of a product– Product purchase probability, expectation and variance– Brand purchase probability, expectation and variance– Heterogeneity in purchase rates

• Indirect consequences:– relative product attraction– relative brand attraction– etc.


Example System: KVASS (1)

• KVASS (KaufVerhaltensAnalyse und SimulationsSystem) is an example of a model and knowledge based data analysis system.– Reference: R. Decker:Knowledge Based Selection and

Application of Quantitative Models of Consumer Behavior. Information Systems and Data Analysis (ed. H.H.Bock, W.Lenski, M.M.Richter), Springer Verlag 1994, p. 405-414.

• Basic idea: Model data with a predefined set of descriptors. These are essentially attributes with there domains, e.g.– estimation method : {undefined, least squares, ...., moments}– type of recording : {undefined, diary, ..., interview}



• Classes of descriptors are:– Essential aspects for a general description (type of recording,

market share etc.)– Temporal aspects (periods for data collections etc.)– Information on the models used for computation (e.g. estimation

method)– Technical descriptors for interpretation of the representation

(e.g. ordinal, nominal etc.)– Combination of descriptors allow to represent complex

situations; this can be translated in more understandable relational representations (see chapter 4).



• The system describes essentially a measurement procedure, i.e. the first phase.

• The purpose is not to make an evaluation about the success of a product or process of the company.

• The correctness condition is that the results provided by the analysis of the system coincide with the reality.

• The results of the system are on the other hand important for the sensitivity analysis concerning success or failure of processes or products designed by the company.


Causal Analysis (1)• Causal analysis is some kind of sensitivity analysis. Task:

Make causal relations explicit.

• Suppose the Xi are activities and the Yi are sales results. Notation:– Xi + Yi : positive influence– Xi - Yi : negative influence – no arow : neutral

• Initial situation: A suspected model for the influence.

• Either experiment: Variation of the Xi and measurement of the Yi or analysis in several companies.

• Data analysis: E.g. by analyzing the covariance structure.

• Result: Revised model and refined model.


Causal Analysis (2)

• Example (artificially created): – X1: Effort in catalogues– X2: Effort in dynamic forms – X3 :Effort in recording and applying customer histories– Y1: Return from book sales– Y2: Return from high tech products sales

• Initial model based on qualitative knowledge:X1

X2

Y1

Y2X3

++

++ +


Causal Analysis (3)

Revised model:

X1

X2

Y1

Y2

X3+

+

+ +

-

A possibility for coming to a refined quantitative model is toassume a linear model (which may be justified by some knowledge).This leads to the linear equations

Y1 = a11X1 - a13X3Y2 = a21X1 + a22X2 + a23X3

The solutions for the coefficients aik will determine a quantitative model.


Quality Management: Internal Analysis

• As the first step the goals of the analysis have to be defined:– Where are the weak points ?– What has to be improved or optimized ?– Where are improvements possible ?

• This is part of the requirements analysis• Further steps include

– identify groups of objects with similar quality characteristics– identify properties of these groups– describe these groups– draw conclusions for quality improvements


Example: Quality Analysis for Dialogues (1)

• Classification of Dialogues (evaluation of the user):– succesfully finished– quit because no adequate product available– quit for unknown reasons : This is the failure class.

• Measurement:– Has to collect data which arise during the dialogue– These data may not be recorded during an ordinary dialogue,

e.g. • Which questions raised by the customer where dealing with

a certain property type of the product• Which actions where performed by customers from a certain

customer class– The quality of the measured data has to be considered


Quality Analysis for Dialogues (2)

• The evaluation is simple because it is the same as the one of the user.

• The sensitivity analysis has two phases here:– (1) Describe the evaluation result in terms of measured

quantities and determine the influence factors of this description.

– (2) Describe the evaluation result in terms of factors which define the dialogue.

• The first phases involves already a learning step:– The classification of the dialogue in terms of measured

quantities has to be learned. This classification approximates the real classes obtained from the evaluation.



• The analysis of the first phases is based on the dialogue situations and additionally measured data.

• Typical candidates for interesting data in order to classify types of situations are

• length of the dialogue• not understandable terms• customer questions (How often? Typical ones?)• etc.

• The selection of these candidates depends on a hypothesis for a preliminary dependency model. The data mining and learning methods are used in order to refine and correct this model.


Quality Analysis for Dialogues (4)• The result allows a prognosis of the dialogue class from

the occurrence of dialogue situations which are important influence factors (but here in terms ob measured data!), in particular a description of failure situations, i.e. situations which lead with high probability to a failure dialogue.

• The description of the failure situations is refined in order to– discover dependencies between influence factors– in particular to obtain definitions of earliest failure situations in

dialogues, i.e. earliest situations in the dialogue which will lead to a failure.

• The earliest failure situations give rise to the second phase of the sensitivity analysis.



• Second phase: Analysis of reasons for reaching earliest failure situations, mainly:

• Which elements in the strategy are responsible?

• Weak points of the knowledge base (e.g. wrong prices for products)?

• These reasons can directly be influenced when the dialogue is designed.

• Consequences of the analysis (learned results):– improved knowledge base

– Possible changes of the strategy

– Possible disadvantages of changes

• Final recommendations: Update


Discussion

• The dialogue and the situations can be given in a (possibly object oriented) attribute-value representation. Some virtual attributes (like length of dialogues) can be useful, they contain valuable knowledge.

• One way to proceed is to use cluster analysis techniques and machine learning algorithms (e.g. CN2, C4.5) for learning the classification.

• Another way is to consider the data base as a case base and start with an initial similarity measure which is improved during the development of a CBR-system for the classification and the improvement suggestions.


Learning Informal Concepts

• Many concepts in e-c, in particular in connection with CRM and customer classes are of informal character where no direct formal equivalent exists.

• Computer support requires a formal notion which approximates the informal concept as good as possible.

• Such formal versions have to be learned and the learning process requires data mining activities which are again based on studies of customers and their behavior.

• It has to taken into account that informal concepts are usually not stable over time.


The Correctness Problem• The correctness problem for the statement that two

expressions are logically equivalent reduces to a formal proof.• How to “proof” that an informal and a formal concept are

equivalent ? – Formal systems do not have access to informal notions.

– Humans have usually difficulties to compare both types of notions because this refers to a broad scope of intended uses.

• Required is a kind of Turing Test which decides that a human who uses the informal version and a machine which uses the formal version refer to the same concept.

• The ordering principle is that the test does not deal with the concept itself but with partial orderings related to the concept.


The Ordering Principle and a Turing Test (1)

Suppose there is a partial ordering „<„ with the concept C associated: The partial ordering then again has two versions: formal and informal.The Turing test refers to these two versions of „<„ :

Informal humanversion of C

Formalversionof C

The goal is that whenvariations of the arguments of < arepresented:The human says „up“if and only if the formalsystem says „up“

goal



Concept to grasp: Typical lionFormalversion usesOrdering: Quotient of length/height

Human:Aesthetic property

better

betterThe partial ordering approximates the concept C in the sense that semantics of y < z is : z is more typical for Cthan y is.



• Advantages of the ordering principle:– The validity of the equivalence of formal and informal

concepts can be effectively validated by Turing tests, i.e. by experiments.

– If there are several orderings involved this can be done for all of them.

– The search for a formal counterpart of an informal concept can be performed in an approximative way and partial validation is possible.

• The formal partial ordering is what has to be learned• The learning process is an approximation process in

order to perform the Turing test sufficiently well.


The Learning Scenario • (1)The informal concept C on a set U is regarded as a fuzzy

set where a set of prototypes P U is known.

(2) An informal relation rx(y,z) stating “y is more similar to x than z is”

• The object to be learned is a similarity measure

sim: U x P [0,1].

• Turing test: The relations x (from the formal similarity measure) and rx agree.

• We decompose the approach into two basic steps:– A first step to get a suitable representation language : Concept

learning.– A second step for learning the similarity measure: Subsymbolic

learning.


Learning of Weights• Learning similarities is an example of subsymbolic

learning and reduces often to learning weights:We distinguish:– global weights:

– prototype specific weights:wi,c: relevance matrix

• Change of weights: Change of relevance of features.• Error function determined by Turing test.• Learning procedures can be supervised or unsupervised.

sim q c w sim q ci i i ii

n

( , ) ( , )

1

sim q c w sim q ci c i i ii

n

( , ) ( , ),

1


Learning of Weights with/without Feedback

• Many algorithms for both learning types are known.• Learning without Feedback for Retrieval / Reuse

– Use the the distribution of cases in the case base in order to determine the relevance of attributes

• Learning with Feedback– Correct or incorrect choice of cases / classification– result leads to the change of weights

+

+

A1

+

+

-

-

-

-

A2

A1 is more important than A2


Learning of Weights without Feedback

• Determination of class specific weights:– Binary coding of the attributes by

• Discretizing of real valued attributes• Transforming each symbolic attribute into n binary attributes

– Suppose

• wik the weight for attribute i for class k

• class(c) the class (solution) in case c

• ci the attribute i in case c

– Put: wik = P( class(c)=k | ci) conditional probability that the class of a case is k under the condition that the attribute i vorliegt is given.

– Estimation of the probabilities use samples of the case base.


Learning of Weights with Feedback

• Correct or incorrect classification leads to a correction of weights:wik := wik + wik

• There are several ways for the adaptation of the weights:• Approach of Salzberg (1991) for binary attributes:

– Feedback = positive (i.e. correct classification): • Weight for attributes with the same values increases• Weight for attributes with different values decreases• Feedback = negative (i.e. wrong classification): • Weight for attributes with the same values decreases• Weight for attributes with different values increases

• The increment wik remains constant.


Summary• Relations between data mining and kdd.• Relations between data mining, learning and performance.• The way from data to knowledge.• Making knowledge explicit.• Collecting cases and building a CBR-system• Examples:

– Quantitative models of consumer behavior (external analysis)– Causal analysis (external analysis)– Quality analysis for dialogues (internal analysis)

• Learning of informal concepts can be reduced to learning of similarity measures.

Chapter 13 Data Mining - 2 -

Documents

data analysis

certain data

data andevaluation

data fayyad

available data

data mining references

predictive data mining

data bases knowledge