Top Banner
An Evaluation of Commercial Data Mining Oracle Data Mining Emily Davis Computer Science Department Rhodes University Supervisor: John Ebden November 2004 Submitted in partial fulfilment of the requirements for BSc. Honours in Computer Science
99
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: draft2.doc

An Evaluation of Commercial Data Mining

Oracle Data Mining

Emily DavisComputer Science Department

Rhodes UniversitySupervisor: John Ebden

November 2004

Submitted in partial fulfilment of the requirements for BSc. Honours in Computer Science

Page 2: draft2.doc

Acknowledgements

I am very grateful for all the advice and assistance given to me by my supervisor,

John Ebden. I am exceedingly thankful for all the time and effort he put into helping

me produce this work. I am also grateful for the funding provided by the Andrew

Mellon Foundation in the form of an Honours Degree Scholarship.

I must acknowledge the financial and technical support of this project of

Telkom SA, Business Connexion, Comverse SA, and Verso Technologies through

the Telkom Centre of Excellence at Rhodes University.

I must also thank the technical division in the Computer Science Department at

Rhodes University and especially Jody Balarin and Chris Morley for their help.

An Evaluation of Commercial Data Mining: Oracle Data Mining2

Page 3: draft2.doc

Table of Contents

Abstract: 5

Section 1 Introduction6

Chapter 1 Introduction_____________________________________________________61.1 Background to Data Mining______________________________________________61.2 Supervised Learning and Classification Techniques___________________________81.3 Oracle Data Mining (ODM)_______________________________________________8

1.3.1 Oracle Data Mining Algorithms___________________________________________91.3.2 Functionality of Oracle Data Mining Algorithms and ODM____________________9

1.4 Chapter Summary______________________________________________________9

Section 2 Evaluation of Oracle Data Mining 11

Chapter 2 Methodology of the Evaluation____________________________________112.1 Approach_________________________________________________________________112.2 Choice of Data Mining Tool__________________________________________________122.3 The Data__________________________________________________________________142.4 Classification Algorithms____________________________________________________16

2.4.1 Naïve Bayes___________________________________________________________162.4.2 Adaptive Bayes Network_________________________________________________17

2.5 Algorithm Settings_________________________________________________________172.5.1 Naïve Bayes Settings____________________________________________________172.5.2 Adaptive Bayes Network Settings_________________________________________18

2.6 Chapter Summary__________________________________________________________21

Chapter 3 Classification Models 22

3.1 Preparing the Data____________________________________________________223.1.1 Build and Test Data Sets___________________________________________________223.1.2 Priors___________________________________________________________________233.2 Building the Models________________________________________________________25

3.2.1 Building the Naïve Bayes Models__________________________________________253.2.1.1 nbBuild___________________________________________________________253.2.1.2 nbBuild2__________________________________________________________25

3.2.2 Building the Adaptive Bayes Network Models_______________________________263.2.2.1 abnBuild__________________________________________________________263.2.2.2 abnBuild2_________________________________________________________26

3.3 Testing the Models_________________________________________________________273.3.1 Model Accuracy________________________________________________________283.3.2 Model Confusion Matrices_______________________________________________29

3.4 Calculating Model Lift______________________________________________________303.5 Training and Tuning the Models______________________________________________333.6 Applying the Models to New Data_____________________________________________36

An Evaluation of Commercial Data Mining: Oracle Data Mining3

Page 4: draft2.doc

3.7 Chapter Summary__________________________________________________________37

Chapter 4 Model Results____________________________________________384.1 Results of Application to New Data_________________________________38

4.1.1 Rules Associated with Adaptive Bayes Network Predictions___________________394.2 Comparison of Model Results________________________________________________414.3 Chapter Summary__________________________________________________________43

Chapter 5 Interpretation of Results__________________________________________445.1 Comparison of Model Results________________________________________44

5.1.1 Comparison 1____________________________________________________455.1.2 Comparison 2____________________________________________________465.1.3 Comparison 3__________________________________________________________475.1.4 Comparison 4__________________________________________________________48

5.2 Effectiveness of Models______________________________________________________495.3 Significance of Results______________________________________________________515.4 Chapter Summary__________________________________________________________52

Section 3 Conclusion 53

Chapter 6 Conclusions Drawn from Results__________________________________536.1 Conclusions Regarding Model Results_________________________________________536.2 Conclusions Regarding Data_________________________________________________546.3 Conclusions Regarding Oracle Data Mining____________________________________576.4 Chapter Summary__________________________________________________________58

Chapter 7 Conclusion_____________________________________________________597.1 Conclusion________________________________________________________________597.2 Possible Extensions to Research______________________________________________60

List of Figures62

List of Tables 63

References 64

An Evaluation of Commercial Data Mining: Oracle Data Mining4

Page 5: draft2.doc

Abstract:

This project describes an investigation of a commercial data mining suite, that

available with Oracle9i database software.

This investigation was conducted in order to determine the type of results achieved

when data mining models were created using Oracle’s data mining components and

applied to data. Issues investigated in this process included whether the algorithms

used in the evaluation found a pattern in a data set, which of the algorithms built the

most effective data mining model, the manner in which the data mining models were

tested and the effect the distribution of the data set had on the testing process.

Two algorithms in the Classification category, Naïve Bayes and Adaptive Bayes

Network, were used to build the data mining models. The models were then tested to

determine their accuracy and applied to new data to establish their effectiveness. The

results of the testing process and the results of applying the models to new data were

analysed and compared as part of this investigation.

A number of conclusions were drawn from this investigation, namely that Oracle Data

Mining provides all the functionality necessary to easily build an effective data

mining model and that the Adaptive Bayes Network algorithm produced the most

effective data mining model. As far as actual results were concerned the accuracy the

models displayed during testing was not a good indication of the accuracy they would

display when applied to new data and the distribution of the target attribute in the data

sets had an impact on the data mining models and the testing thereof.

An Evaluation of Commercial Data Mining: Oracle Data Mining5

Page 6: draft2.doc

Section 1 Introduction

Chapter 1 Introduction

The purpose of this evaluation is to determine how the Oracle Data Mining suite

provides data mining functionality. This involves investigating a number of issues:

1. How easy the tools available with the data mining software are to use and in

what ways they provide aspects of data mining like data preparation, building

of data mining models and testing of these models.

2. Whether the algorithms selected for this evaluation found a useful pattern in

a data set and what happened when the models produced by the algorithms

were applied to a new data set.

3. Which of the algorithms investigated built the most effective data mining

model and under what circumstances this occurred.

4. How the models were tested and whether test results gave an indication of

how the models would perform when applied to new data.

5. Lastly, the manner in which the distribution of the data used to build the data

mining models affected the models and how the distribution of the data used

to test the models affected the test results.

1.1 Background to Data Mining

Data mining is a relatively new offshoot of database technology which has arisen

primarily as a result of the ability of computers to:

An Evaluation of Commercial Data Mining: Oracle Data Mining6

Page 7: draft2.doc

Store vast quantities of data in data warehouses. (Data warehouses differ from

operational databases in that the data in a warehouse is historical; the data

does not only consist of active records in a database.)

Implement various algorithms for the mining of data.

Use these algorithms to analyse these vast quantities of data in a reasonable

amount of time.

The ability to store vast amounts of data is of little use if the data cannot somehow be

organised in a meaningful way. Data mining achieves this by discovering the patterns

in data that represent knowledge and providing some sort of description or abstraction

of what is contained in a data set. These patterns allow organisations to learn from

past behaviour stored in historical data and exploit those patterns that work best for

them.

There are various ways to classify data mining into categories as suggested by a

number of authors. Berry and Linoff [2000] attempt to classify into categories the

various techniques of data mining and specify two main categories – directed data

mining and undirected data mining. Geatz and Roiger [2003] divide data mining into

two categories, supervised and unsupervised learning. Al-Attar [2004] makes a

distinction between data mining and data modelling.

Berry and Linoff [2000] suggest considering the goals of the data mining project

when classifying data mining and, accordingly, what techniques can be used to fulfil

these goals. Prescriptive techniques are useful for making predictions and descriptive

techniques help with understanding of a problem space.

According to Berry and Linoff [2000], directed data mining involves using the data to

build a model that describes one particular variable of interest in terms of the rest of

the data. This category includes techniques such as classification, estimation and

prediction. Undirected data mining builds a model with no single target variable but

rather to establish the relationships among all the variables. Included in this category

are affinity groupings or association discovery, clustering (classification with no

predefined data) and description or visualization. [Berry and Linoff, 2000]

An Evaluation of Commercial Data Mining: Oracle Data Mining7

Page 8: draft2.doc

Geatz and Roiger [2003] define input variables as independent variables and output

variables as dependent variables. It can then be deduced that dependent variables do

not exist in unsupervised learning as no output variable is produced but rather a

descriptive relationship is produced. In supervised learning a predictive, dependent

variable is produced as output.

According to Al-Attar [2004], data mining results in patterns that are understandable

such as decision trees, rules and associations. Data modelling produces a model that

fits the data that can be understandable (trees, rules) or presented as a black box as in

neural networks.

In keeping with these definitions it is possible to say that directed data mining,

supervised learning and Al-Attar’s [2004] definition of data mining describe similar

predictive techniques and fall into the category of supervised learning. Undirected

data mining, unsupervised learning and Al-Attar’s [2004] data modelling are in the

same class as descriptive techniques and fall into the category of unsupervised

learning.

1.2 Supervised Learning and Classification Techniques

Algorithms are used to implement the techniques in these various data mining

categories. Supervised learning covers techniques that include prediction,

classification, estimation, decision trees and association rules. As this evaluation

investigates classification techniques, these will be discussed in further detail.

Geatz and Roiger [2003] describe classification as a technique where the dependent or

output variable is categorical. The emphasis of the model is to assign new instances of

data to categorical classes. The authors describe estimation as a similar technique that

is used to determine the value of an unknown output attribute that is numerical. Geatz

and Roiger [2003] state that prediction only differs from the two techniques

mentioned above in that it is used to determine future outcomes of data. Classification

techniques such as these are generally used when there is a set of input and output

data as dependent and independent variables exist in the data.

An Evaluation of Commercial Data Mining: Oracle Data Mining8

Page 9: draft2.doc

1.3 Oracle Data Mining (ODM)

Oracle embeds data mining in the Oracle 9i Enterprise Edition version 9.2.0.5.0

database which allows for integration with other database applications. All data

mining functions are provided through the Java API giving complete control to the

data miner over the data mining functions. [Oracle9i Data Mining Concepts Release 2

(9.2) 2002]

The Oracle Data Mining suite is made up of two components, the data mining Java

API and the Data Mining Server (DMS). [Oracle9i Data Mining Concepts Release 2

(9.2), 2002] The DMS is a server side component that provides a repository of

metadata of the input and result objects of data mining. The DMS also provides a

connection to the database and access to the data that is mined. It is possible to use

JDeveloper 10g to provide the access to the Java API and the DMS. The data mining

can then be performed using Data Mining for Java (DM4J) 9.0.4 or by writing Java

code. DM4J provides a number of wizards that automatically produce the Java code.

[Oracle Data Mining Tutorial, Release 9.0.4, 2004]

1.3.1 Oracle Data Mining Algorithms

ODM supports a number of algorithms and choice of algorithm for ODM depends on

the data available for mining as well as the format of results required. This project has

made use of the Adaptive Bayes Network and Naïve Bayes algorithms which are

Classification algorithms that assign new instances of data to categorical classes and

can be used to make predictions when applied to new data.

1.3.2 Functionality of Oracle Data Mining Algorithms and ODM

Mining tasks are available to perform data mining operations using these algorithms

which include building and testing of models, computing model lift and applying

models to new data (scoring).

DM4J wizards control the preparation and mining of data as well as evaluation and

scoring of models. DM4J has the ability to automatically generate Java and SQL code

An Evaluation of Commercial Data Mining: Oracle Data Mining9

Page 10: draft2.doc

to transfer the data mining into integrated data mining or business intelligence

applications. [Oracle Data Mining for Java (DM4J), 2004]

1.4 Chapter Summary

This chapter introduces the evaluation and describes what is hoped to be achieved by

investigating the Oracle Data Mining suite. A short background to data mining is

presented and supervised learning and Classification techniques introduced. A short

introduction to ODM is also presented. The next chapter will describe the approach

taken by this evaluation and will present reasons for some of the design decisions.

An Evaluation of Commercial Data Mining: Oracle Data Mining10

Page 11: draft2.doc

Section 2 Evaluation of Oracle Data Mining

Chapter 2 Methodology of the Evaluation

This chapter aims to provide an explanation of the approach that has been taken

during this evaluation. It will explain why ODM was selected as the data mining tool

to be evaluated as well as why the Naïve Bayes and Adaptive Bayes Network

algorithms were used to build the data mining models. The parameters required by

these algorithms are explained and the data used during this evaluation is described.

2.1 Approach

One purpose of this evaluation is to determine what functionality is provided with

ODM as well as to ascertain what kinds of models can be produced by ODM. In order

to make these discoveries, it is necessary to use a number of algorithms in the data

mining suite to build data mining models, to test the accuracy of these models and to

validate the results these models produce when applied to new data.

To be able to perform comparisons of the results the models produce, it has been

necessary to select two forms of data mining algorithm that fall into the same

categories, in this case, supervised learning and classification. For this reason, Naïve

Bayes for Classification and Adaptive Bayes Network for Classification have been

selected as both algorithms fall into the supervised learning category and can be used

to make predictions. These predictions could then be compared to determine which

models, built using the different algorithms, are more effective. Both algorithms

allow for building the model, testing the model, computing model lift (providing a

An Evaluation of Commercial Data Mining: Oracle Data Mining11

Page 12: draft2.doc

measure of how quickly the model finds actual positive target values) and application

of the model to new data.

An Oracle 9i Enterprise Edition version 9.2.0.5.0 database was configured and the

tools and software for data mining installed and configured for use with the database.

For the purposes of this investigation, JDeveloper 10g provides the access to the Java

API and the DMS. The data mining itself is performed using DM4J 9.0.4 which is an

extension of JDeveloper that provides the user with a number of wizards that

automatically create the Java programs that perform the data mining when these

programs are run. [Oracle Data Mining Tutorial, Release 9.0.4, 2004]

The data used during the evaluation was obtained at http://www.ru.ac.za/weather/

which provides an archive of weather data in the Grahamstown area for a number of

years. It was deemed that it would be more interesting to use this data to determine

whether a pattern was present in the data when conducting the evaluation as the

results would be of more interest than sample data with little relevance to Rhodes

University.

The two Classification algorithms were then used to build, test and apply a number of

data mining models to the data and it was then possible to compare the predictions

made by each model. During the model building stage it was possible to build the

models using prepared and unprepared data as well as to build models using the

different techniques to determine the effect this had on the results. During testing of

the models it was possible to compare the models’ accuracy and to measure how

quickly the model finds actual positive target values (model lift). Once the models had

been built and tested it was possible to apply the models to new data and then

compare the predictions made by the models to those of the other models as well as to

the actual values in the historical data. It was also of interest to compare the results of

testing the models to those of applying the models to new data.

2.2 Choice of Data Mining Tool

It was chosen to evaluate the data mining functionality provided with the Oracle9i

Enterprise Edition database. An aspect of ODM that supported its use was that all data

mining processing occurs within the database. This removes the need to extract data

An Evaluation of Commercial Data Mining: Oracle Data Mining12

Page 13: draft2.doc

from the database in order to perform the mining as well as reducing the need for

hardware and software to store and manage this data. According to Berger [2004] this

results in a more secure and stable data management and mining environment and

enhances productivity as the data does not have to be extracted from the database

before it is mined.

ODM uses Java code to build, test and apply the models. It was decided to use DM4J

9.0.4 (an extension of JDeveloper 10g) to conduct the data mining as DM4J provides

wizards that allow the user to adjust the settings for the data mining and automatically

generates the Java code that is run when the mining is performed. This functionality

allows novice users to use the default settings for the various algorithms and more

advanced users can experiment with the different settings without having to rewrite

vast amounts of code. DM4J also provides access to the Oracle 9i database and the

data used for the data mining which allows the user to carry out data preparation

within the database using similar wizards. These factors would allow the ease of use

of the tools to be evaluated and to determine how the various stages of the data

mining process are supported by ODM.

In the study of related literature it is apparent that a number of authors feel data

mining should be conducted in a procedural manner. Al-Attar [2004] feels that a step

by step data mining methodology needs to be developed to allow non-experts to

conduct data mining and that this methodology should be repeatable for most data

mining projects. This and similar statements show the need for a well defined data

mining process to be used by data miners.

Geatz and Roiger [2003] introduce the KDD (Knowledge Discovery and Data

Mining) data mining process where emphasis is placed on data preparation for model

building which involves:

Identification of the goal to be achieved using data mining.

Selecting the data to be mined.

Data preprocessing in order to deal with noisy data.

Data transformation which involves the addition or removal of attributes and

instances, normalizing of data and type conversions.

An Evaluation of Commercial Data Mining: Oracle Data Mining13

Page 14: draft2.doc

The actual data mining, at this stage the model is built from training and test

data sets.

The resulting model is interpreted to determine if the results it presents are

useful or interesting.

The model or acquired knowledge is applied to the problem.

When this suggested process is compared to the process used by ODM as depicted in

Figure 1, it is apparent that ODM makes use of similar stages in their data mining and

places the necessary emphasis on preparation of data and evaluation of results. This

suggests that ODM provides access to the necessary stages involved in conducting a

more successful data mining project.

Figure 1.The Oracle Data Mining Process [Berger, 2004]

2.3 The Data

The data used in this evaluation consists of a number of tables that are stored in the

Oracle database and available in Appendix B on the CD-ROM that accompanies this

project. The data was created from a weather data archive available at

http://www.ru.ac.za/weather/ compiled by Jacot-Guillarmod, F. According to the

An Evaluation of Commercial Data Mining: Oracle Data Mining14

Page 15: draft2.doc

explanation on the web page, the data available at the site represents data gathered at

5 minute intervals throughout a day. Data recorded includes:

Temperature (degrees F)

Humidity (percent)

Barometer (inches of mercury)

Wind Direction (degrees, 360 = North, 90 = East)

Wind Speed (MPH)

High Wind Speed (MPH)

Solar Radiation (Watts/m^2)

Rainfall (inches)

Wind Chill (computed from high wind speed and temperature)

Preparing the data to create the database tables involved removing the reading of

rainfall in inches from the records and replacing it with a ‘yes’ or ‘no’ value,

depending on whether rain had been measured or not. This implies that the 5 minute

interval measurements are used to determine whether rain had been recorded on the

day the measurements were taken. Although information is lost regarding the amount

of rain that had fallen on a specific day, for the purposes of this evaluation it is of

interest whether rain fell at all on a specific day as the predictions made by the

algorithms are categorical.

This categorical variable which was named RAIN would then be predicted by the

models when applied to new data of the same format. The resulting structure of the

tables of data is depicted in Table 1.

Name Data Type Size Nulls?

THETIME NUMBER NO

TEMP NUMBER YES

HUM NUMBER YES

BARO NUMBER YES

WDIR NUMBER 3 YES

WSPD NUMBER YES

WSHI NUMBER YES

SRAD NUMBER YES

CHILL NUMBER YES

An Evaluation of Commercial Data Mining: Oracle Data Mining15

Page 16: draft2.doc

RAIN VARCHAR 3 YES

Table 1 Mining Data Table Structure

The data set WEATHER_BUILD is used for the building of the data mining models

for both algorithms. This data set consists of 2601 records and is created from a

number of daily weather archives recorded in September 2004.

The test data set used to evaluate the effectiveness of the models is created from

WEATHER_BUILD and the process of creating this data set will be explained in

more detail later in the project.

WEATHER_ APPLY consists of 290 records and is the data set which

the built and tested model is applied to in order to make predictions.

All the actual values of the RAIN attribute had been removed and

stored for later comparison. This means the models will predict

whether the value of RAIN will be ‘yes’ or ‘no’ and it will then be

possible to compare these predictions with the actual values in the

original data used to create WEATHER_APPLY. The results of the

application of the models to the data are stored by DM4J for

inspection and use. It is also possible to export the results to

spreadsheet format which has been done in this case to allow for

comparison between models and with the actual data values.

2.4 Classification Algorithms

The two algorithms selected for the evaluation were Naïve Bayes and Adaptive Bayes

Network. Both are classification algorithms that allow the data miner to build a model

using historical data and then apply this model to new data in order to make

predictions regarding a dependent, categorical variable in the data. Berger [2004],

states that both algorithms should be used in a data mining project to see which

algorithm is able to build the better model. This provides a further justification for the

comparison of these two algorithms within the data mining suite.

2.4.1 Naïve Bayes

An Evaluation of Commercial Data Mining: Oracle Data Mining16

Page 17: draft2.doc

The Naïve Bayes algorithm builds a model that predicts the probability of a variable

falling into a categorical class. This is achieved by discovering patterns present in the

data and counting the number of times certain conditions or relationships in the data

occur. [Berger, 2004] The data mining model represents these relationships and can

be applied to new data to make predictions. The algorithm makes use of Bayes’

Theorem, which is statistical in nature. [Berger, 2004]

The algorithm is said to provide quicker model building and faster application to new

data than the Adaptive Bayes Network algorithm. Naïve Bayes can also be used to

make predictions of categorical classes that consist of binary-type outcomes or

multiple categories of outcomes. [Berger, 2004]

2.4.2 Adaptive Bayes Network

The Adaptive Bayes Network model provides similar functionality to that of Naïve

Bayes but can also be used to generate rules or decision tree-like outcomes when built

and again to make predictions when applied to new data. The rules that are generated

are easy to interpret in the form of “if…..then” statements. Berger [2004] states that

this algorithm can be used to build better models than Naïve Bayes but it does involve

a larger number of parameters to be set and it tends to take a longer time to build such

a model.

2.5 Algorithm Settings

2.5.1 Naïve Bayes Settings

Naïve Bayes works by looking at the build data and calculating

conditional probabilities for the target value. This is done by

observing the frequency of certain attribute values and

combinations thereof. [Oracle Data Mining Tutorial, Release 9.0.4,

2004] The two parameters that must be supplied to the Naïve Bayes

build wizard, as shown in Figure 2, indicate how outliers in the data

should be treated; occurrences below the threshold values are

An Evaluation of Commercial Data Mining: Oracle Data Mining17

Page 18: draft2.doc

ignored when creating the model. [Oracle Data Mining Tutorial,

Release 9.0.4, 2004]

The singleton threshold value provides a threshold for the count of

items that occur frequently in the data. Given k as the number of

times the item occurs in the data, P as the number of records and t

as the singleton threshold expressed as a percentage of P; then the

item is considered to occur frequently if k>=t*P. [Oracle Help for

Java,1997-2004]

The pairwise threshold provides a threshold for the count of pairs of

items that occur frequently in the data. Given k as the number of

times two items appear together in the records and P and t as

above; a pair is frequent if k>t*P. [Oracle Help for Java, 1997-2004]

Figure 2 Naïve Bayes algorithm settings

2.5.2 Adaptive Bayes Network Settings

Adaptive Bayes Network works by ranking the attributes in a data

set and then building a Naïve Bayes model in order of the ranked

An Evaluation of Commercial Data Mining: Oracle Data Mining18

Page 19: draft2.doc

attributes. The algorithm then builds a set of features or ‘trees’

using these attributes which are in turn tested against the model in

order to determine whether they improve the accuracy of the model

or not. If no improvement is found the feature is discarded. When

the number of discarded features reaches a certain level the

building stops and the model is those features that remain. [Oracle

Data Mining Tutorial, Release 9.0.4, 2004]

The choice of settings when building an Adaptive Bayes Network

model allows the user to choose from three types of models:

SingleFeatureBuild, MultiFeatureBuild and NaiveBayesBuild.

The SingleFeatureBuild model produces rules in an “if….then”

format and produces only one feature. The parameters required by

this type of model are shown in Figure 3 and include the maximum

depth of the feature (number of attributes in the feature) and

number of predictors to use during the building of the model. It is

then possible for the algorithm to determine which attributes to

include in the feature and how many to include up to the specified

maximum. [Oracle Help for Java, 1997-2004] A greater feature

depth as well as a greater number of predictors included will result

in a slower model building process.

The MultiFeatureBuild model does not generate any rules. This

model builds a form of Naive Bayes model and creates one or more

features made up of a number of attributes. The parameters

required by this kind of model are the maximum number of features

to build and, as with the SingleFeatureBuild model type, the

maximum number of predictors or attributes to use while the model

is built. Also to be specified is the maximum number of failures to

allow when a feature is tested against model accuracy, before it is

discarded and the number of attributes allowed in a feature. [Oracle

Help for Java, 1997-2004]. Again a greater feature depth, greater

An Evaluation of Commercial Data Mining: Oracle Data Mining19

Page 20: draft2.doc

number of predictors and a greater number of failures allowed will

result in a slower model build process.

Figure 3 Adaptive Bayes Network algorithm settings

The NaiveBayesBuild model type does not generate rules either and,

like the MultiFeatureBuild, also builds a form of Naïve Bayes model.

The maximum number of predictors to consider during the build

process must be specified by the user. [Oracle Help for Java, 1997-

2004] Again, the greater the number of predictors the algorithm

must consider, the slower the model building will be.

An Evaluation of Commercial Data Mining: Oracle Data Mining20

Page 21: draft2.doc

The type of model created in all the Adaptive Bayes Network models

in this evaluation was the SingleFeatureBuild. This model type was

chosen as in the explanations of the model types it appears to be

the model type that results in a model less similar to a Naïve Bayes

model. Also it is the only model type that produces rules and the

rules produced by the model would be of interest in determining

what aspects of the data influenced the predictions made by the

model.

2.6 Chapter Summary

This chapter has described what has hoped to have been achieved by building data

mining models using the Naïve Bayes and Adaptive Bayes Network algorithms in the

Oracle Data Mining suite. The reasons for selecting Oracle Data Mining for this

research have been highlighted. The models built using the algorithms have been

outlined and the parameters required by each algorithm have been described. The

source of the data used for this evaluation has been explained as well as how the data

sets for the data mining were created.

The next chapter will describe the process of preparing the data, building the models,

testing the models and training and tuning the models.

An Evaluation of Commercial Data Mining: Oracle Data Mining21

Page 22: draft2.doc

Chapter 3 Classification Models

This chapter describes the process of building the Classification models. The process

of preparing the data to create the build and test data sets is discussed and the Priors

technique is introduced. The actual model building is explained in this chapter. The

model testing process is described including aspects like model accuracy, confusion

matrices and model lift. The process of training and tuning the models to increase

their effectiveness is explained. This chapter provides insight into how ODM provides

data mining functionality.

3.1 Preparing the Data

3.1.1 Build and Test Data Sets

Pyle [2000] emphasises the importance of proper data preparation for data mining and

says the benefits of data mining using properly prepared data include the creation of

more effective models faster. He states that at least two outputs are required from data

preparation: the training data set which is used for building the model and the testing

data set which helps detect overtraining (noise trained into the model). These data sets

are used by the data mining suite later in the data mining process.

In the case of this evaluation it was necessary to use the data in WEATHER_BUILD

to create the training and testing data sets. DM4J provides a tool which allows the

An Evaluation of Commercial Data Mining: Oracle Data Mining22

Page 23: draft2.doc

user to create randomized build and test tables from the existing data. The wizard is

known as the Transformation Split wizard and is specifically developed for use with

Classification models.

The wizard allows the user to select which data is to be used to create the new tables

as well as to specify what percentage of records in the original data should be

allocated to each of the build and test tables. WEATHER_BUILD was used as the

original data and 75% of the records were allocated to the build table and 25% were

placed in the test table. That is, 1951 records were randomly selected from

WEATHER_BUILD and placed in the build table and the remaining 650 records were

placed in the test table. These ratios were chosen because the varying nature of the

weather data meant it would be more beneficial to have a larger number of cases in

the build data set, thus allowing the data mining model to be aware of a larger number

of cases that influenced the target attribute RAIN.

The wizard produced the Transformation Split component which was run and the

resulting tables were named THE_BUILD and THE_TEST and were stored in the

database along with the original data.

3.1.2 Priors

In a number of scenarios where the variable that is being predicted is binary in nature,

one outcome of this variable may occur more frequently in the data that the other.

When the model is built from such data the model may not observe enough of the one

case to build an accurate model and may predict the other case nearly every time but

still show a high accuracy during testing. In order to prevent this from occurring, it is

necessary to create a build table that has approximately equal numbers of each

outcome and also to supply the algorithm with the original distribution of the data or

the prior distribution. This technique, known as Priors, should result in a more

effective model. However, the model must be tested against data of the original

distribution. [Oracle Data Mining Tutorial, Release 9.0.4, 2004]

In order to determine the effect of using such a technique as a form of data

preparation it was decided to build models using both algorithms that would use data

An Evaluation of Commercial Data Mining: Oracle Data Mining23

Page 24: draft2.doc

prepared in this way. When the data in THE_BUILD data set was examined it was

apparent that the ‘no’ outcome occurred more frequently than the ‘yes’ for the target

attribute RAIN as shown in Figure 4. An outcome of ‘no’ occurred 1242 times and a

‘yes’ occurred 727 times.

Histogram for:RAIN

0

200

400

600

800

1000

1200

1400

yes no

Bin Range

Bin

Co

un

t

Figure 4. Data Distribution for RAIN Attribute from THE_BUILD data set

It was possible to create a build data set with a more even distribution of the target

attribute. This was accomplished using the ODM browser and a Transformation

wizard which created a stratified sample of the data with a balanced distribution of the

target attribute. Stratified random sampling divides the data set into subpopulations

and samples are then taken from these in proportion to subpopulation size.

[Fernandez, 2003] As there were 727 cases of ‘yes’ for the RAIN attribute, creating a

balanced data set would require a data set of approximately twice that size (1454).

This data set was created by the wizard and named THE_BUILD1. When the

distribution of the RAIN attribute was inspected again a more balanced distribution

was shown as depicted in Figure 5.

An Evaluation of Commercial Data Mining: Oracle Data Mining24

Page 25: draft2.doc

Histogram for:RAIN

0

200

400

600

800

1000

1200

1400

yes no

Bin Range

Bin

Co

un

t

Figure 5. Data Distribution for RAIN Attribute from THE_BUILD1 data set

The data sets THE_BUILD and THE_BUILD1 were used to build models using each

algorithm and tested on the same test data, THE_TEST, in order to allow for an

evaluation of the effect the distribution of the data has on the resulting models.

3.2 Building the Models

In total, 8 classification models were built using DM4J, four using the Naïve Bayes

for Classification algorithm and four using the Adaptive Bayes Network for

Classification algorithm. Of the four for each algorithm, two models were built using

the data set THE_BUILD where the Priors technique was not made use of and

weighting was used in one of the two, two models were built using THE_BUILD1,

using the Priors technique and again, weighting was used in one of the two.

Weighting and its effects on the models will be discussed later in this chapter. All the

models were built using the attribute RAIN as the target value. This means the models

were built in order to predict the outcome, ‘yes’ or ‘no’, of RAIN when applied to

new data.

3.2.1 Building the Naïve Bayes Models

3.2.1.1 nbBuild

An Evaluation of Commercial Data Mining: Oracle Data Mining25

Page 26: draft2.doc

The first model built was named nbBuild and used the data set THE_BUILD which

had the uneven distribution of the target attribute RAIN (as discussed in section

3.1.2). The Naïve Bayes algorithm was used and the default algorithm settings were

used. This means the singleton threshold was 0.1 and the pairwise threshold was 0.1

for the model.

3.2.1.2 nbBuild2

The second model was named nbBuild2 and made use of the data set THE_BUILD1

which was adjusted using stratified sampling and the Priors technique to have an even

distribution of the target value RAIN. When making use of the Priors technique it was

necessary to specify in the model build wizard what the original distribution of the

data had been in order for the algorithm to be aware of this when making its

classifications. The values supplied at this stage of the model build process are shown

in Figure 6. Again, the default algorithm settings of 0.1 for the pairwise and singleton

thresholds were used.

3.2.2 Building the Adaptive Bayes Network Models

3.2.2.1 abnBuild

The third model was named abnBuild and made use of the data set THE_BUILD.

The Adaptive Bayes Network algorithm was used and a model type of

SingleFeatureBuild was selected. This model type produces rules along with its

predictions. The settings for the model type were left at the defaults. These settings

included a maximum number of predictors of 25, a maximum network feature depth

of 10 and no time limit for the running of the algorithm.

3.2.2.2 abnBuild2

The fourth model was named abnBuild2 and made use of THE_BUILD1 which was

the adjusted data set. Again it was necessary to specify the original distribution of the

original data set as shown in Figure 6. A SingleFeatureBuild model type was selected

and the default settings as described above were used.

An Evaluation of Commercial Data Mining: Oracle Data Mining26

Page 27: draft2.doc

Figure 6. Extract from Classification Model Build Wizard, Priors Settings.

3.3 Testing the Models

Roiger and Geatz [2003] state that evaluation of supervised learning models involves

determining the level of predictive accuracy and that supervised learning models can

be evaluated using test data sets. Such models can be evaluated by comparing the test

set error rates of supervised learning models created from the same training data to

determine the accuracy of the models and which model is most effective. It is of

interest how ODM supports testing, whether the accuracy a model displays during

An Evaluation of Commercial Data Mining: Oracle Data Mining27

Page 28: draft2.doc

testing indicates how it will perform on new data and how the data used during testing

affects the results of testing the models.

The test model results produced by DM4J are depicted in confusion matrices.

Confusion matrices can be used to determine the accuracy of Classification models

and to show the number of false negative or false positive predictions made by the

model on the test data. Confusion matrices are best used for evaluating the accuracy

of models using categorical data which is being used in this case. [Roiger and Geatz,

2003]

Roiger and Geatz [2003] provide an example of a confusion matrix as shown in

Table 2, Model A is used to classify categorical data into two classes, Accept and

Reject. The rows in the table represent the actual values in the data and the columns

represent the predicted values. The model correctly classified 600 Accept instances

from the data and correctly classified 300 Reject instances. However, there were

actually 625 Accept instances in the data and 375 Reject instances. The model also

classified 675 instances as Accept instances and 325 instances as Reject instances.

The accuracy of the model is then determined by dividing 900 by 1000 and results in

a 90% accuracy or an error rate of 10%.

Example Model Predicted Accept Predicted Reject

Actual Accept 600 25Actual Reject 75 300

Table 2. Example Confusion Matrix

3.3.1 Model Accuracy

The four models discussed in the previous section were each tested on the same test

data set, THE_TEST, consisting of 633 records. The test accuracy for each model is

shown in Table 3. It is interesting to note the greater accuracy of the models built

using the Adaptive Bayes Network algorithm and that using the prior distribution

technique appears to have had a negative impact on the test accuracy of the models.

Model nbBuild nbBuild2 abnBuild abnBuild2Test Accuracy 72.35387% 71.09005% 85.15008% 84.9921%

Table 3. Model Test Accuracy Rates

An Evaluation of Commercial Data Mining: Oracle Data Mining28

Page 29: draft2.doc

3.3.2 Model Confusion Matrices

Testing the models produced a confusion matrix for each model which showed the

tendencies of the individual model’s predictions when examined. The following

Tables 4-7 depict each models confusion matrix which is then discussed. Again, the

rows represent actual values and the columns represent the predicted values.

nbBuild no yesno 384 34yes 141 74

Table 4. Confusion Matrix for Model nbBuild Testing

When nbBuild was tested the model correctly predicted the value of the RAIN attribute in 384 + 74 =

458 cases out of 633 cases. As can be seen in the lower left corner of the matrix, the model also

incorrectly predicts a larger number (141) of ‘no’ values that are actually ‘yes’ values. This error will

be adjusted for when the model is tuned.

nbBuild2 no yesno 320 98yes 85 130

Table 5. Confusion Matrix for Model nbBuild2 Testing

The nbBuild2 model correctly predicted the value of the RAIN attribute in 320 + 130

= 450 cases out of 633 cases. When tested this model shows less of a tendency for an

error in a certain direction, i.e. ‘yes’ or ‘no’, as the false prediction numbers of 98 and

85 are close. This can be attributed to the fact that the model was built using the Priors

technique, to compensate for the lower level of ‘yes’ values for RAIN in the original

data.

abnBuild no yesno 353 65yes 29 186

Table 6. Confusion Matrix for Model abnBuild Testing

The abnBuild model correctly predicted the value of the RAIN attribute in 353 + 186

= 539 cases out of 633 cases. This model shows a higher accuracy during testing than

the previous models built using Naïve Bayes. Testing also shows that this model

An Evaluation of Commercial Data Mining: Oracle Data Mining29

Page 30: draft2.doc

makes a larger number of incorrect ‘yes’ predictions. This effect could also be

minimised during tuning.

abnBuild2 no yesno 346 72yes 23 192

Table 7. Confusion Matrix for Model abnBuild2 Testing

The abnBuild2 model correctly predicted the value of the RAIN attribute in 346+192

= 538 cases out of 633 cases. Similarly, this model shows a higher accuracy during

testing than those models built using Naïve Bayes. This model also tends to make a

larger number of incorrect ‘yes’ predictions. This too could be dealt with during

model tuning.

Once the accuracy of the models is tested it is possible to perform another kind of

model testing using cumulative gains charts or lift charts.

3.4 Calculating Model Lift

A lift or cumulative gains chart shows how well the model improves predictions of

positive target attribute outcomes over a sample of the data containing actual results.

The usefulness of such a technique would be apparent in a business problem where

predicted positive values in a model may indicate possible business opportunities. Lift

allows that miner to estimate how well the model will perform when applied to new

data. [Oracle Help for Java, 1997-2004]

An Evaluation of Commercial Data Mining: Oracle Data Mining30

Page 31: draft2.doc

Figure 7. nbBuild Lift Chart

Figure 7 shows the cumulative lift chart for nbBuild when applied to the test data set,

THE_TEST. The value in the first column, approximately 2.4, indicates that the

model should find approximately 2.4 times as many actual positive values for the

RAIN attribute than a random selection of 10% of the data would show.

Figure 8. nbBuild2 Lift Chart

An Evaluation of Commercial Data Mining: Oracle Data Mining31

Page 32: draft2.doc

Figure 8 depicts the cumulative lift chart for nbBuild2 when applied to the test data

set. In the first and second columns the graph indicates that the model should find

approximately 2.4 times as many positive values than random selection.

Figure 9. abnBuild Lift Chart

Figure 9 shows the cumulative lift chart for abnBuild when applied to the test data

set. The value of approximately 2.6 indicates that the model finds approximately 2.6

times as many positive values as random selection would.

An Evaluation of Commercial Data Mining: Oracle Data Mining32

Page 33: draft2.doc

Figure 10. abnBuild2 Lift Chart

Figure 10 shows the cumulative lift chart for abnBuild2 when applied to the test data

set. Similarly, the value of approximately 2.6 indicates that the model finds

approximately 2.6 times as many positive values as random selection would.

It is evident from the above charts that, although the accuracy of the models is not

high in all cases, when applied to new data they should provide a far greater level of

accuracy than attempting to make predictions using no model at all.

3.5 Training and Tuning the Models

Using ODM it is possible to assign weights to the target value when using Naïve

Bayes or Adaptive Bayes so that the model predicts more of one kind of outcome if it

appears that there are a large number of false predictions of a certain kind when

testing the model. [Oracle Data Mining Tutorial, Release 9.0.4, 2004] This bias can be

built into the model to increase predictions of the desired target value.

In this investigation weighting was used to introduce this bias because when testing

the nbBuild model, it was apparent from the confusion matrix that a significant error

was encountered as the model predicted a large number of false negatives, ‘no’ values

for the target attribute RAIN that were in fact ‘yes’ values. These predictions were

An Evaluation of Commercial Data Mining: Oracle Data Mining33

Page 34: draft2.doc

false in 141 of the cases. This level of false predictions was high, thus it was viable to

use weighting in order to decrease the number of false negative predictions.

A weighting value is often chosen by trial and error and is then associated with a

certain type of prediction, false negative or positive, and the model will then treat a

false prediction of that kind as ‘the weighting value’ times as costly as an error of the

other kind. This forces the model to make more predictions in the other direction.

[Oracle Data Mining Tutorial, Release 9.0.4, 2004]

As it was apparent during testing that nbBuild predicted a large number of false

negatives and as this was the most substantial error out of all the models, it was

decided to build another four models, two more for each algorithm, which

incorporated a weighting of 3 against false negatives. The weighting value of 3 was

chosen after some experimentation and used on all the new models as shown in the

extract of the model build wizard for abnBuild4 in Figure 11. The Priors technique

was used in one case for each algorithm.

The models were then tested on the same test data set, THE_TEST, which the

previous models were tested on. Table 8 represents the previous models’ test accuracy

rates and Table 9 represents the new weighted models’ test accuracy rates.

Model nbBuild (no Priors)

nbBuild2 abnBuild (no Priors)

abnBuild2

Test Accuracy 72.35387% 71.09005% 85.15008% 84.9921%Table 8 Unweighted Models’ Test Accuracy Rates

Model nbBuild3 (no Priors)

nbBuild4 abnBuild3 (no Priors)

abnBuild4

Test Accuracy 72.511846% 68.24645% 77.40916% 77.40916%Table 9 Weighted Models’ Test Accuracy Rates

An Evaluation of Commercial Data Mining: Oracle Data Mining34

Page 35: draft2.doc

Figure 11. Extract Showing Weighting of Model Build Wizard for abnBuild4

In only one case, nbBuild3, did weighting improve model test accuracy when

compared to the model with the same settings, nbBuild, before weighting was added.

It is of interest to compare the confusion matrices for these two models.

nbBuild no yesno 384 34yes 141 74

Table 10 nbBuild Confusion Matrix

An Evaluation of Commercial Data Mining: Oracle Data Mining35

Page 36: draft2.doc

nbBuild3 no yesno 381 37yes 137 78

Table 11 nbBuild3 Confusion Matrix

Table 10 shows the confusion matrix for nbBuild and Table 11 shows the confusion

matrix for the weighted model nbBuild3. nbBuild3 was weighted 3 against false

negatives. The effects of this weighting are shown in the decrease of false ‘no’

predictions, from 141 to 137, the increase in correct ‘yes’ predictions, from 74 to 78,

and the increase in false ‘yes’ predictions, from 34 to 37. The affect of the weighting

seems minimal but can be increased by increasing the value of the weighting.

However, since the weighting appears to have had a negative impact on the test

accuracy of other models it was decided to leave the value at 3.

3.6 Applying the Models to New Data

At this stage it is necessary to provide a summary of the models built and tested thus

far. This summary is provided in Table 12.

Classification Algorithm Naïve Bayes Adaptive Bayes NetworkNo weighting, no use of Priors

nbBuild abnBuild

No Weighting, use of Priors

nbBuild2 abnBuild2

Weighting, no use of Priors

nbBuild3 abnBuild3

Weighting, use of Priors nbBuild4 abnBuild4

Table 12. Summary of Classification Models

The models were applied to the new data in the WEATHER_APPLY set. The results

were depicted according to the unique THE_TIME attribute for each record and

showed a prediction, ‘yes’ or ‘no’, of whether it was likely to rain. The results were

exported to spreadsheets to allow for inspection and the comparisons are discussed in

the following chapters.

An Evaluation of Commercial Data Mining: Oracle Data Mining36

Page 37: draft2.doc

3.7 Chapter Summary

The eight Classification models that have been built have been discussed and it is

apparent that the algorithms have found a pattern in the data. The support ODM

provides for the process of preparing the data to build the models has been described

and the Priors technique has been explained. The model testing process has been

described and has given an indication of the accuracy of the models. It will be

interesting to compare this accuracy with the accuracy the models exhibit when

applied to new data. Model lift has been calculated for the models. Four of the models

have been tuned by introducing weighting into the models. The models have been

applied to new data and the results of this are described in the next chapter.

An Evaluation of Commercial Data Mining: Oracle Data Mining37

Page 38: draft2.doc

Chapter 4 Model Results

This chapter describes the results obtained when the models were

applied to new data. Extracts of the results are provided to show

how these can be interpreted. The rules associated with the

predictions made by the Adaptive Bayes Network models are

explained. As a form of external validation the predictions made by

the models are compared to the actual values in the original data.

The results of this validation are compared for the eight models in

order to determine which model is most effective when applied to

new data and with what settings this model was built.

4.1 Results of Application to New Data

The eight classification models were applied to the new data in the

WEATHER_APPLY data set. This data set consisted of 290 records all

of which had had the value for the RAIN attribute removed. These

values had been stored for later comparison. The results were

depicted by THE_TIME attribute and showed a prediction ‘yes’ or

‘no’, of whether it was likely to rain for all 290 records. The

probability of this prediction was also depicted as shown in a sample

from the results for nbBuild in Table 13. The results in this extract

can be interpreted as at THE_TIME attribute with value 1, it is

predicted that no rain will have been measured and this prediction

is given with a probability of 0.9999. At THE_TIME attribute with

value 138 it is predicted that rain will have been measured with a

probability of 0.6711.

PREDICTION PROBABILITY THE_TIMEno 0.9999 1yes 0.6711 138

Table 13. Extract of results from model nbBuild

An Evaluation of Commercial Data Mining: Oracle Data Mining38

Page 39: draft2.doc

Those models that were weighted provided predictions and cost

figures. This cost figure is provided instead of probability as the

model makes prediction based on the cost of an incorrect prediction

to the model’s accuracy. This cost figure is determined by the

weighting of a certain type of false prediction when the model is

tuned and the algorithm then attempts to minimise costs when

making predictions. An extract from these types of results is shown

in Table 14. This extract can be interpreted as at THE_TIME attribute

of value 1, it is predicted that no rain will have been measured and

the cost of such a prediction is 0. At THE_TIME attribute of value 138

it is predicted that rain will have been measured, if this prediction is

incorrect the cost is higher at 0.3288 which is due to the fact that a

target value of ‘yes’ was weighted to avoid false negatives. Low cost

can be interpreted as high probability as can be seen from

comparing the two extracts, but it is not possible to directly

calculate probability from cost. [Oracle Data Mining Tutorial,

Release 9.0.4, 2004]

PREDICTION COST THE_TIMEno 0 1yes 0.3288 138

Table 14. Extract of results from model nbBuild3

4.1.1 Rules Associated with Adaptive Bayes Network Predictions

Those models that were built using the Adaptive Bayes Network algorithm provide

the same format of results as shown in Tables 13 and 14 but also provide the rule with

which the associated prediction was made. During the model build stage these rules

are generated and then predictions are made using these rules when the model is

applied to new data. However, not all rules are made use of when the model is applied

to new data. The format of these results is shown in Table 15.

PREDICTION PROBABILITY RULE_ID THETIMEno 0.5418 52 1yes 0.6677 53 138

Table 15 Extract of Results from model abnBuild showing rules

An Evaluation of Commercial Data Mining: Oracle Data Mining39

Page 40: draft2.doc

After inspecting the spreadsheets containing the results of those models built using the

Adaptive Bayes Network algorithm, it was apparent that when the models were

applied to the new data only 8 of the 61 rules generated during the model building

process were used to make the predictions. These 8 rules will be expanded upon in

Table 16.

Rule ID If (Condition) Then (classification)

Confidence Support

2 CHILL in (37 - 46.6)

no 0.63258135 0.104113765

38 CHILL in (37 - 46.6) and WDIR in (22 - 89.6)

yes 0.6427132 0.019299136

43 CHILL in (37 - 46.6) and WDIR in (89.6 - 157.2)

yes 0.94884205 0.019807009

44 CHILL in (46.6 - 56.2) and WDIR in (89.6 - 157.2)

yes 0.9037015 0.014728288

48 CHILL in (46.6 - 56.2) and WDIR in (157.2 - 224.8)

yes 0.8486806 0.031488065

52 CHILL in (37 - 46.6) and WDIR in (224.8 - 292.4)

no 0.54187334 0.019807009

53 CHILL in (46.6 - 56.2) and WDIR in (224.8 - 292.4)

yes 0.6677172 0.1777552

57 CHILL in (37 - 46.6) and WDIR in (292.4 - 360)

no 0.93961054 0.0726257

Table 16 Rules used by Adaptive Bayes Network Models to Make Predictions

These rules can be interpreted as follows for rule 52:

IF

CHILL in (37 - 46.6) and WDIR in (224.8 - 292.4)

THEN

An Evaluation of Commercial Data Mining: Oracle Data Mining40

Page 41: draft2.doc

RAIN equal (no)

Confidence=0.54187334

Support=0.019807009

The support value given with the rules gives an indication of the percentage of cases

in the build data set with the same predicted target attribute and that meet the

conditions of the rule. The confidence value indicates the improvement in the

accuracy of the model that has been made by adding the rule. [Oracle Data

Mining Tutorial, Release 9.0.4, 2004]

4.2 Comparison of Model Results

Once the models had been applied to new data and the results of this step had been

exported to spreadsheets, it was possible to replace the original values of the RAIN

attribute in the WEATHER_APPLY data set to allow the effectiveness of the models

to be evaluated. After the RAIN attribute was replaced in the original data, each

prediction made by each model was compared to the actual value of the RAIN

attribute for that record. The number of correct predictions was counted and the

percentage of correct predictions was calculated. It is also of interest to consider the

accuracy of the model during testing when evaluating the effectiveness of the model

when applied to new data. This makes it possible to determine whether testing results

give a good indication of model performance when applied to new data. These results

are depicted in Table17.

An Evaluation of Commercial Data Mining: Oracle Data Mining41

Page 42: draft2.doc

Model Model Settings

Number of Correct Predictions (out of 290)

Percentage of Correct Predictions

Model Accuracy During Testing

nbBuild

No weighting, no use of Priors 40 13.79% 72.35386%

nbBuild2

No weighting, use of Priors 107 36.90% 71.09005%

nbBuild3Weighting, no use of Priors 40 13.79% 72.511846%

nbBuild4Weighting, use of Priors 185 63.79% 68.24645%

abnBuild

No weighting, no use of Priors 123 42.41% 85.15008%

abnBuild2

No weighting, use of Priors 123 42.41% 84.9921%

abnBuild3Weighting, no use of Priors 212 73.10% 77.40916%

abnBuild4Weighting, use of Priors 212 73.10% 77.40916%

Table 17 Summary of Accuracy of Predictions When Compared to Actual Data

An Evaluation of Commercial Data Mining: Oracle Data Mining42

Page 43: draft2.doc

It is also of interest to directly compare the results when applied to new data of those

models built using different algorithms but the same settings in terms of weighting

and use of Priors.

This is depicted in Table 18.

Models Settings Naïve Bayes Percentage of Correct Predictions

Adaptive Bayes Network Percentage of Correct Predictions

nbBuild vs abnBuild

No weighting, no use of Priors

13.79% 42.41%

nbBuild2 vs abnBuild2

No weighting, use of Priors

36.90% 42.41%

nbBuild3 vs abnBuild3

Weighting, no use of Priors

13.79% 73.10%

nbBuild3 vs abnBuild3

Weighting, use of Priors

63.79% 73.10%

Table 18 Comparison of Models Built using Same Settings

4.3 Chapter Summary

This chapter has provided a description of the results that were obtained when the

models were applied to new data including the rules the Adaptive Bayes Network

algorithm used to make its predictions. The predictions made by the models were

compared to the original values for RAIN in the data. The accuracy of the predictions

was compared between as well as with the accuracy the models showed during

testing.

In the following chapter the results of applying the models to new data and the results

of the comparisons between the models will be interpreted as part of the evaluation of

ODM.

An Evaluation of Commercial Data Mining: Oracle Data Mining43

Page 44: draft2.doc

Chapter 5 Interpretation of Results

This chapter provides an interpretation of the results obtained from the data mining

models built using similar techniques but different algorithms. Each comparison

between the models is interpreted and reasons presented for the results obtained. The

effectiveness of all the models is compared and the significance of these observations

discussed.

5.1 Comparison of Model Results

As presented in Table 18 in the previous chapter, the percentage of

correct predictions for each model built using the Naïve Bayes

algorithm was compared to that of the model built using the

Adaptive Bayes Network algorithm using similar techniques, in

terms of Priors and weighting. Table 19 includes the accuracy during testing

for the models along with the other results.

Comparison Models Settings Naïve Bayes Percentage of Correct Predictions

Adaptive Bayes Network Percentage of Correct Predictions

Naïve Bayes Accuracy During Testing

Adaptive Bayes Network Accuracy During Testing

1 nbBuild vs abnBuild

No weighting, no use of Priors

13.79% 42.41% 72.35386% 85.15008%

2 nbBuild2 vs

No weighting,

36.90% 42.41% 71.09005% 84.9921%

An Evaluation of Commercial Data Mining: Oracle Data Mining44

Page 45: draft2.doc

abnBuild2 use of Priors3 nbBuild3

vs abnBuild3

Weighting, no use of Priors

13.79% 73.10% 72.511846% 77.40916%

4 nbBuild4 vs abnBuild4

Weighting, use of Priors

63.79% 73.10% 68.24645% 77.40916%

Table 19 Comparison of Models built using same techniques and showing

accuracy during testing

When each comparison is inspected, it is apparent that in all cases

when the models have been applied to new data, those models built

using the Adaptive Bayes Network algorithm outperform those built

using Naïve Bayes. In all except comparison 2, the percentage of

correct predictions for the Adaptive Bayes Network models are

markedly higher than those for the Naïve Bayes models.

During testing, those models built using the Adaptive Bayes

Network algorithm showed a higher level of accuracy than those

models built using Naïve Bayes. However, this difference in test

accuracy between the models in each comparison is not nearly as

large as that demonstrated when the models are applied to new

data.

In the following subsections each comparison is interpreted and

discussed.

5.1.1 Comparison 1

The models in this comparison were built without making use of the

Priors technique and were not tuned using weighting to introduce

bias. The model built using the Adaptive Bayes Network algorithm,

abnBuild, correctly predicted 42.41% of the RAIN attribute

outcomes when applied to new data whereas the model built using

Naïve Bayes, nbBuild, only predicted 13.79% of the outcomes

An Evaluation of Commercial Data Mining: Oracle Data Mining45

Page 46: draft2.doc

correctly. During testing, nbBuild showed an accuracy of 72.35386%

and abnBuild showed an accuracy of 85.15008%.

The fact that nbBuild showed a relatively high test accuracy and a low accuracy when

applied to new data can be attributed to the fact that the data set used for building the

model, THE_BUILD, had an unbalanced distribution of outcomes for the RAIN

attribute. During the model building stage the model did not observe enough of one

outcome of the target attribute to build an accurate model but still showed a high level

of accuracy during testing as the data distribution of the test data set, THE_TEST, was

similar to that of the build data. Thus, when applied to the new data,

WEATHER_APPLY, the model was shown to be ineffective.

abnBuild showed a higher overall accuracy than nbBuild which was expected as the

Adaptive Bayes Network algorithm is said to build more effective models. [Berger,

2004]

5.1.2 Comparison 2

The models in this comparison were built using the Priors technique

in order to minimise the effect of an unbalanced distribution of

outcomes for the RAIN attribute in the build data set. When applied

to the new data, nbBuild2 correctly predicted 36.90% of the

outcomes of RAIN and abnBuild2 correctly predicted 42.41%.

During testing, nbBuild2 showed an accuracy of 71.09005% and

abnBuild2 showed an accuracy of 84.9921%.

When the Priors technique was implemented and the models applied to new data,

nbBuild2 showed an increase in accuracy of 23.11% compared to nbBuild. However,

the test accuracy of this model decreased by little over 1%.

The increase in accuracy when the model was applied to new data can be attributed to

the fact that the build data set used to build this model had a balanced distribution of

An Evaluation of Commercial Data Mining: Oracle Data Mining46

Page 47: draft2.doc

outcomes for the RAIN attribute, thus allowing the model to observe a sufficient

number of cases of each outcome to ensure a more effective model.

The slight decrease in test accuracy is due to the fact that the test data set had a similar

distribution to the original build data set with the uneven distribution of the target

attribute.

abnBuild2 showed the same accuracy when applied to new data as the Adaptive

Bayes Network model that did not make use of the priors technique. However,

abnBuild2 showed a decrease in test accuracy from that of abnBuild. The fact that

both Adaptive Bayes Network models showed the same accuracy when applied to

new data is indicative of the effectiveness of the algorithm for building a model from

data with varying distributions of the target attribute. The decrease in test accuracy of

abnBuild2 can be attributed to the fact that the build data has a more even

distribution of the target attribute and the test data set has a less even distribution of

the target attribute.

5.1.3 Comparison 3

The models in this comparison were built from the original build data with the uneven

distribution of the target attribute, RAIN. These models were tuned as it was evident

during testing of the model nbBuild, that the model predicted a large number of false

‘no’ values for the target attribute. Thus, it was viable to introduce bias into the model

by using weighting. The models were weighted 3 against false negatives, implying

that the cost of predicting a false negative was 3 times that of predicting a false

positive.

With this weighting in place, nbBuild3 showed no improvement in accuracy when

applied to new data compared to the first Naïve Bayes model. The accuracy remained

at 13.79%. The accuracy of nbBuild3 during testing only improved by 0.157986%

compared to the first model.

An Evaluation of Commercial Data Mining: Oracle Data Mining47

Page 48: draft2.doc

It is unexpected that after introducing weighting into the model, nbBuild3’s accuracy

should not improve when the model is applied to new data. This could indicate that

the majority of reasons for this model’s ineffectiveness are due to the use of the build

data that does not incorporate the Priors technique.

The improvement in accuracy during testing of nbBuild3 could be attributed to

introducing weighting, but indicates that this has had little effect on the accuracy of

the model when applied to new data.

A dramatic improvement in the Adaptive Bayes Network model was observed after

introducing weighting. The accuracy of the model when applied to new data increased

to 73.10% even though during testing the accuracy of abnBuild3 dropped to

77.40916%.

It can be deduced that introducing the weighting has had a significant impact on the

abnBuild3’s accuracy when applied to new data as the model has been sufficiently

tuned to avoid errors of a certain kind.

The accuracy of the first Adaptive Bayes Network model during testing was

85.15008%. Introducing weighting reduced this accuracy to 77.40916% even though

the weighted model markedly outperforms the non-weighted model when applied to

new data. It must also be emphasised that in this case the accuracy during testing and

during application to new data are relatively close which has not occurred with

previous models. For this reason, it can be said that introducing weighting into this

model has improved the effectiveness of the model in terms of accuracy of the testing

of the model itself and the accuracy of the model when applied to new data.

5.1.4 Comparison 4

The models in this comparison, nbBuild4 and abnBuild4, were built using the Priors

technique and tuned by introducing a weighting of 3 against false negatives into the

model.

An Evaluation of Commercial Data Mining: Oracle Data Mining48

Page 49: draft2.doc

nbBuild4 showed a significant improvement in accuracy when applied to new data,

correctly predicting 63.79% of the outcomes. The accuracy during testing and

application of abnBuild4 was unchanged from that of the previous model that only

made use of weighting.

It is apparent that nbBuild4 showed an improvement due to the combination of the

Priors technique and the introduction of weighting. These results show that building

the model using data with a balanced distribution of the target attribute and then

introducing weighting enhances the effect of the weighting in terms of building a

more effective model and then tuning the model to increase the overall accuracy of

the model.

Also to be noted is that the accuracy of nbBuild4 during testing, 68.24645%, is more

similar to the accuracy of the model when applied to new data than in the case of the

other Naïve Bayes models. This could indicate the increased effectiveness during

application to new data and testing of this model built using Priors and using

weighting.

The accuracy of abnBuild4 remained the same during application to new data and

testing as it did for the previous model that did not make use of the Priors technique.

For this reason, it is apparent that the use of the Priors technique has had no impact on

the effectiveness of this model and tuning the model using weighting has significantly

improved the effectiveness of this model.

5.2 Effectiveness of Models

Figure 12 graphically portrays the accuracy of the models when applied to the new

data set, WEATHER_APPLY, as well as the settings for each pair of models.

In the case of the Naïve Bayes models, the most effective model correctly predicted

63.79% of the RAIN outcomes and was built using the Priors technique and

introducing a weighting of 3 against false negatives.

An Evaluation of Commercial Data Mining: Oracle Data Mining49

Page 50: draft2.doc

The most effective Adaptive Bayes model correctly predicted 73.10% of the RAIN

outcomes and was built by introducing a weighting of 3 against false negatives into

the model. The Priors technique had no influence on the accuracy of the Adaptive

Bayes models when they were applied to new data.

Figure 13 graphically portrays the accuracy of the models when tested on the test data

set, THE_TEST, as well as the settings for each pair of models.

The most accurate model during testing built with the Naïve Bayes algorithm was the

model that used weighting but no use of the Priors technique. This model showed an

accuracy of 72.51%. However, this model was also one of the two models that

performed most poorly when applied to new data correctly predicting only 13.79% of

the RAIN outcomes. The high accuracy in this case is attributed to the build and test

data sets used during building and testing of this model having a similar unbalanced

distribution of outcomes for the target attribute. This caused the model to perform

well during testing but to be ineffective when applied to the new data set with a

different distribution of outcomes for the target attribute.

The model built using the Adaptive Bayes Network algorithm that demonstrated the

greatest accuracy during testing was the model that did not make use of the Priors

technique and had no bias introduced in the form of weighting. This model

demonstrated an accuracy of 85.15% during testing. This model was also one of the

two models of the Adaptive Bayes Network kind that performed most poorly when

applied to new data predicting only 42.41% of the outcomes of RAIN correctly. Since

the use of the Priors technique had no effect on the effectiveness of the models when

applied to new data, it is possible to deduce that the introduction of weighting

improved this models accuracy when applied to new data even though this was not

reflected during testing. These results raise some questions about the effectiveness of

the model testing process.

An Evaluation of Commercial Data Mining: Oracle Data Mining50

Page 51: draft2.doc

Model Results

0.00%

10.00%

20.00%

30.00%

40.00%

50.00%

60.00%

70.00%

80.00%

noweighting,no priors

noweighting,

priors

weighting,no priors

weighting,priors

Model Settings

Ac

cu

rac

y

Naïve Bayes

Adaptive Bayes Network

Figure 12 Model results and settings of application to new data

Testing Results

0.00%10.00%20.00%30.00%40.00%50.00%60.00%70.00%80.00%90.00%

no

we

igh

ting

,n

o p

rio

rs

no

we

igh

ting

,p

rio

rs

we

igh

ting

,n

o p

rio

rs

we

igh

ting

,p

rio

rs

Model Settings

Ac

cu

rac

y Naïve Bayes

Adaptive BayesNetwork

Figure 13 Model results and settings during testing

5.3 Significance of Results

An Evaluation of Commercial Data Mining: Oracle Data Mining51

Page 52: draft2.doc

It is significant that the effectiveness of the models when applied to new data of those

models built using the Adaptive Bayes Network algorithm was not affected by the use

of the Priors techniques which attempts to ensure a balanced distribution of the target

attribute in the build data set. This could indicate the effectiveness of the algorithm in

incorporating rare occurrences in the data into the model. This is opposed to the case

of the models built using the Naïve Bayes algorithm. The effectiveness of the models

in these cases was markedly improved by use of the Priors technique, indicating the

requirement of data preparation for the algorithm.

Also to be emphasised is the effect of introducing weighting into the models of both

kinds. In the case of those models built using the Adaptive Bayes Network algorithm,

the introduction of weighting provided a dramatic increase in the accuracy of the

results when the model was applied to new data. Those models built using the Naïve

Bayes algorithm only benefited from the introduction of weighting when the Priors

technique had also been used in the model. This could indicate that the effect of

introducing weighting in order to tune the model is most beneficial when the model is

already at a relatively high level of effectiveness.

It was interesting to note the discrepancies between the models’ accuracy during

testing and accuracy when applied to new data. In all cases, the test accuracy was

higher than the accuracy calculated when the model was applied to new data. In some

case this difference was significant. This could indicate the impact the nature of the

test data has on the results of model test accuracy. The data used for testing the

models was created from the data set used to build the models using the

Transformation Split wizard as discussed in Chapter 3. For this reason, the

distribution of the target attribute in both data sets was similar, which positively

influenced the accuracy of the models built from and tested on similar data. External

validation of the models’ performance on the new data emphasised this influence.

These findings indicate the need for test data sets that show fewer similarities to the

build data sets and question the use of the Transformation Split wizard to create build

and test data sets from data that shows a specific distribution of the target attribute.

An Evaluation of Commercial Data Mining: Oracle Data Mining52

Page 53: draft2.doc

5.4 Chapter Summary

After interpreting the results obtained from the different models it is apparent that the

most effective model was built using the Adaptive Bayes Network algorithm with a

weighting of 3 against false negatives. It was apparent that the results obtained from

those models built using the Adaptive Bayes Network algorithm were not affected by

the use of the Priors technique whereas the results of the models built using the Naïve

Bayes algorithm were. Weighting had an effect on the results obtained from both

kinds of models but was only noticeable in the case of the Naïve Bayes models when

the Priors technique was used. Also to be noted is that the accuracy of the models

during testing does not always indicate the effectiveness of the models when applied

to new data.

The following chapter will discuss the conclusions that can be drawn from the results

obtained in this chapter.

Section 3 Conclusion

Chapter 6 Conclusions Drawn from Results

This chapter will draw conclusions from the results presented in the previous

chapters. The first set of conclusions will be made from the actual results obtained

when the models were applied to new data. The next set will consider the effect the

data used during the data mining had on the results obtained and lastly, conclusions

regarding Oracle Data Mining will be drawn.

6.1 Conclusions Regarding Model Results

The most effective model built using the Naïve Bayes algorithm correctly predicted

the outcome of the RAIN attribute 63.79% of 290 records. The model built using this

algorithm and no other techniques correctly predicted only 13.79% of the outcomes.

Introduction of bias into the model using weighting had no effect on this accuracy.

Use of the Priors technique increased this accuracy to 36.90%. A combination of

An Evaluation of Commercial Data Mining: Oracle Data Mining53

Page 54: draft2.doc

weighting and the use of Priors increased the accuracy of the model when applied to

new data to 63.79%.

Tuning was accomplished by introducing bias into the model using weighting. It was

viable to introduce bias into the model because during testing, the confusion matrix of

the model showed the model tended to make errors by predicting outcomes of a

certain kind. Bias makes these particular errors more costly to the effectiveness of the

model and thus the algorithm attempts to minimise them when building the model.

These observations indicate that the Naïve Bayes algorithm requires the use of the

Priors technique when the build data has an uneven distribution of the target attribute.

This ensures the algorithm observes enough of each target attribute outcome to build a

model that will be effective when applied to data with a different distribution of the

target attribute.

Adjusting the settings of the algorithm parameters would also be beneficial when

using this algorithm to build a model of data with an uneven target attribute

distribution. These parameters, the pairwise and singleton thresholds, affect how the

algorithm treats outliers in the data. By reducing the values of these parameters a

more accurate model can be built but this would only be beneficial if the model

already observes enough of a certain type of target attribute outcome.

Introducing bias into the Naïve Bayes model was most beneficial when the model had

been built using the Priors technique. This could indicate that the effect of weighting

is enhanced when the model is already relatively effective.

The most effective of all the models was built using the Adaptive Bayes Network

algorithm. This model correctly predicted 73.10% of 290 RAIN attribute outcomes

when applied to the new weather data set. This level of accuracy was increased from

42.41% by tuning the model.

The use of the Priors technique had no effect on the models built using the Adaptive

Bayes Network algorithm. This indicates that the effectiveness of the resulting models

was not affected by the distribution of the target attribute in the data set used to build

An Evaluation of Commercial Data Mining: Oracle Data Mining54

Page 55: draft2.doc

the model. Thus, it can be concluded that the algorithm effectively considers

occurrences of instances in the data even when these occurrences are rare.

6.2 Conclusions Regarding Data

It is apparent from the results of applying the models to the WEATHER_APPLY data

set that the algorithms found a pattern in the data that allowed them to be able to

correctly predict the outcome of the RAIN attribute in a significant number of cases.

According to the rules generated by the Adaptive Bayes Network algorithms these

predictions were mostly influenced by the measurements for wind chill factor and

wind direction in the records. Although unexpected, these measurements appear to

allow the models to make accurate predictions in most cases.

The Transformation Split wizard allows a data set to be split up into build and test

data sets by randomly selecting a predetermined number of records and placing them

in the build data set and placing the remainder of records in the test data set. However,

use of this technique to create the data sets results in both data sets showing a similar

distribution of the target attributes.

If the distribution of the target attribute is uneven, both data sets will show this to an

extent. This is depicted in Figures 14, 15 and 16. Figure 14 shows the distribution of

the RAIN attribute in the data set used with the Transformation Split wizard to create

the test and build data sets from. Figure 15 shows the distribution of this attribute in

the build data set and Figure 16 in the test data set created from this wizard.

An Evaluation of Commercial Data Mining: Oracle Data Mining55

Page 56: draft2.doc

Original Data Distribution

0200400600800

10001200140016001800

yes no

Bin Range

Bin

Co

un

t

Figure 14 Distribution of RAIN attribute in the original data set

It appears that similar distributions of the target attribute in both the build and test

data sets influence the accuracy of the model during testing. The result of testing a

model using a data set that resembles the build data set is an inflated accuracy. This

was evident from the significantly lower levels of accuracy the models showed when

applied to the new data of a different distribution. The distribution of the apply data

set is shown in Figure 17.

These findings indicate the need to test a model on a variety of data sets of different

distributions in order to properly validate model accuracy and effectiveness when

applied to data sets with different distributions of the target attribute.

Further, it appears it would be more beneficial to use the largest data set possible to

build and test models on. This would result in a more effective model as a wider range

of occurrences in the data would be incorporated into the model.

An Evaluation of Commercial Data Mining: Oracle Data Mining56

Page 57: draft2.doc

Build Data Distribution

0200400600800

10001200140016001800

yes no

Bin Range

Bin

Co

un

t

Figure 15 Distribution of RAIN attribute in the build data set

Test Data Distribution

0200400600800

10001200140016001800

yes no

Bin Range

Bin

Co

un

t

Figure 16 Distribution of RAIN attribute in the test data set

An Evaluation of Commercial Data Mining: Oracle Data Mining57

Page 58: draft2.doc

Apply Data Distribution

0

50

100

150

200

250

300

yes no

Bin Range

Bin

Cou

nt

Figure17 Distribution of RAIN attribute in the apply data set

6.3 Conclusions Regarding Oracle Data Mining

Oracle Data Mining and DM4J in particular provide the user with easy to use and

understand wizards that cover all aspects of the data mining process. Wizards are

available to create the build and test data sets from an original data set, to prepare the

data for use with the Priors technique, to build models, to test these models and to

apply these models to new data.

Although data preparation is an important aspect of the data mining process [Berger,

2004], it is not explicitly emphasised in the wizards that allow for the model building.

Although accessible through the Data Mining Browser, techniques for data

preparation and the benefits of using them are not emphasised.

The Data Mining Browser in DM4J allows the user to easily access the results of

model testing and application of models to new data. These results can also be

exported to spreadsheets allowing increased accessibility and ensuring they can be

easily worked with.

DM4J provides easy and reliable access to the database and the tables stored in the

database. This makes it possible to search for a specific data set during the data

mining process. The Data Mining Browser also allows the user to view summaries of

An Evaluation of Commercial Data Mining: Oracle Data Mining58

Page 59: draft2.doc

the data including distributions of attributes in the data set, which is of use during the

data preparation phase.

It is apparent from the results of the model testing during this evaluation that testing a

model on a single data set does not provide an indication of the effectiveness of the

model when applied to new data. Test accuracy of models can be misleading. For this

reason models should be externally validated using a technique similar to the one used

in this investigation (applying to data where outcome of the target attribute is known)

or tested on a number of data sets with varying distributions to better determine model

accuracy. The need to validate a model on a variety of data sets is not emphasised in

the documentation or by the wizards.

It must be emphasised that the ease and speed of building and testing a model using

the wizards allows for a number of models to be built and tests to be conducted. This

approach is recommended in order to ensure the most effective model possible is

produced.

6.4 Chapter Summary

This chapter has drawn a number of conclusions from the results obtained during the

data mining. Conclusions have been made regarding model results, the effect of data

used during the data mining and Oracle Data Mining itself. The following chapter will

conclude this evaluation.

An Evaluation of Commercial Data Mining: Oracle Data Mining59

Page 60: draft2.doc

Chapter 7 Conclusion

This chapter presents the conclusions drawn from the evaluation and suggests

possible extensions to the research area.

7.1 Conclusion

Oracle Data Mining provides data mining functionality through a series of wizards.

These wizards allow the user to perform data preparation, to build models, to test

these models and to apply the models to new data. The data preparation in this

evaluation was performed using the Transformation Split wizard and the Stratified

Sampling wizard. A number of wizards were used to build, test and apply the models

to new data.

The wizards were easy to use and understand and allowed a number of models to be

built in a short amount of time. Access to the database was provided through the

wizards. However, it was found that the wizards for building the data mining models

placed little emphasis on data preparation.

The two Classification algorithms used in this evaluation found a distinct pattern in

the weather data sets. This allowed the models to be used to make predictions of the

outcome of the RAIN attribute when the models were applied to the new data set. It is

possible to conclude that given a new set of weather data, the data mining models

would be able to make fairly accurate predictions of the outcome of the RAIN

attribute.

Of the algorithms investigated, the Adaptive Bayes Network algorithm produced the

most effective model when applied to new data, correctly predicting 73.10% of the

RAIN attributes outcomes. This model was tuned using a weighting of 3 against false

negatives to introduce bias into the model. The most effective model built using the

Naïve Bayes algorithm correctly predicted 63.79% of the RAIN attributes outcomes

when applied to new data. This model made use of the Priors technique and was

weighted 3 against false negatives.

An Evaluation of Commercial Data Mining: Oracle Data Mining60

Page 61: draft2.doc

The models were tested against a data set created from the original data,

WEATHER_BUILD, to determine the level of predictive accuracy. This testing was

conducted using the test wizards in DM4J and resulted in confusion matrices showing

test accuracy. It can be concluded that testing a model on a single data set does not

provide an accurate indication of how the model will perform when applied to new

data. This is because any similarities between the target attribute distribution (that of

RAIN) in the build and test data sets appear to inflate the test accuracy results. This is

evident from the fact that when the test accuracy of the models is compared to the

accuracy of the models when applied to new data, the test accuracy is significantly

higher. It is recommended that models be tested on a number of data sets of varying

distribution in order to gauge more accurately how they will perform when applied to

new data.

In conclusion, Oracle Data Mining provides the functionality required to build

effective data mining models using the two Classification algorithms, namely

Adaptive Bayes Network and Naïve Bayes. However, in order to build an effective

data mining model it is necessary to perform data preparation and to test the models

on a number of data sets of varying distribution. These aspects are not explicitly

emphasised when using the wizards to build the models. It can be stated that building

an effective model is an iterative process and requires the data miner to have an

awareness of the data sets used in the process, how these could influence the

outcomes of applying the models to new data and what techniques are available to

train and tune the models to be most effective.

7.2 Possible Extensions to Research

Possible extensions to this research could involve conducting similar evaluations of

the other algorithms available in the Oracle Data Mining Suite. These algorithms

include Attribute Importance, Association Rules, O-Cluster and Enhanced k-Means

Clustering. The results of models built on a similar data set using these other

algorithms could be compared to the models built in this evaluation to determine

when a specific algorithm is most applicable to a data mining project.

An Evaluation of Commercial Data Mining: Oracle Data Mining61

Page 62: draft2.doc

It would also be possible to compare the two clustering algorithms in ODM, O-

Cluster and Enhanced k-Means, in a manner similar to the one used in this evaluation.

Another extension of the research could involve comparing the functionality

demonstrated by Oracle Data Mining in this evaluation to that provided by another

data mining suite. Other data mining suites that could be compared include IBM

Intelligent Miner and SQL Server Data Mining and Analysis Server.

A further possibility would be to compare the results obtained using Oracle Data

Mining with those obtained after performing a regression analysis on the weather

data.

An Evaluation of Commercial Data Mining: Oracle Data Mining62

Page 63: draft2.doc

List of Figures

Figure 1: The Oracle Data Mining Process [Berger, 2004]……………………... 14

Figure 2: Naïve Bayes algorithm settings……………………………………….. 18

Figure 3: Adaptive Bayes Network algorithm settings………………………….. 20

Figure 4: Data Distribution for Rain Attribute from THE_BUILD data set…….. 24

Figure 5: Data Distribution for RAIN Attribute from THE_BUILD1 data set…. 24

Figure 6: Extract from Classification Model Build Wizard, Priors Settings……. 27

Figure 7: nbBuild Lift Chart……………………………………………………. 31

Figure 8: nbBuild2 Lift Chart…………………………………………………... 31

Figure 9: abnBuild Lift Chart…………………………………………………... 32

Figure10: abnBuild2 Lift Chart…………………………………………………. 33

Figure 11: Extract Showing Weighting of Model Build Wizard for abnBuild4 .. 35

Figure 12: Model results and settings of application to new data………………... 50

Figure 13: Model results and settings during testing…………………………….. 51

Figure 14: Distribution of RAIN attribute in the original data set……………….. 55

Figure 15: Distribution of RAIN attribute in the build data set………………….. 56

Figure 16: Distribution of RAIN attribute in the test data set……………………. 56

Figure 17: Distribution of RAIN attribute in the apply data set…………………. 58

An Evaluation of Commercial Data Mining: Oracle Data Mining63

Page 64: draft2.doc

List of Tables

Table 1: Mining Data Table Structure…………………………………………... 15

Table 2: Example Confusion Matrix...................................................................... 28

Table 3: Model Test Accuracy Rates……………………………………………. 28

Table 4: Confusion Matrix for nbBuild Testing……………………………........ 29

Table 5: Confusion Matrix for nbBuild2 Testing ………………………………. 29

Table 6: Confusion Matrix for abnBuild Testing ………………………………. 29

Table 7: Confusion Matrix for abnBuild2 Testing ……………………………... 30

Table 8: Unweighted Models’ Test Accuracy Rates…………………………….. 34

Table 9: Weighted Models’ Test Accuracy Rates ………………………………. 34

Table 10: nbBuild Confusion Matrix …………………………………………… 35

Table 11: nbBuild3 Confusion Matrix ………………………………………….. 36

Table 12: Summary of Classification Models……………………………………. 36

Table 13: Extracts of results from model nbBuild................................................. 38

Table 14: Extracts of results from model nbBuild3 …………………….............. 39

Table 15: Extracts of results from model abnBuild showing rules ……………... 39

Table 16: Rules used by Adaptive Bayes Network Models to make predictions... 40

Table 17: Summary of Accuracy of Predictions When Compared to Actual Data. 42

Table 18: Comparison of Models Built using Same Settings……………………. 43

Table 19: Comparison of Models built using same settings and showing

accuracy during testing……………………………………………………………

44

References

Al-Attar, A., 2004, White Paper: Data Mining - Beyond Algorithms, URL: <http://www.attar.com/tutor/mining.htm>, Accessed: 06/2004.

Berger, C., 09/2004, Oracle Data Mining, Know More, Do More, Spend Less - An Oracle White Paper, URL:

An Evaluation of Commercial Data Mining: Oracle Data Mining64

Page 65: draft2.doc

<http://www.oracle.com/technology/products/bi/odm/pdf/bwp_db_odm_10gr1_0904.pdf>, Accessed: 10/2004.

Berry, M. J. A. and Linoff, G. S., 2000, Mastering Data Mining: The Art and Science of Customer Relationship Management, USA, Wiley Computer Publishing.

Fernandez, G., 2003, Data Mining Using SAS Applications, USA, Chapman and Hall/CRC.

Jacot-Guillarmod, F., 2004, Index of /weather/ARCHIVE/2004, URL: <http://www.ru.ac.za/weather/>, Accessed: 10/2004.

Oracle9i Data Mining Concepts Release 2 (9.2), 2002, Oracle Home Page,URL: <http://www.lc.leidenuniv.nl/awcourse/oracle/datamine.920/a95961/preface.htm>, Accessed: 06/2004.

Oracle Data Mining Tutorial, Release 9.0.4, Oracle Home Page, 02/2004,URL: < http://www.oracle.com/technology/products/bi/odm/9idm4jv2.html>, Accessed: 09/2004.

Oracle Help for Java, Version 4.2.5.1.0, Copyright 1997-2004, Available via Oracle Data Mining Browser Help Menu in JDeveloper 10g.

Oracle Data Mining for Java (DM4J), 24/03/2004, Oracle Technology Network, URL: < http://www.oracle.com/technology/products/bi/pdf/odm4java.pdf >, Accessed: 09/2004.

Pyle, D., 2000, Data Preparation for Data Mining, San Francisco, California, Morgan Kauffman.

Roiger, R. J. and Geatz M. W., 2003, Data mining: a tutorial- based primer, Boston, Massachusetts, Addison Wesley.

Appendix A – Introductory Manual

This appendix aims to provide the reader with an introductory manual for using Data

Mining for Java (DM4J) 9.0.4 and JDeveloper 10g in order to perform data mining. It

An Evaluation of Commercial Data Mining: Oracle Data Mining65

Page 66: draft2.doc

assumes the reader has access to Oracle Data Mining Tutorial, Release 9.0.4 which is

available on the CD-ROM that accompanies this project.

The data mining performed during this project was conducted with DM4J 9.0.4. This

component is an extension of JDeveloper 10g and provides the user interface to the

data mining components.

A.1 Preparing for Data Mining

It is possible to login to the machine used in this project with the username

‘g01d1801’ and the password ‘810412’.

Any data sets that will be mined should be loaded into the ODM or ODM_MTR

schema in the Oracle database. This can be accomplished in the Enterprise Manager

Console using the Load wizard which is accessible under the menu items:

Tools>Database Tools>Data Management>Load. The most success was achieved

loading data sets in the form of .txt files with items in records separated by commas

and each record on a new line.

The Enterprise Manager Console is accessible from the Start Menu:

Start>All Programs>Oracle-Ora1>Enterprise Manager Console.

At the login select the ‘Login to the Oracle Management Server’ radio button. The

administrator is ‘sysman’, the password is ‘810412’ and the Management Server is

ora1.ict.ru.ac.za.

Once the console opens expand the ‘Database’ item in the navigation pane. Right

click on Ora1.ict.ru.ac.za. Select ‘Connect’ from the menu. The username is ‘sys’ and

the password ‘emily’. Connect as ‘sysdba’. At this stage it is possible to use the Load

wizard which explains the process of loading data into the database.

A.2 Starting JDeveloper

JDeveloper can be launched by double clicking on the shortcut icon on the desktop. It

is then necessary to connect to the Oracle database using JDeveloper as follows:

An Evaluation of Commercial Data Mining: Oracle Data Mining66

Page 67: draft2.doc

Clicking on Connections to expand it in the System Navigator pane in

JDeveloper.

Right clicking on the Database item in the list

Selecting New Connection in the menu that appears.

Following the instructions in the Connection Wizard.

o At step 2 of the wizard a username and password is required. The user

name is ‘odm’ and password is ‘odm’. The Deploy Password check

box must be checked.

o Any other information required can be left as the default settings.

A.3 Data Mining

It should now possible to conduct data mining on a data set.

In JDeveloper click on the ‘File’ menu and select the ‘New’ menu item. In the dialog

box that appears select ‘General’ in the left pane and ‘Workspace’ in the right pane.

This will create a workspace to store the components of data mining. Fill in a name

for the workspace and note where it is saved. Click OK, fill in a project name and

click OK. This will be where any data mining models created are stored. It is possible

to view the workspace and projects in it in the System Navigator pane in JDeveloper.

Highlight the project name in the System Navigator pane, select ‘File’ in the

JDeveloper menu. Select ‘New’ and in the dialog box that appears expand the

‘Business Tier’ item in the left pane. Select ‘Data Mining Components’. In the pane

on the right will appear the wizards available for data mining.

The workspace created in this evaluation was named weather.jws. The projects

created within the workspace were named theWeather.jpr and finalWeather.jpr.

theWeather.jpr contains the component created with the Transformation Split wizard.

This wizard allows the user to create the build and test data sets from a single data set.

It is easy to use and provides clear instructions to be followed. The data used by this

An Evaluation of Commercial Data Mining: Oracle Data Mining67

Page 68: draft2.doc

wizard in this project was stored in the odm_mtr schema and named weather_build2.

The wizard produced the tables theBuild and theTest which were stored in the odm

schema.

finalWeather.jpr contains all the components of all the data mining models created

during this project. The data set the models were applied to was named

weather_apply2 and was stored in the odm_mtr schema.

At this stage the reader should refer to Oracle Data Mining Tutorial, Release 9.0.4.

This tutorial provides step by step instructions for running the data mining wizards to

perform data mining. In reference to this project the reader should refer to chapters 2

to 8 in the tutorial.

An Evaluation of Commercial Data Mining: Oracle Data Mining68