Application of Computational Intelligence in Data Mining

Brasov, 2011

“Transilvania” University of Braşov

Faculty of Electrical Engineering and Computer

Science

Applications of computational intelligence in data mining

By

Ioan Bogdan CRIVA Ţ

A thesis submitted in partial

fulfillment of the requirements for

the degree of

PhD

Advisor: Prof. Univ. Dr. Razvan Andonie

Brasov, 2011

Abstract

The objective of this work is a synthesis of some of the recent efforts in

the domain of predictive and associative rules extraction and processing

as well as a presentation of certain original contributions to the area.

The first two chapters of the theses present data mining and the

some recent results in the area of rule extraction. The second chapter,

“Rules in the Data Mining Context” introduces data mining with a focus

on rule extraction. We discuss association rules and their properties as

well as some notions of fuzzy modeling and fuzzy rules. The third

chapter, “Methods for Rules Extraction”, presents the most commonly

used methods for extracting rules. A special section describes the

specifics of rules analysis in Microsoft SQL Server. The following chapters

contain some original contributions in their context. The fourth chapter,

“Contributions to Rules Generalization”, reviews some of the existing

methods for simplifying rule models, and focuses on measures for

detecting rules similarity. Similar rules can be merged, resulting in

simpler rule systems. The fifth chapter, “Measuring the Usage

Prediction Accuracy of Recommendation Systems”, presents the area of

accuracy measurements for recommendation systems, one of the most

common applications of association rules. A new instrument for

assessing the accuracy of a recommender is presented, together with

some experimental results. The sixth chapter presents some

experimental results for the techniques introduced in the third and

fourth chapters. The results are detailed for datasets used in presenting

the methods or compared against results from other authors. The last

chapter contains conclusions of this thesis as well as certain directions

for further research.

Brasov, 2011

Contents Contents .......................................................................................................... iii List of figures ................................................................................................... 1

Acknowledgments........................................................................................... 2

Publications, Patents and Patent Applications by the Author ........................ 3

Books ......................................................................................................... 3

Articles ...................................................................................................... 3

Issued Patents (United States Patents and Trademark Office) ................ 3

Pending patent applications (United States Patents and Trademark Office) ........................................................................................................ 4

1 Introduction .............................................................................................. 5

1.1 Objectives...................................................................................... 5

1.2 Contributions ................................................................................ 5

1.3 The Structure of this Thesis .......................................................... 8

2 Rules in the Data Mining context............................................................ 10

2.1 Data mining in industry: an overview ........................................ 10

2.2 Data Mining Problems, Tasks and Processes .............................. 11

2.2.1 Business Problems .................................................................................... 11

2.2.2 Implementation Tasks ............................................................................... 13

2.2.3 Data Mining Project Cycle ......................................................................... 14

2.3 Rules in Data Mining ................................................................... 17

2.3.1 Association Rules ...................................................................................... 18

2.3.2 Classifications of association rules ............................................................ 20

2.3.3 The Market Basket Analysis problem ....................................................... 21

2.3.4 Itemsets and Rules in dense representation ............................................ 23

2.3.5 Equivalence of dense and sparse representations ................................... 24

2.4 Fuzzy Rules .................................................................................. 27

2.4.1 Conceptualizing in Fuzzy Terms ................................................................ 27

2.4.2 Fuzzy Modeling ......................................................................................... 28

3 Methods for Rule Extraction ................................................................... 31

3.1 Extraction of Association Rules ................................................... 31

3.1.1 The Apriori algorithm ................................................................................ 31

3.1.2 The FP-Growth algorithm ......................................................................... 35

3.1.3 Other algorithms and a performance comparison ................................... 38

3.1.4 Problems raised by Minimum Support itemset extraction systems ................................................................................................................. 40

3.2 An implementation perspective: Support for association analysis in Microsoft SQL Server ® 2008 ................................................. 45

3.3 Rules as expression of patterns detected by other algorithms .. 50

3.3.1 Rules based on Decision Trees .................................................................. 51

3.3.2 Rules from Neural Networks ..................................................................... 52

4 Contributions to Rule Generalization ..................................................... 59

Brasov, 2011

4.1 Fuzzy Rules Generalization ......................................................... 59

4.1.1 Redundancy .............................................................................................. 60

4.1.2 Similarity ................................................................................................... 61

4.1.3 Interpolation based rule generalization techniques ................................ 62

4.2 Rule Model Simplification Techniques ....................................... 63

4.2.1 Feature set alterations .............................................................................. 63

4.2.2 Changes of the Fuzzy sets definition ........................................................ 65

4.2.3 Merging and Removal Based Reduction ................................................... 65

4.3 Similarity Measures and Rule Base Simplification ...................... 66

4.4 Rule Generalization ..................................................................... 70

4.4.1 Problem and context................................................................................. 71

4.4.2 The rule generalization algorithm ............................................................ 72

4.4.3 Applying the RGA to an apriori-derived set of rules ................................. 76

4.5 Conclusion ................................................................................... 79

4.5.1 Future directions for the basic rule generalization algorithm ............................................................................................................... 80

4.5.2 Further work for the apriori specialization of the RGA ............................ 84

5 Measuring the Usage Prediction Accuracy of Recommendation Systems ......................................................................................................... 85

5.1 Association Rules as Recommender Systems ............................. 86

5.2 Evaluating Recommendation Systems ........................................ 86

5.3 Instruments for offline measuring the accuracy of usage predictions .............................................................................................. 88

5.3.1 Accuracy measurements for a single user ................................................ 89

5.3.2 Accuracy Measurements for Multiple Users ............................................ 92

5.4 The Itemized Accuracy Curve ...................................................... 93

5.4.1 A visual interpretation of the itemized accuracy curve ............................ 98

5.4.2 Impact of the N parameter on the Lift and Area Under Curve measures .................................................................................................... 99

5.5 An Implementation for the Itemized Accuracy Curve .............. 101

5.5.1 Accuracy measures ................................................................................. 101

5.5.2 Real data test strategies ......................................................................... 102

5.5.3 The algorithm for constructing Itemized Accuracy Curve ...................... 103

5.6 Conclusions and further work ................................................... 104

6 Experimental Results ............................................................................ 107

6.1 Datasets used in this material................................................... 107

6.1.1 IC50 prediction dataset ............................................................................ 107

6.1.2 Movies Recommendation ....................................................................... 108

6.1.3 Movie Lens .............................................................................................. 109

6.1.4 Iris ............................................................................................................ 109

6.2 Experimental results for the Rule Generalization algorithm .... 110

6.2.1 Rule set and results used in Section 4.4 on generalization ................... 110

Brasov, 2011

6.2.2 Results for the apriori post-processing algorithm .................................. 112

6.3 Experimental results for the Itemized Accuracy Curve ............ 113

6.3.1 Movie Recommendation Results ............................................................ 115

6.3.2 Movie Lens Results ................................................................................. 116

7 Conclusions and directions for further research .................................. 118

7.1 Conclusions ............................................................................... 118

7.2 Further Work ............................................................................. 119

Appendix A: Key Algorithms ....................................................................... 122

Apriori ................................................................................................... 122

FP-Growth ............................................................................................. 124

Bibliography ................................................................................................ 126

List of figures

Figure 2-1 The CRISP-DM process .................................................................................... 17

Figure 2-2 Standard types of membership functions (from (20) ) .................................... 28

Figure 3-1: Finding frequent itemsets .............................................................................. 34

Figure 3-2 An FP-Tree structure ........................................................................................ 36

Figure 3-3 A mining case containing tabular features ...................................................... 46

Figure 3-4 A RDBMS representation of the data supporting mining cases with nested tables ................................................................................................. 47

Figure 3-5 Using a structure nested table as source for multiple model nested tables ......................................................................................................... 50

Figure 3-6 A decision tree built for rules extraction (part of a SQL Server forest) .................................................................................................................... 52

Figure 3-7 An artificial neural network ............................................................................. 53

Figure 4-1 - Creating a fuzzy set C to replace two similar sets A and B ............................ 69

Figure 4-2 Merging of similar rules ................................................................................... 70

Figure 4-3 A visual representation of the RGA ................................................................. 74

Figure 4-4 A finer grain approach to rule generalization ................................................. 80

Figure 4-5Accuracy of a fuzzy rule as a measure of similarity with the universal set .......................................................................................................... 83

Figure 5-1 Example of ROC Curve ..................................................................................... 91

Figure 5-2 Itemized Accuracy Curve for a top-N recommender ....................................... 98

Figure 5-3 Evolution of Lift and Area Under Curve for different values of N ................. 100

Figure 5-4 Aggregated Itemized Accuracy Curve based on the Movie Recommendations dataset (for N=5 recommendations) ................................... 105

Figure 6-1 Itemized Accuracy Chart for n=3 (Movie recommendations) ....................... 114

Figure 6-2 Evolution of Lift for various values of N for test models (Movie Recommendations dataset) ................................................................................ 116

Figure 6-3 Evolution of Lift for various values of N for test models (Movie Lens dataset) ....................................................................................................... 117

2

Acknowledgments

I would like to express my deepest gratitude to Prof. Dr. Răzvan Andonie for

his guidance, patience and encouragements. Above all, I would like to thank

him for rekindling my passion for academic research after years of industrial

experience.

Deep thanks also go to the Faculty of Electrical Engineering and Computer

Science at the “Transilvania” University for their help and advice with the

intermediate steps of the doctoral research as well as to dr. Daniela Drăgoi,

always a tremendous help for the doctoral program procedures.

I am also grateful to the amazing people that I met in my academic life,

particularly to prof. Petru Moroșanu and prof. dr. Tudor Bălănescu, and to

the wonderful colleagues at Microsoft Corporation and Predixion Software,

for their friendship, knowledge and experience.

At last, but certainly not least, my heartfelt thanks go to my family, Irinel

and Cosmin, for their most consistent help and support.

3

Publications, Patents and Patent Applications by the

Author

Books

1. MacLennan Jamie, Crivat Bogdan and Tang ZhaoHui Data Mining with Microsoft SQL Server 2008 [Book]. - Indianapolis, Indiana, United States of America : Wiley Publishing, Inc., 2009. - 978-0-470-27774-4.

2. Crivat Bogdan, Grewal Jasjit Singh, Kumar Pranish and Lee Eric ATL Server: High Performance C++ on .Net [Book]. – Berkeley, CA, United States of America : APress, Inc., 2003. - 1-59059-128-3.

Articles

3. Andonie Razvan, Crivat B [et al.] Fuzzy ARTMAP rule extraction in computational chemistry [Conference] // IJCNN. - 2009. - pp. 157-163. - DOI: 10.1109/IJCNN.2009.5179007.

4. Crivat, Ioan Bogdan SQL Server Data Mining Programability [Online] March 2005 [Cited: 6 22, 2011.] http://msdn.microsoft.com/en-US/library/ms345148(v=SQL.90).aspx.

Issued Patents (United States Patents and Trademark Office)

5. Crivat Ioan B, Petculescu Cristian and Netz Amir Explaining changes

in measures thru data mining [Patent] : 7899776. - United States of America, 2011.

6. Crivat Ioan B, Petculescu Cristian and Netz Amir Random access in run-length encoded structures [Patent] : 7952499. - United States of America, 2011.

7. Crivat Ioan B, Iyer Raman and MacLennan C James Detecting and displaying exceptions in tabular data [Patent] : 7797264. - United States of America, 2010.

4

8. Crivat Ioan B, Iyer Raman and MacLennan C. James Dynamically detecting exceptions based on data changes [Patent] : 7797356. - United States of America, 2010.

9. Crivat Ioan B, Iyer Raman and MacLennan James Partitioning of a data mining training set [Patent] : 7756881. - United States of America, 2010.

10. Crivat Ioan B, Petculescu Cristian and Netz Amir Efficient Column Based Data Encoding for Large Scale Data Storage [Patent] : 20100030796 . - United States of America, 2010.

11. Crivat Ioan B. [et al.] Extensible data mining framework [Patent] : 7383234. - United States of America, 2008.

12. Crivat Ioan Bogdan [et al.] Systems and methods that facilitate data mining [Patent] : 7398268. - United States of America, 2008.

13. Crivat Ioan, B [et al.] Using a rowset as a query parameter [Patent] : 7451137. - United States of America, 2008.

14. Crivat Ioan, B, MacLennan C, James and Iyer Raman Goal seeking using predictive analytics [Patent] : 7788200. - United States of America, 2010.

15. Crivat Ioan, Bogdan [et al.] Unstructured data in a mining model language [Patent] : 7593927. - United States of America, 2009.

16. Crivat Ioan, Bogdan, Cristofor Elena, D. and MacLennan C. James Analyzing mining pattern evolutions by comparing labels, algorithms, or data patterns chosen by a reasoning component [Patent] : 7636698. - United States of America, 2009.

17. Crivat Bogdan [et al.] Systems and methods of utilizing and expanding standard protocol [Patent] : 7689703. - United States of America, 2010.

Pending patent applications (United States Patents and

Trademark Office)

18. Crivat Ioan Bogdan [et al.] Techniques for Evaluating Recommendation Systems [Patent Application] : 20090319330 - United States of America, 2009.

5

1 Introduction

1.1 Objectives

The objective of this work is a synthesis of some of the recent efforts in the

domain of predictive and associative rules extraction and processing as well

as a presentation of certain original contributions to the area.

As used in this work, data mining is the process of analyzing data in order to

find hidden patterns using automatic methodologies. Due in part to major

computational advances in the last decades, extensive research in the area

of data mining led to development of many classes of pattern extraction

algorithms. These algorithms are often employed in systems that yield high

accuracy predictions but the patterns detected by such algorithms are,

more often than not, difficult to interpret.

A direct consequence of this difficulty is the high barrier encountered by

data mining to acceptance in the common information worker’s toolset.

The author spent the most of the last decade as one of the principal

designers and implementers of the Microsoft SQL Server Data Mining

platform, a product with the goal of making data mining more accessible to

information workers. This work is strongly influenced by this industrial

perspective.

1.2 Contributions

This work synthesizes the original contributions of the author over a period

of time longer than the actual doctoral studies, as illustrated by the author’s

6

patents and publications: [1], [2] , [3], [4], [5], [6], [7], [8], [9], [10], [11],

[12], [13], [14], [15]. This thesis does not include all these works, but it is

certainly a result of theirs.

Rule systems are collections of easily understandable patterns which often

can be translated to plain language statements. Significant research starting

around 1990 aimed to employ rule systems in data mining. This research

produced multiple algorithms for rule extraction as well as many techniques

for converting other patterns to rule sets.

Chapters 2 and 3 describe the area of rules mining. The author had an

extensive activity in this area, leading to the „Data Mining with Microsoft

SQL Server” [2] volume as well as to the materialization of some of the

author’s patents, such as [8], [13] and [14], [7], [9]. The volume has been

translated to Russian and Chinese, becoming the reference work for the

users of the SQL Server Data Mining. Sections of chapters 2 and 3 are based

on this volume.

In Chapter 4 we investigate some of the efforts for simplifying rule sets and

generalizing rules. The original contributions were initially published in the

Proceedings of the 2009 edition of International Joint Conference of Neural

Networks, an IEEE conference, and received the Best Poster Award Runner-

up distinction. Contributions, presented in Section 4.4, include:

7

Subsection 4.4.2: A novel method for post-processing a set of rules

in order to improve its generalization capability. The method is

developed specifically for rules extracted from a fuzzy ARTMAP

incremental learning system used for classification, hence for rule

generated indirectly (as Fuzzy ARTMAP does not directly produce

rules).

Subsection 4.4.3: An extension of the aforementioned method to

rule sets from common rule extraction algorithms, such as apriori.

Experimental results suggest a 5 to 10 times reduction in size for the

rules set, essentially by removing redundant rules.

Property 4.1: a theoretical result introduced and proven in

Subsection 4.4.3, describes the equivalence of generalizations for

predictive and associative rules and justifies the previously

mentioned extension.

Chapter 5 treats the area of measuring the accuracy of recommendation

systems. The original contributions were introduced, as a patent, in [16]

(“Techniques for Evaluating Recommendation Systems”). Contributions

include:

Section 5.4: The itemized accuracy curve, a novel instrument for

evaluating the quality of recommendation systems. This diagram

offers some advantages over the existing accuracy curves,

advantages presented in Section 5.6

8

Lemma 5.1 (in the same section) is an original theoretical result

which justifies using a certain recommendation system (MFnR, the

Most Frequent n-Items Recommender) as a baseline for comparing

other recommendation systems.

Section 5.5: Implementation details and optimization suggestions

for the algorithm which computes the Itemized Accuracy Curve,

intended for large data sets.

1.3 The Structure of this Thesis

The second chapter, Rules in the Data Mining context, introduces

data mining with a focus on rule extraction. The CRISP-DM standard for the

life cycle of a data mining project is described, together with some business

problems commonly approachable by data mining techniques. We focus,

then, on rules in data mining. We discuss association rules and their

properties as well as some notions of fuzzy modeling and fuzzy rules.

The third chapter, Methods for Rule Extraction, presents the most

commonly used methods for extracting rules. We start by presenting some

algorithms designed specifically for rule extraction, such as apriori and FP-

Growth. We discuss some of the problems raised by these algorithms as

well as solutions identified for those problems. Next, we present some

techniques for extracting rules from patterns detected by other algorithms

and focus on rule extractions from neural networks, a topic of significant

9

interest in the next chapter. A special section describes the specifics of rules

analysis in Microsoft SQL Server.

The fourth chapter, Contributions to Rule Generalization, reviews

some of the existing methods for simplifying rule models, and focuses on

measures for detecting rules similarity. Similar rules can be merged,

resulting in simpler rule systems. By interpreting one of these similarity

measures from the data mining rules analysis perspective, a novel

generalization method is proposed, which reduces the complexity of certain

rule sets and improves the interpretability of the model. The method,

introduced for very specific rules extracted from a Fuzzy ARTMAP predictor,

is extended for rule sets discovered by algorithms such as apriori.

The fifth chapter, “Measuring the Usage Prediction Accuracy of

Recommendation Systems”, presents the area of accuracy measurements

for recommendation systems, one of the most common applications of

association rules. A new instrument for assessing the accuracy of a

recommender is presented, together with some experimental results.

The sixth chapter, Experimental Results, presents some experimental

results for the techniques introduced in the 4th and 5th chapters. The results

are detailed for datasets used in presenting the methods or compared

against results from other authors.

The last chapter contains conclusions of this thesis as well as certain

directions for further research.

10

2 Rules in the Data Mining context

The term data mining has been used, in various publications, to mean

anything from ad-hoc queries [17] or pivot chart analysis [18] to

government domestic spying programs [19]. As used in this work, data

mining is the process of analyzing data to find hidden patterns using

automatic methodologies. Parts of this process may be referred, in cited

works, as machine learning, predictive analytics or knowledge discovery in

databases (KDD).

In this chapter we describe the main problems addressed by data

mining as well as the CRISP-DM standard for the life cycle of a data mining

project. We focus, then, in Section 2.3, on rules in data mining. We discuss

association rules and their properties. Section2.4 discusses rules from a

different perspective, that of fuzzy modeling.

2.1 Data mining in industry: an overview

Over the past several decades, computing power, growing according to

Moore’s law [20], and significant advances in storage technology (following

the so-called “Kryder’s Law”) [21] led to production of unprecedented data

volumes. Various industry and academic studies [22,23], estimate the

amount of data produced or consumed in the world to the order of

zettabytes (1 ZB = 103EC = 109TB = 1021 bytes). A large chunk of this data is

produced by business software, such as enterprise resource planning (ERP)

systems, customer relationship management (CRM) systems or database

servers. The result of this tremendous data production is that organizations

are rich in data and poor in knowledge. The sheer vastness of the data

11

collections makes practical use of the data stores limited. The main purpose

of data mining is to extract non-trivial, previously unknown and potentially

useful knowledge from the data [24].

2.2 Data Mining Problems, Tasks and Processes

Data Mining can be used in virtually all business applications, answering

various types of questions. Data mining technology was introduced initially

in highly specialized, dedicated software packages targeting highly skilled

specialists. More recently, though, data mining technology is perceived as a

commodity of any serious business intelligence platform. With a large offer

of data mining software integrated in major spreadsheet products, such as

11 Ants Model Builder [25], Predixion Insight [26] or SQL Server Data Mining

add-ins [27] , all one needs is the motivation and a bit of know how to apply

data mining on one’s business data.

Unless otherwise specified the material in this chapter is based on our

previously published volume [2].

2.2.1 Business Problems

Examples of common business data mining problems include:

Recommendation generation – What products or services should

one offer customers? Generating recommendations is an important

business challenge for retailers and service providers. Customers

provided with accurate and timely recommendations are likely more

valuable (because they purchase more) and more loyal (they feel a

stronger relationship to the vendor). Such recommendations are

12

derived from using data mining to analyze purchase behavior of

other of the retailer’s customers.

Anomaly detection – how to detect whether data is “good” or not?

Data Mining can be employed to analyze data, detect patterns that

govern most of the data then pick out those items that do not fit the

rest by not matching the common patterns. Credit card companies

frequently use data mining to determine whether a particular

transaction is valid (typical) or likely to be a fraud. Insurance

companies use anomaly detection to determine whether claims are

likely to be fraudulent.

Churn analysis – The telecommunication, banking and insurance

industries face severe competition and every business would like to

retain as many customers as possible. Churn analysis may help

marketing managers identify the customers that are likely to leave

and why. As a result, named managers can improve customer

satisfaction and consequently retain those customers.

Risk management – Data mining techniques are used to determine

the risk of a loan application, helping loan officers make appropriate

decisions regarding the cost and validity of each loan application.

Customer segmentation – This kind of analysis determines the

behavioral and descriptive profiles for customers. These profiles can

then be used to provide personalized marketing programs and

strategies that are appropriate for each group.

Targeted ads – web retailers or portal sites personalize their content

for their web customers. Using navigation patterns or online

13

purchase patterns, these sites can use data mining solutions to

display targeted advertisements that are most likely to trigger the

interest of these customers.

Forecasting – data mining techniques can be used to identify trends,

periodicities and noise levels in various numeric series and then

extrapolate, based on these parameters, the future evolution of

those series. These techniques are frequently used for future sales

or consumption estimations and consequently for inventory

planning.

Quality control for manufacturing processes – data mining can

provide tremendous help in identifying the root cause of various

manufacturing processes failures and defects.

2.2.2 Implementation Tasks

From the perspective of modeling data and choosing a data mining

technique or algorithm to use, the business problems described above, as

well as many others, can be translated to one or more general data mining

tasks such as those described below:

- Classification is the most commonly used data mining task. It

consists in learning, from a set of data points, the patterns that lead

to correctly assigning a category to a new data point. The machine

learning algorithms employed in classification tasks depend on the

user to indicate which of the features of the training data points

contains the target category. They detect then patterns linking the

other data features to the target. Because a user needs to indicate

14

the learning target, the classification algorithms are said to perform

supervised learning.

- Clustering (or segmentation) consists in identifying natural groups

of data points based on a set of features of these data points. Points

within the same group have more or less similar features. Clustering

is an unsupervised data mining task, as none of the features is of

special importance to the machine learning process.

- Association – detects common occurrences of specific features over

typically large data point populations. The next section of this

chapter discuss in more detail the association process

- Regression – this task is very similar to classification, but rather than

assigning a category to a new data point, regression’s goal is to

determine a numerical value. Simple linear line fitting techniques,

such as the Least Squares Method, are an example of regression.

More advanced regression algorithm support various non-numeric

features as inputs.

- Time Series Forecasting is the use of a model to forecast future

events based on known past events, in order to predict data points

before they are measured.

2.2.3 Data Mining Project Cycle

Starting in 1996, a cross-industry effort emerged to describe and formalize

the most commonly used approaches that expert data miners employ for

tackling various problems. This effort materialized in the first Cross Industry

15

Standard Process for Data Mining (CRISP-DM), an industry-neutral, tool-

neutral standard for approaching data mining problems. This standard is

now hosted and published by IBM and can be accessed here: [28] . The

CRISP-DM standard breaks down the data mining process in six major

phases. These phases are presented below, as described by CRISP-DM (from

[28]):

- Business understanding -- This initial phase focuses on

understanding the project objectives and requirements from a

business perspective, then converting this knowledge into a data

mining problem definition and a preliminary plan designed to

achieve the objectives.

- Data understanding -- The data understanding phase starts with

initial data collection and proceeds with activities that enable you to

become familiar with the data, identify data quality problems,

discover first insights into the data, and/or detect interesting

subsets to form hypotheses regarding hidden information.

- Data preparation. The initial raw data is seldom ready to be

consumed by modeling tools. The data preparation phase covers all

activities needed to construct the final dataset, ready for modeling

tools, from the initial raw data. Data preparation tasks are likely to

be performed multiple times and not in any prescribed order. Tasks

include table, record, and attribute selection, as well as

transformation and cleaning of data for modeling tools

16

- Modeling In this phase, various modeling techniques are selected

and applied, and their parameters are calibrated to optimal values.

Typically, there are several techniques for the same data mining

problem type. Some techniques have specific requirements on the

form of data. Therefore, going back to the data preparation phase is

often necessary.

- Evaluation -- At this stage in the project, you have built a model (or

models) that appear to have high quality from a data analysis

perspective. Before proceeding to final deployment of the model, it

is important to thoroughly evaluate it and review the steps executed

to create it, to be certain the model properly achieves the business

objectives. A key objective is to determine if there is some important

business issue that has not been sufficiently considered. At the end

of this phase, a decision on the use of the data mining results should

be reached.

- Deployment - At this stage in the project, you have built a model (or

models) that appear to have high quality from a data analysis

perspective. Before proceeding to final deployment of the model, it

is important to thoroughly evaluate it and review the steps executed

to create it, to be certain the model properly achieves the business

objectives. A key objective is to determine if there is some important

business issue that has not been sufficiently considered. At the end

of this phase, a decision on the use of the data mining results should

be reached.

17

The CRISP phases are iterative, as presented in Figure 2-1 below.

Figure 2-1 The CRISP-DM process (from [28])

2.3 Rules in Data Mining

Implementations of data mining processes (as described above) produce

comprehensible patterns of the data, patterns that can be used to describe

the data or make inferences about similar data pieces. These

comprehensible patterns can take multiple forms such as decision trees,

regression equations, distributions etc. Rules are a natural style of

representing patterns extracted from data. A rule consists of an antecedent,

or precondition (also known as left hand side, or LHS), and the consequent

or conclusion (right hand side, RHS). The antecedent consists of one or

18

more logical predicates (or tests). The conclusion consists in one or more

classes that apply to data points (instances) that satisfy the antecedent

conditions. For the scope of this work, the antecedent is a conjunction of

predicates. In this section, we will present a formal definition of rules and

some properties associated with rules. The rules are introduced from a

transactional database perspective (association rules). We will show how

this transactional perspective can be applied to classification rules (non-

transactional data).

2.3.1 Association Rules

Association Rules were introduced by Agrawal in [29] with the purpose of

analyzing vast transaction databases from large retail companies.

Let be a set of items. Let D be a set of data points

(database transactions). Each transaction T is a set of items such that T I.

Each transaction typically has a unique identifier, called Transaction Id

(TxId). For a set of items A, T is said to contain A if and only if AT.

An association rule is a logical statement of the form A B, where A B =

and A, B I. A is the antecedent, while B is the consequent of the rule

and are both itemsets (i.e. sets of items from the catalog I).

An itemset containing k items is said to be a k-itemset. The number of

transactions containing an itemset is the count or support count of the

itemset. Note that the support supp(A) for an itemset may be expressed as

19

an integer (the absolute number of transactions containing the itemset) or

as a percentage (of transactions, out of the total transactions in the

database, that contain an itemset). For the purpose of this work, supp is

always a percent. The absolute support is always denoted as supp count.

It is important to notice that a rule is itself an itemset, consisting of the

items in the set.

Rules are characterized by certain properties. Among them, referred in this

work:

- Confidence, defined as:

(2.1)

Confidence can be interpreted as an estimation of the conditional

probability of finding the right hand side part of the rule among transactions

containing the left hand side of the rule, P(B|A)

- Lift, defined as

(2.2)

the ratio between the observed support and the expected support (if A

and B were independent)

20

- Importance , defined [2] as:

(2.3)

acts as a measure of interestingness for a rule.

2.3.2 Classifications of association rules

As discussed in detail [30], association rules can be classified based on

different criteria. The most common rules classifications systems are

presented below and used throughout this thesis.

Rules can be classified:

- Based on the type of values handled in the rule. If a rule associates

the presence of certain items, it is a Boolean rule. If a rule describes

associations between quantitative items or attributes, it is a

quantitative association rule.

- Based on the dimensions of data involved in the rule: if the LHS

component has a single predicate, referring only one dimension,

then the rule is called single-dimensional. Conversely, a rule

referring multiple dimensions is a multi-dimensional association rule.

21

- Based on the level of abstractions involved in the rule set. Items

used in the various predicates composing the rule may appear at

different levels of abstraction, for example:

(Age = 20-30)Computer Games

(Age = 20-30)Computer Software

“Computer Software”, in this example, is a higher level of abstraction

than “Computer Games”. Rule sets mined at different abstraction

levels consist of multilevel association rules. If all rules refer the

same abstraction level, then the set is said to contain single-level

association rules.

2.3.3 The Market Basket Analysis problem

Association rules mining finds interesting associations among a large set of

items. A typical example is the market basket analysis. This process analyzes

customer buying habits by finding associations between items that are

frequently purchased together by customers (i.e. appear frequently

together in the same transaction, are frequently placed together in the

same “shopping basket”).

Such data is typically represented in a database as a transaction table,

similar to Table 2-1.

Order Number Model

SO51176 SO51176

Milk Bread

22

SO51177 SO51177

Bread Butter

SO51178 SO51178 SO51178

Milk Bread Butter

SO51179 SO51179 SO51179 SO51179

Apples Butter Bread Pears

Table 2-1 Representation of shopping basket data

For a data set organized like Table 2-1 the association rules concepts are

mapped like below:

- The space of items (I) is the set of all distinct values in the Model

column

- A transaction identifier (TxId) is a distinct value in the Order Number

column.

- A transaction, identified by a transaction id (e.g.”S051176”) is the

set of distinct Model values that are associated with all occurrences

of the specified transaction identifier.

- An itemset is a non-empty collection of distinct values in the Model

column

- A rule is, therefore, a logical statement like

(2.4)

where Mi is an item

23

Learning a set of association rules from market basket analysis serves

multiple purposes, such as to describe the frequent itemsets or to generate

recommendations based on the shopping basket content. Generating

recommendations for a given shopping basket is generally a two-step

process:

- Identify the rules whose precondition matches the current shopping

basket content

- Sort these rules based on some rule property, confidence, lift or

importance being the most frequently used such properties, the

recommend those consequences at the top of the sorted list that

are not already part of the shopping basket

2.3.4 Itemsets and Rules in dense representation

The data in Table 2-1 can be thought as a normalized representation of a

(very wide) table, organized in attribute/value pairs, like below:

Tx Id Milk Bread Butter Apples Pears

SO51176 1 1 0 0 0

SO51177 0 1 1 0 0

SO51178 1 1 1 0 0

SO51179 0 1 1 1 1 Table 2-2 Shopping basket data as attribute/value pairs

For most attributes in Table 2-2, a value of 0 signifies the absence (and a

value of 1, the presence) of an item in a transaction.

24

The representation in Table 2-2 is not efficient for a large catalog and

typically impossible in most RDBMS which handle up to around 1000

columns, as described in [31], [32]. However, this representation allows

adding new attributes to a transaction, attributes that are not necessarily

related to the items that are present in the shopping basket. For example,

demographic information about the customer or geographical information

about the store where the transaction has been recorded may be added to

the table. This information is typically describing a different dimension of a

transaction. (A discussion of multi-dimensional data warehouses does not

make the object of this work, but [2] as well as [30] contain thorough

discussions of the concepts).

A representation such as the one in Table 2-2 is said to be dense, as all the

features are explicitly present in data, with specific values indicating the

presence (1) or absence (0) of an item in a transaction. By contrast, a

representation such as the one in Table 2-1 is said to be sparse, as features

are implied from the presence (or absence) of an item in a transaction.

2.3.5 Equivalence of dense and sparse representations

The dense and sparse representations of transactions are equivalent with

regard to rules and itemsets.

Let now A={Ai} be the set of all the attributes that can be associated with a

densely represented transaction, attributes which may span multiple

dimensions (for the dense dataset, each attribute Ai is a column in the

dataset)

25

Let Vi={vij}, the set of all possible values of the Ai attribute.

Let an item be a certain state of an attribute, Ai=vij

Under this definition of an item, the association rules can be thought over

the dense transaction space by mapping the concepts like below:

- The space of items (I) is the set of all distinct attribute/value pairs

- A transaction identifier (TxId) is a distinct value in the Order Number

column.

- A transaction, identified by a transaction id (e.g.”S051176”) is the

set of attribute/value pairs defining the transaction row.

- An itemset is a non-empty collection of attribute/value pairs.

With these concepts, a rule, defined in equation (2.4) becomes a logical

statement like:

{A1=v1, …, Ai=vi} {Aj=vj, …, An=vn} (2.5) It is interesting to notice that, in the particular case when the number of

items in the consequent is exactly 1, an association rule becomes a

predictive rule, as it can be used to predict, with a certain confidence, the

value of a single attribute. As we will show in Section 3.1.4 below,

association rules may be employed, in commercial software packages, to

produce predictive rules, by mapping dense datasets using this

representation.

26

Note that range type columns (attributes) may have a very large number of

states, so for such attributes the corresponding set of values Vi may have

very high cardinality, leading to a very large number of 1-itemsets. Binning

(discretization) is often used to reduce the number of states of an attribute.

Rules which apply to range intervals are called quantitative association rules

(as opposed to the Boolean association rules which deal with qualitative

statements). Srikant and Agrawal, in [33], introduce a method of fine-

partitioning the values of an attribute and then combining the adjacent

partitions as necessary. This work also introduces a modified version of the

apriori rule detection algorithm (described in detail below), version which

detects quantitative association rules.

In a typical industrial system, the transaction table used to store this

information is likely to be significantly more complex. The item catalog may

contain millions of distinct items, a fact that raises significant challenges in

finding significant rules (more in the next section, “Methods for Rules

Extraction”). Also, in an industrial implementation, the transactions are

likely to be stored for analysis in a data warehouse, together with additional

related information, supporting multidimensional analysis of the data.

Dimensions associated with a transaction may include customer

information, time or geo-location information etc.

27

2.4 Fuzzy Rules

Fuzzy modeling is one of the techniques being used for modeling of

nonlinear, uncertain, and complex systems. An important characteristic of

fuzzy models is the partitioning of the space of system variables into fuzzy

regions using fuzzy sets [34]. In each region, the characteristics of the

system can be simply described using a rule. A fuzzy model typically consists

of a rule base with a rule for each particular region. Fuzzy transitions

between these rules allow for the modeling of complex nonlinear systems

with a good global accuracy. One of the aspects that distinguish fuzzy

modeling from other black-box approaches like neural nets is that fuzzy

models are transparent to interpretation and analysis (to a certain degree).

However, the transparency of a fuzzy model is not achieved automatically.

A system can be described with a few rules using distinct and interpretable

fuzzy sets but also with a large number of highly overlapping fuzzy sets that

hardly allow for any interpretation.

2.4.1 Conceptualizing in Fuzzy Terms

Supposing that a particular concept is not well defined, a function can

be used to measure the grade to which an event is a member of that

concept. E.g.: today is a rainy day – may have a very low value for sunny

days, a higher value for an autumn day, and a very high value for a

torrential rain day.

This membership function is typically defined to have values in the [0,1]

space, with 0 meaning that the event does not belong at all to a concept,

and 1 meaning that an event completely belongs to a certain concept. Such

28

a membership function µ may look like a Gaussian bell, a triangle, a

trapezoid or, in general, any take shape in the 0-1 interval (see Figure 2-2)

Figure 2-2 Standard types of membership functions (from [34] )

2.4.2 Fuzzy Modeling

Fuzzy modeling is a technique for modeling based on data. The result of this

modeling is a set of IF-THEN rules, with fuzzy predicates which establish

relations between relevant system variables. The fuzzy predicates are

0

0.2

0.4

0.6

0.8

1

1.2

Crisp

Trapezoidal

Triangular

Sigmoid

Z-function

Gaussian

29

associated with linguistic labels, so the model is in fact a qualitative

description of a system, with rules like:

IF temperature is moderate and volume is small THEN pressure is low

The meanings of the linguistic terms moderate, small and low are defined

by fuzzy sets in the domain of the respective system variables. Such models

are often called linguistic models.

Different types of linguistic models exist:

- The Mamdani model [35] uses linguistic rules with a fuzzy premise

part and a fuzzy consequent part

- The Takagi Sugeno (TS) model [36] uses rules that differ from

Mamdani models in that their consequents are mathematical

functions instead of fuzzy sets.

In a Mamdani model, the inference is the result of the rule that applies

in a certain point. The rule base represents a static mapping between the

antecedent and the consequent.

The TS model is based on the idea that the rules in the model will have

the following structure:

Ri: wi(IF X1 is Ai1 AND … AND Xn is Ain THEN Yi = fi(.)) (2.6) Where:

- Wi – is the rule weight (typically, 1, but it can be adjusted)

- fi – is usually a linear function of the premise variables, x1 … xn

30

The inference (prediction) of a TS model is computed as

∑

∑

(2.7)

i.e. the weighted average of the consequences of all the rules, where N

is the number of rules, Yi is the contribution of a certain rule and βi is

the degree of activation of the i-th rule’s premise. Given the input X=(x1,

x2, … xn), βi is computed like below (the product of the membership

function for all the predicates of the current rule)

∏ ( )

(2.8)

Because of the linear structure of the rule consequents, well known

parameter estimation techniques (e.g. least squares) can be used to

estimate the consequent parameters.

31

3 Methods for Rule Extraction

In this chapter, we present the most commonly used methods for

extracting rules. Section 3.1 below presents some algorithms designed

specifically for rule extraction, such as apriori and FP-Growth. We discuss

some of the problems raised by these algorithms as well as solutions

identified for those problems. Next, in Section 3.3 we present some

techniques for extracting rules from patterns detected by other algorithms

and focus on rule extractions from neural networks, a topic of significant

interest in the next chapter. A special section, 3.2, describes the specifics of

rules analysis in Microsoft SQL Server.

3.1 Extraction of Association Rules

In this section we present some of the algorithms designed specifically for

the extraction of association rules as well as some results comparing the

real-life performance of various rules extraction algorithms.

3.1.1 The Apriori algorithm

Apriori is an influential algorithm for mining frequent itemsets for Boolean

association rules, introduced by Agrawal in [37]. The algorithm uses prior

knowledge of frequent itemset properties. Its purpose is to avoid counting

the support of every possible itemset derivable from I. Apriori exploits the

downward closure property of itemsets: if any n-itemset is frequent, then

all its subsets must also be frequent. Frequent, in this context, means that

the support (supp) of an itemset exceeds a minsup minimum support

parameter of the algorithm. Itemsets that appear less frequently than the

specified minimum support are considered infrequent and ignored by the

32

algorithm. An itemset generation and test algorithm that was not using the

apriori property was introduced also by Agrawal in [29].

The apriori algorithm is initialized by counting the occurrences of each

individual item, therefore finding the frequencies for all itemsets of size 1.

The algorithm does this by scanning the data set and counting the support

of each item. The 1-itemsets with a frequency lower than minsup are

removed. The remaining 1-itemsets constitute L1, the set of frequent 1-

itemsets that are interesting for the algorithm.

Once initialized, the algorithm performs iteratively the following steps:

1. The join step: a set of candidate n-itemsets, Cn, is generated by

joining Ln-1 with itself. (By convention, apriori assumes that items

within a transaction are sorted lexicographically). The join is

performed by the compound key represented by the first n-2 items

in an itemset. Consider the (n-1)-itemsets A and B defined as below

A = {a1, a2, …, an-2, an-1}

B = {b1, b2, …, bn-2, bn-1} (3.1)

A and B are joined if the share the join key, i.e. if

⋀

(3.2)

33

As a result of joining A and B on the compound key, a new candidate

n-itemset is produced and inserted in the Cn set of candidate n-

itemsets:

C = {a1, a2, …, an-2, an-1, bn-1} (3.3)

The predicate in the join condition is not actually

part of the join key, but ensures that no duplicate itemsets are

generated as a result of the join.

2. The pruning step: Not all itemsets in the Cn set of candidate n-

itemsets meet the minsup requirement. Determining those itemsets

that meet the minsup requirement can be done with a scan of the

database. However, this is not always possible or practical, as Cn can

be huge. The downward closure property is now used to prune

some of the Cn items. If any (n-1) component of an n-itemset

candidate is not frequent, then the candidate cannot be frequent.

This test can be done quickly by maintaining a quick lookup

structure (e.g. tree, hash table) of all the frequent itemsets

discovered so far. For this step, actually, only the n-1 frequent

itemsets Ln-1 need to be stored in memory.

Upon completion of the pruning step, the reminder of the set Cn of

candidate n-itemsets becomes the set of frequent n-itemsets, Ln. The

iteration stops when either Ln is empty or the n+1 length of frequent

itemsets to be detected next exceeds a user defined threshold.

34

Figure 3-1: Finding frequent itemsets

Figure 3-1 illustrates the process of identifying frequent itemsets. The

minsup is set to 2.5% for a population of 10000 transactions, therefore 250.

At the first iteration, cheese and cake are filtered out. At the second

iteration, the candidate {diaper, milk} is disqualified. At the third iteration,

the candidate {beer, bread, diaper} has enough support, whereas the

candidate {beer, milk, bread} is filtered out, because it contains the {diaper,

milk} subset which has already been discounted as infrequent.

Once the frequent itemsets have been detected, the association rules can

be extracted easily. Typically, only rules exceeding a certain confidence

threshold are interesting. Let minconf be the minimum confidence

threshold, an algorithm input parameter. As mentioned in Section 2.3.1

above, the confidence of a rule A B is defined as

35

(3.4)

For each frequent itemset I, association rules can be generated like below:

- Generate all non-empty strict subsets {Si I} of the itemset

- For every non-empty subset, Si, determine the confidence of the rule

Ri: Si{I-Si}:

supp(Ri) =

(3.5)

- If supp(Ri)>minconf then add Ri to the set of rules

The apriori method of detecting frequent itemsets may need to generate a

huge number of candidate sets. For example, if there are 10,000 frequent

items, the algorithm will need to generate more than 107 candidate 2-

itemsets and then scan the database in order to test their occurrence

frequencies. Some other issues raised by the apriori algorithm (and, in

general, by any algorithm driven by a minsupp parameter) are discussed in

Section 3.1.4 which treats the problem of rare rules.

3.1.2 The FP-Growth algorithm

The Frequent Pattern Growth (FP-Growth) algorithm was introduced by

Jiawei Han in [38] and refined in [39], with the purpose of extracting the

complete set of frequent itemsets, without candidate generation.

36

Figure 3-2 An FP-Tree structure

The algorithm uses a novel data structure, called a Frequent pattern Tree

(FP-tree). A FP-tree is an extended prefix tree which stores information

about frequent patterns. Only the frequent 1-items appear in the tree and

the nodes are arranged in such a way that the frequently occurring items

have better chances of node sharing than the less frequently occurring

ones. An item header table can be built to facilitate the tree’s traversal.

Figure 3-2 presents such a tree, together with the associated item header

table. Once an FP tree is built, mining frequent patterns in a database us

transformed to that of mining the FP-Tree. Experiments [38] show that such

a tree may be orders of magnitude smaller than the dataset it represents.

The full algorithm for building the tree is presented in Appendix A: Key

Algorithms. While building the tree, the item header table is updated to

37

contain a node link (pointer) to the first occurrence of each item in the tree.

Any new occurrence of the item in the tree (as part of a different sub-tree)

ends up being linked to the previous occurrence, so that from the item

header table one can traverse all the tree occurrences of each individual

item.

Each transaction in the database is represented on one of the paths from

the FP-tree root to a tree leaf. Consequently, for each itemset α, any larger

itemset suffixed by α may only appear on a path containing α.

The second step of the algorithm consists in mining the tree to extract the

frequent itemsets. Each 1-itemset, in reverse order of the frequency, is

considered as an initial suffix pattern. By traversing the linked list of

occurrences of the initial suffix pattern in the tree, a conditional pattern

base is created, consisting of full prefix paths in the FP tree that co-occur

with the current suffix pattern. The conditional pattern base is used to

create a conditional FP-tree. This conditional tree is then mined recursively.

All the detected patterns are concatenated with the original suffix pattern

used to create the conditional FP-tree.

As opposed to Apriori, which performs restricted itemset generation and

testing, the frequent pattern mining algorithm performs only a restricted

testing of the itemsets. Also, the mining of the FP-tree is based on

partitioning, reducing dramatically the size of the conditional pattern base.

Refinements to the original FP-Tree mining algorithm are proposed in [39].,

including a method to scale the FP-tree mining by using database

38

projections. For ai a itemset and DB a database, the ai-projected database is

based on DB and contains all transactions which contain ai, after eliminating

from them infrequent items (all items that appear after ai in the list of

frequent items).

Also included in [39] is a comparative analysis of the FP-growth algorithm

and an alternative database projection-based algorithm, TreeProjection

(described in Section 3.1.3). The FP-Growth algorithm is determined to be

more efficient both in terms of memory consumption and computational

complexity.

3.1.3 Other algorithms and a performance comparison

Partition, an algorithm introduced in 1995 by Savasere et al. in [40]

generates all the frequent itemsets and rules in at most 2 scans of the

database. In the first scan, it divides the database in a number of non-

overlapping partitions and computes, for each partition, the frequent

itemsets. The union of these partition frequent itemsets is a superset of all

frequent itemsets, so it may contain items that are not globally frequent. A

second scan of the database is employed to compute the actual support for

all candidate itemsets (and remove those that are not globally frequent).

We mentioned, in the previous section, the TreeProjection algorithm. It was

introduced in 2000 by Agarwal, in [41], and uses a lexicographic tree to

39

represent the itemsets. Transactions are projected onto the tree nodes for

counting the support of frequent itemsets.

A different approach to rules mining is to discover the closed itemsets, a

small representative subset that captures frequent itemsets without loss of

information. This idea was introduced in 1999 by Pasquier et al. in [42]. An

algorithm to detect closed itemsets called CLOSE was introduced in the

same paper. After finding the frequent k-itemsets, Close compares the

support of each set with its subsets at the previous level. If the support of

an itemset matches the support of any of its subsets, the itemset is pruned.

The second step in Close is to compute the closure of all the itemsets found

in the first step. An improved version, A-CLOSE, was introduced in [43],

which generates a reduced set of association rules without having to

produce all frequent itemset, reducing, this way, the computational cost.

Charm is another algorithm for generating closed frequent itemsets for

association rules, introduced in [44]. Charm explores simultaneously the

itemset space as well as the transaction space and uses a very efficient

search method to identify the frequent closed itemsets (instead of

enumerating many possible subsets)

A 2001 study compared the performance of some of the commonly used

rules or frequent itemset detection algorithms [45]. Apriori, FP-Growth and

TreeProjection were included among the tested algorithms. The study used

three real-world datasets as well as one artificial dataset, T10I4D100K from

40

IBM Almaden. The original URL indicated for the data generator,

http://www.almaden.ibm.com/software/quest/Resources/index.shtml,

seems unavailable now (June 2011), but the test datasets can be

downloaded from http://fimi.ua.ac.be/data/ ). The algorithm performances

claimed by their respective authors were confirmed on artificial datasets,

but some of these gains did not seem to carry to the real datasets. As

reported in [45], a very quick growth in the number of rules is associated

with very small changes in the minimum support threshold, suggesting that

the choice of algorithm only matters at support levels that generate more

rules than would be useful in practice.

3.1.4 Problems raised by Minimum Support rule itemset

extraction systems

The most commonly used algorithms for rule extraction, apriori and FP-tree,

just like most of the other algorithms mentioned previously, focus on

finding frequent itemsets, i.e. itemsets that exceed a certain minimum

support. All itemsets (and, consequently, rules) that do not meet the

minsup threshold are ignored by these algorithms.

Rules with low support and high confidence, however, may be very

interesting for certain applications, particularly for e-Commerce

applications which aim to yield high profit margin by suggesting customers

items of interest. Customers with exotic tastes may be a small minority, but

http://www.almaden.ibm.com/software/quest/Resources/index.shtml

http://fimi.ua.ac.be/data/

41

they share, in their respective clusters, similar interests, and

recommendation systems should, at least theoretically, be able to make

good appropriate suggestions in their case.

Rare rules may be of two forms:

- Both the antecedent and the consequent have small support and fail

the minsup test. In this case, they are never considered by common

algorithms

- The predicates in antecedent and/or the consequent exceed the

minsup criterion, but they only rarely co-occur, and the combination

ends up being ignored by the algorithms.

The simple solution of reducing the minsup threshold will not function

practically. On a theoretical level, the minimum support criterion is what

makes both apriori and FP-tree practical for large datasets. The comparative

study in [45] (discussed in Section 3.1.3 above) shows that, on certain

datasets, small reductions on minimum support value may lead to

extremely rapid growth in the number of rules.

Some research has been carried recently in the area of rare rules detection.

A collection of most significant results in this area is available in [46]. A few

different approaches have been taken in solving this problem.

One of the approaches consists in using a variable minimum support

threshold. Each itemset may have a different support threshold, which can

be predefined or can be dynamically lowered to allow for rare itemset

inclusion.

42

Multiple Support Apriori (MSApriori), introduced in [47], allows each

database item to have its own minimum support. The minimum support for

an n-itemset, n>1, is computed as the minimum per components. To

facilitate the detection of small support itemsets, the items are sorted in

ascending order of their minimum support values rather than in the

conventional lexicographic order used by apriori. As it is impractical to

associate an individual minimum support with each item in a large product

catalog, the authors suggest a Lowest Allowable Minimum Support (LS) and

a constant β[0,1] as algorithm parameter. An arbitrary item’s minimum

support will then be

(3.6) The algorithm detects certain rare itemset and rules, but the criterion is

user’s β value rather than the frequency of items.

Relative Support Apriori, introduced in [48], is a refinement on top of

MSApriori which avoids the user input (the β parameter of the MSApriori

algorithm) and defines a new threshold for itemsets, the relative support,

which measures the confidence of rare items. The relative support

threshold (defined below) imposes a higher support limit for items that are

globally infrequent.

43

(3.7)

Adaptive Apriori, introduced in [49], introduces the idea of support

constraints, a function which produces minimum support for specified

itemsets. Multiple constraints are combined by picking the minimum. The

resulting apriori implementation generates only necessary itemsets, i.e.

itemsets that meet the set of predefined constraints.

LPMiner, introduced in [50], also uses a variable minimum support

threshold. The authors propose a support threshold which decreases with

the length of the itemset. The implementation is based on the FP-tree

algorithm.

A very different approach consists in completely eliminating the minimum

support threshold.

A family of algorithms based on MinHashing is presented in [51]. These

algorithms detect rare itemsets of very highly correlated items. The

algorithms represent transactions, conceptually, as a 0/1 matrix with one

row per transaction and as many columns as distinct items. In this

representation, the confidence of a rule is the number of rows with 1 in

44

both columns divided by the number of rows with1 in either column. This

representation is not practical, as it would be very large. The authors

suggest computing a hashing signature for each column so that the

probability that two columns have the same signature is proportional to

their similarity.

An example of such a hash, a random order of rows is selected and a

column’s hash is the first row index (under the new order) where the

column has a 1. The article shows that the probability that two columns

share a signature is proportional to their similarity.

To reduce the number of false positives and false negatives, multiple

signatures are selected (by repeating the process independently) . The

resulting candidate pairs are generated and checked against the real

database (the original matrix). The algorithm is implemented for rules with

2 or 3 itemsets but is not yet extended beyond this size.

Apriori Inverse, proposed in [52], is also a variation of the apriori algorithm

but it uses maximum support instead of minsup. Candidates of interest are

below maxsup, but still above an absolute minimum support (minabssup,

noise threshold). A rule X is interesting if sup(X)<maxsup AND

sup(X)>minabssup.

45

Apriori Rare, proposed in [53], splits the problem of detecting rare itemsets

in two tasks. The authors introduced the concepts of:

- Maximal frequent itemset (MFI), an itemset which is frequent, but

all its proper supersets are rare

- Minimal rare itemset (mRI) , a rare itemset having all proper subsets

frequent

- Generator, an itemset that has no proper subset with the same

support (i.e. )

The mRIs can be detected naively, using apriori, or by using a new algorithm

introduced in the paper, called MRG-Exp, which avoids exploring all

itemsets and instead only looks for frequent generators in the itemset

lattice. The second part consists in restoring rare itemsets from mRIs, using

an algorithm called Arima (A Rare Itemset Miner Algorithm).

3.2 An implementation perspective: Support for association

analysis in Microsoft SQL Server ® 2008

This section describes some of the innovations supporting association

analysis in the Microsoft SQL Server ® Analysis Services 2008 platform (AS)

as a context for some for some of the work presented in this document. We

originally published the core of this material in our previously published

volume [2].

46

Analysis Services separates between data storage objects (mining

structures), which are in essence multi-dimensional data stores , and mining

models, instantiations of data mining algorithms on top of projections of

mining structures.

In the simplest case, a mining structure may be a table. A mining model

belonging to that structure may use some or all of the table columns.

Mining structure columns can be referred more than once in the same

mining model. A mining case is a data point used in training a data mining

algorithm or one that needs to be scored by a trained algorithm.

A significant innovation in the AS product is the concept of nested tables.

From data mining modeling perspective, a nested table is a tabular feature

of a mining case.

Figure 3-3 A mining case containing tabular features

Figure 3-3 represents such a mining case (a customer, in this case). The case

contains certain scalar features such as Key (a unique identifier), Gender,

Age or name. Tabular features, such as the list of purchases or ratings

47

produced by this costumer for certain movies, can also be logically

associated with the customer.

From a relational database perspective, a customer with the related

information is represented as join relationships between several tables.

Figure 3-4 presents the relational database structure associated with the

mining case represented by the customer plus purchases and movie ratings.

A mining structure can store data from multiple tables and models built

inside that structure can access data from multiple tables as features of the

same mining case. The modeling of nested tables is centered on the key

columns of the nested table. Each individual value of a nested table key is

mapped to one or more modeling attributes.

Figure 3-4 A RDBMS representation of the data supporting mining cases with nested tables

48

For example, consider a classification model that aims to predict a

customer’s age based on gender and the lists of purchases as well as movie

ratings. Each mining case will have the following attributes:

- Gender , Age – from the People table

- Purchases(Milk), Purchases (Bread), Purchases (Apples) – all of

them with values of Existing/Missing

- Purchase(Bread).Quantity, Purchases(Milk).Quantity,

Purchases(Apples).Quantity – either missing or mapped to the

Quantity column of the Purchases relational table

The feature space for a mining case is very wide and contains all possible

values for each nested table key (and the related attributes). However, a

mining case is represented sparsely, only those nested attributes having the

Existing state are presented to the mining algorithm. Given that the mining

algorithm has full access to the feature space information (dimensionality,

data types), it can effectively mine the very large feature space.

The abstraction on top of the physical feature set is part of the AS platform

and all the data mining algorithms running on the AS platform must,

therefore, support sparse feature sets.

The nested table concept in AS allows mining complex patterns directly

from relational database, without a need to move the data to an internal

representation.

49

The nested tables are particularly useful in mining association rules, as they

map to the database representation of transactions. Using equivalence

(shown in Section 2.3.4 above) between the transactional and tabular data

for the association rules algorithm, the result is an implementation that can

detect association rules between nested table items (transactional items)

and scalar features. The AS implementation of association rules is,

therefore, able to produce associative rules combining multidimensional

predicates, like below:

⇒ (3.8)

Models, inside mining structures, use projections of the data in the

structure. Columns from the mining structure may appear once, multiple

times or not at all in a model. Rows of the mining structure may be filtered

out of models as well.

50

Figure 3-5 Using a structure nested table as source for multiple model nested tables

Figure 3-5 presents an example of complex modeling using filters:

- The mining structure on left contains a single nested table with 2

columns: product name and a flag indicating whether the product

was On Sale when purchased or not

- A model is built inside the mining structure, containing two nested

tables, both linked to the single mining structure nested table, but

with different row filters.

Rules can be mined, now, to detect how On Sale products drive sales of

other products.

3.3 Rules as expression of patterns detected by other

algorithms

The descriptive power of rules makes them a frequently used tool for

explaining the patterns extracted by various machine learning algorithms.

51

3.3.1 Rules based on Decision Trees

Decision trees building algorithms are frequently used for rule extraction.

Trees induction methods are producing patterns that can be easily be

converted to rule sets. Every node in a classification tree (such as ID3,

iterative dichotomiser 3, introduced by Quinlan [54]) or classification-and-

regression-trees (CART, introduced by Breiman at al., [55]) can be easily

converted to a rule by treating the full path, from root to the respective

node, as antecedent and the histogram of the node as consequent.

Collections of trees (forests) can be used to extract association rules, similar

to the ones detected by the apriori algorithm. An example for this is

implemented in Microsoft’s SQL Server data mining product, as we

described in [2]. In such an implementation, a tree is built for each item in

the item catalog, with the purpose of extracting rules that have that

respective item as a consequent. Figure 3-6 shows such a tree, built for the

“Eyes Wide Shut” movie item as a consequent. An example of such rule is:

R1: (“Full metal Jacket”)(“Eyes Wide Shut”)

supp(R1)=(total support for the leaf node)=56

conf(R1)=(from the histogram of the leaf node)=11/56=0.1964

52

Figure 3-6 A decision tree built for rules extraction (part of a SQL Server forest)

3.3.2 Rules from Neural Networks

An artificial neural network (ANN) is a mathematical (or computational)

model inspired from functional aspects of biological neural networks. An

ANN consists of groups of artificial interconnected neurons. A very

thorough description of artificial neural networks does not make the object

of this work and can be found in [56]. Some concepts and properties of

ANNs that are relevant to this work are summarized from [56] in this

section.

53

Figure 3-7 An artificial neural network

Each artificial neuron is a simplified abstraction of a biological neuron. A

neuron receives one or more inputs and sums them to produce an output. A

neuron typically combines the inputs by means of some weighted sum, and

then the result is passed through a non-linear function called activation or

transfer function for the neuron. The output of a neuron is:

∑

(3.9)

54

Where:

- m is the number of inputs for the current neuron

- wkj is, respectively, the weight associated with the connection

between input j and the current neuron

- xj is the actual input value

- is the activation function for the neuron.

Frequently used activation functions include the step function or a sigmoid

function.

The ANN in Figure 3-7 has neurons disposed in 3 layers: an input layer, a

hidden one and an output layer. Complex systems may have more hidden

layers. For the purpose of this work, networks can be organized as any

directed acyclic graph (feed forward networks).

An artificial neural network is usually defined by

- The topology of the network (the connections between neurons)

- The learning process for updating the weights of the

interconnections

- The activation functions of the neurons

Neural networks can be used to model complex relationships between

inputs and outputs and are frequently employed in such tasks as

classification or pattern recognition.

55

More complex neural network types were proposed for modeling complex

biological processes, such as cortical development and reinforcement

learning. The Adaptive Resonance Theory (ART), for example, described in

detail in [57], is a special kind of neural network with sequential learning

ability.

The internal structure of neural networks, specifically the presence of the

hidden layers, makes them capable of solving certain classes of difficult

classification problems (such as the non-linearly separable problems). It is

the same complexity, on the other hand, that makes neural networks less

intuitive and more difficult to interpret. A very large corpus of research has

been produced in the last decades on changing the black-box status of

neural networks and exposing the patterns inside.

Three classes of techniques are often used to describe the patterns learned

by a neural network:

- Visualization of the neural network consists of directly describing

the network topology, the weights associated with the connections

and the activation functions of the neurons

- Sensitivity analysis consists in probing the ANN with different test

inputs then recording the outputs and determining, in the process,

the impact or effect of an input variable on the output.

- Rule extraction consists in producing a set of rules that explain the

classification process

56

Visualization and sensitivity analysis do not make the object of this work.

The rest of this section presents some of the methods used in extracting

rules from neural network.

The rules extracted from a network may be crisp or fuzzy. A crisp rule is a

proposition offering crisp Yes and No answers, such as the one below:

(3.6) A fuzzy rule is a mapping from the X input space to the space of fuzzy class

labels, as described in Section 2.4 above.

While chronologically not the first work in the area of rule extraction from

neural network, a 1995 survey on rule extraction, [58], is of particular

interest, as it introduced a frequently used taxonomy of the methods used

for rule extraction from ANNs, based on the expressive power of the rules,

the translucency of the technique (relationship between rules and ANN’s

structure), quality of the rules (accuracy, fidelity to the ANN’s

computations, comprehensibility), algorithmic complexity and the

treatment of variables. The taxonomy has been updated in 1998 in [59] to

cover a broader range of ANNs, such as recurrent networks. One of the first

methods for extracting rules from a neural network was proposed by Saito

and Nakano in 1988, in [60]. It is a sensitivity analysis approach, which

observes the effects that changes in the inputs cause on the network

output. The problem raises challenges due to the large number of input

combinations that need to be evaluated. The authors employ a couple of

57

heuristics to deal with this problem, such as limiting the number of

predicates that may appear in an input.

In 1999, it is shown, in [61], that multilayer feed-forward networks are

universal approximators, i.e. can uniformly approximate any real

continuous function on a compact domain. In 1994, the same thing is

shown, in [62], for certain fuzzy rules based systems (FRBS), specifically

fuzzy additive systems, i.e. systems based on rules such as:

⇒

(3.11)

where pjk is a linear function on the inputs.

This equivalence led authors to discuss the equivalence of neural nets and

fuzzy expert systems, as shown in [63]. In 1998, Benitez at al. offer a

constructive proof in [64] for the equivalence of certain neural networks

and certain fuzzy rules based systems (FRBS). They show how to create a

fuzzy additive system from a neural networks with 3 layers (single hidden

layer) which uses a logistic activation function in hidden neurons and an

identity function in output neurons. The area of neuro fuzzy systems is

particularly interesting in the context of this work as it provides context for

some of the results presented in Chapter 4 below.

More work on the level of equivalence between fuzzy rule-based systems

and neural networks is presented in [65]. The authors provide a survey of

58

neuro-fuzzy rule generation algorithms. This work is used in 2005 in [66] to

extract rules IF-THEN rules from a fuzzy neural network and explain to drug

designers, in a human-comprehensible form, how the network arrives at a

particular decision.

More recently, in 2011, Chorowski and Zurada introduced a new method in

[67], called LORE (Local Rule Extraction), suited for multilayer networks with

logical or categorical (discrete) inputs. A multilayer perceptron is trained

under standard regime and then converted to an equivalent form that

mimics the original network and allows rule extraction. A new data

structure, the Decision Diagram, is introduced, which allows efficient partial

rule merging. Also, a rule format is introduced which explicitly separates

between subsets of inputs for which the answer is known from those with

an undetermined answer.

59

4 Contributions to Rule Generalization

This chapter is organized as follows. The first subsection describes some

concepts related to fuzzy rules generalization and simplification, while the

second section briefly discusses several methods for optimizing and

simplifying the rule sets. The third section focuses on one of these methods

(the Rule Base Simplification based on Similarity Measures).

The fourth section presents a rule generalization algorithm introduced in [1]

for rules extracted from Fuzzy ARTMAP classifiers. The algorithm is then

adapted to rule sets produced by common rule extraction algorithms, such

as apriori. The last section contains some ideas for further research and

some conclusions.

4.1 Fuzzy Rules Generalization

One of the aspects that distinguish fuzzy modeling from other black-box

approaches like neural nets is that fuzzy models are, to a certain degree,

transparent to interpretation and analysis. However, the transparency of a

fuzzy model is not achieved automatically. A system can be described with a

few rules using distinct and interpretable fuzzy sets but also with a large

number of highly overlapping fuzzy sets that hardly allow for any

interpretation.

Description of a system using natural language is an advantage of fuzzy

modeling. A simplified rule base makes it easier to assign qualitatively

meaningful linguistic terms to the fuzzy sets, and it reduces the number of

60

terms needed. It becomes easier for experts to validate the model and the

users can understand better and more quickly the operation of the system.

A model with fewer fuzzy sets and fewer rules is also better suited for the

design and implementation of a nonlinear (model-based) controller, or for

simulation purposes, and it has lower computational demands. Several

methods have been proposed for optimizing the size of the rule base

obtained with automated modeling techniques, and some of them are

discussed in this chapter. One of them, discussed in detail in Section 4.3 on

Similarity Measures and Rule Base Simplification, consists in measuring the

similarity of fuzzy rules and sets and merging them in order to simplify the

model. We build on the concepts introduced by this work and propose a

new method of simplifying the rule set by generalizing the rules in the

model, using data mining rule concepts such as support and accuracy.

4.1.1 Redundancy

Fuzzy models, especially if acquired from data, may contain redundant

information in the form of similarity between fuzzy sets. Three unwanted

effects that can be recognized are

1) Similarity between fuzzy sets in the model;

2) Similarity of a fuzzy set to the universal set;

3) Similarity of a fuzzy set to a singleton set.

61

As similar fuzzy sets represent compatible concepts in the rule base, a

model with many similar fuzzy sets becomes redundant, unnecessarily

complex and computationally demanding.

Some of the fuzzy sets extracted from data may be similar to the universal

set. Such fuzzy sets are irrelevant. The opposite effect is similarity to a

singleton set. During adaptation, membership functions may get narrow,

resulting in fuzzy sets almost like singletons (spikes). If a rule has one or

more such fuzzy sets in its premise, it will practically never fire, and thus the

rule does not contribute to the output of the model. However, it should be

noted that such rules may represent exceptions from the overall model

behavior

4.1.2 Similarity

Different measures have been proposed for similarity of fuzzy sets. In

general, they can be divided in

- Geometric similarity measures (e.g Minkowski class of distance

functions)

∑

⁄ (4.1)

- Set-theoretic similarity measures (e.g. consistency index):

[ ] (4.2)

Where ˄ is the minimum operator

62

Setnes et al., in [68], describe some of the problems associated with using

these measures. The paper defines a set of criteria for such a measure and

introduces such a measure, which will be discussed in detail in Subsection

4.3 below.

4.1.3 Interpolation based rule generalization techniques

Takagi-Sugeno and Mamdani models perform inferences under the

assumption that the rule set completely covers the inference space (i.e. it is

dense). Interpolative reasoning methods address the problem of sparse rule

sets, which do not cover the whole inference space.

Mizumoto and Zimmermann, in [69], analyze the properties of rule

models and the possibility to interpolate new rules in the “generalized

modus tollens”. A modus tollens rule may be written, in logical operator

notation, as

(4.3)

In 1993, in [70], Kóczy and Hirota propose a method (KH-rule

interpolation) for interpolations where results are inferred based on

computation of each -cut level, and the resulting points are connected by

linear pieces to yield an approximate conclusion.

63

4.2 Rule Model Simplification Techniques

Extensive research is available for rule model simplification techniques.

Such techniques may target the feature set considered for rule inference,

the definition of the fuzzy sets participating in the rules or the structure of

the rules models.

4.2.1 Feature set alterations

Feature set alteration techniques share the goal of reducing the number of

features that participate in the inference process. A direct consequence of

applying such alteration techniques is that they result in simplified rule

systems, because a reduction in the number of features implies a smaller

number of predicates in rules’ premises. Such alterations can be classified

as Feature Extraction or Feature Selection techniques.

Feature Extraction techniques allow synthesizing of a new, lower-

dimension feature set which encompasses all or most of the variance of the

original feature set (i.e. the original information is preserved or the loss is

minimal). Such techniques include Principal Component Analysis (aka

Karhunen-Loewe transform), described in [71], which consists in identifying

the eigenvectors of the covariance matrix of the training data and

projecting the data on these eigenvectors. The eigenvalues associated with

these eigenvectors provide a measure of the variance of the whole system

along these vectors and consequently allow sorting the new coordinates

(the eigenvectors) in the order of variance. Frequently, for real data sets, a

low number of eigenvectors can account for 95% or more of the variance in

data.

64

A similar feature extraction technique is Sammon’s non-linear

projection [72]. In this approach, a set of high-dimensional vectors are

projected in a low-dimension space (2 or 3 dimensions) and a gradient

descent technique is used to adjust the projections so that the distance

between projections is as close as possible to the distance between the

original pairs of vectors. As the preservation of the semantic meaning is a

major advantage of the fuzzy rule models, techniques for feature

transformation (which inherently alter the model’s semantics) are not

treated in depth in this paper.

Feature Selection techniques do not create new features, but rather

identify the top most significant features to be used in building a model. On

real data sets, this approach often provides very good results because of

redundancy, co-linearity or irrelevance of certain data dimensions. Dash

and Liu, in [73], provide an extensive overview of the feature selection

techniques commonly used in classification systems. A very popular

technique for feature selection is the information gain method, introduced

in [54]. The information gain feature selection method sorts the input

features by the amount of entropy they reduce from the whole system and

can be used to determine which features should be retained, by keeping

those whose information gains are greater than a predetermined threshold.

Feature selection does not affect the semantic meaning of the rule model

and is used for rule simplification techniques.

65

4.2.2 Changes of the Fuzzy sets definition

Song et al., in [74], suggest using supervised learning to adapt the

parameters of the of the fuzzy membership functions defining the

components of the rules. With the assumption that the inference surface is

relatively smooth, over-fitting of the fuzzy system can be detected in two

ways. Two membership functions coming sufficiently close to each other

can be fused into a single membership function, and membership functions

becoming too narrow can be deleted. In both cases, this adaptive pruning

improves the interpretability of the fuzzy system. This approach is related to

our proposed method for rules generalization and the methods will be

compared in Subsection 4.4 below.

4.2.3 Merging and Removal Based Reduction

Automatically generated rule systems often produce redundant, similar,

inconsistent or inactive rules. Handling of similar rules is detailed in the next

section, covering Similarity Measures and Rule Base Simplification.

Inconsistent rules destroy the logical consistency of the models. Xiong and

Lits, in [75], propose a “consistency index” numerical assessment which

helps measuring the level of consistency/inconsistency of a rule base. They

use this index in the fitness function of a genetic algorithm which searches a

set of optimal rules under two criteria: good accuracy and minimal

inconsistency.

66

4.3 Similarity Measures and Rule Base Simplification

Setnes at al., in [68], propose a similarity measure for rules in a model.

Based on this measure, similar fuzzy sets are merged to create a common

fuzzy set to replace them in the rule base, with the goal of creating a more

efficient and more linguistically tractable model.

A similarity measure for two fuzzy sets, A and B, is defined as a function

[ ] (4.4)

A set of 4 criteria for a similarity measure is first introduced in [68]:

- Non-overlapping fuzzy sets should be totally non-equal.

That is,

(4.5)

- Overlapping fuzzy sets should have a similarity value

greater than 0

(4.6)

- Only equal fuzzy sets should have a similarity value of 1

(4.7)

67

- Similarity between two fuzzy sets should not be

influenced by scaling or shifting the domain on which

they are defined

With these criteria, [68] proposes a new similarity measure, based on set

theory, defined as:

(4.8)

This measure is, therefore, the ratio between the cardinality of intersection

and reunion of the sets. When the equation is rewritten using the

membership functions, in a discrete space X=(x1, x2, …, xn) it becomes:

∑ [ ]

∑ [ ]

(4.9)

The operators are, respectively, minimum (˄) and maximum (˅). This

similarity measure complies with the four criteria above and reflects the

idea of gradual transition from equal to completely non equal fuzzy sets/

With this measure defined, [68] proceeds to simplifying the rule base. Rules

that are similar to the universal fuzzy set (S(A,U)~1, x in X) can, for

example, be removed.

68

The paper also provides a solution for merging similar rules. For this, it uses

a parametric trapezoidal representation of fuzzy sets, each rule being

described by parameters:

{

(4.10)

The merging of two similar fuzzy sets, A and B, defined by µA(x; a1, a2, a3,

a4) and µB(x; b1, b2, b3, b4) is defined as a new fuzzy set, C, defined by µC(x;

c1, c2, c3, c4), where:

c1 = min(a1, b1)

c4 = max(a4, b4)

c2 = λ2a2 + (1- λ2b2)

c3 = λ3a3 + (1- λ3b3)

(4.11)

In the definition of the C fuzzy set, λ2, λ3 are between 0 and 1 and

determine which fuzzy set, A or B, has more influence on the newly

generated set C, with a default value for both of 0.5.

69

Figure 4-1 - Creating a fuzzy set C to replace two similar sets A and B (from [68])

With the merging solution described above, the authors propose an

algorithm for simplifying the rules in the model. The algorithm performs the

following steps:

- Select the most similar pair of fuzzy sets

- If the similarity score exceeds a certain parameter, λ,

then merge the two fuzzy sets and update the rule set

- Repeat until no pair of fuzzy sets exceeds the λ threshold

- For each rule in the system, compute the similarity with

the universal set (U, µU(x)=1 x in X). If the similarity with

the universal set exceeds a certain threshold , then

remove the rule from the set (too universal)

- Merge the rules with identical premise part

70

Figure 4-2 Merging of similar rules (from [68])

Further work in [76] refines the method in [68] by the following steps:

- Reduce the feature set by feature selection

- Apply the method in [68]

- Apply a Genetic Algorithm to improve the accuracy of the

rules. To maintain the interpretability of the rule set, the

genetic algorithm step is restricted to the neighborhood

of the initial rule set

4.4 Rule Generalization

In [1], four molecular descriptors are used (molecular weight, number of H-

bond donors and acceptors, and ClogP) to predict biological activity (IC50). In

the paper, we introduced a novel rule generalization algorithm and a rule

inference procedure able to improve the rules extracted from a neural

71

network. This section describes the rule generalization algorithm, discusses

the results and proposes some directions for further research.

4.4.1 Problem and context

In [1], the IC50 prediction task uses a FAM-type prediction technique called

Fuzzy ARTMAP with Relevance (FAMR).

The Adaptive Resonance Theory (ART), described in detail in [57], is a

special kind of neural network with sequential learning ability. ART’s pattern

recognition features are enhanced with fuzzy logic in the Fuzzy ART model,

introduced in [77].

The FAMR is an incremental, neural network-based learning system used for

classification, probability estimation, and function approximation,

introduced in [78]. The FAMR architecture is able to sequentially

accommodate input-output sample pairs. Each such pair may be assigned a

relevance factor, proportional to the importance of that pair during the

learning phase.

FAM networks have the capability to easily expose the learned knowledge

in the form of fuzzy IF/THEN rules; several authors have addressed this issue

for classification tasks, such as [79] , [80]. The final goal in generating such

rules would be to explain, in human-comprehensible form, how the

network arrives at a particular decision, and to provide insight into the

influence of the input features on the target. To the best of our knowledge,

no author has discussed FAM rule extraction for function approximation

tasks, such as IC50 prediction.

72

Carpenter and Tan, in [79] and [81] were the first who introduced a FAM

rule extraction procedure. To reduce complexity of the fuzzy ARTMAP, a

pruning procedure was also introduced. In [1] we adapt Carpenter and Tan‘s

rule extraction method for function approximation tasks with the FAMR.

4.4.2 The rule generalization algorithm

Let O be the set of rules extracted from the FAMR model. In this section,

the quality of the rules in O is analyzed from the perspective of the

confidence (conf) and support (supp) properties described in Section 2.3.1

above.

The rules in O have support between 0.0% and 16.47%, and confidence

between 0.00% and 100.00%. To ensure the quality of the final rule set, we

use a minimum confidence and a minimum support criterion for the output

rules and prune the rules, from the extracted set, which do not meet these

minimum support and confidence criteria.

The set of rules extracted this way has the following characteristics:

All rules are complete with regard to the input descriptors (the antecedent of each rule contains, therefore, one predicate for each descriptor), a consequence of the rule extraction algorithm.

Certain descriptor fuzzy categories do not appear in any rule. To further analyze this rule set, we introduce two new measures for the rule

set:

73

- Coverage: The percentage of training data points which have the following property: There exists at least one rule for which the molecule‘s descriptors fall within the range of the antecedent (i.e. the percentage of points for which at least one rule is triggered).

- Accuracy: The percentage of training data points which have the following property: There exists at least one rule for which the molecule‘s descriptors fall within the range of all antecedents and, in addition, the output falls within the range of the consequent (i.e. the percentage of points for which a correct rule is triggered).

Assuming that some rules are too specific to the training set (over fitting),

we attempt to generalize them, by applying a greedy Rule Generalization

Algorithm (RGA). The RGA is applied to each rule in the set.

Rule Generalization Algorithm (RGA). Let a rule R be represented as

R: (X1 = x1,X2 = x2, . . . ,Xn = xn) ⇒ (Y = y) (4.12) Relax R by replacing one predicate Xi = xi with a wild card value,

representing any possible state and designated by the (Xi = ) notation. By

definition, the newly formed rule has the same or better support, as its

antecedent is less restrictive. If the newly formed rule’s confidence meets

the minimum confidence criterion, then keep it in a pool of candidates. This

procedure is applied for all the predicates in the rule, resulting in at most n

generalized rules (where n is the number of predicates in the original rule)

which have support better or equal with the original rule. If the candidate

pool is not empty, replace the original with the candidate which maximizes

the confidence. The algorithm is applied recursively to the best

74

generalization and it stops when the candidate pool is empty (no better

generalization can be found).

The RGA’s goal is to relax the rules by trying to improve, at each step, the

rule support, without sacrificing accuracy beyond the minimum acceptable

confidence level.

Figure 4-3 A visual representation of the RGA

Figure 4-3 provides a visual representation of the way the RGA works.

Consider a rule R: (X=High, Y=High)⇒ (Target = t). If, after relaxing the

Y=High condition the new rule R’: (X=High, Y=*)⇒ (Target = t) has sufficient

accuracy (the support is already guaranteed), then R’ becomes a candidate

for replacing R.

In the worst case, the number of predicate replacements for each rule is in

O(n2). Any relaxation of a rule increases (or does not change) the support of

that rule. Therefore, relaxing a rule improves both its confidence and

support.

75

Example of iteratively applying RGA: This example is extracted from the

original experimental results presented with [1] . Let R be a complete rule in

the original O set. As mentioned, previously, all rules contain one predicate

for each of the four inputs.

The values for each of the descriptors are binned in 5 buckets (B1-B5), see

Chapter 6 below, presenting experimental results, for details.

R: (X1 = B1, X2 = B2, X3 = B2, X4 = B3) ⇒ (Y = Excellent),

with sup(R)=6.25%, conf(R)=90.9% (4.13)

Upon relaxing all the predicates associated with R and evaluating the

confidence and support for the relaxed derivatives, the best derivative is

selected:

R’: (X1 = *, X2 = B2, X3 = B2, X4 = B3) ⇒ (Y = Excellent),

with sup(R’)=8.52%, conf(R’)=93.3% (4.14)

After applying the algorithm one more time to the generalized rule R’, we

obtain a better generalization:

R’’: (X1 = *, X2 = B2, X3 = *, X4 = B3) ⇒ (Y = Excellent),

with sup(R’’)=13.06%, conf(R’’)=95.65% (4.15)

76

4.4.3 Applying the RGA to an apriori-derived set of rules

As described in Section 3.1.1, the most commonly used rule extraction

algorithm, apriori, produces a set of variable-length rules, having the

predicates in the antecedent sorted, usually lexicographically. Certain

apriori derivatives, such as Multiple Support apriori (discussed in 3.1.4) may

use a different sort order, but this order is preserved for all the rules that

are extracted by the algorithm.

This common sort order of the predicates, shared among all the rules in the

rule set, allows for a fast way of applying the Rules Generalization Algorithm

(introduced in the previous section) to apriori-produced rule sets. The

following property justifies the application of RGA to sets of rules

characterized by a shared sort order of the antecedent predicates.

Property 4.1: Consider two rules in a rule set having the same consequence,

C, each rule defined by a set of predicates Pi in its antecedent: R1: ({P1}->C},

R2: ({P2}->C}. If P1 P2 then R1 is a generalization of R2, similar to candidate

wildcard rules introduced in RCA.

Rationale: if P1is a proper subset of P2, then P2 contains at least one

predicate Ci:Xi=xi, CiP1. Each such predicate Ci in the definition of P2 can be

relaxed, resulting in P2’=,P1, Xi=*}. By repeating this for each Ci P2, CiP1 a

relaxation of P2 is obtained which is identical with P1.

77

Based on property 4.1, we propose an algorithm for simplifying apriori-like

rule sets. The algorithm traverses the set of lexicographically sorted rules

maintaining a stack of rule antecedents encountered during the scan. If a

rule matches one of the stacked prefixes, we check if the rule can be

generalized by one of the previous rules.

The algorithm is presented below:

Parameters:

T – A set of rules sharing the predicate order in ant.

Output

A set of generalized rules T’

Initialization:

Sort rules by consequent (resulting in subgroups GiT, where

all rules in one such Gi share the consequent.

For each group Gi Reset prefix stack S

for each rule RGi(as all rules in Gi share the consequent, R

can be considered to be the antecedent)

while S (traverse the stack)

if S.topR then

if S.confidence is satisfactory then

S.top is a generalization of R (and R can be dismissed)

end if

else Pop(S) // the stacked prefix does not match, remove it

end while // stack traversal is complete

if R has not been dismissed then

copy R to T’

push R onto stack S

End if

End for Each

For a simple example, consider a trivial rule set consisting of three rules, as

below:

78

R1: X1=a⇒ Y = Excellent

R2: X1=a AND X2=b ⇒ Y = Excellent R3: X1=c ⇒ Y = Excellent

(4.16)

Rule R1 is the first rule being read. The stack is empty, so it will not be

dismissed by a previous generalization. After processing R1 it is added both

to the stack and to the output set T’.

When rule R2 is being read, the top of the stack contains the antecedent of

R1, X1=x1, which is included in the antecedent of R2. R1 is, therefore, a

candidate generalization of R2 and R2 may be dismissed.

When rule R3 is being read, the content of the stack does not share the

prefix of the rule, so the stack will be emptied.

As shown in the Chapter 6 below, presenting Experimental Results, this

algorithm produces very significant rule set simplifications. The experiments

suggest that the number of rules in a system is reduced, by this algorithm to

10%-20%. The complexity of the calculations is relatively small, at most

O(n2) in-memory operations (using the stack) where n is the cardinality of

the rule set.

Also, the RGA presented in the previous section needs to estimate the

support and confidence for each generalized rule. This is typically done by

scanning the data set (or by using additional memory in an index structure,

79

such as an FP-tree). The apriori flavor of the RGA does not require any

additional scans of the data.

Some weaknesses of the algorithm are easy to point out. For example, it is

easy to show that the greedy nature of algorithm prevents detection of all

possible generalizations of the rule set. Consider a system containing rules

like:

R1: X1=a AND X2=b ⇒ Y = Excellent

R2: X2=b ⇒ Y = Excellent (4.17)

Although R2 is a generalization of R1, it will not be detected by the algorithm

because it appears in lexicographic order after R1.

4.5 Conclusion

We presented some of the recent research work regarding the rule systems

generalization and simplification. Much of this work is related to the space

of fuzzy rules.

The rule generalization algorithm introduced in this chapter produced very

promising experimental results, as shown in Chapter 6 below. Some known

weaknesses of the proposed algorithm suggest directions for further

research.

80

4.5.1 Future directions for the basic rule generalization algorithm

The RGA algorithm discussed above currently works by eliminating entire

slices of the premise space from the rule antecedents. While this approach

produced good results in our experiments, it is probably too coarse. A

better solution, although more computationally intensive, may be to check

the neighborhood of the initial antecedent and merge those areas which,

when added to the antecedent, keep the rule’s accuracy above the

minimum confidence criteria.

Figure 4-4 A finer grain approach to rule generalization

81

Figure 4-4 describes such a possible implementation. Consider a rule R:

(X=High, Y=High)⇒ (Target = t). The current algorithm relaxes, say, the

Y=High condition and produces a new rule R’: (X=High, Y=*)⇒ (Target = t),

which may not have sufficient accuracy to replace R. Rather than removing

Y=High, the algorithm could investigate the vicinities of the original

antecedent cell (such as Y=M or Y=VH). The generalization would then

result in rules such as:

R’’:(X=High, Y{High, Medium, Very High})⇒

(Target = t).

(4.18)

82

This consists, in essence, in merging the antecedent part of two rules as

long as they are adjacent, they share the consequent and the resulting rule

does not fall below the minimum confidence threshold.

In the data problem we treated in [1], as well as in many real applications of

rule systems, the predicates in the antecedent, as well as in the consequent,

represent binning ranges of continuous variables. In this case, for a rule

R:(Xi=xi Y=yi) we can define a function p:(Xi=xi )[0,1] which describes

the probability density for the Y=yi predicate over the Xi=xi area of the

space. The rule accuracy can then be thought of as a ratio between the

integral of this probability density function p and a constant function u=1

defined on the same area, (Xi=xi). The accuracy of the R rule can be thought

of as :

∫

∫

(4.19)

83

Figure 4-5Accuracy of a fuzzy rule as a measure of similarity with the universal set

Let’s consider p, the probability density function, is considered the

membership function for a fuzzy set. In this interpretation and using the

similarity measure introduced by Setnes in [68] and discussed in 4.3 above

the confidence of the rule becomes the similarity measure between the

fuzzy set defined by (p, X=xi ) and the Universal set.

It may be interesting to investigate whether this idea might be converted in

the space of fuzzy rules, as a way of merging adjacent fuzzy sets that serve

as premises for Takagi-Sugeno rules with similar consequents, as suggested

in Figure 4-5.

From an implementation perspective, it is interesting to notice that the

algorithm allows block evaluation of multiple measurements. In a typical

relational database, all the neighbors of the premise space could be

evaluated in a single pass over data using GROUP BY relational algebra

84

constructs. This will likely produce significant performance gains. Recent

developments in the space of in-memory database systems (such as [82],

[83] ) may be useful in addressing the cost of computing the accuracy and

support while relaxing predicates.

4.5.2 Further work for the apriori specialization of the RGA

The reduction in the number of rules, as presented by the experimental

results, is significant. This reduction makes the rules set more accessible

and easier to interpret. Additional work is required, though, to estimate the

predictive power of the reduced rule set and to measure the accuracy

tradeoff that is being introduced by this rule set simplification technique.

As mentioned in Section 4.4.3, the greedy nature of algorithm prevents

detection of all possible generalizations of the rule set. A different direction

for further work is investigating whether a more complex data structure,

possibly combined with a new sort order which takes into account the

antecedent’s length before the lexicographic order, may address this issue.

More work is also needed to study the possibility of applying the rule

generalization algorithm to the area of multiple-level association rules

described in [84] (and also in Section 2.3.2 above).

85

5 Measuring the Usage Prediction Accuracy of

Recommendation Systems

Recommendation systems are some of the most popular applications for

data mining technologies. They are generally employed to use opinions of a

community of users in order to identify content of interest for other users.

Commercial implementations, such as Amazon’s, described in [85], are

helping users choose from an overwhelming set of products. The

importance of recommendation systems for industry is emphasized by the

Netflix prize [86], which attracted 51051 contestants, on 41305 teams from

186 countries (as of June 1011) in trying to build a movie recommendation

system to exceed the performance of Netflix’s in-house developed system,

Cinematch.

In this chapter, we focus on metrics used for usage prediction accuracy on

offline datasets. The remaining content is structured as follows:

An introduction to the usage of Association Rules for

recommendation systems

An overview of the most commonly used instruments and metrics

for evaluating usage prediction accuracy

A new instrument (diagram) proposed for evaluating usage

predictions accuracy and comparing different recommendation

systems.

Implementation observations for the aforementioned instrument

86

5.1 Association Rules as Recommender Systems

Developers of one of the first recommender systems, [87], coined the term

Collaborative Filtering (CF) to describe a system “which entails people

collaborating to help each other perform filtering by recording their

reactions to documents they read. The reactions are called annotations;

they can be accessed by other people’s filters”. The term ended up being

used interchangeably with the term “recommender system”.

This area generated lots of scientific interest and some recent surveys, such

as [88], present in detail the algorithms and techniques being employed in

recommender systems. Item-based collaborative filtering recommendation

algorithms were introduced in [89], where the authors compare such a

system vs. user-based recommender systems. In [90], the authors show that

the Apriori algorithm offers large improvement in stability and robustness

and can achieve comparable recommendation accuracy to other commonly

employed methods, such as k-Nearest Neighbor systems.

5.2 Evaluating Recommendation Systems

Recommendation systems may be employed to annotate entities in their

context (such as filtering through structured discussion postings to discover

which may be of interest to a reader, [87] ) or to find good items, such as

the Netflix prize [86] or the Amazon recommendation engine [85].

From an implementation perspective, some of these systems may predict

item ratings (such as Netflix, [86]), while others are predicting the

probability of usage (e.g. of purchase), such as Amazon’s [85]. More

87

complex systems may serve as intelligent advisors, comprehensive tools

which use behavioral science techniques to guide a customer through a

purchase decision process and learn while doing this, as described in [91].

These differences in usage make comparing and evaluating accuracy

systems a difficult task, as such systems are often tuned for specific

problems or datasets. A very thorough analysis of the problem of

evaluating and comparing recommendation systems is presented by J.

Herlocker et al. in [92] and, more recently, by A. Gunawardana in [93]. Both

surveys present the tasks that are commonly accomplished by

recommendation systems, the types of analysis and datasets that can be

used and the ways in which prediction quality can be measured,

Most of the research on evaluating recommendation systems focuses on

the problem of accuracy, under the assumption that a system that provides

more accurate predictions will be preferred by a user or will yield better

results to the commercial system that deploys it. Accuracy measurements

are very different when a system predicts user opinions (such as ratings) or

probabilities of usage (e.g. purchase).

Accuracy evaluations can be completed using offline analysis, controlled live

user experiments [94], or a combination of the two. In offline evaluation,

the algorithm is used to predict certain withheld values from a dataset, and

the results are analyzed using one or more of the metrics that we’ll discuss

in the following section. Offline evaluations are inexpensive and quick to

conduct, even on multiple datasets or recommendation systems at the

88

same time. Datasets including timestamps may be used to replay usage

(ratings and recommendations) scenarios: every time a new rating or usage

decision is made by a user, it is compared with the prediction based on the

prior data about that user.

5.3 Instruments for offline measuring the accuracy of usage

predictions

During offline evaluation, a dataset is typically available consisting in items

used by each user. A typical test consists in selecting a test user, then hiding

some of the selections and asking the recommendation system to predict a

set of items that the user will use, based on the remaining selections. The

recommended and hidden items may produce 4 different outcomes, as

shown in Table 5-1

Recommended Not Recommended

Used True Positives (TP) False Negatives (FN)

Not used False Positives (FP) True Negatives (TN)

Table 5-1 Classification of the possible result of a recommendation of an item to a user

The test may be more sophisticated when the items selected by a user are

qualified by time stamps, as is the case for retailers tracking recurrent visits

from customers (e.g. Amazon.com). In that case, a user’s items can be

revealed to the recommendation system in the actual chronological order.

89

5.3.1 Accuracy measurements for a single user

Upon counting the number of items in each cell of the Table 5-1 table, the

following quantities can be computed:

(7.1)

(7.2)

Precision and Recall were introduced in [95] as key metrics. These metrics

started being used for evaluation of recommendation systems in 1988 in

[96] and later in [97]. Precision represents the probability that a selected

item is relevant, while Recall represents the probability that a relevant item

will be selected. Relevance is, in the case of recommender systems, a

subjective concept, as the test user is the only person who to decide

whether a recommendation meets their requirements and the transaction

record is the only information about that user’s decision.

Precision and Recall are inversely related, as shown in [95] so while allowing

longer recommendation lists typically improves recall, it is likely to reduce

the precision. Several approaches have been taken to combine prevision

and recall into a single metric. One approach is the F1 metric, introduced in

[98], then used as a classifier metric in [99] and used for recommendation

systems in [97], defined as below:

90

(7.3)

In certain applications, the number of recommendations that can be

presented to a user is predefined. For such applications, the measures of

interest are Precision and Recall at N, where N is the number of presented

recommendations. For other applications, the number is not predefined, or

an optimal value needs to be determined. For the latter, curves can be

computed for metrics for various numbers of recommendations. Such

curves may compare precision to recall or true positives to false positives

rates.

The true positive/false positive curves, also known as ROC curves, are more

commonly used. ROC curves were introduced in 1969 in [100] under the

name of “Relative Operating Characteristics” but are more commonly

known under the name “Receiver Operating Characteristics”, which evolved

from their use in signal detection theory (see [101]). An example of an ROC

curve, plotting True Positives against False Positives, is shown in Figure 5-1.

The curve is obtained, for a test user, by sorting the ranked

recommendations in descending order of confidence. Then, for each

predicted item, starting at the origin of the diagram, one of the following

actions is executed:

a) If it is indeed relevant (e.g. used by the user, part of the hidden user

items) then draw the curve one step vertically

91

b) If the item is not relevant (not part of the hidden items) draw the curve

one step horizontally to the right.

Figure 5-1 Example of ROC Curve

A perfect predictive system will generate a ROC curve that goes straight up

until 100% of the items have been encountered, then straight right for the

remaining items. For multiple recommender systems, multiple ROC curves

can be plotted, one for each algorithm. If one curve completely dominates

the others, it is easy to pick the best system. When the curves intersect, the

decision depends on the application requirements. For example, an

application that can only expose a small number of recommendations may

choose the curve that is dominant in the left side of the ROC chart. Hanley

0

0.2

0.4

0.6

0.8

1

1.2

0 0.2 0.4 0.6 0.8 1 1.2

Pe

rce

nt

of

Re

leva

nt

Ite

ms

Percent of non-relevant Items

92

and McNeil, in [101], propose Area under Curve as a measure for comparing

implementations independently of application.

5.3.2 Accuracy Measurements for Multiple Users

We presented, in the previous section, some of the metrics used to

measure the accuracy of usage predictions for individual test users in offline

experiments. A number of strategies have been developed to aggregate the

results across test populations.

For applications that expose fixed length N recommendation lists, the

average precision and recall can be computed across the test population (at

length N), as shown in [97].

This aggregation approach is used in [102] to introduce an aggregated ROC

curve, computed over multiple users, using the same fixed number of

recommendations, called Customer ROC (CROC).

A special class of applications consists of those where the recommendation

process is more interactive, and users are allowed to obtain more and more

recommendations. Such applications can be modeled, in offline

experiments, when a timestamp is associated with each item ever used by

any test user. A ROC curve can be computed, in such as test, for each user.

The number of recommendations requested for each user depends on the

number of items used, in the test dataset, by each user. Certain

competitions, such as TREC’s (Text Retrieval Conference) - [103] – compute

ROC or precision/recall curves in this manner.

93

5.4 The Itemized Accuracy Curve

The accuracy measurements for recommendation systems described in the

previous section are commonly used in academic competitions or to

evaluate new systems. However, they are not commonly used in data

mining products. While lift charts, classification ROC diagrams and scatter

plots are common for classification and regression algorithms, most

products do not offer a “built-in” tool or comparing recommendation

systems, such as association rules models.

We propose a new instrument, introduced in [16], for evaluating the quality

of usage prediction on offline datasets. This instrument consists of a family

of curves that can be used to compare the recall for each individual item in

an item catalog for a family of recommendation systems.

The itemized accuracy curve was developed from a product need to present

users with an easy to understand diagram which allows comparing

recommendation systems as easily as the cumulative gain charts allow

comparing classification models.

A top-N recommender is a recommendation system configured to return the

top N most likely items for each input. In an industry setting such a

recommender takes as input information about one user and outputs N

items that the system predicts will be preferred by the user. A simple top-N

recommender is the Most-Frequent N-Item Recommender. It simply returns

the top N items that appear most frequently in a transaction database. A

94

more sophisticated top-N recommender may be an association rules

engine, which looks at all the user properties specified as input then

extracts those rules that match, in the antecedent, the input and sorts their

consequents by a certain rule measure, such as probability, importance, lift

etc.

Let be a set of items and D a set of data points (database

transactions). Each transaction D is defined as a tuple = (C, I) where:

- C is an optional set of characteristic properties of the transaction.

These characteristic properties may be transaction attribute from a

separate data dimension, such as customer demographic attributes,

store geographic attributes.

- I I is the set of items, in the transaction, to be used in testing the

top-N recommender.

The process of testing a top-N recommender using the D data set consists in

evaluating how well the recommendation system predicts the items that

appear in each transaction D. Testing for an item iI consists in

presenting the recommender with a ’i transaction as input (’i derives

from but does not include the item i) and then verifying the relationship

between the left out item i and the recommendation (presence, possibly

rank etc.). Ways to construct ’i from are described in Section 5.5.2 below.

Per definition, the top-N recommender will produce n recommendations

based on the specified input. Upon analysis of the n recommendations:

- A True Positive prediction is defined as the presence of item i in the

recommendation set. In this case, the recommender correctly

identified the item which was part of the test set

- A False Negative prediction is defined as the absence of item i in the

top N recommendations.

95

Let be a positive metric which describes the usage prediction accuracy

and that can be computed for each individual item that is part of the item

catalog I. Examples of such metrics include the number of true positives,

recall, precision, recall value (defined as recall multiplied by item value) etc.

Concepts such as True Positive or False Negative, which may appear in the

definition of such a metric, need to be adjusted for the particular case of a

top-N recommender (as shown above).

The rationale of the itemized accuracy curve for a top-N recommender is as

follows:

- Compute (over the test set) the accuracy measure for each

individual item in the item catalog I.

- Aggregate the system’s accuracy measure over the item catalog.

Note that the aggregation may be any additive measure, not

necessarily a sum

∑

- Compare the aggregated M measure with a minimum and maximum

theoretical baseline measures, Mmin and Mmax

- Compute two new quantities:

o

o

96

The Lift describes the performance of the current top-N recommender as

compared against the minimum acceptable baseline measure (and the

improvement on top of Mmin). The Area Under Curve describes the

performance of the current top-N recommender as compared against the

maximum theoretical baseline measure (and the improvement on top of

Mmax). If the minimum baseline measure is associated with a baseline

recommender, then the lift of that recommender is by definition 1,

regardless of the value of n.

Similarly, the Area Under Curve metric is less or equal to 1 (as it represents

the ratio to the theoretical maximum value of the accuracy measure) and, if

the maximum baseline measure is associated with a recommender, then

that recommender’s Area Under Curve is by definition 1, regardless of the

value of n.

Note that the Area Under Curve aggregation is not the homonymous metric

associated with ROC curves, although it shares some of its’ properties, such

as being upper bound by 1 or associating a value of 1 with an ideal model.

For practical purposes, the minimum theoretical baseline measure needs

not be worse than the measure yielded by the Most-Frequent n-Item

Recommender (MFnR). A few reasons for using the MFnR as a minimum

include:

- It is practically a zero-cost recommendation system, in terms of

implementation costs

97

- It is commonly used in industry if a more sophisticated

recommendation system is not available (e.g. “Would you like fries

with that?” in any fast food restaurant)

For the aforementioned reasons, we will use MFnR as the minimum

recommender in the rest of this chapter. An interesting property of the

MFnR recommender is that its accuracy grows with n.

Lemma 5.1 The number of True Positives of the MFnR recommender grows,

and the number of False Negatives decreases, with the value of the n

parameter, until n reaches the cardinality of the itemset.

Rationale:

Let , the cardinality of the items catalog. For any given the

following properties derive from the definition of True Positive and False

Negatives:

(7.4) , (7.5)

In fact, when n reaches X, False Negatives becomes 0, True Positives

becomes n and the MFnR becomes an optimal recommender with regard to

the True Positive and False Negative measures.

98

The itemized accuracy curve is obtained by plotting the accuracy measure

for each individual item in the item catalog I (as ordinate). The sort order of

the items on the abscissa improves the clarity of the diagram. For example,

sorting the items in I, the item catalog, in descending order of the max

metric (as computed for the maximum theoretical baseline measure) may

give a good intuitive perspective on the performance of the recommender

being analyzed.

5.4.1 A visual interpretation of the itemized accuracy curve

Figure 5-2 Itemized Accuracy Curve for a top-N recommender

99

Figure 5-2 presents such an itemized accuracy curve. The upper line

represents the max metric (as computed for the maximum theoretical

baseline measure) for each item, while the lower line is the metric,

computed for the top-N recommender being evaluated.

The aggregations of the metrics are equivalent to integrating the

measure over the item catalog I. Therefore, the aggregated measures of Lift

and Area Under Curve can be defined below

∫

∫

, ∫

∫

,

Both aggregations become, therefore, ratios between areas under curve for

the graphs defined by the metrics for different recommenders.

5.4.2 Impact of the N parameter on the Lift and Area Under Curve

measures

An interesting aspect of the Lift and Area Under Curve metrics is that they

allow comparing different values of N, the number of recommendations

being produced by the recommendations system. In an e-commerce

implementation of the recommendation system, the number of

recommendations presented on the screen must be a trade-off between

the potential value of the recommendations and that of other page

elements (such as advertisements) which may compete for the same page

real estate as the recommendations. It is, therefore, useful to analyze the

value (in terms of Lift and Area Under Curve) of various values for N, the

number of recommendations being presented.

100

Figure 5-3 Evolution of Lift and Area Under Curve for different values of N

Figure 5-3 presents the evolution of the Lift and Area Under Curve

measures for a top-N recommender as the value of N changes from 1 to

100.

The horizontal line at the ordinate 1 is the minimum baseline Lift,

associated with the MFnR minimum baseline. The upper line (on top of the

baseline lift) presents the lift yielded by the top-N recommender. As shown

101

previously, in Lemma 5.1, the MFnR recommender’s accuracy grows, so the

lift of the top-N recommender decreases with the growth of N.

The lines in the lower part of the diagram represent the evolution of the

Area Under Curve measure with the growth of the N parameter. The Area

Under Curve of an ideal recommender is by definition 1, while the Area

Under Curve values associated with the MFnR recommender as well as the

top-N recommender being evaluated are growing to reach 1, in the worst

case when N reaches the cardinality of the itemset.

5.5 An Implementation for the Itemized Accuracy Curve

5.5.1 Accuracy measures

We found the number of True Positives (and certain derivatives) to be a

convenient measure for the accuracy measure for each individual item in

the item catalog I. It is a simple additive measure, which can be summed up

across the transaction space as well as across the item space.

As exemplified previously, we consider an ideal predictor as the source for

the Mmax aggregation, therefore a predictor that produces zero False

Negatives. The difference between M and Mmax is, therefore, the number of

False Negatives produced by the M recommendation system being

assessed.

102

A consequence of this choice is that the Area under Curve aggregated

measure is exactly the recall associated with the recommendation system.

A related additive measure that can be used for may include the catalog

value associated with an item, (i) = Value(i)*TP(i). This allows for a more

flexible estimation of the value propose by the recommender.

5.5.2 Real data test strategies

The test dataset D consists of transactions D defined as tuple = (C, I)

where C, are the transaction specific properties while I is the set of items

known to be included in the transaction and which should be tested against

the real recommendations. Testing for an item iI consists in presenting

the recommender with a ’i transaction as input. ’i derives from but does

not include the item i. Two different ways to construct ’i from are

described below.

The simplest strategy is to treat each transaction as a bag of items. In that

case, the test for that respective transaction is performed by successively

leaving each item iI out, and requesting from the target system a

recommendation.

A more elaborate strategy may take into account a timestamp associated

with the moment when an item has been added to a transaction. In this

103

case, a possible strategy is to create the test input ’i by including, besides

the characteristic transaction properties, only the items that appeared, in

the transaction, before item i (chronologically). This approach may be more

realistic for certain e-commerce scenarios.

5.5.3 The algorithm for constructing Itemized Accuracy Curve

The algorithm, presented below uses a test population to compute counts

of true positives and false negative recommendations. The number of

recommendations to be presented to a user is an algorithm parameter.

The algorithm collects the number of occurrences and True Positive

recommendations for each item in the catalog in two item-indexed

structures, GlobalCounts and TruePositives.

When the iteration is complete, the metrics of interest can be computed as:

- M – the sum of the True Positives counts

- Mmax – the sum of the GlobalCounts values

- Mmin – the sum of those GlobalCounts values with indices in the top

N most popular items

Note that a frequency table for the most popular items can be computed in

the same iteration. This algorithm does not compute the frequency table as

104

real world database systems may have more efficient ways of returning the

top N most popular values in a table column.

Parameters :

n – number of recommendations to be presented

D – test set of transactions

Initialization:

Initialize GlobalCounts, TruePositives – item-indexed

vectors of counts, initialized on 0

for each transaction Tx=(Cx, Ix) in the test dataset D

for each item i in the Ix

increment GlobalCounts[i]

Let Txi = (Cx, (Ix – i))

Let RIn = TopRecommendations(n, Txi)

if i RIn then

increment TruePositives[i]

IterationEnd: compute the aggregated metrics

The algorithm traverses the space of test transactions and executes one

recommendation request for each item to be tested. The complexity of the

algorithm is, therefore where |D| is the number of

transactions in the test dataset and the Avg(|I|) is the average number of

test items in a transaction. Naturally, the execution time depends on the

recommendation system’s implementation.

5.6 Conclusions and further work

The Itemized Accuracy Curve provides an intuitive way to compare

recommendation systems. It can be used with count or profit oriented

105

measures and it can provide very specific information about the behavior of

a recommendation system for each specific item in a product catalog.

Combined with a taxonomy of the items, such as an OLAP product

dimension and a hierarchy on top of that dimension, the Itemized Accuracy

Curve can be used to select specific recommendation systems for areas of

the product catalog.

Figure 5-4 Aggregated Itemized Accuracy Curve based on the Movie Recommendations dataset (for N=5 recommendations)

0

500

1000

1500

2000

2500

3000

3500

MA-apriori_p40

(Ideal)

MA_Trees_2048

(MFNR)

106

Figure 5-4 presents aggregated accuracy results across the Category

attribute of the Movies Recommendations dataset.

The Itemized Accuracy Curve, however, does not take into account the

ranking of an item in the recommendation list. Investigating accuracy

measures that can be used with the Itemized Accuracy Curve in conjunction

with the ranking of items may provide more value.

Another direction of further research is integrating in the algorithm for

computing the itemized accuracy diagram the evaluation of other

performance characteristics of recommendation systems, such as:

- the degree to which a recommendation system covers the entire set

of items (see [104]),

- the computing time,

- the novelty of recommendations

- the robustness of recommendations (as defined in [105])

107

6 Experimental Results

We present here some of the experimental results discussed in the previous

chapters. The first section describes the datasets being used for each

experiment.

6.1 Datasets used in this material

6.1.1 IC50 prediction dataset

The IC50 prediction dataset contains most of the data and experimental

results mentioned in Section 4.4 on rule generalization. The data as well as

some results are available at:

http://www.bogdancrivat.net/FAMR/FAMR_Rules.zip .

The package consists of multiple files:

- trainMols.txt – 176 molecule descriptors and their associated IC50 ,

used as training set.

- trainMols_Discretized.csv – the above mentioned molecules

discretized

- testMols.txt -20 test molecules and their associated IC50

- testMols_Discretized.csv – the test set, discretized

- LatestRules.txt – the result of the FAM rules extraction process (used

as input in the Rule Generalization algorithm)

Following are the discretization ranges used in our experiments.

http://www.bogdancrivat.net/FAMR/FAMR_Rules.zip

108

The descriptor range: Bi should be read as “bin i”

B1 (Low): [0, 0.125)

B2 (Low-Medium): [0.125, 0.375)

B3 (Medium): [0.375, 0.625)

B4 (Medium-High): [0.625, 0.875)

B5 (High): [0.875, 1.0].

The IC50 value range

Excellent: [0, 20)

Good: [20,50)

OK: [50, 100)

Mediocre: [100, 500)

Terrible: [500,MaxValue].

It should be noted that low IC50 is optimal.

6.1.2 Movies Recommendation

The Movie Recommendation dataset was collected for [2] and available

with the book or as Chapter 11 downloads at

http://www.wiley.com/WileyCDA/WileyTitle/productCd-

0470277742,descCd-DOWNLOAD.html

The Movies Recommendation dataset consists of 3200 responses to a

survey collecting movie (2707 movies), director (508 directors) and actor

http://www.wiley.com/WileyCDA/WileyTitle/productCd-0470277742,descCd-DOWNLOAD.html

http://www.wiley.com/WileyCDA/WileyTitle/productCd-0470277742,descCd-DOWNLOAD.html

109

(1192 actors) preferences. On average, each response contains 15 movies,

with a minimum of 1, a maximum of 106 and a standard deviation of ~20.

The dataset also contains demographic data about customers participating

in the survey.

6.1.3 Movie Lens

The Movie Lens dataset is an older recommendation data set, used first in

[106]. This dataset is publicly available at

http://www.grouplens.org/node/73 . The subset containing 1 million ratings

for 3900 movies by 6040 users has been used for experiments. On average,

each user has rated 165 movies, with a minimum of 20, a maximum of 2314

and a standard deviation of ~192.

6.1.4 Iris

A commonly used dataset, Iris is available as part of the Weka suite or

accessible on the web at http://archive.ics.uci.edu/ml/datasets/Iris.

Four continuous attributes (petal length, petal width, sepal length, sepal

width) are used to predict one of 3 classes of flowers (setosa, virginica or

versicolor). There are 50 data points in each class.

http://www.grouplens.org/node/73

http://archive.ics.uci.edu/ml/datasets/Iris

110

6.2 Experimental results for the Rule Generalization

algorithm

6.2.1 Rule set and results used in Section 4.4 (on rules

generalization)

The IC50 prediction dataset described above has been used for this

experiment. All rules are described in the form:

(Molecular Weight, hDonors, hAcceptors, ClogP) ⇒ IC50.

From the trained FAMR we obtain the following set of rules:

O1 : (Low-Medium, Low, Low, Low-Medium) ⇒ Terrible

O2 : (Low-Medium, Low, Low, Medium) ⇒ Mediocre

O3 : (Low-Medium, Low, Low-Medium, Medium-High)⇒ Excellent

O4 : (Low-Medium, Low-Medium, Low-Medium, Low-Medium) ⇒ OK

O5 : (Low-Medium, Low-Medium, Low-Medium,Medium) ⇒ Excellent

O6 : (Low-Medium, Medium, Low-Medium, Medium) ⇒Excellent

O7 : (Medium, Low-Medium, Medium, Medium) ⇒Excellent

O8 : (Medium, Low-Medium, Medium, Medium-High) ⇒Excellent

O9 : (Medium, Low-Medium, Medium, Highm) ⇒ Excellent

O10 : (Medium-High, Low-Medium, Medium, High) ⇒Excellent

O11 : (Medium-High, Medium, Medium-High, Medium-High) ⇒ Mediocre

O12 : (Medium-High, Medium-High, Medium-High,Medium) ⇒ Terrible

O13 : (Medium-High, High, Medium-High, Medium) ⇒Mediocre

Rules {O1, . . . ,O13} have support between 0.0% and 16.47%, and confidence

between 0.00% and 100.00%. In order to remove irrelevant rules (pruning),

111

we introduce a minimum confidence criterion of 25% and a minimum

support criterion of 2.5%. Rule O3 does not meet these criteria and was

removed from the set.

After applying the algorithm described in 4.4.2 above to the rule set {O1, . . .

,O10- − ,O3} , the following generalized rules are obtained:

G1 : (*, Low-Medium, *, *) ⇒ Excellent

G2 : (*, Medium, Low-Medium, *) ⇒ Excellent

G3 : (Medium, *, Medium, *) ⇒ Excellent

G4 : (*, Low, Low, Medium) ⇒ Mediocre

G5 : (*, Medium-High, Medium-High, *) ⇒ Terrible

As certain descriptor values do not appear in any rule, simple one-predicate

rules were produced to cover those slices of the descriptor space. Only one

such rule is produced for this dataset (after pruning those which do not

meet the minimum confidence and support criteria)

I1 : (Low,*, *, *) ⇒ Terrible

The combined rule set {G1, . . . ,G5} ∪ {I1 } is our end result. Finally, we

compared our FAMR rule extractor to the FNN [6]–[8] and to the following

standard decision trees implementations:

- CART (WEKA implementation - simpleCart) trees [107]

- Microsoft SQL Server 2008 Decision Trees [2]

112

For the decision trees, rules were extracted from each non-root node.

Naturally, the decision-tree derived rules have 100% coverage. The

complete comparison results are presented in Table 6-1.

Method/rule set Training Set Coverage

Training Set Accuracy

Test Set Coverage

Test Set Accuracy

FAMR: {O1, . . . ,O13}

57.39% 36.93% 20% 20%

FAMR: {G1, . . . ,G5}

86.36% 65.34% 90% 75%

FAMR: {G1, . . . ,G5} {I1}

88.64% 67.61% 90% 75%

CART 100% 64.20% 100% 75%

Microsoft Decision Trees

100% 69.32% 100% 80%

Table 6-1 Rules set comparison

The FAMR {G1, . . . ,G5} {I1 } rule set has very good coverage and accuracy. For the test set, the {G1, . . . ,G5} {I1 } rules have almost the same accuracy as the rule set derived from classic decision trees system (the test set consists of 20 molecules, so a difference of 5% translates to one incorrect prediction). This is rather surprising, considering that the fact that decision trees are a dedicated tool for rule generation, whereas the FAMR was essentially designed as a primary prediction/classification model.

6.2.2 Results for the apriori post-processing algorithm

We present here some experimental results obtained after applying the

Rules Generalization Algorithm on various datasets:

Dataset Apriori params Initial Rules Rules after generalization

IC50 minconf=60%, 135 31

Movie Recommendations

minconf=60% minsup=3

18436 1788

113

Demographics, predicting Home Ownership

MovieRecommendations (associative)

Minconf=60% Minsup=10

25058 5677

Iris (discretized) Minconf=60% 208 38

For the movie recommendation dataset, the demographic table has been

used. The apriori algorithm was employed to extract rules predicting home

ownership status from other demographic attributes.

6.3 Experimental results for the Itemized Accuracy Curve

A Windows application has been developed to illustrate and test the

Itemized Accuracy Curve concepts. The application functions as a client for

the Microsoft SQL Server Analysis Service platform, which allows

instantiation of multiple data mining algorithms on the same datasets.

Multiple association rules models were investigated using the IAC client

application. The application uses DMX [108] statements for executing the

recommendation queries.

The UI of the respective application is presented in Figure 6-1. The

application uses the True Positive count as accuracy metric and sorts the

items in the product catalog in descending order of their popularity on the

abscissa. The dominant curve (red line) is associated with an ideal

recommendation system which produces zero False Negatives (and, hence,

the curve is identical to the popularity curve). The green curve, present in

the left part of the diagram, is associated with the Most Frequent n-Items

114

Recommender. The other lines are associated with different

recommendation systems. Clicking at any point on the chart surface

presents the item rendered at the specified location on the abscissa

together with the number of True Positives yielded by each of the

recommenders, as in Table 6-2.

Model Correct Recommendations

(Ideal Model) 233

(MFnR) 0

MA_apriori_p20 120

MA_Trees_2048 165 Table 6-2 True Positive counts for the selected item

Figure 6-1 Itemized Accuracy Chart for n=3 (Movie recommendations)

115

6.3.1 Movie Recommendation Results

We have built four recommendation models, using Microsoft SQL Server:

MA_apriori_p20 and MA_apriori_p40 use the Microsoft Association Rules

algorithms, an optimized implementation of the Apriori algorithm. It uses a

minimum rule probability threshold of 0.2 and 0.4, respectively. They both

use a minimum support of 10 (meaning approximately 0.3% for this

dataset).

MA_Trees_256 and MA_Trees_2048 use the Microsoft Decision Trees

algorithm to build a forest of trees to be used for recommendations. They

build, respectively, 256 (default) and 2048 trees.

Figure 6-2 presents the lift of the 4 models as a function of n, the number of

recommendations:

116

Figure 6-2 Evolution of Lift for various values of N for test models (Movie Recommendations dataset)

6.3.2 Movie Lens Results

We have built four recommendation models, using Microsoft SQL Server:

apriori, apriori_min_supp_10 and apriori_min_supp_100 use the Microsoft

Association Rules algorithm. It uses a minimum rule probability threshold of

0.2 and minimum support thresholds of 1000, 10 and 100.

DecisionTrees uses the Microsoft Decision Trees algorithm to build a forest

of trees to be used for recommendations. It contains 2048 trees.

0.9

1.1

1.3

1.5

1.7

1.9

2.1

2.3

2.5

0 10 20 30

(Baseline MFnR Lift)

MA_apriori_p20-L

MA_Trees_2048-L

MA_Trees_256-L

MA-apriori_p40-L

117

Figure 6-3 presents the lift of the 4 models as a function of n, the number of

recommendations

Figure 6-3 Evolution of Lift for various values of N for test models (Movie Lens dataset)

It is interesting to notice that the decision tree outperforms the apriori

models and that some of the apriori models actually perform worse than

the Most Frequent n-Item Recommender.

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

0 10 20 30

(Baseline MFnR Lift)

apriori-L

apriori_min_supp_10-L

apriori_min_supp_100-L

DecisionTrees-L

118

7 Conclusions and directions for further research

The thesis presents a synthesis of recent research in the area of associative

and predictive rules and post processing of these rules. The original

contributions are focused on practical improvements of rule systems.

7.1 Conclusions

In Chapter 4, we introduced a novel method for post-processing a set of

rules in order to improve its generalization capability. The method is

developed specifically for rules extracted from a fuzzy ARTMAP incremental

learning system used for classification, hence for rule generated indirectly

(as Fuzzy ARTMAP does not directly produce rules).

We also proposed an algorithm for generalizing rule sets produced by

common rule extraction algorithms, such as apriori. The experimental

results for this algorithm look very promising as they reduce the size of rule

sets by 5-10 times. More work is necessary to fully determine the properties

of this generalization algorithm, as shown in the next section.

In the second part of the thesis, in Chapter 5, we proposed a novel

instrument for evaluating the quality of recommendation systems, in the

context of recent research regarding the accuracy of recommendation

systems. The instrument has been introduced, as a patent, in [16]. The

Itemized Accuracy Curve has certain interesting properties. Among them:

119

- It provides an intuitive way of comparing different recommendation

systems

- It allows aggregations of the accuracy metrics across items

dimensions

7.2 Further Work

The Rules Generalization Algorithm introduced in Chapter 4 works by

eliminating entire slices of the premise space from the rule antecedents.

While this approach produced good results in our experiments, it is

probably too coarse. A better solution, although more computationally

intensive, may be to check the neighborhood of the initial antecedent and

merge those areas which, when added to the antecedent, keep the rule’s

accuracy above the minimum confidence criteria. Section 4.5.1 suggests

refinement of the algorithm which would result in rules such as:

R’’:(X=High, Y{High, Medium, Very High})⇒

(Target = t).

(4.20)

This consists, in essence, in merging the antecedent part of two rules as

long as they are adjacent, they share the consequent and the resulting rule

does not fall below the minimum confidence threshold.

Section 4.5.1 also describes a direction for research whether the refinement

may be applied to fuzzy rule sets, as a way of merging adjacent fuzzy sets

that serve as premises for Takagi-Sugeno rules with similar consequents.

120

From an implementation perspective, it is interesting to notice that the

algorithm allows block evaluation of multiple measurements. In a typical

relational database, all the neighbors of the premise space could be

evaluated in a single pass over data using GROUP BY relational algebra

constructs. This will likely produce significant performance gains. Recent

developments in the space of in-memory database systems (see [82], [83] )

may be useful in addressing the cost of computing the accuracy and support

while relaxing predicates.

In Section 4.4.3 we proposed an algorithm for generalizing the rule sets

produced by algorithms such as apriori, with significant reduction in the

number of rules, as presented by the experimental results. This reduction

makes the rules set more accessible and easier to interpret. Additional work

is required, though, to estimate the predictive power of the reduced rule

set and to measure the accuracy tradeoff that is being introduced by this

rule set simplification technique. The greedy nature of algorithm prevents

detection of all possible generalizations of the rule set. A different direction

for further work is investigating whether a more complex data structure,

possibly combined with a new sort order which takes into account the

antecedent’s length before the lexicographic order, may address this issue.

121

More work is also needed to study the possibility of applying the rule

generalization algorithm to the area of multiple-level association rules

described in [84] (and also in section 2.3.2 above).

Chapter 5 introduced the Itemized Accuracy Curve as an intuitive way to

compare recommendation systems. The Itemized Accuracy Curve, however,

does not take into account the ranking of an item in the recommendation

list. Investigating accuracy measures that can be used with the Itemized

Accuracy Curve in conjunction with the ranking of items may provide more

value.

Another direction of further research is integrating the evaluation of other

performance characteristics of recommendation systems, such as the

degree to which a recommendation system covers the entire set of items

(see [104]), the computing time, the novelty of recommendations or its’

robustness [105] in the algorithm for computing the itemized accuracy

diagram.

122

Appendix A: Key Algorithms

Apriori

The following pseudo-code is the main procedure for generating frequent

itemsets (from [2]):

F: result set of all frequent itemsets

F[k]: set of frequent itemsets of size k

C[k]: set of candidate itemsets of size k

SetOfItemsets generateFrequentItemsets(Int minimumSupport){

F[1] = {frequent items};

for (k =1, F[k] >0; k++) {

C[k+1] = generateCandidates(k, F[k]);

for each transaction t in databases {

For each candidate c in C[k+1] {

if t contains c then c.count++

}

} //Scan the dataset.

for each candidate c in C[k+1] {

//Select the qualified candidates

if c.count >=Minimum_Support F[k+1] = F[k+1] U {c}

}

}

//Union all frequent itemsets of different size

while k>=1 do {

F = F U F[k];

k--;

}

return F;

}

To generate candidate itemsets Ck+1 from frequent itemsets Fk, you use the

following SQL join statement:

Insert into Ck+1

Select x1.a1, x1.a2, ..., x1.ak, x2.ak

From Fk as x1, Fk as X2

Where

//match the itemset prefixes of size k-1

x1.a1 = x2.a1 And

123

x1.a2 = x2.a2 And

...

x1.ak-1 = x2.ak-1 And

//avoid duplicates

x1.ak <> x2.ak

This SQL statement generates candidate itemsets of size k having prefixes of

itemsets size k-1. However, it doesn’t guarantee that all the subsets of

candidate itemsets are frequent itemsets. So, you must prune the

candidates containing infrequent subsets by using the following procedure:

Boolean hasInfrequentSubset(Itemset c, SetofItemsets F) {

For each (k-1) subset s of c {

If s not in F then return true;

}

return false;

}

The following procedure generates all of the qualified association rules:

For each frequent itemset f,

generate all the subset x and its complimentary set y = f - x

If Support(f)/Support(x) > Minimum_Probability, then

x => y is a qualified association rule

probability = Support(f)/Support(x)

End If

124

FP-Growth

As discussed in Section 3.1.2 above, the FP-growth algorithm extracts the

frequent items into a frequent pattern tree (FP-tree), retaining the itemset

association information, then divides the database into a set of conditional

databases, each associated with one frequent item, and mine each such

database separately.

A FP-tree is populated in the following steps [38] . A procedure called

BuildFrequentItemsList is supposed to exist and scan the transaction space,

creating a sorted list of items, in descending order of support. The

procedure is also supposed to eliminate infrequent items. The procedure is

not part of the implementation as it can often be optimized in a Database

or platform (e.g. SQL Server Analysis services). Another procedure, Sort, is

supposed to sort items in a transaction in the order specified in the list

argument.

Procedure FP_Create(TransactionSpace)

Let Tree = new node

Tree.item-name = null

Let L = BuildFrequentItemsList(TransactionSpace)

Foreach Trans in TransactionSpace

Let SortedTrans = Sort (Trans, L)

FP_Insert(Tree,SortedTrans)

Next

End Procedure

Procedure FP_Insert(Tree,Trans)

Let p = First Item in Trans

Let q = Reminder of Trans (excluding p)

If Tree has a child node N such as N.item-name = p.item-name

Then

N.count++

125

Else

Create new node N, child of Tree

N.item-name = p.item-name

N.count = 1

End If

FP_Insert(N, q)

End Procedure

Mining of a FP-tree is performed by calling FP_growth(FP_tree, null),

implemented as below (as described in [38]):

Procedure FP_Growth(Tree, x) If Tree contains a single path P then

For each combination β of the nodes in the path P

Generate pattern β x with supp=minimum support of nodes in β

Else for each ai in the header of Tree

Generate pattern β = ai x with supp=supp(ai) Construct β’s conditional pattern base,

Construct β’s conditional tree, Treeβ

If Treeβ then

call FP_Growth(Treeβ , β)

End if

End If

End Procedure

126

Bibliography

[1] Razvan Andonie, Levente Fabry-Asztalos, Ioan Bogdan Crivat, Sarah Abdul-Wahid, and Badi Abdul-Wahid, "Fuzzy ARTMAP rule extraction in computational chemistry," in Proceedings of the International Joint Conference on Neural Networks (IJCNN), Atlanta, GA, 2009, pp. 157-163.

[2] Jamie MacLennan, Ioan Bogdan Crivat, and ZhaoHui Tang, Data Mining with Microsoft SQL Server 2008. Indianapolis, Indiana, United States of America: Wiley Publishing, Inc., 2009.

[3] Ioan Bogdan Crivat, Paul Sanders, Mosha Pasumansky, Marius Dumitru, Adrian Dumitrascu, Cristian Petculescu, Akshai Mirchandani, T.K Anand, Richard Tkachuk, Raman Iyer, Thomas Conlon, Alexander Berger, Sergei Gringauze, James MacLennan, and Rong Guan, "Systems and methods of utilizing and expanding standard protocol," USPTO Patent/Application Nbr. 7689703, 2010.

[4] Ioan B Crivat, Raman Iyer, and C James MacLennan, "Detecting and displaying exceptions in tabular data," USPTO Patent/Application Nbr. 7797264, 2010.

[5] Ioan B Crivat, Raman Iyer, and C. James MacLennan, "Dynamically detecting exceptions based on data changes," USPTO Patent/Application Nbr. 7797356, 2010.

[6] Ioan B Crivat, Raman Iyer, and James MacLennan, "Partitioning of a data mining training set," USPTO Patent/Application Nbr. 7756881, 2010.

[7] Ioan B Crivat, Cristian Petculescu, and Amir Netz, "Efficient Column Based Data Encoding for Large Scale Data Storage," USPTO Patent/Application Nbr. 20100030796 , 2010.

[8] Ioan B Crivat, Cristian Petculescu, and Amir Netz, "Explaining changes in measures thru data mining," USPTO Patent/Application Nbr. 7899776, 2011.

[9] Ioan B Crivat, Cristian Petculescu, and Amir Netz, "Random access in run-length encoded structures," USPTO Patent/Application Nbr. 7952499, 2011.

127

[10] Ioan B. Crivat, Raman Iyer, C. James MacLennan, Scott Oveson, Rong Guan, Zhaohui Tang, Pyungchul Kim, and Irina Gorbach, "Extensible data mining framework ," USPTO Patent/Application Nbr. 7383234, 2008.

[11] Ioan Bogdan Crivat, Pyungchul Kim, ZhaoHui Tang, James MacLennan, Raman Iyer, and Irina Gorbach, "Systems and methods that facilitate data mining," USPTO Patent/Application Nbr. 7398268, 2008.

[12] Ioan Bogdan Crivat, C. James MacLennan, Yue Liu, and Michael Moore, "Techniques for Evaluating Recommendation Systems," Application USPTO Patent/Application Nbr. 20090319330, 2009.

[13] Ioan, B Crivat, C, James MacLennan, and Raman Iyer, "Goal seeking using predictive analytics," USPTO Patent/Application Nbr. 7788200, 2010.

[14] Ioan, Bogdan Crivat, Elena, D. Cristofor, and C. James MacLennan, "Analyzing mining pattern evolutions by comparing labels, algorithms, or data patterns chosen by a reasoning component ," USPTO Patent/Application Nbr. 7636698, 2009.

[15] Ioan, Bogdan Crivat, C., James MacLennan, ZhaoHui Tang, and Raman S. Iyer, "Unstructured data in a mining model language," Patent (USPTO) USPTO Patent/Application Nbr. 7593927, 2009.

[16] Ioan Bogdan Crivat, C. James MacLennan, Yue Liu, and Michael Moore, "Techniques for Evaluating Recommendation Systems," Patent Application (USPTO) USPTO Patent/Application Nbr. 20090319330, 2009.

[17] Jeff Davis. (2002, July)Data Mining with Access Queries [Online]. http://www.techrepublic.com/article/data-mining-with-access-queries/1043734

[18] devexpress. Pivot Table® Style Data Mining Control for ASP.NET AJAX [Online]. http://www.devexpress.com/Products/NET/Controls/ASP/Pivot_Grid/

[19] Laura W. Murphy. (2010)Testimony Regarding Civil Liberties and National Security: Stopping the Flow of Power to the Executive Branch [Online].

http://www.techrepublic.com/article/data-mining-with-access-queries/1043734

http://www.techrepublic.com/article/data-mining-with-access-queries/1043734

http://www.devexpress.com/Products/NET/Controls/ASP/Pivot_Grid/

http://www.devexpress.com/Products/NET/Controls/ASP/Pivot_Grid/

128

http://judiciary.house.gov/hearings/pdf/Murphy101209.pdf

[20] Intel Corporation. (2005)Excerpts from A Conversation with Gordon Moore: Moore’s Law *Online+. ftp://download.intel.com/museum/Moores_Law/Video-Transcripts/Excepts_A_Conversation_with_Gordon_Moore.pdf

[21] Chip Walter. (2005, July)Kryder's Law [Online]. http://www.scientificamerican.com/article.cfm?id=kryders-law

[22] John Gantz and David Reinsel. (2010, May)The Digital Universe Decade – Are You Ready? [Online]. http://idcdocserv.com/925

[23] Roger, E. Bohn and James, E. Short. (2010, January)How Much Information? 2009 [Online]. http://hmi.ucsd.edu/pdf/HMI_2009_ConsumerReport_Dec9_2009.pdf

[24] Usama Fayyad, Gregory Piatetsky-Shapiro, and Padhraic Smyth, "Knowledge Discovery and Data Mining: Towards a Unifying Framework," in KDD, 1996.

[25] 11 Ants Analytics. www.11antsanalytics.com. [Online]. http://www.11antsanalytics.com/products/default.aspx

[26] Predixion Software. (2011) PredixionSoftware.com. [Online]. https://www.predixionsoftware.com/predixion/Products.aspx

[27] Microsoft Corp. (2008) www.microsoft.com. [Online]. http://www.microsoft.com/sqlserver/2008/en/us/data-mining-addins.aspx

[28] IBM. ftp://public.dhe.ibm.com. [Online]. ftp://public.dhe.ibm.com/common/ssi/ecm/en/ytw03084usen/YTW03084USEN.PDF

[29] Rakesh Agrawal, Tomasz Imielinski, and Arun N Swami, "Mining association rules between sets of items in large databases," vol. 22, pp. 207-216, 1993, p207-agrawal.pdf.

[30] Jiawei Han and Micheline Kamber, Data Mining Concepts and Techniques. San Diego, CA, USA: Academic Press, 2001.

[31] Microsoft Corporation. Maximum Capacity Specifications for SQL Server [Online]. http://msdn.microsoft.com/en-us/library/ms143432.aspx

http://judiciary.house.gov/hearings/pdf/Murphy101209.pdf

ftp://download.intel.com/museum/Moores_Law/Video-Transcripts/Excepts_A_Conversation_with_Gordon_Moore.pdf

ftp://download.intel.com/museum/Moores_Law/Video-Transcripts/Excepts_A_Conversation_with_Gordon_Moore.pdf

http://www.scientificamerican.com/article.cfm?id=kryders-law

http://idcdocserv.com/925

http://hmi.ucsd.edu/pdf/HMI_2009_ConsumerReport_Dec9_2009.pdf

http://hmi.ucsd.edu/pdf/HMI_2009_ConsumerReport_Dec9_2009.pdf

http://www.11antsanalytics.com/products/default.aspx

https://www.predixionsoftware.com/predixion/Products.aspx

http://www.microsoft.com/sqlserver/2008/en/us/data-mining-addins.aspx

http://www.microsoft.com/sqlserver/2008/en/us/data-mining-addins.aspx

ftp://public.dhe.ibm.com/common/ssi/ecm/en/ytw03084usen/YTW03084USEN.PDF

ftp://public.dhe.ibm.com/common/ssi/ecm/en/ytw03084usen/YTW03084USEN.PDF

http://msdn.microsoft.com/en-us/library/ms143432.aspx


129

[32] Oracle. Logical Database Limits [Online]. http://download.oracle.com/docs/cd/B19306_01/server.102/b14237/limits003.htm

[33] Ramakrishnan Srikant and Rakesh Agrawal, "Mining quantitative association rules in large relational tables," in International Conference on Management of Data - SIGMOD, vol. 25, 1996, pp. 1-12, srikant96.pdf.

[34] Nikola K. Kasabov, Foundations of Neural Networks, Fuzzy Systems, and Knowledge Engineering.: Massachusetts Institute of Technology, 1998.

[35] E.H. Mamdani, "Application of Fuzzy Logic to Approximate Reasoning Using Linguistic Synthesis," IEEE Transactions on Computers - TC, vol. 26, no. 12, pp. 1182-1191.

[36] T. Takagi and M Sugeno, "Fuzzy identification of systems and its applications to modelling and control," IEEE Transactions on Systems, Man and Cybernetics, no. 15, pp. 116-132, 1985, http://pisis.unalmed.edu.co/vieja/cursos/s4405/Lecturas/Takagi%20Sugeno%20Modelling.pdf.

[37] Rakesh Agrawal and Ramakrishnan Srikant, "Fast Algorithms for Mining Association Rules," in Very Large Databases VLDB, 1994, http://www.eecs.umich.edu/~jag/eecs584/papers/apriori.pdf.

[38] Jiawei Han, Jian Pei, and Yiwen Yin, "Mining frequent patterns without candidate generation," in International Conference on Management of Data - SIGMOD, vol. 29, 2000, pp. 1-12, dami04_fptree.pdf.

[39] Jiawei Han, Jian Pei, Yiwen Yin, and Runying Mao, "Mining Frequent Patterns without Candidate: A Frequent-Pattern Tree Approach," Data Mining and Knowledge Discovery, vol. 8, pp. 53-87, 2004, dami04_fptree.pdf.

[40] Ashok Savasere, Edward Omiecinski, and Shamkant B. Navathe, "An Efficient Algorithm for Mining Association Rules in Large Databases," in Very large Databases VLDB, 1995, pp. 432-444, http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.103.5437&rep=rep1&type=pdf.

[41] Ramesh, C. Agarwal, Charu C. Aggarwal, and V.V.V. Prasad, "A Tree

http://download.oracle.com/docs/cd/B19306_01/server.102/b14237/limits003.htm

http://download.oracle.com/docs/cd/B19306_01/server.102/b14237/limits003.htm

130

Projection Algorithm For Generation of Frequent Itemsets," Journal of Parallel and Distributed Computing , 1999.

[42] Nicolas Pasquier, Yves Bastide, Rafik Taouil, and Lotfi Lakhal, "Efficient Mining of Association Rules Using Closed Itemset Lattices," Information Systems - IS, vol. 24, no. 1, pp. 25-46, 1999, http://cchen1.csie.ntust.edu.tw:8080/students/2009/Efficient%20mining%20of%20association%20rules%20using%20closed%20itemset%20lattices.pdf.

[43] Nicolas Pasquier, Yves Bastide, Rafik Taouil, and Lotfi Lakhal, "Discovering Frequent Closed Itemsets for Association Rules," International Conference on Database Theory - ICDT, pp. 398-416, 1999, http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.37.1102&rep=rep1&type=pdf.

[44] Mohammed Javeed Zaki and Ching-jiu Hsiao, "CHARM: An Efficient Algorithm for Closed Itemset Mining," in SIAM International Conference on Data Mining - SDM, 2002, CHARM.pdf.

[45] Zijian Zheng, Ron Kohavi, and Llew Mason, "Real world performance of association rule algorithms," in Knowledge Discovery and Data Mining - KDD, 2001, pp. 401-406, RealWorldPerf01.pdf.

[46] Yun Sing Koh and Nathan Rountree, Rare Association Rule Mining And Knowledge Discovery - Technologies for Infrequent and Critical Event Detection. Hershey, PA: Information Science Reference, 2010.

[47] Bing Liu, Wynne Hsu, and Yiming Ma, "Mining association rules with multiple minimum supports," in Knowledge Discovery and Data Mining - KDD, 1999, pp. 337-341.

[48] Hyunyoon Yun, Danshim Ha, Buhyun Hwang, and Keun Ho Ryu, "Mining association rules on significant rare data using relative support," Journal of Systems and Software - JSS, vol. 67, no. 3, pp. 181-191, 2003.

[49] Ke Wang, Yu He, and Jiawei Han, "Pushing Support Constraints Into Association Rules Mining," IEEE Transactions on Knowledge and Data Engineering : TKDE, pp. 642-658, 2003.

[50] Masakazu Seno and George Karypis, "LPMiner: An Algorithm for

131

Finding Frequent Itemsets Using Length-Decreasing Support," in IEEE: International Conference on Data Mining ICDM, 2001.

[51] E. Cohen, M. Datar, S. Fujiwara, A. Gionis, P. Indyk, R. Motwani, J.D. Ullman, and C. Yang, "Finding interesting associations without support pruning," IEEE Transactions on Knowledge and Data Engineering - TKDE, vol. 13, no. 1, pp. 64-78, 2001, http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.96.7294&rep=rep1&type=pdf.

[52] Yun Sing Koh and Nathan Rountree, "Finding Sporadic Rules Using Apriori-Inverse," Lecture Notes in Computer Science, vol. 3518/2005, pp. 153-168, 2005.

[53] L. Szathmary, A. Napoli, P. Valtchev, and Vandceuvre-les-Nancy LORIA, "Towards Rare Itemset Mining," in IEEE International Conference on Tools with Artificial Intelligence - ICTAI 2007, 2007, pp. 305-312, http://hal.archives-ouvertes.fr/docs/00/18/94/24/PDF/szathmary-ictai07.pdf.

[54] J. R. Quinlan, "Induction of Decision Trees," Machine Learning - ML, vol. 1, no. 1, pp. 81-106, 1986, InductionOfDT.pdf.

[55] Leo Breiman, Jerome Friedman, Charles J Stone, and R A Olshen, Classification and Regression Trees.: Chapman & Hall, 1984.

[56] Cristopher M. Bishop, Neural Networks for Pattern Recognition. New York: Oxford University Press, Inc, 1995.

[57] G.A. Carpenter and S Grossberg, The Handbook of Brain Theory and Neural Networks, Michael A. Arbib, Ed. Cambridge, MA: MIT Press, 2003, http://cns.bu.edu/Profiles/Grossberg/CarGro2003HBTNN2.pdf.

[58] Robert Andrews, Joachim Diederich, and Alan B. Tickle, "Survey and critique of techniques for extracting rules from trained artificial neural networks," Knowledge Based Systems - KBS, vol. 8, no. 6, pp. 373-389, 1995.

[59] Alan B. Tickle, Robert Andrews, Mostefa Golea, and Joachim Diederich, "The Truth Will Come to Light: Directions and Challenges in Extracting the Knowledge Embedded Within Trained Artificial Neural Networks," IEEE TRANSACTIONS ON NEURAL NETWORKS, vol. 9, no. 6, 1998,

132

TruthWillComeToLight.pdf.

[60] K Saito and R. Nakano, "Medical diagnosis expert system based on PDP model," in IEEE International Conference on Neural Networks, New York, 1988, pp. 1255-1262.

[61] Kurt Hornik, Maxwell B. Stinchcombe, and Halbert White, "Multilayer feedforward networks are universal approximators," Neural Networks, vol. 2, no. 5, pp. 359-366, 1989.

[62] Bart Kosko, "Fuzzy Systems as Universal Approximators," IEEE Transactions on Computers - TC, vol. 43, no. 11, pp. 1329-1333, 1994, http://sipi.usc.edu/~kosko/FuzzyUniversalApprox.pdf.

[63] J. J. Buckley, Y. Hayashi, and E. Czogala, "On the equivalence of neural nets and fuzzy expert systems," Fuzzy Sets and Systems, vol. 53, no. 2, pp. 129-134, 1993.

[64] J.M. Benitez, J.L. Castro, and I. Requena, "Are artificial neural networks black boxes?," IEEE Transactions on neural Networks, pp. 1156 - 1164 , 1997, http://www.imamu.edu.sa/Scientific_selections/abstracts/Math/Are%20Artificial%20Neural%20Networks%20Black%20Boxes.pdf.

[65] S. Mitra and Y. Hayashi, "Neuro-fuzzy rule generation: survey in soft computing framework," IEEE Transactions on Neural Networks, vol. 11, no. 3, pp. 748-768, 2000.

[66] Razvan Andonie, Levente Fabry-asztalos, Catharine Collar, Sarah Abdul-wahid, and Nicholas Salim, "Neuro-fuzzy Prediction of Biological Activity and Rule Extraction for HIV-1 Protease Inhibitors," in Symposium on Computational Intelligence in Bioinformatics and Computational Biology - CIBCB, 2005, pp. 113-120.

[67] J. Chorowski and J. M. Zurada, "Extracting Rules from Neural Networks as Decision Diagrams," IEEE Transactions on Neural Networks, vol. PP, no. 99, pp. 1 - 12, 2011, ExtRulesNNDecisionDiagrams.pdf.

[68] Magne Setnes, Robert Babuska, Uzay Kaymak, and Hans R. van Nauta Lemke, "Similarity Measures in Fuzzy Rule Base Simplification," IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS, vol. 28, no. 3, June 1998.

133

[69] M. Mizumoto and H. J. Zimmermann, "Comparison of fuzzy reasoning methods," Fuzzy Sets and Systems - FSS, vol. 8, no. 3, pp. 253-283, 1982.

[70] László T. Kóczy and Kaoru Hirota, "Approximate reasoning by linear rule interpolation and general approximation," International Journal of Approximate Reasoning - IJAR , vol. 9, no. 3, pp. 197-225, 1993.

[71] I.T. Jolliffe, Principal Component Analysis.: Springer, 2002.

[72] J.W. Sammon, "A Nonlinear Mapping for Data Structure Analysis," IEEE Transactions on Computers - TC, vol. C-18, no. 5, pp. 401-409, 1969, http://www.mec.ita.br/~rodrigo/Disciplinas/MB213/Sammon1969.pdf.

[73] Manoranjan Dash and Huan Liu, "Feature Selection for Classification," Intelligent Data Analysis - IDA, vol. 1, no. 1-4, pp. 131-156, 1997, http://reference.kfupm.edu.sa/content/f/e/feature_selection_for_classification__39093.pdf.

[74] B.G. Song, R.J., II Marks, S. Oh, P. Arabshahi, T.P. Caudell, and J.J. Choi, "Adaptive membership function fusion and annihilation in fuzzy if-then rules," in Second IEEE International Conference on Fuzzy Systems, vol. 2, 1993, pp. 961 - 967.

[75] N. Xiong and Lothar Litz, "Reduction of fuzzy control rules by means of premise learning - method and case study," Fuzzy Sets and Systems - FSS, vol. 132, no. 2, pp. 217-231, 2002, http://www.sciencedirect.com/science/article/pii/S0165011402001124.

[76] Johannes A. Roubos, Magne Setnes, and János Abonyi, "Learning fuzzy classification rules from labeled data," Information Sciences - ISCI, vol. 150, no. 1-2, pp. 77-93, 2003, http://sci2s.ugr.es/keel/pdf/specific/articulo/15-E.pdf.

[77] Gail A. Carpenter, Stephen Grossberg, and David B. Rosen, "Fuzzy ART: Fast stable learning and categorization of analog patterns by an adaptive resonance system," Neural Networks, vol. 4, no. 6, pp. 759-771, 1991,

134

http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.64.2379&rep=rep1&type=pdf.

[78] R Andonie and L. Sasu, "Fuzzy ARTMAP with input relevances," IEEE Transactions on Neural Networks, vol. 17, pp. 929–941, 2006.

[79] Gail Carpenter and H. A. Tan, "Rule Extraction: From Neural Architecture to Symbolic Representation," Connection Science, vol. 7, no. 1, pp. 3-27, 1995.

[80] S. C. Tan, Chee Peng Lim, and M. V. C. Rao, "A hybrid neural network model for rule generation and its application to process fault detection and diagnosis," Engineering Applications of Artificial Intelligence - EAAI, vol. 20, no. 2, pp. 203-213, 2007.

[81] G. A Carpenter and A.-H. Tan, "Rule Extraction, Fuzzy ARTMAP and medical databases," in Proceedings of the World Congress on Neural Networks, Portland, Oregon; Hillsdale, NJ, 1993, pp. 501-506, http://digilib.bu.edu/journals/ojs/index.php/trs/article/view/430.

[82] Ioan B Crivat, Cristian Petculescu, and Amir Netz, "Efficient Column Based Data Encoding for Large Scale Data Storage," Patent Application (USPTO) USPTO Patent/Application Nbr. 20100030796, 2010.

[83] Ioan B Crivat, Cristian Petculescu, and Amir Netz, "Random access in run-length encoded structures," Patent (USPTO) USPTO Patent/Application Nbr. 7952499, 2011.

[84] Jiawei Han and Yongjian Fu, "Discovery of Multiple-Level Association Rules from Large Databases," in Very Large Databases - VLDB, 1995, pp. 420-431, http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.64.3214&rep=rep1&type=pdf.

[85] Greg Linden, B. Smith, and J. York, "Amazon.com recommendations: item-to-item collaborative filtering," Internet Computing, IEEE , vol. 7, no. 1, pp. 76 - 80, January 2003.

[86] Netflix. Netflix Prize. [Online]. http://www.netflixprize.com

[87] David Goldberg, David A. Nichols, Brian M. Oki, and Douglas Terry, "Using collaborative filtering to weave an information tapestry," Communications of the ACM - CACM, vol. 35, no. 12, pp. 61-70,

http://www.netflixprize.com/

135

1992, http://www.ischool.utexas.edu/~i385d/readings/Goldberg_UsingCollaborative_92.pdf.

[88] Xiaoyuan Su and Taghi M. Khoshgoftaar, "A Survey of Collaborative Filtering Techniques," Advances in Artificial Intelligence, no. January 2009, 2009, http://www.hindawi.com/journals/aai/2009/421425/.

[89] Badrul Sarwar, George Karypis, Joseph Konstan, and John Reidl, "Item-based collaborative filtering recommendation algorithms," in World Wide Web Conference Series - WWW, 2001, pp. 285-295, http://glaros.dtc.umn.edu/gkhome/fetch/papers/www10_sarwar.pdf.

[90] Jeff J. Sandvig, Bamshad Mobasher, and Robin D. Burke, "Robustness of collaborative recommendation based on association rule mining," in Conference on Recommender Systems - RecSys, 2007, pp. 105-112, http://maya.cs.depaul.edu/~mobasher/papers/smb-recsys07.pdf.

[91] R Andonie, J.E. Russo, and R. Dean, "Crossing the Rubicon: A Generic Intelligent Advisor," International Journal of Computers, Communications & Control, vol. 2, pp. 5-16, 2007, http://www.cwu.edu/~andonie/MyPapers/Advisor%202005.pdf.

[92] Jonathan L. Herlocker, Joseph A. Konstan, Loren G. Terveen, and John T. Riedl, "Evaluating collaborative filtering recommender systems," ACM Transactions on Information Systems - TOIS, vol. 22, no. 1, pp. 5-53, 2004, http://web.engr.oregonstate.edu/~herlock/papers/tois2004.pdf.

[93] Asela Gunawardana and Guy Shani, "A Survey of Accuracy Evaluation Metrics of Recommendation Tasks," Journal of Machine Learning Research - JMLR, vol. 10, pp. 2935-2962, 2009, http://research.microsoft.com/pubs/118124/gunawardana09a.pdf.

[94] Ron Kohavi, Roger Longbotham, Dan Sommerfield, and Randal M. Henne, "Controlled experiments on the web: survey and practical

136

guide," Data Mining and Knowledge Discovery, vol. 18, no. 1, pp. 140-181, http://www.springerlink.com/content/r28m75k77u145115/fulltext.pdf.

[95] Cyril W. Cleverdon and Michael Keen, "Aslib Cranfield research project - Factors determining the performance of indexing systems; Volume 2, Test results," 1966.

[96] Daniel Billsus and Michael J. Pazzani, "Learning Collaborative Information Filters," in International Conference on Machine Learning - ICML, 1998, pp. 46-54, http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.40.4781&rep=rep1&type=pdf.

[97] Badrul Sarwar, George Karypis, Joseph Konstan, and John Riedl, "Analysis of recommendation algorithms for e-commerce," in ACM Conference on Electronic Commerce - EC, 2000, pp. 158-167.

[98] C. J. Van Rijsbergen, Information Retrieval.: Butterworth-Heinemann, 1979.

[99] Yiming Yang and Xin Liu, "A re-examination of text categorization methods," in Research and Development in Information Retrieval - SIGIR, 1999, http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.11.9519&rep=rep1&type=pdf.

[100] John A. Swets, "EFFECTIVENESS OF INFORMATION RETRIEVAL METHODS," 1969.

[101] James A. Hanley and Barbara J. McNeil, "The Meaning and Use of the Area undera Receiver Operating Characteristics (ROC) Curve," Radiology, vol. 143, no. 1, pp. 29-36, April 1982, http://www.medicine.mcgill.ca/epidemiology/hanley/software/Hanley_McNeil_Radiology_82.pdf.

[102] Andrew I. Schein, Alexandrin Popescul, Lyle H. Ungar, David M. Pennock, and David Ungar, "Methods and metrics for cold-start recommendations.," in Research and Development in Information Retrieval - SIGIR, 2002, MethodMetricsColdStart.pdf.

[103] Ellen M. Voorhees, "Overview of the TREC 2002 Question Answering Track," in Text Retrieval Conference - TREC, 2002,

137

http://trec.nist.gov/pubs/trec11/papers/QA11.pdf.

[104] Bamshad Mobasher, Honghua Dai, Tao Luo, and Miki Nakagawa, "Effective personalization based on association rule discovery from web usage data," in Web Information and Data Management - WIDM, 2001, pp. 9-15.

[105] François Fouss and Marco Saerens, "Evaluating Performance of Recommender Systems: An Experimental Comparison," in Web Intelligence - WI, 2008, pp. 735-738.

[106] B. J. Dahlen, J. A. Konstan, J. L. Herlocker, N. Good, A. Borchers, and Riedl J., "Jump-starting movielens: user benefits of starting a collaborative filtering system with "dead data"," , 1998.

[107] Ian, H. Witten and Eibe Frank, Data Mining - Practical Machine Learning Tools and Techniques. San Francisco, CA, USA: Morgan Kauffman, 2005.

[108] Microsoft Corp. Data Mining Extensions (DMX) Reference [Online]. http://msdn.microsoft.com/en-us/library/ms132058.aspx

[109] Usama Fayyad, Georges, G. Grinstein, and Andreas Wierse, Information Visualization in Data Mining and Knowledge Discovery. San Diego, CA, USA: Academic Press, 2002.

[110] D. Bamber, "The area above the ordinal dominance graph and the area below the receiver operating characteristic graph.," Journal of Mathematical Psychology, vol. 12, pp. 387-415, 1975.

[111] (2011) http://citeseerx.ist.psu.edu/. [Online]. http://citeseerx.ist.psu.edu/

[112] Microsoft Academic Search. [Online]. http://academic.research.microsoft.com/

[113] Google Scholar. [Online]. http://scholar.google.com/

[114] Ioan, B Crivat, C, James MacLennan, Raman Iyer, and Dumitru Marius, "Using a rowset as a query parameter," Patent (USPTO) USPTO Patent/Application Nbr. 7451137, 2008.


http://citeseerx.ist.psu.edu/

http://academic.research.microsoft.com/

http://scholar.google.com/

Application of Computational Intelligence in Data Mining

Documents