Knowledge discovery standardsvillasen/index_archivos/... · Artif Intell Rev (2007) 27:21–56 DOI 10.1007/s10462-008-9067-4 Knowledge discovery standards Sarabjot Singh Anand ·

Artif Intell Rev (2007) 27:21–56DOI 10.1007/s10462-008-9067-4

Knowledge discovery standards

Sarabjot Singh Anand · Marko Grobelnik ·Frank Herrmann · Mark Hornick · ChristophLingenfelder · Niall Rooney · Dietrich Wettschereck

Published online: 3 September 2008© Springer Science+Business Media B.V. 2008

Abstract As knowledge discovery (KD) matures and enters the mainstream, there is anonus on the technology developers to provide the technology in a deployable, embeddableform. This transition from a stand-alone technology, in the control of the knowledgeablefew, to a widely accessible and usable technology will require the development of standards.These standards need to be designed to address various aspects of KD ranging from theactual process of applying the technology in a business environment, so as to make the pro-cess more transparent and repeatable, through to the representation of knowledge generatedand the support for application developers. The large variety of data and model formats thatresearchers and practitioners have to deal with and the lack of procedural support in KD haveprompted a number of standardization efforts in recent years, led by industry and supported

S. S. Anand (B)Department of Computer Science, University of Warwick, Coventry CV4 7AL, England, UKe-mail: [email protected]

M. GrobelnikJozef Stefan Institute, Jamova 39, 1000 Ljubljana, Sloveniae-mail: [email protected]

F. Herrmann · D. WettschereckSchool of Computing, The Robert Gordon University, St. Andrew Street,Aberdeen AB25 1HG, Scotland, UKe-mail: [email protected]

M. HornickOracle Corporation, 10 Van de Graaff Drive, Burlington, MA 01803, USAe-mail: [email protected]

C. Lingenfelder1101 Kitchawan Road, Route 134, P.O. Box 218, Yorktown Heights, NY 10562-1301, USAe-mail: [email protected]

N. RooneyUniversity of Ulster at Jordanstown, Newtownabbey, County Antrin, Northern Irelande-mail: [email protected]

123

22 S. S. Anand et al.

by the KD community at large. This paper provides an overview of the most prominent ofthese standards and highlights how they relate to each other using some example applicationsof these standards.

Keywords Knowledge discovery · Data Mining · Standards · CRISP-DM ·PMML · JDM · OLE-DB

1 Introduction

Since its inception in the early 1990s, the field of knowledge discovery (KD) has found appli-cation in a wide variety of industrial applications ranging from early applications in telecomchurn analysis and marketing of financial and retail products through to analysis of scientificdata (Grossman et al. 2001) and sports related (Bhandari et al. 1997) data. KD has quicklygrown into an extremely broad field that involves a variety of technical and non-technicalpeople from such diverse fields as business analysis or health services to algorithmic devel-opment and database management. All of these people contribute to and use some aspectsof KD.

The last decade has also seen the emergence of a number of robust data mining workbenchapplications from organisations such as SPSS and SAS, traditional application software pro-viders and database integrated tool sets from vendors such as IBM, Oracle and Microsoft.Apart from these key players in the data mining space, there are also a number of smallervendors such as RuleQuest, Megaputer Inc, KXEN, Quadstone and CorporateIntellect, tomention a selected few, that market specialist algorithms and often provide high-end datamining solutions not otherwise available from the larger vendors. Each vendor providingtheir own forms of application programming interfaces, user interfaces, algorithms, knowl-edge representation and meta-data capabilities. As a result, real-world applications of datamining involve a lot of effort in terms of extracting relevant data from existing business ITsystems, integration of that data with additional data sources, deployment of the knowledgegenerated and evaluation of the deployed knowledge. Due to this effort, most data miningapplications tend to be costly exercises that are not easily productionized within an organisa-tion’s operational systems. Experience gained in these applications made it obvious that fordata mining to truly make its mark as a useful business tool it must be embedded in verticalbusiness applications and processes must be put in place to standardise the deployment ofthis technology at the heart of business processes. KD standards, the theme of this paper,are a necessary precondition to make this transition from stand-alone workbench solutionsto embedded business processes possible.

The first attempt at standardising any aspect of KD dates back to 1994 with the proposalfor a KD process (Brachman and Anand 1994), outlining high-level steps required to con-duct a successful KD project. The process evolved over numerous iterations within academia(Fayyad et al. 1996a,b; Anand and Büchner 1998; Klösgen and Zytkow 2002) prior to thedevelopment of an industry led standard called CRISP-DM (see Sect. 2). CRISP-DM is basedon experience gained within academia and industry regarding the application of data miningtechniques to real business problems.

Within this process, three stages viewed as being able to profit most from standardisationwere the modelling, evaluation and deployment stages that deal with the building, testing andapplying of knowledge generated by the use of data mining algorithms. Standardizing thebuild and test phases was aimed at reducing risk of adopting the technology as the businessis freed from the need for vendor-specific integration. It also opens up a secondary market

123

Knowledge discovery standards 23

where smaller algorithm vendors can make their products more accessible to the end-usersby supporting standard interfaces. Standardizing knowledge representation and deploymentreduces the cost of integration with other IT applications that constitute the IT landscape ofa business. To support these tasks three key standardisation efforts were undertaken, namely,SQL/MM (see Sect. 3.2) (Eisenberg and Melton 2001), OLE DB DM (see Sect. 3.3), a Micro-soft initiative aimed at extending OLE DB to include some data mining functionality, andJDM, a Java API for Data Mining (see Sect. 3.1). A further standardisation effort associatedwith these stages is the development of a knowledge representation standard. The aim is tofacilitate the exchange of knowledge between vendor applications, both knowledge produc-ers and consumers. The Predictive Modelling Markup Language (PMML) (see Sect. 4) is anXML based standard under development by members of the Data Mining Group (DMG).Two further standardisation efforts in progress that aim to provide Web Services support fordeploying data mining within an organisation are XMLA (a Microsoft led initiative) and theJDM Web Services initiative bringing together the expertise of the JDM and PMML teams.

It is important to realize that there are two reasons for the fact that more than one standardis discussed in this article: (a) Most standards address different aspects of KD. Some aregeared towards the user of KD while others address a particular programming language orare somewhat vendor-specific; (b) The field is still moving very quickly and new approachesemerge continuously. It is unavoidable that in such circumstances standards are overcome orsubsumed by new discoveries and often overlap to a certain degree.

The remainder of this paper addresses a number of specific KD-standards. We start inSect. 2 with a description of CRISP-DM, a high-level standard that covers the KD process.The following sections then address particular aspects/steps of this process such as applica-tion programming interfaces for data mining (Sect. 3), knowledge representational standards(Sect. 4) and Web services support (Sect. 5). We conclude with a discussion in Sect. 6 on thecontributions, shortcomings and an outlook on the further development of these standards.

2 CRISP-DM

The CRoss-Industry Standard Process for Data Mining (CRISP-DM) (Chapman et al. 2000)has been developed as an industry, tool and application neutral process for conducting KD.It defines tasks, outputs from these tasks, terminology and mining problem type character-izations. Founding members of the consortium working on the CRISP-DM standard wereDaimlerChrysler, SPSS, NCR and OHRA, though the standard has benefited through inputfrom the CRISP-DM Special Interest Group. Membership of the group now numbers about200, ranging from management consultants to data warehousing and data mining practitio-ners.

CRISP-DM divides the KD process into the six top-level phases depicted in Fig. 1. Itmust be noted that CRISP-DM is neither a pure waterfall nor rapid prototyping model. It isrecommended that each phase is conducted as thoroughly as possible before advancing to thenext phase, but in most KD projects it will also be necessary to backtrack to earlier phases,for example, if one discovers during the modelling phase that additional data pre-processingsteps must be taken or if the evaluation phase indicates that results are insufficiently robustor expressive, requiring the collection or purchase of additional data or a redefinition of theinitial, overly ambitious project aims. Each of these six top-level phases has been divided bythe CRISP-DM consortium into between three and seven Generic Tasks. An example of ageneric task within the Data Preparation phase may be Data Cleaning. A generic task breaksthe top-level phase up into a sequence of typical tasks to be performed in this phase. The

123


Fig. 1 The CRISP-DM process(Chapman et al. 2000)

CRISP-DM consortium has tested this set of generic tasks on a number of applications andfound it to be stable, general and complete so that it can be assumed that these generic tasksare sufficient for the largest variety of real KD projects. Associated with each of these tasks,the CRISP-DM consortium provides a list of output reports that document the work carriedout during the process. These reports are not prescriptive as the tasks themselves are onlydescribed at a fairly high-level. Hence the standard is meant to be more of a guideline than aprescriptive recipe for carrying out a data mining project. Each Generic Task in turn is com-posed of a set of Specialized Tasks. In our example, the generic task of Data Cleaning may be(partially) achieved by the specialized task of Missing Value Handling. The set of specializedtasks to be performed may differ from project to project and the list provided in the CRISP-DM step-by-step guide may be incomplete or may contain specialized tasks that need not beperformed in specific projects. The final level of the four levels of abstraction introduced inCRISP-DM is the Process Instance. The Process Instance is a specific instantiation of thespecialized task with a particular method and its parameters. In our example, missing valuesmay be replaced by the mean value for numeric attributes and the most frequent value forcategorical attributes.

The CRISP-DM consortium further defines four dimensions along which the context of aKD project can be defined. The data mining context helps to construct a mapping between thegeneric and the specialized task levels in CRISP-DM. The four dimensions of data miningcontext defined in CRISP-DM are Application Domain, Data Mining Problem Type, Tech-nical Aspects and Tools and Techniques. The Application Domain essentially defines thespecific business problem that is being addressed, for example, Churn Analysis or ResponseModelling. The CRISP-DM consortium defines six different classes of data mining prob-lems such as Segmentation, Data Summarization, Classification etc. The second dimensionof context identifies the KD project as consisting of one or more of these problem types.The technical aspects dimension specifies the technical challenges associated with the pro-ject based on issues within the domain, data, tools or deployment of discovered knowledge.

123


Table 1 The four contextualdimensions in CRISP-DM

Dimension Example

Application Domain Medical PrognosisData Mining Problem type RegressionTechnical Aspect Censored ObservationsTools and Techniques Cox’s Regression, CIL’s GENNA

Finally, the tools and techniques dimension specifies what tools and/or techniques are to beused in the project. Constraints could be technical or commercial. The choice of tools/tech-niques could have a bearing on the technical aspects of the project and vice versa. Table 1shows an example description of the context of a KD project undertaken by some of theauthors.

The Business Understanding phase is aimed at developing a common understandingbetween the different project stake-holders, of the business processes and goals that drivethe KD project. This phase is probably the most important, though tedious, phase of a KDproject, the importance of which cannot be overemphasized. It is during this stage that theprimary business objective and related (quantitative) metrics for determining success/fail-ure of the project must be discussed, understood and agreed. It is important to keep thesemeasures realistic and in business terms rather than technical terms such as model accuracy.Another important aspect of this phase is insuring the right resources (human and technical)are available for making the project successful. The Generic Tasks that constitute this phaseare: Understanding the Business Objectives, Assess Situation (with respect to resources/con-straints and generic factors affecting the project), Determine Data Mining Goals and ProduceProject Plan.

The aim of the Data Understanding phase is to gain an understanding of what data isavailable for use in the KD project and its quality. The key generic tasks at this stage are Ini-tial Data Collection, Data Description, Data Exploration and Data Quality Verification. Anyissues with the data regarding quality or unexpected distributions/values are logged duringthis phase and translate into requirements during the Data Preparation phase. This phase istypically relatively straight forward with the support of advanced OLAP and visualizationtools available today.

The Data Preparation phase involves the selection, cleaning, integration and formattingof the data collected during the Data Understanding phase. It often involves summarization,scaling and transformation of the data as well as the deriving of new attributes from theunderlying data attributes. Once again CRISP-DM provides guidelines for documenting thisphase, as more often than not during a KD project, this phase is revisited after the Modellingand Evaluation phases with the intent of improving on the knowledge discovered through theinclusion of new data. It is therefore essential that every task undertaken during this phase iswell documented along with the rationale for carrying out the task. The varied nature of thespecific data preparation tasks has resulted in this phase being the least well supported withregard to off-the-shelf tools availability and related standards.

The Modelling phase on the other hand is very well supported by a large variety of com-mercial and non-commercial tools and is often considered the fun part of a project. Thedanger in many projects is that this phase is overrated and that the practitioner dives rightinto this phase thinking that a good modelling tool will take care of insufficient data prepara-tion or will be able to automatically filter out the relevant data. This is typically not true andis one of the major pitfalls of many KD projects (Pyle 2004). CRISP-DM defines as generictasks within this phase, Modelling tool selection, Test Design, Model building and Modelassessment.

123


The Evaluation phase consists of the evaluation of results, reviewing the process followedwith the project and determining the next steps. Note that the evaluation phase does not lookto evaluate the model with respect to any technical metrics such as accuracy. It is the modelassessment task within the Modelling phase that is responsible for assessing the accuracy ofthe model. The evaluation phase aims to evaluate the model against the business objectivesas laid out within the Business Understanding phase. Based on the evaluation, the next stepsmay be a further iteration of the CRISP-DM life cycle, deployment of the results or, in theworst case, the abandoning of the project. A common result at this stage is that a technicallyvalid model may not achieve the desired business goals or may be too costly to implement,thus requiring a return to one of the earlier phases.

The Deployment phase involves the development of a plan for productionizing the datamining project and integrating the results with the operations of the business. Understand-ably this phase requires heavy interaction with the domain expert and business functionsaffected by the deployment. Initial planning for this stage should already have been carriedout during the feasibility study carried out as part of the Business Understanding phase. Asthe deployment of results is where the Return on Investment is generated, any major obstacleidentified, related to the deployment of results, should lead to the project being abandoned.

Support for CRISP-DM within some data mining workbenches is now appearing and anumber of organisations are using it to document their data mining projects. However, thebenefits, while self-evident, remain undocumented as most businesses shy away from pub-licizing successes. The evident adoption of the standard in itself is a positive sign for theindustry and the discipline.

3 Model generation and deployment

Existing proposals and initiatives which try to systematically cover and standardize modelgeneration and deployment functionality required by applications follow two mainapproaches: SQL based and Application Program Interface (API) based. Not surprisingly, allthe major database vendors have initiatives for embedding data mining functionality withintheir database management systems. The natural extension to this is to provide access to datamining functionality through extending SQL. The key standardisation effort based on extend-ing SQL is the SQL Multimedia and Applications Packages Standard (SQL/MM) (ISO/IECCD 13249-6 2004; Eisenberg and Melton 2001), an initiative developed and published bythe International Organization for Standardization (ISO). SQL/MM consists of several parts,of which, Part 6 specifies an SQL interface to data mining applications and services throughaccessing data from SQL/MM-compliant relational databases. SQL/MM attempts to pro-vide a standardized interface to data mining algorithms that can be layered atop any object-relational database system and even deployed as middle-ware when required. We discuss thisstandard in greater detail in Sect. 3.2. A number of vendor specific extensions to SQL havealso been proposed, including:

– Microsoft’s OLE DB for Data Mining (OLE-DB 2000), described in more detail below,is an approach which is specially designed for data mining needs—it combines SQL witha low level API (a set of COM interfaces) to achieve interoperability with other clientand server technologies.

– IBM’s DB2 Intelligent Miner products contain a set of DB2 database extenders (DB2-IM2004), released in the time of the writing of this article. The idea is to incorporate data

123


mining functionality into standard database SQL language in a relatively standard way.Functionality is based on IBM’s “Intelligent Miner” data mining product.1

– Oracle Data Mining (Oracle 2004) is a set of functions available in Oracle’s database andaccessible though PL/SQL (programming language available to database programmers)and through a Java interface. It covers some of the most relevant data mining approachessuch as Support Vector Machines and Text Mining.

There is currently an API based standard being developed as part of the Java CommunityProcess (JCP) which is attracting wide attention. Java Data Mining API (JDM 2004, seeSect. 3.1) is an initiative with participation from most of the leading data mining and data-base companies such as Oracle, Sun, IBM, SAS, SPSS, and others. The idea is to define apure Java API for developing data mining applications that can be used by clients withoutusers being aware or affected by the actual vendor implementations for data mining. JDMalso leverages several other data mining standards, including PMML (Grossman et al. 2002),SQL/MM (ISO/IEC CD 13249-6 2004), and the Common Warehouse Metamodel (CWM2004). The JDM specification supports the creation, storage, access and maintenance of dataand metadata supporting data mining models, data scoring and data mining results.

In Sect. 5 we discuss two initiatives aimed at producing XML based interfaces to supportdata mining web services. These are XML for Analysis (XMLA), led by Microsoft, andJDM Web Services as defined by the expert group developing the JDM standard. We nowdiscuss the two key standards, JDM and SQL/MM as well as the Microsoft led OLE DB forData Mining initiative aimed at allowing data mining algorithms to be plugged into the MSAnalysis Services.

3.1 Java Data Mining API

The Java Data Mining API (JSR-73) has been approved under the auspices of the JavaCommunity Process (JCP 1995). The expert group developing this standard lead by Oracle,includes BEA Systems, Computer Associates, CorporateIntellect, CalTech, Fair Issac, Hype-rion, IBM, KXEN, Quadstone, SAP, SAS, SPSS, Strategic Analytics, Sun Microsystems andthe University of Ulster.

JDM hopes to do for data mining what JDBC does for databases, essentially supportingkey tasks using a standard interface. JDM supports the creation, storage and maintenance ofdata and metadata supporting data mining tasks. From the perspective of implementers ofdata mining applications, JDM allows you to expose a single, standard API understood by awide variety of client applications and components. From the perspective of the data miningmarket, this means that providers of specialist algorithms can now provide their componentsadhering to the JDM standard, making it easier for data mining solution providers to plug-inthe components to their existing data mining investment. From the perspective of data miningclient application developers, applications can be coded against a single API that is indepen-dent of the underlying data mining system or vendor, reducing the risk of their investmentin the technology. The supports Capability methods provided a mechanism for JDM users todetermine individual vendor capabilities.

JDM is task oriented. The key tasks supported by JDM are the building of models, the test-ing of models, applying models to data and the import/export of models and related meta data.Note that JDM is an interface, not an actual implementation of a set of data mining algorithms.

1 Intelligent Miner is now part of the IBM DB2 Data Warehouse Edition V9.1 and fully implements SQL/MMdata mining as well as most of PMML. Details can be found at http://www.ibm.com/software/data/db2/dwe/and http://www.ibm.com/software/data/db2/dwe/mining.html

123

http://www.ibm.com/software/data/db2/dwe/

http://www.ibm.com/software/data/db2/dwe/mining.html


Data mining software vendors conforming to the JDM interface provide implementations ofdata mining algorithms that are utilized through the standard interface provided by JDM. JDMconforms to the Java Connection Architecture (JCA 2003). The J2EE Connector architecturedefines a standard approach for connecting the J2EE platform to heterogeneous EnterpriseInformation Systems (EIS). The connector architecture defines a set of scalable, secure, andtransactional mechanisms that enable the integration of EISs with application servers andother enterprise applications. It also enables the data mining vendor to provide a standardresource adapter for its server. A resource adapter is a system-level software driver that isused by a Java application to connect to an application. The resource adapter plugs into anapplication server and provides connectivity between the EIS (Data Mining Engine (DME)in this instance), the application server, and the enterprise application. The resource adapterserves as a protocol adapter that allows any arbitrary EIS communication protocol to be usedfor connectivity. The connection object in JDM allows the user to obtain information on thecapabilities of the data mining server and a connection for communicating with the DME. AConnection object also provides the ability to save and retrieve meta data objects associatedwith user tasks, execute tasks, retrieve models and results generated by the tasks, and loadmodels in memory for single-record scoring. The code in Listing 1 shows how a connectionis established using JDM and an instance of a factory class (BuildTaskFactory) is obtained.

Listing 1 Establishing a connection using JDM (example)

1 ConnectionSpec connSpec = (javax.datamining.resource.ConnectionSpec)jdmCFactory.getConnectionSpec();

2 connSpec.setName(‘‘user1’’);3 connSpec.setPassword(‘‘pswd’’);4 connSpec.setURI(‘‘myDME’’);5 javax.datamining.resource.Connection dmeConn = jdmCFactory.getConnection(connSpec);6 BuildTaskFactory btf = (BuildTaskFactory) cnn.getFactory(‘‘javax.datamining.task.BuildTask’’)

;

The classes defined in JDM can be broadly classified into Data, Mining, and Model relatedclasses. Throughout this section we use an example buildTask to illustrate features of the JDMinterface related to the tasks of building and applying a model. Note that it is beyond thescope of this paper to exhaustively discuss all features of JDM. An interested reader shouldrefer to the JDM 1.0 specification for a complete listing of the API.

3.1.1 Data related classes

At the data level, a distinction is made within JDM between a single record and a data set.The single record is aimed at supporting real-time scoring using data mining models. ThePhysicalDataSet class within the javax.datamining.data package, allows a user of JDM tospecify meta-data regarding the data to be used in the data mining task. A PhysicalDataSetobject contains the URI for the data and metadata regarding the collection of attributes in thedata including the attribute name, datatype and role. The attribute role specifies whether theattribute exhibits special behaviour such as the case identifier or simply data to be mined.Other roles cater for special types of data such as taxonomies stored in a relational form ordata stored as a pivoted table.

The sample code in Listing 2 shows typical usage of the PhysicalDataSet. The createmethod of the PhysicalDataSetFactory takes a single parameter, a String representing theURI of the data to be mined. If supported by the data source and the data mining server, meta

123


data can be imported from the data source. Alternatively, the PhysicalDataSet class providesmethods for specifying the meta data related to the data at the specified URI. Finally, theinstance of the PhysicalDataSet class is saved to the metadata repository for retrieval and useby a task.

Listing 2 Typical usage of the PhysicalDataSet

7 PhysicalDataSetFactory pdsFactory = (PhysicalDataSetFactory)dmeConn.getFactory(‘‘javax.datamining.data.PhysicalDataSet’’);

8 PhysicalDataSet pd = pdsFactory.create(‘‘housing.data’’);9 pd.importMetaData();

10 dmeConn.saveObject(‘‘myPD’’, pd);

Another key, but optional, data related class is the LogicalData. This class provides alogical layer above the actual data specifying how the attributes should be interpreted for usewithin the mining task. The key functions provided by the LogicalData class are the ability tofilter out certain attributes that are not to be used for mining, but are specified in the physicaldata, set the usage type for attributes, such as numeric or categorical and specify legal, ille-gal and missing data values for attributes. If not provided, a vendor may make assumptionsabout how to interpret the provided data. The sample code in Listing 3 shows the typicalusage of the LogicalData object. The create method of the LogicalDataFactory class takes aPhysicalDataSet instance as a parameter. This method returns an instance of the LogicalDatathat mirrors the meta data within the PhysicalDataSet passed to the create method used toinstantiate it. The getAttribute method is used with an attribute name as parameter to gainaccess to a particular LogicalAttribute object that contains the metadata about the attribute. Inthe example code, the attribute type of the attribute, area, is set to numeric. Next an instanceof CategorySet is created. A category set contains the list of all valid, missing or illegalvalues for an attribute. In Listing 3, a null value within the area attribute is specified as beinga missing value and would be treated as such during the execution of a task on the data.Finally, the LogicalData, as with any meta data created using JDM, is stored within the metadata repository by calling the saveObject method on the connection.

Listing 3 Typical usage of the LogicalData object

11 LogicalDataFactory ldFactory = (LogicalDataFactory) dmeConn.getFactory(‘‘javax.datamining.data.LogicalData’’);

12 LogicalData ld = ldFactory.create(pd); // Specify how attributes should be used13 LogicalAttribute area = ld.getAttribute(‘‘area’’);14 area.setAttributeType( AttributeType.numerical );15 CategorySet catSet = (CategorySet) CategorySetFactory.create(AttributeDataType.stringType);16 Category c = new Category(null, CategoryProperty.missing);17 catSet.addCategory(c);18 area.setCategorySet(catSet);19 cnn.saveObject(‘‘housingLogicalData’’, ld, true);

3.1.2 Mining related classes

With the metadata on the data to be mined specified and stored in the metadata repository,the next step within the build task is to define the data mining function to be carried out.JDM makes a distinction between data mining functions and algorithms. Data mining func-tions are generic data mining problems that can be solved through the use of a number ofalternative algorithms. An example of a data mining function is classification that can beperformed by decision tree induction, neural networks, rule induction, SVM, naïve Bayes,

123


nearest neighbour algorithms, among others. Within JDM, the user may specify only the datamining function to be performed and not specify the actual algorithm to be used. It is, then,up the data mining service provider to use an appropriate algorithm with appropriate defaultsetting. The intention is to make data mining more accessible to novice users. Alternatively,the user may specify the algorithm to be used and its settings. The extent to which a datamining power user will avail of such functionality will depend on the ability of data miningvendor to optimize the selection of the algorithm used. As of now, JDM defines the followingfunctions:

– Classification– Associations– Clustering– Attribute Importance

The example code in Listing 4 shows some of the main features of the ClassificationSet-tings object from the javax.datamining.supervised.classification package. As classification isa supervised task, the user can specify the name of the target attribute within the Classification-Settings using the setTargetAttributeName() method inherited from the SupervisedSettingsclass defined in the javax.datamining.supervised package. If the user chooses to specify aparticular algorithm to be used for the data mining function, the user can specify settingsspecific to the algorithm using the appropriate settings object (TreeSettings in the exam-ple Listing 4) and then associate the algorithm settings with the function settings using thesetAlgorithmSettings() method inherited by the ClassificationSettings object from the Build-Settings object provided in the javax.datamining.base package. The ClassificationSettingsare then stored in the meta data repository for retrieval during execution of the build task.Listing 4 shows how this is achieved.

Listing 4 The ClassificationSettings object

20 ClassificationSettingsFactory csf = (ClassificationSettingsFactory) cnn.getFactory(‘‘ClassificationSettings’’);

21 ClassificationSettings cs = csf.create();22 cs.setLogicalDataName(‘‘housingLogicalData’’);23 cs.setTargetAttributeName(‘‘priceGroup’’);24 TreeSettingsFactory tsf = (TreeSettingsFactory) cnn.getFactory(‘‘TreeSettings’’);25 TreeSettings ts = (TreeSettings) tsf.create();26 ts.setMaxSurrogates(2);27 ts.setMaxDepth(3);28 ts.setMaxSplits(5);29 ts.computeNodeStatistics(new Boolean(true));30 cs.setAlgoritmSettings(ts);31 cnn.saveObject(‘‘housingBuildSettings’’, cs, true);

Currently the algorithms supported by JDM are decision tree induction, regression, SVM,k-means, feedforward neural networks and naïve Bayes. Having set the function and algo-rithm settings, the user is now in a position to create a BuildTask using the BuildTaskFactoryobject declared in the statement in line 6 of the example in Listing 1. This is done by callingthe create method on the BuildTaskFactory object. The three parameters passed to the createmethod are the name of the PhysicalDataSet object, the name of the BuildSettings object andthe name of the model that will be built by the BuildTask and stored in the repository. After theBuildTask object has been stored in the metadata repository, the task can be executed onthe connection object as in line 35 in Listing 5.

123


Listing 5 Executing the task

32 BuildTask bt = (BuildTask) btf.create(‘‘myPD’’, ‘‘housingBuildSettings’’, ‘‘myModel’’);33 cnn.saveObject(‘‘housingBuildTask’’, bt, true);34 ExecutionStatus eStatus = null;35 eStatus = cnn.execute(‘‘housingBuildTask’’);

3.1.3 Model related classes

Once a model has been created using a build task and stored in the metadata repository, it canbe retrieved and interrogated using interfaces provided by JDM. The javax.datamining.base.Model interface is the key model class. As models vary based on the data mining function,there are more specific model interfaces defined in JDM for each data mining function,each of them inheriting from the javax.datamining.base.Model interface. For example, theClassificationModel interface defined within the javax.datamining.supervised.classificationpackage extends the SupervisedModel interface (defined in the javax.data mining.supervisedpackage) which in turn extends the Model interface. A model within JDM has the followingkey components:

– Model Signature: The model signature is a definition of the set of attributes that must bepresent in any data record to be scored by the model. The signature may be a subset of theattributes present in the Logical Data defined during the build task, as some algorithmsselect attribute subsets during the building of the model. An example of such an algorithmis Decision Tree Induction. It is the model signature that defines as to whether the modelis appropriate for scoring a particular record or data set.

– Model Detail: The model detail and more specifically a subclass of the ModelDetail class,provides access to the actual knowledge structures within the model. The javax.datamining.modeldetail package contains sub packages with interfaces for interrogating naïve Bayes,decision trees, neural networks and SVM models. For clustering models the equivalentinterfaces are in the javax.data mining.clustering package.

– Build Settings: These are the settings that were used for building the model. The user mayspecify some of these to be “System Determined” or “System Default” in which case it isup to the data mining service provider to determine the actual settings used in the modelbuild and reflect these settings within the build settings object referenced by the model.

– Mining Function: It provides access to the type of data mining function that the model canbe used to perform.

– Mining Algorithm: It provides access to the algorithm that generated the model.– Attribute Statistics (optional): It consists of statistics on each of the attributes within

the data that the model was built from. The AttributeStatisticsSet interface within thejavax.datamining.statistics package provides access to these statistics.

3.1.4 Applying a model

After building and testing the model, the typical last stage within the mining process is thatof deploying the knowledge (see Sect. 2). This may involve exporting the model or using themodel to score cases, such as, classifying customers or predicting outcomes. JDM supportsthe applying of models to a single record and in batch mode through its definition of anapply task. In the example code Listing 6, we show some typical code for applying a modelin batch mode, using the tree model built in previous sections. The generic steps carried outwhen applying a model in batch mode are: selecting the data to which the model must be

123


applied (lines 36–38), mapping attributes in the data onto attributes within the signature of themodel (line 39), defining the format of the output from the apply task (lines 40–44), creatingthe settings for the apply task (lines 45–47) and executing the apply task (line 48). Furtherin Listing 6, the map created in line 39 between the input data and the model signature isadded to the classification apply settings in line 43. Lines 41 and 42 describe the format ofthe output from the apply task. The mapByRank method takes three parameters. The firstone describes the content to be output while the second parameter provides a name for theattribute within the output data in which the content will be stored. The final parameter statesthat the ranking must be in descending order. So line 41 states that the apply task will adda new attribute called Prediction which will contain the top ranking value of the predictedattribute while line 42 states that a new attribute called Probability will be added and it willcontain the probability associated with the prediction. Line 45 and 47 create and store thespecification of the apply task containing the name of the apply data, the name of the modelto be used, the name of the settings object and the output URI.

Listing 6 Applying a model

36 PhysicalDataSet pds = pdsFactory.create(‘‘housingApply.data’’);37 pds.importMetaData();38 cnn.saveObject(‘‘housingApplyData’’, pds, true);39 //create a Map, sdMap, that maps attributes in the physical data set onto attributes in the model’

s signature40 ClassificationApplySettings cas = ClassificationApplySettingsFactory.create();41 cas.mapByRank(ClassificationApplyContent.predictedCategory, ‘‘Prediction’’, true);42 cas.mapByRank(ClassificationApplyContent.probability, ‘‘Probability’’, true);43 cas.setSourceDestinationMap(sdMap);44 myConnection.saveObject(‘‘housingApplySettings’’, pds, true);45 DataSetApplyTaskFactory dsatf = (DataSetApplyTaskFactory)myConnection.getFactory(‘‘

DataSetApplyTask’’);46 DataSetApplyTask dsat = dsatf.create(’’housingApplyData’’, ‘‘myModel’’, ‘‘

housingApplySettings’’, ‘‘outputFile’’);47 cnn.saveObject(‘‘housingApplyTask’’, dsat, true)48 ExecutionStatus es = myConnection.execute(‘‘housingApplyTask’’);

3.1.5 Conformance

JDM takes an a la carte approach to conformance, meaning that, vendors need only implementclasses pertaining to the data mining functions and algorithms that their products support.A vendor must, however, support at least one data mining function specified by JDM. Allcore packages must be supported as must all methods within an implemented class. Further,semantics specified for each method must be implemented to ensure common interpretation ofa given result JDM compliant vendors must support J2EE and/or J2SE and the recommendedapproach to extending JDM classes is through subclassing.

3.1.6 Example application: using WEKA as the data mining service provider

WEKA (Witten and Frank 1999) is an open source collection of data mining algorithmsimplemented in Java, available under the GNU General Public License. The algorithms caneither be applied directly to a dataset or called from a Java program through the instantiationand use of objects of particular algorithm classes. WEKA contains tools for data pre-process-ing, classification, regression, clustering, association rules, and visualization. In recent years,

123


WEKA has become a popular choice within academia and while the original set of algorithmsconstituting WEKA were developed at the University of Waikato, data mining researchersfrom various universities have added their own algorithms to WEKA. As described above,JDM is a set of interfaces that supports the creation and storage of metadata and data relatedto data mining. JDM also supports the execution of data mining tasks such as the building,testing and applying of models. This scope subsumes the purpose of WEKA, as a data miningalgorithm set. The purpose of this section is to describe how a tool such as WEKA can be usedto support the algorithmic side of JDM. In its current form WEKA cannot provide the fullfunctionality of JDM. However, if the core metadata management and other core interfacesdefined in JDM were to be implemented, WEKA can provide a rich source of algorithms thatcould be used to provide JDM compliant data mining services. As such, the key deficienciesin WEKA that could be fulfilled by an implementation of JDM core interfaces and wrapperclasses for WEKA algorithms are:

– Compliance with the Java Connector Architecture hence supporting the development ofdistributed applications involving data mining

– The explicit creation, storage and management of metadata relating to data mining tasksthat can be reused in the future when productionizing data mining tasks

– Standard interfaces for model navigation and reuse. In WEKA, alternative tree inductionalgorithms such as ID3 and J48 use different representations for the resulting models, us-ing their own Java objects with differing interfaces. The wrapping of these classes withinJava classes that support the JDM interfaces within the javax.datamining.modeldetail.treepackage will provide for a consistent approach to interrogating these knowledge struc-tures.

The rest of this section continues the example in Sect. 3.1 and shows how the WEKAalgorithm, J48, can be used as the underlying model when the execute() method on the JDMConnection object is called (line 35 of Listing 5 in Sect. 3.1.2). More specifically we willlook at what classes will need to be implemented, what changes will need to be made toexisting WEKA classes and what JDM interfaces will need to be implemented by existingWEKA classes. JDM provides the following interfaces to support decision trees:

– TreeSettingsFactory: A factory class, the create method of which returns an object oftype TreeSettings.

– TreeSettings: An interface for setting Tree building parameters.– TreeModelDetail: An interface for accessing information from the tree, such as a partic-

ular node or set of nodes.– TreeNode: An interface for accessing information stored in a single node of the tree.

The code in Listing 7 shows how a classifier can be built from data stored in a data filecalled housing.data using the J48 algorithm. The induction of a tree consists of instantiatingan Instances object and loading it with the data to be used for inducing the tree, creating aninstance of the J48 class, setting parameters and calling the buildClassifier method on theJ48 instance, passing the Instances object to it as a parameter.

When using JDM, data to be used in the build task is specified in the PhysicalDataSetobject and stored in the metadata repository prior to calling the execute method on the con-nection. Listing 8 shows a JDM compliant version of Listing 7 for building a classifier usingJ48. The buildTaskName is passed to the execute method as a parameter (see line 35 of codein Sect. 3.1.2). The getObject() method is used to retrieve the BuildTask, PhysicalDataSetand BuildSettings objects from the repository. Next a new instance of the Instances class iscreated and data loaded into it from the URI specified in the PhysicalDataSet object (see line

123


Listing 7 Building a classifier from a data file

1 public static void main(String[] args) {2 String dataFileName = 11housing.data’’;3 Instances data;4 Reader dataReader;5 J48 javaTree = new J48();6 try {7 dataReader = new FileReader(dataFileName);8 data = new Instances(dataReader);9 javaTree.setMinNumObj(10); //example parameters setting

10 javaTree.buildClassifier(data);11 } catch(IOException ioe) {12 ioe.printStackTrace();13 } catch(Exception e) {14 e.printStackTrace();15 }16 }

Listing 8 JDM compliant version of Listing 7

1 public ExecutionStatus execute (String buildTaskName) throws JDMException {2 Instances data;3 J48 javaTree = new J48();4 try {5 BuildTask bt = (BuildTask) this.getObject(buildTaskName, NamedObject.task);6 PhysicalDataSet pds = (PhysicalDataSet) this.getObject(bt.getBuildDataName(),

NamedObject.physicalDataSet);7 BuildSettings bs = (BuildSettings) this.getObject(bt.getBuildSettingsName(),

NamedObject.buildSettings);8 data = new Instances(pds, bs.getLogicalData(), bt.getBuildDataMap());9 TreeSettings ts = (TreeSettings) bs.getAlgorithmSettings();

10 // use ts to set parameters of javaTree11 javaTree.buildClassifier(data);12 ClassificationModel cm = new ClassificationModel();13 cm.setModelDetail(javaTree.getClassifierTree());14 saveObject(bt.getModelName(), cm, true);15 } catch (Exception e) {16 e.printStackTrace();17 }18 }

8 of Listing 2 in Sect. 3.1.1). Note that the Instances constructor shown in Listing 8 is not partof WEKA but would need to be implemented to enable the PhysicalDataSet, LogicalDataand BuildDataMap to be used by WEKA algorithms. The TreeSettings can then be used toset parameter values supported by J48. Finally the buildClassifier() method is called to buildthe tree. Once the tree has been built, the tree is stored in the metadata repository for retrievaland use at a later date. For the code in Listing 8 to work, the following assumptions are made:

– A method getClassifierTree() is implemented in the J48 class that returns the Classi-fierTree object

– The ClassifierTree implements the TreeModelDetail and TreeNode interfaces

3.2 SQL/MM

The SQL/MM effort has been published as ISO/IEC 13249 by ISO as an extension to the data-base SQL standard ISO/IEC 9075:2003 [2003]. It defines a number of packages of generic

123


data types common to various kinds of data used in multimedia and application areas, toenable the data to be stored and manipulated in an SQL database. ISO/IEC 13249 consists ofthe parts listed below, under the general title Information Technology—Database LanguagesâSQL Multimedia and Application Packages:

Part 1: FrameworkPart 2: Full-TextPart 3: SpatialPart 5: Still ImagePart 6: Data Mining

The functionality for the various parts of SQL/MM has been designed to build upon theobject-relational extensions of SQL using user-defined types and methods. This ensures thatthe interface can be implemented on top of any database system supporting these extensionswithout the need to encroach upon the base language SQL, for example by augmenting thebasic query syntax. Such user-defined types can be used in different ways:

– as the basis of a column in a database table– as the basis of an attribute of some other user-defined type– as the basis of a new derived subtype

Therefore, SQL/MM does not provide a repository for values of these types, as the appli-cation can store any such values directly in database tables with columns of the appropriatetypes.

3.2.1 Data Mining techniques in SQL/MM

The typical application scenario for the user-defined types of part 6 of ISO/IEC 13249 isa data warehouse application. Warehouse applications frequently need to execute differenttasks on different data sets in a flexible way. To this end, user-defined types are providedfor the key data mining functions, namely, Association Rule Discovery, Clustering, Classi-fication and Regression. SQL/MM also defines routines for generation and manipulation ofvalues of these user-defined types. The specifications thus make provision for analyzing dataand capturing the results of the analysis. Existing data mining products provide a large num-ber of different algorithms to support these data mining techniques. Therefore, deliberately,the formalization of any specific data mining algorithm such as K-means clustering has beendismissed. This lets an application program deploy the data mining techniques independentof a particular implementation.

3.2.2 Types related to data

In a database context it is natural to expect the input data for data mining to reside in databasetables. This is not a severe restriction, as most SQL database products offer import mech-anisms from various other formats, for example CSV, as well as federated access to dataoutside the database. SQL/MM provides the following user-defined types

– DM_LogicalDataSpec

– DM_MiningData

– DM_ApplicationData

To allow the definition of data mining activities independent of the presence of the actualdata tables, the notion of logical data is introduced, which lets the application define all

123


the fields for a data mining activity and how they are to be used in a mining task, withoutspecifying the source of this data. DM_LogicalDataSpec is an abstraction for a set of datamining fields identified by their names. Each of the fields also has an associated logical fieldtype. The type is introduced to represent the input data needed for both the training and thetest of data mining models. Possible types are categorical and numerical. The mapping fromthe columns of an actual source table to the fields in logical data can be made using thetype DM_MiningData. It is a description of data contained in tables, which represents themetadata required to access the data during training, test or application runs. The constructormethod DM_defMiningData takes a table name as input and reads the metadata (columnnames and types). In the case of straightforward mining from a single table, the necessarylogical data specification can be generated, using the method DM_genDataSpec defined onthe type DM_MiningData. In this case the SQL types are mapped to default data mining types.DM_ApplicationData is used to submit a single record of data for model application. Thesedata are rarely stored in a database table before they are used. Therefore a constructor methodDM_impApplData is provided with which to create the value from an XML representation.

3.2.3 Types supporting the data mining phases

The CRISP data mining methodology (see Sect. 2) introduces six major phases in a datamining project. SQL/MM data mining supports the last three of these phases, namely, mod-elling, evaluation and deployment. These phases are often called the training phase, testingphase and application phase. For the training phase, the structure is the same for all four datamining techniques. The following user-defined types are involved:

– DM_<Technique>Settings

– DM_<Technique>BldTask

– DM_<Technique>Model

The settings types contain a logical data specification and capture all the parameters of atraining run. These parameters influence the model only in a functional way, for example, themethod DM_setMaxNumClus restricts the number of clusters discovered. However, settingsare independent of any particular data mining algorithm. There is therefore no standard wayof restricting the number of iterations of a clustering algorithm or even to select a specificalgorithm. Values of type DM_<technique>Settings can be defined and stored in the data-base independent of the existence of the actual mining data. This allows the reuse of settingsfor future execution of the training phase with different data tables. The build task types(DM_<Technique>BldTask) contain the meta data required for training. This includes thesettings for the technique to be used and a description of the training data, a value of typeDM_MiningData. For predictive techniques a second value of type DM_MiningData may beprovided, representing the verification data to be used during the building of the model. Thebuild tasks are executed by DM_build<Technique>Model methods. Tasks can also be storedfor execution at a later stage, possibly triggered by a data warehouse scheduler. The buildmethods return data mining models as values of DM_<Technique>Model. These models canbe used for testing and application. They can also be exported as PMML for consumptionby other PMML-conformant tools. In the testing phase, predictive models are analyzed todetermine their quality. This is done via the following user-defined types:

– DM_<Technique>TestTask


– DM_<Technique>TestResult

123


Like the build tasks, test tasks hold all the information necessary to test a predictive model,namely the model, a value of type DM_<Technique>Model, and the description of the testdata, a value of type DM_MiningData. Execution of a test task generates a test result, whichencapsulates quality information about the model based on the test data, for example theconfusion matrix for a classification model.

In the application phase, data mining models are applied to new data in one of two ways.The first is the application to database tables and is often called batch application. Usuallybatch application is a recurring event in a data warehouse and will be scheduled on a regularbasis. This can be done in a manner similar to that outlined for model testing, using thefollowing user-defined types:

– DM_<Technique>ApplTask


– DM_<Technique>Result

– DM_ApplicationData

The application tasks (DM_<Technique>ApplTask) hold all the information necessary toapply a model to a table with new data and store the results in the database. This includesthe model (DM_<Technique>Model) to be applied, the description of the application data(DM_MiningData) and the description of the output table (DM_<Technique>Result) includ-ing the name of the result column.

Often, however, models are applied to data that are provided by an application. Thisis also known as single record scoring. To support this capability, methods of the typeDM_apply<Technique>Model take a single record as input and return a corresponding valueof DM_<Technique>Result. The example below shows how such an application to a singlerecord is supported by SQL/MM. The example assumes the existence of a table MOD-ELTABLE with two columns MODELNAME and MODEL of types CHARACTER VARY-ING and DM_CLASMODEL, respectively. Further we assume that a clustering model hasbeen generated and stored in that table with MODELNAME, Model_1. The fact that themodel uses values for AGE, INCOME, and EDUCATION can be determined using themethod DM_getClusMdlSpec(). The application retrieves the model with the statement:and calls the following to compute the predicted class:

SELECT MODEL FROM MODELTABLE WHERE MODELNAME = ‘‘Model_1’’

WITH Segmentation AS (SELECT MODEL FROM MODELTABLE WHERE MODELNAME = ‘‘Model_1’’

)SELECT Segmentation.DM_applyClusModel(

DM_ApplicationData::DM_impApplData(‘ROW>’ ||

‘<COLUMN name=‘‘AGE’’>’ || ‘48’ || ‘</COLUMN>’ ||‘<COLUMN name=‘‘INCOME’’>’ || ‘66000’ || ‘</COLUMN>’ ||‘<COLUMN name=‘‘EDUCATION’’>’ || ‘PhD’ || ‘</COLUMN>’ ||

‘</ROW>’).DM_getClusterID();

3.2.4 Conformance

Conformance to SQL/MM data mining can be claimed by data mining technique. All confor-mant implementations must support the types DM_LogicalDataSpec and DM_MiningData.

123


In addition, for each mining technique the corresponding user-defined types must be sup-ported. Types are supported by providing the type itself, so that values can be stored in thedatabase, and a function for each of its methods.

3.2.5 Example application

In this example a classification model is generated and then tested. Finally the model isapplied to determine the predicted classes and confidence values for the prediction for allthe records in a table. Training data are available in a table CUSTOMERS with columnsC_NAME, C_AGE, C_INCOME, C_EDUCATION and C_BUYS. The goal is to determinea classification model that predicts values for the column C_BUYS. As a first step, a classi-fication task is generated and stored in a table TASKS in Listing 9. Then the task is executedand the resulting classification model stored in a model table (Listing 10). Next, the modelis tested using independent test data from table TEST (Listing 11). Assuming the result ofthe test to be satisfactory, the model can now be applied to predict the class for new cus-tomers. The new customer data are taken from a table NEW_CUSTOMERS with columnsN_NAME, N_AGE, N_INCOME, N_EDUCATION but no information about buying behav-iour. The result consists of the name of the customer, the predicted value for the customersbuying behaviour and the model’s confidence in the predicted value (Listing 12).

Listing 9 Generating and storing a task

WITH TrainingData (DATA) AS(VALUES (DM_MiningData::DM_defMiningData(‘CUSTOMERS’))

)INSERT INTO TASKS (TASKNAME, TASK)

SELECT ‘CUSTOMER_CLASSIFICATION’,DM_ClasBldTask::DM_defClasBldTask(

DATA, NULL,((NEW DM_ClasSettings()).DM_useClasDataSpec(

DATA.DM_genDataSpec()).DM_clasSetTarget(‘C_BUYS’))) FROM TrainingData;

Listing 10 Store model in model table

WITH Task(T) AS (SELECT TASK FROM TASKS WHERE ID=‘CUSTOMER_CLASSIFICATION’

)INSERT INTO MODELS (MODELNAME, MODEL)

SELECT ‘BUYING_PREDICTION’, T.DM_buildClasModel());

3.3 OLE DB for Data Mining

Microsoft’s approach to integration of data mining into their SQL Server product is called“OLE DB for Data Mining”. The specification of OLE DB for Data Mining was preparedin 2000 (OLE-DB 2000) and its first implementation was released as a part of the AnalysisServices, accompanying SQL Server 2000, that provide OLAP and data mining functionality.The next release of the Analysis Services software together with more extensive implemen-tation of OLE DB for Data Mining released as part of the SQL Server 2005 (codename“Yukon”).

123


Listing 11 Testing the model

SELECT MODEL.DM_testClasModel(DM_MiningData::DM_defMiningData(‘TEST’)).DM_getClasError()FROM MODELS WHERE MODELNAME = ‘BUYING_PREDICTION’;

Listing 12 Results

WITH CLASSIFICATION(NAME, APPLY_RESULT) AS (SELECT NEW_CUSTOMERS.N_NAME, MODEL.applyClasModel(

DM_ApplicationData::DM_impApplData(‘<ROW>’ ||

‘<COLUMN name=‘‘C_AGE’’>’ || CAST(N_AGE AS CHAR(3)) ||‘</COLUMN>’ ||‘<COLUMN name=‘‘C_INCOME’’>’ || CAST (N_INCOME AS CHAR(8)) ||‘</COLUMN>’ ||‘<COLUMN name=‘‘C_EDUCATION’’>’ || N_EDUCATION ||‘</COLUMN>’ ||

‘</ROW>’)FROM MODELS, NEW_CUSTOMERSWHERE MODELS.ID = ‘BUYING_PREDICTION’

)SELECT NAME, APPLY_RESULT.DM_getPredClass(),

APPLY_RESULT.DM_getConfidence()FROM CLASSIFICATION;

At the lowest level OLE DB for Data Mining is defined as a set of Component ObjectModel (COM) interfaces, Microsoft’s paradigm for software components—each componentexposes an interface in the form of an API), that comply to the higher level OLE DB standard(a set of data-access COM interfaces providing consistent access to SQL and non-SQL datasources), and adds on the top of OLE DB an additional set of interfaces specific for data min-ing. On the higher level, the most important part of OLE DB for Data Mining is an adaptedSQL-like language called Data Mining eXtensions (DMX) used at the client side for issuingrequests to the Analysis Services server. Figure 2 shows the architecture of a typical systemworking with OLE DB for Data Mining. Each data mining algorithm is a software componentthat is installed as a plug-in into the Analysis Services server which offers a unified interfacetowards client side software. Clients can be anything from user built applications to genericclients such as Excel, Access, etc. On the other side, software plug-ins implementing datamining algorithms can store and retrieve data stored in any database with an OLE DB driver.

One of the key concepts introduced in OLE DB for Data Mining is the language DMX.The central object within DMX is the Mining Model which is a container acting externallyas a relational table. Mining Model does not store the data explicitly. It stores patterns whichdata mining algorithms discovered within the data. Each Mining Model has an associateddata mining algorithm together with a list of parameters. This makes DMX conceptuallyclose to inductive databases (Raedt 2002). DMX defines three main operations on MiningModels: creation, training, and prediction which are implemented through SQL like com-mands CREATE, INSERT and SELECT. In Listings 13, 14 and 15 borrowed from (Tang andKim 2004) we show how all three commands are used for a typical data mining task of pre-dicting credit risk based on a customers demographic information. The CREATE MININGMODEL command is used to define the data domain in the data mining sense. For each col-umn we define its data type and its role within the mining context. For example, the attributethat is the class variable is denoted using the keyword predict, as specified in the example inListing 13 for the attribute RiskLevel.

The next step is to train the mining model defined with the previous CREATE command.Training data can come from any data source being accessible through the OLE DB driver. As

123


Clustering

Associationrules

Analysis Services

server

User s data mining

algorithm

User application

Custom OLAP client

Excel, Access, …

Data, meta-data models, …

OLE DB for DM & DMX commands

Any database server with OLE DB driver (Oracle, DB2, SQL Server, Access, …)

Decision trees

OLE DB

clientsplug-ins server

Fig. 2 Architecture of a typical system using “OLE DB for Data Mining”

Listing 13 CREATE

CREATE MINING MODEL CreditRisk(CustomerId long key,Profession text discrete,Income text discrete,Age long continuous,RiskLevel text discrete predict) USING [Microsoft Decision Tree]

a result of the training process, discovered statistical models or patterns are stored within theMining Model (in the case above, the mining model has stored a decision tree). An examplestatement for executing the training phase is shown in Listing 14 below. The OPENROWSETstatement defines the data to be used for the training. The first parameter sqloledb is the nameof the database within which the data is stored. The second and third parameters are the user-name and password for the database respectively, while the fourth parameter is a valid SQLselect statement that specifies the actual target data to be used in the training (Listing 14).

Listing 14 INSERT

INSERT INTO CreditRisk(CustomerId, Profession, Income, Age, RiskLevel)

OPENROWSET(‘sqloledb’, ‘sa’, ‘mypass’, ‘’ , ‘SELECT CustomerID, Profession, Income, Age, Risk FROM Customers’)

On completion of the training phase, the Mining Model can be used for prediction on theunseen cases (data records). Prediction within the DMX context is a join between a trainedmodel and a set of data records – the type of a join operation is different to classical relationaljoin and is therefore called prediction join. In the example Listing 15 below we make the

123


prediction join between the Mining Model trained in the previous step with the databasetable Customers where the columns Profession, Income, and Age match. The output fromthe prediction task is a table with the three columns, the customer id, the predicted risk leveland its probability (Listing 15).

Listing 15 SELECT

SELECT Customers.ID, CreditRisk.RiskLevel, PredictProbability(CreditRisk.RiskLevel)FROM CreditRisk PREDICTION JOIN Customers ON CreditRisk.Profession =Customers.Profession AND CreditRisk.Income = Customers.Income AND CreditRisk.Age = Customers.Age

4 Model representation: the predictive model markup language

The Predictive Model Markup Language (PMML) (Grossman et al. 2002) is an industry ledstandard for representing the output of data mining algorithms. The Data Mining Group(DMG, see also http://www.dmg.org) defines PMML. The group is composed of a numberof full members (IBM, Oracle, Magnify, SPSS, SAS, StatSoft, Microsoft, MicroStrategy,Prudential Systems, Software GmbH, KXEN, and Salford Systems) as well as numerousassociate members. The objective of the development of PMML is to enable a large varietyof users to define and share predictive models using an open standard.

The rationale for this endavour is that currently a complex mosaic of software applica-tions, produced by a number of data mining vendors, function as knowledge generators usingvarious data mining algorithms and employing different languages/methods for expressingthe knowledge discovered, e.g. in C/C++ routines and text/binary representations of theknowledge. The generation of knowledge is not an end in itself. The deployment of thisknowledge produces ROI for a business. As a result of these varied approaches to knowl-edge representation, knowledge consumers such as real-time scoring tools, personalisationengines, marketing or visualization tools often produced by vendors other than the knowl-edge generators, require a lot of integration costs. PMML aims to provide an XML basedstandard representation for knowledge and the associated meta data required by a knowledgeconsumer. PMML is designed to produce the following benefits to vendors and users of KDsystems:

– Proprietary issues and incompatibilities are no longer a barrier to the exchange of modelsbetween applications.

– PMML is based on XML, a widely accepted and promoted language.– Models can be generated using any knowledge generator vendor and deployed using any

consumer vendor application.

PMML Version 2.1 is supported by most current releases of member vendors applicationsand is discussed in this section. A PMML document must be a valid XML document thatobeys PMML conformance rules, even though a DOCTYPE declaration is not required. Theroot element PMML contains six child elements of which two (Header and DataDictionary)are required and four are optional. We will explain these XML elements subsequently andprovide some examples. Each XML element is described by its attributes and the (sub)elements it may contain.

4.1 Header

The Header element of PMML (see Listing 16) consists of two attributes, copyright anddescription, and three elements, Application, Annotation and TimeStamp. Information regard-

123

http://www.dmg.org


Listing 16 Example of a Header XML element of PMML

1 <?xml version=‘‘1.0’’ ?>2 <PMML version=‘‘1.0’’ >3 <Header copyright=‘‘CorporateIntellect’’ description=‘‘Results of CAPRI’’>4 </Header>5 ...6 ...7 </PMML>

ing the copyright of the model should be contained within the copyright attribute of the Headerelement. The description attribute contains any human readable text considered useful, by theproducer, to a potential consumer of the model but cannot be described in any other elementwithin the document. PMML is not prescriptive with respect to what the description shouldor should not contain. The Application element identifies the application that generated theknowledge contained within the document. The Application element contains two elements,name and version. The Annotation element allows the user to provide any free text commentsthat help the user identify, for example, as to what segment of data the model is intended tobe applied. The TimeStamp element allows for the model creation date and time to be storedwithin the document.

4.2 Mining Build Task

The Mining Build Task is an element that allows the user to describe, in a non-prescriptivemanner, information regarding the building of the model/knowledge described within thedocument. This may include the data source used, the specific parameter settings used etc.While the specific content structure is not specified by the DMG, if PMML is generatedby a knowledge generator that conforms to one of the standards for model generation (seeSect. 3), it is likely that this element will contain information described within these stan-dards specific to the building of the knowledge, for example, the BuildTask within JDM. AsPMML is only aimed at the exchange of knowledge as opposed to the ability to reproducethis knowledge within an alternative knowledge generator, the information provided in thiselement is considered ancillary to PMML and is treated as meta-data, not for use specificallywithin the deployment of the model by the PMML consumer.

4.3 DataDictionary

The DataDictionary element contains a description of the data used as input when gener-ating the model. The DataDictionary element has one attribute called the numberOfFieldsthat takes as a value, a non-negative number representing the number of fields within thedata used as input when generating the model. This number also corresponds to the numberof DataField elements contained within the DataDictionary element, that each describe adistinct field within the data. A DataField element consists of the following attributes:

– name specifies the name of the attribute. The name must be unique within the DataDic-tionary as well as any derived attributes defined within the Transformation Dictionary(see Sect. 4.4) or within any of the models within the document.

– displayName may contain an alternative name used by the knowledge generator applica-tion to identify the attribute.

– opType declares the field as being either categorical, ordinal or continuous. This classi-fication defines the types of operations that can be defined on the field. For example, the

123


“=” comparison of a categorical value is valid whereas a “<” or “>” operation wouldnot be valid.

– dataType declares the type of data contained within the field. This attribute can takevalues from an enumerated set consisting of boolean, integer, double, float and string.

– taxonomy is an optional attribute that provides the name of a taxonomy defined on thefield and declared within the DataDictionary as a separate element.

– isCyclic takes a boolean value. If the value of this attribute is set to true, it indicates that adistance measure is defined on the attribute such that the minimum and maximum valuesare actually close together. An example of such an attribute is time where 23:59 is closeto 00:00.

In addition to the above attributes, each DataField may define the set of valid, invalidand missing values. For ordinal and categorical fields, this is achieved through the use ofthe Value element. The Value element contains three attributes namely, value, displayValueand property. The value attribute contains a unique value that the associated field can take.displayValue specifies an alternative value that an application may use for that particularvalue when displaying it to a user as to make it more readable to the user. For example, thevalue ‘M’ for the gender field may have a displayValue of ‘Male’. The property attributemay take one of the values, valid, invalid and missing, to indicate whether the value specifiedwithin the value attribute should be treated as a valid, invalid or missing value respectively.

For continuous fields, the Interval element is used to define the valid range of values thatthe field can take. The Interval element has three attributes. The closure attribute can takeone of four values, namely, openClosed, closedOpen, openOpen, closedClosed. It defineswhether the interval is open or closed on the left or right. The leftMargin and rightMarginattributes specify the left and right extreme values, if defined, for the field. In addition tothe DataField elements, the DataDictionary may also contain a number of Taxonomy’s. ATaxonomy element contains one attribute, its name. The name specified for the Taxonomymust be unique as it is what links the taxonomy with a specific field within the DataDictionary(using the taxonomy attribute within the DataField element). The ChildParent element withinthe Taxonomy element is where the taxonomy is defined. It has the following attributes:

– childField specifies the name of the field that represents the child value within the tablecontaining the taxonomy data

– parentField specifies the name of the field that represents the parent value within the tablecontaining the taxonomy data

– parentLevelField specifies the name of field that represents the level in the hierarchywithin the table containing the taxonomy data

– isRecursive is either set to yes or no values. A value of yes specifies that the wholehierarchy is defined in the same table. A value of no specifies that an individual tabledefines each level in the hierarchy. In this case, the field within the table identified by theparentLevelField attribute may specify as to which level within the taxonomy the parentvalue belongs to.

The Taxonomy element (see Listing 17) specifies the taxonomy data in one of two ways.If the taxonomy is already stored in a database, the TableLocator element can be used tospecify the URL for the tables containing the taxonomy. PMML is not prescriptive regardingthe format of the TableLocator element. As an alternative, the InlineTable element can beused to specify the data within the PMML document itself. The InlineTable element consistsof a number of row elements, each of which contain one row of the table containing thetaxonomy data. The format of the row element is once again not specified but would need tocontain at least a pair of parent and child values with the taxonomy.

123


Listing 17 Example of a PMML Taxonomy element

1 <Taxonomy name=‘‘Location’’>2 <ChildParent childColumn=‘‘Post Code’’ parentColumn=‘‘District’’>3 <TableLocator x−dbname=‘‘myDB’’ x−tableName=‘‘PostCode_District’’ />4 </ChildParent>5 <ChildParent childColumn=‘‘member’’ parentColumn=‘‘group’’ isRecursive=‘‘yes’’>6 <InlineTable>7 <Extension extender=‘‘MySystem’’>8 <row member=‘‘W9’’ group=‘‘CentralLondon’’/>9 <row member=‘‘NW9’’ group=‘‘NorthLondon’’/>

10 <row member=‘‘NW2’’ group=‘‘NorthLondon’’/>11 <row member=‘‘W1’’ group=‘‘CentralLondon’’/>12 <row member=‘‘CentralLondon’’ group=‘‘London’’/>13 <row member=‘‘NorthLondon’’ group=‘‘London’’/>14 <row member=‘‘EastLondon’’ group=‘‘London’’/>15 <row member=‘‘London’’ group=‘‘England’’/>16 ........17 </Extension>18 </InlineTable>19 </ChildParent>20 </Taxonomy>

4.4 TransformationDictionary

The TransformationDictionary defines mappings of source data values to values more suitedfor use by the mining algorithm. As a complete set of transformations is very difficult toprescribe within a standard, PMML supports a subset of common transformations. Theseare:

– Normalization: map values to numbers, the input can be continuous or discrete.– Discretization: map continuous values to discrete values.– Value mapping: map discrete values to discrete values.– Aggregation: summarize or collect groups of values, e.g. compute average

The TransformationDictionary consists of a set of zero or more DerivedField elements.Each DerivedField contains a single transformation on a field. A DerivedField has two attri-butes, name and displayName representing the name of the new field created using thetransformation and an optional name used for displaying the attribute within an application.The actual transformation can be one of seven possible elements:

– Constant: The Constant element is used to define a constant value used for example in thedefinition of another transformation.

– FieldRef: This is used to reference the field in other parts of the PMML document. TheFieldRef element has one attribute called name that is the name of the field, a reference towhich is being created.

– NormContinuous: The NormContinuous element defines a transformation on a continu-ous attribute to another continuous attribute using piecewise linear transformations. Eachlinear transformation is defined in a LinearNorm element. The LinearNorm element isdefined by two attributes, orig and norm that define the value of the field being trans-formed and the new value that it is mapped to during the transformation respectively. TheLinear transformation is defined from one LinearNorm orig value to the next LinearNormorig value within the NormContinuous element. The NormContinuous element also hasan attribute, name, where the name of the original field that is being transformed is stored.

123


– NormDiscrete: The NormDiscrete element is used to define a transformation on a discretefield to a new field based on a unique attribute value of the original field. A NormDiscreteelement consists of three attributes, name, method and value that specify the name of theoriginal field, the method of the transformation (only indicator is supported currently inPMML, where the new field is set to 1 if the value of the original field matches that storedin the value attribute), the value of the original field that defines this new field.

– Discretize: The Discretize element is used to define a transformation on a continuous attri-bute to a discrete (ordinal) attribute by associating a new value with a range of values ofthe original field. The attribute field stores the name of the original continuous field beingtransformed. The DiscretizeBin elements contain the interval on the original field and thenew discrete value that it maps to.

– MapValues: The MapValues element transforms a discrete field to a new discrete field.MapValues has two attributes, outputColumn and defaultValue, and two elements. Thefirst element is a FieldColumnPair element and the other is either an InlineTable or anTableLocator. The InlineTable or TableLocator contains the actual mapping, the value pairof an original value and corresponding value of the new derived field. The FieldColumnPairelement has two fields defined, field and column, that specify the field being transformedand the column within the table defining the mapping, that contains the values of theoriginal field.

– Aggregate: The Aggregate element defines a transformation akin to the group by statementin SQL. The element has four attributes. The first is field, which contains the name of thefield being aggregated. The second is function, that specifies how the aggregation shouldtake place. This is an enumerated type and is restricted to taking one value out of: count,sum, average, min, max, multiset. While the first five of these functions are defined onlyon numeric fields, the function multiset is defined on discrete fields and essentially groupsall the values of the field being transformed into a set of values (most commonly usedin association and sequence rule discovery). The final two attributes are the groupField(the field by which the values are grouped) and the sqlWhere attribute which optionallyspecifies a subset of data for which the transformation must be carried out.

4.5 Model

PMML currently support the following data mining models types:– Tree Model– Neural Networks– Clustering Model– Regression Model– General Regression Model– Naive Bayes Model– Association Rules– Sequence Rule Model

Each model type has its own specific XML element specified. Describing each of thesemodel types as specified within PMML is out of the scope of this paper. The interested readershould refer to the DMG web site for detailed descriptions of the models. However, eachof the models have some elements and attributes in common. We will briefly describe thesehere. Each model has a name, functionName and algorithmName attribute. The functionNameattribute describes the type of mining goal that the model addresses. Currently, PMML sup-ports five mining functions, namely, association rules, sequences, classification, regressionand clustering. Each model also contains a MiningSchema and ModelStatistics element.

123


4.5.1 Mining schema

A described above, a PMML document can contain more than one Model. While the PMMLdocument only contains one DataDictionary element, each model may use only a subsetof the variables specified within the DataDictionary. The MiningSchema element is modelspecific, describing the data used by the specific model. It also describes how outliers andmissing values were handled during the model build. The Mining Schema consists of a set ofMiningField elements. Each MiningField, describes meta data related to one attribute usingthe following attributes:

– name refers to a field in the DataDictionary or TransformationDictionary– usageType describes how the field was used by the model and can take one of four value,

namely, active, predicted, supplementary and group. An active field is one that was usedby the model as an input. A predicted field is the field that is predicted by the model. Asupplementary field is one that is no actively used by the model but is available to the enduser of the model for investigating the model further. A group field is one that is used inan aggregation transformation (see Sect. 4.4).

– outliers refers to the method used for dealing with outlier values. Three options fordealing with outlier values are supported by PMML, namely asIs, asMissingValue andasExtremeValues. The value asIs would suggest that the outlier value is passed throughto the model without any changes. On the other hand a value of asMissingValue wouldsuggest that the outlier value was treated just as another missing value, to be handled asspecified by the missingValueTreatment attribute below. The alternative is to replace theoutlier value by the appropriate extreme valid value, as specified by the lowValue andhighValue attributes of the MiningField.

– lowValue specifies the smallest value of the field that the model treated as being valid.– highValue specifies the largest value of the field that the model treated as being valid.– missingValueTreatment refers to the method employed during the model build for deal-

ing with missing values. The value taken by this attribute can be one of asIs, asMean,asMode, asMedian and asValue. The value asIs implies that the missing value should betreated as a distinct and valid value for the field. asMean, asMode and asMedian implythat the mean, mode and median value, respectively, for the field was used to replacethe missing value. asValue implies that the missing value was replace with the value asspecified in the missingValueReplacement attribute of the MiningField.

– missingValueReplacement is the value with which a missing value must be replaced ifthe missingValueTreatment method is set to asValue.

4.5.2 ModelStatistics

The ModelStatistics elements provides univariate statistics for each of the mining fields withinthe mining schema. The ModelStatistics element consists of a set of UnivariateStatistics ele-ments, each element consisting of statistics pertaining to a unique mining field identified bythe name attribute. A UnivariateStatistics element may contain zero or one of each of thefollowing elements through clearly some elements are more appropriate than others giventhe data type of the field:

– The Counts element has three attributes, totalFreq, missingFreq and invalidFreq thatprovide the total, missing and invalid value counts respectively.

123


– The NumericInfo has six attributes storing the minimum, maximum, mean, standarddeviation, median and interQuartileRange of the field. It also contains a set of elementscalled Quantile, that stores information regarding the distribution of values of the field.

– The DiscrStats element has one attribute that contains the modal value of the field andtwo arrays, one consisting of string values and the other containing integer values. Thestring values are the unique values that the field takes and the corresponding value in theinteger array is its count.

– The ContStats element has two attributes, that contain the sum of values and sum ofsquares of the values in the data used for building the model. It also has a set of Intervalelements and three arrays containing the corresponding frequency, sum, sum of squaresfor each interval.

4.6 PMML producers and consumers

For the purpose of this section, PMML should be seen as the vehicle that bridges the gapbetween (data mining) tools that produce models and those that consume models. This dis-tinction between producers and consumers of models has been blurred until very recentlywhen the definition of PMML made the development of vendor (and algorithm) independentmodel consumers possible. Model consumers could function as advanced post-processing,visualization, verification and evaluation, hybrids or meta-learning, or deployment tools thatenable primarily end users to work with models without having to purchase, install, learnand handle relatively complex KD suites.

A relatively large number of KD/data mining tools such as IBM’s Intelligent Miner, OracleData Mining, SPSS’ Clementine, SAS’ Enterprise Miner, and WEKA (with the extensionproduced by Li (2003) are already capable of producing PMML. Some of these tools can alsoimport PMML, but that would only utilize the power of PMML in a very restricted manner.This may be beneficial for users working jointly on a specific analysis problem, for examplein a medical application of high public interest, that need to exchange their (preliminary)models. However, analysis experts using these rather complex tools can utilize consumers topresent their results to application experts that may not have access to these KD tools. Fur-thermore, consumers may allow for the presentation of data mining models in the Internetthus enabling wide dissemination of interesting findings.

In the remainder of this section, we will describe three consumers of PMML in supportof the argument that pure PMML consumers offer benefits that are hard to achieve withcurrently available KD tools. It is worth noting that all of the three consumers described inthis section as independent of the application that produced the knowledge they consume.The only requirement is that the software must be PMML compliant.

4.6.1 PEAR: post-processing association rules

The PEAR (Jorge et al. 2002) system is a web-based system that was specifically designedfor viewing and post-processing PMML association rules. A given set of association rulesis managed by PEAR in a way similar to that of a set of linked web pages. In this case,links are caused by common conditions in rules. Hence, a hyperlink exists between each pairof rules that has a common condition. The user can navigate through the input set of rulesby drilling down (refine) or zooming out (generalize) until the most interesting and relevantrules are discovered. PEAR can thus handle very large sets of association rules and since itsuser interface is entirely based on HTML it can be utilized by any user who has access to anInternet browser.

123


4.6.2 ROCOn—visualizing ROC graphs

ROCOn (Farrand and Flach 2003) is a tool developed for the visualization of Receiver Oper-ator Characteristics (ROC) (Provost and Fawcett 2001). Figure 3 summarizes the predictiveperformance of six data mining algorithms on the Cleveland Heart Disease domain (Blakeand Merz 1998) with the help of a ROC plot. The plot shows the “true positive” versus the“false positive” rates. The ideal point on this plot is the upper left corner as a model reachingthis point would not only correctly classify all positive instances, but also not misclassifyany negative instances. The points (“model performance”) lying on or near the outer hull ofthe curve connecting the points (0,0) and (100,100) (“Naive Bayes” and “Neural Network”in the case of Fig. 3) indicate the best models produced. Depending on the characteristicsof the particular task at hand, a user should use one of these two models or a combinationthereof. Ongoing research investigates how this technique can be extended to problems withmore than two classes.

The models shown in the plot were produced using algorithms implemented within WEKA(see Sect. 3.1.6) using default parameters in all cases. Note that WEKA has been extended byLi (2003) to produce PMML output. It is this pmml output that is used by ROCOn to generatethe plot shown in Fig. 3). Note that while in this case all models were produced using thesame data mining toolset, using PMML as input to ROCOn implies that each of the modelsdisplayed on the ROC curve could have been produced by different data mining software.

4.6.3 VizWiz— PMML visualization

VizWiz (Wettschereck et al. 2003) was developed as a highly interactive tool for the visu-alization and evaluation of data mining results. This tool utilizes information visualization

Fig. 3 The Receiver Operator Characteristics (ROC) curve for six models

123


Fig. 4 VizWiz detail view of the decision tree model

techniques to enable application experts, who are not necessarily experts in KD, to explore,evaluate and select those data mining results that are most suited for their purpose. Theprimary arguments for this enabling technology are that the number and complexity of datamining methods is significantly higher than the number of distinct model types and that themodel generation process is much more complex than the model understanding process,especially when models are properly visualized.

VizWiz is written in Java and can be run as an applet with the Java Runtime Environment,version 1.4. VizWiz reads and writes modified models in PMML format. The graphical userinterface of VizWiz offers two primary viewing options to the user: either a plot showing therelative performance of each model on a ROC plot (Fig. 3) or a detailed graphical renderingof each model (Fig. 4).2 The first view can thus serve as an overview window that supportsthe user in quickly zooming in on models most interesting to him/her. The second view canbe used to learn more about each model, to edit the model and to test the model on selectedor all data records of a given test data set. VizWiz currently offers interactive visualizationsfor the following model types:

– Linear regression– Decision- and regression trees– Association rules– Propositional and first-order rules3

– Subgroups (Klösgen 1996)4

VizWiz can represent and visualise output of a variety of data mining and machine learn-ing systems in a coherent fashion. For the KD system developer that means a reduced codingeffort, since there is no need to be concerned with visualisation aspects. As for the user, afamiliar front-end is provided, which facilitates visualisation and evaluation of highly diversemodels, for example Decision Trees and models from Inductive Logic Programming. VizWizclearly focusses on the visualisation of decision trees and classification rules. The main reasonis that both models represent a natural and very “human way” of knowledge representation.

2 The panel on the left hand side shows selected records of the test data set that was used to generate the ROCcurve shown in Fig. 3.3 Propositional rules are typically of the form “if variable_a = value_1 and variable_b = value_2 then class =class_1”, more complicated conditions are of course possible and supported. First-order rules are typically ofthe form “if pred_1(A,B) and pred_2(B,C) then class_1(A)” where pred_1 and pred_2 are predicates such as“father_of” and A, B, C, and D are variables.4 The latter two model types, ‘propositional and first order rules’ and ‘subgroups’, are not natively supportedby PMML. VizWiz represents them as multi-variate decision trees with some minor extensions.

123


Initial investigations are often aimed to extract those models, sometimes merely resultingfrom the need to better understand the inherent structure of large and feature-rich data setsbefore deciding on further investigations. Many sophisticated and efficient algorithms havebeen developed to establish models from data. They are a fundamental part of any commer-cial or non-commercial KD system. Yet, there are no strict rules of which particular model oralgorithm applies to a particular data mining task without explicit and thorough knowledgeof the generating process. Indeed, an equivalent process may be the object of ones interest.Hence experiments often involve a variety of diverse approaches and candidate models andthe final selection depends on a problem-specific fitness score. Because of such variety, thereis an urgent need of consistent visualisation and evaluation, irrespective of algorithmic inte-riors of a particular implementation. VizWiz’s aim is to provide such an interface, allowingcomparison of a wide variety of data models as long as their representation is PMML com-pliant. Besides simplifying KD systems development, it enables and moreover encouragesusers to experiment with diverse KD systems.

5 Web Services support for Data Mining

Web services provide the means for service-oriented networks. Applications can connectto other software on remote systems over networks like the web using the standard HTTP.The communication between the service clients and the service host is based on XML mes-sages that follow the Simple Object Access Protocol (SOAP 2004) for reliable machine tomachine interoperability. Service descriptions, including methods, method parameters andreturn types are contained in XML documents using the Web Services Description Language(WSDL). Web services can be published in Universal Description, Discovery and Integration(UDDI) directory services.

5.1 XMLA—XML for analysis

XML for analysis (XMLA 2004) with the acronym XMLA is a standardization initiativefor the definition of data access interaction between a client application and analytical dataprovider (OLAP or Data Mining). The initiative is sponsored and guided by three major soft-ware vendors in the area: Hyperion, Microsoft and the SAS Institute. The initiative startedin 2001 and by 2004 over 30 members joined the advisory council, among others also someof the major data mining software vendors. Technically speaking, XMLA is a set of XMLMessage Interfaces that use the industry standard SOAP to define the data exchange betweena client application and an analytical data provider on the server side, using the standardHTTP protocol over the Internet. In other words, XMLA is an API defined within SOAP(as a state of the art remote procedure call standard) to standardize communication betweenclient and server for analytical software solutions using OLAP and data mining.

Figure 5 shows a prototypical software architecture that implements the XMLA standard.On the left side we see the analytical client, for example, bespoke end-user applications, MSExcel etc., where the user (usually interactively) issues requests interpreted on the serverside (right hand side of Fig. 5) that returns the result. Requests are formulated through twomethods: Discover and Execute. TheDiscover method is used for retrieving meta-data fromthe analytical server—examples of such meta-data are: the list of available data sources on aserver or details about a specific data source. Once the client collects the appropriate meta-data about the server-side, the Execute method is used for sending the requests for actual dataprocessing on the server side and retrieving result data by the client. This includes requests

123


OLAP or DM client (user interface

Issuing XMLA Discover & Execute calls and receiving data from server

SOAP

TCP/IP(Internet)

HTTP

Interpreting XMLA Discover & Execute calls and returning data to the client

SOAP

TCP/IP (Internet)

HTTP

OLAP or DM server – access to the database

Discover & Execute calls

Meta-data & tabular-data

Client side Server side (OLAP or data mining)

Fig. 5 Typical software architecture implementing XMLA protocol

for retrieving data from OLAP cubes, creating and using data mining models etc. The actualprocessing is performed on the server side while the client issues requests and shows theretrieved results to the user (typically through some kind of graphical user interface). Fig-ure 5 also illustrates the communication process at the protocol level a few steps lower: eachDiscover and Execute method call is wrapped into a SOAP Envelope that is specified in theXML language. Next, the SOAP/XML package is sent over HTTP (standard protocol usedin the World-Wide-Web) over TCP/IP to the server where the request is decoded, processedand results are sent in the same way back to the client.

Listing 18 (XMLA-spec 2004, reproduction) shows the Execute method call issuing aMDX (Microsoft’s SQL-like language for communication with an OLAP server) command.The actual MDX command is within the <Statement> tags which is embedded into twoXMLA tags (<Command> and <Execute>) and the rest are SOAP infrastructural tags defin-ing the appropriate schema, data types etc.:

5.2 JDM for web services

JDM defines a web services interface based on the JDM UML model, thereby enabling Ser-vice Oriented Architecture (SOA) design. Although JDM-based web services map closelyto the Java interface, JDM web services address needs beyond the Java Community, beingbased on WSDL and XML, a programming language neutral interface. Now, vendors of JDMcan leverage their investment in a JDM server for both the Java and web service interfaces,using common metadata, object structure, and capabilities. However, non-JDM vendors canalso leverage this same interface to be interoperable with a broader range of vendor imple-mentations.

Leveraging the Simple Object Access Protocol (SOAP), JDM web services use documentstyle and include methods for interacting with the object repository, validating objects arecorrect or consistent, and controlling tasks. The methods are listed as follows:

– Object Repository:listContents, getObject(s), removeObject, saveObject, renameObject– Controlling Tasks: executeTask, getExecutionStatus, terminateTask

123


Listing 18 Execute method call (reproduction XMLA-spec [2004])

1 <SOAP−ENV:Envelope xmlns:SOAP−ENV=‘‘http://schemas.xmlsoap.org/soap/envelope/’’2 SOAP−ENV:encodingStyle=‘‘http://schemas.xmlsoap.org/soap/encoding/’’>3 <SOAP−ENV:Body>4 <Execute xmlns=‘‘urn:schemas−microsoft−com:xml−analysis’’ SOAP−

ENV:encodingStyle=‘‘http://schemas.xmlsoap.org/soap/encoding/’’>5 <Command>6 <Statement>7 select [Measures].members on Columns from Sales8 </Statement>9 </Command>

10 <Properties>11 <PropertyList>12 <DataSourceInfo>13 Provider=Essbase;Data Source=local;14 </DataSourceInfo>15 <Catalog>Foodmart 2000</Catalog>16 <Format>Multidimensional</Format>17 <AxisFormat>ClusterFormat</AxisFormat>18 </PropertyList>19 </Properties>20 </Execute>21 </SOAP−ENV:Body>22 </SOAP−ENV:Envelope>

– Object Validation: verifyObject– Identifiying DME capabilities: getCapabilities

These methods give the user explicit control over the DME.In Listing 19, we illustrate the executeTask web service as defined in the JDM WSDL

document and an example executing apply on a single record using a classification model.See the JDM specification and Javadoc documentation for details of the particular objectsreferenced.

Listing 19 executeTask web service (example)

1 <complexType name=’’executeTask’’>2 <sequence>3 <choice>4 <element name=‘‘taskName’’ type=‘‘xsd:string’’/>5 <element name=‘‘task’’ type=‘‘Task’’/>6 </choice>7 </sequence>8 </complexType>9 <complexType name=‘‘executeTaskResponse’’>

10 <sequence>11 <choice>12 <element name=‘‘status’’ type=‘‘ExecutionStatus’’/>13 <element name=‘‘recordValue’’ type=‘‘RecordElement’’14 maxOccurs=‘‘unbounded’’/>15 </choice>16 </sequence>17 </complexType>

The execution of a task can be specified either by naming a task already present in theDME (line 4) or by specifying the task content inline (line 5). The ExecutionStatus used in theexecuteTaskReponse in line 12 provides task progress. However, some tasks return values,

123


as in the case of real-time scoring (record apply) as specified in line 13. In Listing 20 lines17 through 30, we illustrate executing a task called RecordApplyTask. A standard header isexpected in line 17. Line 20 specifies the apply record for the modelâ ChurnClassification32,a classification model, stored in the meta data repository, predicting customer churn. Lines21–23 provide the record to score consisting of two predictors, age and income, and the cus-tomer identifier. Lines 24 through 28 specify the content of the output from the apply task. Inline 25, we specify the customer identifier to be mapped from the input record to the output.In line 26, the top predicted category should be mapped to the destination attribute churn.Similarly in line 27, the probability of this prediction should be mapped to the destinationattribute churnProb.

Listing 20 RecordApplyTask

17 <SOAP−ENV:Envelope ... > <SOAP−ENV:Header ... />18 <SOAP−ENV:Body>19 <executeTask xmlns=’’http:’’ www.jsr73.org=’’2004’’ http:=’’www.jsr−73.org’’/>20 <task xsi:type=’’RecordApplyTask’’ modelName=’’ChurnClassification32’’>21 <recordValue name=’’CustomerAge’’ value=’’23’’/>22 <recordValue name=’’CustomerIncome’’ value=’’50000’’/>23 <recordValue name=’’CustomerID’’ value=’’1003−2203−120’’/>24 <applySettingsName xsi:type=’’ClassificationApplySettings’’>25 <sourceDestinationMap sourceAttrName=’’CustomerID’’ destinationAttrName=

’’CustId’’/>26 <applyMap content=’’predCat’’ destPhysAttrName=’’churn’’ rank=’’1’’/>27 <applyMap content=’’prob’’ destPhysAttrName=’’churnProb’’ rank=’’1’’/>28 </applySettingsName>29 </task>30 </SOAP−ENV:Envelope>

In Listing 21 lines 31 through 39, we depict the task response to the RecordApplyTask, inthis case a prediction result. In lines 34–36, the apply output for customer identifier, predictionand probability are provided.

Listing 21 Task response to the RecordApplyTask

31 <SOAP−ENV:Envelope ... >32 <SOAP−ENV:Body>33 <executeTaskResponse xmlns=’’http://www.jsr−73.org/2004/webservices/’’ xmlns:jdm=

’’http://www.jsr−73.org/2004/JDMSchema’’>34 <recordValue name=’’CustomerID’’ value=’’1003−2203−120’’/>35 <recordValue name=’’churn’’ value=’’1’’/>36 <recordValue name=’’churnProb’’ value=’’.87’’/>37 </executeTaskResponse>38 </SOAP−ENV:Body>39 </SOAP−ENV:Envelope>

6 Discussion

In this paper we have described various KD related standards emerging from various indus-try consortia with the common aim of making data mining more accessible to practitioners.Commonly data mining is viewed as a standalone application with high integration costs.Standardising key data mining tasks and the resulting output can significantly reduce these

123


integration costs, making data mining more accessible to system integrators developing busi-ness solutions that require data mining as a component of the overall solution. CRISP-DMwas developed with the aim of standardising on a process that documents all stages of a datamining exercise in a vendor neutral manner. Software support for the standard now appearsin SPSS’s Clementine data mining product.

Two stages of the CRISP-DM process, lend themselves to further standardisation. Theseare the modelling and deployment phases. From the perspective of a database vendor, datamining is a natural extension to the data access and manipulation facilities provided by SQLand hence one approach to standardise access to data mining functionality proposed is thatof extending SQL. The resulting standard is SQL/MM which is now published by ISO. Analternative approach is that of defining a standard API, an approach similar to JDBC foraccessing DBMS services. The most mature API in this respect is JDM that is being devel-oped as part of the JCP. In keeping with an increasing trend in industry towards the creationof XML based, language-independent APIs for Web Services, the expert group developingJDM has also provided, as part of a broader standard, a clearly defined set of methods anddata mining objects in XML. Customers can develop XML and Java-based applications,depending on architectural requirements, while still using the same vendor DME, for thosevendors providing support for both interfaces. Developers already familiar with the JDMJava API will find concepts and capabilities parallel in the web services interface, greatlylowering the learning curve and minimizing impedance mismatch among vendor productsand capabilities. Vendors can leverage their investment in a JDM-based DME to reach bothJava and web service-based application developer markets.

Both JDM and SQL/MM support another XML standard called the Predictive ModellingMark-up Language (PMML). PMML is an XML based standard for representing the modelsthat are generated by data mining algorithms. The key objective for developing PMML wasto provide a standard format for exchanging models between various knowledge generatorsoftware and knowledge consumer software. Given that both JDM and SQL/MM support thisstandard means that now PMML can be used as a common exchange format between vendorssupporting either SQL/MM or JDM. In general, the groups developing the three standards,PMML, JDM and SQL/MM, have exchanged a number of ideas between them which makesthe understanding and usage of these different standards a lot easier for a practitioner. Forexample, the LogicalData object in JDM, the DM_LogicalDataSpec in SQL/MM and theMining Schema in PMML are conceptually similar. There are a two Microsoft led initiativesthat have been described in this paper, namely OLE DB for Data Mining and XML for Analy-sis. While a number of companies now support OLE DB for Data Mining, the API is platformspecific aimed at integrating data mining functionality into the MS Analysis Services alone.XMLA is a standard for defining data access interaction between a client application and ananalytical data provider. While initially aimed more at On-line Analytical Processing ratherthan data mining, XMLA is generic enough for it to be used to support a broad range ofanalytical functions including data mining. Given this fact, unlike the JDM web servicesinitiative, XMLA does not provide an XML representation of data mining functionality butrather simply embeds DMX. The standardisation initiatives continue and some standards,more than others, are now being supported by data mining software vendors, either due totheir relative maturity or the general overhead associated with supporting a standard. On thewhole the emerging standards can only be seen as a positive sign though time will tell as towhat impact they will have on the industry and the level of adoption that these standards willsee in the future.

123


Fig. 6 The main window of the PEAR browser for association rules

Acknowledgements This article is based on a tutorial presented at ECML/PKDD 2003 which was madepossible by the generous support of the FP5 Network of Excellence KD-Net (IST-2001-33086) and the orga-nizers of ECML/PKDD 2003. Some of the work in preparing the material for the tutorial was carried out aspart of the European funded SolEuNet (Data Mining and Decision Support for Business Competitiveness:A European Virtual Enterprise) (IST-1999-11495) project. The ROC plotting software has been developed byJ. Farrand from the University of Bristol. Figure 6 was generously provided by A. Jorge from the Universityof Porto.

References

Anand S, Büchner A (1998) Data mining for decision support. Financial Times ManagementBhandari I, Colet E, Parker J, Pines Z, Pratap R, Ramanujam K (1997) Advanced scout: data mining and

knowledge discovery in NBA data. Data Mining Knowl Discov 1:121–125Blake C, Merz C (1998) UCI repository of machine learning databases. http://www.ics.uci.edu/~mlearn/

MLRepository.htmlBrachman R, Anand T (1994) The process of knowledge discovery in databases: a first sketch. In: Fayyad U,

Uthurusamy R (eds) Knowledge discovery in databases: papers from the 1994 AAAI Workshop. Seattle,Washington, AAAI Press, pp 1–12

Chapman P, Clinton J, Kerber R, Khabaza T, Reinartz T, Shearer C, Wirth R (2000) Crisp-dm 1.0: Step-by-stepdata mining guide. http://www.crisp-dm.org

123

http://www.ics.uci.edu/~mlearn/MLRepository.html

http://www.ics.uci.edu/~mlearn/MLRepository.html

http://www.crisp-dm.org


CWM (2004) Common warehouse metamodel (CWM). http://www.omg.org/technology/documents/formal/cwm.htm

DB2-IM (2004) DB2 intelligent miner for data. http://www-306.ibm.com/software/data/iminer/fordata/Eisenberg A, Melton J (2001) SQL multimedia and application packages (SQL/MM). SIGMOD RECORD

30(4)Farrand J, Flach P (2003) ROCOn: a tool for visualising ROC graphs. http://www.cs.bris.ac.uk/Fayyad U, Piatetsky-Shapiro G, Smyth P (1996a) From data mining to knowledge discovery: an overview. In:

Fayyad U, Piatetsky-Shapiro G, Smyth P, Uthurusamy R (eds) Advances in knowledge discovery anddata mining. AAAI Press/The MIT Press, Menlo Park, CA, pp 1–34

Fayyad U, Piatetsky-Shapiro G, Smyth P (1996b) The KDD process for extracting useful knowledge fromvolumes of data. Commun ACM 39(11):27–34

Grossman R, Kamath C, Kegelmeyer P, V Kumar RN (2001) Data mining for scientific and engineeringapplications. Kluwer Academic Publishers

Grossman R, Hornick M, Meyer G (2002) Data mining standards initiatives. Commun ACM 45(8), http://www.dmg.org

ISO/IEC 9075:2003 (2003) ISO/IEC 9075:2003 Database Language SQLISO/IEC CD 13249-6 (2004) Information technology—database languages—SQL multimedia and application

packages—Part 6: Data miningJCA (2003) Java connection architecture. http://www.xmla.org/faq.aspJCP (1995) The Java community process. http://jcp.org/en/home/indexJDM (2004) Java specification request 73. http://www.jcp.org/en/jsr/detail?id=73Jorge A, Poças J, Azevedo P (2002) Post-processing operators for browsing large sets of association rules.

In: Lange S, Satoh K, Smith C (eds) Proceedings of discovery science 02, Springer-Verlag, Lübeck,Germany, vol LNCS 2534

Klösgen W (1996) EXPLORA: a multipattern and multistrategy discovery assistant. In: Fayyad U, Piatetsky-Shapiro G, Smyth P, Uthurusamy R (eds) Advances in knowledge discovery and data mining. AAAIPress, Menlo Park, CA, pp 249–271

Klösgen W, Zytkow J (eds) (2002) Knowledge discovery in databases: the purpose, necessity, and challenges,Handbook of data mining and knowledge discovery. Oxford University Press, Inc.

Li J (2003) PMML output and visualization for WEKA. PhD thesis, Department of Computer Science,University of Bristol. http://www.cs.bris.ac.uk/home/jl2092/project/thesis.pdf

OLE-DB (2000) OLE-DB for data mining specification 1.0. Microsoft. http://msdn2.microsoft.com/en-us/library/ms146608.aspx

Oracle (2004) Oracle data mining. http://www.oracle.com/technology/products/bi/odm/index.htmlProvost F, Fawcett T (2001) Robust classification for imprecise environments. Mach Learn 42(3):203–231Pyle D (2004) Nine simple rules you won’t want to follow. DB2 magazine 9(1)Raedt LD (2002) A perspective on inductive databases. ACM SIGKDD Explor Newslett 4(2)SOAP (2004) Simple object access protocol (SOAP). http://www.w3.org/TR/SOAP/Tang Z, Kim P (2004) Building data mining solutions with SQL Server 2000. http://www.dmreview.com/

whitepaper/wid292.pdfWettschereck D, Jorge A, Moyle S (2003) Data mining and decision support integration through the predictive

model markup language standard and visualization. In: Data mining and decision support: integrationand collaboration. Kluwer

Witten I, Frank E (1999) Data mining: practical machine learning tools with Java implementations. MorganKaufmann, San Francisco

XMLA (2004) XML for analysis (XMLA). http://www.xmla.org/XMLA-spec (2004) XML for analysis (XMLA) specification version 1.1. http://www.xmla.org/docs_pub.asp

123

http://www.omg.org/technology/documents/formal/cwm.htm

http://www.omg.org/technology/documents/formal/cwm.htm

http://www-306.ibm.com/software/data/iminer/fordata/

http://www.cs.bris.ac.uk/

http://www.dmg.org

http://www.dmg.org

http://www.xmla.org/faq.asp

http://jcp.org/en/home/index

http://www.jcp.org/en/jsr/detail?id=73

http://www.cs.bris.ac.uk/home/jl2092/project/thesis.pdf

http://msdn2.microsoft.com/en-us/library/ms146608.aspx

http://msdn2.microsoft.com/en-us/library/ms146608.aspx

http://www.oracle.com/technology/products/bi/odm/index.html

http://www.w3.org/TR/SOAP/

http://www.dmreview.com/whitepaper/wid292.pdf

http://www.dmreview.com/whitepaper/wid292.pdf

http://www.xmla.org/

http://www.xmla.org/docs_pub.asp

Knowledge discovery standardsvillasen/index_archivos/... · Artif Intell Rev (2007) 27:21–56 DOI 10.1007/s10462-008-9067-4 Knowledge discovery standards Sarabjot Singh Anand ·

Documents