Data Mining Data Mining with JDM API with JDM API Regina Wang Regina Wang
Data MiningData MiningKnowledge-Discovery in Databases (KDD)Knowledge-Discovery in Databases (KDD)
Searching large volumes of data for patterns.Searching large volumes of data for patterns.
The nontrivial extraction of implicit, previously The nontrivial extraction of implicit, previously known, and potentially useful information from known, and potentially useful information from data.data.
The science of extracting useful information The science of extracting useful information from large data sets or databases.from large data sets or databases.
Uses computational techniques from Uses computational techniques from statistics, statistics, machine learning, machine learning, and and pattern recognitionpattern recognition..
Descriptive StatisticsDescriptive Statistics
Collect data Collect data Classify data Classify data Summarize data Summarize data present data present data Make inferences to draw a conclusionsMake inferences to draw a conclusions--Point and interval estimation--Point and interval estimation--Hypothesis testing--Hypothesis testing--Prediction--Prediction
Machine LearningMachine Learning
Concerned with the development of Concerned with the development of techniques which allow computers to techniques which allow computers to "learn". "learn".
Concerned with the algorithmic Concerned with the algorithmic complexity of computational complexity of computational implementations.implementations.
Many inference problems turn out to be Many inference problems turn out to be NP-hard or harder .NP-hard or harder .
Common Machine Learning Common Machine Learning AlgorithmAlgorithm
Supervised learning—prior knowledgeSupervised learning—prior knowledge
Unsupervised learning—statistical Unsupervised learning—statistical regularity of the patternsregularity of the patterns
Semi-supervised learningSemi-supervised learning
Reinforcement learningReinforcement learning
TransductionTransduction
Learning to learnLearning to learn
Pattern RecognitionPattern Recognition
The act of taking in raw data and taking an The act of taking in raw data and taking an action based on the category of the data.action based on the category of the data.
Aims to classify data patterns based on prior Aims to classify data patterns based on prior knowledge or on statistical info. knowledge or on statistical info.
Based on availability of training set: Based on availability of training set: supervised and unsupervised leaningssupervised and unsupervised leanings
Two approaches: statistical (decision theory) Two approaches: statistical (decision theory) and syntactic (structural).and syntactic (structural).
Supervised TechniquesSupervised Techniques
Classification:Classification:-- -- kk-Nearest Neighbors-Nearest Neighbors--Naïve Bayes--Naïve Bayes--Classification Trees--Classification Trees--Descriminant Analysis--Descriminant Analysis--Logistic Regression--Logistic Regression--Neural Nets --Neural Nets
Supervised TechniquesSupervised Techniques
Prediction (Estimation):Prediction (Estimation):
--Regression--Regression
--Regression Trees--Regression Trees
----kk-Nearest Neighbors-Nearest Neighbors
Unsupervised TechniquesUnsupervised Techniques
Cluster AnalysisCluster Analysis
Principle ComponentsPrinciple Components
Association RulesAssociation Rules
Collaborative FilteringCollaborative Filtering
Data-mining tools were traditionally Data-mining tools were traditionally provided in products with vendor-provided in products with vendor-specific interfaces.specific interfaces.
The Java Data Mining API (JDM) The Java Data Mining API (JDM) defines a common Java API to interact defines a common Java API to interact with data-mining systems.with data-mining systems.
Developed by Java Community Data Developed by Java Community Data Mining Expert GroupMining Expert Group
JAVA Data Mining API (JDM)JAVA Data Mining API (JDM)
JDM Current VersionsJDM Current Versions
JDM 1.0 (JSR 73) final specification in JDM 1.0 (JSR 73) final specification in August, 2004August, 2004
http://http://www.jcp.org/en/jsr/detail?idwww.jcp.org/en/jsr/detail?id=73=73
JDM 2.0 (JSR 247) Early ReviewJDM 2.0 (JSR 247) Early Review
http://http://www.jcp.org/en/jsr/detail?idwww.jcp.org/en/jsr/detail?id=247=247
JDM is for the Java™ 2 Platform JDM is for the Java™ 2 Platform (J2EE™) and (J2SE™)(J2EE™) and (J2SE™)
Data Mining SystemData Mining System
A typical data-mining system consists ofA typical data-mining system consists of
--a data-mining engine --a data-mining engine
--a repository that persists the data-mining --a repository that persists the data-mining artifacts, such as the models, created in artifacts, such as the models, created in the process. the process.
The actual data is obtained via a database The actual data is obtained via a database connection, or via a file-system API. connection, or via a file-system API.
JDM Architectural componentsJDM Architectural components
Application programming interface (API)Application programming interface (API)
Data mining engine (DME) Data mining engine (DME) – or – or data mining data mining server server (DMS), provides the infrastructure (DMS), provides the infrastructure that offers a set of data mining services to its that offers a set of data mining services to its API clients. API clients.
Mining object repository (MOR) Mining object repository (MOR) - The - The DME uses a mining object repository which DME uses a mining object repository which serves to persist data mining objectsserves to persist data mining objects
Key JDM API benefit :Key JDM API benefit :abstracts out the physical components, tasks, and algorithms to java classes
Figure 1. Components of a data-mining system
Building a data-mining modelBuilding a data-mining model
1.1. Decide what you want to learn.Decide what you want to learn.2.2. Select and prepare your data. Select and prepare your data. 3.3. Choose mining tasks and configure the Choose mining tasks and configure the
mining algorithms.mining algorithms.4.4. Build your data-mining model. Build your data-mining model. 5.5. Test and refine the models. Test and refine the models. 6.6. Report findings or predict future Report findings or predict future
outcomes. outcomes.
Usage of JDM API Usage of JDM API
Using JDM to explore mining object Using JDM to explore mining object repository (MOR) and find out what repository (MOR) and find out what models and model building parameters models and model building parameters work best.work best.
Follow a few simple steps that map the Follow a few simple steps that map the process to JDM interactions. process to JDM interactions.
Build Java Data Mining GUI ApplicationBuild Java Data Mining GUI Application
Using the JDM APIUsing the JDM API
1.1. Identify the dataIdentify the data you wish to use to build your you wish to use to build your model—your model—your build databuild data—with a URL that points to —with a URL that points to that data.that data.
2.2. Specify the type of modelSpecify the type of model you want to build, and you want to build, and parameters to the build process. Such parameters parameters to the build process. Such parameters are termed are termed build settingsbuild settings in JDM. such as in JDM. such as clustering, classification, or association rules. clustering, classification, or association rules. These tasks are represented by API classes. These tasks are represented by API classes.
3.3. Create a logical representation of your dataCreate a logical representation of your data to to select certain attributes of the physical data, and select certain attributes of the physical data, and then map those attributes to logical values.then map those attributes to logical values.
Using the JDM APIUsing the JDM API
4.4. SpecifySpecify the parameters to your data-mining the parameters to your data-mining algorithmsalgorithms
5.5. Create a build taskCreate a build task, and apply to that task , and apply to that task the physical data references and the build the physical data references and the build settings. settings.
6.6. Finally, you Finally, you execute the taskexecute the task. The outcome . The outcome of that execution is your data model. That of that execution is your data model. That model will have a model will have a signaturesignature—a kind of —a kind of interface—that describes the possible input interface—that describes the possible input attributes for later applying the model to attributes for later applying the model to additional data.additional data.
Using data model and resultsUsing data model and results
Once you've created a model, you can test Once you've created a model, you can test that model, and then even apply the model that model, and then even apply the model to additional data. Building, testing, and to additional data. Building, testing, and applying the model to additional data is an applying the model to additional data is an iterative process that, ideally, yields iterative process that, ideally, yields increasingly accurate models. increasingly accurate models.
Those models can then be saved in the Those models can then be saved in the MOR, and used to either explain data, or MOR, and used to either explain data, or to predict the outcome of new data in to predict the outcome of new data in relation to your data-mining objective. relation to your data-mining objective.
JDM Data ConnectionJDM Data Connection
A JDM connection is represented by the A JDM connection is represented by the engineengine variable, which is of type variable, which is of type javax.datamining.resource.Connection. JDM javax.datamining.resource.Connection. JDM connections are very similar to JDBC connections are very similar to JDBC connections, with one connection per thread. connections, with one connection per thread.
PhysicalDataSetFactory dataSetFactory = PhysicalDataSetFactory dataSetFactory = (PhysicalDataSetFactory) (PhysicalDataSetFactory) engine.getFactory("javax.datamining.data.PhysicalDataSengine.getFactory("javax.datamining.data.PhysicalDataSet");et");
JDM Data ConnectionJDM Data Connection
Build data is referenced via a PhysicalDataSet Build data is referenced via a PhysicalDataSet object, which, in turn, loads the data from a file object, which, in turn, loads the data from a file or a database table, referenced with a URL. or a database table, referenced with a URL.
PhysicalDataSet dataSet = PhysicalDataSet dataSet = pdsFactory.create( pdsFactory.create( "file:///export/data/textFileData.data", true);"file:///export/data/textFileData.data", true);
Code Example: Building a Code Example: Building a clustering modelclustering model
// Create the physical representation of the data// Create the physical representation of the data(1) PhysicalDataSetFactory pdsFactory = (PhysicalDataSetFactory) dme-(1) PhysicalDataSetFactory pdsFactory = (PhysicalDataSetFactory) dme-Conn.getFactory( “javax.datamining.data.PhysicalDataSet” );Conn.getFactory( “javax.datamining.data.PhysicalDataSet” );(2) PhysicalDataSet buildData = pdsFactory.create( uri, true );(2) PhysicalDataSet buildData = pdsFactory.create( uri, true );(3) dmeConn.saveObject( “myBuildData”, buildData, false );(3) dmeConn.saveObject( “myBuildData”, buildData, false );// Create the logical representation of the data from physical data// Create the logical representation of the data from physical data(4) LogicalDataFactory ldFactory = (LogicalDataFactory) dmeConn.getFactory((4) LogicalDataFactory ldFactory = (LogicalDataFactory) dmeConn.getFactory(““javax.datamining.data.LogicalData” );javax.datamining.data.LogicalData” );(5) LogicalData ld = ldFactory.create( buildData );(5) LogicalData ld = ldFactory.create( buildData );(6) dmeConn.saveObject( “myLogicalData”, ld, false );(6) dmeConn.saveObject( “myLogicalData”, ld, false );// Create the settings to build a clustering model// Create the settings to build a clustering model(7) ClusteringSettingsFactory csFactory = (ClusteringSettingsFactory) dme-(7) ClusteringSettingsFactory csFactory = (ClusteringSettingsFactory) dme-Conn.getFactory( “javax.datamining.clustering.ClusteringSettings”);Conn.getFactory( “javax.datamining.clustering.ClusteringSettings”);(8) ClusteringSettings clusteringSettings = csFactory.create();(8) ClusteringSettings clusteringSettings = csFactory.create();(9) clusteringSettings.setLogicalDataName( “myLogicalData” );(9) clusteringSettings.setLogicalDataName( “myLogicalData” );(10) clusteringSettings.setMaxNumberOfClusters( 20 );(10) clusteringSettings.setMaxNumberOfClusters( 20 );
Code Example: Building a Code Example: Building a clustering model con’tclustering model con’t
(11) clusteringSettings.setMinClusterCaseCount( 5 );(11) clusteringSettings.setMinClusterCaseCount( 5 );(12) dmeConn.saveObject( “myClusteringBS”, clusteringSettings, false );(12) dmeConn.saveObject( “myClusteringBS”, clusteringSettings, false );// Create a task to build a clustering model with data and settings// Create a task to build a clustering model with data and settings(13) BuildTaskFactory btFactory = (BuildTaskFactory) dmeConn.getFactory((13) BuildTaskFactory btFactory = (BuildTaskFactory) dmeConn.getFactory(““javax.datamining.task.BuildTask” );javax.datamining.task.BuildTask” );(14) BuildTask task = btFactory.create( “myBuildData”, “myClusteringBS”,(14) BuildTask task = btFactory.create( “myBuildData”, “myClusteringBS”,““myClusteringModel” );myClusteringModel” );(15) dmeConn.saveObject( “myClusteringTask”, task, false );(15) dmeConn.saveObject( “myClusteringTask”, task, false );// Execute the task and check the status// Execute the task and check the status(16) ExecutionHandle handle = dmeConn.execute( “myClusteringTask” );(16) ExecutionHandle handle = dmeConn.execute( “myClusteringTask” );(17) handle.waitForCompletion( Integer.MAX_VALUE ); // wait until done(17) handle.waitForCompletion( Integer.MAX_VALUE ); // wait until done(18) ExecutionStatus status = handle.getLatestStatus();(18) ExecutionStatus status = handle.getLatestStatus();(19) if( ExecutionState.success.equals( status.getState() ) )(19) if( ExecutionState.success.equals( status.getState() ) )(20) // task completed successfully...(20) // task completed successfully...
ReferencesReferences
Java Data Mining SpecificationJava Data Mining Specification
http://www.jcp.org/en/jsr/detail?id=73 http://www.jcp.org/en/jsr/detail?id=73
Mine Your Own Data with the JDM Mine Your Own Data with the JDM API, Frank Sommers, July 7, 2005API, Frank Sommers, July 7, 2005
http://www.artima.com/lejava/articles/http://www.artima.com/lejava/articles/data_mining.htmldata_mining.html
http://www.stanford.edu/class/http://www.stanford.edu/class/cs345a/#handoutscs345a/#handouts