1 Data Mining Workbenches: a overview &comparison focusing on open-source packages CS240B notes by C. Zaniolo.

1

Data Mining Workbenches: a overview &comparison focusing on open-source packages

CS240B notes by C. Zaniolo

2

Comparing KDD/DM Toolsets

Many packages and very few in-depth comparisons An Evaluation by USDA Forest Service comparing

R, WEKA, Orange, and SAS® Several User-satisfaction/popularity surveys

KDD-nuggets Rexer Analytics Survey (annual)

3

An Evaluation of CART Programs by USDA Forest Service (USFS) By USDA Forest Service (USFS)

USFS uses classification and regression-tree (CART) technology to map USFS Forest Inventory and Analysis

(FIA) biomass, forest type, forest type groups, and National Forest vegetation.

The results of the study were reported by: B. Ruefenacht, G. Liknes, A. J. Lister, H. Fisk and Dan Wendt “Evaluation of Open Source Data Mining Software Packages”, Symposium on Forest Inventory and Analysis (FIA), October 2008; Park City,UT.

Proc.

http://www.fs.fed.us/rm/pubs/rmrs_p056/rmrs_p056_59_ruefenacht.pdf




4

R: (http://www.r-project.org)

By the University of Auckland, NZ, in 1993 GNU Public License (GPL) in 1995. An extension of the S language (Bell Labs) Twelve packages are supplied with the basic

R distribution each including many functions http://cran.r-project.org offers 1,364 additional

packages extending the basic R functionality.

5

WEKA: www.cs.waikato.ac.nz/ml/weka/ Waikato Environment for Knowledge Analysis by the University of Waikato, New Zealand, which

supports the software with funds by the NZ government. Starded in 1993 and released in 1996. A GPL package WEKA is a collection of machine-learning algorithms

implemented in Java plus data preprocessing tools, and visualization tools,

interface tools (R, SQL)

6

Orange: www.ailab.si/orange/

By the University of Ljubljana, Slovenia, in 2004, under GPL. Still evolving: frequent new releases

Main routines & libraries in C++ but Python is used to call the routines and access libraries www.ailab.si/orange/doc/ofb/.

Users can add their machine-learning algorithms using both scripting and GUI environments

Orange also has a GUI version called Orange Canvas, which allows for interactive machine-learning “visual programming”.

7

SAS® (Statistical Analysis Software) By Jim Goodnight and North Carolina State University

associates in early 1970s. In 1976 the SAS-Institute was founded to distribute and further develop the increasingly popular software.

SAS® currently has 10,658 employees, and is the largest privately held software company with annual revenue of $2.15 billion (in 2007) SAS® is used in 109 countries, different industries, with 44,000 customer sites worldwide.

SAS® is purchased by contacting a distributor directly: it can cost several thousand dollars depending on the options. The purchase includes the software, technical support, and licenses, which are renewed regularly, incurring more costs.

8

Evaluation Criteria Cost Usability:

How easy is the interface to use and understand? Are there a variety of models and options available? How easy to use is the software’s programming language? Does the software integrate easily with other programs?

Performance w.r.t. speed, stability, and accuracy.

Critical Mass: how widespread is the software? Uniqueness of useful features & algorithms Defensibility w.r.t.citations and academic repute

9

Usability SAS®: The Enterprise Guide for SAS® has a user-friendly GUI

system that allows for the building of graphical models. GUIs also exist for other SAS® modules, but unlike WEKA and

Orange there is no universal GUI for SAS SAS® is primarily driven by its own programming language, a

new user will require some training R, like SAS®, is used by numerous industries and thus

has a wide variety of models and options. R is driven by its own scripting language, which does require

some training and/or experience GUIs for specific functions only.

10

Usability (Cont.) WEKA does have a comprehensive GUI with many models and

options available. WEKA’s GUI is easy for users need a good understanding of modeling techniques. to integrate WEKA with other software programs Familiarity with Java is needed to extend WEKA and link with other software

programs WEKA can be expanded and used within R,

Orange: Open source data visualization and analysis for novice and experts. Data mining through visual programming or Python scripting. Orange website (http://www.ailab.si/orange/) Orange has a good website on how

to integrate Orange with Python. The number of models and options available in Orange lags behind not only

SAS® and R but WEKA as well.

http://www.ailab.si/orange/

11

Performance notes

R significantly faster than WEKA and Orange on classification trees.

Orange is the least stable although new versions are released monthly

WEKA is a stable program, but also does not work well with large datasets. The weka recently recently introduced MOA to

process massive data sets in a stream-like mode.

12

Evaluation Results

13

Most Popular Data Mining SoftwareRexer Analytics Survey (Early 2007) asked

about the tools used often and occasionally. Clearly more popular than the rest were:

SPSS or SPSS Clementine "Own Code" SAS or SAS Enterprise Miner

Followed by R Weka C4.5 / C5.0

http://www.the-data-mine.com/bin/view/Software/SPSSClementine

http://www.the-data-mine.com/bin/view/Software/SASEnterpriseMiner

http://www.the-data-mine.com/bin/view/Software/R

http://www.the-data-mine.com/bin/view/Software/Weka

14

Critical Mass and PopularityTop ten most used packages by KDD Nuggets Survey (May 2007):

SPSS/ SPSS Clementine Salford Systems CART/MARS/TreeNet/RF Yale (now Rapid Miner) SAS / SAS Enterprise Miner Angoss Knowledge Studio / Knowledge Seeker KXEN Weka R Microsoft SQL Server? MATLAB?

Note: Microsoft Excel omitted as it's not really "data mining" software, and I've merged the tools offered by a single vendor (SPSS and SAS)

You can see the full survey results

http://www.the-data-mine.com/bin/view/Software/SPSSClementine

http://www.the-data-mine.com/bin/view/Organizations/SalfordSystems

http://www.the-data-mine.com/bin/view/Organizations/SalfordSystems

http://www.the-data-mine.com/bin/view/Software/RapidMiner

http://www.the-data-mine.com/bin/view/Software/SASEnterpriseMiner

http://www.the-data-mine.com/bin/view/Software/ANGOSSKnowledgeSTUDIO

http://www.the-data-mine.com/bin/view/Software/ANGOSSKnowledgeSEEKER

http://www.the-data-mine.com/bin/view/Software/KXEN

http://www.the-data-mine.com/bin/view/Software/Weka

http://www.the-data-mine.com/bin/view/Software/R

http://www.the-data-mine.com/bin/edit/Software/MicrosoftSQLServer?topicparent=Software.MostPopularDataMiningSoftware

http://www.the-data-mine.com/bin/edit/Software/MATLAB?topicparent=Software.MostPopularDataMiningSoftware

http://www.kdnuggets.com/polls/2007/data_mining_software_tools.htm

15

Comments Gregory Piatetsky-Shapiro, KDnuggets Editor:Votes from tool vendors were removed..Comparing with 2008 KDnuggets Poll on data mining tools/software used,the big changes are growth in SPSS, RapidMiner, and R.

16

Popular Data Mining Software (cont.)Rexer Analytics Survey is taken every year and the

summary report can be obtained free. 2009 SURVEY HIGHLIGHTS:

Open-source tools Weka and R made substantial movement up data miner’s tool rankings this year, and are now used by large numbers of both academic and for-profit data miners.

SAS Enterprise Miner dropped in data miner’s tool rankings 2010 SURVEY HIGHLIGHTS:

R: After a steady rise across the past few years, R overtook other tools to become the tool used by more data miners (43%)

STATISTICA has also been climbing in the rankings. STATISTICA, IBM SPSS Modeler, and R received the strongest satisfaction ratings in both 2010 and 2009.

http://www.rexeranalytics.com/Data-Miner-Survey-Results-2010.html



17

18

Selected References Witten, I.H.; Frank, E. Data Mining: Practical machine

learning tools and techniques. 2nd Edition, Morgan Kaufmann, 2005.

R. R. Bouckaert et al., WEKA Manual for Version 3.6.0, 2008.

Demsar J.; Zupan, B.; Leban, G.. “Orange: From experimental machine learning to interactive data mining”, 2004. (http://www.ailab.si/orange).

R Development Core Team. R: A language and environment for statistical computing. R Foundation for Statistical Computing, 2008.

http://www.ailab.si/orange

http://biostat.mc.vanderbilt.edu/twiki/pub/Main/JeffreyHorner/JCGSR.pdf

19

About Weka Comparison to R, WEKA is weaker in classical statistics but

stronger in machine learning (data mining) algorithms. WEKA has developed a set of extensions covering diverse

areas, such as text mining, visualization and bioinformatics. WEKA 3.6 includes support for importing PMML models

(Predictive Modeling Markup Language). PMML is a XML-based standard fro expressing statistical and data mining models.

WEKA can interface with many systems and formats: SQL, LibSVM and SVM-Light,….

WEKA has 2 limitations: Java implementation is somewhat slower than an equivalent in

C/C++ Most of the algorithms require all the data stored in main

memory. So it restricts application to small or medium-sized datasets.

20

MOA: Massive Online Analysis MOA supports bi-directional interaction with WEKA

to deal with the scaling up the implementation of state of the art algorithms to real world dataset sizes using a streaming settings

MOA: a software environment for testing algorithms and running experiments for online learning from evolving data streams

A DSMS will then be required to deploy these algorithms on actual data streams—MOA is not a DSMS

21

Downloads available under GNU GPL license Several Data Sets used:

SEA Concepts Generator: artificial dataset with abrupt concept drift STAGGER Concepts Generator by Schlimmer and Grange Rotating Hyperplane: used as testbed for CVFDT versus VFDT Random RBF Generator Waveform Generator Function Generator It was introduced by Agrawal et al.

MOA Currently supports:Classification and clustering methods

System is easily extensible and has nice GUIGood Documentation:

Albert Bifet, G. Holmes, R. Kirkby & B. Pfahringer: DATA STREAM MINING: A Practical Approach. May 2011.

Albert Bifet et al.: MOA: Massive Online Analysis, a Framework for Stream Classication and Clustering (2010)

http://voxel.dl.sourceforge.net/project/moa-datastream/documentation/StreamMining.pdf

http://www.informatik.uni-trier.de/~ley/db/journals/jmlr/jmlrp11.html%23BifetHPKKJS10



1 Data Mining Workbenches: a overview &comparison focusing on open-source packages CS240B notes by C. Zaniolo.

Documents

software company

popular software

usfs forest inventory

zaniolo slide

usda forest service

forest type groups

additional packages

national forest vegetation