Top Banner
1 Data Mining Workbenches: a overview &comparison focusing on open-source packages CS240B notes by C. Zaniolo
21

1 Data Mining Workbenches: a overview &comparison focusing on open-source packages CS240B notes by C. Zaniolo.

Dec 22, 2015

Download

Documents

Matthew Cannon
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: 1 Data Mining Workbenches: a overview &comparison focusing on open-source packages CS240B notes by C. Zaniolo.

1

Data Mining Workbenches: a overview &comparison focusing on open-source packages

CS240B notes by C. Zaniolo

Page 2: 1 Data Mining Workbenches: a overview &comparison focusing on open-source packages CS240B notes by C. Zaniolo.

2

Comparing KDD/DM Toolsets

Many packages and very few in-depth comparisons An Evaluation by USDA Forest Service comparing

R, WEKA, Orange, and SAS® Several User-satisfaction/popularity surveys

KDD-nuggets Rexer Analytics Survey (annual)

Page 3: 1 Data Mining Workbenches: a overview &comparison focusing on open-source packages CS240B notes by C. Zaniolo.

3

An Evaluation of CART Programs by USDA Forest Service (USFS) By USDA Forest Service (USFS)

USFS uses classification and regression-tree (CART) technology to map USFS Forest Inventory and Analysis

(FIA) biomass, forest type, forest type groups, and National Forest vegetation.

The results of the study were reported by: B. Ruefenacht, G. Liknes, A. J. Lister, H. Fisk and Dan Wendt “Evaluation of Open Source Data Mining Software Packages”, Symposium on Forest Inventory and Analysis (FIA), October 2008; Park City,UT.

Proc.

Page 4: 1 Data Mining Workbenches: a overview &comparison focusing on open-source packages CS240B notes by C. Zaniolo.

4

R: (http://www.r-project.org)

By the University of Auckland, NZ, in 1993 GNU Public License (GPL) in 1995. An extension of the S language (Bell Labs) Twelve packages are supplied with the basic

R distribution each including many functions http://cran.r-project.org offers 1,364 additional

packages extending the basic R functionality.

Page 5: 1 Data Mining Workbenches: a overview &comparison focusing on open-source packages CS240B notes by C. Zaniolo.

5

WEKA: www.cs.waikato.ac.nz/ml/weka/ Waikato Environment for Knowledge Analysis by the University of Waikato, New Zealand, which

supports the software with funds by the NZ government. Starded in 1993 and released in 1996. A GPL package WEKA is a collection of machine-learning algorithms

implemented in Java plus data preprocessing tools, and visualization tools,

interface tools (R, SQL)

Page 6: 1 Data Mining Workbenches: a overview &comparison focusing on open-source packages CS240B notes by C. Zaniolo.

6

Orange: www.ailab.si/orange/

By the University of Ljubljana, Slovenia, in 2004, under GPL. Still evolving: frequent new releases

Main routines & libraries in C++ but Python is used to call the routines and access libraries www.ailab.si/orange/doc/ofb/.

Users can add their machine-learning algorithms using both scripting and GUI environments

Orange also has a GUI version called Orange Canvas, which allows for interactive machine-learning “visual programming”.

Page 7: 1 Data Mining Workbenches: a overview &comparison focusing on open-source packages CS240B notes by C. Zaniolo.

7

SAS® (Statistical Analysis Software) By Jim Goodnight and North Carolina State University

associates in early 1970s. In 1976 the SAS-Institute was founded to distribute and further develop the increasingly popular software.

SAS® currently has 10,658 employees, and is the largest privately held software company with annual revenue of $2.15 billion (in 2007) SAS® is used in 109 countries, different industries, with 44,000 customer sites worldwide.

SAS® is purchased by contacting a distributor directly: it can cost several thousand dollars depending on the options. The purchase includes the software, technical support, and licenses, which are renewed regularly, incurring more costs.

Page 8: 1 Data Mining Workbenches: a overview &comparison focusing on open-source packages CS240B notes by C. Zaniolo.

8

Evaluation Criteria Cost Usability:

How easy is the interface to use and understand? Are there a variety of models and options available? How easy to use is the software’s programming language? Does the software integrate easily with other programs?

Performance w.r.t. speed, stability, and accuracy.

Critical Mass: how widespread is the software? Uniqueness of useful features & algorithms Defensibility w.r.t.citations and academic repute

Page 9: 1 Data Mining Workbenches: a overview &comparison focusing on open-source packages CS240B notes by C. Zaniolo.

9

Usability SAS®: The Enterprise Guide for SAS® has a user-friendly GUI

system that allows for the building of graphical models. GUIs also exist for other SAS® modules, but unlike WEKA and

Orange there is no universal GUI for SAS SAS® is primarily driven by its own programming language, a

new user will require some training R, like SAS®, is used by numerous industries and thus

has a wide variety of models and options. R is driven by its own scripting language, which does require

some training and/or experience GUIs for specific functions only.

Page 10: 1 Data Mining Workbenches: a overview &comparison focusing on open-source packages CS240B notes by C. Zaniolo.

10

Usability (Cont.) WEKA does have a comprehensive GUI with many models and

options available. WEKA’s GUI is easy for users need a good understanding of modeling techniques. to integrate WEKA with other software programs Familiarity with Java is needed to extend WEKA and link with other software

programs WEKA can be expanded and used within R,

Orange: Open source data visualization and analysis for novice and experts. Data mining through visual programming or Python scripting.  Orange website (http://www.ailab.si/orange/) Orange has a good website on how

to integrate Orange with Python. The number of models and options available in Orange lags behind not only

SAS® and R but WEKA as well.

Page 11: 1 Data Mining Workbenches: a overview &comparison focusing on open-source packages CS240B notes by C. Zaniolo.

11

Performance notes

R significantly faster than WEKA and Orange on classification trees.

Orange is the least stable although new versions are released monthly

WEKA is a stable program, but also does not work well with large datasets. The weka recently recently introduced MOA to

process massive data sets in a stream-like mode.

Page 12: 1 Data Mining Workbenches: a overview &comparison focusing on open-source packages CS240B notes by C. Zaniolo.

12

Evaluation Results

Page 13: 1 Data Mining Workbenches: a overview &comparison focusing on open-source packages CS240B notes by C. Zaniolo.

13

Most Popular Data Mining SoftwareRexer Analytics Survey (Early 2007) asked

about the tools used often and occasionally. Clearly more popular than the rest were:

SPSS or SPSS Clementine "Own Code" SAS or SAS Enterprise Miner

Followed by R Weka C4.5 / C5.0

Page 15: 1 Data Mining Workbenches: a overview &comparison focusing on open-source packages CS240B notes by C. Zaniolo.

15

Comments Gregory Piatetsky-Shapiro, KDnuggets Editor:Votes from tool vendors were removed..Comparing with 2008 KDnuggets Poll on data mining tools/software used,the big changes are growth in SPSS, RapidMiner, and R.

Page 16: 1 Data Mining Workbenches: a overview &comparison focusing on open-source packages CS240B notes by C. Zaniolo.

16

Popular Data Mining Software (cont.)Rexer Analytics Survey is taken every year and the

summary report can be obtained free. 2009 SURVEY HIGHLIGHTS:

Open-source tools Weka and R made substantial movement up data miner’s tool rankings this year, and are now used by large numbers of both academic and for-profit data miners.

SAS Enterprise Miner dropped in data miner’s tool rankings 2010 SURVEY HIGHLIGHTS:

R:  After a steady rise across the past few years, R overtook other tools to become the tool used by more data miners (43%)

STATISTICA has also been climbing in the rankings. STATISTICA, IBM SPSS Modeler, and R received the strongest satisfaction ratings in both 2010 and 2009.

Page 17: 1 Data Mining Workbenches: a overview &comparison focusing on open-source packages CS240B notes by C. Zaniolo.

17

Page 18: 1 Data Mining Workbenches: a overview &comparison focusing on open-source packages CS240B notes by C. Zaniolo.

18

Selected References Witten, I.H.; Frank, E. Data Mining: Practical machine

learning tools and techniques. 2nd Edition, Morgan Kaufmann, 2005.

R. R. Bouckaert et al., WEKA Manual for Version 3.6.0, 2008.

Demsar J.; Zupan, B.; Leban, G.. “Orange: From experimental machine learning to interactive data mining”, 2004. (http://www.ailab.si/orange).

R Development Core Team. R: A language and environment for statistical computing. R Foundation for Statistical Computing, 2008.

Page 19: 1 Data Mining Workbenches: a overview &comparison focusing on open-source packages CS240B notes by C. Zaniolo.

19

About Weka Comparison to R, WEKA is weaker in classical statistics but

stronger in machine learning (data mining) algorithms. WEKA has developed a set of extensions covering diverse

areas, such as text mining, visualization and bioinformatics. WEKA 3.6 includes support for importing PMML models

(Predictive Modeling Markup Language). PMML is a XML-based standard fro expressing statistical and data mining models.

WEKA can interface with many systems and formats: SQL, LibSVM and SVM-Light,….

WEKA has 2 limitations: Java implementation is somewhat slower than an equivalent in

C/C++ Most of the algorithms require all the data stored in main

memory. So it restricts application to small or medium-sized datasets.

Page 20: 1 Data Mining Workbenches: a overview &comparison focusing on open-source packages CS240B notes by C. Zaniolo.

20

MOA: Massive Online Analysis MOA supports bi-directional interaction with WEKA

to deal with the scaling up the implementation of state of the art algorithms to real world dataset sizes using a streaming settings

MOA: a software environment for testing algorithms and running experiments for online learning from evolving data streams

A DSMS will then be required to deploy these algorithms on actual data streams—MOA is not a DSMS

Page 21: 1 Data Mining Workbenches: a overview &comparison focusing on open-source packages CS240B notes by C. Zaniolo.

21

Downloads available under GNU GPL license Several Data Sets used:

SEA Concepts Generator: artificial dataset with abrupt concept drift STAGGER Concepts Generator by Schlimmer and Grange Rotating Hyperplane: used as testbed for CVFDT versus VFDT Random RBF Generator Waveform Generator Function Generator It was introduced by Agrawal et al.

MOA Currently supports:Classification and clustering methods

System is easily extensible and has nice GUIGood Documentation:

Albert Bifet, G. Holmes, R. Kirkby & B. Pfahringer: DATA STREAM MINING: A Practical Approach. May 2011.

Albert Bifet et al.: MOA: Massive Online Analysis, a Framework for Stream Classication and Clustering (2010)