Scalable Benchmarks and Kernels for Data Mining and Analytics Vipin Kumar University of Minnesota [email protected] kumar Joint work with.

Scalable Benchmarks and Kernels for Data Mining and Analytics

Vipin Kumar

University of Minnesota [email protected]

www.cs.umn.edu/~kumar

Joint work with Alok Choudhary and Gokhan Memik (Northwestern) and Michael Steinbach (University of Minnesota)

Research funded by NSF

mailto:[email protected]

mailto:[email protected]

http://www.cs.umn.edu/~kumar






Need for High Performance Data Mining

Today’s digital society has seen enormous data growth in both commercial and scientific databases

Data Mining is becoming a commonly used tool to extract information from large and complex datasets

Advances in computing capabilities and technological innovation needed to harvest the available wealth of data

Computational Simulations

Internet

Sensor Networks

Geo-spatial data

Biomedical DataHomeland Security

SST

Precipitation

NPP

Pressure

SST

Precipitation

NPP

Pressure

Longitude

Latitude

Timegrid cell zone

...

Data Mining for Climate Data

NASA ESE questions: How is the global Earth system changing?

What are the primary forcings?

How does Earth system respond to natural & human-induced changes?

What are the consequences of changes in the Earth system?

How well can we predict future changes?

Global snapshots of values for a number of variables on land surfaces or water

NASA DATA MINING REVEALS A NEW HISTORY OF NATURAL DISASTERS

NASA is using satellite data to paint a detailed global picture of the interplay among natural disasters, human activities and the rise of carbon dioxide in the Earth's atmosphere during the past 20 years….

http://www.nasa.gov/centers/ames/news/releases/2003/03_51AR.html

High Resolution EOS Data:

•EOS satellites provide high resolution measurements• Finer spatial grids

• 1 km 1 km grid produces 694,315,008 data points• Going from 0.5º 0.5º degree data to 1 km 1 km data results in a 2500-

fold increase in the data size• More frequent measurements• Multiple instruments

•High resolution data allows us to answer more detailed questions:• Detecting patterns such as trajectories, fronts, and movements of regions with uniform properties

• Finding relationships between leaf area index (LAI) and topography of a river drainage basin

• Finding relationships between fire frequency and elevation as well as topographic position

•Leads to substantially high computational and memory requirementsDisturbance Viewer

This interactive module displays the locations on the earth surface where significant disturbance events have been detected.

Detection of Ecosystem Disturbances:

Data Mining for Cyber Security

• Due to proliferation of Internet, more and more organizations are becoming vulnerable to sophisticated cyber attacks

• Traditional Intrusion Detection Systems (IDS) have well-known limitations– Too many false alarms– Unable to detect sophisticated and novel attacks– Unable to detect insider abuse/ policy abuse

• Data Mining is well suited to address these challenges

0

20000

40000

60000

80000

100000

120000

1 2 3 4 5 6 7 8 9 10 11 12 13 14

• Incorporated into Interrogator architecture at ARL Center for Intrusion Monitoring and Protection (CIMP)

• Helps analyze data from multiple sensors at DoD sites around the country• Routinely detects Insider Abuse / Policy Violations / Worms / Scans

Large Scale Data Analysis is needed for

• Correlation of suspicious events across network sites

– Helps detect sophisticated attacks not identifiable by single site analyses

• Analysis of long term data (months/years)

– Uncover suspicious stealth activities (e.g. insiders leaking/modifying information)

MINDS – Minnesota Intrusion Detection System

http://www.caida.org/outreach/papers/2003/sapphire/sql-after.gif

Data Mining for Biomedical Informatics

Recent technological advances are helping to generate large amounts of both medical and genomic data• High-throughput experiments/techniques

- Gene and protein sequences- Gene-expression data- Biological networks and phylogenetic profiles

• Electronic Medical Records- IBM-Mayo clinic partnership has created a DB of 5

million patients- NIH Roadmap

Data mining offers potential solution for analysis of large-scale data

• Automated analysis of patients history for customized treatment

• Design of drugs/chemicals• Prediction of the functions of anonymous genes

Protein Interaction Network

Role of Benchmarks in Architecture Design

Benchmarks guide the development of new processor architectures in addition to measuring the relative performance of different systems

• SPEC: General purpose architecture(“Advances in the microprocessor industry would not have been possible without the SPEC benchmarks” - David Patterson)

• TPC: Database Systems

• SPLASH: Parallel machine architectures

• Mediabench: Media and Communication Processors

• NetBench: Network/Embedded processors

Do We Need Benchmarks Specific to Data Mining?

Performance metrics of several benchmarks gathered from Vtune• Cache miss ratios, Bus usage, Page faults etc.

Benchmark applications were grouped using Kohenen clustering to spot trends:

012

345

8

9

11

apri

ori

bayesi

an

bir

checl

at

hop

scalp

arc

kM

eans

fuzz

yrs

earc

hse

mphy

snp

genenet

svm

-rfe

MineBench

67

10

Clu

ster

Num

ber

gcc

bzi

p2

gzi

pm

cftw

olf

vort

ex

vpr

pars

er

apsi

art

equake

luca

sm

esa

mgri

dsw

imw

upw

ise

raw

caudio

epic

enco

de

cjpeg

mpeg2

pegw

itgs

toast

Q1

7Q

3Q

4Q

6

SPEC FP MediaBench TPC-HSPEC INT

Reference: [Pisharath J., Zambreno J., Ozisikyilmaz B., Choudhary A., 2006]

Recently funded NSF project: Scalable Benchmarks, Software and Datafor Data Mining, Analytics and Scientific Discoveries

PIs: A. Choudhary and Gokhan Memik (NW) , V. Kumar and M. Steinbach (UM)

Goal: Establish a comprehensive benchmarking suite for data mining applications.

Motivate the development of new processor architectures and system design for data mining

Motivate the implementation of more sophisticated data mining algorithms that can work with the constraints imposed by current architecture designs

Improvement the productivity of scientists and engineers using data mining application in a wide variety of domains

Profiling

Typ

es o

f da

ta

(str

eam

ing,

fil

e I/

O)

Types of applications (scientific,

bioinformatics,security, …)

Typ

es o

f st

orag

e(m

emor

y, d

isks

, …)

Scalability(data-level, processor)

Performance (execution time,

cache behavior, …)

Profiling

Typ

es o

f da

ta

(str

eam

ing,

fil

e I/

O)

Types of applications (scientific,

bioinformatics,security, …)

Typ

es o

f st

orag

e(m

emor

y, d

isks

, …)

Scalability(data-level, processor)

Performance (execution time,

cache behavior, …)

Tid Refund Marital Status

Taxable Income Cheat

1 Yes Single 125K No

2 No Married 100K No

3 No Single 70K No

4 Yes Married 120K No

5 No Divorced 95K Yes

6 No Married 60K No

7 Yes Divorced 220K No

8 No Single 85K Yes

9 No Married 75K No

10 No Single 90K Yes


12 Yes Divorced 220K No

13 No Single 85K Yes


15 No Single 90K Yes 10

Predictive M

odeling

Clustering

Association

Rules

Anomaly Detection

Milk

Data

Data Mining Tasks …

Key Data Mining Algorithms

Clustering• K-means, EM, SOM• Single link / Group Average hierarchical clustering• DBSCAN, SNN

Classification• Bayes• SVM• Decision trees, Rule based systems

Association Rule Mining• Apriori, FP-Growth

Anomaly Detection• Statistical methods• Distance-based• Clustering-based

Preprocessing• SVD, PCA

Major Data Mining Kernels

Counting• Given a set of data records, count types of different

categories to build a contingency table• Count the occurrence of a set of items in a set of

transactions

Pairwise computations• Given a set of data records, perform pairwise

distane/similarity computations

Linear Algebra operations• SVD, PCA

General Characteristics of Data Mining Algorithms

Dense/Sparse data

Hash table / Hash tree

Linked Lists

Iterative nature

Data often too large to fit in main memory• Spatial locality is critical

Constructing a Decision Tree

10

Tid Employed Level of

Education

# years at present address

Credit Worthy

1 Yes Graduate 5 Yes

2 Yes High School 2 No

3 No Undergrad 1 No

4 Yes High School 10 Yes

5 Yes Graduate 2 No

6 No High School 2 No

7 Yes Undergrad 3 No



10 No Graduate 1 No

Employed

Worthy: 4Not Worthy: 3

Yes

10


Education


Credit Worthy



3 No Undergrad 1 No


5 Yes Graduate 2 No





10 No Graduate 1 No

No


10


Education


Credit Worthy



3 No Undergrad 1 No


5 Yes Graduate 2 No





10 No Graduate 1 No

Graduate High School/ Undergrad


Education


Key Computation

WorthyNot Worthy

4 3

0 3

Employed = Yes

Employed = No

10


Education


Credit Worthy



3 No Undergrad 1 No


5 Yes Graduate 2 No





10 No Graduate 1 No


Yes No


Employed

Constructing a Decision Tree

Employed = Yes

Employed = No

10


Education


Credit Worthy



3 No Undergrad 1 No


5 Yes Graduate 2 No





10 No Graduate 1 No

10


Education


Credit Worthy




5 Yes Graduate 2 No




10


Education


Credit Worthy

3 No Undergrad 1 No


10 No Graduate 1 No

Constructing a Decision Tree in Parallel

Partitioning of data only– global reduction per

node is required– large number of

classification tree nodes gives high communication cost

n records

m categorical attributesWorthy Not Worthy

Yes 4 3No 0 3

Worthy Not Worthy

Yes 2 5No 1 2

Worthy Not Worthy

Yes 6 1No 1 2

Constructing a Decision Tree in Parallel

Partitioning of classification tree nodes– natural concurrency– load imbalance

– the amount of work associated with each node varies

– limited concurrency on the upper portion of the tree

– child nodes use the same data as used by parent node

– loss of locality– high data movement cost

7,000 records

10,000 training records

3,000 records

2,000 5,000 2,000 1,000

Speedup Comparison of the Three Parallel Algorithms

Data set used in SLIQ paper (Ref: Mehta, Agrawal and Rissanen, 1996) IBM SP2 with 128 processors

hybrid

Data partitioning

Tree partitioning

hybrid

Data partitioning

Tree partitioning

0.8 million examples

1.6 million examples

Dynamic load balancing inspired by parallel sparse Cholesky factorization and parallel tree search

Speedup of the Hybrid Algorithm with Different Size Data Sets

ID Income

0 25K

2 28K

8 30K

4 30K

5 35K

1 50K

3 52K

6 55K

7 70K 10

ID Age

2 25

5 31

8 33

1 37

3 41

6 52

4 55

7 60

0 61 10

Hash Table Access

• Some efficient decision tree algorithms require random access to large data structures.

• Example: SPRINT (Ref: Shafer, Agrawal, Mehta, 1996)Hash Table

Processor P0

Processor P1

Processor P2

ID Left/ Right

0 Left

1 Left

2 Right

3 Right

4 Right

5 Left

6 Right

7 Left

8 Left 10

10


Education


Credit Worthy



3 No Undergrad 1 No


5 Yes Graduate 2 No




10 No Graduate 1 No

Left Right

10


Education


Credit Worthy

5 Yes Graduate 2 No




10 No Graduate 1 No

10


Education


Credit Worthy




10 No Graduate 1 No

ID Age

2 25

5 31

8 33

1 37

3 41

6 52

4 55

7 60

0 61 10

ID Left/ Right

0 Left

1 Left

2 Right

3 Right

4 Right

5 Left

6 Right

7 Left

8 Left 10

Processor P0

ID Income

0 25K

2 28K

8 30K

4 30K

5 35K

1 50K

3 52K

6 55K

7 70K 10

ID Age

2 25

5 31

8 33

1 37

3 41

6 52

4 55

7 60

0 61 10

Processor P1

ID Left/ Right

0 Left

1 Left

2 Right

3 Right

4 Right

5 Left

6 Right

7 Left

8 Left 10

ID Income

0 25K

2 28K

8 30K

4 30K

5 35K

1 50K

3 52K

6 55K

7 70K 10

ID Age

2 25

5 31

8 33

1 37

3 41

6 52

4 55

7 60

0 61 10

Processor P2

ID Left/ Right

0 Left

1 Left

2 Right

3 Right

4 Right

5 Left

6 Right

7 Left

8 Left 10

Storing the entire has table on one processor makes the algorithm unscalable

ID Left/ Right

0 Left

1 Left

2 Right

3 Right

4 Right

5 Left

6 Right

7 Left

8 Left 10

ID Income

0 25K

2 28K

8 30K

4 30K

5 35K

1 50K

3 52K

6 55K

7 70K 10

ID Age

2 25

5 31

8 33

1 37

3 41

6 52

4 55

7 60

0 61 10

Processor P0

ScalParC (Ref: Joshi, Karypis, Kumar, 1998)

ScalParC is a scalable parallel decision tree construction algorithm• Scales to large number of processors• Scales to large training sets

ScalParC is memory efficient • The hash-table is distributed among the processors

ScalParC performs minimum amount of communication

This ScalParC Design is Inspired by..

Communication Structure of Parallel Sparse Matrix-Vector Algorithms

Processor P1

Processor P0

Processor P2

Processor P0

Processor P1

Processor P2

Hash Table Entries

Parallel Runtime (Ref: Joshi, Karypis, Kumar, 1998)

0

20

40

60

80

100

120

0 50 100 150

Processors

Ru

nti

me

(sec

on

ds) 0.2M

0.4M

0.8M

1.6M

3.2M

6.4M

128 Processor Cray T3D

Computing Association Patterns

1. Market-basket transactions2. Find item combinations (itemsets) that occur frequently in data

{Diaper}{Bread}

Beer}{}MilkDiaper,{

3. Generate association rules

Counting Candidates

Frequent Itemsets are found by counting candidates

Simple way: • Search for each candidate in each transaction

Transactions Candidates

N M

A B C D

A C E

B C D

A B D E

B C E

B D

Count

A B 0

A C 0

A D 0

A E 0

B C 0

B D 0

A B E 0

B C D 0

A B D E 0

A B C D E 0

1A D

0A E

A B C D E

A B D E

B C D

A B E

B D

B C

A C

A B

0

0

1

0

1

1

1

1

1A D

1A E

A B C D E

A B D E

B C D

A B E

B D

B C

A C

A B

0

0

1

0

1

1

2

1

Reduce the number

of comparisons (NM) by using hash tables to store the candidate itemsets

2A D

2A E

A B C D E

A B D E

B C D

A B E

B D

B C

A C

A B

0

1

2

2

4

3

2

2 Naïve approach requires O(NM) comparisons

Parallel Association Rules: Scaleup Results (100K,0.25%) (Ref: Han, Karypis, and Kumar, 2000)

DD (Agrawal & Shafer, 1996)

IDD (Han, Karypis, Kumar, 2000)

HD (Han, Karypis, Kumar, 2000)

Efficient implementation of collective communication

Dynamic restructuring of computation

Candidates for MineBenchAlgorithms Category Description Lang. Parallel

PCA Preprocessing Principal component analysis C/C++/FORT.

Y

ABB Preprocessing Automatic Branch and Bound C/C++ N LVF Preprocessing A probabilistic feature selection algorithm C/C++ N Normalization Preprocessing Variable transformation C/C++ Y ScalParC Predictive Modeling Decision tree classifier C Y

Naïve Bayesian Predictive Modeling Statistical classifier based on class conditional independence

C++ N

RIPPER Predictive Modeling Rule-based predictive modeling C/C++ Y SVMlight Predictive Modeling Support Vector Machines C/C++ N K-means Clustering Partitioning method C Y Bisecting K-means

Clustering Partitioning method C Y

Fuzzy K-means Clustering Fuzzy logic based K-means C Y EM Clustering Clustering Partitioning method C/C++ Y MAFIA(N) Clustering Multidimensional Clustering C Y BIRCH Clustering Hierarchical method C++ N AHC Clustering Agglomerative Hierarchical Clustering C/C++ N DBSCAN Clustering Density-based method C/C++ Y HOP Clustering Density-based method C Y LOF Anomaly Detection Local Outlier Factor C/C++ Y Outlier Detection Anomaly Detection Distance-based outlier detection C/C++ Y

Apriori ARM Horizontal database, level-wise mining based on Apriori property

C/C++ Y

MAFIA(C) ARM Maximal frequent itemset mining C/C++ N

Eclat ARM Vertical database, break large search space into equivalence classes

C++ N

FP-growth ARM Encodes database into a compact FP-tree C/C++ N

Analysis of Benchmark Algorithms

Explore the bottlenecks associated with the current general purpose sequential and parallel machines

Explore how different architectural features impact the performance of data mining algorithms

Preliminary Evaluation of Some Sample Data Sets

Example small (S), medium (M), and large (L) data set

Execution time for some algorithms in the MineBench suite.

Classification Association Rule Mining (ARM) Dataset

Parameter DB Size(MB) Parameter DB Size(MB) Small F26-A32-D125K 27 T10-I4-D1000K 47

Medium F26-A32-D250K 54 T20-I6-D2000K 175 Large F26-A64-D250K 108 T20-I6-D4000K 350

Data set = S Data set = M Data set = L Program

P1 P4 P8 P1 P4 P8 P1 P4 P8

HOP 6.3 1.8 1.2 52.7 27.4 18.7 435.3 128.0 81.5

K-means 5.7 2.0 1.3 12.9 3.3 2.6 - - -

Fuzzy K-means 164.1 54.6 26.4 146.8 42.7 27.1 - - -

BIRCH 3.5 - - 31.7 - - 172.6 - -

ScalParC 51.0 13.5 10.4 110.6 28.5 21.6 225.9 56.2 36.5 Bayesian 12.6 - - 25.1 - - 51.5 - - Apriori 6.1 3.0 2.6 102.7 38.6 30.5 200.2 72.6 63.0

Eclat 11.8 - - 81.5 - - 127.8 - -

Reference: [Liu Y., Pisharath J., Liao W., Memik G., Choudhary A., Dubey P., 2004]

Designing Efficient Kernels for Data Mining

Frequency of Kernel Operations in Representative Applications

Understanding of the bottlenecks in executing DM algorithms on current architectures will help design new, more efficient algorithms

Focus will be on design frequently used kernels that dominates the execution time of most DM algorithms

Both sequential and parallel versions will be developed

Reference: [Pisharath J., Zambreno J., Ozisikyilmaz B., Choudhary A., 2006]

Conclusions

Data mining applications are becoming increasingly important

Current systems design approach not adequate for DM applications

MineBench – a new benchmark suite which encompasses many algorithms found in data mining

Initial findings:• Data mining applications are unique in terms of

performance characteristics• There exists much room for optimization with regards

to data mining workloads

Bibliography Introduction to Data Mining, Pang-Ning Tan, Michael Steinbach, Vipin Kumar, Addison-Wesley

April 2005 Introduction to Parallel Computing, (Second Edition) by Ananth Grama, Anshul Gupta, George

Karypis, and Vipin Kumar. Addison-Wesley, 2003 Data Mining for Scientific and Engineering Applications, edited by R. Grossman, C. Kamath, W. P.

Kegelmeyer, V. Kumar, and R. Namburu, Kluwer Academic Publishers, 2001 J. Han, R. B. Altman, V. Kumar, H. Mannila, and D. Pregibon, "Emerging Scientific Applications in

Data Mining", Communications of the ACMVolume 45, Number 8, pp 54-58, August 2002

C. Potter, P. Tan, M. Steinbach, S. Klooster, V. Kumar, R. Myneni, V. Genovese, Major Disturbance Events in Terrestrial Ecosystems Detected using global Satellite Data Sets, Global Change Biology 9 (7), 1005-1021, 2003

Vipin Kumar, “Parallel and Distributed Computing for Cyber Security". An article based on the keynote talk by the author at 17th International Conference on Parallel and Distributed Computing Systems (PDCS-2004). DS Online Journal, OLUME 6, NUMBER 10, October 2005

• Ying Liu, Jayaprakash Pisharath, Wei-keng Liao, Gokhan Memik, Alok Choudhary, and Pradeep Dubey. Performance Evaluation and Characterization of Scalable Data Mining Algorithms. In Proceedings of the 16th International Conference on Parallel and Distributed Computing and Systems (PDCS), November 2004.

• Joseph Zambreno, Berkin Ozisikyilmaz, Jayaprakash Pisharath, Gokhan Memik, and Alok Choudhary. Performance Characterization of Data Mining Applications using MineBench. In Proceedings of the 9th Workshop on Computer Architecture Evaluation using Commercial Workloads (CAECW-9), February 2006.

• Jayaprakash Pisharath, Joseph Zambreno, Berkin Ozisikyilmaz, and Alok Choudhary. Accelerating Data Mining Workloads: Current Approaches and Future Challenges in System Architecture Design. In Proceedings of the 9th International Workshop on High Performance and Distributed Mining (HPDM), April 2006

Scalable Benchmarks and Kernels for Data Mining and Analytics Vipin Kumar University of Minnesota [email protected] kumar Joint work with.

Documents

data size

satellite data

data points

data results

degree data

water nasa data mining

enormous data growth

high resolution eos