Scalable Benchmarks and Kernels for Data Mining and Analytics Vipin Kumar University of Minnesota kumar @cs.umn.edu www. cs . umn .edu/~ kumar Joint work with Alok Choudhary and Gokhan Memik (Northwestern) and Michael Steinbach (University of Minnesota) Research funded by NSF
31
Embed
Scalable Benchmarks and Kernels for Data Mining and Analytics Vipin Kumar University of Minnesota [email protected] kumar Joint work with.
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Scalable Benchmarks and Kernels for Data Mining and Analytics
Today’s digital society has seen enormous data growth in both commercial and scientific databases
Data Mining is becoming a commonly used tool to extract information from large and complex datasets
Advances in computing capabilities and technological innovation needed to harvest the available wealth of data
Computational Simulations
Internet
Sensor Networks
Geo-spatial data
Biomedical DataHomeland Security
SST
Precipitation
NPP
Pressure
SST
Precipitation
NPP
Pressure
Longitude
Latitude
Timegrid cell zone
...
Data Mining for Climate Data
NASA ESE questions: How is the global Earth system changing?
What are the primary forcings?
How does Earth system respond to natural & human-induced changes?
What are the consequences of changes in the Earth system?
How well can we predict future changes?
Global snapshots of values for a number of variables on land surfaces or water
NASA DATA MINING REVEALS A NEW HISTORY OF NATURAL DISASTERS
NASA is using satellite data to paint a detailed global picture of the interplay among natural disasters, human activities and the rise of carbon dioxide in the Earth's atmosphere during the past 20 years….
•EOS satellites provide high resolution measurements• Finer spatial grids
• 1 km 1 km grid produces 694,315,008 data points• Going from 0.5º 0.5º degree data to 1 km 1 km data results in a 2500-
fold increase in the data size• More frequent measurements• Multiple instruments
•High resolution data allows us to answer more detailed questions:• Detecting patterns such as trajectories, fronts, and movements of regions with uniform properties
• Finding relationships between leaf area index (LAI) and topography of a river drainage basin
• Finding relationships between fire frequency and elevation as well as topographic position
•Leads to substantially high computational and memory requirementsDisturbance Viewer
This interactive module displays the locations on the earth surface where significant disturbance events have been detected.
Detection of Ecosystem Disturbances:
Data Mining for Cyber Security
• Due to proliferation of Internet, more and more organizations are becoming vulnerable to sophisticated cyber attacks
• Traditional Intrusion Detection Systems (IDS) have well-known limitations– Too many false alarms– Unable to detect sophisticated and novel attacks– Unable to detect insider abuse/ policy abuse
• Data Mining is well suited to address these challenges
0
20000
40000
60000
80000
100000
120000
1 2 3 4 5 6 7 8 9 10 11 12 13 14
• Incorporated into Interrogator architecture at ARL Center for Intrusion Monitoring and Protection (CIMP)
• Helps analyze data from multiple sensors at DoD sites around the country• Routinely detects Insider Abuse / Policy Violations / Worms / Scans
Large Scale Data Analysis is needed for
• Correlation of suspicious events across network sites
– Helps detect sophisticated attacks not identifiable by single site analyses
Recent technological advances are helping to generate large amounts of both medical and genomic data• High-throughput experiments/techniques
- Gene and protein sequences- Gene-expression data- Biological networks and phylogenetic profiles
• Electronic Medical Records- IBM-Mayo clinic partnership has created a DB of 5
million patients- NIH Roadmap
Data mining offers potential solution for analysis of large-scale data
• Automated analysis of patients history for customized treatment
• Design of drugs/chemicals• Prediction of the functions of anonymous genes
Protein Interaction Network
Role of Benchmarks in Architecture Design
Benchmarks guide the development of new processor architectures in addition to measuring the relative performance of different systems
• SPEC: General purpose architecture(“Advances in the microprocessor industry would not have been possible without the SPEC benchmarks” - David Patterson)
• TPC: Database Systems
• SPLASH: Parallel machine architectures
• Mediabench: Media and Communication Processors
• NetBench: Network/Embedded processors
Do We Need Benchmarks Specific to Data Mining?
Performance metrics of several benchmarks gathered from Vtune• Cache miss ratios, Bus usage, Page faults etc.
Benchmark applications were grouped using Kohenen clustering to spot trends:
012
345
8
9
11
apri
ori
bayesi
an
bir
checl
at
hop
scalp
arc
kM
eans
fuzz
yrs
earc
hse
mphy
snp
genenet
svm
-rfe
MineBench
67
10
Clu
ster
Num
ber
gcc
bzi
p2
gzi
pm
cftw
olf
vort
ex
vpr
pars
er
apsi
art
equake
luca
sm
esa
mgri
dsw
imw
upw
ise
raw
caudio
epic
enco
de
cjpeg
mpeg2
pegw
itgs
toast
Q1
7Q
3Q
4Q
6
SPEC FP MediaBench TPC-HSPEC INT
Reference: [Pisharath J., Zambreno J., Ozisikyilmaz B., Choudhary A., 2006]
Recently funded NSF project: Scalable Benchmarks, Software and Datafor Data Mining, Analytics and Scientific Discoveries
PIs: A. Choudhary and Gokhan Memik (NW) , V. Kumar and M. Steinbach (UM)
Goal: Establish a comprehensive benchmarking suite for data mining applications.
Motivate the development of new processor architectures and system design for data mining
Motivate the implementation of more sophisticated data mining algorithms that can work with the constraints imposed by current architecture designs
Improvement the productivity of scientists and engineers using data mining application in a wide variety of domains
Profiling
Typ
es o
f da
ta
(str
eam
ing,
fil
e I/
O)
Types of applications (scientific,
bioinformatics,security, …)
Typ
es o
f st
orag
e(m
emor
y, d
isks
, …)
Scalability(data-level, processor)
Performance (execution time,
cache behavior, …)
Profiling
Typ
es o
f da
ta
(str
eam
ing,
fil
e I/
O)
Types of applications (scientific,
bioinformatics,security, …)
Typ
es o
f st
orag
e(m
emor
y, d
isks
, …)
Scalability(data-level, processor)
Performance (execution time,
cache behavior, …)
Tid Refund Marital Status
Taxable Income Cheat
1 Yes Single 125K No
2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes
11 No Married 60K No
12 Yes Divorced 220K No
13 No Single 85K Yes
14 No Married 75K No
15 No Single 90K Yes 10
Predictive M
odeling
Clustering
Association
Rules
Anomaly Detection
Milk
Data
Data Mining Tasks …
Key Data Mining Algorithms
Clustering• K-means, EM, SOM• Single link / Group Average hierarchical clustering• DBSCAN, SNN
Classification• Bayes• SVM• Decision trees, Rule based systems
1. Market-basket transactions2. Find item combinations (itemsets) that occur frequently in data
{Diaper}{Bread}
Beer}{}MilkDiaper,{
3. Generate association rules
Counting Candidates
Frequent Itemsets are found by counting candidates
Simple way: • Search for each candidate in each transaction
Transactions Candidates
N M
A B C D
A C E
B C D
A B D E
B C E
B D
Count
A B 0
A C 0
A D 0
A E 0
B C 0
B D 0
A B E 0
B C D 0
A B D E 0
A B C D E 0
1A D
0A E
A B C D E
A B D E
B C D
A B E
B D
B C
A C
A B
0
0
1
0
1
1
1
1
1A D
1A E
A B C D E
A B D E
B C D
A B E
B D
B C
A C
A B
0
0
1
0
1
1
2
1
Reduce the number
of comparisons (NM) by using hash tables to store the candidate itemsets
2A D
2A E
A B C D E
A B D E
B C D
A B E
B D
B C
A C
A B
0
1
2
2
4
3
2
2 Naïve approach requires O(NM) comparisons
Parallel Association Rules: Scaleup Results (100K,0.25%) (Ref: Han, Karypis, and Kumar, 2000)
DD (Agrawal & Shafer, 1996)
IDD (Han, Karypis, Kumar, 2000)
HD (Han, Karypis, Kumar, 2000)
Efficient implementation of collective communication
Dynamic restructuring of computation
Candidates for MineBenchAlgorithms Category Description Lang. Parallel
PCA Preprocessing Principal component analysis C/C++/FORT.
Y
ABB Preprocessing Automatic Branch and Bound C/C++ N LVF Preprocessing A probabilistic feature selection algorithm C/C++ N Normalization Preprocessing Variable transformation C/C++ Y ScalParC Predictive Modeling Decision tree classifier C Y
Naïve Bayesian Predictive Modeling Statistical classifier based on class conditional independence
C++ N
RIPPER Predictive Modeling Rule-based predictive modeling C/C++ Y SVMlight Predictive Modeling Support Vector Machines C/C++ N K-means Clustering Partitioning method C Y Bisecting K-means
Clustering Partitioning method C Y
Fuzzy K-means Clustering Fuzzy logic based K-means C Y EM Clustering Clustering Partitioning method C/C++ Y MAFIA(N) Clustering Multidimensional Clustering C Y BIRCH Clustering Hierarchical method C++ N AHC Clustering Agglomerative Hierarchical Clustering C/C++ N DBSCAN Clustering Density-based method C/C++ Y HOP Clustering Density-based method C Y LOF Anomaly Detection Local Outlier Factor C/C++ Y Outlier Detection Anomaly Detection Distance-based outlier detection C/C++ Y
Apriori ARM Horizontal database, level-wise mining based on Apriori property
C/C++ Y
MAFIA(C) ARM Maximal frequent itemset mining C/C++ N
Eclat ARM Vertical database, break large search space into equivalence classes
C++ N
FP-growth ARM Encodes database into a compact FP-tree C/C++ N
Analysis of Benchmark Algorithms
Explore the bottlenecks associated with the current general purpose sequential and parallel machines
Explore how different architectural features impact the performance of data mining algorithms
Preliminary Evaluation of Some Sample Data Sets
Example small (S), medium (M), and large (L) data set
Execution time for some algorithms in the MineBench suite.
Classification Association Rule Mining (ARM) Dataset
Parameter DB Size(MB) Parameter DB Size(MB) Small F26-A32-D125K 27 T10-I4-D1000K 47
Medium F26-A32-D250K 54 T20-I6-D2000K 175 Large F26-A64-D250K 108 T20-I6-D4000K 350
Reference: [Liu Y., Pisharath J., Liao W., Memik G., Choudhary A., Dubey P., 2004]
Designing Efficient Kernels for Data Mining
Frequency of Kernel Operations in Representative Applications
Understanding of the bottlenecks in executing DM algorithms on current architectures will help design new, more efficient algorithms
Focus will be on design frequently used kernels that dominates the execution time of most DM algorithms
Both sequential and parallel versions will be developed
Reference: [Pisharath J., Zambreno J., Ozisikyilmaz B., Choudhary A., 2006]
Conclusions
Data mining applications are becoming increasingly important
Current systems design approach not adequate for DM applications
MineBench – a new benchmark suite which encompasses many algorithms found in data mining
Initial findings:• Data mining applications are unique in terms of
performance characteristics• There exists much room for optimization with regards
to data mining workloads
Bibliography Introduction to Data Mining, Pang-Ning Tan, Michael Steinbach, Vipin Kumar, Addison-Wesley
April 2005 Introduction to Parallel Computing, (Second Edition) by Ananth Grama, Anshul Gupta, George
Karypis, and Vipin Kumar. Addison-Wesley, 2003 Data Mining for Scientific and Engineering Applications, edited by R. Grossman, C. Kamath, W. P.
Kegelmeyer, V. Kumar, and R. Namburu, Kluwer Academic Publishers, 2001 J. Han, R. B. Altman, V. Kumar, H. Mannila, and D. Pregibon, "Emerging Scientific Applications in
Data Mining", Communications of the ACMVolume 45, Number 8, pp 54-58, August 2002
C. Potter, P. Tan, M. Steinbach, S. Klooster, V. Kumar, R. Myneni, V. Genovese, Major Disturbance Events in Terrestrial Ecosystems Detected using global Satellite Data Sets, Global Change Biology 9 (7), 1005-1021, 2003
Vipin Kumar, “Parallel and Distributed Computing for Cyber Security". An article based on the keynote talk by the author at 17th International Conference on Parallel and Distributed Computing Systems (PDCS-2004). DS Online Journal, OLUME 6, NUMBER 10, October 2005
• Ying Liu, Jayaprakash Pisharath, Wei-keng Liao, Gokhan Memik, Alok Choudhary, and Pradeep Dubey. Performance Evaluation and Characterization of Scalable Data Mining Algorithms. In Proceedings of the 16th International Conference on Parallel and Distributed Computing and Systems (PDCS), November 2004.
• Joseph Zambreno, Berkin Ozisikyilmaz, Jayaprakash Pisharath, Gokhan Memik, and Alok Choudhary. Performance Characterization of Data Mining Applications using MineBench. In Proceedings of the 9th Workshop on Computer Architecture Evaluation using Commercial Workloads (CAECW-9), February 2006.
• Jayaprakash Pisharath, Joseph Zambreno, Berkin Ozisikyilmaz, and Alok Choudhary. Accelerating Data Mining Workloads: Current Approaches and Future Challenges in System Architecture Design. In Proceedings of the 9th International Workshop on High Performance and Distributed Mining (HPDM), April 2006