Top Banner
Studies of HPCC Systems® from Machine Learning Perspectives Ying Xie, Pooja Chenna, Ken Hoganson Department of Computer Science Kennesaw State University
56

Studies of HPCC Systems from Machine Learning Perspectives

Jan 22, 2018

Download

Data & Analytics

HPCC Systems
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Studies of HPCC Systems from Machine Learning Perspectives

Studies of HPCC Systems® from Machine Learning Perspectives

Ying Xie, Pooja Chenna, Ken Hoganson Department of Computer Science

Kennesaw State University

Page 2: Studies of HPCC Systems from Machine Learning Perspectives

Outline • Leverage and Enhance Deep Learning Capability of HPCC

Systems®

• Comparative Studies of HPCC Systems® and Hadoop Systems

Page 3: Studies of HPCC Systems from Machine Learning Perspectives

Leverage and Enhance Deep Learning Capability of HPCC Systems®

• Deep learning has been emerged as a recent breakthrough in the area of Machine Learning and revitalized research activities in Artificial Intelligence (AI).

• The development of deep learning techniques has been primarily driven by big data analytics.

• Since HPCC Systems is a well established big data platform, it is very important to leverage and enhance the deep learning capability of the HPCC Systems platform.

Page 4: Studies of HPCC Systems from Machine Learning Perspectives

Images are copied from neuralnetworksanddeeplearning.com

• From Neural Network to Deep Learning

A typical multilayer perceptron

Page 5: Studies of HPCC Systems from Machine Learning Perspectives

A deep neural architecture

Page 6: Studies of HPCC Systems from Machine Learning Perspectives

Stacked Autoencoder

Page 7: Studies of HPCC Systems from Machine Learning Perspectives

Images are copied from ufldl.stanford.edu

• Stacked Autoencoder

Page 8: Studies of HPCC Systems from Machine Learning Perspectives

This code was implemented by Maryam Mousaarab Najafabadi from F

Page 9: Studies of HPCC Systems from Machine Learning Perspectives

Deep Belief Network

Page 10: Studies of HPCC Systems from Machine Learning Perspectives

Restricted Boltzmann Machine

Images are copied from http://deeplearning4j.org/

Page 11: Studies of HPCC Systems from Machine Learning Perspectives

Deep Belief Network

Page 12: Studies of HPCC Systems from Machine Learning Perspectives

Visualize High Dimensional Data

Page 13: Studies of HPCC Systems from Machine Learning Perspectives

• Iris Data

Page 14: Studies of HPCC Systems from Machine Learning Perspectives

Use DBN to conduct dimension reduction

Page 15: Studies of HPCC Systems from Machine Learning Perspectives
Page 16: Studies of HPCC Systems from Machine Learning Perspectives
Page 17: Studies of HPCC Systems from Machine Learning Perspectives

• Breast Cancer Wisconsin (Original) Data Set

Page 18: Studies of HPCC Systems from Machine Learning Perspectives

Use DBN to conduct dimension reduction

Page 19: Studies of HPCC Systems from Machine Learning Perspectives
Page 20: Studies of HPCC Systems from Machine Learning Perspectives
Page 21: Studies of HPCC Systems from Machine Learning Perspectives

• Glass Data

Page 22: Studies of HPCC Systems from Machine Learning Perspectives

Use stacked auto-encoder to conduct dimension reduction

Page 23: Studies of HPCC Systems from Machine Learning Perspectives
Page 24: Studies of HPCC Systems from Machine Learning Perspectives
Page 25: Studies of HPCC Systems from Machine Learning Perspectives
Page 26: Studies of HPCC Systems from Machine Learning Perspectives
Page 27: Studies of HPCC Systems from Machine Learning Perspectives

• Based on the visualization, we may even gain some good ideas what classification algorithm may work well

Page 28: Studies of HPCC Systems from Machine Learning Perspectives

Classification Algorithm #Miss-classified Instances

Logistic 72

MLP 66

Simple Logistic 70

7NN 76

5NN 70

3NN 64

Page 29: Studies of HPCC Systems from Machine Learning Perspectives

Mapping Data to Higher Dimensional Spaces

Page 30: Studies of HPCC Systems from Machine Learning Perspectives

• Blood Transfusion Data

Page 31: Studies of HPCC Systems from Machine Learning Perspectives
Page 32: Studies of HPCC Systems from Machine Learning Perspectives

• Classification performance on higher dimensional space

Original Space (4 Dim) (#Miss-classified Instances)

Higher Dim (10 Dim) (#Miss-classified Instances)

Logistic 171 157 MLP 160 166

Mapped to higher dimensional space by Stacked Auto-Encoder

Mapped to higher dimensional space DBN

Page 33: Studies of HPCC Systems from Machine Learning Perspectives

• Wine Data

Page 34: Studies of HPCC Systems from Machine Learning Perspectives

• Classification performance on higher dimensional space

Original Space (13 Dim) (#Miss-classified Instances)

Higher Dim (15 Dim) (#Miss-classified Instances)

Logistic 10 6 MLP 4 4

Original Space (13 Dim) (#Miss-classified Instances)

Higher Dim (15 Dim) (#Miss-classified Instances)

Logistic 10 7 MLP 4 2

Mapped to higher dimensional space by Stacked Auto-Encoder

Page 35: Studies of HPCC Systems from Machine Learning Perspectives

• For a given data set, we can explore which combination of deep learning mapping techniques, dimensional space, and supervised learning model may yield best classification result

Page 36: Studies of HPCC Systems from Machine Learning Perspectives

• For instance – Breast Cancer Data

3 Dim. Space (#Miss-classified

Instances)

6 Dim. Space (#Miss-classified

Instances)

9 Dim. Space (#Miss-classified

Instances)

12 Dim. Space (#Miss-classified

Instances)

Logistic 22 21 24 25 MLP 21 22 30 21

3 Dim. Space (#Miss-classified

Instances)

6 Dim. Space (#Miss-classified

Instances)

9 Dim. Space (#Miss-classified

Instances)

12 Dim. Space (#Miss-classified

Instances)

Logistic 22 23 24 25 MLP 20 21 30 22

Mapped to different dimensional spaces by Stacked Auto-Encoder

Mapped to different dimensional spaces by DBN

Page 37: Studies of HPCC Systems from Machine Learning Perspectives

• Our next step: try to implement a meta supervised learning algorithm on HPCC – This algorithm will automatically map the given data

to different dimensional spaces by using both stacked auto-encoder and DBN

– Then classification models will be trained on all dimensional spaces in a distributed manner

– Cross-validation will be used to select the best performed model as the final output.

Page 38: Studies of HPCC Systems from Machine Learning Perspectives

Implementation of Deep Belief Network on

HPCC

Page 39: Studies of HPCC Systems from Machine Learning Perspectives

Our Implementations of Deep Learning on HPCC

• Restricted Boltzmann Machine (RBM) with Contrastive Divergence learning algorithm

• Deep Belief Network by stacking RBMs • Supervised Deep Belief Network

Page 40: Studies of HPCC Systems from Machine Learning Perspectives

Machine Learning Routines • Utility Module • Matrix Library • Dense Matrix Library • PBblas

Page 41: Studies of HPCC Systems from Machine Learning Perspectives

Restricted Boltzmann Machine - RBM

v

h

v’

h’

Page 42: Studies of HPCC Systems from Machine Learning Perspectives

w(t+1) = w(t) + α(vhT – v’h’T)

Page 43: Studies of HPCC Systems from Machine Learning Perspectives
Page 44: Studies of HPCC Systems from Machine Learning Perspectives

Stacking Boltzmann Machines - Deep Belief Network

Page 45: Studies of HPCC Systems from Machine Learning Perspectives

Final Output

Input Parameters

• Iris Data Sample

Page 46: Studies of HPCC Systems from Machine Learning Perspectives

Supervised Boltzmann Machine – Deep Belief Network

y – actual output h – hidden samples v – visible samples

Page 47: Studies of HPCC Systems from Machine Learning Perspectives

Supervised Deep Belief Network

Page 48: Studies of HPCC Systems from Machine Learning Perspectives

• Our ultimate goal is to implement a full-stack of deep learning techniques on HPCC and conduct a wide range of experiences to show how powerful the deep learning engine on HPCC will be.

Page 49: Studies of HPCC Systems from Machine Learning Perspectives

Comparative Studies of HPCC and Hadoop

Page 50: Studies of HPCC Systems from Machine Learning Perspectives

• HPCC and Hadoop clusters on CSCloud – HPCC cluster :

• 5 thor nodes • 5 roxie nodes • 2 middle-ware nodes • 1 landing zone node for uploading files

– Hadoop cluster

• 1 job-tracker / name-node • 1 support system (Web UI, hadoop ecosystem, etc) • 4 worker nodes: task-tracker / data-node

Page 51: Studies of HPCC Systems from Machine Learning Perspectives

• Algorithms for comparison – Text Processing Algorithms:

• Word Count • Inverted Index

– Machine Learning Algorithms: • Supervised Learning Algorithm - Random Forests • Unsupervised Learning Algorithm - KMeans

– Graph Algorithm: • Page Rank

Page 52: Studies of HPCC Systems from Machine Learning Perspectives

Text Processing Algorithms Data: Authorized version of Bible downloaded from http://av1611.com/ HPCC Implementation of Inverted Index: http://www.dabhand.org/ECL/construct_a_simple_bible_search.htm (implemented by David Bayliss, Chief Data Scientist and VP of LexisNexis Risk Solutions) Hadoop Implementation of Inverted Index: Victor Guana and Joshua Davidson. On Comparing Inverted Index Parallel Implementations Using Map/Reduce Technical Report. May 09, 2012.

Algorithm HPCC Hadoop

Word Count 1.003 seconds 23.466 seconds

Inverted Index 34.205 seconds 27.047 seconds

Page 53: Studies of HPCC Systems from Machine Learning Perspectives

Machine Learning Algorithms Data: KDD Network Intrusion Dataset Reference: C. Blake and C. J. Merz. UCI Repository of machine learning databases. Irvine, CA: University of California, Department of Information and Computer Science. [http://www.ics.uci.edu/~mlearn/MLRepository.html] Description:

– Total number of Instances : 4000000 – Used instances : 20394 randomly picked – Number of Attributes : 42

Hadoop Libraries – Apache Mahout:

– Version : CDH-5.4.2-1 (Cloudera)

HPCC Machine Learning Library

Page 54: Studies of HPCC Systems from Machine Learning Perspectives

Efficiency:

Algorithm HPCC Hadoop

Random Forests 1 minutes 50 seconds 18 seconds

KMeans 36.675 seconds 1 min 45 seconds

Page 55: Studies of HPCC Systems from Machine Learning Perspectives

Graph Algorithm Data Sets: Randomly generated graph with 25 nodes with maximum degree 5. HPCC Implementation of Pageranking – Our team’s implementation Hadoop Implementation Pageranking - http://blog.xebia.com/2011/09/wiki-pagerank-with-hadoop/

Algorithm HPCC Hadoop

Page Rank 29.817 seconds 36 minutes

Page 56: Studies of HPCC Systems from Machine Learning Perspectives

THANK YOU