Top Banner
HPC Factory A Parallel Algorithmic SCALable Framework for N-body Problems Laleh Aghababaie Beni, Aparna Chandramowlishwaran Euro-Par 2017 PASCAL A Parallel Algorithmic SCALable Framework for N-body Problems Laleh Aghababaie Beni, Aparna Chandramowlishwaran Euro-Par 2017
83

PASCAL - University of California, Irvine

Feb 12, 2022

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: PASCAL - University of California, Irvine

HPC Factory

A Parallel Algorithmic SCALable Framework for N-body Problems

Laleh Aghababaie Beni, Aparna Chandramowlishwaran

Euro-Par 2017

PASCALA Parallel Algorithmic SCALable Framework

for N-body Problems

Laleh Aghababaie Beni, Aparna Chandramowlishwaran

Euro-Par 2017

Page 2: PASCAL - University of California, Irvine

HPC Factory

• Introduction• PASCAL Framework

• Space Partitioning Trees• Tree Traversal• Prune/Approximate Generators

• Optimizations & Parallelization • Experiments & Results • Conclusions & Future Work

Outline

Page 3: PASCAL - University of California, Irvine

HPC Factory

Outline

• Introduction• PASCAL Framework

• Space Partitioning Trees• Tree Traversal• Prune/Approximate Generators

• Optimizations & Parallelization • Experiments & Results • Conclusions & Future Work

Page 4: PASCAL - University of California, Irvine

HPC Factory

N-body calculations

Force computation

Nearest neighbors

Kernel density estimation

Range count

⌅q ⇤ Q : F (q) =�

r⇥(Q�{q})

Cr � q

||r � q||3

⌅q ⇤ Q : AllNN(q) = argminr⇥R d(q, r)

⌅q ⇤ Q : KDE(q) =1

|R|�

r⇥R

K(q, r)

⌅q ⇤ Q : Range(q) =�

r⇥R

I(dist(q, r)) ⇥ h)

Page 5: PASCAL - University of California, Irvine

HPC Factory

N-body calculations

Force computation

Nearest neighbors

Kernel density estimation

Range count

⌅q ⇤ Q : F (q) =�

r⇥(Q�{q})

Cr � q

||r � q||3

⌅q ⇤ Q : AllNN(q) = argminr⇥R d(q, r)

⌅q ⇤ Q : KDE(q) =1

|R|�

r⇥R

K(q, r)

⌅q ⇤ Q : Range(q) =�

r⇥R

I(dist(q, r)) ⇥ h)

What do these have in common?

Page 6: PASCAL - University of California, Irvine

HPC Factory

N-body calculations

Force computation

Nearest neighbors

Kernel density estimation

Range count

⌅q ⇤ Q : F (q) =�

r⇥(Q�{q})

Cr � q

||r � q||3

⌅q ⇤ Q : AllNN(q) = argminr⇥R d(q, r)

⌅q ⇤ Q : KDE(q) =1

|R|�

r⇥R

K(q, r)

⌅q ⇤ Q : Range(q) =�

r⇥R

I(dist(q, r)) ⇥ h)

Consider pairs of points – naïvely O(N2)

What do these have in common?

Page 7: PASCAL - University of California, Irvine

HPC Factory

Commonality: Optimal approximation algorithms

Force computation⇤q ⇥ Q : F (q) =�

r⇥(Q�{q})

Cr � q

||r � q||3

Evaluate interactions→ Tree traversals

Store aggregate data atnodes, e.g., bounding box,mass

•Hierarchical tree-based approximation algorithms for force computations, e.g., Barnes-Hut or FMM

Page 8: PASCAL - University of California, Irvine

HPC Factory

N-body problems in other domains

Problem Operators Kernel FunctionAll Nearest Neighbors

All Range SearchAll Range Count

Naive Bayes ClassifierMixture Model E-step

K-means E-stepMixture Model Log-likelihood

Kernel Density EstimationKernel Density Bayes Classifier

2-point (cross-)correlation Nadaraya-Watson Regression

Thermodynamic AverageLargest-span set

Closest PairMinimum Spanning TreeCoulombic Interaction

Average DensityWave function

Hausdorff DistanceIntrinsic (fractal) Dimension

8, argmin

8,[ arg

||xq � xr||I(h

min

< ||xq

� x

r

|| < h

max

)

I(hmin

< ||xq

� x

r

|| < h

max

)8,⌃(1/

p2⇡|⌃

k

|)e� 12 (xi�µk)

T⌃�1k (xi�µk)P (C

k

)8, argmax

(1/p

2⇡|⌃k

|)e� 12 (xi�µk)

T⌃�1k (xi�µk)8, 8

8, argmin ||xq � xr||

(1/p

2⇡|⌃k

|)e� 12 (xi�µk)

T⌃�1k (xi�µk)

X, log

X

8,⌃ �(||xq � xr||

h

)

�(||xq � xr||

h

)P (Ck)8, argmax⌃

max,min

8,⇧||xq � xr||

�(||xq � xr||)⌃,⌃ I(||xq � xr|| < h)

⌃,⌃ I(||xq � xr|| < h)

8,⌃ yr �(||xq � xr||

h

)

⌃,⌃ �(||xq � xr||)max, ...,max ⌃(||xq � xr||)

argmin, argmin ||xq � xr||8, argmin ||xq � xr||8,⌃ ↵q↵r

||xq � xr||

⌃,⌃ I(||xq � xr|| < h)

Each problem has a set of operators and a kernel function

Page 9: PASCAL - University of California, Irvine

HPC Factory

• One of the original seven dwarfs or motifs

• FMM listed among the top 10 algorithms having the greatest influence in 20th century

• EM is one of the top 10 algorithms having the highest impact in data mining

• Applications

• Machine learning

• Computer vision

• Computational geometry

• Scientific computing …

Why N-body methods?

Page 10: PASCAL - University of California, Irvine

HPC Factory

Key Ideas and Findings

• An algorithmic framework for N-body problems• Automatically generates prune & approximation

conditions• Results in O(N log N) and O(N) algorithms• Domain-specific optimizations and parallelization

• Show 10-230x speedup on 6 different algorithms compared to state-of-art libraries/softwares

• Out-of-the-box new optimal algorithms• O(N log N) EM algorithm for GMM’s• O(N) algorithm for Hausdorff distance

Page 11: PASCAL - University of California, Irvine

HPC Factory

• Introduction• PASCAL Framework

• Space Partitioning Trees• Tree Traversal• Prune/Approximate Generators

• Optimizations & Parallelization • Experiments & Results • Conclusions & Future Work

Outline

Page 12: PASCAL - University of California, Irvine

HPC Factory

PASCAL Framework

Datasets

N-body spec.:Operators &

Kernel function

Prune/Approximate condition generator

Tree Traversal SchemesMulti tree traversals

BaseCase Prune/Approximate ComputeApproximate

Space-partitioning TreesKd-tree

Domain-Specific Optimizations

Optimized code

Parallelization

Page 13: PASCAL - University of California, Irvine

HPC Factory

Tree Construction

http://www.cs.cmu.edu/~dpelleg/kmeans.html

Recursively divide space until each box has at most q points.

Page 14: PASCAL - University of California, Irvine

HPC Factory

Tree Construction

http://www.cs.cmu.edu/~dpelleg/kmeans.html

Recursively divide space until each box has at most q points.

Page 15: PASCAL - University of California, Irvine

HPC Factory

Tree Construction

http://www.cs.cmu.edu/~dpelleg/kmeans.html

Recursively divide space until each box has at most q points.

Page 16: PASCAL - University of California, Irvine

HPC Factory

Tree Traversal

Q R

Page 17: PASCAL - University of California, Irvine

HPC Factory

Tree Traversal

Q RPrune/Approx()?

Page 18: PASCAL - University of California, Irvine

HPC Factory

Tree Traversal

Q RPrune/Approx()? NO

Page 19: PASCAL - University of California, Irvine

HPC Factory

Tree Traversal

Q R

Page 20: PASCAL - University of California, Irvine

HPC Factory

Tree Traversal

Q RPrune/Approx()?

Page 21: PASCAL - University of California, Irvine

HPC Factory

Tree Traversal

Q RPrune/Approx()? NO

Page 22: PASCAL - University of California, Irvine

HPC Factory

Tree Traversal

Q R

Page 23: PASCAL - University of California, Irvine

HPC Factory

Tree Traversal

Q R

BaseCase()

Direct QL⊗RL → O(q2)

Page 24: PASCAL - University of California, Irvine

HPC Factory

Tree Traversal

Q R

Page 25: PASCAL - University of California, Irvine

HPC Factory

Tree Traversal

Q R

Prune/Approx()?

Page 26: PASCAL - University of California, Irvine

HPC Factory

Tree Traversal

Q R

Prune/Approx()? YES

Page 27: PASCAL - University of California, Irvine

HPC Factory

Tree Traversal

Q R

If Prune/Approx() is true, discard the entire subtree for pruning problems

Page 28: PASCAL - University of California, Irvine

HPC Factory

Tree Traversal

Q RBaseCase()

Page 29: PASCAL - University of California, Irvine

HPC Factory

Tree Traversal

Q R

Page 30: PASCAL - University of California, Irvine

HPC Factory

Tree Traversal

Q R

Prune/Approx()? YES

Page 31: PASCAL - University of California, Irvine

HPC Factory

Tree Traversal

Q R

If Prune/Approx() is true, replace the subtree with the centroid for approximation problems

Page 32: PASCAL - University of California, Irvine

HPC Factory

Tree Traversal

Q R

ApproxCompute()

Page 33: PASCAL - University of California, Irvine

HPC Factory

Tree Traversal

Q RBaseCase()

Page 34: PASCAL - University of California, Irvine

HPC Factory

Tree Traversal

Q R

Page 35: PASCAL - University of California, Irvine

HPC Factory

Prune/Approximate Condition Generator

• Prune e.g., Hausdorff Distance

Page 36: PASCAL - University of California, Irvine

HPC Factory

Hausdorff Distance

Page 37: PASCAL - University of California, Irvine

HPC Factory

Hausdorff Distance

Q

Page 38: PASCAL - University of California, Irvine

HPC Factory

Hausdorff Distance

Q R

Page 39: PASCAL - University of California, Irvine

HPC Factory

Hausdorff Distance

Q R

Page 40: PASCAL - University of California, Irvine

HPC Factory

Hausdorff Distance

Q R

Page 41: PASCAL - University of California, Irvine

HPC Factory

Hausdorff Distance

Q R

Page 42: PASCAL - University of California, Irvine

HPC Factory

Hausdorff Distance

Q R

Page 43: PASCAL - University of California, Irvine

HPC Factory

Hausdorff Distance

Q R

Page 44: PASCAL - University of California, Irvine

HPC Factory

Prune/Approximate Condition Generator

Log-likelihood

E-step •Approximation e.g., Expectation Maximization (EM)

M-step

• Prune e.g., Hausdorff Distance

Page 45: PASCAL - University of California, Irvine

HPC Factory

Approximate Condition for EM

Page 46: PASCAL - University of California, Irvine

HPC Factory

Approximate Condition for EM

Q

R

Page 47: PASCAL - University of California, Irvine

HPC Factory

Approximate Condition for EM

Q

R

Page 48: PASCAL - University of California, Irvine

HPC Factory

Approximate Condition for EM

Q

R

Kmin

Page 49: PASCAL - University of California, Irvine

HPC Factory

Approximate Condition for EM

Q

R

Page 50: PASCAL - University of California, Irvine

HPC Factory

Approximate Condition for EM

Q

R

Kmax

Page 51: PASCAL - University of California, Irvine

HPC Factory

Approximate Condition for EM

Q

R

Page 52: PASCAL - University of California, Irvine

HPC Factory

Approximate Condition for EM

Q

R

center

center

Page 53: PASCAL - University of California, Irvine

HPC Factory

Approximate Condition for EM

Q

R

center

center

Kcenter

Page 54: PASCAL - University of California, Irvine

HPC Factory

Approximate Condition for EM

Q

R

center

center

Kcenter

Kmax -Kmin < X Kcenter(ri,max

� ri,min

) < � ri,mean

(i = 1, ...,K)

Page 55: PASCAL - University of California, Irvine

HPC Factory

Approximate Condition for EM

Q

R

center

center

Kcenter

Kmax -Kmin < X Kcenter

user-controlled accuracy

(ri,max

� ri,min

) < � ri,mean

(i = 1, ...,K)

Page 56: PASCAL - University of California, Irvine

HPC Factory

Approximate Condition for EM

Q

R

center

center

Kcenter

Kmax -Kmin < X Kcenter

user-controlled accuracy

(ri,max

� ri,min

) < � ri,mean

(i = 1, ...,K)

log liklihood: E-step: (r

i,max

� ri,min

) < � ri,mean

(i = 1, ...,K)

Page 57: PASCAL - University of California, Irvine

HPC Factory

Prune Condition for Hausdorff distance:

Page 58: PASCAL - University of California, Irvine

HPC Factory

Prune Condition for Hausdorff distance:

RQ

Page 59: PASCAL - University of California, Irvine

HPC Factory

Prune Condition for Hausdorff distance:

RQ

Page 60: PASCAL - University of California, Irvine

HPC Factory

Prune Condition for Hausdorff distance:

R

border point

Q

Page 61: PASCAL - University of California, Irvine

HPC Factory

Prune Condition for Hausdorff distance:

R

border point

Q

op�1(⌧1,K(xq, xr)|op�2(⌧2,K(xq, xr))) s.t. 8xq 2 N border

q , 8xr 2 N border

r

Page 62: PASCAL - University of California, Irvine

HPC Factory

• Introduction• PASCAL Framework

• Space Partitioning Trees• Tree Traversal• Prune/Approximate Generators

• Optimizations & Parallelization • Experiments & Results • Conclusions & Future Work

Outline

Page 63: PASCAL - University of California, Irvine

HPC Factory

Optimizations

• Incremental bounding box calculation

Page 64: PASCAL - University of California, Irvine

HPC Factory

Optimizations

• Incremental bounding box calculation

Page 65: PASCAL - University of California, Irvine

HPC Factory

Optimizations

• Incremental bounding box calculation

Page 66: PASCAL - University of California, Irvine

HPC Factory

Only update the dimension that is split at each node

Optimizations

• Incremental bounding box calculation

Page 67: PASCAL - University of California, Irvine

HPC Factory

Only update the dimension that is split at each node

Optimizations

• Incremental bounding box calculation

Page 68: PASCAL - University of California, Irvine

HPC Factory

• Optimal Metric Calculation• Reduced distance

• e.g., squared Euclidean distance• Eliminates expensive sqrt instruction with long

latencies• Partial distance

• Big payoff for large dimensional datasets

Optimizations

• Incremental bounding box calculation

Page 69: PASCAL - University of California, Irvine

HPC Factory

• Optimal Metric Calculation• Reduced distance

• e.g., squared Euclidean distance• Eliminates expensive sqrt instruction with long

latencies• Partial distance

• Big payoff for large dimensional datasets

Optimizations

• Incremental bounding box calculation

• Incremental distance calculation• Node-to-node distance computed incrementally from

parent’s distance in constant time

Page 70: PASCAL - University of California, Irvine

HPC Factory

Parallelization

Q

Page 71: PASCAL - University of California, Irvine

HPC Factory

Parallelization

Q cilk_spawn

Page 72: PASCAL - University of California, Irvine

HPC Factory

Parallelization

Q cilk_spawn

Page 73: PASCAL - University of California, Irvine

HPC Factory

Parallelization

Q cilk_spawn

Stop spawning when #cilk threads = #physical threads

t0 t1 t2 t3

Page 74: PASCAL - University of California, Irvine

HPC Factory

Parallelization

Q cilk_spawn

Stop spawning when #cilk threads = #physical threads

t0 t1 t2 t3

Task parallelism

Data parallelism

Page 75: PASCAL - University of California, Irvine

HPC Factory

Parallelization

t0

cilk_spawn

Stop spawning when #cilk threads = #physical threadsTask parallelism

Pruning/Approximation causes load imbalance

Q

t1 t2 t3

Data parallelism

Page 76: PASCAL - University of California, Irvine

HPC Factory

• Introduction• PASCAL Framework

• Space Partitioning Trees• Tree Traversal• Prune/Approximate Generators

• Optimizations & Parallelization • Experiments & Results • Conclusions & Future Work

Outline

Page 77: PASCAL - University of California, Irvine

HPC Factory

• Architecture• Dual-socket Intel Xeon E5-2630

v3 processor (Haswell-EP)• Each socket has 8 cores• Theoretical peak performance of

614.4 GFlops

• Compiler• Intel C++ complier (icpc v15.0.2)• Python v2.7.6 (Scikit-learn)• Java v1.8.0 (Weka)

Experimental Setup

Page 78: PASCAL - University of California, Irvine

HPC Factory

Case Studies (Direct)

• Kernel Density Estimation

• Nearest Neighbors

• Range-Search I (||xq � xr|| h)

• Hausdorff Distance

Page 79: PASCAL - University of California, Irvine

HPC Factory

Case Studies (Iterative)

Log-likelihood

•Euclidean Minimum Spanning Tree

E-step •Expectation Maximization (EM)

M-step

Page 80: PASCAL - University of California, Irvine

HPC Factory

• Weka: 6,677,053 downloads, written in Java• Scikit-learn: 121,841 downloads, written in Python• MATLAB: over 1,000,000 licensed users, uses C in backend• MLPACK: exploits C++ language features to provide maximum performance

Library Comparison

63

5.36.3Base6.2

143

3.58.9Base7.5

231

2.123.1Base14.5

98

212.3Base4.7

160

Base13.324.50

50

100

150

200

250

Yahoo! HIGGS Census KDD IHEPC

Spee

dup

MATLAB WEKA MLPACK Scikit PASCAL

201

5.222.3Base18.4

142

Base7.91.63.9

104

1.46.1Base3.4

123

1.315.4Base7.7

98

1.56.1Base4.10

50

100

150

200

250

Yahoo! HIGGS Census KDD IHEPC

Speedup

EM

kNN

Page 81: PASCAL - University of California, Irvine

HPC Factory

Speedup Breakdown

7 Results and Discussion

The combined benefits of asymptotically optimal algorithms, optimizations, andparallelization are substantial. In this section, we first compare our performanceagainst state-of-the-art ML libraries and software. Then, we break down the per-formance gain step by step and finally, evaluate the scalability of our algorithms.

Performance Summary. Figure 2 presents the performance of k-NN andEM. The choice of these two algorithms is because they are the only ones sup-ported by all competing libraries and therefore make good candidates for a com-prehensive comparison. Moreover, the choice of these two algorithms albeit spaceconstraints is because k-NN is a direct pruning algorithm while EM is an iterativeapproximation algorithm that represents two ends of the spectrum.

63

5.36.3Base6.2

143

3.58.9Base7.5

231

2.123.1

Base14.5

98

212.3Base4.7

160

Base13.324.50

50

100

150

200

250

Yahoo! HIGGS Census KDD IHEPC

Sp

ee

du

p

MATLAB WEKA MLPACK Scikit PASCAL

201

5.222.3

Base18.4

142

Base7.91.63.9

104

1.46.1Base3.4

123

1.315.4Base7.7

98

1.56.1Base4.10

50

100

150

200

250

Yahoo! HIGGS Census KDD IHEPC

Sp

ee

du

p

Fig. 2: Speedup summary of single-tree EM(top) and dual-tree k-NN for k = 3(bottom). The slowest library is used as the baseline for comparison.

Across the board, our implementation shows significantly better performancecompared to Scikit-learn, MLPACK, MATLAB, and Weka.

Performance Breakdown. To gain a better understanding of the factorscontributing to the performance improvement, we break down the speedups inTable 2. Specifically, it helps distinguish the improvements that are purely al-gorithmic (tree algorithm) from improvements via optimization and paralleliza-tion. For example, for the Yahoo! dataset, we observe a 3.1⇥ speedup from anasymptotically faster algorithm, 12.1⇥ due to optimizations on top of the treealgorithm, and 173.1⇥ with parallelization for k-NN. The breakdown for EM are1.6⇥, 3.2⇥, and 53.7⇥ respectively for the same dataset.

KNN EM KDE HD RS EMST

Alg +Opt +Par Alg +Opt +Par Alg +Opt +Par Alg +Opt +Par Alg +Opt +Par Alg +Opt +Par

Yahoo! 3.1 12.1 173.1 1.6 3.2 53.7 2.1 9.1 92.1 2.5 11.5 161.1 2.2 9.1 126.8 2.9 11.9 166.7

HIGGS 2.1 7.3 108.1 1.5 6.8 117.6 1.7 4.7 50.1 1.9 6.1 89.6 1.9 6.3 86.5 2.0 6.9 102.8

Census 1.4 6.5 90.8 1.3 11.2 190.0 1.4 8.1 75.6 1.3 10.2 141.8 1.3 10.4 144.9 1.4 10.9 151.6

KDD 1.6 6.8 100.7 1.4 4.1 70.9 1.5 3.1 33.5 1.4 3.8 54.4 1.4 5.1 70.5 1.5 3.8 55.5

IHEPC 3.0 4.3 61.5 1.5 7.6 127.6 2.0 5.4 53.6 2.5 6.8 101.3 2.1 6.3 94.1 2.9 7.1 107.1

Table 2: Speedup breakdown. Alg stands for algorithmic improvement, +Optrefers to optimization on top of Alg, and +Par is parallelization on top of Opt.

Page 82: PASCAL - University of California, Irvine

HPC Factory

Scalability

Page 83: PASCAL - University of California, Irvine

HPC Factory

• First generalized algorithmic framework for N-body problems• Out-of-the-box new optimal algorithms

• O(N log N) EM algorithm• O(N) Hausdorff distance algorithm

• Generalizes to more than two operators• 10-230x speedup from optimal tree algorithm, domain-

specific optimizations and parallelization• Short-term: DSL + code generator for base-case,

optimizations and parallelization• Long-term: Extend to GPUs and distributed memory

systems

Summary and Status