Top Banner
Sandia National Laboratories is a multimission laboratory managed and operated by National Technology and Engineering Solutions of Sandia, LLC, a wholly owned subsidiary of Honeywell International, Inc., for the U.S. Department of Energy’s National Nuclear Security Administration under contract DE-NA0003525. Portability and Scalability of Sparse Tensor Decompositions on CPU/MIC/GPU Architectures Christopher Forster, Keita Teranishi, Greg Mackey, Daniel Dunlavy and Tamara Kolda SAND2017-6575 C
30

Portability and Scalability of Sparse Tensor ...users.wfu.edu/ballard/SIAM-AN17/forster.pdf · SAND2017-6575 C. Sparse Tensor Decomposition ... Parallel Programing with Kokkos ...

May 27, 2018

Download

Documents

danghuong
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Portability and Scalability of Sparse Tensor ...users.wfu.edu/ballard/SIAM-AN17/forster.pdf · SAND2017-6575 C. Sparse Tensor Decomposition ... Parallel Programing with Kokkos ...

Photos placed in horizontal position with even amount of white space

between photos and header

Sandia National Laboratories is a multimission laboratory managed and operated by National Technology and Engineering Solutions of Sandia, LLC, a wholly owned subsidiary of Honeywell International, Inc., for the U.S. Department of Energy’s National Nuclear Security Administration under contract DE-NA0003525.

PortabilityandScalabilityofSparseTensorDecompositionsonCPU/MIC/GPUArchitectures

ChristopherForster,KeitaTeranishi,GregMackey,DanielDunlavy andTamaraKolda

SAND2017-6575C

Page 2: Portability and Scalability of Sparse Tensor ...users.wfu.edu/ballard/SIAM-AN17/forster.pdf · SAND2017-6575 C. Sparse Tensor Decomposition ... Parallel Programing with Kokkos ...

SparseTensorDecomposition

§ DevelopproductionqualitylibrarysoftwaretoperformCPfactorizationwithAlternatingPoissonRegression onHPCplatforms§ SparTen

§ SupportseveralHPCplatforms§ Nodeparallelism(Multicore,Manycore andGPUs)

§ MajorQuestions§ SoftwareDesign§ PerformanceTuning

§ Thistalk§ Weareinterestedintwomajorvariants

§ MultiplicativeUpdates§ ProjectedDampedNewtonforRow-subproblems

2

Page 3: Portability and Scalability of Sparse Tensor ...users.wfu.edu/ballard/SIAM-AN17/forster.pdf · SAND2017-6575 C. Sparse Tensor Decomposition ... Parallel Programing with Kokkos ...

CPTensorDecomposition

3

§ Expresstheimportantfeatureofdatausingasmallnumberofvectorouterproducts

Key references: Hitchcock (1927), Harshman (1970), Carroll and Chang (1970)

CANDECOMP/PARAFAC (CP) Model

Model:

Page 4: Portability and Scalability of Sparse Tensor ...users.wfu.edu/ballard/SIAM-AN17/forster.pdf · SAND2017-6575 C. Sparse Tensor Decomposition ... Parallel Programing with Kokkos ...

PoissonforSparseCountData

4

Gaussian (typical) PoissonThe random variable x is a

continuous real-valued number.The random variable x is a

discrete nonnegative integer.

Page 5: Portability and Scalability of Sparse Tensor ...users.wfu.edu/ballard/SIAM-AN17/forster.pdf · SAND2017-6575 C. Sparse Tensor Decomposition ... Parallel Programing with Kokkos ...

Model: Poisson distribution (nonnegative factorization)

SparsePoissonTensorFactorization

5

§ Nonconvexproblem!§ AssumeRisgiven

§ Minimizationproblemwithconstraint§ Thedecomposedvectorsmustbenon-negative

§ AlternatingPoissonRegression(ChiandKolda,2011)§ Assume(d-1)factormatricesareknownandsolvefortheremainingone

Page 6: Portability and Scalability of Sparse Tensor ...users.wfu.edu/ballard/SIAM-AN17/forster.pdf · SAND2017-6575 C. Sparse Tensor Decomposition ... Parallel Programing with Kokkos ...

NewMethod:AlternatingPoissonRegression(CP-APR)

6

Repeat until converged…

Fix B,C;solveforA

Fix A,C;solvefor B

Fix A,B;solvefor C

Theorem: The CP-APR algorithm will converge to a constrained stationary point if the subproblems are strictly convex and solved exactly at each iteration. (Chi and Kolda, 2011)

Convergence Theory

Page 7: Portability and Scalability of Sparse Tensor ...users.wfu.edu/ballard/SIAM-AN17/forster.pdf · SAND2017-6575 C. Sparse Tensor Decomposition ... Parallel Programing with Kokkos ...

CP-APR

7

Minimization problem is expressed as:

Page 8: Portability and Scalability of Sparse Tensor ...users.wfu.edu/ballard/SIAM-AN17/forster.pdf · SAND2017-6575 C. Sparse Tensor Decomposition ... Parallel Programing with Kokkos ...

CP-APR

8

Minimization problem is expressed as:

• 2 major approaches• Multiplicative Updates like Lee & Seung

(2000) for matrices, but extended by Chi and Kolda (2011) for tensors

• Newton and Quasi-Newton method for Row-subpblems by Hansen, Plantenga and Kolda(2014)

Page 9: Portability and Scalability of Sparse Tensor ...users.wfu.edu/ballard/SIAM-AN17/forster.pdf · SAND2017-6575 C. Sparse Tensor Decomposition ... Parallel Programing with Kokkos ...

KeyElementsofMUandPDNRmethods

§ Keycomputations§ Khatri-RaoProduct§ Modifier(10+iterations)

§ Keyfeatures§ Factormatrixisupdatedallatonce

§ Exploitstheconvexityofrowsubproblems forglobalconvergence

§ Keycomputations§ Khatri-RaoProduct§ ConstrainedNon-linearNewton-basedoptimizationforeachrow

§ Keyfeatures§ Factormatrixcanbe

updatedbyrows§ Exploitstheconvexityof

row-subproblems

9

Multiplicative Update (MU) Projected Damped Newton for Row-subproblems (PDNR)

Page 10: Portability and Scalability of Sparse Tensor ...users.wfu.edu/ballard/SIAM-AN17/forster.pdf · SAND2017-6575 C. Sparse Tensor Decomposition ... Parallel Programing with Kokkos ...

CP-APR-MU

10

Key Computations

Page 11: Portability and Scalability of Sparse Tensor ...users.wfu.edu/ballard/SIAM-AN17/forster.pdf · SAND2017-6575 C. Sparse Tensor Decomposition ... Parallel Programing with Kokkos ...

CP-APR-PDNR

11

Key Computations

Algorithm 1: CPAPR-PDNR algorithm

1 CPAPR PDNR (X ,M);

Input : Sparse N -mode Tensor X of size I

1

⇥ I

2

⇥ . . . IN and the

number of components R

Output: Kruskal Tensor M = [�;A

(1)

. . . A

(N)

]

2 Initialize3 repeat4 for n = 1, . . . , N do5 Let ⇧

(n)= (A

(N) � · · ·�A

(n+1) �A

(n�1) � . . . A

(1)

)

T

6 for i = 1, . . . , In do

7 Find b

(n)i s.t. min

b(n)i �0

f

row

(b

(n)i , x

(n)i ,⇧

(n))

8 end

9 � = e

TB

(n)where B

(n)= [b

(n)1

. . . b

(n)In

]

T

10 A

(n) B

(n)⇤

�1

, where ⇤ = diag(�)

11 end

12 until all mode subproblems converged ;

Page 12: Portability and Scalability of Sparse Tensor ...users.wfu.edu/ballard/SIAM-AN17/forster.pdf · SAND2017-6575 C. Sparse Tensor Decomposition ... Parallel Programing with Kokkos ...

PARALLELCP-APRALGORITHMS

12

Page 13: Portability and Scalability of Sparse Tensor ...users.wfu.edu/ballard/SIAM-AN17/forster.pdf · SAND2017-6575 C. Sparse Tensor Decomposition ... Parallel Programing with Kokkos ...

ParallelizingCP-APR§ Focusonon-nodeparallelismformultiplearchitectures

§ Multiplechoicesforprogramming§ OpenMP,OpenACC, CUDA,Pthread …§ Managedifferentlow-levelhardwarefeatures(cache,devicememory,NUMA…)

§ OurSolution:UseKokkos forproductivityandperformanceportability§ Abstractionofparallelloops§ AbstractionDatalayout(row-major,columnmajor,programmablememory)§ Samecodetosupportmultiplearchitectures

13

Kokkos

Intel Multicore Intel Manycore NVIDIA GPU IBM PowerAMD Multicore/APU ARM

Support multiple Architectures

Page 14: Portability and Scalability of Sparse Tensor ...users.wfu.edu/ballard/SIAM-AN17/forster.pdf · SAND2017-6575 C. Sparse Tensor Decomposition ... Parallel Programing with Kokkos ...

WhatisKokkos?

§ TemplatedC++LibrarybySandiaNationalLabs(Edwards,etal)§ Serveassubstratelayerofsparsematrixandvectorkernels§ Supportanymachineprecisions

§ Float§ Double§ QuadandHalffloatifneeded.

§ Kokkos::View()accommodatesperformance-awaremultidimensionalarraydataobjects§ Light-weightC++classto

§ ParallelizingloopsusingC++languagestandard§ Lambda§ Functors

§ Extensivesupportofatomics

14

Page 15: Portability and Scalability of Sparse Tensor ...users.wfu.edu/ballard/SIAM-AN17/forster.pdf · SAND2017-6575 C. Sparse Tensor Decomposition ... Parallel Programing with Kokkos ...

ParallelProgramingwithKokkos

§ ProvideparallelloopoperationsusingC++languagefeatures§ Conceptually,theusageisnomoredifficultthanOpenMP.

Theannotationsjustgoindifferentplaces. 15

for (size_t i = 0; i < N; ++i) {

/* loop body */}

#pragma omp parallel forfor (size_t i = 0; i < N; ++i) {

/* loop body */}

parallel_for (( N, [=], (const size_t i) {

/* loop body */});

Seria

lO

penM

PKo

kkos

Kokkos information courtesy of Carter Edwards

Page 16: Portability and Scalability of Sparse Tensor ...users.wfu.edu/ballard/SIAM-AN17/forster.pdf · SAND2017-6575 C. Sparse Tensor Decomposition ... Parallel Programing with Kokkos ...

WhyKokkos?

§ ComplyC++languagestandard!§ Supportmultipleback-ends

§ Pthread,OpenMP,CUDA,IntelTBBandQthread

§ Supportmultipledatalayoutoptions§ ColumnvsRowMajor§ Device/CPUmemory

§ Supportdifferentparallelism§ Nestingsupport§ Vector,threads,Warp,etc.§ Taskparallelism(underdevelopment)

16

Page 17: Portability and Scalability of Sparse Tensor ...users.wfu.edu/ballard/SIAM-AN17/forster.pdf · SAND2017-6575 C. Sparse Tensor Decomposition ... Parallel Programing with Kokkos ...

ArrayAccessbyKokkos

17

Row-majorThread 0 reads

Thread 1 reads

Column-major

Thread 0 reads

Thread 1 reads

Kokkos::View<double **, Layout, Space>

View<double **, Right, Space> View<double **, Left, Space>

Page 18: Portability and Scalability of Sparse Tensor ...users.wfu.edu/ballard/SIAM-AN17/forster.pdf · SAND2017-6575 C. Sparse Tensor Decomposition ... Parallel Programing with Kokkos ...

ArrayAccessbyKokkos

18

Row-major

Thread 0 reads

Thread 1 reads

Contiguous reads per thread

Column-major

Thread 0 reads

Thread 1 reads

Coalesced reads w

ithin warp

View<double **, Right, Host> View<double **, Left, CUDA>

Kokkos::View<double **, Layout, Space>

Page 19: Portability and Scalability of Sparse Tensor ...users.wfu.edu/ballard/SIAM-AN17/forster.pdf · SAND2017-6575 C. Sparse Tensor Decomposition ... Parallel Programing with Kokkos ...

ParallelCP-APR-MU

19

Data Parallel

Page 20: Portability and Scalability of Sparse Tensor ...users.wfu.edu/ballard/SIAM-AN17/forster.pdf · SAND2017-6575 C. Sparse Tensor Decomposition ... Parallel Programing with Kokkos ...

ParallelCP-APR-PDNR

20

Data Parallel

Task Parallel

Page 21: Portability and Scalability of Sparse Tensor ...users.wfu.edu/ballard/SIAM-AN17/forster.pdf · SAND2017-6575 C. Sparse Tensor Decomposition ... Parallel Programing with Kokkos ...

NotesonDataStructure

§ UseKokkos::View§ SparseTensor

§ SimilartotheCoordinate(COO)FormatinSparseMatrixrepresentation

§ Kruskal Tensor&KhatriRaoProduct§ Providesoptionsforroworcolumnmajor

§ Kokkos::Viewprovidesanoptiontodefinetheleadingdimension.§ Determinedduringcompileorruntime

§ AvoidAtomics§ ExpensiveinCPUsandManycore§ Useextraindexingdatastructure

§ CP-APR-PDNR§ Createsapooloftasks§ Adedicatedbufferspace(Kokkos::View)isassignedtoindividualtask

21

Page 22: Portability and Scalability of Sparse Tensor ...users.wfu.edu/ballard/SIAM-AN17/forster.pdf · SAND2017-6575 C. Sparse Tensor Decomposition ... Parallel Programing with Kokkos ...

PERFORMANCE

22

Page 23: Portability and Scalability of Sparse Tensor ...users.wfu.edu/ballard/SIAM-AN17/forster.pdf · SAND2017-6575 C. Sparse Tensor Decomposition ... Parallel Programing with Kokkos ...

PerformanceTest§ StrongScalability

§ Problemsizeisfixed§ RandomTensor

§ 3Kx4Kx5K,10Mnonzeroentries§ 100outeriterations

§ RealisticProblems§ CountData(Non-negative)§ Availableathttp://frostt.io/§ 10outeriterations

§ DoublePrecision

23

Data Dimensions Nonzeros RankLBNL 2K x 4K x 2K x 4K x 866K 1.7M 10NELL-2 12K x 9K x 29K 77M 10NELL-1 3M x 2M x 25M 144M 10Delicious 500K x 17M x 3M x 1K 140M 10

Page 24: Portability and Scalability of Sparse Tensor ...users.wfu.edu/ballard/SIAM-AN17/forster.pdf · SAND2017-6575 C. Sparse Tensor Decomposition ... Parallel Programing with Kokkos ...

CPAPR-MUonCPU(Random)

24

0

200

400

600

800

1000

1200

1400

1600

1800

1 2 4 6 8 10 12 14 16 18 20 22 24 26 28

Exec

utio

n Ti

me

(s)

Core Count

CP-APR-MU method, 100 outer-iterations, (3000 x 4000 x 5000, 10M nonzero entries), R=10, PC cluster, 2 Haswell (14 core) CPUs

per node, MKL-11.3.3, HyperThreading disabled

Pi Phi+ Update

Page 25: Portability and Scalability of Sparse Tensor ...users.wfu.edu/ballard/SIAM-AN17/forster.pdf · SAND2017-6575 C. Sparse Tensor Decomposition ... Parallel Programing with Kokkos ...

Results:CPAPR-MUScalability

25

DataHaswell CPU

1-core

2 Haswell CPUs

14-cores

2 Haswell CPUs

28-coresKNL

68-core CPUNVIDIA

P100 GPUTime(s) Speedup Time(s) Speedup Time(s) Speedup Time(s) Speedup Time(s) Speedup

Random 1715* 1 279 6.14 165 10.39 20 85.74 10 171.5LBNL 131 1 32 4.09 32 4.09 103 1.27NELL-2 1226 1 159 7.77 92 13.32 873 1.40NELL-1 5410 1 569 9.51 349 15.50 1690 3.20Delicious 5761 1 2542 2.26 2524 2.28

100 outer iterations for the random problem10 outer iterations for realistic problems* Pre-Kokkos C++ code on 2 Haswell CPUs:

1-core, 2136 sec14-cores, 762 sec28-cores, 538 sec

Page 26: Portability and Scalability of Sparse Tensor ...users.wfu.edu/ballard/SIAM-AN17/forster.pdf · SAND2017-6575 C. Sparse Tensor Decomposition ... Parallel Programing with Kokkos ...

CPAPR-PDNRonCPU(Random)

26

0100200300400500600700800900

1 2 4 6 8 10 12 14 16 18 20 22 24 26 28

Exec

utio

n Ti

me

(s)

Core Count

CPAPR-PDNR method, 100 outer-iterations, 1831221 inner iterations total, (3000 x 4000 x 5000, 10M nonzero entries), R=10,

PC cluster, 2 Haswell (14 core) CPUs per node, MKL-11.3.3, HyperThreading disabled

Pi RowSub

Page 27: Portability and Scalability of Sparse Tensor ...users.wfu.edu/ballard/SIAM-AN17/forster.pdf · SAND2017-6575 C. Sparse Tensor Decomposition ... Parallel Programing with Kokkos ...

Results:CPAPR-PDNRScalability

27

DataHaswell CPU

1 core2 Haswell CPUs

14 cores2 Haswell CPUs

28 coresTime(s) Speedup Time(s) Speedup Time(s) Speedup

Random 817* 1 73 11.19 44 18.58

LBNL 441 1 187 2.35 191 2.30

NELL-2 2162 1 326 6.63 319 6.77

NELL-1 17212 1 4241 4.05 3974 4.33

Delicious 18992 1 3684 5.15 3138 6.05

100 outer iterations for the random problem10 outer iterations for realistic problems* Pre-Kokkos C++ code spends 3270 sec on 1 core

Page 28: Portability and Scalability of Sparse Tensor ...users.wfu.edu/ballard/SIAM-AN17/forster.pdf · SAND2017-6575 C. Sparse Tensor Decomposition ... Parallel Programing with Kokkos ...

PerformanceIssues

§ Ourimplementationexhibitsverygoodscalabilitywiththerandomtensor.§ Similarmodesizes§ Regulardistributionofnonzeroentries

§ Somecacheeffects§ Kokkos isNUMA-awareforcontiguousmemoryaccess(first-touch)

§ Somescalabilityissueswiththerealistictensorproblems.§ Irregularnonzerodistributionanddisparityinmodesizes§ Task-parallelcodemayhavesomememorylocalityissuestoaccess

sparsetensor,Kruskal Tensor,andKhatori-Raoproduct§ Preprocessingcouldimprovethelocality

§ ExplicitDatapartitioning(SmithandKarypis)§ PossibletoimplementusingKokkos

28

Page 29: Portability and Scalability of Sparse Tensor ...users.wfu.edu/ballard/SIAM-AN17/forster.pdf · SAND2017-6575 C. Sparse Tensor Decomposition ... Parallel Programing with Kokkos ...

MemoryBandwidth(StreamBenchmark)

§ Allcoresdeliverapproximately8xperformanceimprovementfromsinglethread

§ Hardtoscaleusingallcoreswithmemory-boundcode.

29

0

20000

40000

60000

80000

100000

120000

0 5 10 15 20 25 30

MBy

tes/

Sec

# of Cores

Stream Benchmark on 2x 14 core Intel Haswell CPUs

Page 30: Portability and Scalability of Sparse Tensor ...users.wfu.edu/ballard/SIAM-AN17/forster.pdf · SAND2017-6575 C. Sparse Tensor Decomposition ... Parallel Programing with Kokkos ...

Conclusion

§ DevelopmentofPortableon-nodeParallelCP-APRSolvers§ DataparallelismforMUmethod§ MixedData/TaskparallelismforPDNRmethod§ MultipleArchitectureSupportusingKokkos

§ ScalablePerformanceforrandomsparsetensor

§ FutureWork§ ProjectedQuasi-NewtonforRow-subproblems (PQNR)§ GPUandManycore supportforPDNRandPQNR§ Performancetuningtohandleirregularnonzerodistributionsand

disparityinmodesizes

30