Top Banner
Wei Tan, IBM T. J. Watson Research Center [email protected] CuMF: Large-Scale Matrix Factorization on Just One Machine with GPUs
22

S6211 - CuMF: Large-Scale Matrix Factorization on Just One Machine with GPUs

Apr 13, 2017

Download

Internet

Fabio Pinelli
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: S6211 - CuMF: Large-Scale Matrix Factorization on Just One Machine with GPUs

Wei Tan, IBM T. J. Watson Research Center

[email protected]

CuMF: Large-Scale Matrix Factorization on Just One Machine with GPUs

Page 2: S6211 - CuMF: Large-Scale Matrix Factorization on Just One Machine with GPUs

Agenda

Why accelerate recommendation (matrix factorization) using GPUs?

What are the challenges?

How cuMF tackles the challenges?

So what is the result?

2

Page 3: S6211 - CuMF: Large-Scale Matrix Factorization on Just One Machine with GPUs

Why we need a fast and scalable recommender system?

3

Recommendation is pervasive

Drive 80% of Netflix’s watch hours1

Digital ads market in US: 37.3 billion2

Need: fast, scalable, economic

Page 4: S6211 - CuMF: Large-Scale Matrix Factorization on Just One Machine with GPUs

ALS (alternating least square) for collaborative filtering

4

Input: users ratings on some items

Output: all missing ratings

How: factorize the rating matrix R into

and minimize the lost on observed ratings:

ALS: iteratively solve

R

X

xu

ΘT θv

Page 5: S6211 - CuMF: Large-Scale Matrix Factorization on Just One Machine with GPUs

Matrix factorization/ALS is versatile

5

item

X

xu

ΘT θv

Recommendation

user

X

ΘT

word

wordWord embedding

X

ΘT

word

documentTopic model

a very important algorithmto accelerate

Page 6: S6211 - CuMF: Large-Scale Matrix Factorization on Just One Machine with GPUs

Challenges of fast and scalable ALS

6

• ALS needs to solve:

spmm: cublas

• Challenge 1: access and

aggregate many θvs:

irregular (R is sparse) and

memory intensive

LU or Choleskydecomposition: cublas

• Challenge 2:

Single GPU can NOT handle big m, n and nnz

Page 7: S6211 - CuMF: Large-Scale Matrix Factorization on Just One Machine with GPUs

Challenge 1: memory access

7

Nvidia K40: Memory BW: 288 GB/sec, compute: 4.3 Tflops/sec

Higher flops higher op intensity (more flops per byte) caching!

Operational intensity (Flops/Byte)

Gflo

ps/s

4.3T

1

288G ×

15

×

under-utilized

fully-utilized

Page 8: S6211 - CuMF: Large-Scale Matrix Factorization on Just One Machine with GPUs

Address challenge 1: memory-optimized ALS

8

• To obtain

1. Reuse θv s for many users

2. Load a portion into smem

3. Tile and aggregate in register

Page 9: S6211 - CuMF: Large-Scale Matrix Factorization on Just One Machine with GPUs

Address challenge 1: memory-optimized ALS

9

f

fT

T

T

T

Page 10: S6211 - CuMF: Large-Scale Matrix Factorization on Just One Machine with GPUs

Address challenge 2: scale-up ALS on multiple GPUs

Model parallel: solve a

portion of the model

10

model

parallel

Page 11: S6211 - CuMF: Large-Scale Matrix Factorization on Just One Machine with GPUs

Address challenge 2: scale-up ALS on multiple GPUs

Data parallel: solve

with a portion of the

training data

11

Data parallel

model

parallel

Page 12: S6211 - CuMF: Large-Scale Matrix Factorization on Just One Machine with GPUs

Address challenge 2: parallel reduction

Data parallel needs cross-GPU reduction

12

One-phase parallel reduction. Two-phase parallel reduction.

Inter-socketIntra-socket

Page 13: S6211 - CuMF: Large-Scale Matrix Factorization on Just One Machine with GPUs

Recap: cuMF tackled two challenges

13

• ALS needs to solve:

spmm: cublas

• Challenge 1: access and

aggregate many θvs:

irregular (R is sparse) and

memory intensive

LU or Choleskydecomposition: cublas

• Challenge 2:

Single GPU can NOT handle big m, n and nnz

Page 14: S6211 - CuMF: Large-Scale Matrix Factorization on Just One Machine with GPUs

Connect cuMF to Spark MLlib

14

Spark applications relying on mllib/ALS need no change

Modified mllib/ALS detects GPU and offload computation

Leverage the best of Spark (scale-out) and GPU (scale-up)

ALS apps

mllib/ALS

cuMFJNI

Page 15: S6211 - CuMF: Large-Scale Matrix Factorization on Just One Machine with GPUs

Connect cuMF to Spark MLlib

1 Power 8 node + 2 K40

CUDA kernel

GPU1

RDD

CUDA kernel

…RDD RDD RDD

CUDA kernel

GPU2

RDD

CUDA kernel

…RDD RDD RDD

shuffle

1 Power 8 node + 2 K40

CUDA kernel

GPU1

RDD

CUDA kernel

…RDD RDD RDD

shuffle

RDD on CPU: to distribute rating data and shuffle parameters

Solver on GPU: to form and solve

Able to run on multiple nodes, and multiple GPUs per node

15

Page 16: S6211 - CuMF: Large-Scale Matrix Factorization on Just One Machine with GPUs

Implementation

16

In C (circa. 10k LOC)

CUDA 7.0/7.5, GCC OpenMP v3.0

Baselines:– Libmf: SGD on 1 node [RecSys14]

– NOMAD: SGD on >1 nodes [VLDB 14]

– SparkALS: ALS on Spark

– FactorBird: SGD + parameter server for MF

– Facebook: enhanced Giraph

Page 17: S6211 - CuMF: Large-Scale Matrix Factorization on Just One Machine with GPUs

CuMF performance

17

X-axis: time in seconds; Y-axis: Root Mean Square Error (RMSE) on test set

• 1 GPU vs. 30 cores, CuMF slightly

faster than libmf and NOMAD

• CuMF scales well on 1, 2, 4 GPUs

Page 18: S6211 - CuMF: Large-Scale Matrix Factorization on Just One Machine with GPUs

Effectiveness of memory optimization

18

X-axis: time in seconds; Y-axis: Root Mean Square Error (RMSE) on test set

• Aggressively using registers 2x

• Using texture 25%-35% faster

Page 19: S6211 - CuMF: Large-Scale Matrix Factorization on Just One Machine with GPUs

CuMF performance and cost

19

• CuMF @4 GPUs ≈ NOMAD @64 HPC nodes

≈ 10x NOMAD @32 AWS nodes

• CuMF @4 GPUs ≈ 10x SparkALS @50 nodes

≈ 1% of its cost

Page 20: S6211 - CuMF: Large-Scale Matrix Factorization on Just One Machine with GPUs

CuMF accelerated Spark on Power 8

20

• CuMF @2 K40s achieves 6+x speedup in training (193 sec 1.3k sec)

*GUI designed by Amir Sanjar

Page 21: S6211 - CuMF: Large-Scale Matrix Factorization on Just One Machine with GPUs

Conclusion

Why accelerate recommendation (matrix factorization) using GPU?

– Need to be fast, scalable and economic

What are the challenges?

– Memory access, scale to multiple GPUs

How cuMF tackles the challenges?

– Optimize memory access, parallelism and communication

So what is the result?

– Up to 10x as fast, 100x as cost-efficient

– Use cuMF standalone or with Spark

– GPU can tackle ML problems beyond deep learning!

21

Page 22: S6211 - CuMF: Large-Scale Matrix Factorization on Just One Machine with GPUs

Wei Tan, IBM T. J. Watson Research Center

[email protected]

http://ibm.biz/wei_tan

Thank you, questions?

Faster and Cheaper: Parallelizing Large-Scale Matrix Factorization on GPUs. Wei Tan, Liangliang Cao, Liana Fong. HPDC 2016, http://arxiv.org/abs/1603.03820

Source code available soon.