Snap ML: A Hierarchical Framework for Machine Learning C . D ünner * , T. Parnell*, D. Sarigiannis *, N. Ioannou *, A. Anghel *, G. Ravi + , M. Kandasamy + , H. Pozidis * Snap ML is a new framework for efficient training of generalized linear models. Snap ML implements novel out-of-core techniques to enable GPU acceleration at scale. Snap ML is built on a novel hierarchical version of the popular CoCoA framework to enable multi- level distributed training. Snap ML can train a logistic regression classifier on the Criteo Terabyte Click Logs data in 1.5 minutes. Contributions A unique feature of Snap ML is its design, aligned with the architecture of modern systems Local Solver For large datasets the GPU-CPU link can become the bottleneck ! Streaming Pipeline: 1. Using CUDA streams we can copy the next batch of data while the current is being trained 2. We use the CPU to generate random numbers for sampling in the GPU solver [1] Parallel Model Training Without Compromising Convergence, N. Ioannou, C. Dünner, K. Kourtis, T. Parnell. MLSys workshop (2018) → oral, Fri. 7th December [2] Tera-Scale Coordinate Descent on GPUs, T. Parnell, C. Dünner, K. Atasu, M. Sifalakis, H. Pozidis. FGCS (2018) [3] Efficient Use of Limited-Memory Accelerators for Linear Learning on Heterogeneous Systems, C. Dünner, T. Parnell, M. Jaggi. NIPS (2017) [4] CoCoA: A General Framework for Communication-Efficient Distributed Optimization, V. Smith, S. Forte, C. Ma, M. Takac, M. Jordan, M. Jaggi. JMLR (2018) Models GPU Acceleration Distributed Training Sparse Data Support Scikit-learn ML\{DL} No No Yes Spark MLlib ML\{DL} No Yes Yes TensorFlow ML Yes Yes Limited Snap ML GLMs Yes Yes Yes Tera-Scale Benchmark worker CPU GPU CPU GPU GPU GPU worker worker worker 10Gbit/s Hierarchical Optimization Framework The Snap ML Framework e.g. SVM, Lasso, Ridge Regression, Logistic Regression Take advantage of non-uniform interconnects → CPU solver: Parallel primal/dual coordinate descent solver [1] GPU solver: Twice Parallel Asynchronous Coordinate Descent [2] Performance of Snap ML in comparison to other frameworks and previously published results for training a logistic regression classifier on the Terabyte Click-Logs dataset. Data: # examples 4.5 billion # features 1 million ~3TB In Snap ML the user can describe the application using high- level python APIs for single-node and multi-node applications https ://www.zurich.ibm.com/snapml/ single-node performance task: logistic regression dataset: criteo-kaggle (45 million examples) infrastructure: Power AC922 server with V100 GPU GLMs Level 1: distribution across nodes in a cluster Level 2: distribution across heterogeneous compute units Level 3: distribution across cores/threads Consider Algorithm 1 applied to where the local subproblems are solved with relative accuracy in each iteration. Let be -smooth and convex and be general convex functions. Then, after 1 outer iterations with 2 inner iterations each the suboptimality is bounded as Furthermore, if are -strongly convex this rate improves to 1 1 ★ ★ [0,1,2,3] Specify which GPUs to use Disjoint partitions Data-local subtasks are defined by recursively applying a block- separable upper-bound to similar to [4] 1 ★ task: logistic regression dataset: criteo-tera-byte (1 billion examples) infrastructure: 4x Power AC922 server with 4x V100 GPUs *IBM Research – Zurich, Switzerland + IBM Systems – Bangalore, India LIBLINEAR [1 core] Vowpal Wabbit [12 cores] Spark Mllib [512 cores] TensorFlow [60 worker machines, 29 parameter machines] Snap ML [16 V100 GPUs] TensorFlow [16 V100 GPUs] TensorFlow on Spark [12 executors] 0.128 0.129 0.130 0.131 0.132 0.133 1 10 100 1000 10000 LogLoss (Test) Training Time (minutes) Time(seconds) LogLoss (Test) 0.45 0.46 0.47 0.48 0.49 0.5 0.51 0.52 0.53 0.54 0.55 0.56 10 100 1000 TensorFlow Sklearn (LIBLINEAR) Snap ML 65 70 75 80 85 90 1 10 100 Fast Network (InfiniBand) Slow Network (1Gbit Ethernet) # inner iterations ( 2 ) Time to suboptimality (seconds) node1,node2,node3,node4 Specify which nodes to use black: previously published results orange: run on our hardware (4xIBM Power9 with 4xNVIDIA V100 GPUs each) Trade-off parameters and 2 NeurIPS | 2018