Deep Learning at Twitter's Scale Cibele Montez Halasz Machine Learning Engineer @ Twitter Cortex October 11, 2018
Deep Learning at Twitter's Scale
Cibele Montez HalaszMachine Learning Engineer @ Twitter Cortex
October 11, 2018
12345
BackgroundWorkflow/PlatformModeling and Optimizations Additional Performance GainsGPU
Agenda
Background
● Challenges: Characteristics of Platform
● Data Shift
4
Sparsity of Data
Along with more body copy goes here.
Challenges
Source: Luca Belli/Dan Shiebler, June, 2018
VERY
SPARSE
DATA
5
Speed
Along with more body copy goes here.
Challenges
Source: Jacob Kastrenakes, The Verge, July, 2018
6
Data Shift
Along with more body copy goes here.
Data Shift
Source: Luca Belli/Dan Shiebler, June, 2018
7
Data Shift
Source: Luca Belli/Dan Shiebler, June, 2018
Data Shift
Machine Learning at the Company
● Environment● Modeling: some use cases
Environment
Environment: ML Overview
Source: Luca Belli/Dan Shiebler, June, 2018
Team A’s Data Aggregation and Feature Extraction Job
Cortex Embedding Generation Pipeline
Feature Registry
Team A’s Machine Learning Model
Team B’s Machine Learning Model
Team C’s Machine Learning Model
Environment
Environment: ML Workflows
Source: Devin Goodsell, NYC Machine Learning Meetup, June 20, 2018
Environment
Environment: ML Training
Environment
Environment: Priorities
● Feature Addition → Scalable data ● Data Addition → Scalable data ● Training → Fast, robust training engine ● Deployment → Seamless and tested ML services ● A/B test → Good AB test environment
Modeling: Modeling and Optimizations with TensorFlow
Modeling
Modeling
Discretizer Full Sparse Dense MLP Full Sparse
Modeling
Sparse Linear Layer
1 2 K...
V1 V2 Vn-2 Vn-1 V
Modeling
Sparse Linear Layer: First Approach
Source: Tensorflow
Modeling
Sparse Linear Layer: Final Approach
Source: Tensorflow
Modeling
Sparse Linear Layer
● Optimizers
○ SGD○ Lazy Adam
Source: Berkeley Research Artificial Lab
Modeling
Sparse Linear Layer: Variable Partitioning
output:
Modeling
Sparse Linear Layer: Variable Partitioning
output:
Modeling
Sparse Linear Layer: Variable Partitioning
output:
Modeling
Sparse Linear Layer: Variable Partitioning: Profiling
~33% reduction
Modeling
Sparse Linear Layer: Online Normalization
● Example:
Source: Nicolas Koumchatzky
Input: input_feature (value == 1M)⇒ weight_gradient == 1M⇒ update = 1M * learning_rate⇒ ?
Modeling
Sparse Linear Layer: Online Normalization
● Example:
Input: input_feature (value == 1M)⇒ weight_gradient == 1M⇒ update = 1M * learning_rate⇒
Source: Nicolas Koumchatzky
Modeling
Sparse Linear Layer: Online Normalization
● Normalization of input values
Source: Nicolas Koumchatzky
Belongs to [-1, 1]Trainable per-feature bias: discriminate absence and presence of features
Additional Performance Gains
Performance Gains
Hogwild
Source: Hogwild!, UC Berkeley
Performance Gains
Hogwild
Source: Tensorflow
Performance Gains
Custom Ops
GPU x CPU metrics
GPU Benchmarks
GPU Benchmarks
Batch Size/Model Optimization
CPU: Baseline(tf.sparse_tensor_dense_matmul)
CPU: After optimizations
GPU: Baseline(tf.sparse_tensor_dense_matmul)
GPU: After Optimizations
256 1024 samples/s 7372 samples/s 5504 samples/s 22528 samples/s
512 1638 samples/s 11264 samples/s 8448 samples/s 21504 samples/s
1024 2355 samples/s 13312 samples/s 10752 samples/s 22528 samples/s
GPU benchmarks were run with NVIDIA Tesla K80 ProcessorsCPU benchmarks were run with Intel Xenon Platinum 8180 Processors
Acknowledgements
● Andrew Bean● Ricardo Cervera-Navarro● Priyank Jain● Ruhua Jiang● Nicholas Léonard● Briac Marcatté● Mahak Patidar● Tim Sweeney● Pavan Yalamanchili● Yi Zhuang
Thank you!Questions?
Follow me on Twitter: @cibelemhEmail me at: [email protected]