Machine Learning Engineer @ Twitter Cortex Cibele Montez Halaszon-demand.gputechconf.com/gtc-eu/2018/pdf/e8449-deep... · 2018. 10. 7. · Deep Learning at Twitter's Scale Cibele

Deep Learning at Twitter's Scale

Cibele Montez HalaszMachine Learning Engineer @ Twitter Cortex

October 11, 2018

12345

BackgroundWorkflow/PlatformModeling and Optimizations Additional Performance GainsGPU

Agenda

Background

● Challenges: Characteristics of Platform

● Data Shift

4

Sparsity of Data

Along with more body copy goes here.

Challenges

Source: Luca Belli/Dan Shiebler, June, 2018

VERY

SPARSE

DATA

5

Speed


Challenges

Source: Jacob Kastrenakes, The Verge, July, 2018

6

Data Shift


Data Shift


7

Data Shift


Data Shift

Machine Learning at the Company

● Environment● Modeling: some use cases

Environment

Environment: ML Overview


Team A’s Data Aggregation and Feature Extraction Job

Cortex Embedding Generation Pipeline

Feature Registry

Team A’s Machine Learning Model

Team B’s Machine Learning Model

Team C’s Machine Learning Model

Environment

Environment: ML Workflows

Source: Devin Goodsell, NYC Machine Learning Meetup, June 20, 2018

Environment

Environment: ML Training

Environment

Environment: Priorities

● Feature Addition → Scalable data ● Data Addition → Scalable data ● Training → Fast, robust training engine ● Deployment → Seamless and tested ML services ● A/B test → Good AB test environment

Modeling: Modeling and Optimizations with TensorFlow

Modeling

Modeling

Discretizer Full Sparse Dense MLP Full Sparse

Modeling

Sparse Linear Layer

1 2 K...

V1 V2 Vn-2 Vn-1 V

Modeling

Sparse Linear Layer: First Approach

Source: Tensorflow

Modeling

Sparse Linear Layer: Final Approach

Source: Tensorflow

Modeling

Sparse Linear Layer

● Optimizers

○ SGD○ Lazy Adam

Source: Berkeley Research Artificial Lab

Modeling

Sparse Linear Layer: Variable Partitioning

output:

Modeling


output:

Modeling


output:

Modeling

Sparse Linear Layer: Variable Partitioning: Profiling

~33% reduction

Modeling

Sparse Linear Layer: Online Normalization

● Example:

Source: Nicolas Koumchatzky

Input: input_feature (value == 1M)⇒ weight_gradient == 1M⇒ update = 1M * learning_rate⇒ ?

Modeling


● Example:

Input: input_feature (value == 1M)⇒ weight_gradient == 1M⇒ update = 1M * learning_rate⇒


Modeling


● Normalization of input values


Belongs to [-1, 1]Trainable per-feature bias: discriminate absence and presence of features

Additional Performance Gains

Performance Gains

Hogwild

Source: Hogwild!, UC Berkeley

Performance Gains

Hogwild

Source: Tensorflow

Performance Gains

Custom Ops

GPU x CPU metrics

GPU Benchmarks

GPU Benchmarks

Batch Size/Model Optimization

CPU: Baseline(tf.sparse_tensor_dense_matmul)

CPU: After optimizations

GPU: Baseline(tf.sparse_tensor_dense_matmul)

GPU: After Optimizations

256 1024 samples/s 7372 samples/s 5504 samples/s 22528 samples/s



GPU benchmarks were run with NVIDIA Tesla K80 ProcessorsCPU benchmarks were run with Intel Xenon Platinum 8180 Processors

Acknowledgements

● Andrew Bean● Ricardo Cervera-Navarro● Priyank Jain● Ruhua Jiang● Nicholas Léonard● Briac Marcatté● Mahak Patidar● Tim Sweeney● Pavan Yalamanchili● Yi Zhuang

Thank you!Questions?

Follow me on Twitter: @cibelemhEmail me at: [email protected]

Machine Learning Engineer @ Twitter Cortex Cibele Montez Halaszon-demand.gputechconf.com/gtc-eu/2018/pdf/e8449-deep... · 2018. 10. 7. · Deep Learning at Twitter's Scale Cibele

Documents