ADVANCES IN TRINITY OF AI: DATA, ALGORITHMS & COMPUTE Anima Anandkumar Bren Professor at Caltech Director of ML Research at NVIDIA
ADVANCES IN TRINITY OF AI: DATA, ALGORITHMS & COMPUTE
Anima Anandkumar
Bren Professor at CaltechDirector of ML Research at NVIDIA
ALGORITHMS • OPTIMIZATION• SCALABILITY• MULTI-DIMENSIONALITY
DATA• COLLECTION• AGGREGATION • AUGMENTATION
INFRASTRUCTUREFULL STACK FOR ML• APPLICATION SERVICES• ML PLATFORM• GPUS
TRINITY FUELING ARTIFICIAL INTELLIGENCE
• COLLECTION: ACTIVE LEARNING, PARTIAL LABELS..
• AGGREGATION: CROWDSOURCING MODELS..
• AUGMENTATION: GENERATIVE MODELS, SYMBOLIC EXPRESSIONS..
DATA
ACTIVE LEARNING
Labeled data
Unlabeled data
Goal• Reach SOTA with a smaller dataset
• Active learning analyzed in theory• In practice, only small classical models
Can it work at scale with deep learning?
TASK: NAMED ENTITY RECOGNITION
RESULTSNER task on largest open benchmark (Onto-notes)
Active learning heuristics:
• Least confidence (LC)• Max. normalized log
probability (MNLP)
• Deep active learning matches : • SOTA with just 25% data on English, 30% on Chinese.• Best shallow model (on full data) with 12% data on English, 17% on Chinese.
Test F1 score vs. % of labeled words
English
Published as a conference paper at ICLR 2018
0 20 40 60 80
70
75
80
85
Percent of words annotated
Test
F1sc
ore
MNLPLC
RANDBest Deep Model
Best Shallow Model
(a) OntoNotes-5.0 English
0 20 40 60 80 100
65
70
75
Percent of words annotated
Test
F1sc
ore
MNLPLC
RANDBest Deep Model
Best Shallow Model
(b) OntoNotes-5.0 Chinese
Figure 1: F1 score on the test dataset, in terms of the number of words labeled.
Figure 2: Genre distribution of top 1,000 sentenceschosen by an active learning algorithm
Detection of under-explored genres To bet-ter understand how active learning algorithmschoose informative examples, we designed thefollowing experiment. The OntoNotes datasetsconsist of six genres: broadcast conversation(bc), braodcast news (bn), magazine genre (mz),newswire (nw), telephone conversation (tc), we-blogs (wb). We created three training datasets:half-data, which contains random 50% of theoriginal training data, nw-data, which containssentences only from newswire (51.5% of wordsin the original data), and no-nw-data, which isthe complement of nw-data. Then, we trainedCNN-CNN-LSTM model on each dataset. Themodel trained on half-data achieved 85.10 F1,significantly outperforming others trained on bi-ased datasets (no-nw-data: 81.49, nw-only-data:82.08). This showed the importance of good genre coverage in training data. Then, we analyzed thegenre distribution of 1,000 sentences MNLP chose for each model (see Figure 2). For no-nw-data,the algorithm chose many more newswire (nw) sentences than it did for unbiased half-data (367 vs.217). On the other hand, it undersampled newswire sentences for nw-only-data and increased theproportion of broadcast news and telephone conversation, which are genres distant from newswire.Impressively, although we did not provide the genre of sentences to the algorithm, it was able toautomatically detect underexplored genres.
REFERENCES
Ashwinkumar Badanidiyuru, Baharan Mirzasoleiman, Amin Karbasi, and Andreas Krause. Streamingsubmodular maximization: Massive data summarization on the fly. In Proceedings of the 20th
ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 671–680.ACM, 2014.
Jason PC Chiu and Eric Nichols. Named entity recognition with bidirectional lstm-cnns. Transactions
of the Association for Computational Linguistics, 4:357–370, 2016.
Ronan Collobert, Jason Weston, Léon Bottou, Michael Karlen, Koray Kavukcuoglu, and Pavel Kuksa.Natural language processing (almost) from scratch. Journal of Machine Learning Research, 12(Aug):2493–2537, 2011.
David Graff and Christopher Cieri. English gigaword, ldc catalog no. LDC2003T05. Linguistic Data
Consortium, University of Pennsylvania, 2003.
3
Published as a conference paper at ICLR 2018
0 20 40 60 80
70
75
80
85
Percent of words annotated
Test
F1sc
ore
MNLPLC
RANDBest Deep Model
Best Shallow Model
(a) OntoNotes-5.0 English
0 20 40 60 80 100
65
70
75
Percent of words annotated
Test
F1sc
ore
MNLPLC
RANDBest Deep Model
Best Shallow Model
(b) OntoNotes-5.0 Chinese
Figure 1: F1 score on the test dataset, in terms of the number of words labeled.
Figure 2: Genre distribution of top 1,000 sentenceschosen by an active learning algorithm
Detection of under-explored genres To bet-ter understand how active learning algorithmschoose informative examples, we designed thefollowing experiment. The OntoNotes datasetsconsist of six genres: broadcast conversation(bc), braodcast news (bn), magazine genre (mz),newswire (nw), telephone conversation (tc), we-blogs (wb). We created three training datasets:half-data, which contains random 50% of theoriginal training data, nw-data, which containssentences only from newswire (51.5% of wordsin the original data), and no-nw-data, which isthe complement of nw-data. Then, we trainedCNN-CNN-LSTM model on each dataset. Themodel trained on half-data achieved 85.10 F1,significantly outperforming others trained on bi-ased datasets (no-nw-data: 81.49, nw-only-data:82.08). This showed the importance of good genre coverage in training data. Then, we analyzed thegenre distribution of 1,000 sentences MNLP chose for each model (see Figure 2). For no-nw-data,the algorithm chose many more newswire (nw) sentences than it did for unbiased half-data (367 vs.217). On the other hand, it undersampled newswire sentences for nw-only-data and increased theproportion of broadcast news and telephone conversation, which are genres distant from newswire.Impressively, although we did not provide the genre of sentences to the algorithm, it was able toautomatically detect underexplored genres.
REFERENCES
Ashwinkumar Badanidiyuru, Baharan Mirzasoleiman, Amin Karbasi, and Andreas Krause. Streamingsubmodular maximization: Massive data summarization on the fly. In Proceedings of the 20th
ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 671–680.ACM, 2014.
Jason PC Chiu and Eric Nichols. Named entity recognition with bidirectional lstm-cnns. Transactions
of the Association for Computational Linguistics, 4:357–370, 2016.
Ronan Collobert, Jason Weston, Léon Bottou, Michael Karlen, Koray Kavukcuoglu, and Pavel Kuksa.Natural language processing (almost) from scratch. Journal of Machine Learning Research, 12(Aug):2493–2537, 2011.
David Graff and Christopher Cieri. English gigaword, ldc catalog no. LDC2003T05. Linguistic Data
Consortium, University of Pennsylvania, 2003.
3
Chinese
• Uncertainty sampling works. Normalizing for length helps under low data.
• With active learning, deep beats shallow even in low data regime.
• With active learning, SOTA achieved with far fewer samples.
TAKE-AWAY
ACTIVE LEARNING WITH PARTIAL FEEDBACK
images
questions dog?
dog
non-dogpartiallabels
• Hierarchical class labeling: Labor proportional to # of binary questions asked
• Actively pick informative questions ?
RESULTS ON TINY IMAGENET (100K SAMPLES)
• Yield 8% higher accuracy at 30% questions (w.r.t. Uniform)
• Obtain full annotation with 40% less binary questions
ALPF-ERCactive data
activequestions
Uniforminactive data
inactive questions
AL-MEactive data
inactive questions
AQ-ERCinactive data
active questions
0.1
0.2
0.3
0.4
0.5
0% 25% 50% 75% 100%
Accuracy vs. # of QuestionsUniform AL-ME AQ-ERC ALPF-ERC
- 40%
+8%
• Don’t annotate from scratch
• Select questions actively based on the learned model
• Don’t sleep on partial labels
• Re-train model from partial labels
TWO TAKE-AWAYS
CROWDSOURCING: AGGREGATION OF CROWD ANNOTATIONS
Majority rule• Simple and common.• Wasteful: ignores annotator
quality of different workers.
Annotator-quality models• Can improve accuracy.• Hard: needs to be estimated
without ground-truth.
PROPOSED CROWDSOURCING ALGORITHM
Repeat
Posterior of ground-truth labels given annotator quality model
Use trained model to infer ground-truth labels
Noisy crowdsourced annotations
MLE : update Annotator quality using inferred labels from model
Training with weighted loss. Use posterior as weights
LABELING ONCE IS OPTIMAL: BOTH IN THEORY AND PRACTICE
MS-COCO dataset. Fixed budget: 35k annotations
No. of workers
Theorem: Under fixed budget, generalization error minimized with single annotation per sample.
Assumptions: • Best predictor is accurate enough
(under no label noise).• Simplified case: All workers
have same quality. • Prob. of being correct > 83%
5% wrt Majority rule
DATA AUGMENTATION 1: GENERATIVE MODELING
Merits• Captures statistics of natural images
• Learnable
Peril• Feedback is real vs. fake: different from prediction.• Introduces artifacts
GAN
PREDICTIVE VS GENERATIVE MODELS
y
x
y
x
P(y | x) P(x | y)
One model to do both?
• SOTA prediction from CNN models.• What class of p(x|y) yield CNN models for p(y|x)?
NEURAL DEEP RENDERING MODEL (NRM)
...
...
object category
intermediaterendering
image
latent variables
x<latexit sha1_base64="TQGEygQLk4MpTONXYfpVxGVfRf8=">AAAB6HicbVBNS8NAEJ3Ur1q/qh69LBbBU0lEqMeCF48t2A9oQ9lsJ+3azSbsbsQS+gu8eFDEqz/Jm//GbZuDtj4YeLw3w8y8IBFcG9f9dgobm1vbO8Xd0t7+weFR+fikreNUMWyxWMSqG1CNgktsGW4EdhOFNAoEdoLJ7dzvPKLSPJb3ZpqgH9GR5CFn1Fip+TQoV9yquwBZJ15OKpCjMSh/9YcxSyOUhgmqdc9zE+NnVBnOBM5K/VRjQtmEjrBnqaQRaj9bHDojF1YZkjBWtqQhC/X3REYjradRYDsjasZ61ZuL/3m91IQ3fsZlkhqUbLkoTAUxMZl/TYZcITNiagllittbCRtTRZmx2ZRsCN7qy+ukfVX13KrXvK7Uu3kcRTiDc7gED2pQhztoQAsYIDzDK7w5D86L8+58LFsLTj5zCn/gfP4A7NeNEg==</latexit><latexit sha1_base64="TQGEygQLk4MpTONXYfpVxGVfRf8=">AAAB6HicbVBNS8NAEJ3Ur1q/qh69LBbBU0lEqMeCF48t2A9oQ9lsJ+3azSbsbsQS+gu8eFDEqz/Jm//GbZuDtj4YeLw3w8y8IBFcG9f9dgobm1vbO8Xd0t7+weFR+fikreNUMWyxWMSqG1CNgktsGW4EdhOFNAoEdoLJ7dzvPKLSPJb3ZpqgH9GR5CFn1Fip+TQoV9yquwBZJ15OKpCjMSh/9YcxSyOUhgmqdc9zE+NnVBnOBM5K/VRjQtmEjrBnqaQRaj9bHDojF1YZkjBWtqQhC/X3REYjradRYDsjasZ61ZuL/3m91IQ3fsZlkhqUbLkoTAUxMZl/TYZcITNiagllittbCRtTRZmx2ZRsCN7qy+ukfVX13KrXvK7Uu3kcRTiDc7gED2pQhztoQAsYIDzDK7w5D86L8+58LFsLTj5zCn/gfP4A7NeNEg==</latexit><latexit sha1_base64="TQGEygQLk4MpTONXYfpVxGVfRf8=">AAAB6HicbVBNS8NAEJ3Ur1q/qh69LBbBU0lEqMeCF48t2A9oQ9lsJ+3azSbsbsQS+gu8eFDEqz/Jm//GbZuDtj4YeLw3w8y8IBFcG9f9dgobm1vbO8Xd0t7+weFR+fikreNUMWyxWMSqG1CNgktsGW4EdhOFNAoEdoLJ7dzvPKLSPJb3ZpqgH9GR5CFn1Fip+TQoV9yquwBZJ15OKpCjMSh/9YcxSyOUhgmqdc9zE+NnVBnOBM5K/VRjQtmEjrBnqaQRaj9bHDojF1YZkjBWtqQhC/X3REYjradRYDsjasZ61ZuL/3m91IQ3fsZlkhqUbLkoTAUxMZl/TYZcITNiagllittbCRtTRZmx2ZRsCN7qy+ukfVX13KrXvK7Uu3kcRTiDc7gED2pQhztoQAsYIDzDK7w5D86L8+58LFsLTj5zCn/gfP4A7NeNEg==</latexit><latexit sha1_base64="TQGEygQLk4MpTONXYfpVxGVfRf8=">AAAB6HicbVBNS8NAEJ3Ur1q/qh69LBbBU0lEqMeCF48t2A9oQ9lsJ+3azSbsbsQS+gu8eFDEqz/Jm//GbZuDtj4YeLw3w8y8IBFcG9f9dgobm1vbO8Xd0t7+weFR+fikreNUMWyxWMSqG1CNgktsGW4EdhOFNAoEdoLJ7dzvPKLSPJb3ZpqgH9GR5CFn1Fip+TQoV9yquwBZJ15OKpCjMSh/9YcxSyOUhgmqdc9zE+NnVBnOBM5K/VRjQtmEjrBnqaQRaj9bHDojF1YZkjBWtqQhC/X3REYjradRYDsjasZ61ZuL/3m91IQ3fsZlkhqUbLkoTAUxMZl/TYZcITNiagllittbCRtTRZmx2ZRsCN7qy+ukfVX13KrXvK7Uu3kcRTiDc7gED2pQhztoQAsYIDzDK7w5D86L8+58LFsLTj5zCn/gfP4A7NeNEg==</latexit>
y<latexit sha1_base64="zX/nehuC+fK5+AT4o3l1JMUrCQQ=">AAAB6HicbVBNS8NAEJ3Ur1q/qh69LBbBU0lE0GPBi8cW7Ae0oWy2k3btZhN2N0II/QVePCji1Z/kzX/jts1BWx8MPN6bYWZekAiujet+O6WNza3tnfJuZW//4PCoenzS0XGqGLZZLGLVC6hGwSW2DTcCe4lCGgUCu8H0bu53n1BpHssHkyXoR3QsecgZNVZqZcNqza27C5B14hWkBgWaw+rXYBSzNEJpmKBa9z03MX5OleFM4KwySDUmlE3pGPuWShqh9vPFoTNyYZURCWNlSxqyUH9P5DTSOosC2xlRM9Gr3lz8z+unJrz1cy6T1KBky0VhKoiJyfxrMuIKmRGZJZQpbm8lbEIVZcZmU7EheKsvr5POVd1z617rutboFXGU4QzO4RI8uIEG3EMT2sAA4Rle4c15dF6cd+dj2VpyiplT+APn8wfuW40T</latexit><latexit sha1_base64="zX/nehuC+fK5+AT4o3l1JMUrCQQ=">AAAB6HicbVBNS8NAEJ3Ur1q/qh69LBbBU0lE0GPBi8cW7Ae0oWy2k3btZhN2N0II/QVePCji1Z/kzX/jts1BWx8MPN6bYWZekAiujet+O6WNza3tnfJuZW//4PCoenzS0XGqGLZZLGLVC6hGwSW2DTcCe4lCGgUCu8H0bu53n1BpHssHkyXoR3QsecgZNVZqZcNqza27C5B14hWkBgWaw+rXYBSzNEJpmKBa9z03MX5OleFM4KwySDUmlE3pGPuWShqh9vPFoTNyYZURCWNlSxqyUH9P5DTSOosC2xlRM9Gr3lz8z+unJrz1cy6T1KBky0VhKoiJyfxrMuIKmRGZJZQpbm8lbEIVZcZmU7EheKsvr5POVd1z617rutboFXGU4QzO4RI8uIEG3EMT2sAA4Rle4c15dF6cd+dj2VpyiplT+APn8wfuW40T</latexit><latexit sha1_base64="zX/nehuC+fK5+AT4o3l1JMUrCQQ=">AAAB6HicbVBNS8NAEJ3Ur1q/qh69LBbBU0lE0GPBi8cW7Ae0oWy2k3btZhN2N0II/QVePCji1Z/kzX/jts1BWx8MPN6bYWZekAiujet+O6WNza3tnfJuZW//4PCoenzS0XGqGLZZLGLVC6hGwSW2DTcCe4lCGgUCu8H0bu53n1BpHssHkyXoR3QsecgZNVZqZcNqza27C5B14hWkBgWaw+rXYBSzNEJpmKBa9z03MX5OleFM4KwySDUmlE3pGPuWShqh9vPFoTNyYZURCWNlSxqyUH9P5DTSOosC2xlRM9Gr3lz8z+unJrz1cy6T1KBky0VhKoiJyfxrMuIKmRGZJZQpbm8lbEIVZcZmU7EheKsvr5POVd1z617rutboFXGU4QzO4RI8uIEG3EMT2sAA4Rle4c15dF6cd+dj2VpyiplT+APn8wfuW40T</latexit><latexit sha1_base64="zX/nehuC+fK5+AT4o3l1JMUrCQQ=">AAAB6HicbVBNS8NAEJ3Ur1q/qh69LBbBU0lE0GPBi8cW7Ae0oWy2k3btZhN2N0II/QVePCji1Z/kzX/jts1BWx8MPN6bYWZekAiujet+O6WNza3tnfJuZW//4PCoenzS0XGqGLZZLGLVC6hGwSW2DTcCe4lCGgUCu8H0bu53n1BpHssHkyXoR3QsecgZNVZqZcNqza27C5B14hWkBgWaw+rXYBSzNEJpmKBa9z03MX5OleFM4KwySDUmlE3pGPuWShqh9vPFoTNyYZURCWNlSxqyUH9P5DTSOosC2xlRM9Gr3lz8z+unJrz1cy6T1KBky0VhKoiJyfxrMuIKmRGZJZQpbm8lbEIVZcZmU7EheKsvr5POVd1z617rutboFXGU4QzO4RI8uIEG3EMT2sAA4Rle4c15dF6cd+dj2VpyiplT+APn8wfuW40T</latexit>
Design joint priors for latent variables based on reverse-engineering CNN predictive architectures
NEURAL RENDERING MODEL (NRM)
0.5dog0.2cat0.1horse…
1.0dog
Chooserenderornot
Upsample,selectlocation
RenderNRM:Generation
CNN:Inference
image unpooledfeaturemap
pooledfeaturemap
rectifiedfeaturemap
classtemplate
maskedtemplate
upsampledtemplate
renderedimage
MAX-MIN CROSS-ENTROPY ➡ MAX-MIN NETWORKS
Cross-Entropy Loss for Training the CNNs with Labeled Data
min✓2A�
Hp,q(y|x, zmax) � min(zi)ni=1,✓
1
n
nX
i=1
� log p(yi|xi, zi; ✓)<latexit sha1_base64="o8aAYTACJXrjfyUfzuzzjiD9z6U=">AAACkXicbVFdb9MwFHXC1ygwOvbIyxUVUiuVKZkmwYSGCrxM4mVIdJtUd5HjOqk128liBzXz/H/4Pbzxb3DaSMDGlXx1dO491/a5aSm4NlH0Kwjv3X/w8NHW496Tp8+2n/d3Xpzqoq4om9JCFNV5SjQTXLGp4Uaw87JiRKaCnaWXn9v62XdWaV6ob6Yp2VySXPGMU2I8lfR/YMlVYrFZMkMAcwVYErOkRNiPzvM5kZI4B8eJLcdXbtjcrMbXF76frYyVZOXcCHDOrmAzZ3idWO5GPh3F7sIqN4bNaIcXWUWojZ0nAeta/umBN4BFkUM5bFr1zarNY1iPer+RjyDpD6K9aB1wF8QdGKAuTpL+T7woaC2ZMlQQrWdxVJq5JZXhVDDXw7VmJaGXJGczDxWRTM/t2lEHrz2zgKyo/FEG1uzfCkuk1o1MfWdrl75da8n/1Wa1yd7NLVdlbZiim4uyWoApoF0PLHjFqBGNB4RW3L8V6JJ454xfYs+bEN/+8l1wur8Xe/z1YDD51NmxhV6iV2iIYvQWTdAxOkFTRIPt4CA4Cj6Eu+FhOAm73jDoNLvonwi//Aa9kMoP</latexit><latexit sha1_base64="o8aAYTACJXrjfyUfzuzzjiD9z6U=">AAACkXicbVFdb9MwFHXC1ygwOvbIyxUVUiuVKZkmwYSGCrxM4mVIdJtUd5HjOqk128liBzXz/H/4Pbzxb3DaSMDGlXx1dO491/a5aSm4NlH0Kwjv3X/w8NHW496Tp8+2n/d3Xpzqoq4om9JCFNV5SjQTXLGp4Uaw87JiRKaCnaWXn9v62XdWaV6ob6Yp2VySXPGMU2I8lfR/YMlVYrFZMkMAcwVYErOkRNiPzvM5kZI4B8eJLcdXbtjcrMbXF76frYyVZOXcCHDOrmAzZ3idWO5GPh3F7sIqN4bNaIcXWUWojZ0nAeta/umBN4BFkUM5bFr1zarNY1iPer+RjyDpD6K9aB1wF8QdGKAuTpL+T7woaC2ZMlQQrWdxVJq5JZXhVDDXw7VmJaGXJGczDxWRTM/t2lEHrz2zgKyo/FEG1uzfCkuk1o1MfWdrl75da8n/1Wa1yd7NLVdlbZiim4uyWoApoF0PLHjFqBGNB4RW3L8V6JJ454xfYs+bEN/+8l1wur8Xe/z1YDD51NmxhV6iV2iIYvQWTdAxOkFTRIPt4CA4Cj6Eu+FhOAm73jDoNLvonwi//Aa9kMoP</latexit><latexit sha1_base64="o8aAYTACJXrjfyUfzuzzjiD9z6U=">AAACkXicbVFdb9MwFHXC1ygwOvbIyxUVUiuVKZkmwYSGCrxM4mVIdJtUd5HjOqk128liBzXz/H/4Pbzxb3DaSMDGlXx1dO491/a5aSm4NlH0Kwjv3X/w8NHW496Tp8+2n/d3Xpzqoq4om9JCFNV5SjQTXLGp4Uaw87JiRKaCnaWXn9v62XdWaV6ob6Yp2VySXPGMU2I8lfR/YMlVYrFZMkMAcwVYErOkRNiPzvM5kZI4B8eJLcdXbtjcrMbXF76frYyVZOXcCHDOrmAzZ3idWO5GPh3F7sIqN4bNaIcXWUWojZ0nAeta/umBN4BFkUM5bFr1zarNY1iPer+RjyDpD6K9aB1wF8QdGKAuTpL+T7woaC2ZMlQQrWdxVJq5JZXhVDDXw7VmJaGXJGczDxWRTM/t2lEHrz2zgKyo/FEG1uzfCkuk1o1MfWdrl75da8n/1Wa1yd7NLVdlbZiim4uyWoApoF0PLHjFqBGNB4RW3L8V6JJ454xfYs+bEN/+8l1wur8Xe/z1YDD51NmxhV6iV2iIYvQWTdAxOkFTRIPt4CA4Cj6Eu+FhOAm73jDoNLvonwi//Aa9kMoP</latexit><latexit sha1_base64="o8aAYTACJXrjfyUfzuzzjiD9z6U=">AAACkXicbVFdb9MwFHXC1ygwOvbIyxUVUiuVKZkmwYSGCrxM4mVIdJtUd5HjOqk128liBzXz/H/4Pbzxb3DaSMDGlXx1dO491/a5aSm4NlH0Kwjv3X/w8NHW496Tp8+2n/d3Xpzqoq4om9JCFNV5SjQTXLGp4Uaw87JiRKaCnaWXn9v62XdWaV6ob6Yp2VySXPGMU2I8lfR/YMlVYrFZMkMAcwVYErOkRNiPzvM5kZI4B8eJLcdXbtjcrMbXF76frYyVZOXcCHDOrmAzZ3idWO5GPh3F7sIqN4bNaIcXWUWojZ0nAeta/umBN4BFkUM5bFr1zarNY1iPer+RjyDpD6K9aB1wF8QdGKAuTpL+T7woaC2ZMlQQrWdxVJq5JZXhVDDXw7VmJaGXJGczDxWRTM/t2lEHrz2zgKyo/FEG1uzfCkuk1o1MfWdrl75da8n/1Wa1yd7NLVdlbZiim4uyWoApoF0PLHjFqBGNB4RW3L8V6JJ454xfYs+bEN/+8l1wur8Xe/z1YDD51NmxhV6iV2iIYvQWTdAxOkFTRIPt4CA4Cj6Eu+FhOAm73jDoNLvonwi//Aa9kMoP</latexit>
Max-Min Loss for Training the CNNs with Labeled Data
↵max
Hp,q(y|x, zmax) + ↵min
Hp,q(y|x, zmin)<latexit sha1_base64="CTyL0GiHpBuDE9+L99VekVIDKxM=">AAACSHicdZBLSwMxFIUz9VXrq+rSTbAIiiIzIuiy6KZLBauFtg530tQGM5kxuSOt4/w8Ny7d+RvcuFDEnWntwla9EPg45x6SnCCWwqDrPju5icmp6Zn8bGFufmFxqbi8cm6iRDNeZZGMdC0Aw6VQvIoCJa/FmkMYSH4RXB/3/Ytbro2I1Bn2Yt4M4UqJtmCAVvKLfgNk3IHLtIG8i2kI3Syr+Gm8c5Nt9u67O3cjzhbdpmMBof4L9J0tv1hyd93B0N/gDaFEhnPiF58arYglIVfIJBhT99wYmyloFEzyrNBIDI+BXcMVr1tUEHLTTAdFZHTDKi3ajrQ9CulA/ZlIITSmFwZ2MwTsmHGvL/7l1RNsHzZToeIEuWLfF7UTSTGi/VZpS2jOUPYsANPCvpWyDmhgaLsv2BK88S//hvO9Xc/y6X6pfDSsI0/WyDrZJB45IGVSISekShh5IC/kjbw7j86r8+F8fq/mnGFmlYxMLvcFK2q1gQ==</latexit><latexit sha1_base64="CTyL0GiHpBuDE9+L99VekVIDKxM=">AAACSHicdZBLSwMxFIUz9VXrq+rSTbAIiiIzIuiy6KZLBauFtg530tQGM5kxuSOt4/w8Ny7d+RvcuFDEnWntwla9EPg45x6SnCCWwqDrPju5icmp6Zn8bGFufmFxqbi8cm6iRDNeZZGMdC0Aw6VQvIoCJa/FmkMYSH4RXB/3/Ytbro2I1Bn2Yt4M4UqJtmCAVvKLfgNk3IHLtIG8i2kI3Syr+Gm8c5Nt9u67O3cjzhbdpmMBof4L9J0tv1hyd93B0N/gDaFEhnPiF58arYglIVfIJBhT99wYmyloFEzyrNBIDI+BXcMVr1tUEHLTTAdFZHTDKi3ajrQ9CulA/ZlIITSmFwZ2MwTsmHGvL/7l1RNsHzZToeIEuWLfF7UTSTGi/VZpS2jOUPYsANPCvpWyDmhgaLsv2BK88S//hvO9Xc/y6X6pfDSsI0/WyDrZJB45IGVSISekShh5IC/kjbw7j86r8+F8fq/mnGFmlYxMLvcFK2q1gQ==</latexit><latexit sha1_base64="CTyL0GiHpBuDE9+L99VekVIDKxM=">AAACSHicdZBLSwMxFIUz9VXrq+rSTbAIiiIzIuiy6KZLBauFtg530tQGM5kxuSOt4/w8Ny7d+RvcuFDEnWntwla9EPg45x6SnCCWwqDrPju5icmp6Zn8bGFufmFxqbi8cm6iRDNeZZGMdC0Aw6VQvIoCJa/FmkMYSH4RXB/3/Ytbro2I1Bn2Yt4M4UqJtmCAVvKLfgNk3IHLtIG8i2kI3Syr+Gm8c5Nt9u67O3cjzhbdpmMBof4L9J0tv1hyd93B0N/gDaFEhnPiF58arYglIVfIJBhT99wYmyloFEzyrNBIDI+BXcMVr1tUEHLTTAdFZHTDKi3ajrQ9CulA/ZlIITSmFwZ2MwTsmHGvL/7l1RNsHzZToeIEuWLfF7UTSTGi/VZpS2jOUPYsANPCvpWyDmhgaLsv2BK88S//hvO9Xc/y6X6pfDSsI0/WyDrZJB45IGVSISekShh5IC/kjbw7j86r8+F8fq/mnGFmlYxMLvcFK2q1gQ==</latexit><latexit sha1_base64="CTyL0GiHpBuDE9+L99VekVIDKxM=">AAACSHicdZBLSwMxFIUz9VXrq+rSTbAIiiIzIuiy6KZLBauFtg530tQGM5kxuSOt4/w8Ny7d+RvcuFDEnWntwla9EPg45x6SnCCWwqDrPju5icmp6Zn8bGFufmFxqbi8cm6iRDNeZZGMdC0Aw6VQvIoCJa/FmkMYSH4RXB/3/Ytbro2I1Bn2Yt4M4UqJtmCAVvKLfgNk3IHLtIG8i2kI3Syr+Gm8c5Nt9u67O3cjzhbdpmMBof4L9J0tv1hyd93B0N/gDaFEhnPiF58arYglIVfIJBhT99wYmyloFEzyrNBIDI+BXcMVr1tUEHLTTAdFZHTDKi3ajrQ9CulA/ZlIITSmFwZ2MwTsmHGvL/7l1RNsHzZToeIEuWLfF7UTSTGi/VZpS2jOUPYsANPCvpWyDmhgaLsv2BK88S//hvO9Xc/y6X6pfDSsI0/WyDrZJB45IGVSISekShh5IC/kjbw7j86r8+F8fq/mnGFmlYxMLvcFK2q1gQ==</latexit>
InputImageMax
Xentropy
MinXentropy
Max-MinXentropy
Sharedweights
• Max cross-entropy maximizes the posteriors of correct labels. Min cross-entropy minimizes the posteriors of incorrect labels.
• Co-learning: Max and Min networks try to learn from each other
STATISTICAL GUARANTEES FOR THE NRM
Bound on the generalization errorRisk ≤
!"#$%&()*+,-.%&%/0%&-/12*,34/5/7
• Rendering path normalization: • new form of regularization
Training loss in the CNNs equivalent to likelihood in NRM
Max-Min NRM with RPN achieves SOTA on benchmarks
Rendering Path
DATA AUGMENTATION 2: SYMBOLIC EXPRESSIONS
Goal: Learn a domain of functions (sin, cos, log, add…)
• Training on numerical input-output does not generalize.
Data Augmentation with Symbolic Expressions
• Efficiently encode relationships between functions.
Solution:
• Design networks to use both: symbolic + numerical
ARCHITECTURE : TREE LSTM
sin; 𝜃 + cos; 𝜃 = 1 sin −2.5 = −0.6
• Symbolic expression trees. Function evaluation tree.
• Decimal trees: encode numbers with decimal representation (numerical).
• Can encode any expression, function evaluation and number.
DecimalTreefor2.5
RESULTS
• Vastly Improved numerical evaluation: 90% over function-fitting baseline.
•Generalization to verifying symbolic equations of higher depth
• Combining symbolic + numerical data helps in better generalization
for both tasks: symbolic and numerical evaluation.
LSTM: Symbolic TreeLSTM: Symbolic TreeLSTM: symbolic + numeric
76.40 % 93.27 % 96.17 %
• OPTIMIZATION : ANALYSIS OF CONVERGENCE
• SCALABILITY : GRADIENT QUANTIZATION
• MULTI-DIMENSIONALITY : TENSOR ALGEBRA
ALGORITHMS
DISTRIBUTED TRAINING INVOLVES COMPUTATION & COMMUNICATION
Parameterserver
GPU 1 GPU 2
With 1/2 data With 1/2 data
DISTRIBUTED TRAINING INVOLVES COMPUTATION & COMMUNICATION
Parameterserver
GPU 1 GPU 2
With 1/2 data With 1/2 data
Compress?Compress?
Compress?
DISTRIBUTED TRAINING BY MAJORITY VOTE
Parameterserver
GPU 1GPU 2GPU 3
sign(g)
sign(g)
sign(g)
Parameterserver
GPU 1GPU 2GPU 3
sign [sum(sign(g))]
SIGNSGD PROVIDES “FREE LUNCH"
Throughput gain with only tiny accuracy loss
P3.2x machines on AWS, Resnet50 on imagenet
SIGNSGD ACROSS DOMAINS AND ARCHITECTURES
Huge throughput gain!
TAKE-AWAYS FOR SIGN-SGD
• Convergence even under biased gradients and noise.
• Faster than SGD in theory and in practice.
• For distributed training, similar variance reduction as SGD.
• In practice, similar accuracy but with far less communication.
TENSORS FOR LEARNING IN MANY DIMENSIONSTensors: Beyond 2D world
Modern data is inherently multi-dimensional
Images: 3 dimensions Videos: 4 dimensions
TENSORS FOR MULTI-DIMENSIONAL DATA AND HIGHER ORDER MOMENTS
Pairwise correlations Triplet correlations
OPERATIONS ON TENSORS: TENSOR CONTRACTIONTensor Contraction
Extends the notion of matrix product
Matrix product
Mv =!
j
vjMj
= +
Tensor ContractionT (u, v, ·) =
!
i,j
uivjTi,j,:
=
++
+
DEEP NEURAL NETS: TRANSFORMING TENSORS
DEEP TENSORIZED NETWORKS
SPACE SAVING IN DEEP TENSORIZED NETWORKS
Tensor Train RNN and LSTMs
TENSORS FOR LONG-TERM FORECASTING
Challenges:• Long-term
dependencies• High-order
correlations• Error propagation
Climate datasetTraffic dataset
TENSOR LSTM FOR LONG-TERM FORECASTING
TENSORLY: H IGH-LEVEL API FOR TENSOR ALGEBRA
• Python programming
• User-friendly API
• Multiple backends: flexible + scalable
• Example notebooks in repository
A New Vision for Autonomy
Center for Autonomous Systems and Technologies
CAST: BRINGING ROBOTICS AND AI TOGETHER
FIRST SET OF RESULTS: LEARNING TO LAND
SOME RESEARCH LEADERS AT NVIDIA
Robotics
Dieter Fox
Learning &Perception
Jan KautzBill Dally Dave Luebke Alex Keller Aaron Lefohn
Graphics
Steve Keckler Dave Nellans Mike O’Connor
ArchitectureProgramming
Michael Garland
VLSI
Brucek Khailany
Circuits
Tom Gray
Networks
Larry Dennison
Chief Scientist
Computervision Core ML
Sanja Fidler Me !
Applied research
Bryan Catanzaro
• DATA
• Collection: Active learning and partial feedback
• Aggregation: Crowdsourcing models
• Augmentation: Graphics rendering + GANs, Symbolic expressions
• ALGORITHMS
• Convergence: SignSGD has good rates in theory and practice
• Scalability: SignSGD has same variance reduction as SGD for multi-machine
• Multi-dimensionality: Tensor algebra for neural networks and probabilistic models.
• INFRASTRUCTURE:
• Frameworks: Tensorly is high-level API for deep tensorized networks.
CONCLUSION
AI needs integration of data, algorithms and infrastructure
COLLABORATORS (L IMITED L IST )
Thank you