ADVANCES IN TRINITY OF AI: DATA, ALGORITHMS & COMPUTEtensorlab.cms.caltech.edu › users › anima › slides › NYU-oct2018.pdf · chosen by an active learning algorithm Detection

ADVANCES IN TRINITY OF AI: DATA, ALGORITHMS & COMPUTE

Anima Anandkumar

Bren Professor at CaltechDirector of ML Research at NVIDIA

ALGORITHMS • OPTIMIZATION• SCALABILITY• MULTI-DIMENSIONALITY

DATA• COLLECTION• AGGREGATION • AUGMENTATION

INFRASTRUCTUREFULL STACK FOR ML• APPLICATION SERVICES• ML PLATFORM• GPUS

TRINITY FUELING ARTIFICIAL INTELLIGENCE

• COLLECTION: ACTIVE LEARNING, PARTIAL LABELS..

• AGGREGATION: CROWDSOURCING MODELS..

• AUGMENTATION: GENERATIVE MODELS, SYMBOLIC EXPRESSIONS..

DATA

ACTIVE LEARNING

Labeled data

Unlabeled data

Goal• Reach SOTA with a smaller dataset

• Active learning analyzed in theory• In practice, only small classical models

Can it work at scale with deep learning?

TASK: NAMED ENTITY RECOGNITION

RESULTSNER task on largest open benchmark (Onto-notes)

Active learning heuristics:

• Least confidence (LC)• Max. normalized log

probability (MNLP)

• Deep active learning matches : • SOTA with just 25% data on English, 30% on Chinese.• Best shallow model (on full data) with 12% data on English, 17% on Chinese.

Test F1 score vs. % of labeled words

English

Published as a conference paper at ICLR 2018

0 20 40 60 80

70

75

80

85

Percent of words annotated

Test

F1sc

ore

MNLPLC

RANDBest Deep Model

Best Shallow Model

(a) OntoNotes-5.0 English

0 20 40 60 80 100

65

70

75


Test

F1sc

ore

MNLPLC

RANDBest Deep Model

Best Shallow Model

(b) OntoNotes-5.0 Chinese

Figure 1: F1 score on the test dataset, in terms of the number of words labeled.

Figure 2: Genre distribution of top 1,000 sentenceschosen by an active learning algorithm

Detection of under-explored genres To bet-ter understand how active learning algorithmschoose informative examples, we designed thefollowing experiment. The OntoNotes datasetsconsist of six genres: broadcast conversation(bc), braodcast news (bn), magazine genre (mz),newswire (nw), telephone conversation (tc), we-blogs (wb). We created three training datasets:half-data, which contains random 50% of theoriginal training data, nw-data, which containssentences only from newswire (51.5% of wordsin the original data), and no-nw-data, which isthe complement of nw-data. Then, we trainedCNN-CNN-LSTM model on each dataset. Themodel trained on half-data achieved 85.10 F1,significantly outperforming others trained on bi-ased datasets (no-nw-data: 81.49, nw-only-data:82.08). This showed the importance of good genre coverage in training data. Then, we analyzed thegenre distribution of 1,000 sentences MNLP chose for each model (see Figure 2). For no-nw-data,the algorithm chose many more newswire (nw) sentences than it did for unbiased half-data (367 vs.217). On the other hand, it undersampled newswire sentences for nw-only-data and increased theproportion of broadcast news and telephone conversation, which are genres distant from newswire.Impressively, although we did not provide the genre of sentences to the algorithm, it was able toautomatically detect underexplored genres.

REFERENCES

Ashwinkumar Badanidiyuru, Baharan Mirzasoleiman, Amin Karbasi, and Andreas Krause. Streamingsubmodular maximization: Massive data summarization on the fly. In Proceedings of the 20th

ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 671–680.ACM, 2014.

Jason PC Chiu and Eric Nichols. Named entity recognition with bidirectional lstm-cnns. Transactions

of the Association for Computational Linguistics, 4:357–370, 2016.

Ronan Collobert, Jason Weston, Léon Bottou, Michael Karlen, Koray Kavukcuoglu, and Pavel Kuksa.Natural language processing (almost) from scratch. Journal of Machine Learning Research, 12(Aug):2493–2537, 2011.

David Graff and Christopher Cieri. English gigaword, ldc catalog no. LDC2003T05. Linguistic Data

Consortium, University of Pennsylvania, 2003.

3

Published as a conference paper at ICLR 2018

0 20 40 60 80

70

75

80

85


Test

F1sc

ore

MNLPLC

RANDBest Deep Model

Best Shallow Model

(a) OntoNotes-5.0 English

0 20 40 60 80 100

65

70

75


Test

F1sc

ore

MNLPLC

RANDBest Deep Model

Best Shallow Model

(b) OntoNotes-5.0 Chinese

Figure 1: F1 score on the test dataset, in terms of the number of words labeled.

Figure 2: Genre distribution of top 1,000 sentenceschosen by an active learning algorithm

Detection of under-explored genres To bet-ter understand how active learning algorithmschoose informative examples, we designed thefollowing experiment. The OntoNotes datasetsconsist of six genres: broadcast conversation(bc), braodcast news (bn), magazine genre (mz),newswire (nw), telephone conversation (tc), we-blogs (wb). We created three training datasets:half-data, which contains random 50% of theoriginal training data, nw-data, which containssentences only from newswire (51.5% of wordsin the original data), and no-nw-data, which isthe complement of nw-data. Then, we trainedCNN-CNN-LSTM model on each dataset. Themodel trained on half-data achieved 85.10 F1,significantly outperforming others trained on bi-ased datasets (no-nw-data: 81.49, nw-only-data:82.08). This showed the importance of good genre coverage in training data. Then, we analyzed thegenre distribution of 1,000 sentences MNLP chose for each model (see Figure 2). For no-nw-data,the algorithm chose many more newswire (nw) sentences than it did for unbiased half-data (367 vs.217). On the other hand, it undersampled newswire sentences for nw-only-data and increased theproportion of broadcast news and telephone conversation, which are genres distant from newswire.Impressively, although we did not provide the genre of sentences to the algorithm, it was able toautomatically detect underexplored genres.

REFERENCES

Ashwinkumar Badanidiyuru, Baharan Mirzasoleiman, Amin Karbasi, and Andreas Krause. Streamingsubmodular maximization: Massive data summarization on the fly. In Proceedings of the 20th

ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 671–680.ACM, 2014.

Jason PC Chiu and Eric Nichols. Named entity recognition with bidirectional lstm-cnns. Transactions

of the Association for Computational Linguistics, 4:357–370, 2016.

Ronan Collobert, Jason Weston, Léon Bottou, Michael Karlen, Koray Kavukcuoglu, and Pavel Kuksa.Natural language processing (almost) from scratch. Journal of Machine Learning Research, 12(Aug):2493–2537, 2011.

David Graff and Christopher Cieri. English gigaword, ldc catalog no. LDC2003T05. Linguistic Data

Consortium, University of Pennsylvania, 2003.

3

Chinese

• Uncertainty sampling works. Normalizing for length helps under low data.

• With active learning, deep beats shallow even in low data regime.

• With active learning, SOTA achieved with far fewer samples.

TAKE-AWAY

ACTIVE LEARNING WITH PARTIAL FEEDBACK

images

questions dog?

dog

non-dogpartiallabels

• Hierarchical class labeling: Labor proportional to # of binary questions asked

• Actively pick informative questions ?

RESULTS ON TINY IMAGENET (100K SAMPLES)

• Yield 8% higher accuracy at 30% questions (w.r.t. Uniform)

• Obtain full annotation with 40% less binary questions

ALPF-ERCactive data

activequestions

Uniforminactive data

inactive questions

AL-MEactive data

inactive questions

AQ-ERCinactive data

active questions

0.1

0.2

0.3

0.4

0.5

0% 25% 50% 75% 100%

Accuracy vs. # of QuestionsUniform AL-ME AQ-ERC ALPF-ERC

- 40%

+8%

• Don’t annotate from scratch

• Select questions actively based on the learned model

• Don’t sleep on partial labels

• Re-train model from partial labels

TWO TAKE-AWAYS

CROWDSOURCING: AGGREGATION OF CROWD ANNOTATIONS

Majority rule• Simple and common.• Wasteful: ignores annotator

quality of different workers.

Annotator-quality models• Can improve accuracy.• Hard: needs to be estimated

without ground-truth.

PROPOSED CROWDSOURCING ALGORITHM

Repeat

Posterior of ground-truth labels given annotator quality model

Use trained model to infer ground-truth labels

Noisy crowdsourced annotations

MLE : update Annotator quality using inferred labels from model

Training with weighted loss. Use posterior as weights

LABELING ONCE IS OPTIMAL: BOTH IN THEORY AND PRACTICE

MS-COCO dataset. Fixed budget: 35k annotations

No. of workers

Theorem: Under fixed budget, generalization error minimized with single annotation per sample.

Assumptions: • Best predictor is accurate enough

(under no label noise).• Simplified case: All workers

have same quality. • Prob. of being correct > 83%

5% wrt Majority rule

DATA AUGMENTATION 1: GENERATIVE MODELING

Merits• Captures statistics of natural images

• Learnable

Peril• Feedback is real vs. fake: different from prediction.• Introduces artifacts

GAN

PREDICTIVE VS GENERATIVE MODELS

y

x

y

x

P(y | x) P(x | y)

One model to do both?

• SOTA prediction from CNN models.• What class of p(x|y) yield CNN models for p(y|x)?

NEURAL DEEP RENDERING MODEL (NRM)

...

...

object category

intermediaterendering

image

latent variables

x<latexit sha1_base64="TQGEygQLk4MpTONXYfpVxGVfRf8=">AAAB6HicbVBNS8NAEJ3Ur1q/qh69LBbBU0lEqMeCF48t2A9oQ9lsJ+3azSbsbsQS+gu8eFDEqz/Jm//GbZuDtj4YeLw3w8y8IBFcG9f9dgobm1vbO8Xd0t7+weFR+fikreNUMWyxWMSqG1CNgktsGW4EdhOFNAoEdoLJ7dzvPKLSPJb3ZpqgH9GR5CFn1Fip+TQoV9yquwBZJ15OKpCjMSh/9YcxSyOUhgmqdc9zE+NnVBnOBM5K/VRjQtmEjrBnqaQRaj9bHDojF1YZkjBWtqQhC/X3REYjradRYDsjasZ61ZuL/3m91IQ3fsZlkhqUbLkoTAUxMZl/TYZcITNiagllittbCRtTRZmx2ZRsCN7qy+ukfVX13KrXvK7Uu3kcRTiDc7gED2pQhztoQAsYIDzDK7w5D86L8+58LFsLTj5zCn/gfP4A7NeNEg==</latexit><latexit sha1_base64="TQGEygQLk4MpTONXYfpVxGVfRf8=">AAAB6HicbVBNS8NAEJ3Ur1q/qh69LBbBU0lEqMeCF48t2A9oQ9lsJ+3azSbsbsQS+gu8eFDEqz/Jm//GbZuDtj4YeLw3w8y8IBFcG9f9dgobm1vbO8Xd0t7+weFR+fikreNUMWyxWMSqG1CNgktsGW4EdhOFNAoEdoLJ7dzvPKLSPJb3ZpqgH9GR5CFn1Fip+TQoV9yquwBZJ15OKpCjMSh/9YcxSyOUhgmqdc9zE+NnVBnOBM5K/VRjQtmEjrBnqaQRaj9bHDojF1YZkjBWtqQhC/X3REYjradRYDsjasZ61ZuL/3m91IQ3fsZlkhqUbLkoTAUxMZl/TYZcITNiagllittbCRtTRZmx2ZRsCN7qy+ukfVX13KrXvK7Uu3kcRTiDc7gED2pQhztoQAsYIDzDK7w5D86L8+58LFsLTj5zCn/gfP4A7NeNEg==</latexit><latexit sha1_base64="TQGEygQLk4MpTONXYfpVxGVfRf8=">AAAB6HicbVBNS8NAEJ3Ur1q/qh69LBbBU0lEqMeCF48t2A9oQ9lsJ+3azSbsbsQS+gu8eFDEqz/Jm//GbZuDtj4YeLw3w8y8IBFcG9f9dgobm1vbO8Xd0t7+weFR+fikreNUMWyxWMSqG1CNgktsGW4EdhOFNAoEdoLJ7dzvPKLSPJb3ZpqgH9GR5CFn1Fip+TQoV9yquwBZJ15OKpCjMSh/9YcxSyOUhgmqdc9zE+NnVBnOBM5K/VRjQtmEjrBnqaQRaj9bHDojF1YZkjBWtqQhC/X3REYjradRYDsjasZ61ZuL/3m91IQ3fsZlkhqUbLkoTAUxMZl/TYZcITNiagllittbCRtTRZmx2ZRsCN7qy+ukfVX13KrXvK7Uu3kcRTiDc7gED2pQhztoQAsYIDzDK7w5D86L8+58LFsLTj5zCn/gfP4A7NeNEg==</latexit><latexit sha1_base64="TQGEygQLk4MpTONXYfpVxGVfRf8=">AAAB6HicbVBNS8NAEJ3Ur1q/qh69LBbBU0lEqMeCF48t2A9oQ9lsJ+3azSbsbsQS+gu8eFDEqz/Jm//GbZuDtj4YeLw3w8y8IBFcG9f9dgobm1vbO8Xd0t7+weFR+fikreNUMWyxWMSqG1CNgktsGW4EdhOFNAoEdoLJ7dzvPKLSPJb3ZpqgH9GR5CFn1Fip+TQoV9yquwBZJ15OKpCjMSh/9YcxSyOUhgmqdc9zE+NnVBnOBM5K/VRjQtmEjrBnqaQRaj9bHDojF1YZkjBWtqQhC/X3REYjradRYDsjasZ61ZuL/3m91IQ3fsZlkhqUbLkoTAUxMZl/TYZcITNiagllittbCRtTRZmx2ZRsCN7qy+ukfVX13KrXvK7Uu3kcRTiDc7gED2pQhztoQAsYIDzDK7w5D86L8+58LFsLTj5zCn/gfP4A7NeNEg==</latexit>

y<latexit sha1_base64="zX/nehuC+fK5+AT4o3l1JMUrCQQ=">AAAB6HicbVBNS8NAEJ3Ur1q/qh69LBbBU0lE0GPBi8cW7Ae0oWy2k3btZhN2N0II/QVePCji1Z/kzX/jts1BWx8MPN6bYWZekAiujet+O6WNza3tnfJuZW//4PCoenzS0XGqGLZZLGLVC6hGwSW2DTcCe4lCGgUCu8H0bu53n1BpHssHkyXoR3QsecgZNVZqZcNqza27C5B14hWkBgWaw+rXYBSzNEJpmKBa9z03MX5OleFM4KwySDUmlE3pGPuWShqh9vPFoTNyYZURCWNlSxqyUH9P5DTSOosC2xlRM9Gr3lz8z+unJrz1cy6T1KBky0VhKoiJyfxrMuIKmRGZJZQpbm8lbEIVZcZmU7EheKsvr5POVd1z617rutboFXGU4QzO4RI8uIEG3EMT2sAA4Rle4c15dF6cd+dj2VpyiplT+APn8wfuW40T</latexit><latexit sha1_base64="zX/nehuC+fK5+AT4o3l1JMUrCQQ=">AAAB6HicbVBNS8NAEJ3Ur1q/qh69LBbBU0lE0GPBi8cW7Ae0oWy2k3btZhN2N0II/QVePCji1Z/kzX/jts1BWx8MPN6bYWZekAiujet+O6WNza3tnfJuZW//4PCoenzS0XGqGLZZLGLVC6hGwSW2DTcCe4lCGgUCu8H0bu53n1BpHssHkyXoR3QsecgZNVZqZcNqza27C5B14hWkBgWaw+rXYBSzNEJpmKBa9z03MX5OleFM4KwySDUmlE3pGPuWShqh9vPFoTNyYZURCWNlSxqyUH9P5DTSOosC2xlRM9Gr3lz8z+unJrz1cy6T1KBky0VhKoiJyfxrMuIKmRGZJZQpbm8lbEIVZcZmU7EheKsvr5POVd1z617rutboFXGU4QzO4RI8uIEG3EMT2sAA4Rle4c15dF6cd+dj2VpyiplT+APn8wfuW40T</latexit><latexit sha1_base64="zX/nehuC+fK5+AT4o3l1JMUrCQQ=">AAAB6HicbVBNS8NAEJ3Ur1q/qh69LBbBU0lE0GPBi8cW7Ae0oWy2k3btZhN2N0II/QVePCji1Z/kzX/jts1BWx8MPN6bYWZekAiujet+O6WNza3tnfJuZW//4PCoenzS0XGqGLZZLGLVC6hGwSW2DTcCe4lCGgUCu8H0bu53n1BpHssHkyXoR3QsecgZNVZqZcNqza27C5B14hWkBgWaw+rXYBSzNEJpmKBa9z03MX5OleFM4KwySDUmlE3pGPuWShqh9vPFoTNyYZURCWNlSxqyUH9P5DTSOosC2xlRM9Gr3lz8z+unJrz1cy6T1KBky0VhKoiJyfxrMuIKmRGZJZQpbm8lbEIVZcZmU7EheKsvr5POVd1z617rutboFXGU4QzO4RI8uIEG3EMT2sAA4Rle4c15dF6cd+dj2VpyiplT+APn8wfuW40T</latexit><latexit sha1_base64="zX/nehuC+fK5+AT4o3l1JMUrCQQ=">AAAB6HicbVBNS8NAEJ3Ur1q/qh69LBbBU0lE0GPBi8cW7Ae0oWy2k3btZhN2N0II/QVePCji1Z/kzX/jts1BWx8MPN6bYWZekAiujet+O6WNza3tnfJuZW//4PCoenzS0XGqGLZZLGLVC6hGwSW2DTcCe4lCGgUCu8H0bu53n1BpHssHkyXoR3QsecgZNVZqZcNqza27C5B14hWkBgWaw+rXYBSzNEJpmKBa9z03MX5OleFM4KwySDUmlE3pGPuWShqh9vPFoTNyYZURCWNlSxqyUH9P5DTSOosC2xlRM9Gr3lz8z+unJrz1cy6T1KBky0VhKoiJyfxrMuIKmRGZJZQpbm8lbEIVZcZmU7EheKsvr5POVd1z617rutboFXGU4QzO4RI8uIEG3EMT2sAA4Rle4c15dF6cd+dj2VpyiplT+APn8wfuW40T</latexit>

Design joint priors for latent variables based on reverse-engineering CNN predictive architectures

NEURAL RENDERING MODEL (NRM)

0.5dog0.2cat0.1horse…

1.0dog

Chooserenderornot

Upsample,selectlocation

RenderNRM:Generation

CNN:Inference

image unpooledfeaturemap

pooledfeaturemap

rectifiedfeaturemap

classtemplate

maskedtemplate

upsampledtemplate

renderedimage

MAX-MIN CROSS-ENTROPY ➡ MAX-MIN NETWORKS

Cross-Entropy Loss for Training the CNNs with Labeled Data

min✓2A�

Hp,q(y|x, zmax) � min(zi)ni=1,✓

1

n

nX

i=1

� log p(yi|xi, zi; ✓)<latexit sha1_base64="o8aAYTACJXrjfyUfzuzzjiD9z6U=">AAACkXicbVFdb9MwFHXC1ygwOvbIyxUVUiuVKZkmwYSGCrxM4mVIdJtUd5HjOqk128liBzXz/H/4Pbzxb3DaSMDGlXx1dO491/a5aSm4NlH0Kwjv3X/w8NHW496Tp8+2n/d3Xpzqoq4om9JCFNV5SjQTXLGp4Uaw87JiRKaCnaWXn9v62XdWaV6ob6Yp2VySXPGMU2I8lfR/YMlVYrFZMkMAcwVYErOkRNiPzvM5kZI4B8eJLcdXbtjcrMbXF76frYyVZOXcCHDOrmAzZ3idWO5GPh3F7sIqN4bNaIcXWUWojZ0nAeta/umBN4BFkUM5bFr1zarNY1iPer+RjyDpD6K9aB1wF8QdGKAuTpL+T7woaC2ZMlQQrWdxVJq5JZXhVDDXw7VmJaGXJGczDxWRTM/t2lEHrz2zgKyo/FEG1uzfCkuk1o1MfWdrl75da8n/1Wa1yd7NLVdlbZiim4uyWoApoF0PLHjFqBGNB4RW3L8V6JJ454xfYs+bEN/+8l1wur8Xe/z1YDD51NmxhV6iV2iIYvQWTdAxOkFTRIPt4CA4Cj6Eu+FhOAm73jDoNLvonwi//Aa9kMoP</latexit><latexit sha1_base64="o8aAYTACJXrjfyUfzuzzjiD9z6U=">AAACkXicbVFdb9MwFHXC1ygwOvbIyxUVUiuVKZkmwYSGCrxM4mVIdJtUd5HjOqk128liBzXz/H/4Pbzxb3DaSMDGlXx1dO491/a5aSm4NlH0Kwjv3X/w8NHW496Tp8+2n/d3Xpzqoq4om9JCFNV5SjQTXLGp4Uaw87JiRKaCnaWXn9v62XdWaV6ob6Yp2VySXPGMU2I8lfR/YMlVYrFZMkMAcwVYErOkRNiPzvM5kZI4B8eJLcdXbtjcrMbXF76frYyVZOXcCHDOrmAzZ3idWO5GPh3F7sIqN4bNaIcXWUWojZ0nAeta/umBN4BFkUM5bFr1zarNY1iPer+RjyDpD6K9aB1wF8QdGKAuTpL+T7woaC2ZMlQQrWdxVJq5JZXhVDDXw7VmJaGXJGczDxWRTM/t2lEHrz2zgKyo/FEG1uzfCkuk1o1MfWdrl75da8n/1Wa1yd7NLVdlbZiim4uyWoApoF0PLHjFqBGNB4RW3L8V6JJ454xfYs+bEN/+8l1wur8Xe/z1YDD51NmxhV6iV2iIYvQWTdAxOkFTRIPt4CA4Cj6Eu+FhOAm73jDoNLvonwi//Aa9kMoP</latexit><latexit sha1_base64="o8aAYTACJXrjfyUfzuzzjiD9z6U=">AAACkXicbVFdb9MwFHXC1ygwOvbIyxUVUiuVKZkmwYSGCrxM4mVIdJtUd5HjOqk128liBzXz/H/4Pbzxb3DaSMDGlXx1dO491/a5aSm4NlH0Kwjv3X/w8NHW496Tp8+2n/d3Xpzqoq4om9JCFNV5SjQTXLGp4Uaw87JiRKaCnaWXn9v62XdWaV6ob6Yp2VySXPGMU2I8lfR/YMlVYrFZMkMAcwVYErOkRNiPzvM5kZI4B8eJLcdXbtjcrMbXF76frYyVZOXcCHDOrmAzZ3idWO5GPh3F7sIqN4bNaIcXWUWojZ0nAeta/umBN4BFkUM5bFr1zarNY1iPer+RjyDpD6K9aB1wF8QdGKAuTpL+T7woaC2ZMlQQrWdxVJq5JZXhVDDXw7VmJaGXJGczDxWRTM/t2lEHrz2zgKyo/FEG1uzfCkuk1o1MfWdrl75da8n/1Wa1yd7NLVdlbZiim4uyWoApoF0PLHjFqBGNB4RW3L8V6JJ454xfYs+bEN/+8l1wur8Xe/z1YDD51NmxhV6iV2iIYvQWTdAxOkFTRIPt4CA4Cj6Eu+FhOAm73jDoNLvonwi//Aa9kMoP</latexit><latexit sha1_base64="o8aAYTACJXrjfyUfzuzzjiD9z6U=">AAACkXicbVFdb9MwFHXC1ygwOvbIyxUVUiuVKZkmwYSGCrxM4mVIdJtUd5HjOqk128liBzXz/H/4Pbzxb3DaSMDGlXx1dO491/a5aSm4NlH0Kwjv3X/w8NHW496Tp8+2n/d3Xpzqoq4om9JCFNV5SjQTXLGp4Uaw87JiRKaCnaWXn9v62XdWaV6ob6Yp2VySXPGMU2I8lfR/YMlVYrFZMkMAcwVYErOkRNiPzvM5kZI4B8eJLcdXbtjcrMbXF76frYyVZOXcCHDOrmAzZ3idWO5GPh3F7sIqN4bNaIcXWUWojZ0nAeta/umBN4BFkUM5bFr1zarNY1iPer+RjyDpD6K9aB1wF8QdGKAuTpL+T7woaC2ZMlQQrWdxVJq5JZXhVDDXw7VmJaGXJGczDxWRTM/t2lEHrz2zgKyo/FEG1uzfCkuk1o1MfWdrl75da8n/1Wa1yd7NLVdlbZiim4uyWoApoF0PLHjFqBGNB4RW3L8V6JJ454xfYs+bEN/+8l1wur8Xe/z1YDD51NmxhV6iV2iIYvQWTdAxOkFTRIPt4CA4Cj6Eu+FhOAm73jDoNLvonwi//Aa9kMoP</latexit>

Max-Min Loss for Training the CNNs with Labeled Data

↵max

Hp,q(y|x, zmax) + ↵min

Hp,q(y|x, zmin)<latexit sha1_base64="CTyL0GiHpBuDE9+L99VekVIDKxM=">AAACSHicdZBLSwMxFIUz9VXrq+rSTbAIiiIzIuiy6KZLBauFtg530tQGM5kxuSOt4/w8Ny7d+RvcuFDEnWntwla9EPg45x6SnCCWwqDrPju5icmp6Zn8bGFufmFxqbi8cm6iRDNeZZGMdC0Aw6VQvIoCJa/FmkMYSH4RXB/3/Ytbro2I1Bn2Yt4M4UqJtmCAVvKLfgNk3IHLtIG8i2kI3Syr+Gm8c5Nt9u67O3cjzhbdpmMBof4L9J0tv1hyd93B0N/gDaFEhnPiF58arYglIVfIJBhT99wYmyloFEzyrNBIDI+BXcMVr1tUEHLTTAdFZHTDKi3ajrQ9CulA/ZlIITSmFwZ2MwTsmHGvL/7l1RNsHzZToeIEuWLfF7UTSTGi/VZpS2jOUPYsANPCvpWyDmhgaLsv2BK88S//hvO9Xc/y6X6pfDSsI0/WyDrZJB45IGVSISekShh5IC/kjbw7j86r8+F8fq/mnGFmlYxMLvcFK2q1gQ==</latexit><latexit sha1_base64="CTyL0GiHpBuDE9+L99VekVIDKxM=">AAACSHicdZBLSwMxFIUz9VXrq+rSTbAIiiIzIuiy6KZLBauFtg530tQGM5kxuSOt4/w8Ny7d+RvcuFDEnWntwla9EPg45x6SnCCWwqDrPju5icmp6Zn8bGFufmFxqbi8cm6iRDNeZZGMdC0Aw6VQvIoCJa/FmkMYSH4RXB/3/Ytbro2I1Bn2Yt4M4UqJtmCAVvKLfgNk3IHLtIG8i2kI3Syr+Gm8c5Nt9u67O3cjzhbdpmMBof4L9J0tv1hyd93B0N/gDaFEhnPiF58arYglIVfIJBhT99wYmyloFEzyrNBIDI+BXcMVr1tUEHLTTAdFZHTDKi3ajrQ9CulA/ZlIITSmFwZ2MwTsmHGvL/7l1RNsHzZToeIEuWLfF7UTSTGi/VZpS2jOUPYsANPCvpWyDmhgaLsv2BK88S//hvO9Xc/y6X6pfDSsI0/WyDrZJB45IGVSISekShh5IC/kjbw7j86r8+F8fq/mnGFmlYxMLvcFK2q1gQ==</latexit><latexit sha1_base64="CTyL0GiHpBuDE9+L99VekVIDKxM=">AAACSHicdZBLSwMxFIUz9VXrq+rSTbAIiiIzIuiy6KZLBauFtg530tQGM5kxuSOt4/w8Ny7d+RvcuFDEnWntwla9EPg45x6SnCCWwqDrPju5icmp6Zn8bGFufmFxqbi8cm6iRDNeZZGMdC0Aw6VQvIoCJa/FmkMYSH4RXB/3/Ytbro2I1Bn2Yt4M4UqJtmCAVvKLfgNk3IHLtIG8i2kI3Syr+Gm8c5Nt9u67O3cjzhbdpmMBof4L9J0tv1hyd93B0N/gDaFEhnPiF58arYglIVfIJBhT99wYmyloFEzyrNBIDI+BXcMVr1tUEHLTTAdFZHTDKi3ajrQ9CulA/ZlIITSmFwZ2MwTsmHGvL/7l1RNsHzZToeIEuWLfF7UTSTGi/VZpS2jOUPYsANPCvpWyDmhgaLsv2BK88S//hvO9Xc/y6X6pfDSsI0/WyDrZJB45IGVSISekShh5IC/kjbw7j86r8+F8fq/mnGFmlYxMLvcFK2q1gQ==</latexit><latexit sha1_base64="CTyL0GiHpBuDE9+L99VekVIDKxM=">AAACSHicdZBLSwMxFIUz9VXrq+rSTbAIiiIzIuiy6KZLBauFtg530tQGM5kxuSOt4/w8Ny7d+RvcuFDEnWntwla9EPg45x6SnCCWwqDrPju5icmp6Zn8bGFufmFxqbi8cm6iRDNeZZGMdC0Aw6VQvIoCJa/FmkMYSH4RXB/3/Ytbro2I1Bn2Yt4M4UqJtmCAVvKLfgNk3IHLtIG8i2kI3Syr+Gm8c5Nt9u67O3cjzhbdpmMBof4L9J0tv1hyd93B0N/gDaFEhnPiF58arYglIVfIJBhT99wYmyloFEzyrNBIDI+BXcMVr1tUEHLTTAdFZHTDKi3ajrQ9CulA/ZlIITSmFwZ2MwTsmHGvL/7l1RNsHzZToeIEuWLfF7UTSTGi/VZpS2jOUPYsANPCvpWyDmhgaLsv2BK88S//hvO9Xc/y6X6pfDSsI0/WyDrZJB45IGVSISekShh5IC/kjbw7j86r8+F8fq/mnGFmlYxMLvcFK2q1gQ==</latexit>

InputImageMax

Xentropy

MinXentropy

Max-MinXentropy

Sharedweights

• Max cross-entropy maximizes the posteriors of correct labels. Min cross-entropy minimizes the posteriors of incorrect labels.

• Co-learning: Max and Min networks try to learn from each other

STATISTICAL GUARANTEES FOR THE NRM

Bound on the generalization errorRisk ≤

!"#$%&()*+,-.%&%/0%&-/12*,34/5/7

• Rendering path normalization: • new form of regularization

Training loss in the CNNs equivalent to likelihood in NRM

Max-Min NRM with RPN achieves SOTA on benchmarks

Rendering Path

DATA AUGMENTATION 2: SYMBOLIC EXPRESSIONS

Goal: Learn a domain of functions (sin, cos, log, add…)

• Training on numerical input-output does not generalize.

Data Augmentation with Symbolic Expressions

• Efficiently encode relationships between functions.

Solution:

• Design networks to use both: symbolic + numerical

ARCHITECTURE : TREE LSTM

sin; 𝜃 + cos; 𝜃 = 1 sin −2.5 = −0.6

• Symbolic expression trees. Function evaluation tree.

• Decimal trees: encode numbers with decimal representation (numerical).

• Can encode any expression, function evaluation and number.

DecimalTreefor2.5

RESULTS

• Vastly Improved numerical evaluation: 90% over function-fitting baseline.

•Generalization to verifying symbolic equations of higher depth

• Combining symbolic + numerical data helps in better generalization

for both tasks: symbolic and numerical evaluation.

LSTM: Symbolic TreeLSTM: Symbolic TreeLSTM: symbolic + numeric

76.40 % 93.27 % 96.17 %

• OPTIMIZATION : ANALYSIS OF CONVERGENCE

• SCALABILITY : GRADIENT QUANTIZATION

• MULTI-DIMENSIONALITY : TENSOR ALGEBRA

ALGORITHMS

DISTRIBUTED TRAINING INVOLVES COMPUTATION & COMMUNICATION

Parameterserver

GPU 1 GPU 2

With 1/2 data With 1/2 data

DISTRIBUTED TRAINING INVOLVES COMPUTATION & COMMUNICATION

Parameterserver

GPU 1 GPU 2

With 1/2 data With 1/2 data

Compress?Compress?

Compress?

DISTRIBUTED TRAINING BY MAJORITY VOTE

Parameterserver

GPU 1GPU 2GPU 3

sign(g)

sign(g)

sign(g)

Parameterserver

GPU 1GPU 2GPU 3

sign [sum(sign(g))]

SIGNSGD PROVIDES “FREE LUNCH"

Throughput gain with only tiny accuracy loss

P3.2x machines on AWS, Resnet50 on imagenet

SIGNSGD ACROSS DOMAINS AND ARCHITECTURES

Huge throughput gain!

TAKE-AWAYS FOR SIGN-SGD

• Convergence even under biased gradients and noise.

• Faster than SGD in theory and in practice.

• For distributed training, similar variance reduction as SGD.

• In practice, similar accuracy but with far less communication.

TENSORS FOR LEARNING IN MANY DIMENSIONSTensors: Beyond 2D world

Modern data is inherently multi-dimensional

Images: 3 dimensions Videos: 4 dimensions

TENSORS FOR MULTI-DIMENSIONAL DATA AND HIGHER ORDER MOMENTS

Pairwise correlations Triplet correlations

OPERATIONS ON TENSORS: TENSOR CONTRACTIONTensor Contraction

Extends the notion of matrix product

Matrix product

Mv =!

j

vjMj

= +

Tensor ContractionT (u, v, ·) =

!

i,j

uivjTi,j,:

=

++

+

DEEP NEURAL NETS: TRANSFORMING TENSORS

DEEP TENSORIZED NETWORKS

SPACE SAVING IN DEEP TENSORIZED NETWORKS

Tensor Train RNN and LSTMs

TENSORS FOR LONG-TERM FORECASTING

Challenges:• Long-term

dependencies• High-order

correlations• Error propagation

Climate datasetTraffic dataset

TENSOR LSTM FOR LONG-TERM FORECASTING

TENSORLY: H IGH-LEVEL API FOR TENSOR ALGEBRA

• Python programming

• User-friendly API

• Multiple backends: flexible + scalable

• Example notebooks in repository

A New Vision for Autonomy

Center for Autonomous Systems and Technologies

CAST: BRINGING ROBOTICS AND AI TOGETHER

FIRST SET OF RESULTS: LEARNING TO LAND

SOME RESEARCH LEADERS AT NVIDIA

Robotics

Dieter Fox

Learning &Perception

Jan KautzBill Dally Dave Luebke Alex Keller Aaron Lefohn

Graphics

Steve Keckler Dave Nellans Mike O’Connor

ArchitectureProgramming

Michael Garland

VLSI

Brucek Khailany

Circuits

Tom Gray

Networks

Larry Dennison

Chief Scientist

Computervision Core ML

Sanja Fidler Me !

Applied research

Bryan Catanzaro

• DATA

• Collection: Active learning and partial feedback

• Aggregation: Crowdsourcing models

• Augmentation: Graphics rendering + GANs, Symbolic expressions

• ALGORITHMS

• Convergence: SignSGD has good rates in theory and practice

• Scalability: SignSGD has same variance reduction as SGD for multi-machine

• Multi-dimensionality: Tensor algebra for neural networks and probabilistic models.

• INFRASTRUCTURE:

• Frameworks: Tensorly is high-level API for deep tensorized networks.

CONCLUSION

AI needs integration of data, algorithms and infrastructure

COLLABORATORS (L IMITED L IST )

Thank you

ADVANCES IN TRINITY OF AI: DATA, ALGORITHMS & COMPUTEtensorlab.cms.caltech.edu › users › anima › slides › NYU-oct2018.pdf · chosen by an active learning algorithm Detection

Documents