Top Banner
Deep Learning for New User Interactions (Gestures, Speech and Emotions) Olivia Klose, Software Development Engineer, Microsoft Dr. Marcel Tilly, Program Manager, Microsoft
43

Deep Learning for New User Interactions (Gestures, Speech and Emotions)

Apr 16, 2017

Download

Technology

Olivia Klose
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Deep Learning for New User Interactions (Gestures, Speech and Emotions)

Deep Learning for New User Interactions (Gestures, Speech and Emotions)

Olivia Klose, Software Development Engineer, Microsoft

Dr. Marcel Tilly, Program Manager, Microsoft

Page 2: Deep Learning for New User Interactions (Gestures, Speech and Emotions)

https://www.technologyreview.com/lists/technologies/2013/

Page 3: Deep Learning for New User Interactions (Gestures, Speech and Emotions)

Deep Neural Networks

… is inspired by the neural network in the brain

# of Neurons in the brains (~100 billion)

= # of Trees in the Amazon Rainforest (~ 300 billion)

# of Synapses (~ 100 - 1000 trillion)

= # of Leaves in the Amazon Rainforest

Page 4: Deep Learning for New User Interactions (Gestures, Speech and Emotions)
Page 5: Deep Learning for New User Interactions (Gestures, Speech and Emotions)

https://www.youtube.com/watch?v=V1eYniJ0Rnk

Page 6: Deep Learning for New User Interactions (Gestures, Speech and Emotions)

Scale in

Compute

Scale in

Data

Better

Algorithms

More

Investment

Page 7: Deep Learning for New User Interactions (Gestures, Speech and Emotions)
Page 8: Deep Learning for New User Interactions (Gestures, Speech and Emotions)
Page 9: Deep Learning for New User Interactions (Gestures, Speech and Emotions)
Page 10: Deep Learning for New User Interactions (Gestures, Speech and Emotions)

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

1993

1994

1995

1996

1997

1998

1999

2000

2001

2002

2003

2004

2005

2006

2007

2008

2009

2010

2011

2012

WER %

Improving

domain

knowledge

Page 11: Deep Learning for New User Interactions (Gestures, Speech and Emotions)

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

1993

1994

1995

1996

1997

1998

1999

2000

2001

2002

2003

2004

2005

2006

2007

2008

2009

2010

2011

2012

WER %

stuck

Page 12: Deep Learning for New User Interactions (Gestures, Speech and Emotions)

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

1993

1994

1995

1996

1997

1998

1999

2000

2001

2002

2003

2004

2005

2006

2007

2008

2009

2010

2011

2012

WER %

Deep learning

+ Big Data

+ scalable

tools

Page 13: Deep Learning for New User Interactions (Gestures, Speech and Emotions)

http://arxiv.org/abs/1609.03528

http://blogs.microsoft.com/next/2016/10/18/historic-achievement-microsoft-researchers-reach-human-parity-conversational-speech-recognition

Page 14: Deep Learning for New User Interactions (Gestures, Speech and Emotions)

Speech Recognition Breakthrough for the Spoken, Translated Word

Page 15: Deep Learning for New User Interactions (Gestures, Speech and Emotions)

Skype Translator

Skype

Translator

Bots

Skype Service

Automatic Speech Recognition

Speech Correction

Translation

Text To Speech

Page 16: Deep Learning for New User Interactions (Gestures, Speech and Emotions)

Skype Translator

Skype

Translator

Bots

Skype Service

Automatic Speech Recognition

Speech Correction

Translation

Text To Speech

Software “robots”

Separate and manage

audio streams

Page 17: Deep Learning for New User Interactions (Gestures, Speech and Emotions)

Skype Translator

Skype

Translator

Bots

Skype Service

Automatic Speech Recognition

Speech Correction

Translation

Text To Speech

• Machine Learning

• Deep Neural Network

• New language = new training

this is

hum pig

Page 18: Deep Learning for New User Interactions (Gestures, Speech and Emotions)

Skype Translator

Skype

Translator

Bots

Skype Service

Automatic Speech Recognition

Speech Correction

Translation

Text To Speech

this is

hum pig

• Punctuation

• Capitalization

• Disfluency removal

• Lattice Rescoring

this is

hum pig.

This is

hum pig.

This is

pig.This is

big.

Page 19: Deep Learning for New User Interactions (Gestures, Speech and Emotions)

Skype Translator

Skype

Translator

Bots

Skype Service

Automatic Speech Recognition

Speech Correction

Translation

Text To Speech

this is

hum pig

this is

hum pig.

This is

hum pig.

This is

pig.This is

big.

Page 20: Deep Learning for New User Interactions (Gestures, Speech and Emotions)

Skype Translator

Skype

Translator

Bots

Skype Service

Automatic Speech Recognition

Speech Correction

Translation

Text To Speech

this is

hum pig

C’est

grand.

this is

hum pig.

This is

hum pig.

This is

pig.This is

big.

• Microsoft Translator core API

• Statistical Machine Translation

• 45 supported languages

Page 21: Deep Learning for New User Interactions (Gestures, Speech and Emotions)

Skype Translator

Skype

Translator

Bots

Skype Service

Automatic Speech Recognition

Speech Correction

Translation

Text To Speech

Microsoft Translator TTS API

this is

hum pig

C’est

grand.

this is

hum pig.

This is

hum pig.

This is

pig.This is

big.

Page 22: Deep Learning for New User Interactions (Gestures, Speech and Emotions)

Skype Translator

Skype

Translator

Bots

Skype Service

Automatic Speech Recognition

Speech Correction

Translation

Text To Speech

this is

hum pig

C’est

grand.

this is

hum pig.

This is

hum pig.

This is

pig.This is

big.

Page 23: Deep Learning for New User Interactions (Gestures, Speech and Emotions)
Page 24: Deep Learning for New User Interactions (Gestures, Speech and Emotions)
Page 25: Deep Learning for New User Interactions (Gestures, Speech and Emotions)

front view top viewside viewinput depth inferred body parts

(no tracking or smoothing)

https://www.microsoft.com/en-us/research/video/real-time-human-pose-recognition-in-parts-from-single-depth-images-2/

Page 26: Deep Learning for New User Interactions (Gestures, Speech and Emotions)
Page 28: Deep Learning for New User Interactions (Gestures, Speech and Emotions)

https://www.microsoft.com/en-us/research/video/handpose-fully-articulated-hand-tracking/

Page 29: Deep Learning for New User Interactions (Gestures, Speech and Emotions)

bicycleroad

building

road

cat

road

building

cargrass

watercow

https://www.microsoft.com/en-us/research/publication/semantic-segmentation-as-image-representation-for-scene-recognition/

Page 30: Deep Learning for New User Interactions (Gestures, Speech and Emotions)

28,2

25,8

16,4

11,7

7,3 6,75,1

3.5

ILSVRC 2010

NEC America

ILSVRC 2011

Xerox

ILSVRC 2012

AlexNet

ILSVRC 2013

Clarifi

ILSVRC 2014

VGG

ILSVRC 2014

GoogleNet

Human

Performance

ILSVRC 2015

ResNet

ImageNet Classification top-5 error (%)

Microsoft researchers win ImageNet computer vision challenge

Page 31: Deep Learning for New User Interactions (Gestures, Speech and Emotions)

11x11 conv, 96, /4, pool/2

5x5 conv, 256, pool/2

3x3 conv, 384

3x3 conv, 384

3x3 conv, 256, pool/2

fc, 4096

fc, 4096

fc, 1000

AlexNet,

8 layers

(ILSVRC

2012)

3x3 conv, 64

3x3 conv, 64, pool/2

3x3 conv, 128

3x3 conv, 128, pool/2

3x3 conv, 256

3x3 conv, 256

3x3 conv, 256

3x3 conv, 256, pool/2

3x3 conv, 512

3x3 conv, 512

3x3 conv, 512

3x3 conv, 512, pool/2

3x3 conv, 512

3x3 conv, 512

3x3 conv, 512

3x3 conv, 512, pool/2

fc, 4096

fc, 4096

fc, 1000

VGG, 19

layers

(ILSVRC

2014)

input

Conv

7x7+ 2(S)

MaxPool

3x3+ 2(S)

LocalRespNorm

Conv

1x1+ 1(V)

Conv

3x3+ 1(S)

LocalRespNorm

MaxPool

3x3+ 2(S)

Conv Conv Conv Conv

1x1+ 1(S) 3x3+ 1(S) 5x5+ 1(S) 1x1+ 1(S)

Conv Conv MaxPool

1x1+ 1(S) 1x1+ 1(S) 3x3+ 1(S)

Dept hConcat

Conv Conv Conv Conv

1x1+ 1(S) 3x3+ 1(S) 5x5+ 1(S) 1x1+ 1(S)

Conv Conv MaxPool

1x1+ 1(S) 1x1+ 1(S) 3x3+ 1(S)

Dept hConcat

MaxPool

3x3+ 2(S)

Conv Conv Conv Conv

1x1+ 1(S) 3x3+ 1(S) 5x5+ 1(S) 1x1+ 1(S)

Conv Conv MaxPool

1x1+ 1(S) 1x1+ 1(S) 3x3+ 1(S)

Dept hConcat

Conv Conv Conv Conv

1x1+ 1(S) 3x3+ 1(S) 5x5+ 1(S) 1x1+ 1(S)

Conv Conv MaxPool

1x1+ 1(S) 1x1+ 1(S) 3x3+ 1(S)

AveragePool

5x5+ 3(V)

Dept hConcat

Conv Conv Conv Conv

1x1+ 1(S) 3x3+ 1(S) 5x5+ 1(S) 1x1+ 1(S)

Conv Conv MaxPool

1x1+ 1(S) 1x1+ 1(S) 3x3+ 1(S)

Dept hConcat

Conv Conv Conv Conv

1x1+ 1(S) 3x3+ 1(S) 5x5+ 1(S) 1x1+ 1(S)

Conv Conv MaxPool

1x1+ 1(S) 1x1+ 1(S) 3x3+ 1(S)

Dept hConcat

Conv Conv Conv Conv

1x1+ 1(S) 3x3+ 1(S) 5x5+ 1(S) 1x1+ 1(S)

Conv Conv MaxPool

1x1+ 1(S) 1x1+ 1(S) 3x3+ 1(S)

AveragePool

5x5+ 3(V)

Dept hConcat

MaxPool

3x3+ 2(S)

Conv Conv Conv Conv

1x1+ 1(S) 3x3+ 1(S) 5x5+ 1(S) 1x1+ 1(S)

Conv Conv MaxPool

1x1+ 1(S) 1x1+ 1(S) 3x3+ 1(S)

Dept hConcat

Conv Conv Conv Conv

1x1+ 1(S) 3x3+ 1(S) 5x5+ 1(S) 1x1+ 1(S)

Conv Conv MaxPool

1x1+ 1(S) 1x1+ 1(S) 3x3+ 1(S)

Dept hConcat

AveragePool

7x7+ 1(V)

FC

Conv

1x1+ 1(S)

FC

FC

Soft maxAct ivat ion

soft max0

Conv

1x1+ 1(S)

FC

FC

Soft maxAct ivat ion

soft max1

Soft maxAct ivat ion

soft max2

GoogleNet, 22

layers

(ILSVRC 2014)

ResNet, 152 layers

(ILSVRC 2015)

1x1 conv, 64

3x3 conv, 64

1x1 conv, 256

1x1 conv, 64

3x3 conv, 64

1x1 conv, 256

1x1 conv, 64

3x3 conv, 64

1x1 conv, 256

1x2 conv, 128, /2

3x3 conv, 128

1x1 conv, 512

1x1 conv, 128

3x3 conv, 128

1x1 conv, 512

1x1 conv, 128

3x3 conv, 128

1x1 conv, 512

1x1 conv, 128

3x3 conv, 128

1x1 conv, 512

1x1 conv, 128

3x3 conv, 128

1x1 conv, 512

1x1 conv, 128

3x3 conv, 128

1x1 conv, 512

1x1 conv, 128

3x3 conv, 128

1x1 conv, 512

1x1 conv, 128

3x3 conv, 128

1x1 conv, 512

1x1 conv, 256, /2

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 512, /2

3x3 conv, 512

1x1 conv, 2048

1x1 conv, 512

3x3 conv, 512

1x1 conv, 2048

1x1 conv, 512

3x3 conv, 512

1x1 conv, 2048

ave pool, fc 1000

7x7 conv, 64, /2, pool/2

Page 32: Deep Learning for New User Interactions (Gestures, Speech and Emotions)

Open-source, cross-platform toolkit for learning and evaluating deep neural networks.

Expresses (nearly) arbitrary neural networks by composing simple building blocks into complex computational networks

Production-ready: State-of-the-art accuracy, efficient, and scales to multi-GPU/multi-server. http://cntk.ai

Page 33: Deep Learning for New User Interactions (Gestures, Speech and Emotions)

O

P(1)

X

W(1), b(1)

W(2), b(2)

S(1)

Sigmoid

P(2)

Softmax

Hidden

Layer

Output

Layer

Page 34: Deep Learning for New User Interactions (Gestures, Speech and Emotions)

B1=Parameter(HDim)

W1=Parameter(HDim, SDim)

X=Input(SDim)

labels=Input(LDim)

T1=Times(W1, X)

P1=Plus(T1, B1)

S1=Sigmoid(P1)

B2=Parameter(LDim, 1)

W2=Parameter(LDim, HDim)

T2=Times(W2, S1)

P2=Plus(T2, B1)

CrossEntropy=CrossEntropyWithSoftmax(labels, P2)

ErrPredict=ErrorPrediction(labels, P2)

FeatureNodes=(X)

LabelNodes=(labels)

CriteriaNodes=(CrossEntropy)

EvalNodes=(ErrPredict)

OutputNodes=(P2)

Page 35: Deep Learning for New User Interactions (Gestures, Speech and Emotions)

https://github.com/azure/ObjectDetectionUsingCntk

Page 36: Deep Learning for New User Interactions (Gestures, Speech and Emotions)

https://github.com/azure/ObjectDetectionUsingCntk

Page 37: Deep Learning for New User Interactions (Gestures, Speech and Emotions)

https://github.com/azure/ObjectDetectionUsingCntk

Page 38: Deep Learning for New User Interactions (Gestures, Speech and Emotions)

https://github.com/azure/ObjectDetectionUsingCntk

Page 39: Deep Learning for New User Interactions (Gestures, Speech and Emotions)
Page 40: Deep Learning for New User Interactions (Gestures, Speech and Emotions)

VisionComputer Vision | Emotion | Face | Video

SpeechComputer Recognition | Speaker Recognition

Speech | Translator

LanguageBing Spell Check | Language Understanding

Linguistic Analysis | Text Analytics | Web Language Model

KnowledgeAcademic Knowledge | Entity Linking

Knowledge Exploration | Recommendations

SearchBing Auto Suggest | Bing Image Search | Bing News Search

Bing Video Search | Bing Web Search

Cognitive

Services

Give your solutions

a human side

http://microsoft.com/cognitive

Page 41: Deep Learning for New User Interactions (Gestures, Speech and Emotions)

Computer Vision API

Content of Image:

Categories v0: [{ “name”: “animal”, “score”: 0.9765625 }]

V1: [{ "name": "grass", "confidence": 0.9999992847442627 },

{ "name": "outdoor", "confidence": 0.9999072551727295 },

{ "name": "cow", "confidence": 0.99954754114151 },

{ "name": "field", "confidence": 0.9976195693016052 },

{ "name": "brown", "confidence": 0.988935649394989 },

{ "name": "animal", "confidence": 0.97904372215271 },

{ "name": "standing", "confidence": 0.9632768630981445 },

{ "name": "mammal", "confidence": 0.9366017580032349, "hint": "animal" },

{ "name": "wire", "confidence": 0.8946959376335144 },

{ "name": "green", "confidence": 0.8844101428985596 },

{ "name": "pasture", "confidence": 0.8332059383392334 },

{ "name": "bovine", "confidence": 0.5618471503257751, "hint": "animal" },

{ "name": "grassy", "confidence": 0.48627158999443054 },

{ "name": "lush", "confidence": 0.1874018907546997 },

{ "name": "staring", "confidence": 0.165890634059906 }]

Describe0.975 "a brown cow standing on top of a lush green field“

0.974 “a cow standing on top of a lush green field”

0.965 “a large brown cow standing on top of a lush green field”

Page 42: Deep Learning for New User Interactions (Gestures, Speech and Emotions)

https://www.youtube.com/watch?v=R2mC-NUAmMk