Deep Learning for New User Interactions (Gestures, Speech and Emotions)

Deep Learning for New User Interactions (Gestures, Speech and Emotions)

Olivia Klose, Software Development Engineer, Microsoft

Dr. Marcel Tilly, Program Manager, Microsoft

https://www.technologyreview.com/lists/technologies/2013/

https://www.technologyreview.com/lists/technologies/2013/

Deep Neural Networks

… is inspired by the neural network in the brain

# of Neurons in the brains (~100 billion)

= # of Trees in the Amazon Rainforest (~ 300 billion)

# of Synapses (~ 100 - 1000 trillion)

= # of Leaves in the Amazon Rainforest

https://www.youtube.com/watch?v=V1eYniJ0Rnk

https://www.youtube.com/watch?v=V1eYniJ0Rnk

Scale in

Compute

Scale in

Data

Better

Algorithms

More

Investment

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

1993

1994

1995

1996

1997

1998

1999

2000

2001

2002

2003

2004

2005

2006

2007

2008

2009

2010

2011

2012

WER %

Improving

domain

knowledge

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

1993

1994

1995

1996

1997

1998

1999

2000

2001

2002

2003

2004

2005

2006

2007

2008

2009

2010

2011

2012

WER %

stuck

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

1993

1994

1995

1996

1997

1998

1999

2000

2001

2002

2003

2004

2005

2006

2007

2008

2009

2010

2011

2012

WER %

Deep learning

+ Big Data

+ scalable

tools

http://arxiv.org/abs/1609.03528

http://blogs.microsoft.com/next/2016/10/18/historic-achievement-microsoft-researchers-reach-human-parity-conversational-speech-recognition

http://arxiv.org/abs/1609.03528

http://blogs.microsoft.com/next/2016/10/18/historic-achievement-microsoft-researchers-reach-human-parity-conversational-speech-recognition

Speech Recognition Breakthrough for the Spoken, Translated Word

https://www.youtube.com/watch?v=Nu-nlQqFCKg

Skype Translator

Skype

Translator

Bots

Skype Service

Automatic Speech Recognition

Speech Correction

Translation

Text To Speech

Skype Translator

Skype

Translator

Bots

Skype Service


Speech Correction

Translation

Text To Speech

Software “robots”

Separate and manage

audio streams

Skype Translator

Skype

Translator

Bots

Skype Service


Speech Correction

Translation

Text To Speech

• Machine Learning

• Deep Neural Network

• New language = new training

this is

hum pig

Skype Translator

Skype

Translator

Bots

Skype Service


Speech Correction

Translation

Text To Speech

this is

hum pig

• Punctuation

• Capitalization

• Disfluency removal

• Lattice Rescoring

this is

hum pig.

This is

hum pig.

This is

pig.This is

big.

Skype Translator

Skype

Translator

Bots

Skype Service


Speech Correction

Translation

Text To Speech

this is

hum pig

this is

hum pig.

This is

hum pig.

This is

pig.This is

big.

Skype Translator

Skype

Translator

Bots

Skype Service


Speech Correction

Translation

Text To Speech

this is

hum pig

C’est

grand.

this is

hum pig.

This is

hum pig.

This is

pig.This is

big.

• Microsoft Translator core API

• Statistical Machine Translation

• 45 supported languages

Skype Translator

Skype

Translator

Bots

Skype Service


Speech Correction

Translation

Text To Speech

Microsoft Translator TTS API

this is

hum pig

C’est

grand.

this is

hum pig.

This is

hum pig.

This is

pig.This is

big.

Skype Translator

Skype

Translator

Bots

Skype Service


Speech Correction

Translation

Text To Speech

this is

hum pig

C’est

grand.

this is

hum pig.

This is

hum pig.

This is

pig.This is

big.

front view top viewside viewinput depth inferred body parts

(no tracking or smoothing)

https://www.microsoft.com/en-us/research/video/real-time-human-pose-recognition-in-parts-from-single-depth-images-2/

https://www.microsoft.com/en-us/research/video/real-time-human-pose-recognition-in-parts-from-single-depth-images-2/

Kinect Gesture Data Set

https://www.microsoft.com/en-us/download/details.aspx?id=52283

https://www.microsoft.com/en-us/research/video/handpose-fully-articulated-hand-tracking/

https://www.microsoft.com/en-us/research/video/handpose-fully-articulated-hand-tracking/

bicycleroad

building

road

cat

road

building

cargrass

watercow

https://www.microsoft.com/en-us/research/publication/semantic-segmentation-as-image-representation-for-scene-recognition/

https://www.microsoft.com/en-us/research/publication/semantic-segmentation-as-image-representation-for-scene-recognition/

28,2

25,8

16,4

11,7

7,3 6,75,1

3.5

ILSVRC 2010

NEC America

ILSVRC 2011

Xerox

ILSVRC 2012

AlexNet

ILSVRC 2013

Clarifi

ILSVRC 2014

VGG

ILSVRC 2014

GoogleNet

Human

Performance

ILSVRC 2015

ResNet

ImageNet Classification top-5 error (%)

Microsoft researchers win ImageNet computer vision challenge

http://blogs.microsoft.com/next/2015/12/10/microsoft-researchers-win-imagenet-computer-vision-challenge/

11x11 conv, 96, /4, pool/2

5x5 conv, 256, pool/2

3x3 conv, 384

3x3 conv, 384


fc, 4096

fc, 4096

fc, 1000

AlexNet,

8 layers

(ILSVRC

2012)

3x3 conv, 64


3x3 conv, 128


3x3 conv, 256

3x3 conv, 256

3x3 conv, 256


3x3 conv, 512

3x3 conv, 512

3x3 conv, 512


3x3 conv, 512

3x3 conv, 512

3x3 conv, 512


fc, 4096

fc, 4096

fc, 1000

VGG, 19

layers

(ILSVRC

2014)

input

Conv

7x7+ 2(S)

MaxPool

3x3+ 2(S)

LocalRespNorm

Conv

1x1+ 1(V)

Conv

3x3+ 1(S)

LocalRespNorm

MaxPool

3x3+ 2(S)

Conv Conv Conv Conv

1x1+ 1(S) 3x3+ 1(S) 5x5+ 1(S) 1x1+ 1(S)

Conv Conv MaxPool

1x1+ 1(S) 1x1+ 1(S) 3x3+ 1(S)

Dept hConcat

Conv Conv Conv Conv

1x1+ 1(S) 3x3+ 1(S) 5x5+ 1(S) 1x1+ 1(S)

Conv Conv MaxPool

1x1+ 1(S) 1x1+ 1(S) 3x3+ 1(S)

Dept hConcat

MaxPool

3x3+ 2(S)

Conv Conv Conv Conv

1x1+ 1(S) 3x3+ 1(S) 5x5+ 1(S) 1x1+ 1(S)

Conv Conv MaxPool

1x1+ 1(S) 1x1+ 1(S) 3x3+ 1(S)

Dept hConcat

Conv Conv Conv Conv

1x1+ 1(S) 3x3+ 1(S) 5x5+ 1(S) 1x1+ 1(S)

Conv Conv MaxPool

1x1+ 1(S) 1x1+ 1(S) 3x3+ 1(S)

AveragePool

5x5+ 3(V)

Dept hConcat

Conv Conv Conv Conv

1x1+ 1(S) 3x3+ 1(S) 5x5+ 1(S) 1x1+ 1(S)

Conv Conv MaxPool

1x1+ 1(S) 1x1+ 1(S) 3x3+ 1(S)

Dept hConcat

Conv Conv Conv Conv

1x1+ 1(S) 3x3+ 1(S) 5x5+ 1(S) 1x1+ 1(S)

Conv Conv MaxPool

1x1+ 1(S) 1x1+ 1(S) 3x3+ 1(S)

Dept hConcat

Conv Conv Conv Conv

1x1+ 1(S) 3x3+ 1(S) 5x5+ 1(S) 1x1+ 1(S)

Conv Conv MaxPool

1x1+ 1(S) 1x1+ 1(S) 3x3+ 1(S)

AveragePool

5x5+ 3(V)

Dept hConcat

MaxPool

3x3+ 2(S)

Conv Conv Conv Conv

1x1+ 1(S) 3x3+ 1(S) 5x5+ 1(S) 1x1+ 1(S)

Conv Conv MaxPool

1x1+ 1(S) 1x1+ 1(S) 3x3+ 1(S)

Dept hConcat

Conv Conv Conv Conv

1x1+ 1(S) 3x3+ 1(S) 5x5+ 1(S) 1x1+ 1(S)

Conv Conv MaxPool

1x1+ 1(S) 1x1+ 1(S) 3x3+ 1(S)

Dept hConcat

AveragePool

7x7+ 1(V)

FC

Conv

1x1+ 1(S)

FC

FC

Soft maxAct ivat ion

soft max0

Conv

1x1+ 1(S)

FC

FC


soft max1


soft max2

GoogleNet, 22

layers

(ILSVRC 2014)

ResNet, 152 layers

(ILSVRC 2015)

1x1 conv, 64

3x3 conv, 64

1x1 conv, 256

1x1 conv, 64

3x3 conv, 64

1x1 conv, 256

1x1 conv, 64

3x3 conv, 64

1x1 conv, 256

1x2 conv, 128, /2

3x3 conv, 128

1x1 conv, 512

1x1 conv, 128

3x3 conv, 128

1x1 conv, 512

1x1 conv, 128

3x3 conv, 128

1x1 conv, 512

1x1 conv, 128

3x3 conv, 128

1x1 conv, 512

1x1 conv, 128

3x3 conv, 128

1x1 conv, 512

1x1 conv, 128

3x3 conv, 128

1x1 conv, 512

1x1 conv, 128

3x3 conv, 128

1x1 conv, 512

1x1 conv, 128

3x3 conv, 128

1x1 conv, 512

1x1 conv, 256, /2

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 512, /2

3x3 conv, 512

1x1 conv, 2048

1x1 conv, 512

3x3 conv, 512

1x1 conv, 2048

1x1 conv, 512

3x3 conv, 512

1x1 conv, 2048

ave pool, fc 1000

7x7 conv, 64, /2, pool/2

Open-source, cross-platform toolkit for learning and evaluating deep neural networks.

Expresses (nearly) arbitrary neural networks by composing simple building blocks into complex computational networks

Production-ready: State-of-the-art accuracy, efficient, and scales to multi-GPU/multi-server. http://cntk.ai

http://cntk.ai/

O

P(1)

X

W(1), b(1)

W(2), b(2)

S(1)

Sigmoid

P(2)

Softmax

Hidden

Layer

Output

Layer

B1=Parameter(HDim)

W1=Parameter(HDim, SDim)

X=Input(SDim)

labels=Input(LDim)

T1=Times(W1, X)

P1=Plus(T1, B1)

S1=Sigmoid(P1)

B2=Parameter(LDim, 1)

W2=Parameter(LDim, HDim)

T2=Times(W2, S1)

P2=Plus(T2, B1)

CrossEntropy=CrossEntropyWithSoftmax(labels, P2)

ErrPredict=ErrorPrediction(labels, P2)

FeatureNodes=(X)

LabelNodes=(labels)

CriteriaNodes=(CrossEntropy)

EvalNodes=(ErrPredict)

OutputNodes=(P2)

https://github.com/azure/ObjectDetectionUsingCntk








VisionComputer Vision | Emotion | Face | Video

SpeechComputer Recognition | Speaker Recognition

Speech | Translator

LanguageBing Spell Check | Language Understanding

Linguistic Analysis | Text Analytics | Web Language Model

KnowledgeAcademic Knowledge | Entity Linking

Knowledge Exploration | Recommendations

SearchBing Auto Suggest | Bing Image Search | Bing News Search

Bing Video Search | Bing Web Search

Cognitive

Services

Give your solutions

a human side

http://microsoft.com/cognitive

http://microsoft.com/cognitive

Computer Vision API

Content of Image:

Categories v0: [{ “name”: “animal”, “score”: 0.9765625 }]

V1: [{ "name": "grass", "confidence": 0.9999992847442627 },

{ "name": "outdoor", "confidence": 0.9999072551727295 },

{ "name": "cow", "confidence": 0.99954754114151 },

{ "name": "field", "confidence": 0.9976195693016052 },

{ "name": "brown", "confidence": 0.988935649394989 },

{ "name": "animal", "confidence": 0.97904372215271 },

{ "name": "standing", "confidence": 0.9632768630981445 },

{ "name": "mammal", "confidence": 0.9366017580032349, "hint": "animal" },

{ "name": "wire", "confidence": 0.8946959376335144 },

{ "name": "green", "confidence": 0.8844101428985596 },

{ "name": "pasture", "confidence": 0.8332059383392334 },

{ "name": "bovine", "confidence": 0.5618471503257751, "hint": "animal" },

{ "name": "grassy", "confidence": 0.48627158999443054 },

{ "name": "lush", "confidence": 0.1874018907546997 },

{ "name": "staring", "confidence": 0.165890634059906 }]

Describe0.975 "a brown cow standing on top of a lush green field“

0.974 “a cow standing on top of a lush green field”

0.965 “a large brown cow standing on top of a lush green field”

https://www.youtube.com/watch?v=R2mC-NUAmMk

https://www.youtube.com/watch?v=R2mC-NUAmMk

[email protected] [email protected]

mailto:[email protected]

mailto:[email protected]

Deep Learning for New User Interactions (Gestures, Speech and Emotions)

Technology