Top Banner
ModDrop: Adaptive Multi-Modal Gesture Recognition Presented by Chongyang Bai May.14, 2020 Dartmouth Natalia Neverova, Christian Wolf, Graham Taylor, and Florian Nebout PAMI 2016
53

ModDrop: Adaptive Multi-Modal Gesture Recognitioncy/pubs/rpe-moddrop.pdfModDrop: Adaptive Multi-Modal Gesture Recognition Presented by Chongyang Bai May.14, 2020 Dartmouth Natalia

Feb 08, 2021

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
  • ModDrop:AdaptiveMulti-ModalGestureRecognition

    PresentedbyChongyangBaiMay.14,2020

    Dartmouth

    NataliaNeverova,ChristianWolf,GrahamTaylor,andFlorianNebout

    PAMI2016

  • GestureRecognitionfromRGBDVideo• Multi-modalinputs

    [email protected]

    3

  • GestureRecognitionfromRGBDVideo• Multi-modalinputs• Localization▫ Startframe▫ Endframe• Recognition▫ WhichGesture?

    [email protected]

    4

  • Challenges• Gesturesfromvariousspatialandtemporalscales• Noisy/missingsignals(e.g.depth)• Limitedtraining datavs. flexiblegestures• Real-time

    [email protected]

    5

  • Contributions• Multi-modalandmulti-scaledeeplearningframework

    • Challenges▫ Gesturesfromvariousspatialandtemporalscales▫ Flexiblegestures▫ Real-time

    [email protected]

    6

  • Contributions• Multi-modalandmulti-scaledeeplearningframework▫ ModDrop:Multimodaldropouttrainingtechnique

    • Challenges▫ Noisy/missingsignals(e.g.depth)▫ Limitedtrainingdata

    [email protected]

    7

  • Contributions• Multi-modalandmulti-scaledeeplearningframework▫ ModDrop:Multimodaldropouttrainingtechnique▫ Modelinitializationformulti-modalfusion• Challenges▫ Limitedtrainingdata

    [email protected]

    8

  • ResultsSummary• Achieves0.87Jaccard Index(Rank1)inChaLearn 2014Challenge▫ Improvesto0.88whenaddingaudiomodadlity (ChaLearn +A)• Alocalizationrefinementtechniquefurtherimprovestheaccuracy.• ModDrop isrobusttonoisyormissingsamplesduringteststageon▫ MNIST▫ ChaLearn +A• Theinitializationformulti-modalfusioniseffective

    [email protected]

    9

  • RelatedWork• GestureRecognition▫ Classificationwithmotiontrajectories[1]▫ HoG featuresfromRGBanddepthimages[2]▫ 3DCNNtolearnspatial-temporalrepresentations[3]• Multi-modalFusion▫ Earlyfusionandlatefusion[4]▫ MultipleKernelLearning(MKL)[5]▫ DeepNeuralNets[6]

    [email protected]

    10

  • MethodologyOutline• OverallArchitecture• Multi-modalframework▫ Initializationformulti-modalfusion▫ ModDrop Method Regularizationproperties

    • Inter-scalelatefusion• Gesturelocalizationasrefinement

    [email protected]

    11

  • OverallArchitecture• Multi-scalesampling• Single-scalemulti-modalfusion• Inter-scalelatefusion

    [email protected]

    12

  • Single-scaleMulti-modalfusion

    • Fourpaths• Singlepathpre-training• Initializationformulti-modalfusion• ModDrop

    [email protected]

    13

    Initializationforfusion

    ModDrop

  • Single-scaleMulti-modalfusion

    • PathV1/V2forhands▫ Input Depthvolume(WxHx5) Gray-scalevolume(WxHx5)▫ Architecture Conv3D,maxpoolingovertime,Conv2D FlattenconcatenationforHLV1▫ HorizontalflippedinputforV2,shareparameterswithV1▫ Detectactivehandsfortraining:trajectoryofthehandjoint

    [email protected]

    14

  • Single-scaleMulti-modalfusion

    • Inputnormalization▫ Normalizehandboundingboxesoverframeaccordingtohanddistancetosensor[7]

    ▫ H_x:boundingbox(pixel),h_x:actualhandsize(mm),zdistancetosensor▫ X:Imagewidth

    [email protected]

    15

  • Single-scaleMulti-modalfusion• PathM(ArticulatedPose)▫ 3-layerMLP▫ Frameinputfeature

    Normalizedjointpositions Jointvelocitiesandaccelerations Inclinationangles Azimuthangles Bendingangles Pairwisedistances▫ Richrepresentationforindividualarticulationdifference▫ Concatenatefor5frames

    [email protected]

    16

    BodyposeImagesource:[7]

  • Single-scaleMulti-modalfusion• PathM(ArticulatedPose)▫ Inputfeature[7] Normalizedjointpositions Azimuthangles Bendingangles Pairwisedistances

    [email protected]

    17

    PoseImagesource:[7]

  • Single-scaleMulti-modalfusion• PathA(audio)▫ Input:Mel-frequencyhistograms Time-frequency-amplitude▫ FeedtoConv2Dlayer+2hiddenlayers

    [email protected]

    18

  • Single-scaleMulti-modalfusion• Singlepathpre-training▫ Thewholenetworkhastoomanyparameters

    • Early-fusionofheterogeneousdatasourcesisnoteffective

    • Fusethepathsinalatehiddenlayer(HLS)

    [email protected]

    19

    Softmax

    Softmax

    Softmax

    Softmax

  • Initializationofmulti-modalfusion• Sharedlayer1:!"▫ Nclasses▫ Kmodalitieswithfeaturedimensions#",… , #&, # = ∑ #*�* isinputdimension

    ▫ NK,outputdimension

    [email protected]

    20

  • Initializationofmulti-modalfusion• Sharedlayer1:!"▫ Nclasses▫ Kmodalitieswithfeaturedimensions#",… , #&,∑ #*�* isinputdimension

    ▫ NK,outputdimension▫ ,-

    (/), feature-iformodalityk(bluecircles)▫ 12

    (3),unit-2 relatedtomodalitym(pinkcircle)▫ 4-,2

    (3,/),weightbetweenabovetwo

    ▫ Init:5 = 6,4-,2(/,/) arefrompre-training

    ▫ Increase5 tolearncross-modalitycorrelations

    [email protected]

    21

    m

  • Initializationofmulti-modalfusion• Sharedlayer2:!7▫ NK:inputdimension

    [email protected]

    22

    Softmax

    Pre-training

    SoftmaxSoftmaxSoftmaxSoftmax

  • Initializationofmulti-modalfusion• Sharedlayer2:!7▫ NK:inputdimension▫ N:outputdimension(num.ofclasses)▫ Inputℎ9

    (*) ispre-softmax scoreforclasscpredictedviamodalityk duringpre-training

    ▫ !7* ∈ ;

  • Initializationofmulti-modalfusion• Sharedlayer2:!7▫ NK:inputdimension▫ N:outputdimension(num.ofclasses)▫ Inputℎ9

    (*) ispre-softmax scoreforclasscpredictedviamodalityk duringpre-training

    ▫ !7* ∈ ;

  • ModDrop:Multimodaldropout• InspiredbyDropout• Avoidfalse/redundantco-adaptationbetweenmodalities

    • Howtoobtainrobustpredictionwhensomemodalitiesaremissing/noisyduringtest?

    • B :Cross-entropyloss• C(*) isthespecificmodelformodalityk• !D:weightsforlayerh

    [email protected]

    25

    IdealLoss

  • ModDrop:Multimodaldropout• InspiredbyDropout• Avoidfalseco-adaptationbetweenmodalities• Obtainrobustpredictionwhensomemodalitiesaremissing/noisy

    • B :Cross-entropyloss• !D:weights

    [email protected]

    26

    IdealLoss

    Hugeamountofcomputationfor~2^Kterms!

  • ModDrop:Multimodaldropout• Solution:Randomlydropinputswhentrainingeachbatch• Inputsampledfrommodalityk:

    • Bernoulliselectorastherandomindicatorvariable:

    • Multi-modalnetwork:

    [email protected]

    27

  • ModDrop:MultimodalDropout• Regularizationpropertiesonone-layernetworkwithsigmoidactivation

    [email protected]

    28

  • Original ModDrop

    [email protected]

    29

    ModDrop:MultimodaldropoutRegularizationpropertiesonone-layernetwork+sigmoid activation

  • Original ModDrop

    [email protected]

    30

    ModDrop:MultimodaldropoutRegularizationpropertiesonone-layernetwork+sigmoid activation

    DropindexE, F

  • Original ModDrop

    [email protected]

    32

    ModDrop:MultimodaldropoutRegularizationpropertiesonone-layernetwork+sigmoid activation

    TaketheExpection ofequation,useG H / = I(J)Approximation1:G K , = K G , [8]Approximation2:1st orderTaylorexpansionarounds

    s

  • ModDrop:MultimodaldropoutRegularizationpropertiesonone-layernetwork+sigmoid activation

    [email protected]

    33

    • Substitute:

    • ThegradientforModDrop istheprobabilitytimestheoriginalgradientminusaregularizationterm

    Regularization

  • ModDrop:MultimodaldropoutRegularizationpropertiesonone-layernetwork+sigmoid activation

    [email protected]

    34

    • Substitution• ThegradientforModDrop istheprobabilitytimestheoriginalgradientminusaregularizationterm

    • Ifthen• Integrateoverthepartialderivative,takesummationoverallkandi :

    Regularization

    Regularization

  • ModDrop:MultimodaldropoutRegularizationpropertiesonone-layernetwork+sigmoid activation

    [email protected]

    35

    • SubstitutethegradientforOriginalnetworktogradientforModDrop:• ThegradientforModDrop istheprobabilitytimestheoriginalgradientminusaregularizationterm

    • Ifthen• Integrateoverthepartialderivative,takesummationoverallkandi :

    Regularization

    Regularization

    TheModDrop lossisptimesthecompletemodellossminusaregularizationterm

    Regularizationtermhasonlycross-modalitymultiplications!

  • ModDrop:MultimodaldropoutRegularizationpropertiesonone-layernetwork+sigmoid activation

    [email protected]

    36

    • Foranyfeaturesi, LfrommodalitiesM,N:▫ ,-

    / and,O3

    ▫ G[,-/ ] =G[,O

    3 ] = 6,canalwaysdoinputnormalization

    ▫ G[,-/ ,O

    (3)] = G[,-/ ] G[,O

    3 ] +

    STU ,-/ , ,O

    3 = STU ,-/ , ,O

    3

    Lossforonesample!

  • ModDrop:MultimodaldropoutRegularizationpropertiesonone-layernetwork+sigmoid activation

    [email protected]

    37

    • Foranyfeaturesi, LfrommodalitiesM,N:▫ ,-

    / and,O3

    ▫ G[,-/ ] =G[,O

    3 ] = 6,duetoinputnormalization

    ▫ G[,-/ ,O

    (3)] = G[,-/ ] G[,O

    3 ] +STU ,-

    / , ,O3 = STU ,-

    / , ,O3

    • Case1:,-/ and,O

    3 positivelycorrelated▫ G[,-

    / ,O(3)] > 6

    ▫ Thetrainingprocessencourages4-/ 4O

    (3) tobepositive

    Lossforonesample!

  • ModDrop:MultimodaldropoutRegularizationpropertiesonone-layernetwork+sigmoid activation

    [email protected]

    38

    • Case1:,-/ and,O

    3 positively correlated

    ▫ G[,-/ ,O

    (3)] > 6

    ▫ Thetrainingprocessencourages4-/ 4O

    (3) tobepositive

    • Case2:,-/ and,O

    3 negatively correlated

    ▫ Thetrainingprocessencourages4-/ 4O

    (3) tobenegative

    Lossforonesample!KWX , Y, Z[\ K > 6

  • ModDrop:MultimodaldropoutRegularizationpropertiesonone-layernetwork+sigmoid activation

    [email protected]

    39

    • Forcorrelatedmodalities,theregularizationtermencouragesthenetworkto▫ Discoversimilaritiesbetweenthemodalities▫ Alignmodalitiesbylearningtheweights

  • ModDrop:MultimodaldropoutRegularizationpropertiesonone-layernetwork+sigmoid activation

    [email protected]

    40

    • Case3:,-/ and,O

    3 uncorrelated▫ G[,-

    / ,O(3)] = 6

    ▫ Assumption:Weightsobeyaunimodaldistributionwithzeroexpectation[9]

    ▫ AccordingtotheLyapunov’s CentralMeanTheorem:thetermtendstozerowhenamountoftrainingsamplestendstoinfinity[10].

    ▫ Additionalconstraints:theL2regularizationforweights.

    Lossforonesample!

  • MethodologyOutline• OverallArchitecture• Multi-modalframework▫ Initializationformulti-modalfusion▫ ModDrop Method Regularizationproperties

    • Inter-scalelatefusion• Gesturelocalizationasrefinement

    [email protected]

    42

  • Inter-scalelatefusion• Forframet,classk,takeweightedsumofpredictions]^,*(_)overscales` = 2,3,4• Getframe-wisefinalprediction.

    [email protected]

    43

  • Gesturelocalizationasrefinement• Therecognitionframework(R)makespredictionsbyslidingwindows▫ Noisywindow:coverintersectionofagestureandreststate• TrainaMLP(M)toclassify”motion”vs. ”nomotion”foreachframe▫ Input:posedescriptors▫ 98%accuracy• Post-refinement▫ foreachgesturepredictedbyR,assignitsboundaryframestotheclosestswitchingframespredictedbyM

    [email protected]

    44

  • Experiments• Datasetandevaluation• Multi-modalpredictionresults• Comparingtrainingtechniques▫ Pre-training,Dropout(appliedtoinput),Initialization,ModDrop▫ Dataset:MNIST,Chalearn 2014

    [email protected]

    45

  • DatasetandEvaluation• Chalearn 2014challengedataset▫ ~14Klabeledgestureclips▫ 20gesturecategories• Datasetaugmentedwithaudios(vocalphrase)• EvaluationMetric▫ Jaccard Indexforsequence` andgestureF,average overallsandn

    ▫ Foraudio,alsouseclip-basedaccuracy ✅ if20%ofclipispredictedcorrectly

    [email protected]

    46

  • Multi-ModalPredictionResults• Multi-modalmulti-scaleresults(Jaccard Index)

    [email protected]

    47

    1.Exceptforaudio,largersamplingstepyieldsbetterresults2.Althoughaudiomodalityalongperformstheworst(duetoalignmentissue),itstillbooststheperformancewhencombinedwithposeandvideomodalities

  • Multi-ModalPredictionResults• Challengeresults(withoutaudio,Jaccard index)

    [email protected]

    48

    1.Gesturelocalizationmakespredictioncorrections.2.Theperformancecanbefurtherboostedwhencombiningwiththebaseline

    [49]N.Neverova,C.Wolf,G.Taylor,andF.Nebout,“Multi-scaledeeplearningforgesturedetectionandlocalization,”inECCVW,2014

  • TrainingTechniquesComparison• MNIST(10Ktestimages,10classes)▫ Multi-modalsetting

    [email protected]

    49

  • TrainingTechniquesComparison• MNIST(10Ktestimages,10classes)

    [email protected]

    50

    1.Train-from-scratchincreaseserror2.Dropoutandpre-trainingareuseful4.ModDrop hasnolift5.Themodelislightweight

    [55]G.E.Hinton,N.Srivastava,A.Krizhevsky,I.Sutskever,andR.R.Salakhutdinov,“Improvingneuralnetworksbypreventingcoadaptationoffeaturedetectors,”inarXiv:1207.0580,2012

  • TrainingTechniquesComparison• MNIST(10Ktestimages,10classes)▫ EffectofModDrop underOcclusionandNoise

    [email protected]

    51

    Pre-trainingandinitializationareemployed.

    Inbothcases,ModDrop makesthemodelmorerobust

  • TrainingTechniquesComparison• Chalearn 2014+audio

    [email protected]

    52

    Allfourtrainingtechniquesarehelpful

  • TrainingTechniquesComparison• Chalearn +audio:EffectofModDrop

    [email protected]

    53

    Whenthetestinputismanipulated,ModDrop isrobustwhencombiningwithDropout

    Pre-trainingandinitializationareemployed.

  • Summary• Multi-modalmulti-scaledeepframework▫ Initializationformulti-modalfusion▫ ModDrop• ShowedefficacyinChaLearn andMNIST

    [email protected]

    54

  • Reference• [1]H.Wang,A.Kl€aser,C.Schmid,andC.L.Liu,“Densetrajectoriesandmotionboundarydescriptorsforaction

    recognition,”Int J.Comput.Vis.,vol.103,pp.60–79,2013.• [2]J.Sung,C.Ponce,B.Selman,andA.Saxena,“UnstructuredhumanactivitydetectionfromRGBDimages,”in

    Proc.IEEEInt.Conf.Robot.Autom.,2012,pp.842–849.• [3]S.Ji,W.Xu,M.Yang,andK.Yu,“3Dconvolutionalneuralnetworksforhumanactionrecognition,”IEEETrans.

    PatternAnal.Mach.Intell.,vol.35,no.1,pp.221–231,Jan.2013.• [4] S.E.Kahou etal.,“Combiningmodalityspecificdeepneuralnet- worksforemotionrecognitioninvideo,”in

    Proc.15thACMInt.Conf.MultimodalInteraction,2013,pp.543–550.• [5] F.Bach,G.Lanckriet,andM.Jordan,“Multiplekernellearning,conicduality,andtheSMOalgorithm,”in

    Proc.21stInt.Conf.Mach.Learning,2004,p.6.• [6] J.Ngiam,A.Khosla,M.Kin,J.Nam,H.Lee,andA.Y.Ng,“Multimodaldeeplearning,”inProc.29thInt.Conf.

    Mach.Learning,2011,pp.689–696.• [7]Neverova,N.,2016. Deeplearningforhumanmotionanalysis (Doctoraldissertation).• [8]Baldi,P.andSadowski,P.,2014.Thedropoutlearningalgorithm. Artificialintelligence, 210,pp.78-122.• [9]S.WangandC.Manning,“Fastdropouttraining,”inProc.30thInt.Conf.Mach.Learning,2013,pp.118–126.• [10]E.L.Lehmann,ElementsofLarge-SampleTheory.Science&BusinessMedia,p.631,1999.

    [email protected]

    57

  • Thanksforlistening!

    [email protected]