Top Banner
DOI: 10.4018/IJCVIP.2019040102 International Journal of Computer Vision and Image Processing Volume 9 • Issue 2 • April-June 2019 Copyright©2019,IGIGlobal.CopyingordistributinginprintorelectronicformswithoutwrittenpermissionofIGIGlobalisprohibited. 16 Accelerating Deep Action Recognition Networks for Real-Time Applications David Ivorra-Piqueres, University of Alicante, Alicante, Spain John Alejandro Castro Vargas, University of Alicante, Alicante, Spain Pablo Martinez-Gonzalez, University of Alicante, Alicante, Spain ABSTRACT Inthiswork,theauthorsproposeseveraltechniquesforacceleratingamodernactionrecognition pipeline. This article reviewed several recent and popular action recognition works and selected twoofthemaspartofthetoolsusedforimprovingtheaforementionedacceleration.Specifically, temporalsegmentnetworks(TSN),aconvolutionalneuralnetwork(CNN)frameworkthatmakes useofasmallnumberofvideoframesforobtainingrobustpredictionswhichhaveallowedtowin thefirstplaceinthe2016ActivityNetchallenge,andMotionNet,aconvolutional-transposedCNN thatiscapableofinferringopticalflowRGBframes.Togetherwiththelastproposal,thisarticle integratedanewsoftwarefordecodingvideosthattakesadvantageofNVIDIAGPUs.Thisarticle showsaproofofconceptforthisapproachbytrainingtheRGBstreamoftheTSNnetworkinvideos loaded with NVIDIA Video Loader (NVVL) of a subset of daily actions from the University of CentralFlorida101dataset. KeywoRDS Action Recognition, Action Understanding, Deep Learning, GPU Acceleration, Machine Learning, Optical Flow, Real-Time, Recurrent Networks, Video Decoding 1. INTRoDUCTIoN Although in recent years the task of activity recognition has witnessed numerous breakthroughs thankstothedevelopmentofnewmethodologiesandtherebirthofdeeplearningtechniques,the naturalcourseofeventshasnotalwaysbeenlikethis.Asformanyyears,despiteofbeingtackled frommultipleperspectives,theproblemofconstructingasystemthatiscapableofidentifyingwhich activityisbeingperformedinagivenscenehasbeenbarelysolved.Inthestateoftheartwecanfind differentapproachesbasedonhandcraftedtraditionalmethodsandmachinelearningapproaches: Handcrafted features dominance.Thefirstapproximationsweremotivatedbyfundamental algorithmssuchasopticalflow(HornandRhunck,1981),theCannyedgedetector(Canny,1986), HiddenMarkovModel(HMM)(RabinerandJuang,1986)orDynamicTimeWarping(DTW) (BellmanandKalaba,1959).Severalofthesemethodshavebeenreviewedin(Gavrila,1999), forhandandthewhole-bodymovements,whichcanbeusedtoobtainrelevantinformationfor therecognitionofactivities. Machine learning approaches.Moremodernmethodsuseopticalflow(Efrosetal.,2003)to obtaintemporalfeaturesoverthesequences,inadditiontousingautomaticlearningalgorithms
16

Accelerating Deep Action Recognition Networks for Real ...rua.ua.es/dspace/bitstream/10045/92109/1/2019... · Althoughinrecentyearsthetaskofactivityrecognitionhaswitnessednumerousbreakthroughs

Aug 11, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Accelerating Deep Action Recognition Networks for Real ...rua.ua.es/dspace/bitstream/10045/92109/1/2019... · Althoughinrecentyearsthetaskofactivityrecognitionhaswitnessednumerousbreakthroughs

DOI: 10.4018/IJCVIP.2019040102

International Journal of Computer Vision and Image ProcessingVolume 9 • Issue 2 • April-June 2019

Copyright©2019,IGIGlobal.CopyingordistributinginprintorelectronicformswithoutwrittenpermissionofIGIGlobalisprohibited.

16

Accelerating Deep Action Recognition Networks for Real-Time ApplicationsDavid Ivorra-Piqueres, University of Alicante, Alicante, Spain

John Alejandro Castro Vargas, University of Alicante, Alicante, Spain

Pablo Martinez-Gonzalez, University of Alicante, Alicante, Spain

ABSTRACT

Inthiswork,theauthorsproposeseveraltechniquesforacceleratingamodernactionrecognitionpipeline.This article reviewed several recent andpopular action recognitionworks and selectedtwoofthemaspartofthetoolsusedforimprovingtheaforementionedacceleration.Specifically,temporalsegmentnetworks(TSN),aconvolutionalneuralnetwork(CNN)frameworkthatmakesuseofasmallnumberofvideoframesforobtainingrobustpredictionswhichhaveallowedtowinthefirstplaceinthe2016ActivityNetchallenge,andMotionNet,aconvolutional-transposedCNNthatiscapableofinferringopticalflowRGBframes.Togetherwiththelastproposal,thisarticleintegratedanewsoftwarefordecodingvideosthattakesadvantageofNVIDIAGPUs.ThisarticleshowsaproofofconceptforthisapproachbytrainingtheRGBstreamoftheTSNnetworkinvideosloadedwithNVIDIAVideoLoader (NVVL)ofasubsetofdailyactions fromtheUniversityofCentralFlorida101dataset.

KeywoRDSAction Recognition, Action Understanding, Deep Learning, GPU Acceleration, Machine Learning, Optical Flow, Real-Time, Recurrent Networks, Video Decoding

1. INTRoDUCTIoN

Although in recentyears the taskofactivity recognitionhaswitnessednumerousbreakthroughsthankstothedevelopmentofnewmethodologiesandtherebirthofdeeplearningtechniques,thenaturalcourseofeventshasnotalwaysbeenlikethis.Asformanyyears,despiteofbeingtackledfrommultipleperspectives,theproblemofconstructingasystemthatiscapableofidentifyingwhichactivityisbeingperformedinagivenscenehasbeenbarelysolved.Inthestateoftheartwecanfinddifferentapproachesbasedonhandcraftedtraditionalmethodsandmachinelearningapproaches:

• Handcrafted features dominance.Thefirstapproximationsweremotivatedbyfundamentalalgorithmssuchasopticalflow(HornandRhunck,1981),theCannyedgedetector(Canny,1986),HiddenMarkovModel(HMM)(RabinerandJuang,1986)orDynamicTimeWarping(DTW)(BellmanandKalaba,1959).Severalofthesemethodshavebeenreviewedin(Gavrila,1999),forhandandthewhole-bodymovements,whichcanbeusedtoobtainrelevantinformationfortherecognitionofactivities.

• Machine learning approaches.Moremodernmethodsuseopticalflow(Efrosetal.,2003)toobtaintemporalfeaturesoverthesequences,inadditiontousingautomaticlearningalgorithms

Page 2: Accelerating Deep Action Recognition Networks for Real ...rua.ua.es/dspace/bitstream/10045/92109/1/2019... · Althoughinrecentyearsthetaskofactivityrecognitionhaswitnessednumerousbreakthroughs

International Journal of Computer Vision and Image ProcessingVolume 9 • Issue 2 • April-June 2019

17

such as Support Vector Machine (SVM) (Schüldt, Laptev and Caputo, 2004) to classifyspatiotemporalfeatures.

• Deep learning.TheCNNnetworksallowtoobtainrobustvisualfeatureson2Dimages(ChéronandLaptev,2015),butmorespecificallyitsversionadaptedtoworkwithdatadefinedinthreedimensionsofferstheabilitytoobtainspatialandtemporalfeatureswhenworkingwithsequencesofimages.Inthisway,furthermoreoftwospatialdimensions(heightandwidth),wehaveathirddimensiondefinedbytime(frames)(Jietal.,2013)(SimonyanandZisserman,2014).

2. APPRoACH

Inthissectionwereviewthemostmodernactionrecognitionworkscarriedoutinthepastthreeyears.Online InverseReinforcementLearning (Rhinehart andKitani, 2017) is anovelmethod for

predicting future behaviors by modeling the interactions between the subject, objects, and theirenvironment, through a first-person mounted camera. The system makes use of online inversereinforcementlearning.Thus,makingitpossibletocontinuallydiscovernewlong-termgoalsandrelationships. Also, a similar approach to that of thehybrid Siamese networks, has been shown(Mahmud,HasanandRoy-Chowdhury,2017)thatispossibletosimultaneouslypredictfutureactivitylabelsandtheirstartingtime.Itdoessobytakingadvantageoffeaturesofpreviouslyseenactivitiesandcurrentlypresentobjectsinthescene.

ThankstotheuseofSingleShotmulti-boxDetectors(SSDs)CNNs,thesystemproposedin(Singhetal.,2017)iscapableofpredictingbothactionlabels,andtheircorrespondingboundingboxesinreal-time(28FPS).Moreover,itcandetectmorethanoneactionatthesametime.AllofthisisaccomplishedbyusingRGBimagefeaturescombinedwithopticalflowones(withadecreaseintheopticalflowqualityandglobalaccuracy)extractedinreal-timeforthecreationofmultipleactiontubes.

In(Kong,TaoandFu,2017),forpredictingactionclasslabelsbeforetheactionfinishes,authorsmakeuseoffeaturesextractedfromfullyobservedvideosprocessedattraintime,forfillingoutthemissinginformationpresentintheincompletevideostopredict.Furthermore,thankstothisapproachtheirmodelobtainsagreatspeedupimprovementwhencomparedtosimilarmethods.

Amodelthatiscapableofperformingvisualforecastingatdifferentabstractionlevelsispresentedin(Zengetal.,2017).Forexample,thesamemodelcanbetrainedforfutureframegenerationaswellasforactionanticipation.Thisisaccomplishedbyfollowinganinversereinforcementlearningapproach.Also,themodelisenforcedtoimitatenaturalvisualsequencesfrompixellevel.

Themodeldevelopedin(Renéetal.,2017)iscapableofpredictinginreal-timefutureactivitieslabelsonRGB-Dvideos.Thisisaccomplishedbymakinguseofsoftregression,forjointlylearningboththepredictormodelandthesoftlabels.Moreover,real-timeperformance(around40FPS)isobtainedbyincludinganovelRGB-DfeaturenamedLocalAccumulativeFrameFeature(LAFF).Moreover,aTCNEncoder-Decodersystemisbuiltforperformingthementionedtasks.Aftertraining,itisabletosurpasscurrentsimilarapproaches.Furthermore,thesystempresentsabetterperformancethanBidirectionalLongshort-termmemorynetworks(Bi-LSTMs)networks.

In(Buchetal.,2017),asystemthatiscapableofpresentingtemporalactionproposalsonavideowithonlyoneforwardpassispresented.Thus,thereisnoneedtocreateoverlappedtemporalslidingwindows.Moreover,thesystemcanworkwithlonguntrimmedvideosofarbitrarylengthinacontinuedfashion.Finally,bycombiningthesystemwithactionclassifiers,temporalactiondetectionperformanceisincreased.

Anewconvolutionalmodelispresentin(CarreiraandZisserman,2017),knownasTwo-StreamInflated3Dconvolutionalneuralnetwork(I3D),whichisusedasaspatio-temporalfeatureextractor.Afterthis,authorspre-trainI3DbasedmodelsontheKineticsdataset,showingthatwiththisapproach,actionclassificationperformanceonwell-knowndatasetsisnoticeablyincreased.

Page 3: Accelerating Deep Action Recognition Networks for Real ...rua.ua.es/dspace/bitstream/10045/92109/1/2019... · Althoughinrecentyearsthetaskofactivityrecognitionhaswitnessednumerousbreakthroughs

International Journal of Computer Vision and Image ProcessingVolume 9 • Issue 2 • April-June 2019

18

In(Feichtenhofer,PinzandWildes,2017),afullyspace-timeconvolutionaltwo-streamnetwork(namedSTResNet)isproposedforthetaskofactionrecognitioninvideos.ThefirststreamisfedwithRGBdatawhilethesecond,withopticalflowfeatures.Themainparticularityofthismodelistheexistinginterconnectionsbetweenbothstreams.Moreover,forlearninglong-termrelationships,identitymappingkernelsareinjectedthroughthenetwork.Allofthisallowsthenetworktopredictonasingleforwardpass.

Newsrecurrentneuralnetworkapproachesarepresentedin(Dave,RussakovskyandRamanan,2017),whichareusedforsolvingtheproblemofactiondetectioninvideosobtainingsatisfactoryresults.Initsbasisthemodel:(1)Focusesonchangesbetweenframes,(2)predictsthefuture,(3)makescorrectionsuponitbyobservingwhattrulyhappensnext.

Authorsof(Sigurdssonetal.,2017)proposeamodelthatiscapableofdetailedlyreasonaboutaspectsofanactivity,i.e,foreachframethemodeliscapableofpredictingthecurrentactivity,itsactionandobject,thescene,andthetemporalprogress.ThisisaccomplishedbymakinguseofConditionalRandomFields(CRFs)thatarefedbyCNNfeatureextractors.Moreover,forbeingabletotrainthissysteminandend-to-end-manner,anasynchronousstochasticinferencealgorithmisdeveloped.

In(Wangetal.,2017)authorsproposeaCNNframeworkfortherecognitionofactionsinvideos,bothtrimmedanuntrimmedandin(Aliakbarianetal.,2017)isproposedamulti-stageLongshort-termmemorynetwork(LSTM)architecturecombinedwithanovellossfunction,thatiscapableofpredictingactionclasslabelsinvideos,evenwhenonlythefirstframesofthesequencehavebeenshown.Themodel takesadvantageofaction-awareandcontext-awarefeatures forsucceeding inthistask.

2.1. TSN FrameworkTemporalSegmentNetwors(TSN)(Wangetal.,2017)isaCNNframeworkfortherecognitionofactionsinvideos,bothtrimmedanuntrimmed.Alongwithit,aseriesofguidelinesforproperlyinitializeandoperatesuchdeepmodelsforthistaskareproposed.TheframeworkaimstotacklefourcommonlimitationswhenusingConvolutionalNeuralNetworksonvideos.First,thedifficultyofusinglong-rangeframesequences,duetohighcomputationalandmemoryspacecosts,whichcanleadtomissimportanttemporalinformation.Second,mostofthesystemsfocusontrimmedvideosinsteadofuntrimmedones(severalactionsmayhappeninavideo).Adaptingtothesewouldmeantoproperlylocalizeactionsandavoidbackground(irrelevant)videoparts.Third,meanwhiledeepmodelsbecomecomplex,stillmanydatasetsaresmallinnumberofsamplesanddiversity,lackingenoughdataforproperlytrainthemandavoidoverfitting.Fourth,timeconsumedforoptical-flowextractioncanbecomeadelayforboth,usinglarge-scaledatasetsandusingthemodelonreal-timeapplications.Figure1showsaschematicviewofsuchnetwork.

2.2. GPU Video DecodingSincethebeginningofthemoderndeeplearningera,datastoringandloadingtimeshavealwaysbeenabottleneckinthepipeline.AlthoughrecentlywearewitnessinggreatspeedupsthankstonewhardwaretechnologieslikeSSDforstoring,ordatatransferringdevices(betweenCPUtoGPU,andvice-versa)suchasNVLINK,theissuepersists.

Manyoftheresearchareaswherethisproblemaggravatesmorearetheoneswhichworkwithvideosasthemaindatasetsource.Theseinclude:predictivelearning,videounderstanding,questionanswering,activityrecognition,andsuper-resolutionnetworks,betweenmanyothers.

Themainapproachwhentacklingthisprobleminthoseareasistofirstextractalltheframesforeachvideoofthedataset,forexamplebyusingFFmpeg,andsavetheminahigh-qualityimageformat,ratherthanonewithpossiblelossycompressionandartifactgenerations,inordertoproperlytrainthenetwork.Thiscomeswithanincreasingneedofstoragespace,sincethemoreinformationwillingtobekept,thelargerinsizeourconvertedimagedatasetwillbe.

Page 4: Accelerating Deep Action Recognition Networks for Real ...rua.ua.es/dspace/bitstream/10045/92109/1/2019... · Althoughinrecentyearsthetaskofactivityrecognitionhaswitnessednumerousbreakthroughs

International Journal of Computer Vision and Image ProcessingVolume 9 • Issue 2 • April-June 2019

19

InFigure2wecanseetheeffectsofstoringtheUniversityofCentralFlorida101(UCF101)dataset(Soomro,ZamirandShah,2012),composedofonly13320videosindifferentformats.ForthecaseofJPEG(image),itoccupies63GiB,whileinAVIformat(video)itoccupies9.25timesless,6.8GiB.IfitistransformedtotheproperMP4format,neededbyNVIDIAVideoLoader(NVVL)(Casper,BarkerandCatanzaro,2018)withthecorrespondingnumberofframes,itoccupies14.2GiB,still4.44timesless.Ifwetakethisintoafine-grainedlevel,suchasframes,wecanseethatthestoragediffersbyalargemargin.

Inordertoalleviatethisproblem,ausefulsolutionistodirectlyloadvideofilesintomemory,decodethenecessaryframes,preparethem,andfinallyfeedthemtothenetwork.Actually,APIsthatcanmanagethefirsttwostepsexist:FFmpeglibrariesitself,andhigherabstractionmoduleslikePyAVorPIMS,whichbothloaddataintotheCPU.Ontheotherhand,the(beta)Hwangproject,alsosupportsNVIDIAGPUsthroughtheuseoftheirspecifichardwaredecoderunits.Furthermore,thosedesignedwithmachinelearningtasksinmind,whichcanprovideallthementionedstepshavebeenrecentlydeveloped.Twoarecurrentlyknown:Lintel(Duke,2018),andNVVL(Casper,BarkerandCatanzaro,2018).ThefirstfocusesonCPUloading(usesFFmpegasbackend),whilethesecondtargetsGPUdevices.Indeed,althoughbeingwritteninC++,itoffersoff-the-shelfPyTorchmodules(datasetandloader).Moreover,anotherwrapperforCuPyarrayshasbeencreated.

Figure 1. Representation of TSN framework. First a snippet is extracted from each of a fixed number of segments that equally divide the video, Then, features such as optical flow or RGB-diff (top and bottom images of the second process column) are extracted. After passing through the corresponding stream, an aggregation function joins the individual snippet class probabilities. Then, softmax is applied for obtaining the final video action class.

Figure 2. Storage comparison between frames and video formats for the UCF101 dataset

Page 5: Accelerating Deep Action Recognition Networks for Real ...rua.ua.es/dspace/bitstream/10045/92109/1/2019... · Althoughinrecentyearsthetaskofactivityrecognitionhaswitnessednumerousbreakthroughs

International Journal of Computer Vision and Image ProcessingVolume 9 • Issue 2 • April-June 2019

20

Regardingperformance,wecanseethatNVVLreducesbyalargemargintheI/Oprocessingtimes,asitcanbeappreciatedinFigure3.MorebenchmarksthattakeintoaccountmemoryusageandCPUloadscanalsobefoundintheblogpost,whileanevenmoredetailedevaluationislocatedon GitHub1. Regarding data, loading behaves like a sliding window of stride one, where framesequencesofapreviouslyfixedlengtharesubsequentlyloadedandreturnedasasingletensor.Ontheotherhand,wecanapplydifferenttransformationstothesesequences:datatype(float,half,orbyte),widthandheightresizingandscaling,randomcroppingandflipping,normalizing,colorspace(RGBorY:Luminance;Cb:Chrominance-Blue;andCr:Chrominance-Red(YCbCr)),andframeindexmapping.Forperformance,flexibility,andcompletenessreasons,wedecidedtouseNVVLasourmaintooltoacceleratetheTSNframework.

2.3. Hidden Two-Stream Convolutional Networks for Action RecognitionAnothersideforapproachingthequestionofreal-timeactionrecognitioncanbefoundin(Zhu,Yetal,2017),wheretheuseofaconvolutionalnetworkforautomaticallycomputeopticalflowispresented.

More indetail, ina firstphase,aCNNdenotedasMotionNet is trained inanunsupervisedmannerforthetaskofoptical-flowestimation.Afterobtainingacceptableresultssimilartooptimaltraditionalmethods,thenetworkisattachedtoaconventionalCNNasthetemporalstreampartofthewholemodel,beingthespatialstreamsimilarinarchitecturetotheotherone.Then,thenetworkistrained(includingMotionNet)onthetaskofactionrecognitionfromframesequences.Theapproachenablestheopticalflowgeneratortobeadaptedtothecharacteristicsofthetaskandfurtherfindingasuitablemotionrepresentation.

3. eXPeRIMeNTATIoN

Inthisprojectwehaveexperimentedusingtheapproachesdiscussedwithdatasetucf101.

Figure 3. Average loading time (milliseconds) that 32-Floating Point PyTorch tensors take to be available in the GPU. The experiment was run on an NVIDIA V100 GPU over one epoch with batches of size 8. Figure extracted from (Casper, Barker and Catanzaro, 2018).

Page 6: Accelerating Deep Action Recognition Networks for Real ...rua.ua.es/dspace/bitstream/10045/92109/1/2019... · Althoughinrecentyearsthetaskofactivityrecognitionhaswitnessednumerousbreakthroughs

International Journal of Computer Vision and Image ProcessingVolume 9 • Issue 2 • April-June 2019

21

3.1. Dataset(Soomro,ZamirandShah,2012)GiventhelimitednumberofRGBactiondatasetsthatincludedrealisticscenes(withoutactorsorpreparedenvironments)andawiderangeofclassesuntil2012,authorsofthispaperproposeanewlarge-scaledatasetsofuser-uploadedvideos(YouTube).Thesepresentamuchdiversetypeofchallengesthantheonesofpreviousdatasets,sincerecordingscancontaindifferent lightingconfigurations, imagequalitydegradation,cluttering,movementof thecamera,andoccludedscenes.

Inregardtothesizeofthedataset,13320videosaredividedinto101classesthatcoverfiveaction groups: Human-Human Interaction, Sports, Playing Musical Instruments, Human-ObjectInteraction, and Body-Motion only. The actions contained in the first and fourth groups can beobservedinFigure42.

Furthermore, thisdatasetmarkedamilestoneinwhat largescaleactionrecognitiondatasetsrefers. Itmadepossible toestablishawell-knownstarting testbed tobe improvedaswellas forbenchmarking. Moreover, deep learning competitions where established around it, such as thedifferentmodalitiesoftheTHUMOSChallenge,whichwasrunforthreeyearsinarow.Afterthat,otherlarge-scaledatasetsappeared,expandingthecharacteristicsoftheUCF101.Forthis,markingthestartofanever-growingnumberofdiverselarge-scaleactionrecognitiondatasets,theUCF101datasetisalsoworthof.

3.2. GPU Video Decoding experimentsInorder to testourGPUvideodecodingalgorithm,wecancompare thedifferencebetween theoriginalframesandtheonesloadedthroughNVVL.Forthistask,wearegoingtousetheSSIMindexbetweentwopictures,usuallyusedinthevideoindustryformeasuringthevisualdifferencewecanperceivewhencomparingframesofanoriginalanddownsampledvideo.Itrangesfrom0to1,where1isgivenfortwoidenticalpicturesand0fortwocompletelydifferentones.Forexample,giventhetwoframesobtainedfromtheUCFdataset(Divingclass)thatcanbeobservedinFigure5,wecannoticeagreenbandontherightextremeoftheNVVLloadedimage.

Apartfromthis,wecannotperceiveanyothersubstantialdegradationinquality.Indeed,theSSIMobtainedis0.992,indicatingthatthisartifactisprobablyduetoabugratherthanalow-qualityvideoprocessor.Forreassuringthisfact,wecancomputetheSSIMheatmap,inordertolocateotherpossiblemissedartifacts(Figure6):

Thus,wecanassumethattherewillbenoharmatthetimeofincorporatingthistoolinaneuralnetworktrainingpipeline.

Figure 4. Classes for the Human-Object Interaction (blue) and Body-Motion Only (red) action groups from the UCF101 dataset. Figure extracted from (Soomro, Zamir and Shah, 2012).

Page 7: Accelerating Deep Action Recognition Networks for Real ...rua.ua.es/dspace/bitstream/10045/92109/1/2019... · Althoughinrecentyearsthetaskofactivityrecognitionhaswitnessednumerousbreakthroughs

International Journal of Computer Vision and Image ProcessingVolume 9 • Issue 2 • April-June 2019

22

Now,weshouldpayattention toknowingwhich thecurrent timespeedup iswecanobtainfrom replacing the image loading systemof theTSN frameworkbyaNVVLpipeline.For this,afteradaptingtheframe-indexgenerationfunctions,andintegratingthevideoloaderintoit,wecanperformthefollowing:

1. Obtainalistofvideos,getthetotalnumberofvideosandthemeannumberofframespervideo.2. Extractalltheframesfromthevideos,alsoconvertthemintotherequiredNVVLvideoformat.3. Selectthenumberofframespervideothataregoingtobeloaded.ForNVVL,alltheframes

havetobeloaded.

Figure 5. Original frame (left) and NVVL obtained frame (right). The frames pertain to a sample of the diving class in the ucf101 dataset.

Figure 6. Heat map of above frames, the lighter the color, the closer to the original frame each pixel is

Page 8: Accelerating Deep Action Recognition Networks for Real ...rua.ua.es/dspace/bitstream/10045/92109/1/2019... · Althoughinrecentyearsthetaskofactivityrecognitionhaswitnessednumerousbreakthroughs

International Journal of Computer Vision and Image ProcessingVolume 9 • Issue 2 • April-June 2019

23

4. Measurehowmuchtimeittakesforextractingthetoldnumberofframes(intotheGPU)oneachoccasion.ForNVVLthisonlyneedstobedoneonce.

5. Obtainmeantimesandtrendforthepreviousprocessandcomparetheresults.

Step1Wewillusethefirst450videosoftheUCFsplit-1trainlistobtainedwiththedatatoolsprovided

bytheTSNframework.Thislistisformattedwitharowforeachvideo.Ineachrow,thepathtothevideo,thenumberofframesthevideohasanditsclassindex.Duetothis,thetotalnumberofframescanbeobtainedjustbysummingthesecondelementofeachrowoverthewholelist.Theresultingnumberis87501.So,themeannumberofframesinavideoisapproximately194.

Step2Forcompletingthisstep,wecansimplyfollowtheinstructionsandcommandsprovidedinthe

repositoriesofeachproject.Wehavetotakeintoaccountthattheextractionprocesscantakeaquitegreateramountoftimethanthevideoconvertingprocess.

Step3Inthiscase,wearegoingtoloadevennumberofframes,startingfrom3andfinishingin25,a

totalof12differentinstances.ThishasbeenselectedsincetheauthorsofTSNtestthemodelwith3,5,7,and9framespervideo.

Step4Forobtaininganaccuratemeasurement,wearegoingtorepeateachexecution29times.For

computingthetimewehaveusedPythontime.timefunction.Also,inordertofreealltheresourcesineachrun,wearegoingtoloopinsideabashscriptinsteadofinsidethePythonexecutingscriptitself,thushavingtheprocesskilledautomatically.

Step5Inthisstepwecomputedthemeanvaluesforeachnumberofframes.Thetimetakenforloading

allthevideoswithNVVLisapproximately24.18seconds.Ontheotherhand,wecanplottheresultsobtainedfromloadingsoleframes:

Wecannoticethatthetrendfollowsalinealgrowthwithrespecttothenumberofframesloaded.Sincewecomputedtheequationdefiningthetrendline(showninthelower-rightpartofFigure7),wecanobtainamorepreciseapproximationofthespeedupachievedwhenusingNVVL.Forthis,sinceweknowthenumberofframesloadedwithNVVListhesameasthemeannumberofframesobtainedinStep1,wejustneedtosubstituteitintheequation(Xvariable),obtainingameanvalueofapproximately458.74secondsor7.65minutes.Wehaveachievedanimprovementonloadingtimeperformance,leadingto18.97timesspeedupwhenusingNVVL.

3.3. Training RGB TSN+HTS with NVVLSofarwehaveshownhowusefulincorporatingNVVLintoavideo-consumingdeeplearningpipelinecanbe,itallowsustoreduceboththestorageanddatatransfercostsatthesametimewedonotsufferdegradationinimagequality.Now,whatonlyremainsistoincorporatethistoolintoacommonactionrecognitionscenario,wherewetrainandtestanetworkforlearningtocategorizehumanactions.

SuchanetworkisgoingtobeTSN,sinceithasdemonstratedasuperiorperformanceinthetaskathand.Moreover,weproposetomakeuseoftheconvertedHTSCaffemodelandweights,inordertoavoidpre-computingtheopticalflowandbeingabletouseNVVLalsointhisstream,focusingtheresultingpipelineforreal-timeapplications.Inspiteofthedatasetwearegoingtouse,memorylimitationsdetailedbelow,andtimeconstrains,wearegoingtofocusthefollowingexperimentonlyfortheRGBstream.

Beforestarting,weneedtopreparethedataintoaformatthatiscompatiblewithNVVL.AsreferredintheGitHubrepository1,weneedvideoswitheitherH.264orHEVC(H.265)codec,andyuv420ppixelformat,alsotheycanbeinanycontainerthatFFmpegparsersupports.

Page 9: Accelerating Deep Action Recognition Networks for Real ...rua.ua.es/dspace/bitstream/10045/92109/1/2019... · Althoughinrecentyearsthetaskofactivityrecognitionhaswitnessednumerousbreakthroughs

International Journal of Computer Vision and Image ProcessingVolume 9 • Issue 2 • April-June 2019

24

Moreover,wehavetotakeintoaccountthenumberofkeyframeseachvideowillhave,i.e.,acodeconlycontainsasubsetofalltheframesthatweseeinavideo,thesearethekeyframes.Atthetimeofdecoding,therestoftheframesareobtainedbyalgorithmicallyinferringthemthroughthekeyframes.Forthisreason,whenloadingsequencesthatcanstartandendatanyframe(similartowhatwecandowithNVVL),thesystemhastoseekthenearestkeyframe,whichcanbefarbeforeorafterthestartingframe.Thiscanresultintoanunderperformingexecution,andforthisreasonwhenconvertingthevideos,wehavetoindicatethefrequencyofkeyframesperframewewanttohave(Figure8).

Developersofthevideoloadersuggesttosetonekeyframeinintervalsthatcorrespondtothelengthofthesequenceswearegoingtoload.Forexample,ifwearegoingtoloadsequencesoflength7,thenevery7framestherewillbeakeyframe.Furthermore,theyalsoprovidetherequiredcommandstocarryoutthisconversionwithFFmpeg.

Forourcase,wearegoingtoseteveryframeinthevideotobeakeyframe,thisisduetothefactthatcurrentlythePyTorchwrapper(theC++APIseemsmoreflexible)isintendedforloadingmultipleframesequencesforeachvideowithaslidingwindowapproachofafixedlength.Althoughthislengthcouldbeequaltothenumberofframesinthevideo,thusloadingonlyasequencepervideo,thiswouldonlyworkifallthevideoshadthesamelength,sincethisparameter,thesequencelength,isglobalforthewholedataset.

Foriteratingoverthedataset,wearegoingtousethedataloaderprovidedbyNVVLPyTorchwrapper,whereineachiterationitwillloadabatchofframesequences.Sincenow,eachframeisasequenceoflengthone,weneedtosetthebatchsizealsotoone.Inthiswaywecaneasilyknowwhentheloaderhasfullyoutputavideo,addittoalist,andwhenwehaveenoughvideos,grouptheminabatchofthesizewewantforprovidingittothenetwork.Furthermore,foraccomplishingthiswealsoneedtosettofalsetheshuffleoptionintheloader.

Althoughwearereadyfortrainingournetwork,animpedimentarisesatthetimeofwritingthiswork.Whetherthevideoshavenotbeenproperlyconverted,orthereisacodeissue,thedataloader

Figure 7. Mean loading time in seconds of each number of frames executed (blue). Trend line of from the obtained data (red). Y axis represents the loading time in seconds, while the X axis shows the number of frames used.

Page 10: Accelerating Deep Action Recognition Networks for Real ...rua.ua.es/dspace/bitstream/10045/92109/1/2019... · Althoughinrecentyearsthetaskofactivityrecognitionhaswitnessednumerousbreakthroughs

International Journal of Computer Vision and Image ProcessingVolume 9 • Issue 2 • April-June 2019

25

seemstogetsilentlystuckwhenloadingsomevideos.Forsolvingthis,onecircumventionistocreatealoaderforeachvideoinsteadofhavingoneforthewholedataset.

Sofarthisworks,butwhathappensnextisthatGPUmemoryisnotproperlyfreed,thuslimitingthesizeofourdatasettothespaceavailableonthegraphiccardatthemoment.ForAsimov,thisconcursinhavingaround240videosfortrainingand160forvalidation(onlytheTitancardsupportsNVVL).

Forthis,followingthesamelinesofmotivationproposedatthebeginningofthedocument,wearegoingtoselectdailyactionsforthereduceddatasetwecanworkwith.Specifically,itiscomposedofeightclassesfromtheHuman-ObjectInteractiongroupoftheUCF101dataset:Apply Eye Makeup,Apply Lipstick,Blow Dry Hair,Brushing Teeth,Cutting In Kitchen,Mopping floor,Shaving Beard,andTyping.Thetrainingsetcontains30videosforeachaction,whilethevalidationonehas20ofthem.

Regardingthetraininghyper-parameters,wearegoingtousetheonessetbydefaultfortheTSNwiththeonlyexceptionofthebatchsizeandnumberofepochs.Fortheformer,wehavesetitto4duestothelimitedmemory,forthelater,wewillperform40epochs,whichisenoughforthemodeltoconvergewiththisdataset.Forthemetrics,wewillkeeptrackofthelossandtop-1andtop-5accuraciesforboththetrainingandvalidationsets.

Figure 8. Representation of how keyframes can be evenly inserted into a video stream

Page 11: Accelerating Deep Action Recognition Networks for Real ...rua.ua.es/dspace/bitstream/10045/92109/1/2019... · Althoughinrecentyearsthetaskofactivityrecognitionhaswitnessednumerousbreakthroughs

International Journal of Computer Vision and Image ProcessingVolume 9 • Issue 2 • April-June 2019

26

Oncealltheepochshavebeencompleted,wefinishourtrainingwithinacommonsituation,100%oftop-1(andtop-5)accuracyandzeroloss.Clearly,ournetworkhasoverfitted.Thisisduetothescarceamountofdataanditslimitedvariability.Moreover,suchdeepnetworks(BNInception)arepronetooverfit,sincetheyhavemoreflexibility(biggernumberofparameters)foradjustingtothedatatheyareconsumingwhileintraining.However,weobtainvalidationaccuraciesof76.25%and98.125%forthetop-1andtop-5versionsrespectively,withalossofapprox.1.52.Takingintoaccountthedataweareworkingwith;theseresultsarequitepromisingincomparisontowhathashappenedinthetrainingphase.

Now,wecangetmoreinsightsiflookingonhowthetrainingandvalidationhaveevolved.IntheFigure9:

Herewecantakenoteoftwofacts.First,thetop-5accuracyconvergesmuchfaster(iter.approx.200)thanthetop-1accuracy(iter.approx.800).Clearly,thisissomethingthatcanbeforeseen,sinceittakesmoretimetolearnthelabelofavideoratherthanguessitamongfivesamples.

Secondly,weseethattheunsmoothedcurve(shadedred)bouncesbetweenhigherandlowervalues(accuracyandloss)amongthefirstiterations.Thiseffectcanbebetterseeninthevalidationcurves(Figure10):

Thishappensasaconsequenceofthesmallbatchsizewehavepreviouslyset.Thesmalleristhebatchsize,themoreweightupdateswewillperform.Ifitistoosmall,wecouldfindthefollowing:

• Instability:Thefrequentupdateswillcausethemetricstowander,goingcontinuallyupanddown.• Notmeaningfulupdates:Thereducednumberofsamplesmakesittocontainlessinformation

abouttheerror(negativegradient)directionineachupdate,thusneedingamajornumberofepochsforconvergingintothesameaccuracythanwithabiggerbatchsize.Thiscanbesummarizedaslongertrainingtimes.

• Hitalocalminimal:Alsoknownasplateau,andcommonlyinducedbythepreviousstatements,asmallbatchsizecanmakethatthenetworkgetsstuckonanon-optimum(norsub-optimum)minimumofthelossfunction,obtaininginsufficientperformanceresults.

Asintuition,wecantakealookattheFigure11,whereontheleft,theevolutionofthreetypesofbatchsizelosscurves(arrivingtotheminimum)areplotted.Theblueone,representsabatchof

Figure 9. Training curves for 40 epochs, 60 iterations per epoch

Page 12: Accelerating Deep Action Recognition Networks for Real ...rua.ua.es/dspace/bitstream/10045/92109/1/2019... · Althoughinrecentyearsthetaskofactivityrecognitionhaswitnessednumerousbreakthroughs

International Journal of Computer Vision and Image ProcessingVolume 9 • Issue 2 • April-June 2019

27

thesamesizeofthedataset,thusmakingonlyoneupdateperepoch,asmothercurvewithamuchlessnoisyevolution.Althoughitseemsthebestapproach,thedetailisinthetimeandspaceittakestoupdatetheweights,sincewehavealargenumberofsamples,wehavetocomputeavastamountofoperations.Moreover,commonlyisimpracticablethatacompletedatasetfitsintoamodernGPUmemory.

Thepurplecurveisforthecasewherewerealizeone-sampleupdates,somethingthatreflects,onextreme,whathappenedduringthetrainingofournetwork.Finally,thegreencurveshowsthedailysituationofmostdeeplearningtrainings,wherethebatchsizeisfoundinbalancewiththenumberofupdatesperepoch.Althoughtherearefrequentupdates,theyarenottoomuchforfiringdivergence,atthesametimeareasonabletimeistakenforfindingtheerror.

Inordertobettervisualizewherethenetworkguessesrightorwrong,wecanmakeuseofthetrainingandvalidationconfusionmatrices12,whereineachcellwecanseethepercentageoftruepositivesfortheclassinthecellrow.Forexample,inthevalidationmatrix,35%percentofthetimesweseetheclassBrushing teeth,thenetworkseesitasShaving Beard.Moreover,wecannotethatbyobtainingthetraceofaconfusionmatrix(summationoverthediagonal)anddividingbythenumberofclasses,weretrievethefinalaccuracy(Figure12).

Easily,wenoticewhatwedeterminedbefore,thetrainingsetisoverfitted,sinceforalldiagonalcellstheconfusionmatrixreportsa100%value(normalizedbetween0and1).Ontheotherhand,whenanalyzingvalidationmatrix,wecanseethatthenetworkmostlyfailswhentheclassesareverysimilar.Forexample:

• Apply Eye Makeupisconfounded15%ofthetimeswithApplyLipstick,sincebothusesomekindofhandstickandcoverzonesofthefaceverticallynearbetweeneachother,islogictothinkthattheyaremoredifficulttodifferentiate.

• Apply Eye MakeupandShaving Beardfollowasimilarerrorpattern,sinceinbothactionsthereishandmovementoverthezoneofthemouthandarmmovementaroundthewholeface.

Inothercases,thecontrarycanhappen,whentheactioniseasilydifferentiablefromothers,thismostlyhappenswithtwoactions,Mopping the floorwhichusuallyhappensinaroom,andCutting in kitchenwherethecamerafocusesontheknifeandthecuttingtablearea(Figure13).

Figure 10. Validation curves for 40 epochs, 60 iterations per epoch

Page 13: Accelerating Deep Action Recognition Networks for Real ...rua.ua.es/dspace/bitstream/10045/92109/1/2019... · Althoughinrecentyearsthetaskofactivityrecognitionhaswitnessednumerousbreakthroughs

International Journal of Computer Vision and Image ProcessingVolume 9 • Issue 2 • April-June 2019

28

4. CoNCLUSIoN

Inthiswork,wehavefocusedourattentionondifferentwaysforacceleratingthetrainingandinferenceprocessesofamodernvideo-basedactionrecognitionpipeline.First,theuseofaTSNframework,sinceitrequiressmallamountsofdataasaninput.Secondly,theuseofMotionNetfromtheHTSwork,inordertoachievereal-timeopticalflowcomputationtimes,adaptitsrepresentationforactionrecognition.Third,theuseoftherecentNVVLforreducingthecostofIOoperations,savestoragespace,andspeedupthewholepipelinebydirectlydecodingvideosontheGPU.

Figure 11. Effects of batch sizes when training

Figure 12. Confusion matrices for the proposed dataset

Page 14: Accelerating Deep Action Recognition Networks for Real ...rua.ua.es/dspace/bitstream/10045/92109/1/2019... · Althoughinrecentyearsthetaskofactivityrecognitionhaswitnessednumerousbreakthroughs

International Journal of Computer Vision and Image ProcessingVolume 9 • Issue 2 • April-June 2019

29

Figure 13. Class labels and network predictions: First line is correct label, second line is the predicted one, green if correct or red if not

Page 15: Accelerating Deep Action Recognition Networks for Real ...rua.ua.es/dspace/bitstream/10045/92109/1/2019... · Althoughinrecentyearsthetaskofactivityrecognitionhaswitnessednumerousbreakthroughs

International Journal of Computer Vision and Image ProcessingVolume 9 • Issue 2 • April-June 2019

30

ReFeReNCeS

Aliakbarian,M.S.,Saleh,F.S.,Salzmann,M.,Fernando,B.,Petersson,L.,&Andersson,L.(2017,October).Encouraging lstms to anticipate actions very early. In IEEE International Conference on Computer Vision (ICCV).10.1109/TPAMI.2018.2868668

Bellman,R.,&Kalaba,R.(1959).Onadaptivecontrolprocesses.I.R.E. Transactions on Automatic Control,4(2),1–9.doi:10.1109/TAC.1959.1104847

Buch,S.,Escorcia,V.,Shen,C.,Ghanem,B.,&Niebles,J.C.(2017,July).Sst:Single-streamtemporalactionproposals.In2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)(pp.6373-6382).IEEE.10.1109/CVPR.2017.675

Canny,J. (1986).Acomputationalapproach toedgedetection. IEEE Transactions on Pattern Analysis and Machine Intelligence,PAMI-8(6),679–698.doi:10.1109/TPAMI.1986.4767851

Carreira,J.,&Zisserman,A.(2017,July).Quovadis,actionrecognition?anewmodelandthekineticsdataset.In2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)(pp.4724-4733).IEEE.

Casper,J.,Barker,J.,&Catanzaro,B.(2018).NVVL:NVIDIAVideoLoader.

Chéron,G.,Laptev,I.,&Schmid,C.(2015).P-cnn:Pose-basedcnnfeaturesforactionrecognition.InProceedings of the IEEE international conference on computer vision(pp.3218-3226).10.1109/ICCV.2015.368

Dave,A.,Russakovsky,O.,&Ramanan,D.(2017,April).Predictivecorrectivenetworksforactiondetection.InProceedings of the Computer Vision and Pattern Recognition.

Duke,B.(2018).Lintel:Pythonvideodecoding.

Efros,A.A.,Berg,A.C.,Mori,G.,&Malik,J.(2003,October).Recognizingactionatadistance.Innull(p.726).IEEE.

Feichtenhofer,C.,Pinz,A.,&Wildes,R.P.(2017,July).Spatiotemporalmultipliernetworksforvideoactionrecognition.In2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)(pp.7445-7454).IEEE.10.1109/CVPR.2017.787

Gavrila, D. M. (1999). The visual analysis of human movement: A survey. Computer Vision and Image Understanding,73(1),82–98.doi:10.1006/cviu.1998.0716

Horn,B.K.,&Schunck,B.G. (1981).Determiningoptical flow.Artificial Intelligence,17(1-3),185–203.doi:10.1016/0004-3702(81)90024-2

Ji,S.,Xu,W.,Yang,M.,&Yu,K.(2013).3Dconvolutionalneuralnetworksforhumanactionrecognition.IEEE Transactions on Pattern Analysis and Machine Intelligence,35(1),221–231.doi:10.1109/TPAMI.2012.59

Kong,Y.,Tao,Z.,&Fu,Y.(2017,July).Deepsequentialcontextnetworksforactionprediction.InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition(pp.1473-1481).10.1109/CVPR.2017.390

Lea,C.,Flynn,M.D.,Vidal,R.,Reiter,A.,&Hager,G.D.(2017,July).Temporalconvolutionalnetworksforactionsegmentationanddetection. In2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)(pp.1003-1012).IEEE.

Mahmud,T.,Hasan,M.,&Roy-Chowdhury,A.K. (2017,October). Jointpredictionofactivity labelsandstartingtimesinuntrimmedvideos.In2017 IEEE International Conference on Computer Vision (ICCV)(pp.5784-5793).IEEE.10.1109/ICCV.2017.616

Rabiner,L.R.,&Juang,B.H.(1986).AnintroductiontohiddenMarkovmodels.IEEE ASSP Magazine,3(1),4–16.

Rhinehart,N.,&Kitani,K.M.(2017,October).First-personactivityforecastingwithonlineinversereinforcementlearning.InProceedings of the IEEE International Conference on Computer Vision(pp.3696-3705).10.1109/ICCV.2017.399

Schuldt,C.,Laptev,I.,&Caputo,B.(2004).Recognizinghumanactions:alocalSVMapproach.InProceedings of the 17th International Conference on Pattern Recognition ICPR 2004(Vol.3,pp.32-36).IEEE.

Page 16: Accelerating Deep Action Recognition Networks for Real ...rua.ua.es/dspace/bitstream/10045/92109/1/2019... · Althoughinrecentyearsthetaskofactivityrecognitionhaswitnessednumerousbreakthroughs

International Journal of Computer Vision and Image ProcessingVolume 9 • Issue 2 • April-June 2019

31

John A. Castro-Vargas is a PhD student at the University of Alicante. His areas of interest are: Robotics, DeepLearning, gesture recognition and action recognition. He has participated in the nationally funded projects “Multi-sensorial robotic system with dual manipulation for human-robot assistance tasks” and “COMBAHO: COMe BAck HOme system for enhancing autonomy of people with acquired brain injury and dependent on their integration into society”.

Sigurdsson,G.A.,Divvala,S.K.,Farhadi,A.,&Gupta,A.(2017,July).Asynchronous Temporal Fields for Action Recognition(Vol.6,p.8).CVPR.

Simonyan,K.,&Zisserman,A.(2014).Two-streamconvolutionalnetworksforactionrecognitioninvideos.InAdvancesinneuralinformationprocessingsystems(pp.568-576).

Singh, G., Saha, S., Sapienza, M., Torr, P., & Cuzzolin, F. (2017, October). Online real-time multiplespatiotemporalactionlocalisationandprediction.In2017 IEEE International Conference on Computer Vision (ICCV)(pp.3657-3666).IEEE.10.1109/ICCV.2017.393

Soomro,K.,Zamir,A.R.,&Shah,M.(2012).UCF101:Adatasetof101humanactionsclassesfromvideosinthewild.arXiv:1212.0402

Wang,L.,Xiong,Y.,Wang,Z.,Qiao,Y.,Lin,D.,Tang,X.,&VanGool,L.(2018).Temporalsegmentnetworksforactionrecognitioninvideos.IEEE Transactions on Pattern Analysis and Machine Intelligence.

Zeng,K.H.,Shen,W.B.,Huang,D.A.,Sun,M.,&Niebles, J.C. (2017,August).Visual forecastingbyimitatingdynamicsinnaturalsequences.In IEEE International Conference on Computer Vision (ICCV)(Vol.2).10.1109/ICCV.2017.326

Zhu,Y.,Lan,Z.,Newsam,S.,&Hauptmann,A.G.(2017).Hiddentwo-streamconvolutionalnetworksforactionrecognition.arXiv:1704.00389

eNDNoTeS

1 https://github.com/NVIDIA/nvvl/tree/master/pytorch/test2 http://crcv.ucf.edu/data/UCF101.php