Weakly Supervised Correspondence Estimation

WeaklySupervisedCorrespondenceEstimation

ZhiweiJia

WeeklySupervisedLearning

• 1.Notenoughlabeleddata• 2.Transferlearning• 3.Helptoincreaseperformanceofsupervisedlearning• 4.Toprovidegoodinsightsonsolvingcertainlearningproblems

LearningtoSeebyMoving

• 1.Biologicalbackground• 2.Whyuseegomotion informationassupervision?

• Availabilityof“labeleddata”• 3.Overview:

• Egomotion informationasaformofself-supervision• 4.Mainresult:

• Learnedvisualrepresentationcomparedfavourably tothatlearntusingdirectsupervisiononthetasksofscenerecognition,objectrecognition,visualodometry andkeypoint matching

MainApproach

• 1.Correlatingvisualstimuliwithegomotion:• Egomotion <==>cameramotion• Predictingthecameratransformationfromtheconsequentpairsofimage.

• 2.Visualcorrespondencecanhelpforvisualtasksingeneral:• Pretraining forothertasks

ArchitectureOverview

1.Siamese style CNN

2.Learning by minimizing the prediction error of egomotion information

3. TCNNonlyusedintrainingprocess

ComparedtoSFATraining

• xt1,xt2refertofeaturerepresentationsofframesobservedattimest1,t2respectively.

• Disameasureofdistancewithparameter.

• misapredefinedmarginand

• Tisapredefinedtimethreshold

TrainingofthisNetwork

• 1.Transformationparametersasgroundtruth.• 2.Traning data:

• MNIST• KITTI• SFdataset

• 3.TrainedNetworkusedforfurthervisualtasks• KITTI-Net• SF-Net

SamplesfromSF/KITTIDataset

OnMNIST• Translation:

• integervalueintherange[-3,3]• X,Yaxes• binnedintosevenuniformlyspacedbins

• Rotation:• liewithintherange[-30◦ ,30◦ ].• Zaxe• binnedintobinsofsize3◦ eachresultingintoatotalof20bins

• SFA:• translationintherange[-1,1],rotationwithin[-3◦ ,3◦ ]

• 5millionimagepairs

OnKITTI

• 1.CameradirectionasZaxis• 2.ImageplaneasXYplane.

• 3.TranslationsalongtheZ/Xaxis• 4.RotationabouttheYaxis(Eulerangle)• 5.Individuallybinnedinto20uniformlyspacedbinseach.

• Thetrainingimagepairsfromframesthatwereatmost±7framesapart

SFDataset

• ConstructedusingGoogleStreetView (≈130Kimage).• Cameratransformationalongallsixdimensionsoftransformation.• Rotationsbetween[-30◦ ,30◦ ]werebinnedinto10uniformlyspacedbinsandtwoextrabinswereusedforrotationslargerandsmaller.

• Threetranslationswereindividuallybinnedinto10uniformlyspacedbinseach.

EvaluationonMNIST

1.LearnedBase-CNNservedasapretrainingmethodforConvNet onclassificationofMNIST

2.smallamountofdata.

3.Learnedfeaturerepresentationincreasestheperformanceofclassificationtasks.

EvaluationofKITTI- /SF-Net

• Measuredintermsoffurtherperformingthesevisualtasks• 1.Sceneclassification• 2.LargeScaleImageClassification• 3.Keypointmatching• 4.Visualodometry

• estimatingthecameratransformationbetweenimagepairs.

SceneClassificationonSUNdataset

• 397indoor/outdoorscenecategories• provides10standardsplitsof5and20trainingimagesperclassandastandardtestsetof50imagesperclass

• CompareKITTI/SFNetwith:• 1.AlexNet pretrainedonImageNet• 2.GIST• 3.SPM

1. KITTI-Net outperforms SF-Net and is comparable to AlexNet-20K.

2. Performance from layer 4, 5 features of KITTI-Net outperform layer 4, 5 features of KITTI-SFA-Net

LargeScaleImageClassification

• AlllayersofKITTI-Net,KITTI-SFA-NetandAlexNet-Scratch(i.e.CNNwithrandomweightinitialization)werefinetunedforimageclassification.

• ComparisonofAlexNet usingpretrainedKITTINetvs.AlexNet trainedfromscratch

Keypoint Matching(intra-class)

• PASCAL-VOC2012datasetwithGround-truthobjectboundingboxes(GT-BBOX)

• 1.Computefeaturemapsfromlayers2-5• 2.MatchingscoreforallpairsofGT-BBOXinthesameobjectclass.

• thefeaturesassociatedwithkeypoints inthefirstimagewereusedtopredictthelocationofthesamekeypoints inthesecondimage.

• 3.Errormeasurementofmatching:• Thenormalizedpixeldistancebetweentheactualandpredictedkeypoint locations

DetailsofKeypoint Matching

ComparisonResult

1.KITTI-Net-20KwassuperiortoAlexNet-20KandAlexNet-100KandinferioronlytoAlexNet-1M.2.AlexNet-RandsurprisinglyperformedbetterthanAlexNet-20K.

VisualOdometry

• AlllayersofKITTI-NetandAlexNet-1Mwerefinetunedfor25KiterationsusingthetrainingsetofSFdatasetonthetaskofvisualodometry.

Weakness,Limitation&Extension

• 1.Impressiveperformanceonlargedataset?• 2.Insteadofpretraining,combinelearningwithothervisualtasksinanonlineform.

•

LearningDenseCorrespondencevia3D-guidedCycleConsistency• 1.Background:

• Worksinintra-classcorrespondenceestimationviadeeplearningvs.worksforcomputingcorrespondenceacrossdifferentobject/sceneinstances.

• Lackofdatafordensecorrespondence• 2.Naïvesolution:trainedon3Drenderedmodel

MainApproach

• Utilizetheconceptofcycleconsistencyofcorrespondenceflows:• thecompositionofflowfieldsforanycircularpaththroughtheimagesetshouldhaveazerocombinedflow.

• “meta-supervision”• End-to-endtraineddeepnetworkfordensecross-instancecorrespondencethatusesthewidelyavailable3DCADmodels.

ConsistencyintheSenseofFlowField

• Predictadenseflow(orcorrespondence)fieldF(a,b):R^2→R^2betweenpairsofimagesaandb.

• TheflowfieldF(a,b)(p)=(px−qx,py−qy)computestherelativeoffsetfromeachpointpinimageatoacorrespondingpointqinimageb.

ConsistencyintheSenseofMatchability

• Whymatchability?• Amatchability mapM(a,b):R^2→[0,1]predictingifacorrespondenceexists,M(a,b)(p)=1,ornotM(a,b)(p)=0.

ConsistencyasSupervision

• Whilewedonotknowwhattheground-truthis,weknowhowitshouldbehave.

• Specifically,foreachpairofrealtrainingimagesr1andr2,finda3DCADmodelofthesamecategory,andrendertwosyntheticviewss1ands2insimilarviewpointasr1andr2,respectively.

• Eachtrainingquartet<s1,s2,r1,r2>

Aim to learn 2D image correspondences that potentially captures the 3D semantics.

ComparedtoAutoencoder

• Reconstructionvs.zeronetflow• Sparsityconstrants vs.useconstructionofFlowFieldfroms1tos2asguidance

LossfunctionforLearningDenseCorrespondence

LossfunctionforLearningDenseMatchability

Combinedlossfunctionis:

ASmallIssueforLearningMatchability

• Multiplicativecomposition.• CouldfixM(s1,r1)=1andM(r2,s2)=1,andonlytraintheCNNtoinferM(r1,r2)

End-to-endDifferentiablebyContinuousApproximation• BilinearinterpolationovertheCNNpredictionsondiscretepixellocations.

•

OverallArchitecture

TrainingProcess

• Data:• The3DCADmodelsusedforconstructingtrainingquartetscomefromtheShapeNet database,whiletherealimagesarefromthePASCAL3D+dataset.

• 1.Firstinitializethenetwork(partly)tomimicSIFTflow:• minimizetheEuclideanlossbetweenthenetworkpredictionandtheSIFTflowoutputonthesampledpair.

• 2.Thenfine-tunethewholenetworkend-to-endtominimizethecombinedconsistencyloss

EvaluationofLearningPerformance

• 1.Featurevisualization• 2.Keypoint transfer• 3.Matchability prediction:• 4.Shape-to-imagesegmentationtransfer

FeatureVisualization

• Extractconv-9featuresfromtheentiresetofcarinstancesinthePASCAL3D+dataset,andembedthemin2-Dwiththet-SNEalgorithm.

• Theresultindicatesthatviewpointsisanimportantsignalsforsimilarities inthelearnednetwork

Keypoint Transfer

• Computethepercentageofcorrectkeypoint transfer(PCK)overallimagepairsasthemetricformeasuringtheperformance.

• Valuatethequalityofourcorrespondenceoutputusingthekeypoint transfertaskonthe12categoriesfromPASCAL3D+

•

Matchability Prediction:

• Evaluatetheperformanceofmatchability predictionusingthePASCAL-Partdataset,whichprovideshumanannotatedlabeling.

•

Shape-to-imageSegmentationTransfer:

• Shape-to-imagecorrespondencefortransfering per-pixellabels(e.g.surfacenormals,segmentationmasks,etc.)fromshapestorealimages.

• 1.Constructashapedatabaseofabout200shapespercategory,witheachshapebeingrenderedin8canonicalviewpoints.

• 2.Givenaqueryrealimage,applythenetworktopredictthecorrespondencebetweenthequeryandeachrenderedviewofthesamecategory,andwarpthequeryimageaccordingtothepredictedflowfield.

• 3.ComparetheHOGEuclideandistancebetweenthewarpedqueryandtherenderedviews,andretrievetherenderedviewwithminimumdistance.

Limitation&Extension

• vs.SIFTFlow• Othervisualproblemssolvedbyusing3Dmodels

Weakly Supervised Correspondence Estimation

Documents