Weakly Supervised Correspondence Estimation Zhiwei Jia
WeaklySupervisedCorrespondenceEstimation
ZhiweiJia
WeeklySupervisedLearning
• 1.Notenoughlabeleddata• 2.Transferlearning• 3.Helptoincreaseperformanceofsupervisedlearning• 4.Toprovidegoodinsightsonsolvingcertainlearningproblems
LearningtoSeebyMoving
• 1.Biologicalbackground• 2.Whyuseegomotion informationassupervision?
• Availabilityof“labeleddata”• 3.Overview:
• Egomotion informationasaformofself-supervision• 4.Mainresult:
• Learnedvisualrepresentationcomparedfavourably tothatlearntusingdirectsupervisiononthetasksofscenerecognition,objectrecognition,visualodometry andkeypoint matching
MainApproach
• 1.Correlatingvisualstimuliwithegomotion:• Egomotion <==>cameramotion• Predictingthecameratransformationfromtheconsequentpairsofimage.
• 2.Visualcorrespondencecanhelpforvisualtasksingeneral:• Pretraining forothertasks
ArchitectureOverview
1.Siamese style CNN
2.Learning by minimizing the prediction error of egomotion information
3. TCNNonlyusedintrainingprocess
ComparedtoSFATraining
• xt1,xt2refertofeaturerepresentationsofframesobservedattimest1,t2respectively.
• Disameasureofdistancewithparameter.
• misapredefinedmarginand
• Tisapredefinedtimethreshold
TrainingofthisNetwork
• 1.Transformationparametersasgroundtruth.• 2.Traning data:
• MNIST• KITTI• SFdataset
• 3.TrainedNetworkusedforfurthervisualtasks• KITTI-Net• SF-Net
SamplesfromSF/KITTIDataset
OnMNIST• Translation:
• integervalueintherange[-3,3]• X,Yaxes• binnedintosevenuniformlyspacedbins
• Rotation:• liewithintherange[-30◦ ,30◦ ].• Zaxe• binnedintobinsofsize3◦ eachresultingintoatotalof20bins
• SFA:• translationintherange[-1,1],rotationwithin[-3◦ ,3◦ ]
• 5millionimagepairs
OnKITTI
• 1.CameradirectionasZaxis• 2.ImageplaneasXYplane.
• 3.TranslationsalongtheZ/Xaxis• 4.RotationabouttheYaxis(Eulerangle)• 5.Individuallybinnedinto20uniformlyspacedbinseach.
• Thetrainingimagepairsfromframesthatwereatmost±7framesapart
SFDataset
• ConstructedusingGoogleStreetView (≈130Kimage).• Cameratransformationalongallsixdimensionsoftransformation.• Rotationsbetween[-30◦ ,30◦ ]werebinnedinto10uniformlyspacedbinsandtwoextrabinswereusedforrotationslargerandsmaller.
• Threetranslationswereindividuallybinnedinto10uniformlyspacedbinseach.
EvaluationonMNIST
1.LearnedBase-CNNservedasapretrainingmethodforConvNet onclassificationofMNIST
2.smallamountofdata.
3.Learnedfeaturerepresentationincreasestheperformanceofclassificationtasks.
EvaluationofKITTI- /SF-Net
• Measuredintermsoffurtherperformingthesevisualtasks• 1.Sceneclassification• 2.LargeScaleImageClassification• 3.Keypointmatching• 4.Visualodometry
• estimatingthecameratransformationbetweenimagepairs.
SceneClassificationonSUNdataset
• 397indoor/outdoorscenecategories• provides10standardsplitsof5and20trainingimagesperclassandastandardtestsetof50imagesperclass
• CompareKITTI/SFNetwith:• 1.AlexNet pretrainedonImageNet• 2.GIST• 3.SPM
1. KITTI-Net outperforms SF-Net and is comparable to AlexNet-20K.
2. Performance from layer 4, 5 features of KITTI-Net outperform layer 4, 5 features of KITTI-SFA-Net
LargeScaleImageClassification
• AlllayersofKITTI-Net,KITTI-SFA-NetandAlexNet-Scratch(i.e.CNNwithrandomweightinitialization)werefinetunedforimageclassification.
• ComparisonofAlexNet usingpretrainedKITTINetvs.AlexNet trainedfromscratch
Keypoint Matching(intra-class)
• PASCAL-VOC2012datasetwithGround-truthobjectboundingboxes(GT-BBOX)
• 1.Computefeaturemapsfromlayers2-5• 2.MatchingscoreforallpairsofGT-BBOXinthesameobjectclass.
• thefeaturesassociatedwithkeypoints inthefirstimagewereusedtopredictthelocationofthesamekeypoints inthesecondimage.
• 3.Errormeasurementofmatching:• Thenormalizedpixeldistancebetweentheactualandpredictedkeypoint locations
DetailsofKeypoint Matching
ComparisonResult
1.KITTI-Net-20KwassuperiortoAlexNet-20KandAlexNet-100KandinferioronlytoAlexNet-1M.2.AlexNet-RandsurprisinglyperformedbetterthanAlexNet-20K.
VisualOdometry
• AlllayersofKITTI-NetandAlexNet-1Mwerefinetunedfor25KiterationsusingthetrainingsetofSFdatasetonthetaskofvisualodometry.
Weakness,Limitation&Extension
• 1.Impressiveperformanceonlargedataset?• 2.Insteadofpretraining,combinelearningwithothervisualtasksinanonlineform.
•
LearningDenseCorrespondencevia3D-guidedCycleConsistency• 1.Background:
• Worksinintra-classcorrespondenceestimationviadeeplearningvs.worksforcomputingcorrespondenceacrossdifferentobject/sceneinstances.
• Lackofdatafordensecorrespondence• 2.Naïvesolution:trainedon3Drenderedmodel
MainApproach
• Utilizetheconceptofcycleconsistencyofcorrespondenceflows:• thecompositionofflowfieldsforanycircularpaththroughtheimagesetshouldhaveazerocombinedflow.
• “meta-supervision”• End-to-endtraineddeepnetworkfordensecross-instancecorrespondencethatusesthewidelyavailable3DCADmodels.
ConsistencyintheSenseofFlowField
• Predictadenseflow(orcorrespondence)fieldF(a,b):R^2→R^2betweenpairsofimagesaandb.
• TheflowfieldF(a,b)(p)=(px−qx,py−qy)computestherelativeoffsetfromeachpointpinimageatoacorrespondingpointqinimageb.
ConsistencyintheSenseofMatchability
• Whymatchability?• Amatchability mapM(a,b):R^2→[0,1]predictingifacorrespondenceexists,M(a,b)(p)=1,ornotM(a,b)(p)=0.
ConsistencyasSupervision
• Whilewedonotknowwhattheground-truthis,weknowhowitshouldbehave.
• Specifically,foreachpairofrealtrainingimagesr1andr2,finda3DCADmodelofthesamecategory,andrendertwosyntheticviewss1ands2insimilarviewpointasr1andr2,respectively.
• Eachtrainingquartet<s1,s2,r1,r2>
Aim to learn 2D image correspondences that potentially captures the 3D semantics.
ComparedtoAutoencoder
• Reconstructionvs.zeronetflow• Sparsityconstrants vs.useconstructionofFlowFieldfroms1tos2asguidance
LossfunctionforLearningDenseCorrespondence
LossfunctionforLearningDenseMatchability
Combinedlossfunctionis:
ASmallIssueforLearningMatchability
• Multiplicativecomposition.• CouldfixM(s1,r1)=1andM(r2,s2)=1,andonlytraintheCNNtoinferM(r1,r2)
End-to-endDifferentiablebyContinuousApproximation• BilinearinterpolationovertheCNNpredictionsondiscretepixellocations.
•
OverallArchitecture
TrainingProcess
• Data:• The3DCADmodelsusedforconstructingtrainingquartetscomefromtheShapeNet database,whiletherealimagesarefromthePASCAL3D+dataset.
• 1.Firstinitializethenetwork(partly)tomimicSIFTflow:• minimizetheEuclideanlossbetweenthenetworkpredictionandtheSIFTflowoutputonthesampledpair.
• 2.Thenfine-tunethewholenetworkend-to-endtominimizethecombinedconsistencyloss
EvaluationofLearningPerformance
• 1.Featurevisualization• 2.Keypoint transfer• 3.Matchability prediction:• 4.Shape-to-imagesegmentationtransfer
FeatureVisualization
• Extractconv-9featuresfromtheentiresetofcarinstancesinthePASCAL3D+dataset,andembedthemin2-Dwiththet-SNEalgorithm.
• Theresultindicatesthatviewpointsisanimportantsignalsforsimilarities inthelearnednetwork
Keypoint Transfer
• Computethepercentageofcorrectkeypoint transfer(PCK)overallimagepairsasthemetricformeasuringtheperformance.
• Valuatethequalityofourcorrespondenceoutputusingthekeypoint transfertaskonthe12categoriesfromPASCAL3D+
•
Matchability Prediction:
• Evaluatetheperformanceofmatchability predictionusingthePASCAL-Partdataset,whichprovideshumanannotatedlabeling.
•
Shape-to-imageSegmentationTransfer:
• Shape-to-imagecorrespondencefortransfering per-pixellabels(e.g.surfacenormals,segmentationmasks,etc.)fromshapestorealimages.
• 1.Constructashapedatabaseofabout200shapespercategory,witheachshapebeingrenderedin8canonicalviewpoints.
• 2.Givenaqueryrealimage,applythenetworktopredictthecorrespondencebetweenthequeryandeachrenderedviewofthesamecategory,andwarpthequeryimageaccordingtothepredictedflowfield.
• 3.ComparetheHOGEuclideandistancebetweenthewarpedqueryandtherenderedviews,andretrievetherenderedviewwithminimumdistance.
Limitation&Extension
• vs.SIFTFlow• Othervisualproblemssolvedbyusing3Dmodels