Unconstrained Matching of 2D and 3D Descriptors …Unconstrained Matching of 2D and 3D Descriptors for 6-DOF Pose Estimation Uzair Nadeem 1, Mohammed Bennamoun , Roberto Togneri2,

Unconstrained Matching of 2D and 3D Descriptorsfor 6-DOF Pose Estimation

Uzair Nadeem1, Mohammed Bennamoun1, Roberto Togneri2, Ferdous Sohel3

1 Department of Computer Science and Software Engineering,The University of Western Australia

2 Department of Electrical, Electronics and Computer Engineering,The University of Western Australia

3 College of Science, Health, Engineering and Education, Murdoch University{uzair.nadeem@research.,mohammed.bennamoun@,roberto.togneri@}

uwa.edu.au,[email protected]

Abstract

This paper proposes a novel concept to directlymatch feature descriptors extracted from 2D im-ages with feature descriptors extracted from 3D pointclouds. We use this concept to directly localize im-ages in a 3D point cloud. We generate a dataset ofmatching 2D and 3D points and their correspond-ing feature descriptors, which is used to learn aDescriptor-Matcher classifier. To localize the pose ofan image at test time, we extract keypoints and fea-ture descriptors from the query image. The trainedDescriptor-Matcher is then used to match the fea-tures from the image and the point cloud. The lo-cations of the matched features are used in a robustpose estimation algorithm to predict the location andorientation of the query image. We carried out anextensive evaluation of the proposed method for in-door and outdoor scenarios and with different typesof point clouds to verify the feasibility of our ap-proach. Experimental results demonstrate that di-rect matching of feature descriptors from images andpoint clouds is not only a viable idea but can also be

reliably used to estimate the 6-DOF poses of querycameras in any type of 3D point cloud in an uncon-strained manner with high precision.

Keywords: 3D to 2D Matching, Multi-DomainDescriptor Matching, 6-DOF Pose Estimation, Im-age Localization.

1 Introduction

The numerous applications of 3D vision, such asaugmented reality, autonomous driving and roboticlocomotion, aided by the ever increasing compu-tational power of modern machines have caused asteep increase of interest in 3D vision and its applica-tions. The research in the field has also been predom-inantly motivated by the wide scale access to goodquality and low cost 3D scanners, which has openedmany new avenues. However, despite of all the re-cent advancements of algorithms for 3D meshes andpoint clouds, there is still room for enhancements to

1

arX

iv:2

005.

1450

2v1

[cs

.CV

] 2

9 M

ay 2

020

(a) 3D point cloud (b) 2D query image (c) Localization results of (b) in (a)

Figure 1: (a) A section of the 3D point cloud from Shop Facade dataset [Kendall et al., 2015]. (b) AnRGB query image to be localized in the 3D point cloud (c) Visualization of the area of the 3D point cloud,identified by our technique as the location of the query image.

match the performance of the systems designed for2D images on similar tasks. Nearly all the state-of-the-art methods in the various fields of computer vi-sion are designed for 2D images and there are alsoimmense differences in the sizes of training datasetsfor 2D images compared to any of their 3D counter-parts. This points to a need for the development ofbridging techniques, which can make efficient usesof both 2D images and 3D point clouds or meshes.A technique that can fuse information from both 2Dand 3D domains can benefit from the matured 2D vi-sion as well as the contemporary advances in 3D vi-sion. The complementary nature of 2D and 3D datacan lead to an improved performance and efficiencyin many potential applications, e.g., face recognitionand verification, identification of objects or differentregions of interest from coloured images in 3D maps,use of the 3D model of an object to locate it in an im-age in the presence of severe perspective distortion[Laga et al., 2018], as well as localization of a 2Dimages in a 3D point cloud.

To move towards the goal of multi-domain infor-mation fusion from 2D and 3D domains, this pa-per proposes a novel approach to learn a framework

to directly match feature descriptors extracted from3D point clouds with those extracted from 2D im-ages. This concept is used to localize images in pointclouds directly generated from 3D scanners. To lo-calize images, the developed framework is used tomatch 3D points of the point clouds to their corre-sponding pixels in 2D images with the help of multi-domain descriptor matching. The matched pointsbetween the images and the point clouds are thenused to estimate the six degrees-of-freedom (DOF)position and orientation (pose) of images in the 3Dworld.

6-DOF pose estimation is an important researchtopic due to its numerous applications such as aug-mented reality, place recognition, robotic grasping,navigation, robotic pose estimation as well as simul-taneous localization and mapping (SLAM). Currenttechniques for camera pose estimation can be clas-sified into two major categories: (i) Regression net-works based methods and (ii) Features based meth-ods.

The regression networks based methods (e.g.Kendall et al. [2015]; Kendall and Cipolla [2017];Walch et al. [2017]), use deep neural networks to es-

2

timate the pose of the camera and consequently havehigh requirements for computational resources suchas powerful GPUs and require a lot of training data[Xin et al., 2019] from different viewpoints to en-sure that the poses of query cameras are sufficientlyclose to the training ones [Sattler et al., 2019]. More-over they scale poorly with the increase in the size of3D models and usually run into the problems of non-convergence for end-to-end training [Brachmann andRother, 2018].

Features based method use hand-crafted ap-proaches or deep learning methods to extract localor global features from images. An essential stepin the state-of-the-art techniques in this category isthe use of the Structure from Motion (SfM) pipeline[Schonberger and Frahm, 2016; Han et al., 2019]to create sparse 3D models from images (e.g. [Liet al., 2010; Sattler et al., 2015, 2017]). Structurefrom Motion pipelines provide a one-to-one corre-spondence between the points in the generated sparsepoint cloud and the pixels of the 2D images that wereused to create the model. Several works use this in-formation to localize images with respect to the 3Dpoint cloud generated by SfM. However, model cre-ation with SfM is a computationally expensive pro-cess which may be very time consuming dependingon the number of images used and the quality of thepoint cloud to be generated. Models generated withSfM have especially poor quality for or miss out en-tirely on texture-less regions. Moreover, dependencyon SfM generated point clouds renders such tech-niques futile for scenarios where point clouds havebeen obtained from 3D scanners. Now-a-days, highquality and user friendly 3D scanners (e.g. LIDAR,Microsoft Kinect, Matterport scanners and Faro 3Dscanners) are available which can effectively renderdense point clouds of large areas without the use ofSfM. These point clouds are of better quality, notonly because of their higher point density but alsodue to the reason that they can effectively capture

bland surfaces that SfM based techniques tend tomiss.

To be able to directly localize 2D images in thepoint clouds generated from any 3D scanners, wepropose a novel concept to directly match featuredescriptors extracted from 3D point clouds with de-scriptors extracted from 2D images. The matcheddescriptors can then be used for 6-DOF camera lo-calization of the image in the 3D point cloud. Fig-ure 1 shows a section of the dense point cloud of theShop Facade dataset [Kendall et al., 2015] along witha query image and the localization results of the pro-posed technique. To match the feature descriptorsfrom 2D and 3D domain, we generate a dataset ofcorresponding 3D and 2D descriptors to train a two-stage classifier called ‘Descriptor-Matcher’. For lo-calization of a 2D image in a point cloud, we first useimage feature extraction techniques such as Scale In-variant Feature Transform (SIFT) [Lowe, 2004] toextract keypoints and their descriptors from the 2Dimage. Similarly, we use techniques designed forpoint clouds [Guo et al., 2014] such as 3D-SIFT key-points [Lowe, 2004; Rusu and Cousins, 2011], 3D-Harris key-points [Harris et al., 1988; Laga et al.,2018] and Rotation Invariant Features Transform(RIFT) descriptors [Lazebnik et al., 2005] to ex-tract 3D keypoints and descriptors from the pointcloud. The Descriptor-Matcher is then used to findthe matching pairs of 3D and 2D descriptors andtheir corresponding keypoints. This results in a list ofcoordinates in the point cloud and their correspond-ing pixels in the query image. The resulting one-to-one matches are then used in a robust algorithm for6-DOF pose estimation to find the location and ori-entation of the query camera in the 3D point cloud.Figure 2 shows the steps involved in our technique toestimate the camera pose for a query image.

A preliminary version of this work appeared inNadeem et al. [2019]. To the best of our knowledge,Nadeem et al. [2019] was the first work: (i) to match

3

2D Key-points Extraction

2D Descriptors Extraction

Coarse Descriptor-

Matcher

Pairs of Corresponding

2D and 3D Points

3D Point Cloud

2D RGB Query image

6DOF Pose Estimation

3D Key-points Extraction

3D Descriptors Extraction

Fine Descriptor-Matcher

Two Stage Descriptor-Matcher

Figure 2: A block diagram of the test pipeline of the proposed technique. We extract 3D key-points anddescriptors from the dense 3D Point Cloud. 2D key-points and their corresponding descriptors are extractedfrom the 2D RGB Query Image. Then our proposed ‘Descriptor-Matcher’ algorithm directly matches the2D descriptors with the 3D descriptors to generate correspondence between points in 2D image and 3Dpoint cloud. This is then used with a robust pose estimation algorithm to estimate the 6-DOF pose of thequery image in the 3D point cloud.

directly 3D descriptors extracted from dense pointclouds with the 2D descriptors from RGB images,and (ii) to use direct matching of 2D and 3D descrip-tors to localize camera pose with 6-DOF in dense3D point clouds. This work extends Nadeem et al.[2019] by improving all the elements of the proposedtechnique, including dataset generation, Descriptor-Matcher and pose estimation. Additionally, we pro-pose a reliable method to create a large collection of2D and 3D points with matching locations in imagesand point cloud, respectively. We also present moredetails of the proposed method and also evaluatethe proposed technique on more challenging datasetscompared to Nadeem et al. [2019].

The rest of this paper is organized as follows. Sec-tion 2 discusses the various categories of the tech-niques in the literature for camera localization orgeo-registration of images. The details of the pro-posed technique are presented in Section 3. Section4 reports the experimental setup and a detailed eval-uation of our results. Finally, the paper is concludedin Section 5.

2 Related Work

The numerous applications of camera pose estima-tion and image localization render it as an interest-ing and active field of research. Traditionally, thereare two distinct approaches to estimate the positionand orientation of a camera [Xin et al., 2019]: (i)Features based methods and (ii) Network based poseregression methods. The proposed method forms anew category of possible approaches: (iii) Direct 2D-3D descriptor matching based methods.

2.1 Features based methods

Features based methods extract convolutional orhand-crafted features from 2D images and use themin different manners to localize the images. How-ever, many of these methods only estimate approxi-mate locations and use number of inliers found withRANSAC [Fischler and Bolles, 1981] as a criterionfor image registration e.g., a query image is con-sidered as registered if the RANSAC stage of themethod can find 12 inliers among the features for

4

the query image. This is partly due to the fact thatsome datasets for image localization do not providethe ground truth position and orientation informationfor the query cameras. However, inlier count is nota reliable criterion and it does not represent the ac-tual performance of any given method. Feature basedmethods can further be classified into two types ofmethods: Image retrieval based methods and SfMbased methods.

2.1.1 Image retrieval based methods

Image retrieval based methods involve the use of alarge geo-tagged database of images. To localize a2D query image, these methods use different typesof features to retrieve images similar to the queryimage. The average of the retrieved database imagescan be used as the predicted location of the query im-age. Alternatively, the poses of the retrieved imagescan be used to triangulate the pose of the query cam-era [Chen et al., 2011; Zamir and Shah, 2010]. Thesemethods cannot be used for the localization of 2Dimages in point clouds. Also, many times the imagesin the dataset are not sufficiently close to the queryimage which results in significant errors in pose esti-mation.

2.1.2 SfM-based methods

SfM-based methods produce better pose estimatesthan image retrieval based methods [Sattler et al.,2017]. These methods use the SfM pipeline[Schonberger and Frahm, 2016]. SfM first extractsand matches features such as SIFT, SURF or ORB,from the set of training images. Then the matchedfeatures between the different 2D images are used tocreate a 3D model on an arbitrary scale. Each pointin the 3D model is created by triangulation of pointsfrom multiple images.

Irschara et al. [2009] used image retrieval to ex-tract 2D images that were similar to the query 2D im-age from the training database. The extracted imageswere used with the SfM model to improve the accu-racy of the estimated camera poses. Li et al. [2010]compared the features extracted from the query 2Dimages with the 2D features corresponding to thepoints in the SfM model to localize the images. Sat-tler et al. [2015] improved the localization processwith a visual vocabulary of 16 million words createdfrom the features of the database images and theirlocations in the SfM point cloud. Later, Sattler et al.[2017] further extended their work with the help of aprioritized matching system to improve the estimatedposes for the query images.

However, these methods can only work with pointclouds generated with SfM based pipelines [Piascoet al., 2018]. This is mainly due to the reason thatthe sparse point clouds generated with SfM save thecorresponding information of the points and featuresfrom 2D images that were used to create the sparseSfM point cloud. SfM based methods rely on thisinformation for pose estimation. Also some worksuse element-wise mean (or any other suitable func-tion) of the feature descriptors of the points that wereused to create a 3D point in the SfM point cloud asa 3D descriptor of that 3D point. Such an approxi-mation of 3D features is dependent on the inherentinformation in the point cloud generated with SfM,which is not available if the point cloud had to begenerated from a 3D scanner.

Structure from Motion is a computationally ex-pensive and time consuming process. The pointclouds generated with SfM are sparse and very noisy[Feng et al., 2019]. SfM models have especially poorquality at bland or texture-less regions and may missout such areas altogether. Moreover, it requires mul-tiple images of the same area from many different an-gles for good results, which is practically a difficultrequirement, especially for large areas or buildings.

5

Also the generated models are created on an arbi-trary scale and it is not possible to determine the ex-act size of the model without extra information fromother sources [Feng et al., 2019]. The availabilityof high quality 3D scanners has made it possible tocapture large scale point clouds in an efficient man-ner without the need to capture thousands of imagesto be used for the SfM pipeline. Moreover, LIDARand other 3D scanners are now becoming an essen-tial part of robots, particularly for locomotion andgrasping. Therefore, it is essential to develop meth-ods that can estimate 6-DOF poses for cameras inpoint clouds captured from any scanner.

2.2 Network-based pose regression meth-ods

These methods use deep neural networks to esti-mate the position and orientation of query imagesthrough pose regression. However, Sattler et al.[2019] showed that these methods can only pro-duce good results when the poses of the query im-ages are sufficiently close to the training images.Kendall et al. [2015] proposed a convolutional neu-ral network, called PoseNet, which was trained toregress the 6-DOF camera pose of the query image.Kendall and Cipolla [2017] later improved the lossfunction in Posenet while Walch et al. [2017] im-proved the network architecture to reduce the errorsin pose estimations. Brachmann et al. [2017] in-troduced the concept of differentiable RANSAC. In-stead of direct pose regression through a neural net-work, they created a differential version of RANSAC[Fischler and Bolles, 1981] to estimate the loca-tion and orientation of the query cameras. Specifi-cally, model of Brachmann et al. [2017] was com-posed of two CNNs, one to predict scene coordi-nates and the other CNN to select a camera posefrom a pool of hypotheses generated from the out-put of the first CNN. However, their method was

prone to over-fitting and did not always convergeduring end-to-end-optimization especially for out-door scenes [Brachmann and Rother, 2018]. Radwanet al. [2018] used a deep learning based architectureto simultaneously carry out camera pose estimation,semantic segmentation and odometry estimation byexploiting the inter-dependencies of these tasks forassisting each other.

Brachmann and Rother [2018] modified [Brach-mann et al., 2017] with a fully convolutional networkfor scene coordinate regression as the only learn-able component in their system. Soft inlier countwas used to test the pose hypotheses. Although itimproved the localization accuracy, the system stillfailed to produce results for datasets with large scale.

2.3 Direct 2D-3D descriptor matchingbased methods

Our preliminary work [Nadeem et al., 2019] was thefirst technique to estimate the 6-DOF pose for querycameras by directly matching features extracted from2D images and 3D point clouds. Feng et al. [2019]trained a deep convolutional network with triplet lossto estimate descriptors for patches extracted fromimages and point clouds. The estimated patches wereused in an exhaustive feature matching algorithm forpose estimation. However, their system achieved alimited localization capability.

This paper improves the concept of [Nadeemet al., 2019] with an unconstrained method to extractpairs of corresponding 2D and 3D points for training,and improves the structure of Descriptor-Matcher.It also introduces a better pose estimation strategy.Our technique directly extracts 3D key-points andfeature descriptors from point clouds which, in con-trast to SfM based approaches, enables us to estimateposes in point clouds generated from any 3D scanner.Moreover, the capability of our method to work withdense point clouds allows it to use better quality fea-

6

ture descriptors, which additionally helps in the poseestimation of the query images.

3 Proposed Technique

The direct matching of 2D and 3D descriptors in ourtechnique provides a way to localize the position andorientation of 2D query images in point clouds whichis not constrained by the method of point cloud gen-eration, the type of 3D scanners used or the indoorand outdoor nature of the environment.

At the core of our technique is a classifier,called ‘Descriptor-Matcher’, which is used to matchthe feature descriptors from 2D and 3D domain.Descriptor-Matcher is composed of two stages:coarse and fine. To train the Descriptor-Matcher, weneed a dataset of 3D points and their correspond-ing pixel locations in the training images. It is im-practical to manually create a dataset of matching2D and 3D points, especially for large scales. Toovercome this issue, we propose a method to auto-matically collect a large number of matching 3D and2D points and their corresponding descriptors froma point cloud and a given set of 2D training images.The trained Descriptor-Matcher can then be used tolocalize a 2D query image in the point cloud.

3.1 Extraction of Key-points and FeatureDescriptors

Let us assume a dataset with N number of trainingimages with known ground truth poses. Let PCd bethe point cloud in which the query images need to belocalized:

PCd =

Np⋃i=1

(xi, yi, zi) (1)

where⋃

is the union operator, Np is the number ofpoints in the point cloud and (xi, yi, zi) are the Carte-sian coordinates of the ith point of the point cloud.

Nowadays, most of the 3D sensors can produce RGBimages with ground truth camera positions relativeto the generated point clouds. In case such informa-tion is not available, Multi View Stereo [Schonbergeret al., 2016] can be used to generate ground truthcamera poses for the training images as explained in[Nadeem et al., 2019].

To create the dataset for training, we first usemethods designed for point clouds to extract 3D key-points and feature descriptors from the point cloudPCd [Rusu and Cousins, 2011; Guo et al., 2014].This results in a set of 3D keypoints:

keys3D =

N3⋃j=1

(xj , yj , zj) (2)

and their corresponding 3D descriptors:

desc3D =N3⋃j=1

(u1j , u2j , u

3j , ...u

pj ) (3)

where (u1j , u2j , u

3j , ...u

pj ) is the p dimensional 3D de-

scriptor of the jth 3D keypoint with Cartesian co-ordinates (xj , yj , zj), and N3 is the number of thedetected keypoints in the dense point cloud.

Similarly, we use keypoint and descriptor extrac-tion methods specific for images to get a set of 2Dkeypoints keys2d and feature descriptors desc2d foreach training image:

keys2Dn =

N2n⋃k=1

(xk,n, yk,n) (4)

desc2Dn =

N2n⋃k=1

(v1k,n, v2k,n, v

3k,n, ...v

qk,n) (5)

where N2n is the number of the detected key-points in nth image. xk,n and yk,n are the hori-zontal and vertical pixel coordinates, respectively,of the kth keypoint in the nth training image and(v1k,n, v

2k,n, v

3k,n, ...v

qk,n) is the q dimensional 2D de-

scriptor of that keypoint.

7

3.2 Dataset Generation

3.2.1 Back ray Tracing

To calculate the distance between 2D and 3D key-points, we need to either project the 3D keypointsonto the 2D image or the 2D keypoints onto the 3Dpoint cloud. Projecting the 2D keypoints will involvethe voxelization of the point cloud which is a com-putationally expensive process compared to its alter-native (i.e., back ray tracing) for the task at hand.Therefore, for each image in the training set, we usethe intrinsic and extrinsic matrix of the camera totrace back rays from the 3D key points keys3d onthe image. Mathematically, back ray tracing can berepresented through the following equation:[

sj,n xj,n, sj,n yj,n, sj,n]tr

=[Kn

][Rn|Tn

][xj , yj , zj , 1

]tr∀j = 1, 2, 3, ..., N3

(6)

Where Kn is the intrinsic matrix of the camera, Rn

is the rotation matrix and Tn is the translation vectorfor the conversion of points from world coordinatesto camera coordinates for the nth image. The tr su-perscript denotes the transpose of a matrix. The xj,nand yj,n are the horizontal and vertical pixel coordi-nates, respectively, of the projected 3D keypoint onthe nth image and sj,n is the depth of the 3D keypointfrom the camera. We keep only those projected key-points that are within the boundaries of the imageand have a positive depth value so that they are infront of the camera:

Θn =⋃

(xj,n,yj,n, sj,n) s.t. (0 < xj,n < img wn)

and (0 < yj,n < img hn) and (sj,n > 0)

∀ j = 1, 2, 3, ..., N3(7)

Where img wn and img hn are the width and heightof the nth image, respectively. However, as key-points are extremely sparse, it is possible that points

from the areas of the point cloud that are not visiblein the image, get projected on the image. For exam-ple, if the camera is looking at a wall at a right angleand there is another wall behind the first one, thenit is possible that the keypoints detected on the backwall get projected on the image. Theses false projec-tions can create problems during the training of theDescriptor-Matcher as they result in the matching oflocations between images and point clouds which arenot the same and increase the number of false posi-tives. Figure 3 shows examples of false projectionsof 3D keypoints from locations in point cloud whichare not visible in the 2D image.

3.2.2 Depth Block Filtering

To overcome this problem, we devised a strategybased on the 3D points of the point cloud PCd,called Depth Block Filtering. Similar to the 3D key-points, we back trace the points in PCd to the imagewith the help of the intrinsic matrix of the camera,Kn, and the augmented matrix created by appendingthe rotation matrix Rn with the translation vector Tnto get the pixel coordinates of the 3D points:

[si,n xi,n, si,n yi,n, si,n

]tr=[

Kn

][Rn|Tn

][xi, yi, zi, 1

]tr∀i = 1, 2, 3, ..., Np

(8)

where xi,n and yi,n are the horizontal and verti-cal pixel coordinates, respectively, of the projectedpoints of the 3D point cloud on the nth image andsi,n is the depth of the projected point from the cam-era. We filter out all those points that are outside theboundaries of the image and behind the camera i.e.,

8

those with a negative depth value si,n.

Γn =⋃

(xi,n,yi,n, si,n) s.t. (0 < xi,n < img wn)

and (0 < yi,n < img hn) and (si,n > 0)

∀ i = 1, 2, 3, ..., Np(9)

We convert the projected pixel values xi and yiin Γn to whole numbers. Any duplicates in the re-sulting set that have the same values for both xi andyi are removed by keeping only those elements ofΓn which are closest to the position of the camerai.e., among the duplicate pixel locations, we keep thetriple with the minimum value of si. The depth val-ues si in Γn are then used to create a depth map forthe training image. Next, we use the generated depthmap to create blocks of size τ × τ pixels around thelocations of the triples in the set of projected 3D key-points Θn. Finally, we filter out all the triples in Θn

whose depth values sj are greater than a thresholdϕ in their respective depth block, where ϕ and τ areconstants. This strategy helps to cater not only forwrong projections but also for any holes in the pointcloud as well. Figure 3 shows examples of the resultsfrom depth block filtering.

3.2.3 Creating 2D-3D Correspondences

We then calculate the Euclidean distances betweenthe remaining projected points (xj,n, yj,n) in the setof projections of 3D keypoints Θn and the 2D key-points keys2Dn on the image. The pairs of 2D and3D keypoints whose projections are within a speci-fied distance α, measured in the units of pixels, areconsidered as the matching pairs. In the case thatthere are more than one 2D keypoints within thespecified distance of the projection of a 3D keypointor multiple 3D keypoints’ projections are close to a2D keypoint, only the pair with the smallest distanceis considered for the dataset, i.e., we treat the match-ing of keypoints as a one-to-one correspondence for

each image. Let ζn be the set of the pair of indices,corresponding to the points of N3 and N2n, that arewithin an error threshold α:

ζn =⋃

(j, k) ∀ ||(xj − xk,n), (yj − yk,n)| | < α

where j = 1, 2, 3, ..., N3,

k = 1, 2, 3, ..., N2(10)

We use ζn to create a matching set of 3D and 2Dpoints by retrieving the keypoints from keys3D andkeys2Dn according to the indexes. This process isrepeated for each training image and then the result-ing sets of matching keypoints are concatenated tocreate a dataset of matching 3D and 2D points forthe whole training set.

To obtain a dataset of corresponding 3D and2D descriptors for the matching points, we retrievethe corresponding descriptors for the 3D key-pointsfrom desc3D and the 2D descriptors from desc2D.It is to be noted that in the resulting dataset, one 3Ddescriptor can correspond to multiple 2D descriptorsas one 3D keypoint can appear in multiple images ofthe same place taken from different poses.

3.3 Training

The generated dataset of corresponding 2D and 3Dfeatures is used to train the Descriptor-Matcher. TheDescriptor-Matcher is composed of two stages: acoarse matcher and a fine matcher, connected inseries (see Figure 2). The two stages allow theDescriptor-Matcher to maintain both a high precisionand computational efficiency at the same time, as ex-plained in Section 3.4. We evaluated several differ-ent classifiers for the coarse and fine stages includingmulti-layer fully connected Neural Networks Khanet al. [2018], Support Vector Machines with differ-ent kernels, Nearest Neighbour Classifiers with dic-tionary learning and Discriminant Analysis. How-ever, we empirically found that Classification Tree

9

(a) (b)

(c) (d)

Figure 3: Examples of results from Depth Block Filtering (DBF). Due to the sparse nature of 3D keypoints,even the points occluded from the camera get projected by back ray tracing e.g., keypoints occluded bythe fridge in (a) and (b) and points from around the shop in (c) and (d). Depth Block Filtering effectivelyremoves the occluded keypoints and only retains the keypoints in the direct line of sight of the camera. Leftcolumn: Snapshot of sections from point clouds with 3D keypoints (red + green) that get projected on atraining image. Right column: Image with the projected locations of 3D keypoints. Red points: Keypointsremoved by DBF as occluded points. Green points: Keypoints retained by DBF. See Section 3.2.2 fordetails of DBF.

10

[Breiman et al., 1984] performs best for the coarsestage of the matcher, while for the fine stage onlyRandom Forest classifier [Breiman, 2001] producessuitable results. Moreover, these classifiers have bet-ter robustness, speed, generalization capability andability to handle over-fitting for the task at hand.

We treat the problem of matching the descrip-tors from images and point cloud as a binary clas-sification problem. At the input of the Descriptor-Matcher, we provide 2D and 3D descriptors, con-catenated in the form of a (p+q)×1 vector. A positiveresult at the output indicates that the concatenateddescriptors are from the same location in the imageand the point cloud, while a negative result showsa mismatch. To split nodes in the classification treeof coarse matcher, as well as in the trees in RandomForest, we used Gini’s Diversity Index to measurethe impurity of nodes. The impurity criterion for ourcase is defined as:

Gini′s index =2× r+ × r−

(r+ + r−)2(11)

where r+ and r− are the number of positive and neg-ative samples, respectively, at any given node of atree. Consequently, an impure node will have a pos-itive value for the splitting criterion, while a nodewith only positive or only negative samples will haveGini’s diversity index equal to zero.

To train the classifiers, the corresponding descrip-tors of the matching 3D and 2D keypoints in thegenerated dataset (Section 3.2) were concatenated tocreate positive training samples. For the negativesamples, we concatenated the 2D and 3D descriptorsof the non-matching points in the dataset. We de-fine non-matching points as those pairs whose cor-responding locations in the point cloud are at leasta distance β apart. However, with the increase inthe scale of the point cloud, it is possible that thereare regions with similar appearance at different loca-tions in the point cloud. This is particularly a con-

cern for indoor scenarios, where there are compar-atively more bald regions and repeated structures.To overcome this problem we also ensured that theEuclidean distance between the descriptors of non-matching points is greater than a threshold γ. Weonly used a randomly selected subset of the gener-ated negative samples for training due to the largenumber of possible one-to-one correspondences be-tween non-matching pairs. We optimized the fine-matcher for maximum precision to minimize thenumber of false positives in the final matches.

3.4 Testing

At test time, we first extract key-points and descrip-tors from the 2D query image that needs to be lo-calized in the point cloud. Similarly, 3D keypointsand descriptors are extracted from the point cloud,if not already available. Then we concatenate the2D descriptors of the query image and the 3D de-scriptors of the point cloud in a one-to-one fashionand use the two stage Descriptor-Matcher to find thematching and non-matching pairs. The pairs of de-scriptors are first tested by the coarse matcher andonly the pairs with positive results for matches arepassed to the fine-matcher. As the coarse matcheris composed of a single classification tree, it greatlyreduces the number of pairs which need to be testedby the fine matcher, at only a small computationalcost, thus greatly decreasing the prediction time ofthe algorithm. Then, we retrieve the correspondingkeypoints for the descriptor pairs positively matchedby the fine-matcher and use them as matching loca-tions from the image and the point cloud for poseestimation.

However, just like any classifier, the output ofthe Descriptor-Matcher contains some false positivesand the matched 3D and 2D points are not com-pletely free of wrong matches. We use two condi-tions to improve the descriptor matching. First, we

11

use the probability scores for a positive match (whichare in the range 0− 1) as the confidence value that a2D descriptor and a 3D descriptor represent the samelocations in the quey image and the point cloud, re-spectively. For a positive match, the predicted con-fidence value must be greater than 0.5. Second, wealso use two-way matching to improve the reliabil-ity of the matches. Specifically, based on the highestprediction confidence, we find the closest matching3D descriptor for each 2D descriptor, and the clos-est matching 2D descriptor for each 3D descriptor.For a 2D descriptor to be treated as a correspondingpair for a 3D descriptor, the confidence value of thepredicted match must be greater than the confidencevalue of that specific 2D descriptor matching withany other 3D descriptor and vice versa.

To further filter out the false positives we usethe MLESAC [Torr and Zisserman, 2000] algorithmalong with the P3P [Gao et al., 2003] algorithm tofind the best set of points for the estimation of theposition and orientation of the camera for the queryimage. MLESAC is an improved and more general-ized algorithm compared to the traditional RANSAC[Fischler and Bolles, 1981]. MLESAC generates thetentative solutions in the same manner as RANSAC.However, it produces a better final solution as, in ad-dition to maximizing the number of inliers in the so-lution, it uses maximum likelihood estimation basedon a Gaussian distribution of noise to minimize there-projection error between the 2D image and 3Dpoints [Torr and Zisserman, 2000]. Figure 2 showsa block diagram of the steps involved to localize a2D query image in a 3D point cloud. The predictedpose of the camera can further be refined with theapplication of the R1PPnP pose estimation algorithm[Zhou et al., 2018] on the points identified as inliersby the MLESAC algorithm to estimate the final po-sition and viewing direction of the query camera inthe point cloud. The predicted pose information canbe used to extract the section of the point cloud that

corresponds to the query image for visualization pur-poses (Figure 1).

4 Experiments and Analysis

To evaluate the performance of our technique, wecarried out extensive experiments on a number ofpublicly available datasets. These include Shop Fa-cade [Kendall et al., 2015], Old Hospital [Kendallet al., 2015], Trinity Great Court [Kendall andCipolla, 2017], King’s College [Kendall et al., 2015]and St. Mary Church [Kendall et al., 2015] datasetsfrom the Cambridge Landmarks Database for out-door scenarios. To test the localization capabilityof the proposed technique for indoor cases, we usedthe Kitchen [Valentin et al., 2016] and Living Room[Valentin et al., 2016] Datasets from the StanfordRGB Localization Database and Baidu Indoor Lo-calization Dataset [Sun et al., 2017].

In our experiments, SIFT key-points and descrip-tors [Lowe, 2004] were extracted from the 2D queryimages. For the point clouds, we used 3D SIFTkey-points [Rusu and Cousins, 2011] and 3D RIFTdescriptors [Lazebnik et al., 2005] to extract key-points and feature descriptors, respectively. 3D SIFTis a key-point extraction method designed for pointclouds, which is inspired from the 2D SIFT key-point extraction algorithm [Lowe, 2004] for 2D im-ages. The original 2D SIFT algorithm was adaptedfor 3D point clouds by substituting the function ofintensity of pixels in an image with the principal cur-vature of points in a point cloud [Rusu and Cousins,2011].

We set the maximum distance between the im-age keypoints and the projected 3D keypoints α =5 pixels for the generation of positive samples. Forthe formation of one-to-one pairs for negative sam-ples, we set the minimum distance between the 3Dkeypoints, β, to 0.5 and the minimum Euclidean dis-

12

tance between the 3D descriptors, γ to 0.3. To op-timize the parameters for coarse and fine matchers,we divided the generated dataset of corresponding2D and 3D descriptors (see Section 3.2) into train-ing and validation sets in the ratio of 8 : 2. Based onthe performance on the validation set, we used gridsearch [Lerman, 1980] to fine tune the classificationcost and the maximum number of splits in a tree forboth coarse and fine matchers, as well as the numberof trees in the Random Forest. Finally, the completedataset of matching 2D and 3D features was used totrain the Descriptor-Matcher with the optimized pa-rameters.

4.1 Evaluation Metrics

We use the positional and the rotational errors in thepredicted poses of the query images to evaluate theperformance of our technique. As such, the posi-tional error is calculated as the Euclidean distancebetween the ground truth and the predicted positionsof the query camera in the 3D point cloud:

position error= ||(xg − xe), (yg − ye), (zg − ze)||(12)

where (xg, yg, zg) and (xe, ye, ze) are the groundtruth and estimated positions of the query image’scamera in the 3D point cloud, respectively.

For the estimation of the rotational error, we cal-culated the minimum angle between the viewing di-rections of the predicted and the ground truth cam-eras for the 2D query image. If the predicted rotationmatrix is represented by Re and the ground truth ro-tation matrix is Rg, then the rotational error φ in de-grees can be calculated as follows:

φ =π

180× cos−1( trace(Rg ×Rtr

e )− 1

2) (13)

4.2 Outdoor Datasets

We used the datasets from the Cambridge Land-marks Database [Kendall et al., 2015; Kendall andCipolla, 2017] to test the localization performance ofthe proposed method in the outdoor scenarios. Theimages in these datasets were captured at differenttimes under different lighting and weather conditionsand contain a lot of urban clutter which increases thechallenges for precise localization of camera poses.As the Cambridge Landmarks Database does notcontain dense point clouds, we used the COLMAP’sMulti View Stereo Pipeline [Schonberger et al.,2016] to generate dense point clouds for thesedatasets. Table 1 reports the median errors for theestimated positions and orientations of the camerasfor the query images for the outdoor datasets. Wealso report the percentile errors for the 25%, 50%,75% and 90% of the query images in these datasetsfor the estimated camera poses.

4.2.1 Shop Facade Dataset

The Shop Facade Dataset [Kendall et al., 2015] fromthe Cambridge Landmarks Database is composed ofthe images of the intersection of two streets in Cam-bridge. The images mainly focus on the shops at theintersection. It covers an area of more than 900 m2.It contains a total of 334 images with 103 images inthe query set. We used the standard train-test split asdefined by the authors of the dataset. Our proposedtechnique was able to localize all the images in thequery set.

4.2.2 Old Hospital Dataset

The Old Hospital dataset [Kendall et al., 2015] con-tains 1077 images. There are 182 query images in thetrain-test split defined by the dataset’s authors. Thedataset covers an area of 2000 m2. This dataset suf-fers particularly from the challenges of repetitive pat-

13

Tabl

e1:

Med

ian

and

perc

entil

elo

caliz

atio

ner

rors

for

the

outd

oor

data

sets

:Sh

opFa

cade

,Old

Hos

pita

l,St

.M

ary

Chu

rch,

Kin

g’s

Col

lege

and

Gre

atC

ourt

Dat

aset

s.P

stan

dsfo

rper

cent

ile,e

.g.,

P25

%m

eans

the

max

imum

erro

rfor

25%

ofth

eda

taw

hen

erro

rsar

eso

rted

inth

eas

cend

ing

orde

r.m

stan

dsfo

rmet

res

Out

door

Dat

aset

s↓E

rror

s↓\

Met

rics→

Med

ian

P25

%P

50%

P75

%P

90%

Shop

Faca

dePo

sitio

nE

rror

(m)

0.08

60m

0.03

07m

0.08

60m

0.42

58m

3.38

90m

Ang

leE

rror

(deg

rees

)0.

7792◦

0.27

76◦

0.77

92◦

4.28

90◦

33.6

595◦

Old

Hos

pita

lPo

sitio

nE

rror

(m)

0.12

95m

0.06

49m

0.12

95m

0.25

61m

3.27

44m

Ang

leE

rror

(deg

rees

)0.

2210◦

0.13

12◦

0.22

10◦

0.61

53◦

5.62

63◦

St.M

ary

Chu

rch

Posi

tion

Err

or(m

)0.

1479

m0.

0689

m0.

1479

m0.

9518

m13

.345

5m

Ang

leE

rror

(deg

rees

)0.

4671◦

0.18

14◦

0.46

71◦

3.52

67◦

35.3

109◦

Kin

g’sC

olle

gePo

sitio

nE

rror

(m)

0.08

77m

0.05

33m

0.08

77m

0.13

32m

0.21

08m

Ang

leE

rror

(deg

rees

)0.

1476◦

0.08

80◦

0.14

76◦

0.26

13◦

0.46

94◦

Gre

atC

ourt

Posi

tion

Err

or(m

)0.

5098

m0.

2520

m0.

5098

m1.

5401

m14

.713

0m

Ang

leE

rror

(deg

rees

)0.

3526◦

0.14

89◦

0.35

26◦

1.31

17◦

12.8

685◦

14

terns, as well as high symmetry due to similar con-structions on both sides of the centre of the building.The localization results are shown in Table 1.

4.2.3 St. Mary Church Dataset

The St. Mary Church Dataset [Kendall et al., 2015]is composed of 2017 images of the Great St. MaryChurch in Cambridge. It encompasses an area of4800 m2. Many of the images contain occlusionscaused by pedestrians and other urban clutter due towhich it becomes challenging to localize images inthe 3D point cloud. We used the query set defined bythe authors of the dataset which contains 530 imagesfor the evaluation of our technique, while the remain-ing 2D images were used to train the Descriptor-Matcher. Our technique successfully localized all theimages in the query set.

4.2.4 King’s College Dataset

This dataset covers the location and building of theKing’s College, which is one of the constituent col-leges of the University of Cambridge [Kendall et al.,2015]. It consists of 1563 images captured with thecamera of a smart phone. King’s College covers anarea of more than 5600 m2. We used the train-querysplit of images defined by the authors of the dataset.There are 343 images in the query set. Table 1 showsthe results of our technique on the dataset.

4.2.5 Trinity Great Court Dataset

Trinity Great Court is the main court (courtyard)of Trinity College, Cambridge. It is one of thelargest enclosed courtyards in Europe with an area of8000 m2. The train and test sets consist of 1532 and760 number of images, respectively [Kendall andCipolla, 2017]. The proposed technique successfullylocalized all the images in the dataset.

4.3 Indoor Datasets

Most of the pose estimation techniques in the litera-ture have either been evaluated only for outdoor sce-narios or for very small indoor scenes (e.g. testedfor a maximum volume of 6m3 on Seven ScenesDatabase [Shotton et al., 2013] ). Indoor localiza-tion of images is a more challenging problem com-pared to the outdoor settings [Sun et al., 2017]. Dueto similar items (e.g., furniture) and the same con-struction patterns (e.g., cubicle shape of rooms, sim-ilar patterns on floor or ceiling, stairs), it is possi-ble that different regions of the building look verysimilar, which greatly increases the possibility ofwrong matches and incorrect camera pose estima-tion. On the other hand, indoor localization is a moreuseful application of image localization as many ofthe traditional localization methods, such as GPSbased localization, do not work properly in indoorsettings. To evaluate the performance of our tech-nique on practical indoor localization scenarios, weused the Kitchen and Living Room Datasets fromthe Stanford RGB Localization Database [Valentinet al., 2016] and Baidu Indoor Localization dataset[Sun et al., 2017]. Table 2 shows the median posi-tional and rotational errors along with the percentileerrors for the intervals of 25%, 50%, 75%, and 90%data on the indoor datasets.

4.3.1 Kitchen Dataset

This dataset contains RGB images and a 3D model ofa Kitchen in an apartment. It is part of the StanfordRGB Localization Database [Valentin et al., 2016]. Itcovers a total volume of 33m3. The 2D images andthe 3D point cloud were created with a Structure.io1

3D sensor coupled with an iPad. Both sensors werecalibrated and temporally synced. We randomly se-lected 20% of the images for the query set while used

1https://structure.io/structure-sensor

15

the remaining images for training the Descriptor-Matcher. Our technique successfully localized allthe query images in the 3D model with high accu-racy. We were able to estimate the 6-DOF poses ofmore than 90% of the images with errors less than 4cm and withing a degree, as shown in Table 2.

4.3.2 Living Room Dataset

The Living Room Dataset is part of the StanfordRGB Localization Database [Valentin et al., 2016]and comprises the 3D model and 2D images of a liv-ing area in an apartment with a total volume of 30m3.This dataset was also captured with a combination ofStructure.io sensor and iPad cameras. For the quan-titative evaluation on this dataset, 20% of the imageswere randomly held out for query set, while the re-maining images were used for the training set. Wewere able to localize all the images in the query setwith more than 90% localizations within a 5 cm error(Table 2).

4.3.3 Baidu Indoor Localization Dataset

The Baidu Indoor Localization Dataset [Sun et al.,2017] is composed of a 3D point cloud of a multi-storey shopping mall created with the scans froma LIDAR scanner and a database of images withground truth information for camera positions andorientations. The 3D point cloud contains more than67 million points. Figure 4 shows different viewsof the point cloud of the mall. The images in thedataset are captured with different cameras and smartphones and at different times which increases thecomplexity of the dataset. Moreover, many of theimages contain occlusions due to the persons shop-ping in the mall. We randomly selected 10% of theimages for the query set, while the remaining im-ages were used for the training set. Our techniquesuccessfully localized 78.2% of the images in the

query set. Despite the challenges of the dataset, weachieved a median pose estimation error of 0.69 mand 2.3 degrees as shown in Table 2.

4.4 Comparison with Other Approaches

The proposed technique for image localization inpoint clouds is based on a novel concept of directlymatching the descriptors extracted from images andpoint clouds. The SfM based techniques are basedon the matching of 2D features and require the pointcloud to be generated from SfM pipeline. On theother hand, our technique can work with point cloudscreated with any 3D scanner or generated with anymethod and has no dependency on any informationfrom SfM.

The network based regression methods train di-rectly to regress the poses of the images. However,this causes the networks to over-fit on the groundtruth poses. Therefore at test time, they only pro-duce good results for the images with poses close tothe training ones [Sattler et al., 2019]. In our tech-nique, training of the Descriptor-Matcher is carriedout on the feature descriptors with no information ofthe poses or the location of the points. This ensuresthat the Descriptor-Matcher does not over-fit on thetraining poses, rather it learns to find a mapping be-tween the 2D and 3D descriptors. The final pose esti-mation is based on the principles of geometry whichproduce better pose estimates compared to end-to-end trained methods [Sattler et al., 2019]. Therefore,the proposed method benefits from the non-relianceon SfM pipeline like the network based methods, aswell as the ability to use geometry based pose esti-mation similar to the SfM based methods.

For quantitative analysis, we provide a compari-son of our results with the state-of-the-art methodsfor camera pose localization on the common datasetsbetween the various approaches. We compare themedian values of errors in the estimation of camera

16

Tabl

e2:

Med

ian

and

perc

entil

elo

caliz

atio

ner

rors

for

the

Kitc

hen,

Liv

ing

Roo

man

dB

aidu

Indo

orL

ocal

izat

ion

Dat

aset

s.P

stan

dsfo

rpe

rcen

tile,

e.g.

,P25

%m

eans

the

max

imum

erro

rfo

r25

%of

the

data

whe

ner

rors

are

sort

edin

the

asce

ndin

gor

der.

mst

ands

form

etre

s

Indo

orD

atas

ets↓

Err

ors↓\

Met

rics→

Med

ian

P25

%P

50%

P75

%P

90%

Kitc

hen

Posi

tion

Err

or(m

)0.

0117

m0.

0066

m0.

0117

m0.

0223

m0.

0395

mA

ngle

Err

or(d

egre

es)

0.21

28◦

0.14

48◦

0.21

28◦

0.38

45◦

0.63

99◦

Liv

ing

Roo

mPo

sitio

nE

rror

(m)

0.01

01m

0.00

59m

0.01

01m

0.01

80m

0.04

70m

Ang

leE

rror

(deg

rees

)0.

3254◦

0.20

11◦

0.32

54◦

0.48

77◦

1.62

18◦

Bai

duIn

door

Loc

aliz

atio

nPo

sitio

nE

rror

(m)

0.69

30m

0.20

20m

0.69

30m

10.4

4m

24.5

5m

Ang

leE

rror

(deg

rees

)2.

30◦

0.61

05◦

2.30◦

13.7

2◦22

.05◦

17

(a) Top view

(b) Side view

Figure 4: Different views of the point cloud of the shopping mall from Baidu Indoor Localization Dataset[Sun et al., 2017]. The large scale and repetitive structures in the dataset make it extremely challenging tolocalize 2D images in the point cloud compared to other indoor datasets.

18

Tabl

e3:

Med

ian

erro

rsfo

rpos

ition

and

orie

ntat

ion

estim

atio

nof

ourt

echn

ique

com

pare

dto

othe

rapp

roac

hes

onC

ambr

idge

Lan

dmar

ksou

tdoo

rdat

aset

s.Po

sst

ands

form

edia

npo

sitio

nale

rror

inm

eter

san

dA

ngst

ands

forr

otat

iona

lerr

orin

degr

ees.

NA

:Res

ults

nota

vaila

ble.

Bes

tres

ults

are

show

nin

bold

and

seco

ndbe

stre

sults

are

unde

rlin

ed.

Met

hods→

Pose

Net

[Ken

dall

etal

.,20

15]

ICC

V’1

5

Geo

m.

Los

sNet

[Ken

dall

and

Cip

olla

,20

17]

CV

PR’1

7

VL

ocN

et[V

alad

aet

al.,

2018

]IC

RA’

18

DSA

C[B

rach

-m

ann

etal

.,20

17]

CV

PR’1

7

Act

ive

Sear

ch[S

attle

ret

al.,

2017

]T

PAM

I’17

DSA

C++

[Bra

ch-

man

nan

dR

othe

r,20

18]

CV

PR’1

8

Our

s

Met

hods

’Ty

pe→

Net

wor

k-ba

sed

Net

wor

k-ba

sed

Net

wor

k-ba

sed

Net

wor

k+

RA

NSA

C

SfM

-ba

sed

Net

wor

k+

RA

NSA

C

2Dto

3DD

escr

ip-

tors

Mat

chin

g

Dat

aset

s↓Po

s(m

)A

ngPo

s(m

)A

ngPo

s(m

)A

ngPo

s(m

)A

ngPo

s(m

)A

ngPo

s(m

)A

ngPo

s(m

)A

ng

Shop

Faca

de1.

464.

04◦

0.88

3.78◦

0.59

33.

529◦

0.09

0.4◦

0.12

0.4◦

0.09

0.4◦

0.08

60.

779◦

Old

Hos

pita

l2.

312.

69◦

3.2

3.29◦

1.07

52.

411◦

0.33

0.6◦

0.44

1◦0.

240.

5◦0.

129

0.22

1◦

Kin

g’s

Col

lege

1.92

2.70◦

0.88

1.04◦

0.83

61.

419◦

0.30

0.5◦

0.42

0.6◦

0.23

0.4◦

0.08

70.

147◦

Gre

atC

ourt

NA

NA

6.83

3.47◦

NA

NA

2.80

1.5◦

NA

NA

0.66

0.4◦

0.50

90.

352◦

St.M

ary

Chu

rch

2.65

4.24◦

1.57

3.32◦

0.63

13.

906◦

0.55

1.6◦

0.19

0.5◦

0.20

0.7◦

0.14

70.

467◦

19

position and rotation on the outdoor datasets of Cam-bridge Landmarks Database [Kendall et al., 2015], asmostly only the median errors are reported by othermethods. The results of the compared methods arenot available for Baidu Indoor Localization Datasetor Stanford RGB Localization Database. Our pro-posed method achieved competitive or superior lo-calization accuracy to the state-of-the-art methodsfor 6-DOF pose estimation on all the datasets. Table3 shows the median values for position and rotationalerrors of our technique compared to other prominentmethods.

5 Conclusion

This paper proposed a novel method to directlymatch descriptors from 2D images with those from3D point clouds, where the descriptors are extractedusing 2D and 3D feature extraction techniques, re-spectively. We have shown that direct matching of2D and 3D feature descriptors is an unconstrainedand reliable method for 6-DOF pose estimation inpoint clouds. Our approach results in a pipelinewhich is much simpler compared to SfM or net-work based methods. Extensive quantitative eval-uation has demonstrated that the proposed methodachieved competitive performance with the state-of-the-art methods in the field. Moreover, unlike SfMbased techniques, the proposed method can be usedwith point clouds generated with any type of 3Dscanners and can work in both indoor and outdoorscenarios.

Acknowledgements

This work was supported by the SIRF scholarshipfrom the University of Western Australia (UWA)and by the Australian Research Council under Grant

DP150100294.

References

E. Brachmann and C. Rother. Learning less is more-6d camera localization via 3d surface regression.In Conference on Computer Vision and PatternRecognition, pages 4654–4662. IEEE, 2018.

E. Brachmann, A. Krull, S. Nowozin, J. Shot-ton, F. Michel, S. Gumhold, and C. Rother.Dsac-differentiable ransac for camera localiza-tion. In Conference on Computer Vision and Pat-tern Recognition, pages 6684–6692. IEEE, 2017.

L. Breiman. Random forests. Machine Learning, 45(1):5–32, 2001.

L. Breiman, J. Friedman, C. J. Stone, and R. A. Ol-shen. Classification and regression trees. CRCpress, 1984.

D. M. Chen, G. Baatz, K. Koser, S. S. Tsai,R. Vedantham, T. Pylvanainen, K. Roimela,X. Chen, J. Bach, M. Pollefeys, et al. City-scale landmark identification on mobile devices.In Conference on Computer Vision and PatternRecognition, pages 737–744. IEEE, 2011.

M. Feng, S. Hu, M. H. Ang, and G. H. Lee. 2d3d-matchnet: learning to match keypoints across 2dimage and 3d point cloud. In 2019 InternationalConference on Robotics and Automation (ICRA),pages 4790–4796. IEEE, 2019.

M. A. Fischler and R. C. Bolles. Random sampleconsensus: a paradigm for model fitting with ap-plications to image analysis and automated cartog-raphy. Communications of the ACM, 24(6):381–395, 1981.

20

X.-S. Gao, X.-R. Hou, J. Tang, and H.-F.Cheng. Complete solution classification for theperspective-three-point problem. IEEE Transac-tions on Pattern Analysis and Machine Intelli-gence, 25(8):930–943, 2003.

Y. Guo, M. Bennamoun, F. Sohel, M. Lu, and J. Wan.3d object recognition in cluttered scenes with lo-cal surface features: a survey. IEEE Transactionson Pattern Analysis and Machine Intelligence, 36(11):2270–2287, 2014.

X. Han, H. Laga, and M. Bennamoun. Image-based 3d object reconstruction: State-of-the-artand trends in the deep learning era. IEEE trans-actions on pattern analysis and machine intelli-gence, 2019.

C. G. Harris, M. Stephens, et al. A combined cor-ner and edge detector. In Alvey Vision Conference,volume 15, pages 10–5244. Citeseer, 1988.

A. Irschara, C. Zach, J.-M. Frahm, and H. Bischof.From structure-from-motion point clouds to fastlocation recognition. In Conference on ComputerVision and Pattern Recognition, pages 2599–2606. IEEE, 2009.

A. Kendall and R. Cipolla. Geometric loss func-tions for camera pose regression with deep learn-ing. In Conference on Computer Vision and Pat-tern Recognition, pages 5974–5983. IEEE, 2017.

A. Kendall, M. Grimes, and R. Cipolla. Posenet:A convolutional network for real-time 6-dof cam-era relocalization. In International Conference onComputer Vision, pages 2938–2946. IEEE, 2015.

S. Khan, H. Rahmani, S. A. A. Shah, and M. Ben-namoun. A guide to convolutional neural net-works for computer vision. Synthesis Lectures onComputer Vision, 8(1):1–207, 2018.

H. Laga, Y. Guo, H. Tabia, R. B. Fisher, and M. Ben-namoun. 3D shape analysis: fundamentals, the-ory, and applications. John Wiley & Sons, 2018.

S. Lazebnik, C. Schmid, and J. Ponce. A sparsetexture representation using local affine regions.IEEE Transactions on Pattern Analysis and Ma-chine Intelligence, 27(8):1265–1278, 2005.

P. Lerman. Fitting segmented regression models bygrid search. Journal of the Royal Statistical So-ciety: Series C (Applied Statistics), 29(1):77–84,1980.

Y. Li, N. Snavely, and D. P. Huttenlocher. Locationrecognition using prioritized feature matching. InEuropean Conference on Computer Vision, pages791–804. Springer, 2010.

D. G. Lowe. Distinctive image features from scale-invariant keypoints. International Journal of Com-puter Vision, 60(2):91–110, 2004.

U. Nadeem, M. A. Jalwana, M. Bennamoun,R. Togneri, and F. Sohel. Direct image to pointcloud descriptors matching for 6-dof camera lo-calization in dense 3d point clouds. In Interna-tional Conference on Neural Information Process-ing, pages 222–234. Springer, 2019.

N. Piasco, D. Sidibe, C. Demonceaux, and V. Gouet-Brunet. A survey on visual-based localization: Onthe benefit of heterogeneous data. Pattern Recog-nition, 74:90–109, 2018.

N. Radwan, A. Valada, and W. Burgard. Vlocnet++:Deep multitask learning for semantic visual local-ization and odometry. IEEE Robotics and Automa-tion Letters, 3(4):4407–4414, 2018.

R. B. Rusu and S. Cousins. Point cloud library (pcl).In International Conference on Robotics and Au-tomation, pages 1–4. IEEE, 2011.

21

T. Sattler, M. Havlena, F. Radenovic, K. Schindler,and M. Pollefeys. Hyperpoints and fine vocabular-ies for large-scale location recognition. In Inter-national Conference on Computer Vision, pages2102–2110. IEEE, 2015.

T. Sattler, B. Leibe, and L. Kobbelt. Efficient & ef-fective prioritized matching for large-scale image-based localization. IEEE Transactions on PatternAnalysis and Machine Intelligence, 39(9):1744–1756, 2017.

T. Sattler, Q. Zhou, M. Pollefeys, and L. Leal-Taixe.Understanding the limitations of cnn-based abso-lute camera pose regression. In Proceedings of theIEEE Conference on Computer Vision and PatternRecognition, pages 3302–3312, 2019.

J. L. Schonberger and J.-M. Frahm. Structure-from-motion revisited. In Conference on Computer Vi-sion and Pattern Recognition. IEEE, 2016.

J. L. Schonberger, E. Zheng, M. Pollefeys, and J.-M. Frahm. Pixelwise view selection for unstruc-tured multi-view stereo. In European Conferenceon Computer Vision. Springer, 2016.

J. Shotton, B. Glocker, C. Zach, S. Izadi, A. Crim-inisi, and A. Fitzgibbon. Scene coordinate regres-sion forests for camera relocalization in rgb-d im-ages. In Conference on Computer Vision and Pat-tern Recognition, pages 2930–2937. IEEE, 2013.

X. Sun, Y. Xie, P. Luo, and L. Wang. A dataset forbenchmarking image-based localization. In TheIEEE Conference on Computer Vision and PatternRecognition (CVPR), July 2017.

P. H. Torr and A. Zisserman. Mlesac: A new robustestimator with application to estimating image ge-ometry. Computer Vision and Image Understand-ing, 78(1):138–156, 2000.

A. Valada, N. Radwan, and W. Burgard. Deep auxil-iary learning for visual localization and odometry.In International Conference on Robotics and Au-tomation, pages 6939–6946. IEEE, 2018.

J. Valentin, A. Dai, M. Nießner, P. Kohli, P. Torr,S. Izadi, and C. Keskin. Learning to navigate theenergy landscape. In 2016 Fourth InternationalConference on 3D Vision (3DV), pages 323–332.IEEE, 2016.

F. Walch, C. Hazirbas, L. Leal-Taixe, T. Sattler,S. Hilsenbeck, and D. Cremers. Image-based lo-calization using lstms for structured feature corre-lation. In International Conference on ComputerVision, pages 627–637. IEEE, 2017.

X. Xin, J. Jiang, and Y. Zou. A review of visual-based localization. In Proceedings of the 2019International Conference on Robotics, IntelligentControl and Artificial Intelligence, pages 94–105,2019.

A. R. Zamir and M. Shah. Accurate image localiza-tion based on google maps street view. In Euro-pean Conference on Computer Vision, pages 255–268. Springer, 2010.

H. Zhou, T. Zhang, and J. Jagadeesan. Re-weightingand 1-point ransac-based p n n p solution to han-dle outliers. IEEE transactions on pattern anal-ysis and machine intelligence, 41(12):3022–3033,2018.

22

Unconstrained Matching of 2D and 3D Descriptors …Unconstrained Matching of 2D and 3D Descriptors for 6-DOF Pose Estimation Uzair Nadeem 1, Mohammed Bennamoun , Roberto Togneri2,

Documents