ReconstructionofWallGeometryforAugmentedReality … · 2020-06-18 · ReconstructionofWallGeometryforAugmentedReality MasterThesis of ArtiﬁcialIntelligence TheUniversityofAmsterdam

Reconstruction of Wall Geometry for Augmented Reality

Master Thesisof

Artificial IntelligenceThe University of Amsterdam

in cooperation with

Biorobotics LabDelft University of Technology

42 EC

Author:ing. T.J. Hijzen (6332331)University of Amsterdam

Supervisors:prof.dr.ir. P.P. Jonker

Delft University of Technology

dr.ir. L. DorstUniversity of Amsterdam

R.G. Prevel MSc, MSc & DICDelft University of Technology

October 10, 2013

Abstract

Augmented Reality is concerned with merging virtual objects into the real world. The problemof pose estimation has been largely solved in this field. However, the robust registration of virtualobjects without preparing the environment is still an open problem. This thesis proposes to approachthis problem by first detecting walls in an indoor scene, using the sparse point cloud created bythe stereo tracking and mapping algorithm. To this end, a comparison is made of region growing,planar homography estimation, the Hough transform and RANSAC adaptations for plane detection.This thesis concludes that RANSAC is the preferred method when the data is sparse and noisy, andimproves upon the robustness of standard RANSAC by modifying the score to detect walls. Twonew algorithms, backlier scaling factor adjusted and binomially adjusted RANSAC are proposed andevaluated with a novel evaluation method based on the sparse point cloud. A new modular softwarearchitecture together with a stereo camera setup including an inertial measurement unit is introducedto test and evaluate the novel algorithms. This thesis shows that the proposed algorithms improveon the state of the art and provide an initial step into scene reconstruction for Augmented Reality.

Contents

1 Introduction 11.1 Augmented Reality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.1.1 Combines real and virtual . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.1.2 Interactive in real time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.1.3 Problem of registration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.2 Main Challenge . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51.3 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2 Scene Geometry 72.1 Tracking and mapping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.1.1 Camera model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82.1.2 Sparse mapping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82.1.3 Tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.2 Scene Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102.2.1 Representation of a plane . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102.2.2 Plane detection methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112.2.3 State of the art in plane detection . . . . . . . . . . . . . . . . . . . . . . . . 13

3 Test Setup 173.1 Hardware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173.2 Software architecture of the Augmented Reality application . . . . . . . . . . . . . . 183.3 Scene reconstruction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

3.3.1 Detection of a single wall . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203.4 Method for evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

4 Results 234.1 Evaluation on artificial data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

4.1.1 Robustness of BSF-RANSAC . . . . . . . . . . . . . . . . . . . . . . . . . . . 244.1.2 Robustness of B-RANSAC . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

4.2 Real data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 264.2.1 Selecting the threshold . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 264.2.2 Selecting the backlier scaling factor . . . . . . . . . . . . . . . . . . . . . . . . 274.2.3 Binomially adapted RANSAC . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

4.3 Including the IMU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

5 Conclusion and Discussion 315.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 315.2 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

Bibliography 33

A Devices for Augmented Reality 37

iii

Chapter 1

Introduction

When performing car maintenance one can use a manual for instructions on how to change the oil,tighten bolts and do other tasks. Following the instructions is a challenge as one needs to translatethe actions described in the manual to the actual actions in the car. Combining this informationvirtually with the real world would alleviate the need for the user to do the translation himself,thereby allowing the user to focus at the task at hand. For instance, one could virtually color thebolts one has to unscrew bright yellow.

Tasks where a user has to translate information to the world are abundant. When navigating witha map, one translates the information on the map to the streets around oneself. When shopping forclothes a person has to imagine how it would look on them. When a surgeon is performing surgeryand is using a three-dimensional scan of a patient to guide the operation then the surgeon has tocouple the information on the scan to the patient. In all of these cases users would benefit fromhaving the virtual information displayed directly into the real world.

1.1 Augmented RealityThe combination of the real world and this virtual information is called Augmented Reality. AugmentedReality merges the real world with one or more virtual objects. An example of this are the billboardsthat are added to the video image of a soccer game shown in figure 1.1. People in the stadium do notsee the billboards, but the people watching the game on their television do.

Figure 1.1: Video frame of a soccer match with digitally added billboards from ‘Kips’.

There have been various attempts to define Augmented Reality. Milgram et al. [26] classifyAugmented Reality as part of the reality-virtuality continuum. Figure 1.2 shows the reality-virtualitycontinuum with the real world on the far left and the entirely virtual environment on the far right.In between there is a mixed reality, combining reality and virtuality. The ratio of virtuality to realitycorresponds to a location on the reality-virtuality continuum. When the ratio shifts to predominantlyreal this is called Augmented Reality. When the ratio shifts to predominantly virtual we call thisAugmented Virtuality.

Benford et al. [3] added a dimension of presence to this analysis. In this analysis the maindifference between Augmented Reality and Virtual Reality is the degree of transportation. WhereAugmented Reality is local and Virtual Reality is transported as shown in figure 1.3.

1

Chapter 1

Figure 1.2: Milgram and Kishino’s Reality-Virtuality Continuum [26].

Augmented

Reality

Physical

Reality

Virtual

Reality

Tele-

Presence

localDimension of

Transportation remote

Dimension of

Artificiality

synthetic

physical

Figure 1.3: Augmented Reality as classified according to transportation and artificiality [3].

A more complete definition was proposed by Azuma [2] who summarized three requirements inAugmented Reality applications as follows:

1. it combines real and virtual

2. it is interactive in real time

3. it is registered in 3-D

This definition does not restrict Augmented Reality to a certain display technology. AugmentedReality also includes other senses besides vision, for example audio and touch (haptics). This thesiswill however focus on vision only. The next sections elaborate on the three requirements in moredetail.

1.1.1 Combines real and virtualThere are numerous ways of creating an Augmented Reality and displaying it to the user. Theexamples given will be limited to mobile solutions1 as the user should remain mobile in order toperform certain tasks. For a more complete description of the various techniques we direct the readerto a survey of Krevelen and Poelman [38].

In general there are three Augmented Reality systems that can realize a combination of real andvirtual objects while remaining mobile:

• Monitor-based displays

• Projection based display

• Head-Mounted displays

The smart phone is a typical example of a mobile monitor based display. The phone uses a videostream and the tracked pose of the phone. Virtual objects can then be merged with the realenvironment and displayed to the user on the screen of the smart phone. A graphical representationof this setup can be found in appendix A figure A.1.

Projection based Augmented Reality uses a projector to change the real world objects appearance.An example is given in figure 1.4. This shows a building with an image projected over it. This pictureimmediately shows one downside of this approach. This is the fact that projection based AugmentedReality can never fully occlude the world, especially not in daylight. Another complication is that in

2

Introduction

Figure 1.4: An example of projection based Augmented Reality (static setup) created by LD systems,for the full video see www.youtube.com/watch?v=8IICGkOtJ9E.

order to allow the user to remain mobile either a powerful beamer would have to be used to augmenta large area, or the user would have to carry a large battery.

The third form of mobile Augmented Reality is a head mounted display (HMD). The advantageof such a display is that the user has his hands free, which can be very important in applications.A second advantage is that the user is more immersed in the augmented world. The HMD mostcommonly uses a video2 or an optical see-through method3 (See figures A.2 and A.3 in appendix Afor a graphical representation of this setup) with some approach for tracking the pose of the headset.The video see-through works like the monitor-based display, taking a video stream, augmenting thatand displaying the augmented scene to the user on a display in front of the eyes. This method iscurrently the cheapest HMD to implement but has some disadvantages. One is that the imagesdisplayed to the user have considerable latency, because of processing and data transfer, causing anoticeable delay in the perceived world. The user also commonly experiences a disorientation due toparallax (eye-offset) which is caused by the cameras being offset from the users eye position. Thoughwe have found that once users of such a headset have some indication of the parallax (e.g. the usersees their own hand) the disorientation immediately disappears. An optical see-through headsetpartially solves these problems. The optical see-through headset uses a prism allowing the user tosee the real world and a display at the same time. An advantage here is that there is no latencyor parallax in the image of the real world, reducing motion sickness of the user. A disadvantage isthat it reduces the brightness and contrast of the real world and has a limited field-of-view due tothe necessary optics. Neither does video see-through have a full field-of-view but this is due to thecameras not capturing the full environment and the displays in front of the eyes not being largeenough.

Another HMD being investigated is the retinal scanning display [18]. This device projects animage directly onto the retina of the user using a low power laser. It does not have the brightness,contrast and field-of-view restrictions of the optical see-through display. This technology is still underdevelopment and has the potential to be the HMD of choice in the future. The main problem herelies in the optics to allow the user to move his eyes and still see sharp [10]. Concluding, at the currentstate of technology the video and optical see-through headsets are still considered the preferred choice.

1.1.2 Interactive in real timeAn example of a combination of real and virtual can be found in Hollywood movies that use videoediting (for example the eagles in the film: The Lord of the Rings). Still, this can not be consideredAugmented Reality for the simple fact that the user is not able to interact with such a movie. Thelarge processing time allows creators of movies to create complex images. Unfortunately this amountof processing time is not available for Augmented Reality. Since the user is moving, AugmentedReality should be real-time and adaptable to new environments.

To properly display the virtual objects, the pose of the user’s head needs to be known in real time.In that way, when a user looks from a different position to a static virtual object the objects appearancewill change accordingly and appear to stand still. In the pioneering work by Sutherland [33] this wassolved by measuring the position of the user’s head using mechanics or ultrasonics. Both solutions

1Static solutions like holographic projection are not discussed.2For example the Marty head-set (www.arlab.nl/media/open-source-headset-marty-now-downloadable).3For example the epson moverio (www.epson.com/moverio).

3

Chapter 1

restrict the user’s movements and require an extensive preparation of the environment. Thus attentionhas shifted to more mobile solutions. In the work done by Caarls [5] markers where used for tracking.An example of such a marker can be seen in figure 1.5. The size and positions of these markers areknown. By detecting the markers in a 2-D camera image and extracting their 3-D position in theenvironment the 3-D position of the camera in the environment can be estimated.

Figure 1.5: An example of a marker used for tracking [5].

The problem with this approach is that the environment still has to be prepared with markers. Theneed for markers can be alleviated by modeling objects. This, however, still requires the environmentto be known before the user enters (See the work of Uchiyama and Marchand [37] for a survey ofapproaches of this type). It would be preferable to have an Augmented Reality setup that works inan unknown environment.

To this end Klein and Murray [17] developed an algorithm for parallel tracking and mapping(PTAM) of natural features. The tracking algorithm proposed by Klein and Murray can estimate theuser’s orientation and location (pose) in an arbitrary environment as long as there are enough featurespresent. One disadvantage of this method is that the scaling of the feature locations is unknown4.Akman [1] developed an approach for natural feature tracking using a stereo camera which solvesthis problem. The natural feature tracking methods are further discussed in chapter 2.

Augmented Reality still poses many interesting challenges for human computer interaction (HCI).The use of a mouse and keyboard would restrict the user. That is why new methods for interactionare now being investigated using techniques such as gesture and speech recognition. An example ofgesture recognition is the multi-cue hand detection and tracking for HCI by Akman [1].

1.1.3 Problem of registrationWhen registering a virtual object into the scene the virtual object is connected to some entity in thereal world. The problem of registration is to provide this connection between the virtual and realworld. For example, when tracking the position of the cameras we can have a virtual object, say apainting, floating in front of the user. It would seem to hang still regardless of where the user islooking from. But to achieve true Augmented Reality this virtual object should appear as a partof the real world. For instance the virtual painting should hang on a wall and not be visible whensomething occludes it. Just as the virtual billboards shown in figure 1.1 are standing on the field andare occluded by the players.

Not all virtual objects placed in the real world are static like the painting or the billboards. Someobjects can move and interact with the real environment. For example a virtual ball needs to bounceoff a wall or floor instead of flying directly through it. Thus some representation of the environmentis needed to enable this type of interaction.

The current way to solve the problem of registration is to either directly attach the informationto a marker (e.g. ARToolKit5) or by gathering prior information on the environment (e.g. Layar6).The problem with these methods is that the environment needs to be prepared or known in someway. If the environment is unknown then there is no way to register the virtual information into thereal world. An attempt to solve this problem is presented in the work of Klein and Murray [17]. Themost dominant plane is detected and used to register the Augmented Reality in the real world.

4It can be estimated, however, not very accurately.5www.hitl.washington.edu/artoolkit/6www.layar.com

4

Introduction

In the thesis by Akman [1] some effort has gone into modeling the environment. Using thedisparity image from a set of stereo cameras the scene is reconstructed as a 3-D point cloud. Thisoperates at 2-6 frames per second depending on the algorithm used for stereo matching (i.e. BM [29],SGBM [15], ELAS [12]). This representation still lacks vital semantic information like plane locationsand object information.

When considering a room, the dense reconstruction by Akman [1] lacks vital information likethe locations, orientation and extent of for example a wall or a desk. The method used by Kleinand Murray [17] fails to produce consistent registration (The virtual object is placed in the worlddifferently almost every time) and often returns a plane not present in the world. It also lacks vitalsemantic information that allows to distinguish between planes.

1.2 Main ChallengeAs discussed in the previous chapter the problem of registration in an unknown environment is still alargely unsolved problem in the field of Augmented Reality. Current techniques lack semanticallyrich and robust information. This brings us to the main challenge addressed in this thesis:

The main challenge in the problem of registration is to produce a semantically meaningful repre-sentation of an unknown environment in nearly real-time.

This work will focus on the first step in solving this problem, by producing a reconstructionof the geometry of the scene of a room. Current techniques in this direction include, for example,multi-view stereo algorithms (see Seitz et al. [31] for an evaluation and comparison of these methods).In these and similar works the objective of the algorithm is to obtain a dense model of the scene,e.g. a point cloud, polygon mesh or grid of voxels. These representations lack semantic informationand are often computationally demanding. There is also work on modeling rooms and buildings (e.g.Henry et al. [14], Xiao and Furukawa [40]) by the use of geometric entities (i.e. lines and planes).The disadvantage of these algorithms is that they use off-line methods or require an accurate depthmap (or both).

This thesis addresses the problem of reconstructing scene geometry from sparse and noisy data inan office environment. To do so this thesis provides the following contributions:

1. A novel wall detection algorithm for use in Augmented Reality.

2. A new method for evaluation of the wall detection algorithm.

To evaluate the algorithm this thesis introduces a novel setup:

1. A new stereo camera setup that includes an inertial measurement unit

2. A novel software architecture framework for Augmented Reality.

1.3 OutlineThe report is organized as follows. In chapter 2 current methods for modeling the scene geometryare discussed. First the underlying algorithm is described, followed by the theory of geometricreconstruction of the environment. Then an overview of the new system is given in chapter 3 alongwith the novel wall detection algorithm and method for evaluation. Finally a quantitative analysis ofthe performance of the algorithms is given in chapter 4 followed by a conclusion and discussion inchapter 5.

5

Chapter 2

Scene Geometry

At the core of scene reconstruction, knowledge about the camera position in the world is required.This is largely solved in the work by Akman [1], which is discussed in section 2.1. This methodproduces a sparse representation of the scene in the form of a point cloud. However, no furtherreconstruction or modeling is performed. In section 2.2 methods for building a more semanticallymeaningful representation of the world are discussed.

2.1 Tracking and mapping

This section briefly explains the pose estimation algorithm as implemented by Akman [1], since thisthesis expands on this work. More importantly, the construction of the sparse map is discussed inmore detail as it will be used for further processing.

Figure 2.1: Process flow with independent threads: camera tracking and sparse mapping [1].

The algorithm consists of a tracking and a mapping process that run in parallel, see figure 2.1.This is enabled by using a key-frame based approach, as used by Klein and Murray [17]. The trackerperforms feature detection, pose estimation and provides the key-frames. These are used by themapping process to create a sparse representation (map) of the scene. This is subsequently used bythe tracker to do pose estimation. Both processes use a camera model explained in section 2.1.1. Themapping algorithm is explained in section 2.1.2 and the tracking algorithm is briefly summarized insection 2.1.3.

7

Chapter 2

2.1.1 Camera modelTracking and mapping algorithms essentially do pose estimation of a camera in subsequent videoimages. To do this one needs to know the projection of a point in the scene, Xw, to a point in thecamera image, x. Assuming a pinhole camera model this projection is decomposed into the intrinsic,K, and extrinsic, G, parameters of the camera as follows:

λx̃ = KGX̃w. (2.1)

Here x̃ and X̃ are homogeneous vectors and λ is the (projective) depth of the point Xw.The intrinsic camera matrix K transforms a point in the camera frame, Xc, to the image plane as

depicted in figure 2.2. The intrinsic camera matrix is

K =

fx s x00 fy y00 0 1

. (2.2)

Here s is a skewness factor, fx = fmx, fy = fmy, x0 = mxpx and y0 = mypy where f is the focallength, m is the magnification and p is the principal point (where the z-axis intersects the imageplane).

Figure 2.2: A projection of point X to the image plane in the camera frame C [1].

The imaging process is not linear, as assumed by the pinhole camera model, because of lensdistortions. To account for this, the lens distortion is modeled. The interested reader can look in thework of Akman [1] to see how this is done. The intrinsic camera matrix and the distortion modelparameters can be found by calibrating the cameras using a checkerboard pattern [42].

The extrinsic parameters detail the transformation of the world to the camera pose (aligning thez-axis to the viewing direction of the camera). The extrinsic matrix is defined as

G =[R T

]. (2.3)

Here R is the rotation matrix and T is a vector that gives the translation. Finding this matrix is themain goal of tracking and mapping algorithms.

2.1.2 Sparse mappingFor tracking, a sparse map of the environment is created and expanded in real-time. The map is a setof salient points that describe the scene which is expanded with each new key-frame1. A key-frame issupplied whenever the camera pose is no longer close to any of the current key-frames and at theinitialization of the algorithm. The map is sparse so that localization can be done in real-time. Thefirst step in creating this map is feature detection and matching. The matching pipeline is depicted infigure 2.3. The region detector should detect the same regions under different rotations, scaling andlighting conditions. The description and matching should also be invariant to these changes. As suchcorners, saddle-points and blobs are good features in images since these features are well localized.An example of a pipeline uses the Harris corner detector [25] for region detection in combinationwith the scale invariant feature transform (SIFT [22]) for description and matching. These methodsare accurate and produce a high repeatability of features. However, the methods mentioned are

1A key-frame is supplied whenever the position of the camera deviates enough from any of the other key-frames.

8

Scene Geometry

Region detection Description Matching

Figure 2.3: The matching pipeline.

slow and not suitable for real-time applications. The Adaptive and Generic Accelerated SegmentTest (AGAST [23]) feature point detector is faster than the before mentioned algorithms at onlya small decrease in robustness. The adaptive thresholding of the AGAST feature detector ensuresthat features are also found in dark and/or low textured areas of the image. To further improvethe feature set, some of the weak features are removed by using non-maximum suppression. Inorder to achieve higher stability and efficiency a bucketing technique, similar to the one used byZhang et. al [43], is used. The image is divided into a number of buckets as shown in figure 2.4.In each bucket only a maximum number of features are added. Bucketing ensures a more uniform

Figure 2.4: Example of bucketing where the image is divided into buckets and the number of features,black dots in the image, is limited per bucket [1].

distribution of the features and so eliminates redundant features. The features are sorted based ontheir Shi-Tomasi [32] score. Then the features are assigned into buckets [43] until either the bucketsare full or the maximum number of features are added. Now that a set of uniformly distributedfeatures is found, the features are matched in the closest key-frame to see if the points have beenadded to the map already. Otherwise the points are matched in a stereo pair, triangulated and addedto the map (i.e. a point cloud of features).

To further improve the estimate of the feature and camera locations bundle adjustment isperformed [36]. This step minimizes the re-projection error, i.e. the projection error of the featurepoints in the map to the key-frames, using a non-linear optimization method (i.e. Levenberg-Marquardtiteration). This is done locally (only the last few key-frames) for each added key-frame and whenthere are no new key-frames globally.

2.1.3 TrackingNow that a sparse map has been created the position of the camera in this map is estimated. The bestcamera transformation is estimated that matches the translation of the features in the images. Thisis done in a two stage tracking step. The first stage gives an accurate estimate of the displacementwhereas the second step eliminates drift in the pose estimation and thus also in the mapping. Thefirst stage projects all map points in the stereo-pair and uses the visible projections to calculatethe optical flow. After eliminating outliers using an epipolar constraint the pose is updated. Thesecond step uses the closest key-frame. Using the pose found in the first stage the image patchescorresponding to the visible map points are matched between frames. Outliers are eliminated byusing the epipolar constraint and Delaunay triangulation. Lastly the final pose is estimated usinga combination of the random sample consensus (RANSAC) and the Efficient Perspective-n-Point(EPnP) algorithm [20]. See the work of Akman [1] for a more detailed explanation.

9

Chapter 2

2.2 Scene ModelingIn order to register virtual objects in the real world a model of the scene is required. This could bedone by a volumetric representation such as a grid of voxels or a polygon mesh. However, such arepresentation lacks semantic meaning. A developer of an Augmented Reality application would forexample want to be able to show information on a wall, place a virtual object on a table or make avirtual object bounce of the ground. It would thus be beneficial to describe the scene in a simplerand more intuitive way. A possible way this can be done is by using geometric primitives (e.g. lines,planes and spheres). In man-made environments these entities are abundant, especially planes. Thissection focuses on detecting planes as planes are one of the most primitive geometric structuresand can be readily used for Augmented Reality applications. To this end, section 2.2.1 describesthe mathematical representation of a plane. The sections thereafter describe the methods used fordetecting planes and their optimizations.

2.2.1 Representation of a planeA plane can be described in two ways, either with an implicit equation or a parametric equation. Inthe implicit representation every point p on the plane satisfies2

p · n− d = 0. (2.4)

Where d is the distance to the origin and n is a vector normal to the plane (as depicted in figure 2.5).

Figure 2.5: Representation of a plane using a point vector p and normal vector n.

This equation is very useful for finding whether a point is on or close to a plane. As p · n − d isproportional to the distance to the plane. From a set of three points (p1,p2,p3) the implicit planeequation can be parameterized. A vector perpendicular to the plane, n′, can be calculated with

n′ = (p1 − p2)× (p1 − p3).

The normal vector is then found to ben = n′

|n′| .

Next equation 2.4 can be solved for d using any one of the three points.A second representation of a plane is the parametric equation. A point on the plane is then

defined in terms of s1 and s2 coordinates as

p = p1 + s1v + s2w. (2.5)

Where v and w are vectors that lie on the plain and v 6= λw,∀λ ∈ R. In this representation it isharder to find whether a point is on or near the plane. From a set of three points in the plane thisequation can also be parametrized as follows:

p = p1 + s1(p1 − p2) + s2(p1 − p3). (2.6)

Fitting a plane to more than three points can be done in different ways. The basic idea is that aset of points can be modeled as a point cloud X with a certain location and variance. The direction of

2The implicit representation is also known as the Hessian normal form.

10

Scene Geometry

minimal variance is the direction normal to the plane. To find this, the eigenvectors of the covariancematrix of the set of points is determined. The eigenvector corresponding to the lowest eigenvaluecorresponds to the vector normal to the plane, the other two vectors lie in the plane. This solution isleast-squares optimal.

2.2.2 Plane detection methodsWhen it comes to plane detection, most approaches operate either on a dense or sparse point cloud.Most work in plane detection has focused on extracting planes in dense point clouds for use in robotics.There the point clouds are usually obtained using one of three methods:

• Time-of-flight sensor

• Kinect sensor

• Stereo matched image

The time-of-flight sensor uses a signal (e.g. light, sonar) to measure the distance to an object. Thesesensors are expensive and usually bulky. Thus not an ideal option for a head-mounted device. TheKinect is often chosen as a cheaper alternative but is sensitive to varying lighting conditions (sinceit uses infrared it is sensitive to daylight). Using the stereo cameras already used for tracking doesnot introduce new lighting problems and only requires additional software decreasing the cost andcomplexity of the device. A challenge is that matching the stereo image requires a considerableamount of computation time and does not provide the accuracy that the kine ct would provide.

In the tracking algorithm, discussed in the previous section (2.1), a sparse map of features withtheir three dimensional location is created. This sparse point cloud can also be used for planedetection. Eventually a plane must be fitted to the set of points be it sparse or dense. The threemain methods to do this are either using the region growing, a three dimensional Hough transform orthe random sample consensus (RANSAC)3. When a camera setup is used it is also possible to detecta planar homography (discussed at the end of this section).

Region growingThe Region growing method works from a seed and expands the plane estimate to include morepoints. The seed is chosen randomly and consists of enough information to fit a plane, e.g. threepoints or a point with a corresponding normal. Then points neighboring to the region are consideredto be part of the plane when these points are consistent with the plane. This is repeated until nomore points can be found. Then the algorithm stops and adds the plane if it contains enough points.Finally the points are removed from the point set and a new seed is selected. The algorithm isoutlined in Algorithm 1. Assuming that few points lie on the intersection of planes, which means thatmost points belong to the same region. The complexity of the algorithm is highly dependent on theimplementation. The implementation by Poppinga et. al [28] for instance achieves a complexity of

O(n · log(n)),

with n being the number of points. Under the assumptions that most of the points belong to theexactly one region and that the points are ordered into a grid.

Hough transformThe Hough transform is a technique which transforms data to parameter space, normally used inimage processing to detect geometric primitives like lines and curves [11]. It can be used to detectplanes in 3-D as explained by the work of Borrmann et al. [4]. The space of all possible planes(Hough Space) is defined in terms of three parameters (ψ, θ, d), two angles and an offset. Every pointthen votes for the set of possible planes through that point in the discretization of the Hough Space.The algorithm then finds the maxima in the Hough Space which represent the detected planes. Thestandard Hough transform (SHT) is outlined in Algorithm 2.

The straightforward implementation of this algorithm is computationally intensive and memoryexpensive, as for every point the entire hough space must be updated. Incrementing the cells in theaccumulator (steps 2-5 of Algorithm 2) takes

O(|X| ·Dψ ·Dd),3There is also an approach that uses expectation maximization [19].

11

Chapter 2

Algorithm 1 Region growing method for plane fitting as implemented by Poppinga et. al [28]1: procedure Region Growing(X)2: R = ∅ , ¬R = ∅ . set of points in a plane and not in a plane3: while points can be selected do4: S = two random neighboring points (P1,P2) ∈ X\R ∪ ¬R5: while select nearest neighbor Pn within a certain distanced do6: if adding Pn to S does not change the plane estimate above a threshold then7: add Pn to S8: update plane estimate9: if size of S > some threshold then10: Add plane estimate to set of planes11: add S to R12: else13: add S to ¬R14: return Set of planes

Algorithm 2 Standard Hough transform for plane fitting1: procedure Hough(X)2: for P ∈ X do3: for all cells (ψ, θ, ρ) in accumulator A do4: if P is in plane defined by (ψ, θ, d) then5: Increment cell A(ψ, θ, d)6: Search accumulator for peaks that correspond to planes7: return Set of planes

where D is the discretization of the accumulator and |X| is the number of points. Searching theaccumulator (step 6 of Algorithm 2) takes

O(Dψ ·Dθ ·Dd),

which is usually much smaller than the time taken for filling the accumulator since the number ofpoints is often smaller than the discretization of d.

RANSACThe random sample consensus (RANSAC) is useful for finding planes in data that has a large numberof outliers. The basic outline of the algorithm is given in Algorithm 3. The algorithm picks threerandom points and fits a plane through these points. Then the support of the algorithm is measuredas the number of inliers (Ninliers), points that lie within a threshold, set to approximately equal theamount of noise of the plane. This process is then repeated for Nrounds rounds after which the bestplane with the most support and therefore least number of outliers is returned as the retrieved plane.

Algorithm 3 Basic RANSAC for plane fitting1: procedure RANSAC(X)2: for all Nrounds do3: select three random points (P1,P2,P3) ∈ X4: fit a plane through the points5: set: Ninliers = 06: for all points do7: if point distance to plane is within a threshold then8: increment Ninliers9: if Ninliers is larger then the best plane then10: update plane estimate using all points11: update best plane12: return Best plane

The memory requirements of RANSAC are very low as only the best plane estimate needs to

12

Scene Geometry

be stored at any time. The computational complexity of RANSAC is determined by the number ofrounds, Nrounds, that need to be performed to obtain a solution with a certain probability, p. Thisalso depends strongly on the ratio of inliers to outliers, ρi. If m is the number of inliers needed toform a hypothesis4 then the probability that at least one of the points is an outlier is 1− ρmi . Thusthe probability that a point is never selected is

1− p = (1− ρmi )Nrounds.

By rewriting this, the number of rounds can be calculated for getting a solution with probability p as

Nrounds,m = log(1− p)log(1− ρmi ) . (2.7)

The number of rounds needed is also highly dependent on the number of inliers needed to obtain ahypothesis. The ratio of two methods that require a different number of inliers simplifies to

Nrounds,ma

Nrounds,mb

= log(1− ρmbi )

log(1− ρmai ) . (2.8)

The amount of time a single round takes is dependent on the time required to obtain a hypothesis thfrom the candidate inliers and the time required to evaluate this computation, te. Thus the totaltime, ttotal, RANSAC takes can be calculated as

ttotal = Nrounds(th + |X| · te). (2.9)

The time required for fitting a plane to three points is negligible with respect to the evaluation of thepoint cloud. The amount of time required for the plane detection adaptation of RANSAC can thusbe simplified to

ttotal ≈ Nrounds · |X| · te. (2.10)

Planar homographySince obtaining a dense point cloud from a stereo image can be expensive. It might be beneficial toonly use a sparse set of points to do the plane detection. Methods taking this approach use the samematching pipeline as described in 2.1.2 (see figure 2.3). When triangulating the points one of theabove mentioned algorithms can be used. This step can however be skipped by directly computinga homography as introduced by Vincent and Laganière [39]. This method directly calculates ahomography between images which is a projective transformation represented as a 3× 3 matrix H.The homography relates two points belonging to the same plane between two images, which is definedas

xb = Habxa, (2.11)where xa and xb are homogeneous vectors describing the same point5 in two different images. SinceH has 8 degrees of freedom (DOF), 4 point matches determine H. RANSAC is employed to findthe homography and uses the number of point pairs that agree with the homography as inliers. Inessence every point that satisfies

norm(Habxa − xb) < ε,

with ε being some threshold. The main advantage of this method is that no triangulation or stereomatching step needs to be performed. A large disadvantage is that the method assumes that planesare dominant in the image. Because of this and since the points are already triangulated, (as done inthe tracking and mapping algorithm explained in section 2.1) detecting a planar homography forplane detection is not considered in the rest of this thesis. The other methods are investigated furtherin the next section.

2.2.3 State of the art in plane detectionThis section will describe the state of the art in plane detection in a point cloud data.6 Moreimportantly however, this section will consider how these algorithms would perform on the sparsepoint cloud produced by the tracking and mapping algorithm discussed in section 2.1. The maindifference comparing with the dense point cloud is that the sparse point cloud contains more noiseand outliers, which will be considered in the algorithms discussed.

4Three points for a plane and two points for a line.5Some expand this by finding lines as well [21].6Again, not considering expectation maximization [19].

13

Chapter 2

Region GrowingThe region growing method for plane detection was first proposed by Hänel et. al [13] as a wayof retrieving a compact 3-D model of the environment. The goal of this method is to create low-complexity models that can be rendered in real-time. The first part of the algorithm uses regiongrowing for the detection of planar surfaces. The algorithm relies on dense globally consistent butlocally noisy data which is the result of using a laser range finder. The algorithm as implemented byHänel et. al [13] resulted in a simplified model of the point cloud created in the order of minutes froma point cloud with more than 100 000 points which is not suitable for processing in nearly real-time.

In the work of Poppinga et. al [28] the mathematical formulation of the problem is revised to anincremental version for the calculation of both the plane fit and mean square error of this fit. Thealgorithm was designed especially for range images, which resulted in a computation time of only 0.2seconds per snapshot. This work uses a time-of-flight camera as it has a better update frequency(25 Hertz). As such, a compact model of the environment was created. Xiao et. al [41] expand thiswork even further by introducing a new seed selection procedure and a grid-based region growingalgorithm. This method however assumes that most of the data is planar.

One of the main problems with region growing for our method is that it requires dense data. Also,even though Poppinga et. al [28] claim good robustness to noise, this is only relative to 3-D laserscanners.7 For the complexity and proper behavior of the algorithm the assumption is also made thatmost points do not lie in the intersection of multiple planes. However when using features most keypoints are expected to lie on the intersection. As such, region growing is not the method of choiceconsidering plane detection in sparse and noisy data.

Hough transformThe two main problems addressed in the 3-D Hough transform are the computational cost of filling theaccumulator (i.e. incrementing the cells of the accumulator) and the representation of the accumulator.The proposed solutions to solve both problems are discussed in the work by Tarsha-Kurdi et. al [34].Filling the accumulator can be improved by using only a subset of the point cloud, which is donein the probabilistic Hough transform (PHT). The adaptive probabilistic Hough transform (APHT)improves on this by monitoring the potential maximum cells and terminating once the list of maximaremains stable. The progressive probabilistic Hough transform (PPHT) improves the PHT by filteringaccumulation in the accumulator that is due to noise by removing points belonging to detected planes.An advantage of this algorithm is that its results can be obtained at any time. Lastly the randomizedHough transform (RHT) picks three points from the input space and maps them to one point in theHough space. Also, if the points selected by the RHT are very far apart then the algorithm assumesthey are most likely not part of the same plane and discards the candidate. As a summary of wherethe methods differ, the reader can refer to table 2.1.

SHT PHT PPHT APHT RHTdeterministic + - - - -has stopping rule - - + + +adaptive stopping rule - - + ++ -deletes detected planes - - + - +ease of implementation + + + - +

Table 2.1: Distinctive characteristics of Hough transform methods [34]

In the comparison of the different methods Tarsha-Krudi et. al found that the RHT benefits fromthe simplicity of the data they used. As a consequence the RHT outperforms the other methodsconcerning runtime. The PPHT yields the most accurately estimated planes. When evaluated againstthe region growing algorithm from Poppinga [28] the RHT performs better with respect to runtime.They conclude that the RHT performs better in finding large planes while the region growing methodfinds these planes more accurately at the cost of detecting several smaller planes.

The second improvement to the Hough transform is in its accumulator design. The cells in thestandard accumulator, as shown in figure 2.6a, are smaller near the poles and larger near the equator.As such voting favors cells along the equator. Censi and Carpin [6] propose to project the unitsphere onto a cube, as shown in figure 2.6b. A major drawback of this approach is the irregularitybetween patches on the unit sphere, which make the algorithm favor planes represented by larger

73-D laser scanners can provide a very accurate depth measurement but have a very low refreshment rate.

14

Scene Geometry

(a) Array. (b) Cube. (c) Ball.

Figure 2.6: Mapping of the accumulator designs onto the unit sphere.

cells. Borrmann et. al [4] propose to solve this by varying the resolution in terms of polar coordinatesdependent on the position on the sphere illustrated in figure 2.6c. The drawback of this representationis that the design has problems with detecting planes at the poles (planes parallel to the xy-plane infigure 2.6c). Once the plane has been detected, each accumulator design produces the same accuracyfor the plane fit.

The advantage of this method is that the information of the points is stored in parameter spaceand so newly found points can be easily added to the estimate. This is especially useful in thetracking and mapping algorithm as the point cloud constantly grows and having computed the houghtransform alleviates the need to do computation on previously mapped points. The ability of thealgorithm to handle noise is dependent on the coarseness of the discretization, i.e. a finer discretizationif there is less noise. This results, even when using the other accumulator designs, in large memoryusage. The fastest method is the RHT which results in a slightly less accurate estimate than thePPHT.

RANSACThe adaptations of RANSAC are discussed in detail in the work of Choi et. al. [7]. The adaptationsimprove RANSAC on one of the following fields: accuracy, speed or robustness. To optimize accuracyan algorithm seeks to optimize the fit produced by RANSAC to the ground truth. The most commonapproach is to adjust the loss function (weighting of inliers and outliers), i.e. line 8 in Algorithm 3.Regular RANSAC uses a discrete loss function with a constant loss outside the bound and zero losswithin the bound. MSAC (M-estimator SAC) and MLESAC (Maximum likelihood SAC) proposed byTorr and Zisserman [35] use a more complex loss function, as shown in figure 2.7, which improves theaccuracy of the fit. Another method to improve the fit is to use local optimization as explained byChum et. al [9]. With this method the fit is recomputed using gradient descent after the algorithmhas found the optimal plane.

Figure 2.7: Loss functions.

Speed is improved by either using guided sampling or partial evaluation. Using guided sampling,candidate points are selected according to some rule so that there is a higher probability of selectinginliers. The progressive sample consensus (PROSAC), introduced by Chum and Matas [8], does thisby using a prior on the points that are used for selection. Using partial evaluation, the fit of a set ofpoints is evaluated first on a small subset of the data before it is considered for full evaluation, forexample Randomized RANSAC [24].

For plane detection Tarsha-Kurdi et. al [34] use an adaptation of RANSAC. They argue that fortheir application RANSAC is the method of choice and propose a couple of adaptations to improve

15

Chapter 2

the algorithm. The second adaptation uses the standard deviation to improve the plane score, whichimproves the success rate of the algorithm.

The largest advantage of RANSAC is its robustness to outliers, which is especially important innoisy data. A disadvantage is that ρi (which is needed to estimate the number of rounds as explainedin section 2.2.2) is usually unknown. Still, it can be estimated to give an indication of the number ofrounds that need to be performed. RANSAC also extends to more complex geometric structures asshown by Schnabel et al. [30]. The only difference between the RHT and RANSAC is that RHT storesthe plane estimates found in the parameter space, which is not very beneficial because it increasesmemory usage. As such RANSAC appears to be more suitable for plane detection in a sparse pointcloud. Thus this thesis concludes that RANSAC is most appropriate for plane detection in a noisyand sparse point cloud. Furthermore, an adaptation of this algorithm should take advantage of thespecific problem. This is discussed further in the next chapter in section 3.3.

16

Chapter 3

Test Setup

This chapter focuses on the setup needed for testing the algorithms. The hardware used includestwo setups, a video see-through headset and a new setup that includes an inertial measuring unit,introduced in section 3.1. A novel and modular software architecture that can use the hardwareis outlined in section 3.2. Within the new architecture a new algorithm is added as a new modulefor scene reconstruction. The scene reconstruction includes two novel wall detection algorithms,described in section 3.3, that detect a wall by adapting the scoring of RANSAC. The last sectionexplains the methods for evaluating these algorithms, which is done in the next chapter.

3.1 HardwareMarty, a video see-through headset, is used in this work for visualization and tracking (see figure 3.1).Marty is a Sony-optics based Augmented Reality headset. It includes two logitech c905 web-cams onthe front that capture a video image of 640× 480 pixels with a frame rate of 30 Hz. The world isshown to the user via a display and a set of lenses that can be adjusted to the user. Two web-camfeeds are displayed on the screen giving the user a stereoscopic view of the real world. The imagesare displayed to the user on a 1048 × 720 pixel display where each eye sees half of the displaycorresponding to one of the web-cam feeds. The setup is calibrated using a standard checkerboardpattern using OpenCV 1 library functions. The cameras are connected via USB and the display viaan HDMI cable to a computer. The computer does the processing and visualization on the display.For this setup a laptop was used (Asus K53SV with an Intel quad core i7-2670QM at 2.2 GHz)on which the software, discussed in section 3.2, is executed. This headset has a relatively low costcompared to other headsets. This is partly because the headset does not include any other sensorsbesides two web-cams. The disadvantage of having only cameras is that tracking depends fully onthe cameras which makes it sensitive to fast rotations of the user. The implementations coordinateframe also lacks a reference to the world frame in which the axes align with the direction of gravityand preferably one of the cardinal directions (i.e. north, west, south or east).

1www.opencv.org

Figure 3.1: The video see-through headset called Marty.

17

Chapter 3

These problems can be solved by using an inertial measuring unit (IMU). An IMU measuresorientation, velocity and gravitational forces by using a combination of accelerometers, gyroscopesand magnetometers. By adding an IMU a reference to the world frame is obtained. This can then beused in a reconstruction algorithm as further explained in section 3.3.

Since Marty does not have an IMU a new test setup was built that includes an IMU. The setupcan be seen in figure 3.2. Since this setup is just for experimentation no visualization was added.The orientation of the IMU with respect to the cameras was measured by hand. The translation isnot used (IMUs are notorious for their drift) and thus not measured.

Figure 3.2: Stereo camera setup with an IMU.

Another benefit of the IMU is that the information from an IMU can be combined with thetracking.2 This would improve the robustness of the tracking algorithm to fast rotations. This ishowever left for future work and not further discussed in this thesis.

3.2 Software architecture of the Augmented Reality applica-tion

The core of Augmented Reality research lies in tracking and mapping the environment. An application,however, requires more. The minimum requirements of an application include:

• Camera input

• Tracking and Mapping

• Visualization

• User input handling

In research most applications, including the work of Akman [1], are built to demonstrate theperformance of the tracking and mapping algorithm. As a consequence most systems tightly couplethe tracking and mapping with other components of the system. This, however, makes it difficult tointroduce and test new methods. Working with the system then also requires a deep understandingof all of its components. To alleviate this, a new modular system is introduced in this thesis, that ismuch simpler to work with and allows different modules to be coupled and uncoupled as needed.

MASTER

Visualization and I/O

resetexit

Tracking and Mappingpose

point cloud

Camera Feedimages

Figure 3.3: Overview of a minimal system, with the module name in bold and variables below.

2This could be done using an implementation of the Kalman filter.

18

Test Setup

The novel software architecture for Augmented Reality consists of a set of modules that arerun on separate threads. To this end, the visualization, camera capturing and user input handlingwas removed from the Augmented Reality system of Akman [1] and rewritten to fit into the newframework as a tracking module. A new camera capture module and a new module for visualizationand user input were written and added to the system. An overview of this minimal video see-throughsystem is given in figure 3.3.

The main benefit of this system is that it is easily adaptable to different types of AugmentedReality systems. The specific system setup used in this thesis is built for a video see-through setup,but could be adapted to any other Augmented Reality application by changing the modules. Asecond important improvement on the old system is the modularity of the system, which simplifiesthe understanding of the system. The tracking and mapping module, for instance, does not needto be fully understood by an expert working on the visualization module. Another advantage ofthis architecture is that modules are interchangeable. If for instance a new tracking and mappingalgorithm is available it can be easily incorporated in the system by just changing the trackingmodule. The same holds for the other existing modules. This system also supports the addition ofnew modules. For example a module for multi-cue hand detection as described by Akman [1] couldbe added to allow the user to interact more intuitively with the environment. Not only are thesemodules separated conceptually, they are also separated on the processor itself.3 This enables themodules to produce their information asynchronously, which is important especially in an AugmentedReality application as the visualization should be real-time, see section 1.1.2.

The system shown in figure 3.3 works as follows: The first thread (master thread) initializes allother threads and couples the shared variables. The master thread then monitors these threads untilthe visualization thread is finished. When this happens the master thread interrupts all other threadsand stops execution of the program. While the program is in progress the camera module handles oneor more camera inputs. These images are then used by the tracking and mapping module to performthe tracking on. The visualization module handles the merging of camera images and graphics anduses the position and rotation information from the tracking module to merge the two properly. Thevisualization module also handles the input from the user. The input can be used to interact with thevisualization and can also be used to send commands to other modules (e.g. resetting the tracking).

MASTER

Visualization and I/O

resetexit

Tracking and Mappingpose

point cloud

Scene Reconstruction

scene model

Camera Feedimages

IMUrotation

Figure 3.4: Overview of the system including the new reconstruction module and IMU module, withthe module name in bold and variables below.

Since the main topic of this thesis is scene reconstruction the system is extended with newmodules that handle this, as seen in figure 3.4. The IMU module simply handles the IMU input. Thereconstruction module uses the information created by the tracking and the IMU to update a modelof the environment. This can then be used by the visualization module to better couple the graphicsto the environment.

3As long as enough threads are available.

19

Chapter 3

3.3 Scene reconstructionAs discussed in section 2.2.3 the RANSAC algorithm is chosen for detecting planes in a sparsepoint set with noise. However, only detecting planes in the image is does not provide much extrainformation. Instead a novel wall detection algorithm is devised as described in this section.

3.3.1 Detection of a single wallThis section introduces a novel wall detection method that uses the sparse map, created by thetracking and mapping algorithm explained in section 2.1. The assumption is made that the user ofthe Augmented Reality system is located in a room with perpendicular walls. A wall is defined asthe bounds of a room where each wall has a position, orientation, extent (surface of the wall) and afront or back side.

To distinguish between planar objects and walls the assumption is made that features are notfound on the other side of the wall but only in front of the wall4. This assumption divides the set ofoutliers in two. First, normal outliers which are points correctly located in front of but not part ofthe wall. Second, backliers which are points incorrectly matched (or matched due to reflections orwindows) that end up behind the wall. This thesis introduces two new scoring methods that shouldtake into account these backliers. The first method subtracts from the score the number of backliers,Nbackliers, scaled by some measure to account for the probability that backliers are expected. Inessence one of the sides (corresponding to the back of the wall) of the loss function described insection 2.2.3 is elevated. The new calculation of the score is given as

score = Ninliers + α ·Nbackliers, (3.1)

where each inlier adds to the score whereas each backlier diminishes the score by the backlier scalingfactor (BSF), α, that accounts for the amount of noise in the data. Because the score is only used toscore planes relative to each other there is an arbitrary scaling. Therefore the inliers do not have tobe scaled. Thus if too many features lie on the other side of the candidate wall then this candidatewall is discarded. The BSF adjusted RANSAC (BSF-RANSAC) algorithm is outlined in Algorithm 4.

Algorithm 4 Adapted RANSAC for wall detection1: procedure FindWall(X)2: X = The point cloud3: for all rounds do4: select three random points (P1,P2,P3) ∈ X5: fit a plane through the points6: if the plane goes behind the camera then7: continue to next round8: for all points in X do9: if point distance to the plane < threshold then10: increment number of inliers11: if point lies behind the plane then12: increment number of backliers13: compute the score using equation 3.1 or 3.414: if score > best score then15: update plane estimate using all points16: store plane as best wall17: return Wall

The second method proposed for adapting the scoring measure is to use the binomial distributionto describe the probability, P, of getting a certain number of backliers. Assuming that the probabilityof getting a backlier, ρb, is proportional to the number of inliers, then the probability of getting acertain number of backliers can be calculated using the probability mass function of the binomialdistribution,

P(Nbackliers|Ninliers, ρb) = B(Nbackliers, Ninliers +Nbackliers, ρb), (3.2)4This can result in problems when large windows or reflective objects are part of the scene.

20

Test Setup

where the probability mass function of the binomial distribution, B, is given as:

B(k, n, ρ) =(n

k

)ρk(1− ρ)n−k. (3.3)

The new score is proportional to the number of inliers and also proportional to the probability ofobtaining a certain number of backliers. Assuming these variables are independent, a novel score canbe formulated as the product of both variables as

score = Ninliers · B(Nbackliers, Ninliers +Nbackliers, ρb). (3.4)

The only unknown is the probability of getting a backlier, ρb. This can be tuned depending on theamount of noise in the data.

The binomially adapted RANSAC (B-RANSAC) algorithm using the new scoring method isoutlined in Algorithm 4. The algorithm uses the same approach as RANSAC except for that it usesa different scoring (3.1, 3.4). It also includes a heuristic that discards candidate planes when they gobehind the camera as we assume that the user is oriented, at least, in the general direction of thewall (±90◦).

To increase the accuracy and speed of the algorithm an IMU can be used. The IMU is used torotate the coordinate frame to the world frame so that one of the axis aligns with the direction ofgravity. Then the fact that a wall is almost always parallel to the direction of gravity is used. Thepoint cloud can then be projected into a two dimensional ground plane reducing the dimensionalityof the problem from three to two. The biggest improvement is obtained when the ratio of inliers tooutliers is lowest. The exact relation is shown in figure 3.5 as calculated using equation 2.8.

1

10

100

1000

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

ratio o

f re

quired r

ounds N

2/N

3

inlier ratio ρi

Figure 3.5: Improvement in the number of rounds, on a log scale, required as a function of theprobability of finding an inlier, ρi.

3.4 Method for evaluationIn section 4 the algorithms will be evaluated on accuracy and robustness. The accuracy of analgorithm is measured with respect to the ground truth. This ground truth is obtained by trackinga marker using the ARToolKit5. The accuracy of estimating the ground truth is dependent on theaccuracy of the ARToolKit. The accuracy was measured in the work of Kato and Billinghurst [16], asshown in figure 3.6. Figure 3.6a shows the errors in position estimation of the marker for differentdistances and rotations (slants). Figure 3.6b shows the accuracy of the rotation (slant) estimation fordifferent distances. The marker used in the experiments has a size of 80× 80 millimeters and wasfilmed with a camera resolution of 263× 234 pixels. The marker used for evaluation in this thesis hasa size of 177× 177 millimeters and was filmed with a camera resolution of 640× 480 pixels. Both thesize of the marker and the resolution (measured in pixels per surface element) are twice as large, thismeans that to be in the same error bounds the distance to the marker can be four times as big. Thus,a pose of 45 degrees and a maximal distance of 1 meter were chosen, using the information in figure3.6. This ensures a maximal position estimation error of 3± 1 millimeters and a slant estimationwithin 0.5± 0.5 degrees.

5www.hitl.washington.edu/artoolkit/

21

Chapter 3

(a) Error in position estimation. (b) Error in slant estimation.

Figure 3.6: Error in position and slant estimation of the marker (80× 80 millimeters) in an imagewith resolution 263× 234 pixels [16].

The solution of the algorithm needs to be compared to the ground truth. This is difficult when itcomes to planes as there are multiple distance measures between two planes. Planes differ in bothrotation and position (i.e. distance from the origin) where the one is independent of the other, as canbe seen in equation 2.4. With purely mathematical planes there is no clear distance between twoplanes.

However, a wall has a bounded surface that can be used to devise a distance measure. Onemethod for making the wall local is to define a bound for the location of the wall. Then the errormeasure could be a function of the area and the distance to the ground truth, so

error = 1A

∫S

f(x)dS. (3.5)

Here A is the area, S the surface and f(x) is a function of the distance between both planes.The problem with this is that the choice of this bound is quite arbitrary. Should the entire wall

be used or only the part that is visible? Instead, the method of choice is to measure the error usingthe inliers, p, selected by the RANSAC based algorithm and their distance to the plane. This givesa reasonable localization of the plane, which is defined by a normal vector, ne and a point on theplane, p0,e. The variance of the inliers perpendicular to the estimated plane should not be taken intoaccount therefore the inliers are first projected to the estimated plane using

p′ = p− ne((p0,e − p) · ne),

where p′ is the projected point on the plane. Using these projected inliers the performance is measuredas the root-mean-square error (RMSE) of the distances of the projected inliers to the true plane as

RMSE =

√√√√ 1Ninliers

I∑p′∈I

((p0,g − p′)T · ng)2. (3.6)

In this equation I is the set of projected points, ng is the normal to the ground truth plane and p0,gis a point on this plane.

The second measure on which the algorithms are evaluated is robustness. This is measured as thenumber of successes, Nsuccess, versus the number of trials, Ntrials:

robustness = Nsuccess

Ntrials. (3.7)

The next section uses these measures to evaluate the algorithms described in section 3.3 onartificially created data and data obtained using the test setup explained in section 3.1.

22

Chapter 4

Results

In this chapter the novel algorithms, BSF-RANSAC and B-RANSAC, proposed in section 3.3 areevaluated in order to show their benefits with respect to the state of the art. To assess the robustnessof the algorithms tests are performed on artificially generated test data, introduced in this thesis.Secondly the algorithms are tested on real data obtained by extracting the map from tracking andmapping software, explained in section 2.1, using the novel framework from section 3.2. Finally, thisthesis explores the benefits of adding an IMU to the system. The algorithms are evaluated using theproposed evaluation methods from section 3.4.

4.1 Evaluation on artificial dataThe robustness of the proposed algorithms is evaluated rather than its speed or accuracy. Since theproposed algorithms only adapt the scoring of the RANSAC algorithm, this does not affect the inliersselected by the algorithm. Thus, if a candidate plane is selected by the algorithm the fit is identicalfor both RANSAC and the proposed algorithms and the accuracy will not differ. The effect on speedis minimal as the extra computation time needed is that of the adapted score and the calculation ofthe number of backliers and both are negligible. The time required for the computation of the scoreis negligible because this is only required once per round. The computation required to calculatethe number of backliers is negligible since the distance to the plane is already calculated and canbe used to determine whether a point is a backlier. The robustness is affected as the concept of thebest plane is changed. In the case of RANSAC the optimal plane has the highest number of inliers.Instead the proposed algorithms use the assumption that no points are located behind a wall andwill score different plane candidates and thus find a different optimal plane, most probably the wall.

The probability that a detected plane is a wall is influenced by the number of backliers. Thisis modeled by the algorithm’s parameter, the backlier scaling factor - BSF (equation 3.1) in thecase of BSF-RANSAC and the backlier probability (equation 3.4) in the case of B-RANSAC. Thebacklier probability, introduced in section 3.3.1 is used as a measure of the number of backliers asit is assumed to be independent of the size of the point cloud. The number of backliers can not beknown in advance, however, an estimate can be made. The following tests examine how accuratethe estimate should be to obtain a robust algorithm. In essence, the relation between the backlierprobability of the data and the parameter of the algorithm is determined. Because it is difficult toobtain real data test sets of different distributions this thesis proposes to artificially generate thedata. Allowing the exact backlier probability to be controlled and has as advantage that a large setof test data can be generated.

The artificial data created contains a wall, a plane, outliers and backliers. The wall is placed 2meters from the origin, the normal is set in the z-axis direction and the plane has a square surface of4 by 4 meters. Another plane in front of the wall is added at 1 meter from the origin with the samenormal and a square surface of 1 by 1 meter. Points on the wall and plane are uniformly distributedon the plane and given a uniformly distributed error of 0.1 meter perpendicular to the plane. Thusthe inlier threshold of the RANSAC algorithm is also set to 0.1. A set number of backliers is addeduniformly to the back of the plane. Lastly, extra noise in the form of uniformly distributed outliers isadded in front of the wall. An example of this, with a backlier probability of 0.2, is shown in figure4.1. In the same figure the 4 types of points are plotted as an illustration of a typical distribution.

The test data is constructed in a way that RANSAC will fail. This is done by adding more

23

Chapter 4

−2

0

2

−2−10120

0.5

1

1.5

2

2.5

3

3.5

x (m)y (m)

z (

m)

Wall

Plane

Outliers

Backliers

(a) 3 Dimensional plot.

−2−1.5−1−0.500.511.520

0.5

1

1.5

2

2.5

3

3.5

y (m)

z (

m)

Wall

Plane

Outliers

Backliers

(b) Side view.

Figure 4.1: The artificially generated point cloud data with a set backlier probability, ρb = 0.2.

inliers in the plane in front of the wall than the wall itself. By constructing the point cloud inthis manner it can be shown when the proposed algorithms would be an improvement on standardRANSAC. Thus by adapting the parameters in the scoring methods, either the wall or the plane willbe selected, i.e the program will succeed or fail. The next two sections show and discuss the resultsof testing BSF-RANSAC and B-RANSAC on the artificially created test data, for different values ofthe parameter of each algorithm and the backlier probability of the dataset.

4.1.1 Robustness of BSF-RANSACThe backlier scaling factor, α, used in the scoring formula 3.1 is evaluated against different valuesfor the backlier probability, ρb, of the artificial data. The algorithm is run for enough iterations toobtain a certainty of 99% that three inliers in the optimal plane are selected. The number of roundsneeded is calculated using equation 2.7.1

’MSEartificialAlpha.dat’

-10 -8 -6 -4 -2 0

BSF α

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

Ba

cklie

r p

rob

ab

ility

ρb

Figure 4.2: Test on artificially generated data. Successes are plotted as white and failures as black fordifferent backlier probability, ρb, as a function of the backlier scaling factor, α, set in the algorithm.

Results shown in figure 4.2 consist of white and black squares where each square corresponds to afailure or a success to detect the wall, respectively. Each square also corresponds to a specific BSFset in the algorithm and a backlier probability of the data.

When the BSF of the algorithm approaches zero the algorithm scoring approaches that of standardRANSAC and fails to detect the wall, as can be seen on the right of figure 4.2. This shows that whenthe algorithm has a low BSF then the algorithm does not take into account the backliers very muchand performs almost the same as RANSAC.

1Since the ratio of inliers is known the equation can be solved.

24

Results

When a sufficiently large BSF is chosen the scoring of the wall is higher than the scoring of theplane. This is due to backliers being taken into account. The exact value of the BSF in this situationis dependent on the number of true inliers in the wall, the number of points in between the twoplanes, Nnoise, and the difference in the number of true inliers between the planes, as follows:

α = Ninliers,wall −Ninliers,plane

Nnoise +Ninliers,wall.

A large BSF also corresponds to a low probability of finding outliers and thus a plane with manyoutliers would score lower. When the BSF is too large backliers are discounted too much and thusthe backliers outweigh the inliers. This is seen in figure 4.2 as failures to detect the plane for largevalues of the BSF.

The optimal choice for the BSF in this case is a value of about 0.5. This can be seen in figure 4.2where the algorithm finds the plane for large backlier probabilities. Even when the number of backliersis larger than the number of points in the plane, ρb > 0.5. From these results it can be concludedthat the BSF should be chosen sufficiently large so that it is certain that planes are discarded whilebeing small enough so that backliers do not outweigh the inliers, i.e. the score approaches zero.

4.1.2 Robustness of B-RANSAC

The BSF is unintuitive to choose as it does not correlate directly with the distribution of the pointcloud. Therefore B-RANSAC was proposed as with B-RANSAC the backlier probability, ρb, canbe set directly using formula 3.4. The robustness of B-RANSAC is thus also evaluated against thebacklier probability of the artificial data in order to see if both values are directly correlated and howaccurate the estimate would have to be to obtain a correct scoring.

’MSEartificialRho.dat’

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7

Algorithm estimate of ρb

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

Actu

al ρ

b

Figure 4.3: Test on artificially generated data. Successes are plotted as white and failures as blackfor different backlier probability, ρb, as a function of the backlier probability, ρb, set in the algorithm.

The results in figure 4.3 show this linear correlation between the estimated and actual backlierprobability. Figure 4.3 also shows that the algorithm still succeeds in detecting the wall when thebacklier probability is not exactly estimated.

Along the linear correlation there is a band in which the algorithm detects the wall. The largerthat this band is the better as that means that the algorithm allows a larger error in the estimateof the backlier probability. As can be seen from figure 4.3, the tolerance or band along the linearcorrelation decreases as the backlier probability increases. Hence a lower amount of noise in the dataresults in a backlier probability estimate that can be less accurate.

25

Chapter 4

4.2 Real dataThe results in the previous section showed that the proposed algorithms can, in theory, improve therobustness of RANSAC. The artificial data, however, is only an approximation of the real world.This section attempts to validate the results from the previous section using real world data. Sincethe optimal threshold of RANSAC, used to differentiate between inliers and outliers, is unknownin the real world, the threshold is estimated from real data in section 4.2.1. This threshold is thenused in subsequent experiments that evaluate the robustness of BSF-RANSAC and B-RANSAC fordifferent settings of the BSF and the backlier probability estimate, respectively. Finally, the benefitof adding an IMU to the system is investigated in section 4.3.

Both BSF-RANSAC and B-RANSAC are evaluated on two different point clouds: Test set 1,containing many planar structures in the image and relatively few features, and thus points on theplane. Test set 2, mostly consisting of points on a wall with only some other planar structures. Thereal world data was created by extracting the sparse map from the tracking and mapping algorithm,explained in section 2.1. This was done in the novel framework from section 3.2 using the stereocamera setup with an IMU, introduced in section 3.1. To obtain a measure of the accuracy a groundtruth is obtained by detecting a marker placed on the wall, as explained in section 3.4. When themarker is detected, the initial map is constructed and the initial IMU position is stored. This isdone to ensure that both the marker, IMU and the sparse map all use the same coordinate system.After the map is initialized it is expanded by moving the stereo setup. The tracking and mappingalgorithm performs bundle adjustment on the fly to improve the pose estimation of the map.

The accuracy of the algorithm is calculated as the root-mean-square error - RMSE (equation 3.6),using the distances of the feature points projected onto the plane estimate to the ground truth plane.It was found that the marker position estimation is biased by a certain factor due to errors in thescaling estimation of the calibrations. This was compensated for by manually aligning the corners ofthe marker with the corners found by the tracking algorithm.

4.2.1 Selecting the thresholdSelecting a threshold for RANSAC is important as when the threshold is too high then the algorithmcan not properly distinguish between inliers and outliers. In contrast, when the threshold is too low itdoes not select enough inliers and so the algorithm needs more rounds to find the optimal solution, asseen from equation 2.7. To find the optimal threshold, the choice of a threshold is evaluated againstthe root-mean-square error (RMSE), calculated using equation 3.6. A point cloud consisting of 654points was created using the stereo camera setup with the Augmented Reality system. It was ensuredthat only one dominant plane, the wall with the marker, is present in the point cloud. Assuming atleast 18 points2 on the plane and requiring 90% probability of finding a plane, equation 2.7 can beused to calculate the required number of rounds, which evaluates to roughly 100 000 rounds. Thusthe standard RANSAC algorithm is executed for this many rounds on thresholds varying from 0.1 to20 millimeters.

0

20

40

60

80

100

120

140

160

2 4 6 8 10 12 14 16 18 20

RM

SE

(m

m)

Threshold (mm)

errorRMSE plane

(a) The RMSE of the points on the plane.

0

50

100

150

200

250

300

350

400

450

500

0 2 4 6 8 10 12 14 16 18 20

Nin

liers

Threshold (mm)

(b) The number of points on the plane.

Figure 4.4: Evaluation of the threshold parameter of RANSAC.

2This was estimated from some preliminary runs.

26

Results

The results of this are shown in figure 4.4. In figure 4.4a the RMSE is plotted as a function of thethreshold. The error in the threshold is calculated using the maximum slant estimation error, whereasthe pose estimation is assumed to be negligible since the data is scaled to optimally fit the markercenter. The minimum and maximum error are calculated by uniformly re-sampling the normal to theground truth plane in a cone around the vector with an angle of 0.5 degrees followed by recalculatingthe RMSE and taking the minimum and maximum values as the respective error bounds. Figure4.4b shows the number of inliers that correspond to the test shown in figure 4.4b. The increase in thenumber of inliers is roughly linear up until 4 millimeters after which the RMSE increases. Thus theoptimal threshold in this case is 4 millimeters as it ensures a high accuracy estimate while retaining asufficiently high probability of retrieving the optimal plane and is used in the experiments hereafter.

4.2.2 Selecting the backlier scaling factorFor optimal performance of BSF-RANSAC the BSF, α, has to be set to a value that optimizes therobustness of the algorithm. To this end the algorithm is evaluated for different alpha on both testsets. Preliminary tests on test set 1 estimated the number of points on the plane to be at least 66 outof a point cloud with 1986 points. To find the plane then, with a probability of 90%, about 63 000rounds of RANSAC are needed, according to equation 2.7.

0

10

20

30

40

50

60

70

80

90

100

110

-3 -2.5 -2 -1.5 -1 -0.5 0 0

20

40

60

80

100

120

140

160

180

200

220

RM

SE

(m

m)

Nin

liers

BSF α

RMSENinliers

(a) Test with occluded wall.

0

2

4

6

8

10

12

14

16

-3 -2.5 -2 -1.5 -1 -0.5 0

RM

SE

(m

m)

BSF α

RMSE

(b) Test without occluded wall.

Figure 4.5: The root-mean-square error of the points on the plane as a function of α.

The result shown in figure 4.5a shows that the algorithm performs optimally with a BSF in therange of -1.6 to -0.9. When the algorithm approaches zero the same result as RANSAC is found,namely the dominant plane instead of the wall. For values of the BSF of -1.7 and higher the RMSEincreases since a different plane is selected in place of the wall as can be seen from the number ofinliers. For even larger BSF the scoring of the algorithm falls below 3. In this case three pixel at theperimeter, of the point cloud, would achieve a higher score than the best plane so far, depicted as ared cross in figure 4.5a.

To determine the optimal choice of the BSF another test is done on the second dataset thatcontains a dominant plane. Preliminary tests on test set 2 estimated the number of points on theplane to be at least 70 out of a point cloud with 1943 points. According to equation 2.7, 100 000rounds are needed to find the plane with a probability of 99%. The results, shown in figure 4.5b,show that for all values for the BSF, the algorithm obtains approximately the same RMSE. The restof the variation in the error can be explained by the random manner that points in the plane areselected and that no post-processing is done to improve this. The results on test sets 1 and 2, seen infigure 4.5, show a good result in the range of -1.6 and -0.9 which suggest that this is a good choicefor the BSF. However, more tests have to be done to validate this result.

4.2.3 Binomially adapted RANSACB-RANSAC was proposed as an improvement on BSF-RANSAC. This algorithm also depends on avariable which must be set to some data specific value. To investigate this B-RANSAC is evaluatedfirst on test set 1 with the occluded wall. Each test was performed with 63 000 rounds to match thetest settings of the previous section.

27

Chapter 4

0

200

400

600

800

0 0.2 0.4 0.6 0.8 1 0

100

200

300

400

RM

SE

(m

m)

Nin

liers

Backlier probability ρb

RMSENinliers

(a) Test set 1.

0

200

400

600

800

0 0.2 0.4 0.6 0.8 1 0

100

200

300

400

RM

SE

(m

m)

Nin

liers


RMSENinliers

(b) Test set 2.

Figure 4.6: The RMSE of the points on the plane as a function of the set backlier probability, ρb, fordata set 1 and 2.

The results shown in figure 4.6a show that an optimal fit is found for estimates of the backlierprobability between 0.2 and 0.4. Outside of this range the error increases as the values for the backlierprobability do not correspond to the dataset. When the estimation error is larger the error increases.

The next test is done test set 2 where the wall is the most prominent plane. Results of this areshown in figure 4.6b and show that for a wide range of 0.05 to 0.55 the algorithm returns a goodplane estimate. This indicates that the selection of the backlier scaling factor does not significantlyinfluence the detection of a dominant plane. Both tests suggest that this is a good choice for the BSF.

0

200

400

600

800

1000

0 0.2 0.4 0.6 0.8 1 0

2

4

6

8

10

RM

SE

(m

m)

Score

s


RMSEScores

(a) Test set 1.

0

200

400

600

800

1000

0 0.2 0.4 0.6 0.8 1 0

5

10

15

20

25

RM

SE

(m

m)

Score

s


RMSEScores

(b) Test set 2.

Figure 4.7: The calculated scores of the algorithm.

Interestingly the scores of the algorithm appear to correlate to the planes in the figure. The peaksin figures 4.7b and 4.7a correspond to the different planes in the image.

4.3 Including the IMUUsing the rotation obtained from the IMU and the rotation of the IMU relative to the left camera3

the points in the point cloud are rotated to match the world frame. The point cloud is then projectedto the ground plane to reduce the dimensionality of the problem. The rotation and projection of thepoint cloud introduce some error. This section investigates what the extent of the effect of this erroris.

The optimal threshold is first determined comparable to what was done in the case of standardRANSAC. The number of inliers was found to be at least 14 out of a point cloud of 654 pointsfrom preliminary tests . To find the plane then, with a probability of 99%, about 10 000 rounds ofRANSAC are needed, according to equation 2.7. Note that this number is much lower than what

3The left camera was used as the base of the coordinate system in the previous sections

28

Results

0

50

100

150

200

250

300

0 2 4 6 8 10 12 14 16 18 20 0

60

120

180

240

300

360

RM

SE

(m

m)

Nin

liers

Threshold (mm)

RMSENinliers

Figure 4.8: Evaluation of the threshold of RANSAC with an IMU

was needed for earlier tests. This is to be expected due to the reduction of dimensionality as wasdiscussed in section 3.3.1. The results shown in figure 4.8 show that the plane is best fitted in therange of 4 to 6 millimeters. The RMSE is compared with the results from the previous section arehigher here due to the added noise caused by errors in the rotation estimate. For further tests athreshold of 4 millimeters is used.

Again the performance on the test sets 1 and 2 are evaluated. Figure 4.9 shows the results for theBSF-RANSAC algorithm on both test sets. In both test sets the dominant wall is the plane. Thisis caused by the projection of the points to the two dimensional ground plane. Walls that are notparallel to the direction of gravity are not projected to a line and will not be found as a wall. Thisgreatly improves the robustness of the algorithm.

0

200

400

600

800

1000

-3 -2.5 -2 -1.5 -1 -0.5 0 0

25

50

75

100

125

RM

SE

(m

m)

Nin

liers

BSF α

RMSENinliers

(a) Test set 1.

0

200

400

600

800

1000

1200

-3 -2.5 -2 -1.5 -1 -0.5 0 0

30

60

90

120

150

180

RM

SE

(m

m)

Nin

liers

BSF α

RMSENinliers

(b) Test set 2.

Figure 4.9: The evaluation of BSF-RANSAC with an IMU.

In figure 4.10 the results for B-RANSAC are shown. B-RANSAC was found to perform well forthe first test set in the range of 0.45 to 0.7 for the first test set and 0.55 to 0.85 for the second testset. Interestingly these ranges are higher than what was found for the three dimensional case. This isdue to the error introduced by projecting the points to the ground plane.

The wall detection algorithm assumed that the user is facing in the general direction of the wall.When rotating the point cloud to the world frame this assumption does not have to be made. Thisshows that the use of an IMU can provide a strong clue for the detection of a plane.

Table 4.1 shows that the smallest RMSE of the plane detected with the IMU is lower for both testsets and for both of the proposed algorithms. This is again caused by the projection of the points tothe ground plane.

In conclusion the proposed wall detection algorithms were found to provide an improvement onRANSAC in the case where the dominant plane is not the wall. Further test showed that an estimatecan be made for the parameters of the proposed algorithms. The accuracy of the methods was foundto be as accurate as RANSAC. Lastly the addition of an IMU improved on the algorithms robustnessat the cost of accuracy.

29

Chapter 4

0

100

200

300

400

500

600

700

800

900

0 0.2 0.4 0.6 0.8 1 0

50

100

150

RM

SE

(m

m)

Nin

liers

BSF α

RMSENinliers

(a) Test set 1.

0

100

200

300

400

500

600

700

800

900

1000

0 0.2 0.4 0.6 0.8 1 0

50

100

150

RM

SE

(m

m)

Nin

liers

BSF α

RMSENinliers

(b) Test set 2.

Figure 4.10: The evaluation of B-RANSAC with an IMU.

Table 4.1: The minimal RMSE for the different algorithms.

Without IMU With IMUTest set 1 Test set 2 Test set 1 Test set 2

B-RANSAC 4 11 7 19BSF-RANSAC 4 9 7 18

30

Chapter 5

Conclusion and Discussion

5.1 ConclusionRecent advances in tracking algorithms for Augmented Reality made it possible to add virtualinformation to an unknown environment in real time. The tracking and mapping algorithm ofAkman [1] in combination with a stereo camera setup was used as the base for this. This setup,however, lacks a robust registration of virtual objects into the real world. To this end this thesisproposed to extract the geometry of the scene from the map generated by the tracking and mappingalgorithm to obtain a more semantically rich representation. Wall detection is investigated in detailin this thesis, since the bounds of a room, i.e. the walls, floor and ceiling, offer a first step in thereconstruction of a room.

A wall is, most commonly, a plane and therefore plane detection methods were researched. Planedetection adaptations include: planar homography estimation, Hough transform, region growing andRANSAC. The planar homography is not used because it assumes a dominant plane in the image.The region growing method was not used as it assumes a dense point-cloud with less noise and alsoassumes that planes are abundant in the scene. The Hough transform was found to be slow and thebest adaptation to improve upon it generally performs the same computation as RANSAC, yet storesthe results in parameter space and thus requires large memory usage. Concluding, RANSAC waschosen as the algorithm of choice considering that it does not require abundant planes in the data, isrobust to a high amount of noise and is the fastest with respect to the other algorithms.

A novel software architecture was introduced that is modular and introduces a multi-threadeddesign. The novelty of this design lies in the interchangeability of the components of the AugmentedReality system and the possibility to add new modules, such as a scene geometry reconstructionmodule. The tracking and mapping software from Akman [1] was rewritten to fit in this framework andnew modules for visualization, camera capturing, IMU capturing, user input and scene reconstructionwere added where the latter is the main contribution of this thesis.

Next two novel scoring methods for the RANSAC algorithm were proposed in order to adaptthe algorithm for detecting walls. The first algorithm, BSF-RANSAC, adapts the loss function byincreasing the loss for points behind the plane whereas the other algorithm, B-RANSAC, adapts thescoring using a binomial distribution. The tests on artificial test data showed that both algorithmswork in theory and that the backlier probability estimate of B-RANSAC corresponds to the truebacklier probability. For tests on real data a novel setup and the new software architecture were used.First a threshold for RANSAC was estimated from tests on a point cloud, followed by an analysis ofthe algorithms behavior for different values of the parameters on two different point clouds. Lastlythe addition of an IMU was investigated. A ground truth for the evaluation of the methods is foundby detecting a marker in the scene. The detected planes are compared against this ground truthusing a new evaluation method based on the root-mean-square error and the points on the estimatedplane.

This thesis found that the algorithms proposed offer an improvement to normal RANSAC whenthe dominant plane in the point cloud is not the wall. In another setting were RANSAC did succeed,an indication was found that the plane is found using the same backlier scaling factor, showingthat the same value of the backlier scaling factor would work for both a visible and an occludedwall. More research on this is still required to validate this result. No clear difference has yet beenfound between the two adaptations of RANSAC. Therefore the algorithm using the backlier scaling

31

Chapter 5

factor is preferred as it is the simplest to understand and fits in the framework of a loss function.Incorporating the data from the IMU improves upon the complexity of the algorithm by reducing,from 3 to 2, the number of inliers required for a plane fit. In theory, this dramatically improves thespeed of the algorithm for data with a low ratio of inliers to outliers. However, since there is an errorin the estimation of the rotation, the accuracy of the algorithm is worse. Still the IMU eliminatesthe need for the assumption that a person is generally facing a wall. The addition does improve therobustness of the algorithm as planes oriented differently than a wall (assumed to be parallel to thedirection of gravity) can be discarded.

In conclusion, this work presented a novel hardware configuration and a novel modular softwarearchitecture for Augmented Reality. For scene reconstruction walls were detected with two adaptations,BSF-RANSAC and B-RANSAC, of the RANSAC algorithm. These adaptations were tested andfound to outperform the straightforward implementation of RANSAC.

5.2 DiscussionMore tests are needed to examine how the backlier scaling factor and the backlier probability estimatebehave in different circumstances. For example, the algorithms might behave differently in clutteredand uncluttered scenes or when windows are part of the scene. It is still unknown how much thebacklier probability varies with the data acquired. One estimate, as was done for the threshold, couldbe sufficient if the variance is low. However if the backlier probability is high the score would eitherhave to be refined further or the backlier probability would have to be estimated from the data, byusing the covariance for instance. Another adjustment of the algorithms which can be investigated isadjusting the definition of a backlier in terms of the distance to the wall and its relation to the extentof the wall.

The speeds of the proposed algorithms are as fast as RANSAC which proved to be the fastestalgorithm for plane detection in our setup. However, to prove the nearly real-time performance of thealgorithms more extensive analysis needs to be performed. The RANSAC algorithm is extensivelyresearched and therefore many adaptations are available that could improve the algorithms proposed.One such improvement is by using the progressive sample consensus (PROSAC) [8] to improve thesampling of the algorithm and with that its speed. A second improvement would be on the accuracyof the algorithm. For example by refitting the inliers to a least square fit or using local optimizationas explained by Chum et. al [9].

From a stereo setup a dense point cloud can be obtained, which could be used for scene recon-struction. It is unknown whether this in combination with the algorithms developed provides a moreaccurate estimation of the scene. However, when using RANSAC the speed is lower as the densepoint cloud first has to be estimated using dense matching and the size of the point cloud requires acandidate plane to be evaluated on more points. The number of rounds required by the RANSACalgorithm does not change as it is only dependent on the proportion of inliers in the test set. Thedense matching is less accurate in estimating the point locations on walls since walls are often smooth.As a consequence the proportion of inliers is lower and thus the algorithm requires more iterations tofind the optimal solution, which might not be the wall.

Another field of research for enriching the model of the world would be by using object detectionalgorithms which would conveniently fit in the framework as another module that runs on a separatethread. One could for instance detect a chair or a cup and add that to the model of the room.

By changing the general orientation assumed for the wall the plane detection algorithm can beused for the detection of ceilings and floors. The floor and ceiling can even be detected more robustly,as the algorithms assume a bounding planar surface which is not reflective or contain large windowsand this assumption holds more strongly in these cases. Through subsequently removing the pointsdetected on one of these surfaces, the complexity of the problem decreases and a room model isexpanded. This model could then be used to improve the accuracy of the map by re-projecting thepoints found on the plane. Concluding, this thesis provides an initial step into the reconstruction ofscene geometry for Augmented Reality.

32

Bibliography

[1] O. Akman, “Robust Augmented Reality,” Ph.D. dissertation, Technical University Delft, 2012.

[2] R. T. Azuma, “A survey of augmented reality,” Presence: Teleoperators and Virtual Environments,vol. 6, pp. 355–385, August 1997.

[3] S. Benford, C. Greenhalgh, G. Reynard, C. Brown, and B. Koleva, “Understanding and construct-ing shared spaces with mixed reality boundaries,” in ACM Transactions on Computer-HumanInteraction, vol. 5, September 1998, pp. 185–223.

[4] D. Borrmann, J. Elseberg, K. Lingemann, and A. Nüchter, “The 3D hough transform for planedetection in point clouds: A review and a new accumulator design,” 3D Research, vol. 2, no. 2,pp. 32:1–32:13, Mar. 2011.

[5] J. Caarls, “Pose estimation for mobile devices and Augmented Reality,” Ph.D. dissertation,Technical University Delft, 2009.

[6] A. Censi and S. Carpin, “HSM3D: feature-less global 6DOF scan-matching in the hough/radondomain,” in Proceedings of the 2009 IEEE international conference on Robotics and Automation,ser. ICRA’09. Piscataway, NJ, USA: IEEE Press, 2009, pp. 1585–1592.

[7] S. Choi, T. Kim, and W. Yu, “Performance evaluation of RANSAC family,” in Proceedings of theBritish Machine Vision Conference. BMVA Press, 2009, pp. 81.1–81.12, doi:10.5244/C.23.81.

[8] O. Chum and J. Matas, “Matching with PROSAC - progressive sample consensus,” in Proceedingsof the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition(CVPR’05) - Volume 1 - Volume 01, ser. CVPR ’05. Washington, DC, USA: IEEE ComputerSociety, 2005, pp. 220–226.

[9] O. Chum, J. Matas, and Š. Obdržálek, “Enhancing RANSAC by generalized model optimization,”in Proc. of the Asian Conference on Computer Vision (ACCV), K.-S. Hong and Z. Zhang, Eds.,vol. 2. Seoul, Korea South: Asian Federation of Computer Vision Societies, January 2004, pp.812–817.

[10] G. C. de Wit, “A retinal scanning display for Virtual Reality,” Ph.D. dissertation, TechnicalUniversity Delft, 1997.

[11] R. O. Duda and P. E. Hart, “Use of the hough transformation to detect lines and curves inpictures,” Communications of the ACM, vol. 15, no. 1, pp. 11–15, 1972.

[12] A. Geiger, M. Roser, and R. Urtasun, “Efficient large-scale stereo matching,” in Proceedingsof the 10th Asian conference on Computer vision - Volume Part I, ser. ACCV’10. Berlin,Heidelberg: Springer-Verlag, 2011, pp. 25–38.

[13] D. Hähnel, W. Burgard, and S. Thrun, “Learning compact 3D models of indoor and outdoorenvironments with a mobile robot,” Robotics and Autonomous Systems, vol. 44, no. 1, pp. 15–27,2003.

[14] P. Henry, M. Krainin, E. Herbst, X. Ren, and D. Fox, “RGB-D mapping: Using kinect-styledepth cameras for dense 3D modeling of indoor environments,” The International Journal ofRobotics Research, vol. 31, no. 5, pp. 647–663, 2012.

[15] H. Hirschmuller, “Stereo processing by semiglobal matching and mutual information,” IEEETransactions on Pattern Analysis and Machine Intelligence, vol. 30, no. 2, pp. 328–341, 2008.

33

Chapter 5

[16] H. Kato and M. Billinghurst, “Marker tracking and HMD calibration for a video-based AugmentedReality conferencing system,” in Proceedings of the 2nd IEEE and ACM International Workshopon Augmented Reality, ser. IWAR ’99. Washington, DC, USA: IEEE Computer Society, 1999,pp. 85–94.

[17] G. Klein and D. Murray, “Parallel tracking and mapping for small AR workspaces,” in Proc.Sixth IEEE and ACM International Symposium on Mixed and Augmented Reality (ISMAR’07),November 2007.

[18] J. S. Kollin, “A retinal display for virtual-environment applications,” Proceedings of Society forInformation Display, vol. XXIV, p. 827, 1993.

[19] R. Lakaemper and L. J. Latecki, “Extended em for planar approximation of 3D data,” inProceedings of the 2006 IEEE International Conference on Robotics and Automation, 2006.

[20] V. Lepetit, F. Moreno-Noguer, and P. Fua, “EPnP: An accurate O(n) solution to the PnPproblem,” International Journal of Computer Vision, vol. 81, no. 2, pp. 155–166, 2009.

[21] M. I. A. Lourakis, A. A. Argyros, and S. C. Orphanoudakis, “Detecting planes in an uncalibratedimage pair,” in Proceedings of the British Machine Vision Association 2002. British MachineVision Association, 2002, pp. 587–596.

[22] D. G. Lowe, “Distinctive image features from scale-invariant keypoints,” International Journalof Computer Vision, vol. 60, no. 2, pp. 91–110, 2004.

[23] E. Mair, G. D. Hager, D. Burschka, M. Suppa, and G. Hirzinger, “Adaptive and generic cornerdetection based on the accelerated segment test,” in Proceedings of the 11th European conferenceon Computer vision: Part II, ser. ECCV’10. Berlin, Heidelberg: Springer-Verlag, 2010, pp.183–196.

[24] J. Matas and O. Chum, “Randomized RANSAC with sequential probability ratio test,” inProceedings of the Tenth IEEE International Conference on Computer Vision - Volume 2, ser.ICCV ’05. Washington, DC, USA: IEEE Computer Society, 2005, pp. 1727–1732.

[25] K. Mikolajczyk and C. Schmid, “Scale & affine invariant interest point detectors,” InternationalJournal of Computer Vision, vol. 60, no. 1, pp. 63–86, Oct. 2004.

[26] P. Milgram, H. Takemura, A. Utsumi, and F. Kishino, “Augmented Reality: A class of displayson the reality-virtuality continuum,” in Telemanipulator and Telepresence Technologies, 1994,pp. 282–292.

[27] S.-F. Persa, “Sensor fusion in head pose tracking for Augmented Reality,” Ph.D. dissertation,Technical University Delft, 2006.

[28] J. Poppinga, N. Vaskevicius, A. Birk, and K. Pathak, “Fast plane detection and polygonalizationin noisy 3D range images,” in 2008 IEEE/RSJ International Conference on Intelligent Robotsand Systems, September 2008, pp. 3378–3383.

[29] D. Scharstein and R. Szeliski, “A taxonomy and evaluation of dense two-frame stereo corre-spondence algorithms,” International Journal of Computer Vision, vol. 47, no. 1-3, pp. 7–42,2002.

[30] R. Schnabel, R. Wessel, R. Wahl, and R. Klein, “Shape recognition in 3D point-clouds,” in The16-th International Conference in Central Europe on Computer Graphics, Visualization andComputer Vision’2008, V. Skala, Ed. UNION Agency-Science Press, Feb. 2008.

[31] S. M. Seitz, B. Curless, J. Diebel, D. Scharstein, and R. Szeliski, “A comparison and evaluationof multi-view stereo reconstruction algorithms,” in Proceedings of the 2006 IEEE ComputerSociety Conference on Computer Vision and Pattern Recognition - Volume 1, ser. CVPR ’06.Washington, DC, USA: IEEE Computer Society, 2006, pp. 519–528.

[32] J. Shi and C. Tomasi, “Good features to track,” Cornell University, Ithaca, NY, USA, Tech.Rep., 1993.

34

Bibliography

[33] I. E. Sutherland, “A head-mounted three dimensional display,” in Proceedings of the Fall JointComputer Conference, 1968, pp. 757–764.

[34] F. Tarsha-Kurdi, T. Landes, and P. grussenmeyer, “Hough-transform and extended RANSACalgorithms for automatic detection of 3d building roof planes from lidar data,” InternationalArchives of Photogrammetry, Remote Sensing and Spatial Information Sciences, vol. 3, pp.407–412, 2007.

[35] P. H. S. Torr and A. Zisserman, “MLESAC: a new robust estimator with application to estimatingimage geometry,” Computer Vision and Image Understanding, vol. 78, no. 1, pp. 138–156, Apr.2000.

[36] B. Triggs, P. Mclauchlan, R. Hartley, and A. Fitzgibbon, “Bundle adjustment âĂŞ a modernsynthesis,” in Vision Algorithms: Theory and Practice, LNCS. Springer Verlag, 2000, pp.298–375.

[37] H. Uchiyama and E. Marchand, “Object detection and pose tracking for Augmented Reality:Recent approaches,” in 18th Korea-Japan Joint Workshop on Frontiers of Computer Vision(FCV), Kawasaki, Japon, Feb. 2012.

[38] D. van Krevelen and R. Poelman, “A survey of augmented reality technologies, applications andlimitations,” in The International Journal of Virtual Reality, vol. 9, June 2010, pp. 1–20.

[39] E. Vincent and R. Langière, “Detecting planar homographies in an image pair,” in Imageand Signal Processing and Analysis, 2001. ISPA 2001. Proceedings of the 2nd InternationalSymposium on, June 2001, pp. 182–187.

[40] J. Xiao and Y. Furukawa, “Reconstructing the world’s museums,” in Proceedings of the 12thEuropean Conference on Computer Vision, ser. ECCV ’12, 2012.

[41] J. Xiao, J. Zhang, J. Zhang, H. Zhang, and H. Hildre, “Fast plane detection for SLAM fromnoisy range images in both structured and unstructured environments,” in Mechatronics andAutomation (ICMA), 2011 International Conference on, aug. 2011, pp. 1768 –1773.

[42] Z. Zhang, “A flexible new technique for camera calibration,” IEEE Transactions on PatternAnalysis and Machine Intelligence, vol. 22, pp. 1330–1334, 1998.

[43] Z. Zhang, R. Deriche, O. Faugeras, and Q.-T. Luong, “A robust technique for matching twouncalibrated images through the recovery of the unknown epipolar geometry,” The InternationalJournal of Robotics Research, vol. 78, no. 1-2, pp. 87–119, Oct. 1995.

35

Appendix A

Devices for Augmented Reality

Figure A.1: Monitor based display [27].

Figure A.2: Video see-through display [27].

Figure A.3: Optical see-through display [27].

37

ReconstructionofWallGeometryforAugmentedReality … · 2020-06-18 · ReconstructionofWallGeometryforAugmentedReality MasterThesis of ArtiﬁcialIntelligence TheUniversityofAmsterdam

Documents