University Klagenfurt Department of Computer Science ISYS Master’s thesis HYBRID TRACKING FOR AUGMENTED REALITY Tymoteusz Sielach Supervisor prof. Martin Hitz Klagenfurt, 2009
University Klagenfurt
Department of Computer Science
ISYS
Master’s thesis
HYBRID TRACKING FOR AUGMENTED REALITY
Tymoteusz Sielach
Supervisor
prof. Martin Hitz
Klagenfurt, 2009
Contents
1 Introduction 1
2 Analysis 4
2.1 Tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.1.1 Visual Tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
Marker-based tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
Tracking without markers . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
Software libraries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.1.2 Inertial Tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
Accelerometers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
Gyroscopes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
6-DOF inertial trackers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.1.3 Other technologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
GPS and DGPS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
Electronic compass . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
Mechanical tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
Gravity sensors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
Ultra sound tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
Ultra-Wideband . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.1.4 Hybrid tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.2 Hardware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.2.1 Ultra mobile PC and TabletPC . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.2.2 Mobile phone . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
Windows Mobile . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
Android . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
iPhone . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
Symbian . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.2.3 AR Telescopes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.3 Propositions of solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.3.1 Hybrid tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.3.2 Tracking from pre-acquired data . . . . . . . . . . . . . . . . . . . . . . . . 31
3 Development of the prototype 33
3.1 Description of prototype . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.1.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
Technical specification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
I
II
3.1.2 Database . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.1.3 Camera pose estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.1.4 Point coordinates estimation . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.1.5 Features - finding, describing, matching . . . . . . . . . . . . . . . . . . . . 42
SURF based feature tracking . . . . . . . . . . . . . . . . . . . . . . . . . . 42
Pyramid Lucas-Kanade algorithm . . . . . . . . . . . . . . . . . . . . . . . 43
3.1.6 Marker based tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.1.7 Initial camera calibration . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.1.8 Inertial tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.2 Possible improvements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.2.1 Descriptor based tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.2.2 Kalman filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.2.3 Weighted tracking error function . . . . . . . . . . . . . . . . . . . . . . . . 47
3.2.4 Speeding-up and better robustness . . . . . . . . . . . . . . . . . . . . . . . 47
3.2.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
3.3 History of development . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
Natural feature tracker . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
Pose estimation and reconstruction procedures . . . . . . . . . . . . . . . . 50
Improvement of architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
Improvements of tracking quality . . . . . . . . . . . . . . . . . . . . . . . . 50
Change of feature tracker . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
Speeding up . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
4 Conclusion 55
4.1 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
4.1.1 Outdoor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
4.1.2 Indoor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
4.1.3 Verification of requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
4.2 Hardware recommendation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
4.3 Tracking method recommendation . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
4.3.1 Pose estimation algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
4.3.2 Map management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
Tracking with extensible map . . . . . . . . . . . . . . . . . . . . . . . . . . 60
Tracking from pre-acquired data . . . . . . . . . . . . . . . . . . . . . . . . 61
4.3.3 Feature tracking and detection method . . . . . . . . . . . . . . . . . . . . . 61
4.3.4 Inertial subsystem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
4.3.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
4.4 Initialisation and marker subsystem . . . . . . . . . . . . . . . . . . . . . . . . . . 62
4.5 Problems unsolved . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
4.6 Technical problems during development . . . . . . . . . . . . . . . . . . . . . . . . 63
Bibliography 67
Chapter 1
Introduction
There is a historical project called “Burgbau zu Friesach”, which takes place in Friesach (Carinthia).
Project relies on building a medieval castle from basis using medieval methods and technologies.
The whole process will take 30 to 40 years. Actually the construction site is fixed and designers
are working in their office in Friesach. The project is already in progress, but the first stones will
stand on the construction site only in a few years. The problem is:
• How to make the construction site attractive for visitors. (Especially in the early phase ofconstruction)
• How to document the whole process.
• How to tell the knowledge about construction to visitors in a attractive way.
The idea is to solve it using modern informatics. The problem of the documentation can
be solved using cameras, which will automatically take picture of construction, store them in a
database and process them in future. Problems related with telling the knowledge and presenting
the building can be solved by creating the augmented reality system, which will augment the
multimedia content and a 3D model of arising castle in the constructions site. Augmented reality
is very good approach for that purpose, because it does not neglect things, which happen in real
world (building) and are very important. For example the 3D presentation of the castle and
its construction watched on the desktop PC at home does not give so many informations and
impressions. AR gives information in not obtrusive and way. Hence, the construction site can look
like it was in medieval, without information boards and arrows guiding visitors. It will make an
impression of really living construction site not just a museum.
I would like to introduce the vision of the system as user’s scenarios. There will be two types
of users:
• Visitor - someone, who comes to sightsee the construction site. Is focused on getting infor-mation about the construction process and history.
• Designer - a person, who wants to see how the castle will look in the future, compare somevariants of design, adjust some dimensions of the building.
Visitor
The family came to see the construction of the castle. They bought tickets and got a handheld
device with 7 inch screen on one side and camera on the second. After turning it on an image
from camera appeared on the screen. The instruction recommends to look through the device on
two color stakes sticked near the entrance to the construction site. After a second of looking on
1
Introduction 2
stakes the caption ’ready’ cropped out on the screen. From that moment all movements of the
device are tracked by the system. The father of the family directed the lens to empty place where a
castle should be build in the nearly 30 years, but on the screen it appeared immediately. In the right
corner of the screen appeared a date indicating a moment in time in about 30 years. The main
functionality of the system is to render the castle in the concrete state of construction indicated by
date. The family discovered that the application contains a time slider which allows to choose a
moment in time, in which the castle will be displayed. It appears after pushing the ’time line’ button
and covers the whole screen. It contains dates and some milestones in the construction process.
After changing the selected time stamp (simply by sliding a finger on the touch screen), the time
line disappears and the augmented castle in the specific construction phase is visible. The family
approached nearer to the building to better see details of the 3D model. They realized that there is
a rectangle in the middle of the screen (similar to that one in the viewfinder in the camera). When
the romanesque window was visible in that rectangle, its color changed. When the father pushed the
’shutter’ button a few windows with text and multimedia content about that detail appeared. They
were displayed till he was watching that detail through a device and disappeared after changing the
virtual gaze direction. Later they learned that in order to prevent windows to disappear, they can
be hold by hand on the touch screen. They also noticed that apart from castle there are other 3D
objects. When they came closer, they noticed, that there are virtual posters containing information
about castle and its current construction phase. They were easy to read through the handheld device,
it was also easy to intuitively change the view between castle and information poster. When they
were changing the date on the time line, the content of posters was also changing according to the
castle’s state. After sightseeing of the castle, the family went to other parts of construction site.
They came to the place, where stones were prepared. The virtual posters appeared around the area
into which they entered, also other posters near the castle vanished. The father approached with a
device to a man dressed like a medieval mason. Multimedia content related to mason appeared on
the screen, so he could read about his work and role in the construction of the castle.
After sightseeing the virtual castle the family came back home, and decided to visit the projects
website. There was a text field on that website, where they introduced their ticket id-number. After
sending the form, they could see their whole trip in the construction site, a list of things they saw,
read the descriptions again and see multimedia content. The son said, that it will be very helpful
for him to write an essay to the class of history.
Designer
The designer is sitting in his office in the center of Friesach. He draws a few versions of the
castle and its more detail parts using a CAD software. He tries to imagine how will they look in
the real world and choose the best one. He is also thinking how will it fit to the shape of the terrain
and what dimensions should be the best. Answers for all of those questions can be obtained using
augmented reality. The designer saves models of the castle in the BZF file format (BZF is an
extended 3DStudio file [*.3ds] used in the whole platform for storing 3D objects) using a plug-in
for his designing software. Then he loads that file to handheld device (the same one used by the
family in the previous scenario) and goes to the construction site. He calibrates the device using
stake markers in the construction site (the same way as in visitor’s scenario) and starts to watch
his project. First he looks at the castle from few places and large distance, then he approaches
closer to walls and looks at some details. After that he switches the system to another version of
project. He walks further to examine it from a distance. Then he realizes that the tower is too low.
He quickly switches to dimension adjusting mode and starts to tune the towers height. All of his
changes are visible immediately, which is very helpful by adjusting the ideal height. After that he
corrects few other dimensions in the same way. He also realizes, that a defending wall does not fit
Introduction 3
to other parts of the castle very well, but he prepared another variant before. He quickly switches
to that version. When he has an ideal solid of the castle on the screen, he pushes a ’shutter’ button
(utilized also in visitor’s scenario). This button in designer’s mode has a different function, it
stores a screen-shoot on a hard drive. Then he sends this image to his co-workers by email, using
standard software (integrated in UMPC’s OS). In order to save adjusted dimensions he chooses
’save’ option from the menu (available in the dimension adjusting mode). Changes are stored in
the BZF file and can be propagated to the CAD software again.
Design of such a system is not very common and causes some problems. Initially we can divide
them into two groups:
• Interface design problems. I suspect that, the system can display to many multimedia contentin one time, which can make the user bewildered and overloaded of information. Rendering
of the castle from inside also can be a problem. Let us imagine, that someone wants to see
the courtyard, but its floor is quite high and not build already. The user stands under it
and does not see anything. Similar problem can happen if he walks to the inside of some
solid (for example defensive wall). It is a open problem if the castle should be rendered from
inside and how to present it to the user.
• Technical problems. Design of augmented reality applications is hard and still full of nottypical problems. Each system is a new challenge. But in the macro scale there are some
common problems in augmented reality applications. Augmented reality relies on rendering
3D models and other data on the image of real world taken by camera. In order do it, the
system must have information about the exact camera position and orientation. The problem
of acquiring this data is called ’tracking’. There are many methods of tracking, which utilize
many different physical phenomenons. All of them have advantages and disadvantages, which
causes that they are very suitable for some aims and not suitable for others. The very often
met approach is the combining of few methods in one system. The accuracy of tracking is
also crucial in AR purposes. The second problem is 3D rendering. It relies on placing a 3D
model in a scene so that it will not overlap object, which occur before a virtual object. It
should look like it was a part of real world. The computational power for rendering (tracking
too) is also a problem. Many mobile phones and PDA’s do not support hardware floating
point operations.
In this thesis I will focus on solving a tracking problem in the outdoor environment.
Chapter 2
Analysis
The aim of this work is to design the augmented reality system, which:
• will make the sightseeing of the construction site of ’Burgau zu Freisach’ project more atrac-tive
• will be able to render big buildings
• will robustly work in limited outdoor area. (Area can be prepared)
• will have short response time
• can render at least 25 fps
• will be easy to use - intuitive interface, lightweight
• will be able to display multimedia content of several types. (3D models, videos, images, text)
• will be able to play sounds
With designing of augmented reality system come 3 types of problems:
• Rendering - subsystem which allows to draw several objects on the screen. It must be fastenough to render the image fluently.
• Tracking - To properly render an augmented scene, system must know the exact trajectoryand position of the camera. It is called geometrical registration problem.
• Hardware platform - in contemporary world no one is constructing the computer from scratch.There are many ready micro computers on the market. Mainly they use Intel-architecture
processors or ARM processors. There is a problem to choose that one which is the fastest
and has enough built in sensors for the aim of tracking.
The first section considers the tracking problem. The next section describes hardware platforms.
Rendering isn’t in the area of this thesis.
2.1 Tracking
In this section we will consider a variety of tracking techniques. They are ordered in turn of
frequency of appearing in contemporary augmented reality systems. In the implementation of
tracking we can distinguish two approaches:
4
2.1. Tracking 5
• Outside-in tracking - sensing devices are placed in the environment in motionless places.Moving objects are being tracked.
• Inside-out tracking - sensors are mounted on the moving object, which will be tracked.
The first approach works in specially prepared environments, which are relatively small. The
second approach can work in an unprepared and unbounded environment.
2.1.1 Visual Tracking
Visual tracking is an approach, which uses one or more cameras. Special algorithms are looking for
features in the raster image, describe and classify them. Consecutive frames are analyzed and the
spatial correspondence between them is calculated. The features can be known for the software and
even specially put in the environment - this approach is called “marker-based tracking”. Markers
don’t have to be visible for the human. We have visual tracking systems which rely for example on
infrared LEDs[CN98]. There are also tracking systems which rely only on natural features (edges,
bulbs, T - junctions, colors, etc.) They can work in the unprepared environments - this fact greatly
increases the area of applications of Augmented Reality. The camera can be attached to the tracked
device (for example: Tablet, PDA or HMD) - inside-out approach. Or the system can consist of
many cameras mounted in the environment - outside-in approach. In the next subsections we
will discuss marker-based tracking, tracking without markers and software libraries, which can be
helpful in implementation of visual tracking.
Marker-based tracking
Marker based tracking is one of the most developed topics. During last 15 years very many types
of fiducial markers were developed. The features, which differentiate several marker systems are:
• Robustness, velocity, and accuracy of tracking.
• Type of markers. In the simplest way we can distinguish: color markers and shape markers.
• A variety of distances from which the marker can be recognized. For example if we have avery big marker, which is noticeable from great distance, cannot be recognized, when camera
is next to it and captures only a part.
• How many bits of information can be coded on the marker. Some standards of markersassume the use of checksum bits.
• How big are markers, and how obtrusive are they? How large is the area, they cover?Studierstube Tracker library [WLS08] contains some concepts of unobtrusive markers.
• If marker system is scalable? Is it possible to use it on the large area?
• How many markers must be in one frame to calculate the 6-DOF pose.
Here I present the list of marker systems which I studied:
ISO/IEC16022 standard [Wikb]. Marker contains a 2D data matrix and two solid edges,
which help to find a marker. ISO/IEC16022 markers can be combined together to make a large ma-
trix consisting of segments delimited by solid edges. The segments have a shape of square or rectan-
gle. There is many standards of dimensions of such markers - from 8x8 up to 144x144. The largest
version can contain 2355 bits of data. Each marker has an ECC200 checksum. ISO/IEC16022
was designed to replace the old-fashioned barcodes. There are many free libraries for reading and
2.1. Tracking 6
(a) ISO/IEC 16022 marker (b) ARTag marker (c) Frame marker
(d) Split marker (e) Dot marker (f) Circular markers
(g) Nested marker (h)stakemarker
Figure 2.1: Types of markers
generating ISO/IEC16022 codes. The Studierstube Tracker library can also estimate 6DOF pose
of the camera using such marker. A big advantage of this system is that it can contain big amount
of data. We can store there whole URL or even a simple 3D model. See 2.1(a).
There are a few other standards using data matrices, for example ARTag (2.1(b))[Fia05].
ARTag marker contains 36 bits of information, 10-bits of them contains ID data (1024 differ-
ent markers), other 26-bit provide redundancy to decrease a chance of false identification. ARTag
is designed to be very robustly identified and tracked.
ARToolkit markers [Kat] - system of square-shaped markers identified by graphic template.
The template can be any black-and-white image, but the more detailed is the symbol, the worse
is the quality of identification. Symbols consisting of big black-and-white areas yield the best
robustness and enlarges the distance from which markers are recognizable.
Frame markers [WLS08] - rectangular markers, which consist only of a frame(inside a marker
can be anything). At the interior side of a frame the id information(9 bits) is encoded, making it
look like a decoration. The code contains checksum and is arranged so, that allows to determine
the orientation. See 2.1(c)
2.1. Tracking 7
Split markers [WLS08] - a variation of frame marker, which consists only of two parallel sides
of the frame. Both sides contain barcodes with equal id information(6-bits) encoded. A pair of
barcodes differentiate only on the one bit, which stores the orientation. See 2.1(d)
Dot markers [WLS08]. Each marker is a black dot with a white ring. Markers form a two
dimensional grid, which is applied to the flat surface. Each 4 dots form a grid cell, which is
matched against the previously precomputed template. This allows to determine the position and
orientation of the camera. The solution is scalable but limited only to the flat surface. Dots cover
very little part of the surface. See 2.1(e)
Nested markers [TKK+06] - is the solution how to improve scalability and accuracy of
tracking for black-and-white rectangular markers. See 2.1(g). The concept is recursive: there is
one high level marker which consists of some smaller (lower level) markers. Lower level markers can
also contain nested smaller markers. Each marker contains a visual code and can be unambiguously
identified. When the camera is far form the marker - the system uses the top-layer marker. If the
camera is close - the lower level markers are used for geometrical registration. The system can also
use markers of many levels simultaneously to increase the tracking accuracy.
Multi - ring color markers [CN98] - is another approach to improve scalability of marker-
based tracking. They are quite similar to nested markers, because this concept also relies on
embedding one marker in the another. The author of the article [CN98] proposed two solutions:
“Proportional width ring markers” and “Constant width ring markers”. The first concept increases
the tracking range. Marker consist of a few concentric rings, the width of each ring is 2 times bigger
than the width of its internal ring. There exists markers of many levels : the level 1 markers consist
of a centre and one ring, level 2 markers have 2 rings and so on. To determine the 6-DOF pose the
camera must see three or more markers. Therefore 1 level markers must be arranged dense, bigger
markers will be arranged sparser because system uses them if distance to the lens is large. Farther,
rings have one of 6 colors, which helps with identification. The constant width ring markers has
also constant number of rings. Hence the tracking range of each marker is also constant.
Invisible markers embedded in images [Hwa07] - Standard black-and-white square markers
can be applied to the static or moving images as a noise, which is unnoticeable for human. The
encoded information can be extracted using standard camera and Wienner filter. If we subtract an
original image from the acquired one after noise reduction, we will receive an image of a marker.
Then it has to be normalized and converted to binary image using global precomputed threshold.
The resultant picture can be robustly processed with a marker tracking software.
IR markers [PP04] - proposed system uses ARToolkit-style markers drawn with an IR invisible
ink. Application uses 2 cameras coincided with a half mirror, one of them is equipped with IR
filter for capturing markers, the second is equipped with visible light cutoff filter for capturing
real scene. The signal from IR camera is processed with the ARToolkit library and 3D models are
augmented with an image from scene camera. It looks like the application was a robust marker-less
visual tracking system.
Stake markers - my own idea for “Burgbau zu Friesach” project. The marker is a stake with
color bars on the upper end and the second end is sharp to stick it vertically in the earth (see figure
2.1(h)). Color bars are creating a visual code which allows to distinguish markers. The width of
a bar is constant and big enough to be recognized by camera from large distance. In order to
calculate the position and orientation two markers must be in the field of view. The color code
should be similar to that one used in Multi - ring color markers. The number of used colors
should be low, to avoid error recognision in different lighting conditions. In addition colors can be
divided into two groups according to the wave frequency (for exmaple: (red, yellow, orange) and
(blue, violet, green)). Colors from two groups could be used alternatingly also to avoid errors in
2.1. Tracking 8
recognision. For each stake the system must remember its 2D position on the earth and height.
Height of all markers must be normalized to some equal imaginary level. I assume that all stakes
are perpendicular to that level. Stake markers are designed to be sticked around the area, where
tracking system should work. On the figure 2.2 two methods of sticking are presented (in context
of ’Burgbau zu Friesach’ project). Gray color indicates area, where spectators should not stay, red
squares represent stakes. We must also assume that virtual objects are placed in the middle of the
environment (in the circle called castle). On the figure 2.2(a) we see that markers are placed in
the meedle, so they are always visible, when looking on the castle. They also should not disturb
during wathcing, because they will be covered by the rendered 3D model. But when the user will
stand very close to stakes or between them, tracking can stop to work, because the arising castle
will cover some markers on the oposite side of the circle. On the figure 2.2(b) stakes are placed
arround the area. The big advantage of this method is that spectators can be anywhere in the
area, but the danger of covering markers by the building is larger.
Effectiveness of proposed marker system depends of where the user is directing his camera. I
assume that in most of cases the camera will look in the axis parallel to the surface of the earth
(small tilts are possible) In order to support 6-DOF in every location in the tracking area, stake
markers must be integrated with some other technique.
During tracking the system is always looking for points where color bars are putting together.
The length of color bar is constant, so 2D and 3D coordinates of those points are easy to calculate.
If the camera sees 2 stakes, 2 points from each of them could be used to estimate the camera pose.
The more points or stakes are visible, the better quality can be achieved.
(a) In the middle (b) Arround
Figure 2.2: Methods of placement of stake markers
All systems presented in this section can be used with inside-out and outside-in tracking.
However, for augmented reality applications inside-out tracking is more desired. This approach
relies on tracking of markers in the environment. If the environment increases its size or changes
so, that some markers will be covered, we can simply add some cheap markers. In the outside-
in system the network of cameras would have to be extended - it is more expensive and more
complicated.
Tracking without markers
As it was already told, visual tracking relies on finding features in consecutive frames and calcu-
lating the frame-to-frame correspondence. In this section we will consider a case, where natural
features (not known before) are utilized. We can distribute the problem into parts:
1each bit has 6 possible values, number of bits is variable
2.1. Tracking 9
Table 2.1: Table of marker-based solutions
Name Type Data MFT SDK TR
ISO/IEC16022 square, B&W 2355 1 Studierstube Tracker for pose es-timation and many others (onlyreading)
S
ARTag square, B&W 10 1 ARToolkitPlus SARToolkitmarkers
square, B&W, anypattern
N/A 1 ARToolkit and StudierstubeTracker
S
Frame mark-ers
square, frame 9 1 Studierstube Tracker S
Split markers barcode 6 2 Studierstube Tracker SDot markers circle N/A 3 Studierstube Tracker SNested mark-ers
square, nested N/A 1 N/A I
Multi - ringcolor markers
circle, nested N/D 1 3 N/A I
Invisiblemarkers
square, invisible N/A 1 ARToolkit with filtering S
IR markers square, invisible N/A 1 ARToolkit with hardware SExplanation: Data - amount of data expressed in bits, MFT - minimal number of markers in viewto calculate 6-DOF pose of the Camera, SDK - implementation is available on the internet, TR -Tracking range { S - standard, I - design of a marker increases tracking range }
• Finding features. The features can be edges, bulbs, T - junctions, colors or even horizonsilhouette. There are many already investigated solutions for finding natural features like
Laplace or Sobel operators and their modifications. Two most important parameters of the
finding method are: repeatability (the same feature are found when object is seen form
different angles and under different lighting conditions) and time of calculations. One of the
best NF detectors is FAST algorithm [RD05], which was mentioned in the article [WRM+08].
• Finding of two corresponding features on two different camera frames. The system mustknow if the feature A on one frame and feature B on the second frame are the same point in
the space. This problem also has some solutions, but in many cases they are computationally
expansive and don’t work on-line, or are robust only in specific conditions. Solution of that
problem relies on describing features and comparing descriptions frame-to-frame. Or they
can rely on comaparing the raw patches of image. There is no guarantee, that there are
two or more features with the same description. Hence, there were investigated methods
of evaluating the certainty of match. They rely on prediction of camera position or on an
assumption, that a velocity of camera movement is limited.
• Outlier removal. Not all matches with a previous frame are true. (The matching algorithmcan simply fail on recursive texture). Not all features found by the system belongs to static
objects on the scene. Sometimes feature can belong to moving objects (like moving people).
The system cannot use them for tracking.
• Calculation of geometrical pose. Methods of finding 2D - 3D correspondence from 2D pro-jection have been already investigated. The problem of finding the camera pose is similar
to problems appearing in stereo-vision. In stereo-vision there are two cameras, which are
looking at the same scene (their fields of view are partially overlapping). The geometrical
relation between them is known, but not always. In vision based tracking we have two con-
secutive frames, which partially overlap, and geometrical relation between positions of the
2.1. Tracking 10
camera are also unknown. Mathematical model of the camera is similar to that one, used in
3D graphic enriched by modeling of radial distortion and slight tangential distortion. Mathe-
matical model of the lens has a few coefficients which can be acquired by analysing a picture
from the camera. It is important to find on the picture projections of points which 3D coor-
dinates are known. In the model we can distribute the intrisic and extrisic camera matrix.
Intrisic matrix models parameters of the lens, translates 3D coordinates into 2D coordinates,
does not change during work of the camera (if we do not use zoom). Extrisic matrix models
geometrical relation between camera and world (rotation and translation coefficients).
• Initial calibration. If the system tracks the camera pose using unknown features - it can onlyestimate the camera pose in the coordinate system which is not connected with coordinate
system of the real world. (This is called incremental tracking.) To create that connection,
system must find some feature, which coordinates in the real world are already known (for
example a calibrated marker or a set previously learned features).
Feature selection and tracking As I told - finding of corresponding features between two
images is crucial. We can divide methods of feature tracking into two groups by taking into account
the compared value.
• Raw patches. In the simplest case raw image patches can be compared, of course the searcharea of each feature must be limited because of time consumption. Such approach also would
require, that searched features had the same size in compared frames. It is not an option for
augmented reality. But there are feature tracking algorithms based on raw patches, which
are robust and quite fast. An example of such solution can a pyramid implementation of
Lucas-Kanade algorithm described in [Bou02]. That version of Lucas-Kanade algorithm is
implemented in OpenCV library and very easy to use. This class of algorithms is called
optical-flow techniques. They are robust only when differences in images are small. The
pyramid implementation solves that problem, but is more computationally expansive than
a standard version. In augmented reality applications in order to support the limitation of
differences, the frames must be captured very frequently, or some initial guesses of features
positions must be provided.
• Descriptor based methods are more robust, than previously described techniques. In order tocheck differences between features, they descriptors are compared. Most feature descriptors
have a property of affine and/or scale invariability. It means that, when features on two
images are scaled in respect to each other, (for example one image is a part of the second
one) or warped, they can be robustly matched. Another advantage is that unlike the optical
flow methods, they do not require little differences between images. But if we assume that
limitation, much of computational effort can be saved. The typical applications of feature
descriptors are image stitching and pattern recognition. The survey and comparison of
features descriptors is presented in [MS05]. The example of descriptor, which is used in
augmented reality applications is SIFT [Low04] (scale - invariant feature tracking). Example
of AR application utilizing SIFT is described here [WRM+08]. The successor of SIFT is
SURF [BTGL06].
However, the quality and reliability of tracking highly depends of features and not only of algorithm.
Important issues of that problem are described in paper ’Good feature to track’ of Shi and Tomasi
[ST94]. The obvious property of a feature is the difference between its intensity and intensity of
surrounding. Distinctive edges and corners are easier to find and their position can be determined
2.1. Tracking 11
more accurately. Another problem is orientation of features. Lets assume, that a tracked edge is
parallel to the direction of movement. In this case tracking cannot be accurate. Shi and Tomasi
proposed a measure, which allows to evaluate usefulness of a feature for tracking. It relies on a
analysis of the eigenvalues of a gradient matrices of an image patch. The outcome of Shi and
Tomasi is that both eigenvalues of a ’good’ feature must be large. Large eigenvalues indicate
corners or ’salt and paper’ textures, which can by reliably tracked. Another issue of tracking is
a feature displacement model. The simplest case assumes that feature can be translated by a
2D vector. More sophisticated model assumes feature warping, which better reflects reality. The
feature warping is represented by a 2x2 affine transformation matrix (the 2D displacement vector
also must be included). This model has 6 parameters hence, it is harder to solve the equation. The
convergence of this model is better, when features are ’good’ in terms of eigenvalue analysis. The
affine feature displacement model has been used in an AR application described in paper [KM07].
But in this case, the affine transformation matrix has been calculated from the knowledge about
camera position and orientation.
Natural feature tracking using pre-acquired data Natural features not always are un-
known from above. There is a project called Archeoguide [SK], which uses ’reference images’. First
of all the area, where the tracking system should work, is being photographed from many sides
and under different angles (poses of the camera by each snapshot is known). When the tracking
system works, each live frame is compared with the best matching frame from the database and
a frame-to-frame correspondence is calculated(frames in the database are calibrated). The sys-
tem works similarly to the marker-based tracking (we can treat reference images as markers), but
matching operation is more complicated and requires the usage of techniques described above. The
advantage of this solution is that we can pre-process some information and reduce the complexity
of on-line calculations. Disadvantage: the database of reference images must be actual. A similar
solution is proposed in the article [CCP02].
Issues of camera pose estimation - camera models
Here I present the mathematical model of camera projection, which is common for all camera
registration techniques.
xcyczc
= [R] xwywzw
+ txtytz
(2.1)
[u
v
]=
[fx 0 cx0 fy cy
] xc/zcyc/zc1
(2.2)
Where R is 3 × 3 rotation matrix, tx, ty and tz are coordinates of translation vector. 2 × 3matrix is the intrisic camera matrix. fx and fy are focal lengths, cx and cy are coordinates of the
camera central point expressed in pixels counted from the upper corner of the image. There are
two focal lengths in order to model cameras of non-square pixels. This equation does not contain
distortion model. Here I present camera model with distortion used in OpenCV.
2.1. Tracking 12
Figure 2.3: The camera model
x′ = x/z
y’ = y/z
x” = x’(1 + k1r2 + k2r4 + k3r6
)+ 2p1x′y′ + p2(r2 + 2x′2)
y” = y’(1 + k1r2 + k2r4 + k3r6
)+ p1(r2 + 2y′2) + 2p2x′y′
r2 = x′2 + y′2[u
v
]=
[fx 0 cx0 fy cy
] x′′
y′′
1
(2.3)
Where k1, k2, k3 are radial distortion coefficients and p1, p2 are tangential distortion coefficients.
They are not changing, when camera resolution changes, but focal length and center point must
be scaled.
In order to calculate intrisic and extrisic parameters the vector of 3D points and their pro-
jections must be given. This data can be easily acquired by photographing a calibrated fiducial
marker. It is also important to calculate the image point coordinates with subpixel accuracy.
OpenCV contains algorithms which can estimate intrisic camera parameters, but only from copla-
nar points. The more control points are on the image and more images are taken, the better
estimations will be produced.
The intrisic camera parameters must be estimated only once, but extrisic parameters per each
frame. There are a few methods of estimating the camera pose. Most of them relies on solving the
equations presented above. As we see they are non-linear because of the division by z. But the
camera geometry is also 3 dimensional and can be introduced that way (see figure 2.3). In order
to solve equations based on the geometry introduced on the figure the units in world coordinate
frame and camera coordinate frame must be uniform. There is a scalar ratio which translates pixel
to world units and vice-versa. That scalar is ’hidden’ in fx and fy in the equation 2.2.
Solving the camera registration problem
Here I present a short list of method for solving camera registration problem. The list of course
does not contain all algorithms, but gives an overview of already investigated approaches. We
can distinguish two approaches: analytic - based on modeling rays from object to camera in the
3 dimensional space, iterative - based on minimizing re-projection error. Some methods combine
these approaches together and exploit the advantages of both.
2.1. Tracking 13
• Least square minimisation of re-projection error. This is a family of iterative methods.The base is the least square iteration schema (for example Gauss-Newton method) which
is minimising the difference between real points in the image and projected points. The
estimated parameters are elements of rotation matrix and translation vector. The parameters
must be initialized by some guesses of values. In the visual tracking purpose the best initial
values are extrisic camera parameters from previous frame. Many implementations of least
square solvers require a matrix of first derivatives with respect to parameters, but it can
be calculated using the differential quotient. Such method is briefly described in the paper
“Pose Tracking from Natural Features on Mobile Phones ” [WRM+08].
• DLT method. Direct linear transformation. Method relies on solving linear equations basedon the perspective camera geometry. It can estimate extrisic parameters and the scaling
factor between coordinate frames. Detailed description can be found here [dlt] [Qin96].
• POSIT algorithm [DD95] is an approach, which combines analytic and iterative methods.First the orientation and translation is calculated by solving linear equations then resulting
values are refined using iterative approach. Algorithm converges very quickly. Hence, is good
for real time calculations.
• Lowe’s algorithm - approach relies on iterative least square minimisation. The object isdescribed in units of camera coordinate frame. It makes the mathematical model so easy,
that it was possible to find equations of first derivatives with respect to position and rotation
parameters. Detailed description can be found here [ACB98].
• SCAAT. Incremental tracing approach based on iterative Kalman filter. It is able to estimatepose when information is incomplete (for example to less points has been found). The
algorithm is presented here [WB97]. Another visual tracking algorithm based on ideas of
SCAAT is described here [JNY00].
Presented algorithms are full of great ideas, which can be applied without using the whole
solution. For example the first method (least square minimisation) can be enriched by Kalman
filter. Or any analytic method can be improved by iterative refinement. The choice of method
must be adjusted to the problem specifics and initial knowledge about camera movement.
3D reconstruction for visual tracking
As I told at the beginning, all algorithms require knowledge about 3D coordinates of world
points. In the case of calibrated fiducial markers it is not a problem, but if tracking has to rely on
initially unknown natural features, the coordinates must be obtained somehow. Techniques of 3D
reconstruction must be applied. In order to find 3D coordinates of a point two different calibrated
camera shots are needed. The coordinates of a point are found by calculating the intersection of
two rays going from the center points of the cameras (two views) through image planes to the
world point. The algorithm is described in details here : [TV98] The coordinates of the point
found that way are not very accurate because camera calibration errors and inaccurate feature
finding. It also depends of the difference of camera pose in two considered views. The biggest
error appears on the Z-axis of camera coordinate system (line from N to O on the figure 2.3). But
it should not disturb by solving camera registration problem. The requirement of two views means
that, the point must be found three times to be utilized for tracking. Sometimes it is to late, when
camera moves fast, or the point is unstably tracked. Correct calculation of points position form
only 2 views is also optimistic case.
2.1. Tracking 14
In my opinion it is possible to obtain 3D coordinates of a point from a single view. To do this
some more information is needed. For example an information about the plane on which the point
lies. If the tracking system would have a knowledge about the rough 3D model of the environment
and the initial camera pose, it could calculate on which plane a point lies, hence the 3D coordinates
of it. Point on the plane has two degrees of freedom, the camera also delivers 2 variables (u and v
coordinate), so the equation system has one not ambiguous solution.
There is another approach, which allows to find coordinates of points from 2 views and also
a camera pose. This is called ’five point algorithm’, because it needs at least five corresponding
points. This algorithm is able to calculate all of those things only up to the scale factor. If the real
distance between two points in the image is known - all coordinates can be expressed in metric
units. This limitation is not a problem. Data about the metric distance can be provided using
fiducial markers or they can be simply assumed. Such approach is presented in the system described
in paper [KM07]. The described application is initialised like that: user directs the camera to the
scene, pushes the button, then moves the camera 10 cm to the side (program believes, that it was
10 cm) and pushes the button again. In that way 2 views are indicated and five point algorithm
is applied. The main product of five point algorithm is the 3 × 3 essential matrix. From thatmatrix relative rotation and translation can be derived. The way how to do it and the whole
algorithm with implementation is described here [Nis03a] and [SEN]. If the relative geometric
correspondence between each pair of frames was accumulated (the initial camera pose should be
known), we would get the actual camera pose, hence the tracking system. This approach has one
advantage: not all points have to be tracked from frame to frame, only a few is needed to keep
the scaling factor. But the big disadvantage is drift. If any measurement of relative movement
will be inaccurate, that error will have influence for estimated camera pose forever. There is no
chance to correct that drift, because the system does not remember previous state or frame. This
approach is called ’Structure from Motion’. In order to improve the quality of the essential matrix
estimation all outliers must be removed. The most sophisticated method of outlier removal and
pose estimation is preemptive RANSAC, described in the paper [Nis03b]. The system has live
performance, but it is not the only solution. Second competitive technology is described further.
Simultaneous Localisation and Mapping (SLAM) is a group of technologies, which orig-
inates from robotics. It was implemented in order to build a map of the unknown environment
and use it to track the position of a robot. We can argue, that the movement of a robot is different
and more predictable than a movement of handheld camera. But successful implementations of
AR based on SLAM [KM07] are the best evidence that the difference is not so large. The problem
of creating the map and immediate tracking from that map is like a ’hen and egg problem’. Errors
made during the map creation phase have influence on tracking and tracking has influence on cal-
culation of position of new points and correction of previous ones. The idea which was widely used
to solve that problem is Kalman filter [Kal]. Kalman filter is a stochastic model of some process. It
is able to predict the state of the process in future. Generally two operations can be performed on
the filter structure: prediction - calculation of the process state in future, and correction - entering
actual measurement of the state, which is connected with an update of coefficients of a stochastic
model. Kalman filter has its modification called Extended Kalman filter (EKF), which supports
non-linear models. KFs and EKFs are widely used in automation.
In terms of augmented reality Kalman filter has a few applications. It can be used to predict
the camera position and orientation. That predicted value (if its close to truth) can be used for
finding of 2D positions of environment points in the camera frame. It can be also used as an initial
guess of camera pose for the iterative pose refinement procedure. They should also reduce the drift
appearing in visual tracking systems. Kalman filters have also their application in interpretation of
2.1. Tracking 15
data from sensors. They are able to smooth the noise coming from the sensor (like accelerometer).
Here I would like to describe how a typical SLAM system works on the basis of the solution
from [Dav03]. In such a system camera movements and also positions of points have probabilistic
models. It means that the state of the system is modeled as a vector, which can be divided into two
parts. The first one describes actual camera position and orientation, second one has an entry for
each feature point. When a new frame is acquired the system predicts the camera pose. After that
all feature point are reprojected and according to their pose uncertainty the elliptic search areas
for each of them are created. The size of the ellipse depends of the uncertainty of the point and of
the camera pose. Images of points are searched in those search areas and the convergence matrix
and state vector are updated according to they real positions. In such a system convergence matrix
is quite large: (7 + 3x) × (7 + 3x), where x is a number of feature points. But the maintenanceof that matrix is very important (also its non - diagonal entries), because it models correlations
between points. For example if some points are laying on the same surface and near to each other
- their absolute position can be uncertain, but knowledge about relative positions of those points
can be very sure. Initialization of a new point is done also in a top-down way. A new point is added
to the database using information from one view only. It is modeled as a endless straight line. The
depth of a point cannot be calculated form one view, so some assumptions are done according to
initial knowledge about the environment. The depth is modeled as a 1D range on that straight line,
which is then fulfilled with big number of regularly distributed particles (see figure 2.4(b)). These
particles create a discreet model of probability density function. At the beginning all particles are
equal, and in next frames observations of that point are applied to them. After some number of
frames particles should form a single peak like on the figure 2.4(a). If it will not happen, the point
is discarded. In opposite case the point is used for tracking with a depth estimate in the place of
peak. The quality of the position estimate is very high but takes much time and requires that,
this point is visible for some period of time. Moreover after the invested effort, the point should
be used for tracking for a long time. Design of that system assumes that the tracking viewpoint
will be repeatable.
(a) Propability density of depth (b) Graphic interpretation of particles
Figure 2.4: Initialisation of point in probabilistic SLAM
Survey of visual markerless tracking
On the figure 2.5 there is a survey of all natural feature tracking techniques discussed in this
section. They are divided into processing stages of the algorithm. The red pentagon with K letter
indicates techniques utilizing Kalman filters. Possible types of Kalman filter are presented in left
2.1. Tracking 16
bottom corner of the image.
Figure 2.5: Summary of visual tracking from natural features
Software libraries
Computer vision systems are not standardized, but there are some software libraries, which gather
implementations of algorithms useful at creating of augmented reality systems. This section is
distributed into two parts: first one is describing frameworks - implementations of whole func-
tionalities like marker-based tracking or marker-less tracking, and libraries, which implement only
individual algorithms (without connections between them).
Frameworks
Table 2.2: AR Frameworks
Name LC NFT Markers PE license platformARToolkit Yes No ARToolkit Yes GPL C/C++ and Java
(Wrapper). Linux,windows, macOS
ARToolkitPlus Yes No ARToolkit, ARTag Yes GPL C/C++. Windows,Linux, WinCE
Studierstube Tracker Yes No ARToolkit, ARTag,Frame, Split, Dot
Yes N/A C/C++. Windows,Linux, WinCE,Symbian, iPhone
Studierstube ES Yes Yes ARToolkit, ARTag,Frame, Split, Dot
Yes N/A C/C++. Windows,Linux, WinCE,Symbian
SceneLib No Yes No Yes LGPL LinuxLC - lens calibration, PE - Pose estimation, NFT - natural feature tracking
2.1. Tracking 17
In the table 2.2 for a great attention deserves SceneLib, which is a framework for designers of
SLAM for robotics. I did not searched that library very wide, but there are AR solutions based
on it. For example [Dav03].
Alghorithms and other functionality
OpenCV - a large library for computer vision and image processing. Windows and Linux
implementation are available. Library is divided into 4 parts:
• CXCORE. Data structures used in CV. Simple drawing and rendering. Time measurement.Mathematical operations (matrices).
• HighGUI: Contains functions for displaying windows, refreshing graphic on them. Capturingimage from devices and files. Allow rapid prototyping and portability.
• CV - typical algorithms for computer vision. SURF, filters, corner detection, camera cali-bration, stereo-vision.
• Machine learning.
OpenCV contains very large functionality - sufficient in many computer vision applications.
Allows rapid development. Contains very interesting system of types - arrays, matrices and images
are aggregated to one super-type. Functions are recognizing input type at runtime.
OpenSURF [Eva09] - open source implementation of SURF. I have compiled that library and
started the program and it was working very slow. I have replaced the Fast-Hessian feature
detector (standard for SURF) with FAST [RD05]. The program was still working to slow. SUFR
implementation in OpenCV works much better.
Integrating Vision Toolkit - (http://ivt.sourceforge.net/) A library similar to OpenCV (a
set of algorithms). Here I want to point differences in respect to OpenCV:
• Contains SIFT implementation
• Harris corner detector is faster than in OpenCV
• Very good object-oriented architecture
2.1.2 Inertial Tracking
This approach relies on capturing translations and rotations of a object using the inertia phe-
nomenon. Movement of a body in a space can be introduced as a sequence of translations and
rotations. Inertial sensor must be mounted on tracked object(inside-out approach). Inertial track-
ing is sourceless - doesn’t need any reference object in the environment. The most considerable
problem is drift. Devices for measuring the translation are called accelerometers, devices for mea-
suring rotations are called gyroscopes.
Accelerometers
Accelerometers are devices, which measure the linear acceleration of a body in one axis. In the
simplest case they consist of a mass and subsystem, which measures the force affecting the mass.
(For example a mass mounted on the piezo-electric crystal [RJD+01]). Acceleration of a body is
proportional to the inertial force affecting the mass. Double integration in time of acceleration
value gives the position. However, the calculated position is not errors free:
• measurement errors
http://ivt.sourceforge.net/
2.1. Tracking 18
Table 2.3: Table of 3d accelerometers
Name range noise U. rate Price OS NotesPhidget Ac-celerometer 2
± 3G 6 mili G 60Hz £114.95 Windows,Windows CE
USB, is designed for quickprototyping of AR solu-tions. Price includes ship-ment.
MOD-MMA7260Q3
± 1.5G± 6G
N/A N/A e38.95 N/A Board with ARM proces-sor and 3D accelerometer,must be programmed bythe user.
USB1600-PC 4
± 1.5G± 6G
N/A >60Hz $275 Windows,windowsCE
Similar to PhidgetAc-celerometer, comes withdrivers, software, andcode samples
• discretization errors
• numerical integration errors
• accumulation of previous errors
These factors cause, that accelerometers can precisely track the position, but only in the little
period of time. Hence, tracking system can not rely solely on accelerometers. It should be sup-
ported by second system which has not a feature of drift (for example visual tracking). A single
accelerometer has 1 degree of freedom. 3D accelerometer (3-DOF) consist of 3 single accelerometer
measuring acceleration in 3 perpendicular axes.
Fortunately, accelerometers are available on the market. There are analog electronic circuits
with build-in 3D accelerometers. They can be connected to the one-chip microcomputer (with
analog input) and transmit data to the PC (USB or RS-232). Devices, which can be simply
connected to the PC (without soldering and low-level programming) are also available. Some
products contain SDK for C++ and other programming languages. Few handheld devices (for
example smartphones) have built in 3D accelerometers.
Gyroscopes
Gyroscopes are devices, which measure the orientation in the space. (insensitive to the translation
in opposite to accelerometer). Most of old-fashioned gyroscopes use a quickly rotating wheel as
a reference. The movement of the object is sensed by rotation encoders, which measure angles
between the tracked body and the wheel. The main problem of gyroscopes is drift caused by small
friction between the axis of the wheel and bearing. This error can be minimized by calculating the
orientation iteratively, but it results in the accumulation of numerical errors. Hence, the gyroscope
must be periodically re-calibrated to ensure the accuracy in time. Electronic gyroscope can be
also implemented as a micro-electro-mechanical system (MEMS), which contain micro-miniature
vibrating elements instead of spinning wheels. Here I introduce a PC gyroscope called Inerti-
aCube2+ (based on MEMS): table 2.4 and [FHA98]. Another approach in building gyroscopes
(and also accelerometers) are silicon micromachines (iMEMS) [LKPB02]. The principle or work
of the iMEMS gyroscope is described here [GK]. The accuracy of silicon micromachined sensors
are less accurate than MEMS, but their dimensions, power-consumption and cost are very low.
2http://www.trossenrobotics.com/store/p/5160-PhidgetAccelerometer-3-Axis.aspx, http://www.active-robots.com/products/phidgets/three-axis-accelerometer.shtml
3http://www.olimex.com/dev/mod-mma7260q.html4http://www.embeddedsys.com/subpages/products/usb1600.shtml
2.1. Tracking 19
Table 2.4: MEMS and iMEMS accelerometers
Name InertiaCube 2+Technology MEMSMaximum angular speed 1200 ◦/sAngular resolution 0.01 degUpdate rate 180 HzInterface RS-232/USB adapterPrice about e2000http://www.intersense.com/uploadedFiles/Products/IC2+_datasheet_0908.pdfName Gyro Breakout BoardTechnology iMEMSMaximum angular speed 150 ◦/sSensitivity 12.5 mV/◦/sUpdate rate 80 HzInterface N/APrice about e69Note Board contains only one-chip 1-DOF analog gyroscope. Requires
microcomputer with ADCs.http://www.watterott.com/Gyro-Breakout-Board-ADXRS150-150-degree-sec_1Name IMU Combo BoardTechnology iMEMSMaximum angular speed 75 ◦/sSensitivity 15 mV/◦/sUpdate rate 40 HzInterface N/APrice about e69Note Board contains only one-chip 3-DOF analog gyroscope. Requires
microcomputer with ADCs.http://www.sparkfun.com/commerce/product_info.php?products_id=842
The lack of accuracy can be recompensed with visual tracking system. These features cause, that
silicon micromachines are very often utilized in mobile devices.
6-DOF inertial trackers
Accelerometers and gyroscopes deliver different type of information. In order to robustly calculate
the 6-DOF position, the system must integrate both accelerometers and gyroscopes. There are a
few such solutions available on the market: for example InertiaCube [FHA98]. InertiaCube con-
tains 3 accelerometers, 3 vibrating gyroscopes and 3 magnetometers. The idea of inertial tracking
system based on silicon micromachines is proposed in the article [LKPB02] and implemented for
example in Wii Remote [Wii] game controller. This device contains a 3D accelerometer and can
be enriched by the gyroscope connected to the expansion slot. Wii Remote is also using visual
tracking system consisting of IR diodes and IR camera on the controller. In the table 2.5 a cheap
6-DOF inertial tracker is introduced. It consists of 3 iMEMS accelerometers, 3 gyroscopes and
microcontroller with A/D converters for data acquisition and communication with the PC.
Each accelerometer is able to measure dynamic (vibration) and static (gravity or tilt) acceler-
ation. If the system should measure the tilt, the high-sensitivity devices are recommended. It is
possible to measure the tilt around 2 axes, which are parallel to the earth, but rotation around the
third axis (perpendicular one) is not sensible. It could be sensible if the rotation axis was quite far
(10 cm) from the sensor, so that the linear acceleration (caused by angular speed) will affect the
sensor. In this situation the motion along the circle and along the straight line is indistinguish-
http://www.intersense.com/uploadedFiles/Products/IC2+_datasheet_0908.pdfhttp://www.watterott.com/Gyro-Breakout-Board-ADXRS150-150-degree-sec_1http://www.sparkfun.com/commerce/product_info.php?products_id=842
2.1. Tracking 20
Table 2.5: 6-DOF inertial sensors
Name Atomic IMU - 6 Degrees of FreedomG. range ± 300◦/sG. sensitivity 3.3 mV/◦/sG. update rate 88 HzA. range ± 1.5G ± 6Gprice e115,00interface UARTnotes Device consists of sensors and Atmel ATMega168TM microcom-
puter. (Must be programmed by the user)
able, when the system sensible only the data from accelerometer. But if the rigid body, which
position is measured, will be equipped with the second 3D accelerometer, the rotation becomes
distinguishable.
6-DOF tracking with 2 accelerometers. I think, that the estimation of 6-DOF position
is possible using of two 3D accelerometers mounted on two opposite ends of the handheld device.
In this case, there exists only one axis (which goes through both accelerometers) around which,
the rotation cannot be sensed. It is mentioned in the article [acc], that such a system can measure
roll, pitch and yaw until the common axis of accelerometers does not point to the gravity. The
signals begins to disappear when the common axis closes to the acceleration vector. Only sharp
movements are measured accurately. I suppose that, the accuracy of such system will be much
worse, than silicon micromachined gyroscope.
2.1.3 Other technologies
GPS and DGPS
Global positioning system. GPS is a system for 3-DOF position tracking on the whole world.
Accuracy of that system is about 10-20 meters. (after eliminating S/A in 2000) [Roy]. System
consists of 28 satellites, which are orbiting around th earth. There is 3-4 atomic clocks on each
satellite to ensure time measurement. Each satellite sends the actual time and its position. Special
receiver is calculating the distance to each visible satellite and its own position. If it sees 4 satellites
- the position and altitude can be estimated, if receiver sees less than 4 satellites the 2D position
can be calculated using altitude introduced by the user.
DGPS - differential GPS. Supporting system for GPS. Consists of earth-bound stations, which
are very precisely placed. These stations are receiving signals from satellite and calculating correc-
tions, which are transmitted to special DGPS receivers. Receivers are also analysing signal from
satellites and signals form earth-bound stations and calculating the position with 1 - 3 m precision.
I found that the HA-NDGPS system which is now under development will have accuracy 0.1-0.15
m [DGP]. There is many standards of DGPS. There is also many stations on the earth [Gal].
Everyone can put the station in his own, if there is a need. DGPS is used in some augmented
reality applications like for example LIFEPLUS [Vla04]. Accuracy of DGPS is not good enough
to base the camera registration on it, but it can be used as a component of hybrid large area AR
systems.
GPS and DGPS have become very common in last few years. The are parts of handheld devices
(Phones, PDA’s) and are also available as separate devices with Bluetooth or USB interface. Most
of GPS receivers are using NMEA protocol to communicate with the world. One of the problems
of GPS technology is the low update rate (about 1 Hz). Second: the signal disappear, when for
example vehicle goes under a wide overpass. In order to prevent such situation, the GPS receiver
2.1. Tracking 21
can be enriched by low-cost accelerometer [Dav08]. GPS receivers can also integrate other sensors
like magnetometers (electronic compass).
Electronic compass
Electronic compass is a very sensitive magnetometer, which measures magnetic field of the earth.
Magnetometer consists of a coil (or a few coils). If the magnetic field around the coil is changing,
the variable current arises in the coil. That current is a function of distance from the magnetic
field source and relative orientation between emitting and receiving coil. Magnetic trackers are
widely used in VR, because of their low price. Electronic compasses use magneto-inductive ele-
ments instead of coils, mostly they contain many of them. Measurement of such a system can be
obviously affected by other sources of magnetic field and also by the tilt of the device. Electronic
compasses very often integrate tilt sensors (for example inclinometers) in order to improve quality
of measurement. Accuracy of the electronic compass is about 0.5 degree. One problem of such
sensors is that the magnetic field of the earth isn’t homogeneous. Electronic compass is a very
good replenishment for GPS, because gives data about 2D orientation. An interesting AR solution,
which exploits compass is Wikitude [Wika].
Mechanical tracking
This type of tracking system uses mechanical linkages between reference and tracked objects. There
are two types of mechanical trackers:
• The reference and the target are connected by a chain of linkages. The position is computedfrom angles between linkages. Angles are measured using potentiometers or incremental
encoders.
• The reference and the target are connected by a system of wires. These wires are rolled oncoils and tensed by a spring system in order to measure a distance accurately.
The number of degrees of freedom depends of the construction. Most of systems supports 6 DOF,
but only a limited range of motion is possible. The tracking range is about 1.8 m. Such system
can be used in immersive human interface (in the case, when the user cannot walk very far).
Mechanical linkages have their successful application in force-feedback systems, which are also
useful in user interfaces. Mechanical tracking is also utilized for augmented reality purposes, in a
group of systems called AR telescopes. They are used for tracking of camera orientation (change
of position is impossible) in very limited range of movement (only yaw and pitch). But they were
replaced by visual tracking because of the lack of accuracy. Electronic rotation encoders are very
temperature sensitive.
Gravity sensors
Inclinometers are devices, which measure their orientation regarding to the gravitation field. In
most cases they consist of some closed dish with a fluid and sensors, which measure the level of
the fluid. Orientation of the dish can be obtained by measuring the pressure or level of liquid.
There are also solutions which use electrolytic fluid or opto-electric sensors. Main limitations of
this technology are: long response time because of viscosity of the liquid, shock and acceleration
sensitivity. The second problem can solved by integrating inclinometers with accelerometers for
shock measurement. Inclinometers do not need any reference (like inertial sensors and compasses).
2.1. Tracking 22
Ultra sound tracking
Ultra sound tracking is based on the measurement of time of flight of the sound wave. Frequency
of the pulse signal is form 20 kHz to 40 kHz to prevent the user from hearing it. In order to
track the position and orientation of the object 3 or more emitters must be placed on it. To
measure 3D position of each emitter 3 receivers must be mounted on the reference. Emitters are
small and lightweight and can be easily carried by the person or mounted on any object. They
send signal sequentially or each of them uses different frequency. Tracking accuracy is very good
(0.5mm - 6mm). Limitations of such systems are: range (from 25 cm to 4.5 m) and sensitivity to
temperature, pressure, humidity and occlusion. All factors which affect the speed of sound wave
are decreasing the accuracy or make the tracking impossible. It limits the application of such
system only to rooms.
Ultra-Wideband
Ultra-Wideband is technology based on radio waves similar to Bluetooth or WiFi. It operates with
a very low energy and and using very wide wave spectrum. The Ubisense company developed a
tracking system based on that technology [uwb]. Systems consists of emitters placed on objects
and receivers mounted on the reference. The density of receivers network is similar to the density
of access-points in WLAN network. The tracking accuracy is about 10-15 cm. It was designed to
track people in a big building. In general the system has 3 degrees of freedom and extending it to
6 DOF would be hard or impossible, because of the low accuracy.
2.1.4 Hybrid tracking
As we see technologies described in previous section have some advantages and disadvantages and
none of them is ideal for tracking in the outdoor environment. Hence, the intuition suggests,
that the good system should combine two or more technologies. Such approach is called hybrid
tracking. The weakness of one technology can be compensated by another solution. For example a
system can integrate GPS with 3D accelerometer, which track the position (through short period
of time) when GPS signal is lost. In order to show, how hybrid tracking works I will present a few
examples. Example projects are also introduced in the table 2.6 which gives overview of techniques
integrated in each of them.
Examples
System II (2003)
System II (described in paper [CMC03]) is very interesting because it involves camera with
wide angle ’fish-eye’ lens (190 ◦). Additionally the system is equipped with 3D gyroscope, which is
used as a prediction for the feature tracker. The camera pose is estimated using typical ’Structure
from Motion’ approach. The essential matrix is calculated and rotation matrix and translation
vector are derived from it.(Structure from Motion) In the fish eye lens spherical distortions are very
high, so that they have to be compensated. The wide view has many advantages. Its field of view
covers very large area, even during large rotations some part of the field of view stays common for
two successive frames. ’Fish-eye’ lens allows also to properly estimate movements along focal axis
of the camera. But it has also disadvantages: resolution of image in front of the camera is very
low and generates estimation errors from movements perpendicular to camera axis like panning or
tilting.
2.1. Tracking 23
The system provides also 3D reconstruction implemented as triangulation from 2 views. It
is done during the essential matrix calculation. Please notice that coordinates of these features
are not utilized directly for tracking. Authors of the article told that it is not a mature solution.
Moreover they have written that the system is drifting.
System I (2004)
B. Jiang U. Neumann and S. You presented a hybrid tracking system in their paper “A Robust
Hybrid Tracking System for Outdoor Augmented Reality” [JNY04]. I called it ’system I’ because
it does not have any name. It integrates digital 3D gyroscope and visual tracking. The designers
assumed that the change in view caused by little rotation is much bigger than the change resulting
from a little linear movement. Hence, systems contains a gyroscope and no accelerometer which
decreases the cost. The system is designed for urban environment. Visual registration is based on
detection of unknown lines and tracking them from frame to frame. The global orientation in gen-
eral is measured by the gyroscope. The measurements are updated by the vision tracking system
from time to time to prevent the drift. Each estimate of camera pose from visual subsystem is eval-
uated to be confidential or not. Only reliable estimates are used to correct the drift. If unreliable
measurement is pronounced, the visual subsystem cames back to the previous, reliable state and
estimates the movement to the actual camera image. Technologies used here are supplementing
each other. When the rotation of the camera is too fast to use visual tracking - only the inertial
subsystem is used. But when the user stops for a moment or starts to move slowly - information
from camera will correct the drift. This system could not be used in the natural environment
because straight lines does not exist in the inartificial world. However, the way of data processing
proposed in this system is a very good base for every system which integrates visual and inertial
techniques.
LIFEPLUS (2004)
LIFEPLUS system [Vla04] was designed to support sightseeing of cultural heritage places. It
integrates visual tracking, DGPS and compass. The system is large and complicated. Besides
tracking it contains network infrastructure based on GPRS and WLAN which allows access to the
remote database of multimedia content. The systems is designed to work on the very large area.
The information from camera and compass are sufficient for tracking, but data from GPS give
initial calibration and allow other parts of the tracking system to work more reliable.
Vidente(2008)
Vidente system [SMK+08] developed at the University of Technology in Graz has been designed
to visualize pipes and cables under the surface of the earth. The tracking system integrates GPS
supporting EGNOS and very accurate inertial orientation sensor (InertiaCube 3). GPS system
delivers the position within accuracy if few meters. InertiaCube3 is very accurate (less than one
degree) sensor and costs e2000 (data from http://www.cybermind.nl/Info/EURO_PriceList.
htm#ISense). These two sensors deliver tracking accurate enough to render pipes and cables which
position is taken from GIS database. Such tracking system is quite easy to implement and does
not require much computational power, but is too expansive.
WikiTude(2009)
WikiTude[Wika] has been already mention in this work. This system integrates the GPS and
a compass in order to show descriptions of interesting places on the earth. It is not hard to notice
that such tracking system is not very accurate, but rendering of labels on big buildings or other big
objects does not require high precision. This system shows how weakness of one technique can be
improved by another one. The GPS delivers 3-DOF position and the compass 1 DOF orientation.
I suspect that two other parameters of orientation are sensed by accelerometers, but I did not
found any wide documentation of that software.
http://www.cybermind.nl/Info/EURO_PriceList.htm#ISensehttp://www.cybermind.nl/Info/EURO_PriceList.htm#ISense
2.2. Hardware 24
Table 2.6: Hybrid augmented relity systems
Name Visual Inertial GPS compassSystem I • • ◦ ◦WikiTude ◦ • • •Vidente ◦ • • ◦LIFEPLUS • ◦ • •System II • • ◦ ◦
These few examples show that hybrid systems are good direction of development. If one
technology is too weak to support all possible conditions, why not to add a second one, which
works better in some specific situations. Combining of technologies gives another profit: for
example in visual tracking computations are very important and consume much time, but some
hints from inertial sensors can simplify them.
2.2 Hardware
In this section I would like to discuss the hardware platform for augmented reality systems. A few
years ago AR was implemented on wearable computers and user was watching the world through
HMDs. These times are fortunately gone. Contemporary AR systems are implemented on tablet
PCs or mobile phones equipped with camera. We can say that user is looking through the handheld
device. The augmented image from camera is displayed on the screen in the real time. In the next
sections I will consider 3 hardware solutions.
2.2.1 Ultra mobile PC and TabletPC
Ultra Mobile PCs (UMPC) are handheld devices with big display (7-10 inches). Their processors
are compatible with PC architecture, hence they run PC operating systems like Windows or Linux.
The size of RAM is also like in a normal PC. They have communication devices like USB ports,
WiFi and Bluetooth. Some of them are equipped with keyboard. Here I briefly analyse the
usefulness of UMPC in the ’Burgbau zu Friesach’ project. Further in the table 2.7 are examples
of some UMPCs and TabletPCs.
Advantages:
• Big display 7-10 inches. Large display allows to see more. The age of visitors of the exhibitionwill be diverse, devices also have to be adjusted to needs of all users. I am talking here about
elderly people who very often have problems with eyesight. If the display is bigger, they feel
more comfortable and do not have to use glasses. Bigger displays also have great resolution,
which is helpful with rendering fonts. They also can fit larger amount of text or graphic,
which helps to avoid the necessity of scrolling. The quality of display is also important,
because of lighting conditions. The display must have very high contrast to be readable in
sunny weather.
• High portability. In the introduction I described a user story, where the family is goingthrough the construction site with a mobile device. UMPC is a device fitting to that scenario
very well. It is lightweight and can work about 5 hours on the battery. I am not sure, if it
will be able to work so long with running AR application. Probably not, because of large
processor usage. Maybe the use of additional battery will be necessary.
• Allows to connect another devices (cameras, sensors). Many of UMPCs available on markethave build-in cameras of quite high resolution, but they can be not sufficient for outdoor
2.2. Hardware 25
augmented reality. Very often the camera is mounted on the incorrect side, which prevent
the use for AR purposes. Fortunately, UMPCs have USB ports. (some of them even IEEE
1394) Most of cameras and other sensors are also using that interface, so the possibilities
of expansion are large. But UMPC equipped with additional camera and sensors requires
casing, which will keep all parts together. An example of system based on UMPC with
housing and grip is presented here [SMK+08].
• Large computing power, much memory. The computing power of UMPC is sufficient foraugmented reality. The main difference between them and laptops is single core processors.
UMPCs have also 2D/3D hardware graphic accelerators. The PC like processor contains
floating point unit, which is useful for calculations connected with tracking and rendering.
• Easy implementation of software. Software for such device can be developed on standardPC and simply run on the UMPC. No emulator or special compiler is needed. Operating
systems also do not differ. The main difference can be found in the hardware user interface.
UMPCs have touch screen and some times QWERTY keyboards.
Disadvantages:
• The exhibitor must own devices and rent it to spectators. It increases the risk of stealing ordamaging by the visitors.
• The UMPC can be too big. I know that this is the negation of something I told before. Theweight of UMPC is about 800g. For some people walking with something so heavy in hands
can be uncomfortable.
Table 2.7: UMPCs and TabletPCs available on the market
Name Processor Pr. speed RAM Price Cam/CB/WiFi/BT/
USB/COM/IEEE1394/GPU
Weight
SamsungQ1EX-71G
VIA Nano 1.2 GHz 2GB $ 749 1/1/1 2/0/0/1 640g
Samsung Q1Ultra
Intel UltraMobile A110
800 MHz 1GB e1029 1/1/1 2/0/0/1 860g
ELVTouchscreen-Panel-PC
VIA Nano 1.0 GHz 1GB e1049 0/0/0 3/2/0/1 3600g
GigabyteM704
VIA C7MULV
1.2 GHz 768 MB e745 1/0/1/1 2/0/1/1 780g
CB - camera on the back side
2.2.2 Mobile phone
Implementation of Augmented reality on the mobile phones is also possible but much harder than
on the PC. I am speaking here about phones which allow to run native code on their processors
(Symbian, WinCE, iPhone). Implementations of AR applications on J2ME phones are not known.
In comparison to contemporary PCs the computational power and amount of RAM is much lower.
Some mobile phones also do not support floating point operations. FPU is not essential, but
numeric algorithms behave more stable when they are working on FPU. I decided to discuss 4
mobile platforms:Symbian, WinCE, iPhone and Android. The last one is being programmed in
2.2. Hardware 26
(a) Samsung Q1 Ultra (b) Gigabyte M704
Figure 2.6: Example of ultra mobile PC
Java, but does not contain standard Java VM known form PCs or J2ME phones, hence is faster
and allows to run native code on the processor. As a crucial criterion I consider:
• Presence and quality of the camera.
• Presence of inertial sensors. Even more and more new mobile phones are equipped withaccelerometer and gyroscopes.
• Ability to run native code implemented in C/C++. Most of computer vision libraries areimplemented in C/C++, mathematical libraries as well. C/C++ technology guaranties the
best trade-off between speed of code and comfort of programing. However, porting of software
from PC to mobile phone requires complete re-engineering of code. (described in [WS09]) But
there