Building and Evaluation of a Mosaic of Images using Aerial Photographs Tiago André Simões Coito Dissertação para obtenção do Grau de Mestre em Engenharia Mecânica Júri Presidente: Prof. Mário Manuel Gonçalves da Costa Orientador: Prof. João Rogério Caldas Pinto Coorientador: Prof. José Raul Carreira Azinheira Vogal: Doutor Lourenço da Penha e Costa Bandeira Novembro de 2012
137
Embed
Building and Evaluation of a Mosaic of Images using Aerial ......SIFT. Ransac, em conjunto com o algoritmo DLT, é usado para calcular as transformações projetivas entre duas imagens
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Building and Evaluation of a Mosaic of Images using
Aerial Photographs
Tiago André Simões Coito
Dissertação para obtenção do Grau de Mestre em
Engenharia Mecânica
Júri
Presidente: Prof. Mário Manuel Gonçalves da Costa
Orientador: Prof. João Rogério Caldas Pinto
Coorientador: Prof. José Raul Carreira Azinheira
Vogal: Doutor Lourenço da Penha e Costa Bandeira
Novembro de 2012
“We cannot solve problems by using the same kind
of thinking we used when we created them.”
Albert Einstein
- i -
Acknowledgements
I would like to thank to my family, because without them this work would not be possible;
I would like to thank also to my advisors Professors João Caldas Pinto for the help, dedication,
availability and especially patience to deal with my stubbornness and laziness and to Professor José
Azinheira, in particular by his constructive criticism on the way I was writing the dissertation.
I am also grateful to my colleagues, in special to Pedro Frazão for the excellent example of
what is to be dedicated and hardworking and for showing me that there are just advantages in doing
everything in advance. To José Azevedo for his friendship and for being always available to help, to
Para resolver o problema da criação de mosaicos de imagem este trabalho primeiro usa o
algoritmo Harris-Laplace para encontrar pontos de interesse invariantes. Através da pesagem das
intensidades das imagens com um filtro Gaussiano de janela quadrada, é possível calcular uma
impressão digital (descritor) para cada ponto de interesse usando o método descrito no algoritmo do
SIFT. Ransac, em conjunto com o algoritmo DLT, é usado para calcular as transformações projetivas
entre duas imagens de um conjunto de correspondências putativas. Para encontrar as
correspondências putativas é usado o rácio das distâncias Euclidianas do primeiro sobre o segundo
vizinho mais próximo de cada ponto de interesse. As homografias calculadas são então usadas para
construir o mosaico de imagens.
Para estudar a robustez do método foi desenvolvido um simulador para tirar fotografias aéreas
a uma imagem representativa da superfície da Terra. Adicionando pequenas variações aos
parâmetros envolvidos para obter cada uma das fotografias, medidas de erro, nomeadamente
baseadas em métodos de decomposição das matrizes de homografias, foram usadas para comparar
os resultados entre os mosaicos estimados e exatos.
Os resultados demonstram a importância da minimização dos ângulos de inclinação do
quadrotor, bem como da necessidade de uma maior percentagem de sobreposição entre as imagens.
Os resultados também demonstram que as imagens devem ser tiradas a altitudes elevadas, para
evitar, ao máximo, a paralaxe. Contudo, o método estudado mostrou ser invariante a translações,
rotações, variações de escala, brilho e contraste, a transformações de perspectiva, e ainda à
presença de algum ruído.
Palavras-chave: Mosaico de imagens; fotografia aérea; UAV; SIFT; Ransac
- v -
Table of Contents
Acknowledgements .............................................................................................................................................................. i Abstract ............................................................................................................................................................................... iii Resumo ................................................................................................................................................................................ iv Table of Contents ................................................................................................................................................................. v
List of Figures .................................................................................................................................................................... vii List of Tables ....................................................................................................................................................................... xi List of Acronyms ............................................................................................................................................................... xiii List of Symbols ................................................................................................................................................................. xiv
2. State of the Art ................................................................................................................................... 7
2.1. Research and application fields of image mosaicing................................................................ 7
2.2. Correct geometric deformations using data and/or camera models......................................... 7
2.3. Image registration using data/or camera models ..................................................................... 8
2.3.1. Feature based methods ................................................................................................................... 8
3.2.2. Nadir view ...................................................................................................................................... 16
Appendix A .................................................................................................................................................... 99
Appendix B .................................................................................................................................................. 104
Appendix C ................................................................................................................................................. 108
Appendix D ................................................................................................................................................. 112
Appendix E .................................................................................................................................................. 116
- vii -
List of Figures
Fig. 1.1 – Uavision quadrotor (small scale UAV) used in this project. .................................................... 3
Fig. 1.2 – Global scheme of the proposed algorithm. ............................................................................. 4
Fig. 3.1 – Common geometric transformations. The first three cases are typical examples for the affine
transformations. The remaining two are the common cases where perspective and polynomial
transformations are used, respectively.[21] ........................................................................................... 15
Fig. 3.3 – Diagram showing the relationship between the zenith, the nadir, and different types of
horizon. The zenith is opposite the nadir [87]........................................................................................ 16
Fig. 3.4 – Perspective distortion resultant of a pitch rotation. ............................................................... 16
Fig. 3.5 – Perspective views of an A4 sheet. ........................................................................................ 17
Fig. 3.6 – From left to right: a) Nadir View; b) Small pitch rotation; c) High pitch rotation; d) Scheme of
the three views. It works similar to roll axis [88]. ................................................................................... 17
Fig. 3.7 – Orthographic views project as a right angle to the data plane. Perspective views project
from the surface onto the datum plane from a fixed location [91]. ........................................................ 18
Fig. 3.8 – A simplified illustration of the parallax of an object against a distant background due to a
perspective shift. When viewed from "Viewpoint A", the object appears to be in front of the blue
square. When the viewpoint is changed to "Viewpoint B", the object appears to have moved in front of
the red square [94]. ............................................................................................................................... 18
Fig. 3.9 – Parallax as a consequence of ground reliefs for high pitch angles. ..................................... 19
Fig. 3.10 – Parallax on nadir photos, where the bigger the angle of view of the camera the worst. .... 19
Fig. 3.11 – For successively higher altitudes (nadir views) the distortions diminish for the same
photographed area, however the resolution diminishes too. ................................................................. 19
Fig. 3.12 – Same camera tilts for successively higher altitudes gives worst results. ........................... 20
Fig. 3.13 – a) Distortion free image; b) Barrel distortion (fisheye lens); c) Pincushion distortion; d)
Example of a complex distortion [95]. ................................................................................................... 21
Fig. 3.14 – Observed object from different viewpoints. ......................................................................... 22
Fig. 3.15 – Contours of constant are shown by the fine lines............................................................ 29
Fig. 3.16 – Schematic of the D. Lowe scale space. .............................................................................. 33
Fig. 3.17 – Scale levels for each image of each octave ( ). [103] .................................................. 34
Fig. 6.4 – On the left we have the inliers (the same as the Fig. 6.3), but now with a rectangle around
defining the neighborhood of that point used to find its descriptor. The size of the square is
according to its characteristic scale. On the right is a zoom in view where we can see that each
interest point has its own canonical orientation. .................................................................................... 73
Fig. 6.5 – Ratio between the overlap percentages. Concept. ............................................................... 73
Fig. 6.6 – Scheme used for the pitch-overlap study. ............................................................................ 77
Fig. 6.7 – Scheme used for the roll-overlap study. ............................................................................... 78
Fig. 6.8 – Scheme used for the yaw-overlap study. .............................................................................. 79
Fig. 6.9 – Scheme used for the scale-overlap study. ............................................................................ 79
Fig. 6.10 – Contrast variations. a) ; b) ; c) ; .. 80
Fig. 6.11 – Brightness variations. a) ; b) ; c) ;81
Fig. 6.12 – Noise. a) ; b) ; ....................................................... 82
Fig. 6.13 – The red asterisks represent the coordinates where each photograph was taken. a)
Photographies taken. b) Desired positions and trajectory without errors. ............................................. 83
Fig. 6.14 – a) Exact mosaic. b) Estimated mosaic with the “maximum”-strategy between the
overlapping regions (see Fig. 5.4). c) Estimated mosaic with the “first in stays in”-strategy between the
overlapping regions (see Fig. 5.5). ........................................................................................................ 83
Fig. 6.15 – Two mosaics obtained from two pairs of images. 80 inliers were involved on the left and 16
on the right. ............................................................................................................................................ 85
Fig. 6.16 – Mosaic from three images. 59 inliers between the first and the second image and 48
between the second and the third. ........................................................................................................ 85
Fig. 6.17 – Results from the flight. a) trajectory; b) flight altitude. ........................................................ 86
To identify interest points in scale space, Lidenberg [100] has shown that under some rather
general assumptions on scale invariance, the Gaussian kernel and its derivatives are the only possible
smoothing kernel for scale space analysis. Therefore, the scale space of an image is defined as a
function:
(3.29)
Where is the Gaussian kernel used in the convolution (eq. (3.17)).
To detect stable interest points, D. Lowe proposed to use space scale extrema in the Difference
of Gaussian (DoG) function convolved with the image : , which can be computed from
the difference of two nearby scales separated by a constant multiplicative factor :
(3.30)
The Laplacian of Gaussian: , gives the second image derivative, which is good to locate
interest points as corners and edges. Linderberg [100] showed that true scale invariance is achieved
normalizing the Laplacian, , by the factor : and Mikolajczyk [101] found that the maxima
and minima of produced the most stable image features compared to a range of other possible
image functions. However, it is computationally heavy to calculate Laplacian of Gaussians. According
to D. Lowe the difference-of-Gaussian function provides a close approximation to the scale-normalized
Laplacian operator with less computational effort:
(3.31)
Where is constant over all scales and so has no influence in extrema location.
was found to be a good scale space factor with almost no influence on the stability of extrema
detection and localization.
Constructing the scale space from DoGs
The philosophy is to use a set of octaves each being composed by a set of scale levels. The
author proposed to use four octaves each with five scale levels [30][102] but other refinements can be
used. To get an image in one scale level a blurring step must be done to the previous image of the
same octave. The first image of each octave is obtained by downsampling the third scale level image
of the previous octave. The first image of the first octave is obtained from the initial image. Finally, as
we said, DoG are obtained performing the difference of two nearby scales (see Fig. 3.16).
- 33 -
A close analysis shows that D. Lowe used multiple octaves but just one DoG per octave on his
first article regarding SIFT [52]. Later, in his second article [30] he presented a generalization of his
method introducing multiple levels per octave. Despite the fact that his first article seems clear, several
of the initial constants used got changed and some of the modifications introduced were not clarified.
Due to such changes, the scale space step became the reason why there are several slightly different
versions of the same algorithm available on the Internet [103][102][104].
We will present here a schematic to ease the reasoning of how to construct a space scale from
DoGs, confronting the previous interpretations [103][102][104]. with D. Lowe articles:
Fig. 3.16 – Schematic of the D. Lowe scale space.
D. Lowe conclude that blurring the initial image led to an improvement of the results, however if
we smooth the initial image before extrema detection we are effectively discarding the high spatial
frequencies [102][30]. To solve this, he doubled the size of the input image (increasing the number of
stable keypoints by almost a factor of 4) using linear interpolation prior to building the first level of the
pyramid. He assumed that the original image had a blur of at least (the minimum needed to
prevent significant aliasing), and that therefore the double image would have a blur of relative to its
new pixel spacing.
In D. Lowe’s first paper [52], as we said, just the first DoG of each of the octaves were used.
and were set to while the scale, , used on was set to . It was also used a pre
blur of before the first level of each octave (not represented).
In the second paper [30], D. Lowe changed the amount of prior smoothing, , to , which
provides close to optimal repeatability according to his results. was set to , by simply taking every
second pixel in each row and column. Despite the fact that should be , D. Lowe claims that there
is no real change in the accuracy of the results, while the computation time is greatly reduced (no
need for bilinear interpolations).
- 34 -
In the implementation: [102] and were set to while the scale, , used on
was set to , on the other hand A.Vedaldi implementation [103] present a different solution. A.
vedaldi instead of blurring the initial image with and , he just uses a blur of
right after doubling the initial image. He also uses as instead of , while
and remains and respectively.
Finally an important aspect to say is that D. Lowe used different constant values for .
According to his papers if is the number of intervals in which each octave of scale space is divided
[30] then is given by:
(3.32)
For this case we were naturally considering which produced images in the stack
of blurred images for each octave.
Characteristic scale of a point that lies within a specific DoG level
The scale of each point, D. Lowe says that it is simply the smallest Gaussian used in the
Difference-of-Gaussian function of eq. (3.30). In other words, the scale of a point found within a DoG
image is the amount of blur, , used on the smallest Gaussian image used to calculate that DoG.
A. Vedaldi defined the scale for each of his levels as
, where , is the octave
number: and is the scale level: . He simply began with because
he considered that when D. Lowe initially doubled the size of the image he was giving a step
backwards on scale space and so the scale for the first level of the first octave should be less than 1
( . Utkarsh [102], on the other hand, used simply the relation:
(3.33)
Substituting we have:
Fig. 3.17 – Scale levels for each image of each octave ( ). [102]
An observation of these results states that the first level of each octave (excluding the first)
corresponds to the level in the previous octave [30][102].
- 35 -
Interest point location
Interest point locations are simply given by the local maxima and minima of the DoG functions.
Each sample point is compared to its eight neighbors in the current image and nine neighbors in the
scale above and below. It is selected only if it is larger than all of these neighbors or smaller than all of
them.
3.7.1. Assignment of a canonical orientation to each interest point
The idea of assigning this orientation is to provide rotation invariance to all the keypoints. The
method is very simple: the scale of the keypoint is used to select the Gaussian smoothed image,
(eq. (3.29)), with the closest scale, so that all computations are performed in a scale-invariant manner.
For each image sample, , at this scale, the gradient magnitude, , and orientation, ,
is computed using pixel differences for all the pixels within the image [52][30].
(3.34)
(3.35)
Next an orientation histogram is formed from the gradient orientations of sample points within a
region around the keypoint (this region is proportional to the of the keypoint). The orientation
histogram has bins of degrees each (covering the degrees). Each sample added to the
histogram is weighted by its gradient magnitude and by a Gaussian-weighted circular window with
( of the keypoint)8.
Finally the highest peak in the orientation histogram is selected for that keypoint. Any other
peak that falls within the 80-100% of the highest peak is also selected and a new point is assigned
(with the same scale) for that orientation (Fig. 3.18). A parabola was used by D. Lowe to improve
accuracy on the peak position, considering always the three closest histogram values of each interest
point for the interpolation.
8 Using Gaussian kernels the amount added also depends on the distance from the keypoint. So gradients
that are far away from the keypoint will add smaller values to the histogram.
- 36 -
Fig. 3.18 – Orientation histogram [102].
3.7.2. Interest point description
Following [52][30] the first step is to use, as before in section 3.7.1, the scale of the keypoint to
select the Gaussian smoothed image, , with the closest scale and thus compute gradient
magnitudes, eq. (3.34), and orientations, eq. (3.35), for that keypoint in a scale-invariant manner.
Next, the descriptor coordinates and the gradient orientations are rotated relative to the
canonical orientation of each keypoint which was obtained in section 3.7.1. This will guarantee rotation
invariance.
In the next step we weight gradient magnitudes around each keypoint with a circular Gaussian
window of . Its purpose will be described soon.
To find the descriptor D. Lowe began for considering a window around each keypoint as
represented in Fig. 3.19:
Fig. 3.19 – Interest point description [102].
From Fig. 3.19 one may see that the keypoint lies “in between”. It does not lie exactly on one of
the entries of the window. That is because it does not. The window takes orientations
and magnitudes of the image “in-between” pixels. So we need to interpolate the image to generate
orientation and magnitude data “in between” pixels [102][30].
- 37 -
As we can see from Fig. 3.19 we divide each window into sixteen small windows.
Now assigning the previous calculated Gaussian-weighted gradient magnitudes, of each
window, into orientation histograms of bins each (Fig. 3.20) we obtain a
dimensional vector which summarize the content of all the small windows from the top left to the
bottom right (Fig. 3.19).
Fig. 3.20 – Orientation histograms. Adapted from [102].
Finally, to reduce the effects of illumination changes, the descriptor vector, let us call it , is
normalized to the unit by dividing the vector by the square root of the sum of the squares of all the
entries of itself:
(3.36)
A change in image contrast in which each pixel value is multiplied by a constant factor will
multiply gradients by the same constant, so this contrast change will be canceled by vector
normalization. A brightness change in which a constant is added to each image pixel will not affect the
gradient values, as they are computed from pixel differences [30].
To reduce the effects of non-linear illumination changes (as camera saturations or simply solar
reflections) that can cause large changes in relative magnitudes for some gradients, but are less likely
to affect the gradient orientations, D. Lowe reduced the influence of large gradient magnitudes by
thresholding the values in the normalized vector to each be no larger than (experimentally
determined), and then renormalizing to the unit.
3.8. Putative correspondences
Given two sets of interest points, where each interest point is described by a vector, the aim of
this step is not perfect matching (it is difficult to guaranty correct matches), but to provide an initial
point correspondence set (putative correspondences) in order to, later, use other method (e.g.
Ransac) to eliminate the mismatches [62].
- 38 -
According to D.Lowe [30] the best candidate match for each keypoint is found by identifying its
nearest neighbor. The nearest neighbor is defined as the keypoint with minimum Euclidean distance
for the invariant descriptor described in section 3.7.2.
The Euclidean distance, , between the vector describing interest point from image 1,
, and the vector describing interest point from image 2, , is given by:
(3.37)
Thus, what we are looking for is the minimum value of each column/row9 of .
Briefly, a more effective measure is obtained by comparing the distance of the closest neighbor
to that of the second-closest neighbor. According to D. Lowe if we reject all matches in which the
distance ratio is greater than (eq. (3.38)) we should get rid of of the false matches while
discarding less than of the correct matches:
(3.38)
3.9. Robust estimation using Ransac
Until now all efforts were made in order to find a set of corresponding points between two
images . However, errors in previous steps must be considered. There are two main sources
of errors: in the measure of points position, which is assumed to follow a Gaussian distribution, and
due to mismatched points (outliers) resulting from previous subchapter (section 3.8). As we said
previously (section 1.4) our final goal is to find projective transformations or homographies that
describe relations between pairs of images, therefore, our objective in this subchapter is then to find a
set of at least four correctly matched points (inliers) so that homographies may be estimated in an
optimal manner. Robust estimation comes from the fact that the methods used are robust (tolerant) to
outliers (measurements following, possibly, an unmodelled error distribution). This subchapter will
follow Richard Harley and Andrew Zisserman’s book [62].
9 In fact, we obtain two different results (not necessarily) if we just use the minimum values provided from the columns or from the rows.
- 39 -
Ransac
RANdom SAmple Consensus (Ransac) [61] is a commonly used method for robust estimation
since is able to cope with a large proportion of outliers. The idea behind is very simple: consider a
bunch of 2D points lying on a plan and that we want our model to be a line that best fits these points;
we first select two points randomly; these points define a line; the support for this line is measured by
the number of points that lie within a distance threshold (example: , where is the standard
deviation of the perpendicular distance of all the points to the mentioned line); this procedure is
repeated N times (so, each repetition picks two random points) and the line with most support is
deemed the robust fit; points within the distance threshold are the inliers and the others outliers.
Naturally, outliers will lead to lines with very little support.
Now considering our case, we want our model to be a planar homography, so instead of two
points we will need to have a minimum set of four point correspondences. Next, an automatic
estimation of the homography between two images using the Ransac robust estimation algorithm
adapted from Richard Harley and Andrew Zisserman’s book is summarized [62]:
Objective: Compute the 2D homography between two images.
Given: Set of putative correspondences between interest points of each image
Algorithm:
1. Select a random sample of 4 putative correspondences and compute the homography
;
2. Calculate the distance for each putative correspondence;
3. Compute the number of inliers consistent with by the number of correspondences for
which pixels ( is the standard deviation of );
4. If the number of inliers is above some threshold value go to step 7;
5. If not, repeat for N samples, steps 1 to 4 ( as we will see later is the number of trials);
6. Since no number of inliers was bigger than , choose the with the highest support
(largest consensus set). In the case of ties choose the solution that has the lowest
standard deviation of inliers;
7. Re-estimate using all correspondences classified as inliers.
Consider the calculations involved in the previous algorithm. We already explained how to
compute a homography of four 2D to 2D point correspondences as well as for over-determined
solutions (section 3.5); therefore, only the remaining distance and thresholds t, N and T are
discussed next.
- 40 -
Distance measure :
The simplest method of assessing the error of a correspondence from a homography is to
use the symmetric transfer error:
(3.39)
Where is the point correspondence. In this method measurement errors occur in both
images. Naturally gives the square of the point distance which is the sum of squared and
measurements.
Reprojection error is an example of a better, thought more expensive distance measure:
(3.40)
Where is the perfect correspondence. In the symmetric transfer error we had the
coordinate values: , however for the reprojection error points
need to be estimated in
such a way that the reprojection error is minimum (a cost function is involved). For readers who wish
to know more about reprojection error the following bibliographic reference is suggested [62].
Fig. 3.21 – A comparison between symmetric transfer error (upper) and reprojection error (lower) [62].
Distance threshold :
If we assume the measurement error of each point to be Gaussian with zero mean and
standard deviation , then the square of the point distance, , will be a sum of squared Gaussian
variables and follows a distribution with degrees of freedom ( , codimension of the modal, is
equals to two for homographies).
The probability that the value of a random variable is less than is given by the cumulative
chi-squared distribution:
(3.41)
- 41 -
Usually is chosen as so that there is probability that the point is an inlier:
with
(3.42)
Number of inliers :
To calculate this value a conservative estimation of the proportion of outliers must be done.
We know the total number, , of putative correspondences received by the Ransac algorithm, so we
then just need to perform:
(3.43)
This threshold value is set aside most of the times since we do not know a priori the fraction of
data consisting of outliers; however, when we do it can be computationally friendly. This value will be
considered again in chapter 5.
Number of trials :
Naturally the usage of random samples of putative correspondences has only the objective
of reduce the computational effort, so the question that rises is: how many samples do we need to,
with probability , at least one of the samples of is free from outliers? To answer this question let be
the probability of any selected data point (correspondence) be an inlier, and thus is the
probability that it is an outlier. Then at least selections of correspondences are required:
(3.44)
Usually is chosen as , however the same problem as before for appears: we do not
know the value of . An adaptive method to find is described next.
First we need initial values. The worst case guess is to consider , since for higher values
the algorithm is likely to fail. Considering the initial value for is, according to eq. (3.44), 72.
Whenever Ransac finds a consensus set containing more than 50% of the data points, we then know
that there is at least that proportion of inliers. Basically each time a sample gives a higher percentage
of inliers than all the previous results, then we just have to save the best score so far and update for
the new lower value. decreases with the decreasing of , thus, when the repetitions are over our
best score is determined. It may occur that determines a less than the number of samples that
have already been performed, in such cases the algorithm terminates.
- 42 -
3.10. Homography decomposition
As we will see later this section is of extreme importance in the error analysis of the mosaicing
method developed, it will allow to compare the exact homographies obtained from chapter 4, where
we know the exact translations and rotations used for the quadrotor when obtaining each of the
photographs, to the ones obtained through mosaicing methods.
The problem of Euclidean homography decomposition, also called Euclidean reconstruction
from homography, is that of retrieving the elements , and from the matrix [105]:
(3.45)
Soon each of these variables will be introduced.
Before eq. (3.45) the homography obtained from either the exact or the estimated methods
must be corrected from the calibration matrix:
(3.46)
The calibration matrix, , will be defined later on eq. (4.20), but without the minus signals in the
entries and (see section 4.5). Next, we need to normalize in such a way that
. We saw before on the note 5 (page 24) that this is achieved through the equation:
(3.47)
Now that we are in conditions to continue our previous reasoning from eq. (3.45), let us begin
for considering two different camera frames: the current and the desired, and in the figure
respectively:
Fig. 3.22 – Desired and current camera frames. Involved notation [105].
- 43 -
The homogeneous transformation matrix converting point coordinates from the desired
frame to the current frame is:
(3.48)
Where and are the rotation matrix and translation vector, respectively. In the figure, the
distances from the object plane to the corresponding camera frame are denoted as and . is the
normal to the plane.
Briefly, and not to enter into too much detail, we obtain:
(3.49)
Where is normalized with respect to the plane depth .
A very good self explanatory presentation of the equations involved can be found here for
interested readers [106]. The concept is to use the Single Value Decomposition (SVD) of the
homography matrix:
(3.50)
And then use the eigenvalues of the matrix as well as and to find , and . For this, the
main equations will be presented here. First, using this decomposition we obtain the new equation:
(3.51)
Where , and are related to , and by:
(3.52)
can be calculated from:
(3.53)
Wherein:
(3.54)
- 44 -
The values are the eigenvalues of .
is obtained from:
(3.55)
To finish we can compute the matrix as:
(3.56)
Where:
(3.57)
And the vector as:
(3.58)
Finally, we can decompose into the known Euler angles and take the roll, pitch and yaw
angles. Equations that give these angles will be presented only in chapter 5, due to some practical
aspects we have to take into account.
- 45 -
4. Simulation of taking aerial photographies with UAVs
The major purpose of this subchapter is to determine the transformations that allow to go from
the coordinates of a photographed point ( ) in the world reference frame (GPS) to a pixel point ( )
in the picture reference frame (eq. (4.1)). Naturally, for this purpose it is necessary to know the
location of the optical center of the camera in the world reference frame as well as the pitch, roll and
yaw rotations of the UAV in the precise moment when the photo is taken. Camera parameters, such
as focal length, must also be known.
(4.1)
Eq. (4.1) divides the problem in two parts. The first relates the transformation from the world
reference frame to the camera reference frame (eq. (4.2)10
).
(4.2)
The second goes from the camera reference frame to the image plane. For this, later will be
discussed the pinhole camera model and the calculation of the calibration matrix which is a
matrix that contains the intrinsic parameters of the camera. On the other hand, camera matrix ( ),
, will be given by:
(4.3)
Later, and are discussed.
4.1. World reference frame
In the real application of this problem the origin of the world reference frame and the
implemented coordinate system would respect the Global Positioning System (GPS). However in this
simulation instead of the World Geodesic System of 1984 (WGS84) used by the GPS, the system
adopted was in cartesian coordinates of the form: . To facilitate interaction with Matlab, the
origin of the world reference frame was made to coincide with the upper left corner of the virtual image
used to simulate the world (Fig. 4.1). It was from this image that the virtual photos were taken.
10 has inverse because it is a multiplication of rotation and translation matrixes and each of these is
invertible [55].
- 46 -
Fig. 4.1 – Adopted virtual world image (8486 by 12000 pixels) and its coordinate system. For this
simulation each pixel was considered to have 1 m [107].
On this simplification it was naturally assumed that the ground would be plane since we were
using flat satellite images to simulate the World. This allowed to be possible, in a first approach, to
take aerial photographs to this virtual "image of the World" with the exact knowledge of all the
variables involved (in a real situation such values, with the respective errors, would be made available
by the sensors). Later, with the obtained images, it would become possible to measure the accuracy
of the mosaicing techniques developed by comparing the resulting mosaics with the original virtual
image of the World.
To avoid possible misunderstandings, axes not only of the world reference frame, but of all
frames used in this dissertation were used with respect to the right-hand rule - clockwise reference
frame (Fig. 4.2).
Fig. 4.2 – Clockwise reference frame adopted.
What concern to the reading and writing functions of images provided by the Matlab Image
Acquisition Toolbox (imread (row, column) and imshow (row, column)) they operate according to the
red reference frame represented in Fig. 4.1 and Fig. 4.3 (direct reference frame). However, let us note
here that some features of Matlab are ruled in a counterclockwise reference frame, this includes the
editor used to select the black dot you can see in Fig. 4.3 as well as the Matlab very known plot
function that allows you, for the current situation, to represent points in images. So, instead of have
plot (x, y) as plot (column, row) - in this work plot was used in the modified form: plot (row, column) or
plot (y, x) for a consistent use / display of the results.
- 47 -
Fig. 4.3 – Caring for the representation of the results in Matlab.
4.2. UAV reference frame
After the definition of the world reference frame comes the need for the UAV reference frame.
According to [62] (p. 579) axes x, y and z are used respectively for the roll, pitch and yaw angles of an
UAV (see Fig. 3.2). It is also known that z positive is pointed to the ground when the roll, pitch and
yaw take zero value (blue reference frame in the Fig. 4.4). According to what was said earlier the
clocker wise reference frame was used.
Fig. 4.4 – UAV reference frame (blue) related to the world reference frame (red) when
11
.
11 The origin of the UAV reference frame related to the origin of the world ( , , ) in this and
in the next figures was of course arbitrary, since the intention was only to demonstrate the procedure adopted.
- 48 -
The passage from the world reference frame to the UAV reference frame ( ) is given in five
steps (five coordinate transformations).
(4.4)
Firstly, according to Fig. 4.4, is made a translation from the origin of the world reference frame
(red) to the origin of the UAV reference frame (green).
(4.5)
Then, comes a rotation in which according to Fig. 4.4, goes from the green reference frame to
the blue reference frame by rotating 180 ° around axis (assuming pitch, roll and yaw equal
zero):
(4.6)
Finally comes the three coordinate transformations resulting from the roll, pitch and yaw angles
of the UAV. According to [108], and because matrix multiplication is not commutative the first
transformation taking place is about yaw angle12
(around - axis that points to the ground on Fig.
4.4). Yaw rotation also occurs about the axis of the world frame, not only the body frame of the
UAV. If it was applied an arbitrary roll or pitch angle, different from zero, before yaw rotation,
would be no longer perpendicular to the ground plane or parallel with or saying in other way the
image plane would be no longer parallel to the ground plane (Fig. 4.5). So, this transformation is given
by:
(4.7)
12 The reader should pay attention that this analysis was done from the world reference frame step by step to the UAV reference frame (from the left to the right transformation in the eq. (4.4)). However when we have the coordinates of a point in the UAV reference frame, we will naturally follow from the right to the left and so the first angle to be corrected would the roll, then the pitch, and finally the yaw. The thing is, for the eq. (4.4), each time a new rotation is introduced on the left, it has no concern for the original body frame of the right.
- 49 -
Fig. 4.5 – The pink reference frame corresponds to an UAV frame with . The blue
frame is the UAV frame with a single rotation yaw of 20 degrees.
The second angle to be applied is the pitch angle - around y (axis through the right wing of the
plane) - which causes the up and down of the UAV nose. This axis must be, necessarily, parallel to
the plane of the ground [108]. If the roll angle were applied before pitch, the previous statement would
not be true (see Fig. 4.6) and we would get a different final transformation. This transformation is
described in the form:
(4.8)
Fig. 4.6 – The pink reference frame corresponds to an UAV frame with . The blue
frame is the UAV frame with a single rotation pitch of 20 degrees.
- 50 -
Finally is the roll angle – x axis (longitudinal axis of the UAV) – which allows the tilting of UAV to
the left or right (Fig. 4.7). This transformation is described in the form:
(4.9)
Fig. 4.7 – The pink reference frame corresponds to an UAV frame with . The blue
frame is the UAV frame with a single rotation roll of 20 degrees.
4.3. Camera reference frame
The transformation that characterizes the passage from the camera reference frame to the UAV
reference frame is debatable. In real situation the position of the camera on the UAV turns out to be
limited by physical circumstances, so that in this simulation it will be assumed that both reference
frame origins are coincident ( , , ) = ( , , ) and that it's possible to align ( )
with the flight direction of the UAV ( ) allowing this way the alignment of the camera and the
pictures taken with the UAV flight direction. Fig. 4.8 shows the camera reference frame in light blue
after a rotation of -90 degrees about the axis.
(4.10)
- 51 -
Fig. 4.8 – Representation of the camera reference frame compared to the world and UAV reference
frames.
4.4. Pinhole camera model
A pinhole camera is simply a camera that has no lens [109]. In its place is an extremely small
aperture through which exterior light is projected onto sensitive film or paper. Effectively is a light-proof
box with a single small hole in one side. Light from a scene passes through this point and projects an
inverted image on the opposite side of the box (Fig. 4.9).
Up to a certain point, the smaller the hole, the sharper the image, but the dimmer the projected
image. Optimally, the size of the aperture should be 1/100 or less of the distance between it and the
projected image [109].
Fig. 4.9 – Pinhole Model [110].
However, in the literature it is common to consider the simplification that the image plane stays
between the pinhole and the photographed object [62](Fig. 4.10). In this case, this simulation avoids
the need for a final correction related to the inversion of the photography taken as Fig. 4.9 would
indicate. Nowadays images that we dead with are already corrected from this inversion.
- 52 -
Fig. 4.10 – Real model (left) and simplified model (right) of a pinhole camera in the XZ plane.
From Fig. 4.10 it is possible to derive these elementary geometric equations:
(4.11)
represents the distance between the object and the pinhole;
is the focal length;
and are the dimensions of the photographed area on the ground;
and are the photography dimensions on the image plane.
Our goal then is to use equations (4.11) to find the transformation that project 3D points from
the object plane to points on 2D plane (se Fig. 4.11 of section 4.6.1 for full detailed scheme).
Projections can be modeled in three steps (eq. (4.12)) as we will see: First is a perspective
projective transform that relates the two point positions ( ); second a frame transformation ( ) in
order to fix of the projected point as and finally a space dimension reduction, :
(4.12)
Perspective projection
Perspective projection ( or ) corresponds to the well-known pinhole camera model. In this
model, like it has been seen in Fig. 4.9, a point in the object plane generates a ray through the optical
center intersecting the image plane at a given position. Using homogeneous transformations if
represents the original point defined in the camera frame then, with respect
to the previous eq. (4.11), its orthogonal projection on plane will have coordinates
and the derived homographic transform is given by [62]:
(4.13)
- 53 -
Frame transformation
The passage from to is just a pure translation along axis of named before as focal
length:
(4.14)
represents the 3D frame whose origin lies on the middle of the image plane.
Space dimension reduction
Finally, because we are only interested in points on the projection plane, we can use the
remaining coordinates to define 2D axes on this plane. Denoting this coordinates system as (yellow
in Fig. 4.11), these new axes are obtained from the old ones through the transform:
(4.15)
Recovering previous equations (4.3) and (4.12), it is now obtained and represented in eq.
(4.16):
(4.16)
However, the use of the homogeneous transformation introduce a scale factor ( ) that
needs to be posteriorly corrected (see the notes in section 4.7).
4.5. Camera parameters and calibration
Remembering eq. (4.3) is the last transformation matrix that needs to be calculated to
completely define our problem. This transformation is used to place the center of the image
coordinates on the top left corner of an image instead of in the middle. We need to do so, because
Matlab by default consider the top left corner as the origin of both matrices and images. It is
recommended you to see the yellow reference frame ( ) and red reference frame ( ) shown in
Fig. 4.11 (section 4.6.1) for further detail.
To find , the problem was divided in three steps: axes rescale, translation, rotation.
- 54 -
Axes rescale
The reference frame is in meters while is given in pixels. Naturally each pixel was
considered to have a very small rectangular area (not necessarily square). The division of the image
pixel size (e.g. ), respectively by the previous and values given in Fig. 4.10 that
represent the photography dimensions on the image plane, gives us the number of pixels for
measuring unit in both directions of the image: and respectively. These are the values that are
used to rescale the axes.
Translation
The translation between the origin of and the origin is ( ), where and are
respectively half the total image size for each direction in pixel dimensions (e.g. and )13
.
Rotation
The rotation (see Fig. 4.11 on section 4.6.1) is simple given by:
(4.17)
Camera calibration
Finally, is given by:
(4.18)
Recovering eq. (4.1), is given by
(4.19)
And the calibration matrix14
:
(4.20)
13 On a real case application values: , , and are calculated based on a calibration procedure.
Thus and are not necessarily half the image sizes. In section 4.7 (“Intrinsic Parameters”) is given a
bit more information about this. 14
The skew, , of the camera calibration matrix (between the and axis) is often set to zero. The skew
is in the entry: [55].
- 55 -
4.6. Resume
In this section is presented a resume of all what was said above to ease the reasoning. First a
detailed scheme with all the reference frames involved is presented (section 4.6.1); next the same is
done for the equations (section 4.6.2).
4.6.1. Detailed scheme of the reference frames
Fig. 4.11 shows the detailed relation scheme between all the reference frames involved.
Camera model as well as image and object plane are represented with respect to the UAV reference
frame and the world reference frame.
Fig. 4.11 – From world reference frame to image plane. Note on the notation involved. It is recalled that
is aligned with the flight direction of the UAV.
- 56 -
4.6.2. Scheme of the equations involved
Next is presented a small resume of how to obtain the transformation that relates 2D points in
the image frame with 3D points on the world frame.
Fig. 4.12 – Graph of the transformations involved.
4.7. Final notes
Scale factor
In order to solve the problem introduced by the scale factor (eq. (4.16)) we will recover eq. (4.1):
(4.21)
The thing is that, and are given in due the multiplication of the calibration
matrix (given in pixels) by the world coordinates and (given in meters). This way is easy to see
that we just need to divide by its last entry (that turns out to be in meters as we can see from eq.
(4.16)).
is naturally set to zero because we are working on the ground plane. The last entry of is
1 due to the use of homogeneous coordinates.
Intrinsic parameters
Focal length ( ), , , and are given by the fabricant of each camera. However in this
dissertation it was used the Camera Calibration Toolbox for Matlab [111] in order to find the camera
parameters of a Logitech Labtec WebCam:
(4.22)
(4.1) (4.2)
(4.3) (4.4)
(4.9) (4.5)
(4.10)
(4.6) (4.8) (4.7)
(4.12)
(4.15) (4.14)
(4.13)
(4.18)
- 57 -
Instead of the theoretical calibration matrix, eq. (4.22) ends up being used to take these virtual
photos.
Discussion concerning the virtual photo
In the beginning to find the virtual photo it was used directly eq. (4.21), where ( , ) was
replaced respectively by the pixel coordinates of Fig. 4.1. Then, it was selected just the
world coordinates ( , ) that lead to values inside image dimensions (e.g. and
). Finally each of these world pixel coordinates selected were used to obtain the virtual
photo.
This method, however, presented two problems. First was computationally very inefficient since
it needed to calculate eq. (4.21) more than 100 million times ( ) or eventually less if an
initial guess were provided (center of the camera coordinates). Second is that when a coordinate pixel
is provided, ( , ), in practice, the ( ) coordinates calculated are not integer numbers which turns
this method actually into an approach unless some interpolation method is used to calculate the
correct color degree for each pixel ( ). Nevertheless, despite the time consumption it gave good
results.
Later, for a quicker calculation, the goal was to introduce ( ) coordinates and obtain directly
( , ).
For this, it was considered the inverted transformation matrix :
(4.23)
As it was said before has inverse. Since the inverse of , which is a matrix, can be
obtained by adding a row of zeros ( ) [62] (p. 590) them has also inverse. A back-
projection of points to rays method can also be used [62] (p. 161-162) to find the set of points in space
that map to a certain image point ( ) using the pseudo-inverse (or right inverse [112]) of :
,
(4.24)
However, a detailed analysis shows that the scale effect, (eqs. (4.16) and (4.21)) that
supposedly need to be multiplied by ( ) in order to obtain ( , ) actually depends of ( , ),
which leads to an indeterminate problem.
- 58 -
Finally, using symbolic calculation with Matlab, and considering the fact that , it was
possible to obtain
:
(4.25)
And get this way expressions for both and
dependent of all the variables in study15.
Using homography theory and again to avoid, e.g., calculations, it was simply used
the four extreme points: ( , , and with their correspondent
inhomogeneous coordinates ( , ) in order to find the projective transformation or homography16
that would give the world coordinates for each image point in a very quicker way:
Fig. 4.13 – Strategy adopted to find the world coordinates associated to each image coordinate.
Later, and to improve our results a bilinear interpolation was used (Fig. 4.15 and Fig. 4.14).
Fig. 4.14 – The use of bilinear interpolation for accurate results.
Fig. 4.15 – Example of bilinear interpolation. For more information we recommend to consult: [62].
15 These equations are two big to be presented here or even in appendix due to their dimensions (more than 25000 characters each).
16 We can use homographies for this because world coordinates are being considered to be 2D coordinates, and so we can treat this problem as a normal 2D to 2D correspondence problem.
Select the 4 corners Compute and
for the 4 corners.
Find homography
Image plan
( , )
( , )
( , )
( , )
World plan
( , ) ( , )
( , ) ( , )
Image plan
Use homography
Bilinear interpolation
( , )
World plan
( , )
Image plan
- 59 -
Consider, for example, that the point given by the coordinates ( , ) gives us the
coordinates in the world image . So, the value would be weighted,
according to Fig. 4.15, in the form:
(4.26)
With respect to the equation:
(4.27)
Where is used to round the value to the lowest nearest integer.
Other application of this procedure
The origin of the UAV reference frame related to the origin of the world ( , , ) as
well as pitch, roll and yaw angles would be measured and given by UAV sensors. This way it is
possible, using previous transformations, to create an image mosaic just with the information provided
by the sensors. Naturally, this approach is valid while only we are treating the world as a flat surface.
4.8. Examples of results given by Matlab
This topic pretends to show the script developed in Matlab based on the previous equations in
order to demonstrate their accuracy. To ease the reader reasoning, it was considered an initial
position of the UAV with a yaw of degrees and zero roll and pitch angles. Introducing all the
parameters we may obtain the set of figures shown next.
- 60 -
Fig. 4.16 – Matlab representation of the model developed. The red rectangle shows the image plane while
the green rectangle shows the object plane. , . Camera position
( , 1000, 600) meters.
Fig. 4.17 – Representation of the ground area covered by the photo taken in Fig. 4.16. The image on the
right is a detailed view. The red asterisk, in the middle, shows the position ( , , ) of
the camera in world coordinates (Fig. 4.1) projected onto a plane ( , ). A projection of
and reference frame is also shown in green.
Fig. 4.18 – Virtual photography taken (e.g. 320 per 240) according to Fig. 4.16 and Fig. 4.17.
Next, and to end with this topic, are presented two other brief examples.
- 61 -
a) b) c)
Fig. 4.19 – , . Camera position ( , 1000, 600) meters. a) Model developed; b)
Representation of the ground area; c) Image taken.
a) b) c)
Fig. 4.20 – , , . Camera position ( , 1100, 800) meters. a) Model
developed; b) Representation of the ground area; c) Image taken.
- 62 -
- 63 -
5. Implementation of the Image Mosaicing Method
This chapter is not intended to exhaustively expose the lines of the Matlab code. The aim is to
present the algorithms and strategies used to solve problems and explain the similarities / differences
between the methods that were implemented and the ones addressed in the mathematical problem
formulation chapter. Several practical aspects like constant values previously skipped will also be
taken into account.
First is explained how from two images, with some overlapping region, we can get a stitched
image mosaic (section 5.1). A detailed scheme to help understanding the image mosaicing method
that was implemented is presented in Fig. 1.2. In section 5.2 this method is generalized for more than
two images. Section 5.3 is related to the error measures used to evaluate the accuracy of the method.
5.1. Image mosaicing method for two images
This section is divided in two parts. The first explains how to obtain a homography between two
images. Next, in section 5.1.2, the stitching operation for these images is addressed.
5.1.1. Find homography
This subsection presents a basic algorithm of the proposed method (Fig. 5.1) (follows the
strategy used in [113]). Previously a simplified version of this scheme was presented in Fig. 1.2
(section 1.4). All the considerations/practical issues related to each and every step of this basic
algorithm are explained in the Appendix A.
Fig. 5.1 – Basic algorithm used to find a homography between two images.
Describe interest points according to SIFT method (section 3.7.2)
Find putative correspondences using the ratio / nearest neighborhood (section 3.8)
Find consistent correspondences using Ransac (section 3.9)
Estimate homography using DLT algorithm (section 3.5)
Assign a characteristic scale for each interest point following Harris-Laplace algorithm (section 3.6)
Use Harris-Laplace algorithm (section 3.6) to find interest points in both images
images
Assign a canonical orientation to each interest point according to SIFT method (section 3.7.1)
- 64 -
5.1.2. Stitch two images together
This section is divided in four steps: Transformation that brings both images into the same
reference frame; Image mosaic boundaries; Application of the transformation to each image; and
Overlapping regions.
Transformation that brings both images into the same reference frame
Up to this point it is assumed we received two images linked by a homography:
. If we set the first image as our reference (identity matrix), then the second image can
be reassigned into the first reference frame according to:
(5.1)
Image mosaic boundaries
Now that we have the required transformations, we need to define the boundaries, in pixel size,
of the resultant image mosaic. For this, let us assume each of our images has pixels. Then
our first image alone (identity matrix) will initially set the size of our mosaic as . However if
we add the second image to the mosaic we need to calculate where its boundaries will lead us:
(5.2)
Normalizing the left matrix by its scale factors we get:
(5.3)
Next we have to find the maximum and minimum values for and , respectively from the sets:
and . This new four values will define the new
boundaries of our image mosaic (built from the two images). See Fig. 5.2.
Fig. 5.2 – Image mosaic within the calculated boundaries.
- 65 -
The black regions of the Fig. 5.2 were simply set to “NaN” (Not a Number).
Application of the transformation to each image
The transformation of the first image is trivial because it is the reference. However for the
second image we simply use a similar concept as the one used in eq. (5.2), but this time the input
coordinates are all the pixel coordinates of the final mosaic (similar to what we saw in equation (4.21)):
(5.4)
and are the mosaic dimensions in pixels.
Now, through bilinear interpolation (function in Matlab) it is possible to know the true
values for the pixels of the mosaic and . Fig. 5.3 shows the two images individually
placed within the mosaic before the overlapping operation. To avoid misunderstandings, these new
images will be called frames.
Fig. 5.3 – Frames associated with the first two images.
Overlapping regions
The stitching operation on the overlapping regions was done in two different ways. Simply
assigning to each pixel coordinates of the final image the maximum value obtained from all the frames
for that specific coordinate17
:
Frame 1
Frame 2
Mosaic (final image)
NaN NaN NaN NaN NaN NaN NaN
NaN NaN 0,87 0,43 0,14 0,08 0,42
NaN NaN 0,87 0,43 0,14 0,08 0,42
0,76 0,05 0,79 0,26 0,08 NaN NaN +
NaN NaN 0,40 0,18 0,58 0,12 0,90 =
0,76 0,05 0,79 0,26 0,58 0,12 0,90
0,75 0,53 0,31 0,65 0,23 NaN NaN
NaN NaN 0,26 0,26 0,55 0,18 0,94
0,75 0,53 0,31 0,65 0,55 0,18 0,94
0,57 0,93 0,17 0,75 0,15 NaN NaN
NaN NaN NaN NaN NaN NaN NaN
0,57 0,93 0,17 0,75 0,15 NaN NaN
Fig. 5.4 – Example of the “maximum” strategy for two images.
17 This was done individually for each RBG channel.
- 66 -
Or instead of the maximum, we can consider that the more transformations involved to find a
frame the more inaccurate are our results18
. This is the same as saying that pixels added sooner
(successively from the first to the last frame) will have priority over the others since they are more
likely to give the best results. See Fig. 5.5 to better understand the idea.
Frame 1
Frame 2
Mosaic (final image)
NaN NaN NaN NaN NaN NaN NaN
NaN NaN 0,87 0,43 0,14 0,08 0,42
NaN NaN 0,87 0,43 0,14 0,08 0,42
0,76 0,05 0,79 0,26 0,08 NaN NaN +
NaN NaN 0,40 0,18 0,58 0,12 0,90 =
0,76 0,05 0,79 0,26 0,08 0,12 0,90
0,75 0,53 0,31 0,65 0,23 NaN NaN
NaN NaN 0,26 0,26 0,55 0,18 0,94
0,75 0,53 0,31 0,65 0,23 0,18 0,94
0,57 0,93 0,17 0,75 0,15 NaN NaN
NaN NaN NaN NaN NaN NaN NaN
0,57 0,93 0,17 0,75 0,15 NaN NaN
Fig. 5.5 – Example of the “first in stays in” strategy for two images.
5.2. Image mosaicing method for more than two images
With relation to this point, the implementation developed in this dissertation did not enter into
great sophistications. What was considered was that each new image added to the mosaic was
directly associated with the image right before (Fig. 5.6).
Fig. 5.6 – Image chain assumed on our implementation.
The configuration of Fig. 5.6 was the adopted because the quadrotor images were obtained and
named according to the trajectories followed. So, for now, there was no need to deal with mixed
images without any explicit order.
The homographies between each pair of images was obtained as before in section 5.1.1. The
stitching operation was just a generalization of what was done in section 5.1.2. If we set the first image
as our reference (identity matrix), then the following images can be transformed into this global
reference frame according to:
(5.5)
18 Generalizing for cases with more than two frames involved.
- 67 -
5.3. Error measures implemented
This section starts with the definition of some equations concerning the error measures. Next
are described the two error measures adopted: “Exact vs. Estimated Transformation” (section 5.3.1)
and “Homography Decomposition Based” (section 3.10).
Flight altitude and covered area
Here is explained how to calculate the flight altitude necessary to cover a given area, ,
and how to calculate the area covered at a given flight altitude, . For these calculations we will
assume that we obtained, a priori, experimental data for the camera calibration matrix. On this
example photographs have pixels and the calibration matrix is given by eq. (4.22).
From Fig. 4.10 we can derive the relations:
,
(5.6)
For accurate results we can now use the data obtained from the Camera Calibration Toolbox
(eq. (4.22)) and set:
,
(5.7)
As we mention, considering the image pixel size as being, e.g., ( , ) and
recalling what was said in “Axes rescale” paragraph of section 4.5 we can write eq. (5.8)19
:
,
(5.8)
Where we use and , instead of simply and , because this values are half the
image sizes (Fig. 4.10).
Finally, combining these three equations (5.6), (5.7) and (5.8) respectively for or we have
the photographed area, , for a given desired flight altitude:
(5.9)
And the flight altitude, , necessary to cover a given area can be computed from:
(5.10)
19 and are constants and do not depend on or , however this is a calibration procedure based on
experimental data.
- 68 -
Overlap percentage
From previous equations one image taken at the height of meters covers
square meters of area, thus if two images have an overlap of it means that both cover a total area
given by:
(5.11)
Also, to achieve this of overlap between two consecutive images, we will consider in this
dissertation just two situations: a pure translation along or along , respectively given by:
(5.12)
(5.13)
Where and were mentioned previously on the equations (5.6) to (5.9):
(5.14)
(5.15)
is aligned with the ( ) according to Fig. 4.11.
Exact homography between two images
From chapter 4 we saw that it is possible to know the exact transformation between the world
plane and the picture plane just knowing the calibration matrix, the coordinates of the camera with
respect to the world frame, and the pitch, roll and yaw angles of the UAV.
Now, if we take two (controlled) pictures of our world image (Fig. 4.1), and if they have some
overlapping region between each other, it is possible to compute the exact homography between such
images. Let us consider two pictures obtained using the transformations described previously on eq.
(4.23): and , then the exact homography that will place point of the second
image into the reference frame of the first image is:
5.3.1. Exact vs. estimated transformation
Let us consider a mosaic composed by two images with some predetermined overlap
percentage between each other (equations (5.12) and (5.13)). The first image is the reference and the
second image is the object of our analyses. The transformation , as described previously
in eq. (5.16), is the exact transformation between the second image and the world reference frame.
The estimated transformation is obtained combining the transformation between the first image with
(5.16)
- 69 -
the world frame ( ) and the estimated homography obtained according to section
5.1.1 ( ):
(5.17)
Now, knowing that:
(5.18)
(5.19)
Where, e.g., and .
We can compute the error measure:
(5.20)
If we use the exact and the estimated transformations to calculate all the coordinates in the
world that are associated with each pixel from the exact and the estimated image, respectively, then
the Euclidean distance will be the distance or error, in meters, between each of these estimated and
exact positions. Fig. 5.7 will ease the comprehension of this measure.
Fig. 5.7 – Euclidean distance.
From the Euclidean distance matrix we now can compute:
(5.21)
(5.22)
(5.23)
(5.24)
- 70 -
5.3.2. Homography decomposition based
As before in section 5.3.1, let us consider a mosaic composed by two images with some
predetermined overlap percentage between each other (equations (5.12) and (5.13)). The error
measure described here is a straight forward application of the equations mentioned on section 3.10.
The idea behind is very simple, we need to decompose the estimated and the exact homography (see
section 5.1.1 and equation (5.16) respectively), and then calculate the difference between the exact
and the estimated results. The results are: roll, pitch, yaw and the translation coordinates that can be
obtained from the translation vector : and (equation (3.48)). Next are presented some
considerations with respect to the involved equations.
The homography decomposition method gives two solutions. For the correct solution the values
from equation (3.53) can be initially guessed from the direction of the -axis of the UAV:
.
The three Euler equations reported for this section are:
(5.25)
(5.26)
(5.27)
However, due to transformations such as the ones given by the equations (4.6) -
, (4.10) - and (4.17), the values of the roll, pitch and yaw can come rotated of
90 or 180 degrees. To correct this, the rotation matrix of equation (3.52) (here renamed as ) needed
to be corrected before the application of equations (5.25) to (5.27):
(5.28)
Also, after the recalculation of the , and angles with the equations (5.25) to (5.26)
the new true angles had to be corrected according to:
(5.29)
The translation (equation (3.48)) was also adjusted by a factor of .
- 71 -
6. Evaluation of the Mosaicing Method. Results and Discussion
This chapter is divided into sections. In section 6.1 a robustness analysis will be done to the
proposed mosaicing algorithm in order to find which is the best overlap percentage between two
images. An example of a mosaic built from a set of images is presented in section 6.2. Section 6.3
presents a comparison between the proposed image mosaicing method and the SIFT method. D.
Lowe has an executable, available for download on the internet, with the SIFT implementation (here:
[114]). Other three methods will be considered in section 6.3. Finally, in section 6.4, considerations are
made on applications with real data.
6.1. Robustness analysis to the implemented algorithm
This section starts with an explanation of how the robustness analysis was done. Section 6.1.1
introduces the importance of having a high number of inliers. In section 6.1.2 a consideration on the
computational cost of the mosaicing method is done. The influence of the random selection step of 4
putative correspondences in the Ransac algorithm is discussed in section 6.1.3. Finally is studied the
most adequate overlap percentage for each one of these parameters: pitch, roll, yaw and scale factor
(section 6.1.4). Changes of brightness; changes of contrast and noise are also addressed.
For the robust analysis this dissertation used the Matlab script developed in chapter 4. Knowing
the variables involved to get each image, two photographs were taken to the world image (Fig. 4.1).
The first image was set to have roll, pitch and yaw equals to zero. This first image was deemed our
world reference. The second image was obtained from the conditions in which the first was taken
adding a translation to the UAV position and/or adding small shifts to the roll, pitch or yaw angles (see
Fig. 6.1).
a) b) c)
Fig. 6.1 – Taking two photographs with the Matlab script developed in chapter 4. The bottom image of c) is
the reference image, while in the top is the obtained applying a small shift to the first.
The bottom image of Fig. 6.1 is an example of image obtained with the position of the UAV set
on meters; and ( just for display
purposes). For the second image was used:
meters; ; and . The translation along axis was obtained from eqs.
(5.12) and (5.14) considering an overlap percentage, , of : meters.
- 72 -
As was explained before in chapter 5, using the exact and the estimated homography, it is
possible to obtain the exact and the estimated mosaics:
Fig. 6.2 – a) Estimated mosaic, b) Exact mosaic.
The controlled environment, and the good overlap percentage between the two images used to
obtain the mosaics of the Fig. 6.2, explain why, visually, seems to be almost a perfect estimation of
the estimated homography.
Next Fig. 6.3 and Fig. 6.4 represent the putative matches and the inliers out of the interest
point of first image and the interest points of the second image.
Fig. 6.3 – Above we have the 84 putative correspondences between the two images obtained with section
3.8. Below are the 62 correspondences (inliers) that remain after the robust estimation (section
3.9).
- 73 -
Fig. 6.4 – On the left we have the inliers (the same as the Fig. 6.3), but now with a rectangle around
defining the neighborhood of that point used to find its descriptor. The size of the
square is according to its characteristic scale. On the right is a zoom in view where we can see
that each interest point has its own canonical orientation.
Error measures used to quantify the quality of the mosaic
For this, only the errors of section 5.3.1 (Exact vs. estimated transformation) and 5.3.2
(Homography decomposition based) were used.
Other errors, namely, a ratio between the overlap percentage of the estimated and the exact
mosaics were studied:
Fig. 6.5 – Ratio between the overlap percentages. Concept.
(6.1)
The overlap percentage is the percentage of the second image (top right image of Fig. 6.1) that
lies within the reference image.
Also, a possible error measure consisting of the difference between the Frobenius norm of the
exact and the estimated homography was studied [115][116].
Others criterions, as the number of inliers, were studied, however, most had the same problem;
they defined necessary but not sufficient conditions, and so were not used.
- 74 -
6.1.1. Minimum number of inliers required
Naturally, the number of inliers should be as high as possible. However, an experimental study
on fast 2D homography estimation for face recognition [117] refer that a good estimate should be
achievable with near the minimum required number of correspondences ( is the minimum) or at least,
as a rule of thumb, with less than point correspondences.
On the absence of parallax displacement (controlled environment) this study shows that the
DLT algorithm gives relatively good results for numbers of at least to inliers. But again, we must
say that a high number of inliers do not mean necessarily a good homography estimation. Point
correspondences should be equally distributed within the overlap region, avoiding problems as
collinearity.
6.1.2. Runtime
This study was not meant for real time applications. Next, are presented some offline runtimes
obtained using a laptop with a processor and of ram (this system was used for all the
tests in this dissertation). It is remembered that the code was written and executed with Matlab.
Tests showed that the time taken to read two .png images with of overlap
successively till the display of our final image mosaic was of about seconds ( seconds were just
for the final display). This includes the calculation of the descriptor vectors - entries each – of
to interest points per image resulting in a final average of inliers. If we include the time taken
to take each of the two images with chapter 4, it increases to around seconds.
6.1.3. Influence of the random factor introduced by Ransac
We saw before in section 3.9 that from the entire set of putative correspondences between two
images, the Ransac algorithm selects randomly just putative correspondences at a time. This
artifact, used only to speed up the algorithm, has consequences. The consequence is that each time
we run the algorithm it shows different results, because a different set of inliers is chosen.
Using the two-image strategy described in the beginning of this section (6.1), a small test was
made to assess on the variability of the results.
The reference image was set to: ; .
The second image was obtained with: since
the ; ; .
- 75 -
Now, running the mosaic algorithm times (for statistical independence) over these two
images, the following table was obtained:
Tab. 6.1 – Variability of the number of inliers.
10000 repetitions Standard Deviation Minimum value Medium value Maximum value
Number of Inliers: 14,85 114,00 166,12 189,00
The number of keypoints found in the first and in the second image used in this test was
respectively of and . The number of putative correspondences was .
From Tab. 6.1 we can see just a small glimpse of the problems that may rise from this
variability. In this particular situation we see that the minimum number of inliers represents just of
the maximum value that we could have. One may ask if this difference has meaning in the finals
results. For this situation it does not, but if our minimum value was of inliers and the maximum
possible was of it would be significant. In some situations the minimum value even goes below
inliers, and in such situations the mosaicing problem has no solution.
6.1.4. Percentage of overlap between two images
Naturally the bigger the overlap percentage between images, the lesser are the errors resulting
from the proposed method, however the number of images required to cover the same given area
increase. Despite the increase of the computational cost needed to stitch more images, the mosaicing
algorithm, as was said, was planned to run offline, so the computational cost (time consuming) was
not a problem.
Taking photographs at higher frequency rates will result in bigger overlap percentages, however
problems related to the UAV capabilities, as the speed of the data transference achieved with the USB
interface of the quadrotor, are not of our concern in this dissertation.
The objective of this section is to give a feedback of what the implemented method in this
dissertation can do by suggesting the more appropriate overlap percentage between two consecutive
images taking into account the physical limitations of the problem. Tests reported showed that the
controller board used on the quadrotor allows to take photographs with pitch and yaw angles below
degrees20. The information that arrived from the GPS sensors was inaccurate and of little use because
we were having errors of near to meters on the quadrotor position.
20 This is the same as saying that are errors are below degrees because naturally we wanted it to be
zero degrees.
- 76 -
Again, using the two-image strategy described in the beginning of this section (6.1), a series of
tables were computed (see Appendix). In each table two parameters was set to change: the overlap
percentage and the pitch/roll/yaw or scale (coordinate will change for scale evaluation) of the second
image. A small consideration on contrast variations, brightness changes and noise will be also
addressed. Each table will present a different error measure as we will see later. For statistical
independence each entry of these tables was computed from five different pairs of images repeating
the Ransac step times for each. In total each entry of each table was obtained from a total of
results.
The overlap percentages that were used in this analysis can be seen in Tab. 6.2, as well as the
individual and displacements that were added to the camera position (from the reference image to
the second image) to achieve such overlap percentages. These results were obtained from equations
The errors are in absolute values, and naturally is of our interest that they be as near zero as
possible. The geometric distance error (Euclidean distance: section 5.3.1), in this particular simulation,
is in meters since each pixel was set to represent one meter.
As was mentioned before we should expect maximum errors for the pitch around degrees.
Thus, if one image is taken with a pitch of and the next right after with then this is relatively
equivalent as having a pitch of for the first image and a pitch of for the second image. On may
ask: what about if we instead of this not just considered a simulation where the first image had
and the second ? That is because the is not yet very accurate. If in the future the controller of the
quadrotor gets better and the maximum error turns to be then there is no need to repeat the results
because we just have to check this tables for results around pitch equals .
A close look on Tab. 6.3 shows that we have better results for negative angles than for positive.
This is simply because a negative pitch increases the overlap percentage when the quadrotor is
moving straight (Fig. 6.6). The displacement can be computed from Tab. 6.2.
Fig. 6.6 – Scheme used for the pitch-overlap study.
- 78 -
The tables regarding the pitch can be seen in Appendix B. If we consider the maximum pitch to
be then it is recommended to have overlaps of near . Despite the fact that the maximum pitch
error obtained for this case was , the medium was just of with a standard deviation of
and a minimum of . Tables regarding “standard deviations”21
and “best results” were always
computed for all the situations in this dissertation, however are not included to avoid an enormous
unnecessary appendix. It was adopted a conservative point of view in this and in the next analyses.
To finish this topic: We saw that this mosaicing method is planned to run offline, and that is
possible to obtain very good results if we have a bit of “luck” in our random selection of the four
putative correspondences with Ransac. So, what is proposed is to, on offline applications, allow the
Ransac algorithm to test as much random samples as possible and just then choose the set
consensus set with the most inliers22
.
Roll
The displacement used for the roll was along the since it gave worst results than for
the .
Fig. 6.7 – Scheme used for the roll-overlap study.
Analyzing the tables in Appendix C for maximum rolls of , we see that the results are
significantly better than the ones obtained with the pitch. An overlap percentage of gave a
medium error value of just and a maximum of . We can also see that the biggest Euclidean
distance was of meters, but the medium was just of meters, which represents less than
pixels in the final mosaic.
One may see that there are maximum errors for the roll angle near (Tab. 7.13). There is
not a real explanation for that because this means that the second image had to be pointing towards
the sky and so we should not have had a solution, but yet the method found a solution. We assumed
this to be mathematical solutions (given by the method described on section 3.10) with no physical
meaning.
21 The standard deviation was obtained averaging all the standard deviations ( ). Since the
sample sizes are equal for all the situations (no need for weights) it was simply used the square root of
the standard deviations:
and then
with
because we are using pairs of images. For this and for information regarding variance of combined
sets of data – which is different from “averaging” – it is recommended the reference: [132]. 22
As we saw on section 6.1.1, empirically we know that a higher number of inliers have more chances go give better results (the Ransac itself uses this principle).
- 79 -
Yaw
At the time there was no precise information available regarding the capacity of the controller to
control the quadrotor direction, thus the displacement used for the roll was along the , because it
shows that we cannot have errors around and as we can see in the results of Appendix D.
Fig. 6.8 – Scheme used for the yaw-overlap study.
For overlaps of we got maximum yaw errors below in the range to , which is
very good when compared to the previous pitch and roll angles. Rotations of between to in
two consecutive photographs are very unlikely, yet, the maximum yaw error found was of .
Scale
Despite the inaccurate information received by the GPS sensors, those responsible for the
quadrotor reported that it is possible to take photographs at a relatively constant altitude. The
displacement used was along for no special reason.
Fig. 6.9 – Scheme used for the scale-overlap study.
Each image has pixels within ( ). From eq. (5.9), for a height of meters,
we have a covered area of squared meters ( pixels of the world image)23
. The
reason why was used an height of meters was simply because at this high we could sift our
altitude to near half its value ( meters pixels) without compromising the accuracy of
the photographs that we know to be obtained from the world image through bilinear interpolation.
The results in Appendix E show that, for overlaps of , a shift in the height of the second
image to values between and result on mosaics with maximum errors (section
5.3.2) of meters and medium bellow meters.
23 With
- 80 -
Contrast variations
For the change of contrast all the pixels of the second image (two-image strategy) were
multiplied by a constant value. Naturally, pixel values above or below (range: to ) were
saturated. The displacement used was along the .
Fig. 6.10 – Contrast variations. a) ; b) ; c) ;
Tab. 6.4 – Minimum number of inliers on contrast evaluation.
Minimum Number of Inliers
value≥250
x100 Ransac Overlap Percentage
250>value≥200
x5 Images 20 30 35 40 45 50 55 60 65 70 80
200>value≥150
Contrast Change (constant factor)
0,35 4 5 9 14 16 17 20 24 24 25 32
150>value≥120
0,50 5 10 16 24 31 38 40 52 52 58 82
120>value≥90
0,75 11 22 37 54 67 77 97 124 138 152 208
90>value≥60
0,90 16 44 69 95 114 128 151 179 201 218 282
60>value≥40
1,00 21 55 86 121 142 158 183 215 242 269 336
40>value≥20
1,10 17 45 73 107 124 137 163 192 218 243 312
20>value≥10
1,25 5 23 44 62 81 93 84 101 110 122 170
10>value≥7
1,40 4 6 13 17 18 19 22 27 28 32 57
7>value
Tab. 6.5 – Maximum geometric distance error on contrast evaluation.
Maximum Geometric Distance Error (section 5.3.1) – in pixels