8/10/2019 Final Document Print
1/78
3D Reconstruction Based on Image Pyramid and Block Matching
Dept. of ECE, MRITS 1
CHAPTER 1
INTRODUCTION TO 3D
1.1 3D RECONSTRUCTION FROM SINGLE 2D IMAGES
A single image of an everyday object, a sculptor can recreate its 3D shape (i.e.,
produce a statue of the object), even if the particular object has never been seen before.
Presumably, it is familiarity with the shapes of similar3D objects (i.e., objects from the
same class) and how they appear in images, which enables the artist to estimate its shape.
This might not be the exact shape of the object; but it is often a good enough estimate for
many purposes Motivated.
In general, the problem of 3D reconstruction from a single2D image is ill posed,
since different shapes may give rise to the same intensity patterns. To solve this,
additional constraints are required. Here, we constrain the reconstruction process by
assuming that similarly looking objects from the same class (e.g., faces, fish), have
similar shapes. We maintain a set of 3D objects, selected as examples of a specific class.
We use these objects to produce a database of images of the objects in the class (e.g., by
standard rendering techniques), along with their respective depth maps. These provide
examples of feasible mappings from intensities to shapes and are used to estimate the
shapes of objects in query images.
Methods for single image reconstruction commonly use cues such as shading,
silhouette shapes, texture, and vanishing. These methods restrict the allowable
reconstructions by placing constraints on the properties of reconstructed objects (e.g.,
reflectance properties, viewing conditions, and symmetry). A few approaches explicitly
use examples to guide the reconstruction process. One approach reconstructs outdoor
scenes assuming they can be labelled as ground, sky, and vertical billboards.
The target of the system is the geometric model of the scenes. So here considergeometric reconstruction and not photometric (or image-based) reconstruction, which
directly generates new views of a scene without (completely) reconstructing the 3D
structure. With the stated purposes stated and application context set the limits as:
8/10/2019 Final Document Print
2/78
3D Reconstruction Based on Image Pyramid and Block Matching
Dept. of ECE, MRITS 2
Static scenes: There is no moving object or the movement of objects is relatively same.
Un-calibrated cameras: The input data is captured by an un calibrated camera, i.e. the
camera's intrinsic parameters such as focal length is unknown.
Varying intrinsic camera parameters: The camera intrinsic parameters (e.g. focallength) can vary freely. Together with the previous, this assumption adds flexibility to the
system.
1.2 3DRECONSTRUCTION FROM VIDEO SEQUENCES
The application-oriented description of 3D reconstruction from video sequences
(shortly called 3D reconstruction)
1. The process starts with the data capturing step, in which a person moves around
and captures a static scene using a hand-held camera.
2.
The recorded video sequence is then pre-processed (e.g. selecting frames),
removing (noise, normalizing illumination).
3. After that, the video sequence is processed to produce a 3D model of the scene.
4. Finally, the 3D model can be rendered, or exported for editing using 3D modeling
tools.
Fig.1.1: Main tasks of 3D reconstruction
The 3D reconstruction (step 3) can be divided into 4 main tasks, which are as following:1.
Feature detection and matching: The objective of this step is to find out the
same features in different images and match them.
2. Structure and motion recovery: This step recovers the structure and motion of
the scene (i.e. 3D coordinates of detected features; position ,orientation and parameters of
the camera at capturing positions).
8/10/2019 Final Document Print
3/78
3D Reconstruction Based on Image Pyramid and Block Matching
Dept. of ECE, MRITS 3
3. Stereo mapping: This step creates a dense matching map. In conjunction with the
structure recovered in the previous step, this enables to build a dense depth map.
4. Modeling: This step includes procedures needed to make a real model of the
scene (e.g. building mesh models, mapping textures).
Some define the input as an image sequence but in fig.1.1 defines it as a video
sequence since our practical objective is a system that does reconstruction from video. By
defining it like that, we want to clearly state that the intermediate step to go from video to
image sequences (i.e. frame selection) is a part of there construction process.
1.2.1 Feature Detection and Matching
This process creates relations used by the next step, structure and motionrecovery, by detecting and matching features in different images. Until now, the features
used in structure recovery processes are points and lines. So here features are understood
as points or lines.
Fig.1.2: Pollefeys 3D modelling framework
Detectors: Given an image a feature detector is s a process to detect features from theimage. The most important information a detector gives is the location of features but other
characteristics such as the scale can also be detected. Two characteristics that a good
detector needs are repeatability and reliability. Repeatability means that the same feature can
be detected in different images. Reliability means that the detected point should be
distinctive enough so that the number of its matching candidates is small.
8/10/2019 Final Document Print
4/78
8/10/2019 Final Document Print
5/78
8/10/2019 Final Document Print
6/78
3D Reconstruction Based on Image Pyramid and Block Matching
Dept. of ECE, MRITS 6
Classification: Point descriptors are classified into the following categories:
Distribution based descriptors: Histograms are used to represent the characteristics of
the region. The characteristics could be pixel intensity, distance from the centre point
relative ordering of intensity or gradient.Spatial-frequency descriptors: These techniques are used in the domain of texture
classification and description. Texture description using Gabor filters is standardized in
MPEG7.
Differential descriptors: The descriptor are used to evaluate detectors reliability is an
example of a differential descriptor, in which a set of local derivatives (local jet) is used
to describe an interest region.
Moments: Moments are used moments to describe a region. The central moment of a
region in combination with the moment's order and degree forms the invariant.
1.3 LINES
Two-view projective reconstruction can only use point correspondences. But in
three or more view structure recovery it is possible to use line correspondences.
1.3.1 Line detection
Line detection usually includes edge detection, followed by line extraction.
1.3.2 Edge detection
The key to solve the problem is the intensity change, which is shown via the
gradient of the image. Edge detectors usually follow the same routine: smoothing,
applying edge enhancement filters, applying a threshold, and edge tracing.
Evaluations of edge detectors are inconsistent and not convergent for reasons such
as unclear objective and varying parameters. A series of evaluation in different tasks in
which the application acts as the black box to test algorithms. One of them is structure
from motion. The evaluation shows that overall, the canny detector is most suitable
because of its Performance.
Fastest s peed and low sensitivity to parameters variation .However the structure
from motion algorithm used there is not a three-view one and uses line segments rather
8/10/2019 Final Document Print
7/78
3D Reconstruction Based on Image Pyramid and Block Matching
Dept. of ECE, MRITS 7
than lines as in also the intermediate processing" (line extraction and corresponding) that
would affect the final result is fixed. Thus the result is not concrete enough.
1.3. 3 Line Extraction
Extracting lines could be done in several ways. The Hough transform is famous in
curve fitting. Despite of having a long history Hough transform and its extensions are still
used widely. A simpler approach connects line segments with a limit of angle changes
and then uses the least median square method to fit the connected paths into lines. As
with edge detection, no complete and concrete evaluation of line extraction is available.
1.3.4 Line matching
Lines can be matched based on attributes such as orientation, length, or extent ofOverlap. Some matching strategies such as nearest line, or additional view verification
can be used to increase the speed and accuracy. Optical flow can be employed in the case
of short baseline Matching groups of lines (graph-matching) is more accurate than
individual matching Beardsley et al use the geometric constraints in both two-view and
three view cases to match lines. The constraints are found by a robust method with
corresponding points.
Lines are generally highly structured features give stronger constraints. Lines are
many and easy to extract in scenes with dominant artificial objects, e.g. urban
architectures. However, the fact that evaluations on line extraction and matching for
structure recovery are not complete and concrete probably is the reason why the theory of
three-view reconstruction with lines are available for along time but methods in structure
recovery usually use point correspondence. One of the few works that uses line
correspondences and trifocal tensors is of Breads but lines are not used directly. Still
point correspondences are used first to recover geometry information.
1.4 STRUCTURE AND MOTION RECOVERY
The second task Structure and motion recover the structure of the scene and the
motion information of the camera. The motion information is the position, orientation,
8/10/2019 Final Document Print
8/78
3D Reconstruction Based on Image Pyramid and Block Matching
Dept. of ECE, MRITS 8
and intrinsic parameters of the camera at the captured views. The structure information is
captured by the 3D coordinates of features.
Given feature correspondences, the geometric constraints among views can be
established. The projection matrices that represent the motion information then can berecovered. Finally, 3D coordinates of features, i.e. structure information can be computed
via triangulation.
Fig.1.5: Structure and motion recovery process
1.5 ADVANTAGES AND PROBLEMS OF USING VIDEO
SEQUENCES
It is possible to do 3D reconstruction from images. But in practice, it is more
natural to use video sequences since it eases the capturing process and provides more
complete data. But also problems arise. The following describes the advantages and the
problems of using video sequences as input.
1.5.1 AdvantagesThe most important advantage of using input of video sequences is the higher
quality one can obtain. Both geometric accuracy and visual quality can be improved by
exploiting the redundancy of data. Intuitively, more back-projecting rays of a point's
projections limits the possible 3D coordinates of the point. The best texture found by
selecting the best view or super-resolution can be used to get better visualization quality.
8/10/2019 Final Document Print
9/78
3D Reconstruction Based on Image Pyramid and Block Matching
Dept. of ECE, MRITS 9
Image sequences also enable some techniques to deal with shadow, shading and
highlights.
Other advantages are the automaticity and flexibility. Capturing data by a
handheld camera is more comfortable since a person does not have to worry about
missing information or to consider if the captured information is enough for
reconstruction .And on the processing time, instead of manually selecting some images
from a video, it is better to have a system that can do everything automatically.
1.5.2 Problems
To take advantage of the use of video sequences we have to deal with some
problems, ranging from pre-processing (frame selection, sequence segmentation), during
processing as has been seen in previous sub-sections, to post-processing (bundle
adjustment, structure fusion).
Frame selection: Among a number of frames, selecting good frames will improve the
reconstruction result. Good frames are ones that have proper geometric attributes and
good photometric quality. The problem is related to the estimation of views' position and
orientation and photometric quality evaluation.
Sequence segmentation: Reconstruction algorithms assume that a sequence is
continuously captured. The sequence should be broken into proper scene parts and
reconstruct separately and fuse later.
Structure fusion: Results of processing different video segments (generated either by
different captures or by segmentation) must be fused together to create a final unique
result.
Bundle adjustment: The reconstruction process includes local updates (e.g. feature
matching, structure update) and bias assumptions (e.g. use of first view coordinate
system). Those lead to inconsistency and accumulated errors in the global result. Thereshould be global optimization step to produce a unique consistent result.
8/10/2019 Final Document Print
10/78
3D Reconstruction Based on Image Pyramid and Block Matching
Dept. of ECE, MRITS 10
1.6 CRITICAL CASES
A critical case happens when it is impossible to make a metric reconstruction
from the input data. It either because of the characteristics of the scene or the capturing
positions.
In practice, metric reconstruction from video sequences captured by a person
using a hand-held camera hardly falls into an absolute critical case. However, nearly
critical cases are common in practice, e.g. a camera moving along a wall or on an elliptic
orbit around the object. That is why studying critical configurations and detecting those
cases is extremely important to create a robust reconstruction method, or select the most
suitable method for the case.
There are two kinds of critical cases: (i) critical surface or critical configuration
and (ii) critical motion sequences (of camera). The first class depends on the observed
points. The later depends only on camera motion, i.e. can happen with any scene. A
brute force" approach to select the best algorithm. This however only helps to reject the
case but not to take the proper method for it. Some important notes about critical cases
are:
Normal cases in some conditions, e.g. calibrated or fixed intrinsic parameters, can
turn into critical ones when conditions change. The more un calibrated the camera, i.e.
less cameras' parameters are known, the more ambiguous the reconstruction will be.
1.7 IMAGE-BASED 3D RECONSTRUCTION
Image-based 3D reconstruction is an active field of research in Photogrammetric
and Computer Vision. The need for detailed 3D models for mapping and navigation,
inspection, cultural heritage conservation or photorealistic image-based rendering for the
entertainment industry lead to the development of several techniques to recover the shape
of objects. To achieve precise and high detailed reconstructions is often employed
providing 2.5D range images and the respective 3D point cloud in a metric scale.
On the other hand, laser-based methods are complex to handle for large scale
outdoor scenes, especially for aerial data acquisition. In contrast to that, passive image-
8/10/2019 Final Document Print
11/78
3D Reconstruction Based on Image Pyramid and Block Matching
Dept. of ECE, MRITS 11
based methods that utilize multiple overlapping views are easily deployable and are low
cost compared to but require some post-processing effort to derive depth information. In
this work we investigate how redundancy and baseline influence the depth accuracy of
multiple view matching methods.
In particular performance synthetic experiments on a typical aerial camera
network that corresponds to a2D flight with 80% forward-overlap and 60% side-lap as
shown in fig.1.6. By covariance analysis of triangulated scene points the theoretical
bound of depth accuracy is determined according to the triangulation angle and the
number of measurements (i.e. the redundancy).
One of the main findings is that true multi-view matching/triangulation
outperforms two-view fused stereo results by at least one order of magnitude in terms of
depth accuracy. Furthermore, present a fast, accurate and robust matching and
reconstruction technique suitable for high resolution images of large scale scenes that is
able to compete through leveraging the redundancy of many views. The solution to multi-
view reconstructions is based on pair-wise stereo, employing efficient and robust optical
flow that is restricted to the epipolar geometry. Unlike standard aerial
Fig.1.6(a) :The view network, a sparse reconstruction and uncertainties
(magnified by 1000 for better visibility) for selected 3D points on a regularly
sampled grid on the ground plane
8/10/2019 Final Document Print
12/78
3D Reconstruction Based on Image Pyramid and Block Matching
Dept. of ECE, MRITS 12
Fig.1.6(b) : Reconstructed dense point cloud from our multi-view method of an
urban scene.
matching approaches that rely on 2.5D data fusion of pair wise stereo depth maps,the correspondence chaining (i.e. measurement linking) and triangulation approach that
takes full advantage of the achievable baseline (i.e. triangulation angles). In contrast to
voxel-based approaches, Polygonal meshes and local patches, focus on algorithms
representing geometry as a set of depth maps. It eliminates the need for resembling the
geometry in the three-dimensional domain and can be easily parallelized. Evaluate the
approach on the multitier benchmark data set that provides accurate ground truth and on
large scale aerial images.
1.8 UNCERTAINTY OF SCENE POINTS
The depth uncertainty of a rectified stereo pair can be directly determined from
the disparity error
z=
-
~
. d... (1)
where z is the point depth, f the focal length and b the image baseline. Hence the depth
precision is mainly a function of the ray intersection angle. In contrast, for multi view
image matching and triangulation , the redundancy not only implies more measurements
but additionally constrains the 3D point location through multiple ray intersections. These
entities are not independent but are coupled, since they rely on the network geometric
8/10/2019 Final Document Print
13/78
3D Reconstruction Based on Image Pyramid and Block Matching
Dept. of ECE, MRITS 13
configuration that determines image overlap (i.e. redundancy) and baseline,
simultaneously. Given a photogrammetric network of cameras and correspondences with
known error distribution, the precision of triangulated points can be determined from the
3D confidence ellipsoid (i.e. covariance matrix CX) as shown in fig. 1.7. An empiricalestimate of the covariance ellipsoid corresponding to multi view triangulation can be
computed by statistical simulation. For the moment we assume that camera orientations
and 3D structure are fixed and known.
The cameras are distributed along a 2D grid (corresponding to flight paths) in
order to achieve a 80% forward overlap and 60% side-lap as shown fig.1.7 According to
a large format digital aerial camera (e.g. Ultra Cam D from Microsoft) the image
resolution is set to 7500 _ 11500 pixel with a field of view _ = 54_. Furthermore,3D
points are evenly distributed on a 2D plane that corresponds to the bold earth surface
observed from a flying height of 900m. Therefore, an average Ground Sampling Distance
(GSD) of8cm/pixel is achieved.
Given the cameras Pi=1 : NP (i.e. calibration and poses) and 3D points Xi=1:M X,
respective ground truth projections are produced xij= PX. Therefore, for every 3D point a
set of point-tracks (i.e. 2D measurements) is generated m = (< x 1; y1>;< x2; y2> : : : ;). Next, 2Dprojections are perturbed by zero mean Gaussian isotropic noise ^x = x +
N(0; ),
= ( )........................................................................................... (2)
with standard deviation x= = 1 pixel (i.e. _ 8cm GSD). Given the set perturbed pointtracks^m = (< ^x1; ^y1>;< ^x2; ^y2> : : : ;< ^xk; ^yk>) and ground truth projection
matrices Pi=1:N, the 3Dposition of the respective point in space is determined. This
process requires the intersection of at least two known rays in space. Hence, we use a
linear triangulation method to determine the 3D position of point tracks. This method
generalizes easily to the intersection of multiple rays providing a least squares solution.
Optionally, a non-linear optimizer based on the Levenberg-Marquardt algorithm issued to
8/10/2019 Final Document Print
14/78
3D Reconstruction Based on Image Pyramid and Block Matching
Dept. of ECE, MRITS 14
refine the 3D point by minimizing the projection error. Through Monte Carlo Simulation
on the perturbed measurement vectors ^m we obtain a distribution of 3D points Xi
around a mean position ^X. From the Law of Large Numbers it follows that for a large
number N of simulations,one can approximate the mean 3D position by,
....(3)and its respective covariance matrix by,
CX = EN[( XiEN [Xi]) (XiEN[Xi] )T ] ...(4)
Using the singular value decomposition the covariance matrix can then be diagonalized,
CX = U( )V
T
(5)
where U represents the main diagonals of the covariance ellipsoid i and i are the
respective standard deviations. The decomposition of the covariance matrix in equation
5into its main diagonals directly relates to the uncertainty in x-y and z direction. Under
the assumption of front to parallel image acquisition the largest singular value 1
corresponds to the uncertainty in depth and 2 and 3to the uncertainty in x - y direction,
respectively.
1.9 LITERATURE SURVEY
With the advent of the multimedia age and the spread of Internet, video storage on
CD/DVD and video has been gaining a lot of popularity. The ISO Moving Picture
Experts Group (MPEG) video coding standards pertain towards compressed video
storage on physical media like CD/DVD, whereas the International Telecommunications
Union (ITU) addresses real-time point-to-point or multi-point communications over a
network. The former has the advantage of having higher bandwidth for data transmission.
In either standard the basic flow of the entire compression decompression process
is largely the same and is shown in fig. 1.7. The encoding side estimates the motion in the
current frame with respect to a previous frame. A motion compensated image for the
current frame is then created that is built of blocks of image from the previous frame. The
8/10/2019 Final Document Print
15/78
3D Reconstruction Based on Image Pyramid and Block Matching
Dept. of ECE, MRITS 15
motion vectors for blocks used for motion estimation are transmitted, as well as the
difference of the compensated image with the current frame with respect to a previous
frame. A motion compensated image for the current frame is then created that is built of
blocks of image from the previous frame.
The motion vectors for blocks used for motion estimation are transmitted, as well
as the difference of compensated image motion estimation are transmitted as well as the
difference of the compensated image with the current frame is also JPEG encoded and
sent. The encoded image that is sent is then decoded at the encoder and used as a
reference frame for the subsequent frames. The decoder reverses the process and creates a
full frame. The whole idea behind motion estimation based video compression ratio 30:1
is to save on bits by sending JPEG encoded difference images which inherently have lessenergy and can be highly compressed as compared to sending a full frame that is JPEG
encoded. Motion JPEG where all frames are JPEG encoded, achieves anything between
10:1 to 15:1 compression ratio, whereas MPEG can achieve a compression ratio of 30:1
and is also useful at 100:1 ratio. It should be noted that the first frame is always sent full,
and so are some other frames that might accurate some regular interval (like every 6th
frame). The standards do not specify this and this might change with every video being
sent based on the dynamics of the video.
The most computationally expensive and resource hungry operation in the entire
compression process is motion estimation. Hence, this field has seen the highest activity
and research interest in the past two decades. The algorithms that have been implemented
are Exhaustive Search (ES), Three Step Search (TSS), New Three Step Search (NTSS),
Simple and Efficient TSS (SES), Four Step Search (4SS), Diamond Search (DS), and
Adaptive Rood Pattern Search (ARPS).
8/10/2019 Final Document Print
16/78
3D Reconstruction Based on Image Pyramid and Block Matching
Dept. of ECE, MRITS 16
Fig.1.7: MPEG / H.26x video compression process flow
.
The most computationally expensive and resource hungry operation in the entire
compression process is motion estimation. Hence, this field has seen the highest activity
and research interest in the past two decades. The algorithms that have been implemented
are Exhaustive Search (ES), Three Step Search (TSS), New Three Step Search (NTSS),
Simple and Efficient TSS (SES), Four Step Search (4SS), Diamond Search (DS), and
Adaptive Rood Pattern Search (ARPS).
1.10 BLOCK MATCHING ALGORITHMS
The underlying supposition behind motion estimation is that the patternscorresponding to objects and background in a frame of video sequence move within the
frame to form corresponding objects on the subsequent frame. The idea behind block
matching is to divide the current frame into a matrix of macro blocks that are then
compared with corresponding block and its adjacent neighbours in the previous frame to
create a vector that stipulates the movement of a macro block from one location to
8/10/2019 Final Document Print
17/78
8/10/2019 Final Document Print
18/78
3D Reconstruction Based on Image Pyramid and Block Matching
Dept. of ECE, MRITS 18
MSE =N-1i=0 2(7)
where N is the side of the macro bock, C and Rijare the pixels being compared in current
macro block and reference macro block respectively, Peak-Signal-to-Noise-Ratio (PSNR)
given by equation 3characterizes the motion compensated image that is created by using
motion vectors and macro clocks from the reference frame
PSNR=10log10 [ ](a)Exhaustive Search (ES)
This algorithm, also known as Full Search, is the most computationally expensive
block matching algorithm of all. This algorithm calculates the cost function at each
possible location in the search window. As a result of which it finds the best possible
match and gives the highest PSNR amongst any block matching algorithm. Fast block
matching algorithms try to achieve the same PSNR doing as little computation as
possible. The disadvantage to ES is that the larger the search window gets the more
computations it requires.
(b)Three Step Search (TSS)
The general idea is represented in Fig.1.9. It starts with the search location at the
centre and sets the step size S = 4, for a usual search parameter value of 7. It then
searches at eight locations +/- pixels around location (0,0). From these nine locations
searched so far it picks the one giving least cost and makes it the new search origin. It
then sets the new step size S = S/2,and repeats similar search for two more until S = 1. At
that point it finds the location with the least cost function and the macro block at that
location is the best match. The calculated motion vector is then saved for transmission. It
gives a flat reduction in computation by a factor of 9. So that for p = 7, ES will compute
cost for 225 macro blocks where as TSS computes cost for 25 macro blocks. The ideabehind TSS is that the error surface due to motion in every macro block is unimodal. A
unimodal surface is a bowl shaped surface such that the weights generated by the cost
function increase monotonically from the global minimum.
8/10/2019 Final Document Print
19/78
3D Reconstruction Based on Image Pyramid and Block Matching
Dept. of ECE, MRITS 19
Fig.1.9: Three Step Search procedure. The motion vector is (5, -3).
(c)New Three Step Search (NTSS)
NTSS improves on TSS results by providing a centre biased searching scheme
and having provisions for half way stop to reduce computational cost. It was one of the
first widely accepted fast algorithms and frequently used for implementing earlier
standards like MPEG 1 and H.261.The TSS uses a uniformly allocated checking pattern
for motion detection and is prone to missing small motions. The NTSS process is
illustrated graphically in fig.1.10. In the first step 16 points are checked in addition to the
search origin for lowest weight using a cost function. Of these additional search
locations, 8 are a distance of S = 4 away (similar to TSS) and the other 8 are at S = 1away from the search origin. If the lowest cost is at the origin then the search is stopped
right here and the motion vector is set as (0, 0). If the lowest weight is at any one of the 8
locations at S = 1, then we change the origin of the search to that point and check for
weights adjacent to it.
8/10/2019 Final Document Print
20/78
3D Reconstruction Based on Image Pyramid and Block Matching
Dept. of ECE, MRITS 20
Fig.1.10: New Three Step Search block matching.
Big circles are checking points in the first step of TSS and the squares are the
extra 8 points added in the first step of NTSS. Triangles and diamonds are second step of
NTSS showing 3 points and 5 points being checked when least weight in first step is at
one of the 8 neighbours of window centre.
8/10/2019 Final Document Print
21/78
3D Reconstruction Based on Image Pyramid and Block Matching
Dept. of ECE, MRITS 21
Fig.1.11: Search patterns corresponding to each selected quadrant: (a) Shows all
quadrants (b) quadrant I is selected (c) quadrant II is selected (d) quadrant III is
selected (e) quadrant IV is selected.
Depending on which point it is end up checking 5 points or 3 points (Fig 1.11(b)
& (c)). The location that gives the lowest weight is the closest match and motion vector is
set to that location. On the other hand if the lowest weight after the first step was one of
the 8 locations at S = 4, then follow the normal TSS procedure. Hence although this
process might need a minimum of 17 points to check every macro block, it also has the
worst-case scenario of 33 locations to check.
(d)Simple and Efficient Search (SES)
SES is another extension to TSS and exploits the assumption of unimodal error
surface. The main idea behind the algorithm is that for a unimodal surface there cannot be
two minimums in opposite directions and hence the 8 point fixed pattern search of TSS
can be changed to incorporate this and save on computations. The algorithm still has
three steps like TSS, but the innovation is that each step like TSS, but the innovation is
that each step has further two phases. The search area is divided into four quadrants and
the algorithm checks three locations A, B and C as shown in fig... A is at the origin and B
and C are S = 4 locations away from A in orthogonal directions. Depending on certain
weight distribution amongst the three the second phase selects few additional points as
shown in fig. 2.3. The rules for determining a search quadrant for seconds phase are as
follows:
8/10/2019 Final Document Print
22/78
3D Reconstruction Based on Image Pyramid and Block Matching
Dept. of ECE, MRITS 22
If MAD(A) _ MAD(B) and MAD(A) _ MAD(C), select (b);
If MAD(A) _ MAD(B) and MAD(A) _ MAD(C), select (c)
If MAD(A)_ MAD(B) and MAD(A) < MAD(C), select (d);
If MAD(A) _ MAD(B) and MAD(A) _ MAD(C), select (e)If MAD(A) < MAD(B) and MAD(A) _ MAD(C), select (e)
Once selected the points to check for in second phase, we find the location with
the lowest weight and set it as the origin. We then change the step size similar to TSS and
repeat the above SES procedure again until we reach S = 1.The location with the lowest
weight is then noted down in terms of motion vectors and transmitted. An example
process is illustrated in fig.1.12
Fig.1.12: The SES procedure. The motion vector is (3, 7) in this example.
Although this algorithm saves a lot on computation as compared to TSS, it was
not widely accepted for two reasons. Firstly, in reality the error surfaces are not strictly
unimodal and hence the PSNR achieved is poor compared to TSS. Secondly, there was
8/10/2019 Final Document Print
23/78
3D Reconstruction Based on Image Pyramid and Block Matching
Dept. of ECE, MRITS 23
another algorithm, Four Step Search, that had been published a year before that presented
low computational cost compared to TSS and gave significantly better PSNR.
(e)Four Step Search (4SS)
Similar to NTSS, 4SS also employs center biased searching and has a halfway
stop provision. 4SS sets a fixed pattern size of S = 2 for the first step, no matter what the
search parameter p value is. Thus it looks at 9 locations in a5x5 window. If the least
weight is found at the centre of search jumps to fourth step. If the least weight is at one of
the eight locations except the centre, then we make it the search origin and move to the
second step. The search window is still maintained as 5x5 pixels wide. Where the least
weight location was, might end up checking weights at 3 locations or 5 locations. The
patterns are shown in Fig 1.13
(c) (d)
Fig.1.13: Search patterns of the FSS. (a) First step (b) Second/Third
step(c)Second/Third Step (d) Fourth Step.
Once again if the least weight location is at the center of the 5x5 search window
we jump to fourth step or else we move on to third step. The third is exactly the same as
the second step. IN the fourth step the window size is dropped to 3x3, i.e.S = 1. The
8/10/2019 Final Document Print
24/78
3D Reconstruction Based on Image Pyramid and Block Matching
Dept. of ECE, MRITS 24
location with the least weight is the best matching macro block and the motion vector is
set to point o that location. A sample procedure is shown in Fig 1.14. This search
algorithm has the best case of algorithm has the best case of 17 checking points and worst
case of 27 checking points.
(f)Diamond Search (DS)
DS algorithm is exactly the same as 4SS, but the search point pattern is changed
from a square to a diamond, and there is no limit on the number of steps that the
algorithm can take.DS uses two different types of fixed patterns, one is Large Diamond
Search Pattern (LDSP) and the other is Smal lDiamond Search Pattern (SDSP). These
two patterns and the DS procedure are illustrated in Fig.1.13. Just like in FSS, the first
step uses LDSP and if the least weight is at the center location we jump to fourth step.
The consequent steps, except the last step, are also similar and use LDSP, but the number
of points where cost function is checked are either 3 or 5 and are illustrated in second and
third steps of procedure shown in Fig.1.14
. Fig. 1.14: Diamond Search procedure.
8/10/2019 Final Document Print
25/78
3D Reconstruction Based on Image Pyramid and Block Matching
Dept. of ECE, MRITS 25
This figure shows the large diamond search pattern and the small diamond search
pattern. It also shows an example path to motion vector (-4, -2) in five search steps four
times of LDSP and one time of SDSP.
The last step uses SDSP around the new search origin and the location with the
least weight is the best match. As the search pattern is neither too small nor too big and
the fact that there is no limit to the number of steps, this algorithm can find global
minimum very accurately. The end result should see a PSNR close to that of ES while
computational expense should be significantly less.
Adaptive Rood Pattern Search (ARPS)
ARPS algorithm makes use of the fact that the general motion in a frame is
usually coherent, i.e. if the macro blocks around the current macro block moved in aparticular direction hen there is a high probability that the current macro block will also
have a similar motion vector. This algorithm uses the motion vector of the macro block to
its immediate left to predicts own motion vector. An example is shown in fig.1.15.
Fig.1.15: Adaptive Root Pattern: The predicted motion vector is (3,-2), and the step
size S = Max (|3|, |-2|) = 3.
8/10/2019 Final Document Print
26/78
3D Reconstruction Based on Image Pyramid and Block Matching
Dept. of ECE, MRITS 26
The predicted motion vector points to (3, -2). In addition to checking the location
pointed by the predicted motion vector, it also checks at a rood pattern distributed points,
as shown in fig.1.15, where they are at a step size of S = Max (|X|, |Y|). X and Y are the
x- coordinate and y-coordinate of the predicted motion vector. This rood pattern search isalways the first step. It directly puts the search in an area where there is a high probability
of finding a good matching block.
The point that has the least weight becomes the origin for subsequent search steps,
and the search pattern is changed to SDSP. The procedure keeps on doing SDSP until
least weighted point is found to be at the center of the SDSP. A further small
improvement in the algorithm can be to check for Zero Motion Prejudgment, using which
the search is stopped half way if the least weighted point is already at the center of the
rood pattern
Fig. 1.16. Search points per macro block while computing the PSNR
Performance of Fast Block Matching Algorithms
The main advantage of this algorithm over DS is if the predicted motion vector is
(0, 0), it does not waste computational time in doing LDSP, it rather directly starts using
8/10/2019 Final Document Print
27/78
3D Reconstruction Based on Image Pyramid and Block Matching
Dept. of ECE, MRITS 27
SDSP. Furthermore, if the predicted motion vector is far away from the center, then again
ARPS save on computations by directly jumping to that vicinity and using SDSP,
whereas DS takes its time doing LDSP.
Care has to be taken to not repeat the computations at points that were checked
earlier. Care also needs to be taken when the predicted motion vector turns to match one
of the rood pattern location. So have to avoid double computation at that point. For macro
blocks in the first column of the frame, rood pattern step size is fixed at 2 pixel.
Fig.1.17: PSNR performance of Fast Block Matching Algorithms.
1.11 THESIS OUTLINE
The summary of chapter 1 deals with general concept of 3D reconstruction and
their methods of how the image is reconstructed from 2D to 3D image. Types of methods
are Exhaustive Search (ES), Three Step Search (TSS), New Three Step Search
(NTSS),Simple and Efficient Search (SES),Four Step Search (4SS),Diamond Search
(DS),Adaptive Rood Pattern Search (ARPS).
8/10/2019 Final Document Print
28/78
3D Reconstruction Based on Image Pyramid and Block Matching
Dept. of ECE, MRITS 28
Chapter 2 deals with the stereo vision algorithms. Chapter 3 deals with the
implementation of algorithms. Chapter 4 deals with the simulation results. Chapter 5
describes the conclusion.
8/10/2019 Final Document Print
29/78
3D Reconstruction Based on Image Pyramid and Block Matching
Dept. of ECE, MRITS 29
CHAPTER 2
STEREO VISION ALGORITHMS
2.1 INTRODUCTION TO STEREO VISION
Stereo correspondence problem has historically been, and continues to be, one of
the most investigated topics in computer vision, and a larger number of literatures on it
have been published. The correspondence problem in computer vision concerns the
matching of points, or other kinds of primitives, in two or more images such that the
matched elements are the projections of the same physical elements in 3D scene, and the
resulting displacement of a projected point in one image with respect to the other is
termed as disparity. Similarity is the guiding principle for solving the correspondence
problem; however, the stereo correspondence problem is an ill-posed task, in order to
make it tractable, it is usually necessary to exploit some additional information or
constraints.
The most popular constraint is the epipolar constraint, which can reduce the
search to one-dimension rather than two. Other constraints commonly used are the
disparity uniqueness constraint and the continuous constraint.
The origin of the word stereo is the Greek word stereos which means firm or
solid, with stereo vision, the objects are seen solid in three dimensions with range. In
stereo vision, the same seen is captured using two sensors from two different angles. The
captured two images have a lot of similarities and smaller number of differences. In
human sensitivity, the brain combines the captured to images together by matching the
similarities and integrating the differences to get a three dimension model for the seen
objects.
In machine vision, the three dimension model for the captured objects is obtained
finding the similarities between the stereo images and using projective geometry to
process these matches. The difficulties of reconstruction using stereo is finding matching
correspondences between the stereo pair.
8/10/2019 Final Document Print
30/78
3D Reconstruction Based on Image Pyramid and Block Matching
Dept. of ECE, MRITS 30
Latest trends in the field mainly pursue real-time execution speeds, as well as
decent accuracy. As indicated by this survey, the algorithms theoretical matching cores
are quite well established leading the researchers towards innovations resulting in more
efficient hardware implementations.
Detecting conjugate pairs in stereo images is a challenging research problem
known as the correspondence problem, i.e., to find for each point in the left image, the
corresponding point in the right one. To deter-mine these two points from a conjugate
pair, it is necessary to measure the similarity of the points. The point to be matched
without any ambiguity should be distinctly different from its surrounding pixels. Several
algorithms have been proposed in order to address this problem. However, every
algorithm makes use of a matching cost function so as to establish correspondencebetween two pixels.
The most common ones are absolute intensity differences (AD), the squared
intensity differences (SD) and the normalized cross correlation (NCC) evaluation of
various matching costs can be found. Usually, the matching costs are aggregated over
support regions. Those support regions, often referred to as support or aggregating
windows, could be square or rectangular, fix-sized or adaptive ones. The aggregation of
the aforementioned cost functions, leads to the core of most of the stereo vision methods,
which can be mathematically SAD expressed as follows, for the case of the sum of
absolute differences
(SAD).(x, y, d) = (I l (x, y) - I r (x, y - d)) (1)
For the case of the sum of squared differences (SSD)
SSD (x, y, d) `= (I l (x, y) - I r (x, y - d))2
..(2)
And for the case of the NCC
NCC (x, y, d) = (I l (x, y) * I r (x, y - d))/sqrt ( (I2
l (x, y) I2
r (x, y - d))..(3)
Where IL and Ir are the intensity values in left and right image, (x, y) are the pixels
coordinates, d is the disparity value under consideration and W is the aggregated support
8/10/2019 Final Document Print
31/78
3D Reconstruction Based on Image Pyramid and Block Matching
Dept. of ECE, MRITS 31
region. The selection of the appropriate disparity value for each pixel is performed
afterwards. The simpler algorithms make use of winner-takes-all (WTA) method of
disparity selection.
D (x, y) = arg min SAD (x, y, d) .. ...................(4)
i.e., for every pixel (x, y) and for constant value of disparity d the minimum cost is
selected. Equation 1.4 refers to the SAD method but any other could be used instead.
However, in many cases disparity selection is an iterative process, since each pixels
disparity is depending on its neighbouring pixels disparity. As a result, more than one
iterations are needed in order to find the best set of disparities. This stage differentiates
the local from the global algorithms, which will be analyzed. An additional disparity
refinement step is frequently used
2.2 GOAL OF STEREO VISION
The recovery of the 3D structure of a scene using two or more images of the 3D
scene, each acquired from a different viewpoint in space. The images can be obtained
using multiple cameras or one moving camera. The term binocular vision is used when
two cameras are employed
Fig2.1: General setup of cameras
8/10/2019 Final Document Print
32/78
3D Reconstruction Based on Image Pyramid and Block Matching
Dept. of ECE, MRITS 32
2.2.1 Stereo setup and terminology
Fixation point:the point of intersection of the optical axis.
Baseline:the distance between the centres of projection.
Epipolar plane:the plane passing through the centers of projection and the point in the
scene.
Epipolar line:the intersection of the epipolar plane with the image plane.
Conjugate pair:any point in the scene that is visible in both cameras will be projected to
a pair of image points in the two images
Disparity: the distance between corresponding points when the two images are
superimposed.
Disparity map:the disparities of all points from the disparity map (can be displayed as
an image).
Fig2.2: Internal projection of camera in stereo vision
8/10/2019 Final Document Print
33/78
3D Reconstruction Based on Image Pyramid and Block Matching
Dept. of ECE, MRITS 33
Figure2.3: Two cameras in arbitrary position and orientation
2.2.2 Triangulation - the principle underlying stereo vision
The 3D location of any visible object point in space is restricted to the straight line
that passes through the center of projection and the projection of the object point.
Binocular stereo vision determines the position of a point in space by finding the
intersection of the two lines passing through center of projection and the projection of
the point in each image
Fig2.4: Positions of binocular
8/10/2019 Final Document Print
34/78
3D Reconstruction Based on Image Pyramid and Block Matching
Dept. of ECE, MRITS 34
2.2.3 The problems of stereo
The correspondence problem.
The reconstruction problem.
2.2.4 The correspondence problem
Finding pairs of matched points such that each point in the pair is the projection
of the same 3D point.
Triangulation depends crucially on the solution of the correspondence problem.
Ambiguous correspondence between points in the two images may lead to several
different consistent interpretations of the scene.
Fig2.5: Correspondence problem in stereo vision
2.2.5The reconstruction problem
Given the corresponding points, we can compute the disparity map.
The disparity map can be converted to a 3D map of the scene (i.e., recover the 3D
structure) if the stereo geometry is known
8/10/2019 Final Document Print
35/78
3D Reconstruction Based on Image Pyramid and Block Matching
Dept. of ECE, MRITS 35
Fig2.6: Reconstruction problem in stereo vision
2.3 STEREO CORRESPONDENCE
Stereo correspondence problem has historically been, and continues to be, one of
the most investigated topics in computer vision, and a larger number of literatures on it
have been published. The correspondence problem in computer vision concerns the
matching of points, or other kinds of primitives, in two or more images such that the
matched elements are the projections of the same physical elements in 3D scene, and the
resulting displacement of a projected point in one image with respect to the other is
termed as disparity.
Similarity is the guiding principle for solving the correspondence problem;
however, the stereo correspondence problem is an ill-posed task, in order to make it
tractable, it is usually necessary to exploit some additional information or constraints.
The most popular constraint is the epipolar constraint, which can reduce the search to
one-dimension rather than two. Other constraints commonly used are the disparity
uniqueness constraint and the continuous constraint.
The existing techniques for general two-view stereo correspondence roughly fall
into two categories: local method and global method. Local methods use only small areas
neighborhoods surrounding the pixels, while global methods optimize some global
(energy) function. Local methods, such as block matching, gradient-based optimization,
8/10/2019 Final Document Print
36/78
3D Reconstruction Based on Image Pyramid and Block Matching
Dept. of ECE, MRITS 36
and feature matching can be very efficient, but they are sensitive to locally ambiguous
regions in images (e.g., occlusion regions or regions with uniform texture).
Global methods, such as dynamic programming, intrinsic curves, graph cuts, and
belief propagation can be less sensitive to these problems since global constraints provide
additional support for regions difficult to match locally. However, these methods are
more expensive in their computational cost.
Stereo correspondence algorithms can be grouped into those producing sparse
output and those giving a dense result. Feature based methods stem from human vision
studies and are based on matching segments or edges between two images, thus resulting
in a sparse output. This disadvantage, dreadful for many purposes, is counterbalanced by
the accuracy and speed obtained. However, contemporary applications demand more and
more dense output.
In order to categorize and evaluate them a context has been proposed. According
to this, dense matching algorithms are classified in local and global ones. Local methods
trade accuracy for speed. They are also referred to as window-based methods because
disparity computation at a given point depends only on intensity values within a finite
support window. Global methods (energy-based) on the other hand are time consuming
but very accurate.
Their goal is to minimize a global cost function, which combines data and
smoothness terms, taking into account the whole image. Of course, there are many other
methods that are not strictly included in either of these two broad classes. The issue of
stereo matching has recruited a variation of computation tools. Advanced computational
intelligence techniques are not uncommon and present interesting and promiscuous
results.
While the aforementioned categorization involves stereo matching algorithms in
general, in practice it is valuable for software implemented algorithms only. Software
implementations make use of general purpose personal computers (PC) and usually result
in considerably long running times. However, this is not an option when the objective is
8/10/2019 Final Document Print
37/78
3D Reconstruction Based on Image Pyramid and Block Matching
Dept. of ECE, MRITS 37
the development of autonomous robotic platforms, simultaneous localization and
mapping (SLAM) or virtual reality (VR) systems.
Such tasks require real-time, efficient performance and demand dedicated
hardware and consequently specially developed and optimized algorithms. Only a small
subset of the already proposed algorithms is suitable for hardware implementation.
Hardware implemented algorithms are characterized from their theoretical algorithm as
well as the implementation itself. There are two broad classes of hardware
implementations: the field-programmable gate arrays (FPGA) and the application-
specific integrated circuits (ASIC) based ones. Figure 2 depicts an ASIC chip (a) and a
FPGA development board (b). Each one can execute stereo vision algorithms without the
necessity of a PC, saving volume, weight and consumed energy. However, the evolutionof FPGA has made them an appealing choice due to the small prototyping times, their
flexibility and their good performance
2.4 STEREO MATCHING ALGORITHMS
The issue of stereo correspondence is of great importance in the field of machine
vision, computer vision, depth measurements and environment reconstruction as well as
in many other aspects of production, security, defense, exploration, and entertainment.
Calculating the distance of various points or any other primitive in a scene relative to the
position of a camera is one of the important tasks of a computer vision system.
The most common method for extracting depth information from intensity images
is by means of a pair of synchronized camera-signals, acquired by a stereo rig. The point-
by-point matching between the two images from the stereo setup derives the depth
images, or the so called disparity maps. This matching can be done as a one dimensional
search if accurately rectified stereo pairs in which horizontal scan lines reside on thesame epipolar line are assumed, as shown in Figure 2.7. A point P1 in one image plane
may have arisen from any of points in the line C1P1, and may appear in the alternate
image plane at any point on the so-called epipolar line.
8/10/2019 Final Document Print
38/78
3D Reconstruction Based on Image Pyramid and Block Matching
Dept. of ECE, MRITS 38
Thus, the search is theoretically reduced within a scan line, since corresponding
pair points reside on the same epipolar line. The difference on the horizontal coordinates
of these points is the disparity. The disparity map consists of all disparity values of the
image. Having extracted the disparity map, problems such as 3D reconstruction,positioning, mobile robot navigation, obstacle avoidance, etc., can be dealt with in a more
efficient way.
Fig 2.7: Geometry of epipolar lines, where C1 and C2 are the left and right camera
lens centers, respectively. Point P1 in one image plane may have arisen from any of
points in the line C1P1, and may appear in the alternate image plane at any point on
the epipolar line E2.
As numerous methods have been proposed since then, this section aspires to review the
most recent ones, i.e., Most of the results presented in the rest of this paper are based on
the image sets and test provided there.
The most common image sets are presented in Figure 2.8. Table 2.1 summarizes
their size as well the number of disparity levels. Experimental results based on theseimage sets are given, where available. The preferred metric adopted by in this paper, in
order to depict the quality of the resulting disparity maps, is the percentage of pixels
whose absolute disparity error is greater than 1 in the unconcluded areas of the image.
This metric, considered the most representative of the results quality, was used so as to
8/10/2019 Final Document Print
39/78
3D Reconstruction Based on Image Pyramid and Block Matching
Dept. of ECE, MRITS 39
make comparison easier. Other metrics, like error rate and root mean square error are also
employed.
Fig.2.8:Left image of the stereo pair (left) and ground truth (right) for the Tsukuba
(a), Sawtooth (b), Map (c), Venus (d), Cones (e) and Teddy (f) stereo pair.
The speed with which the algorithms process input image pairs is expressed in
frames per second (fps). This metric has of course a lot to do with the used computational
platform and the kind of the implementation. Inevitably, speed results are not directly
comparable.
Tsukuba Map Sawtooth Venus Cone Teddy
Size in
pixels
384288 284216 434380 434383 450375 450375
Disparity
levels
16 30 20 20 60 60
Table 2.1: Characteristics of the most common image sets
8/10/2019 Final Document Print
40/78
3D Reconstruction Based on Image Pyramid and Block Matching
Dept. of ECE, MRITS 40
2.5 DENSE DISPARITY ALGORITHMS
Methods that produce dense disparity maps gain popularity as the computational
power grows. Moreover, contemporary applications are benefited by, and consequently
demand dense depth information. Therefore, during the latest years efforts towards this
direction are being reported much more frequently than towards the direction of sparse
results.
Dense disparity stereo matching algorithms can be divided in two general classes,
according to the way they assign disparities to pixels. Firstly, there are algorithms that
decide the disparity of each pixel according to the information provided by its local,
neighboring pixels. There are, however, other algorithms which assign disparity values to
each pixel depending on information derived from the whole image. Consequently, the
former ones are called local methods while the latter ones global.
2.5.1. Local methods.
Local methods are usually fast and can at the same time produce descent results.
Several new methods have been presented. In Figure 2.9 Venn diagram presents the main
characteristics of the below presented local methods. Under the term color usage we have
grouped the methods that take advantage of the chromatic information of the image pair.
Any algorithm can process color images but not everyone can use it in a more beneficial
way. Furthermore, in Figure 2.3 NCC stands for the use of normalized cross correlation
and SAD for the use of sum of absolute differences as the matching cost function.
As expected, the use of SAD as matching cost is far more widespread than any
other. A method that uses the sum of absolute differences (SAD) correlation measure for
RGB color images. It achieves high speed and reasonable quality. It makes use of the left
to right consistency and uniqueness constraints and applies a fast median filter to the
results.
It can achieve 20 fps for 160120 pixels image size, making this method suitable
for real time applications. The PC platform is Linux on a dual processor 800MHz
Pentium III system with 512 MB of RAM.
8/10/2019 Final Document Print
41/78
3D Reconstruction Based on Image Pyramid and Block Matching
Dept. of ECE, MRITS 41
Fig.2.9: Diagrammatic representation of the local methods categorization.
Another fast area-based stereo matching algorithm, which uses the SAD as error
function, is presented is based on the uniqueness constraint, it rejects previous matches as
soon as better ones are detected. In contrast to bidirectional matching algorithms this one
performs only one matching phase, having though similar results. The results obtained
are tested for reliability and sub-pixel refined. It produces dense disparity maps in real-
time using an Intel Pentium III processor running at 800MHz. The algorithm achieves
39.59 fps speed for 320240 pixels and 16 disparity levels and the root mean square error
for the standard Tsukuba pair is 5.77%.
The object is to achieve minimum segmentation. The experimental results
indicate 1.77%, 0.61%, 3.00%, and 7.63% error percentages. The execution speed of the
algorithm varies from 1 to 0.2 fps on a 2.4GHz processor.
Another method that presents almost real-time performance is it makes use of a
refined implementation of the SAD method and a left-right consistency check. The errors
in the problematic regions are reduced using different sized correlation windows. Finally,
a median filter is used in order to interpolate the results. The algorithm is able to process
8/10/2019 Final Document Print
42/78
3D Reconstruction Based on Image Pyramid and Block Matching
Dept. of ECE, MRITS 42
7 fps for 320240 pixels images and 32 disparity levels. These results are obtained using
an Intel Pentium 4 at 2.66GHz Processor.
A window-based method for correspondence search is presented in that uses
varying support-weights. The support-weights of the pixels in a given support window
are adjusted based on color similarity and geometric proximity to reduce the image
ambiguity. The difference between pixel colors is measured in the CIE Lab color space
because the distance of two points in this space is analogous to the stimulus perceived by
the human eye. The running time for the image pair with a 3535 pixels support window
is about 0.016 fps on an AMD 2700processor. The error ratio is 1.29%, 0.97%, 0.99%,
and 1.13% and Map image sets, respectively. These figures can be further improved
through a left-right consistency check.
For given input images, specular free two-band images are generated. The
similarity between pixels of these input-image representations can be measured using
various correspondence search methods such as the simple SAD-based method, the
adaptive support-weights method and the dynamic programming (DP) method. This pre-
processing step can be performed in real time and compensates satisfactory for specular
reflections.
On the other hand the zero mean normalized cross correlation (ZNCC) as
matching cost. This method integrates a neural network (NN) model, which uses the
least-mean-square delta rule for training. The NN decides on the proper window shape
and size for each support region. The results obtained are satisfactory but the 0.024 fps
running speed reported for the common image sets, on a Windows platform with a
300MHz processor, renders this method as not suitable for real-time applications.
Based on the same matching cost function a more complex area-based method is
proposed in a perceptual organization framework, considering both binocular and
monocular cues is utilized. An initial matching is performed by a combination of
normalized cross correlation techniques. The correct matches are selected for each pixel
using tensor voting. Matches are then grouped into smooth surfaces. Disparities for the
unmatched pixels are assigned so as to ensure smoothness in terms of both surface
8/10/2019 Final Document Print
43/78
3D Reconstruction Based on Image Pyramid and Block Matching
Dept. of ECE, MRITS 43
orientation and color. The percentage of un occluded pixels whose absolute disparity
error is greater than 1 is 3.79, 1.23, 9.76, and 4.38 for the image sets. The execution
speed reported is about 0.002 fps for the image pair with 20 disparity levels running on
an Intel Pentium 4 processor at 2.8MHz.
There are, of course, more hardware-oriented proposals as well. Many of them
take advantage of the contemporary powerful graphics machines to achieve enhanced
results in terms of processing time and data volume. A hierarchical disparity estimation
algorithm implemented on programmable 3D graphics processing unit (GPU) is reported
in this method can process either rectified or un calibrated image pairs. Bidirectional
matching is utilized in conjunction with a locally aggregated sum of absolute intensity
differences.
Moreover, the use of Cellular Automata (CA) presents architecture for real-time
extraction of disparity maps. It is capable of processing 1Mpixels image pairs at more
than 40 fps. The core of the algorithm relies on matching pixels of each scan-line using a
one-dimensional window and the SAD matching cost as described in this method
involves a pre-processing mean filtering step and a post-processing CA based filtering
one.
CA is models of physical systems, where space and time are discrete and
interactions are local. They can easily handle complicated boundary and initial
conditions. In CA analysis, physical processes and systems are described by a cell array
and a local rule, which defines the new state of a cell depending on the states of its
neighbors. All cells can work in parallel due to the fact that each cell can independently
update each own state. Therefore the proposed CA algorithm is massively parallel and is
an ideal candidate to be implemented in hardware.
2.5.2 Global methods
Contrary to local methods, global ones produce very accurate results. Their goal is
to find the optimum disparity function d = d (x, y) which minimizes a global cost
function E, which combines data and smoothness terms.
8/10/2019 Final Document Print
44/78
3D Reconstruction Based on Image Pyramid and Block Matching
Dept. of ECE, MRITS 44
E (d) = E data (d) +. E smooth (d) (5)
Where E datatakes into consideration the (x, y) pixels value throughout the image, E
smoothprovides the algorithms smoothening assumptions and k is a weight factor.
The main disadvantage of the global methods is that they are more time
consuming and computational demanding. The source of these characteristics is the
iterative refinement approaches that they employ. They can be roughly divided in those
performing a global energy minimization and those pursuing the minimum for
independent scan lines using DP.
In Figure 2.10 the main characteristics of the below discussed global algorithms
are presented. It is clear that the recently published works utilizes global optimizationpreferably rather than DP. This observation is not a surprising one, taking into
consideration the fact that under the term global optimization there are actually quite a
few different methods. Additionally, DP tends to produce inferior ,thus less impressive,
results. Therefore, applications that dont have running speed constraints, preferably
utilize global optimization methods
2.6 REVIEW OF STEREO VISION ALGORITHMS
Fig.2.10: Diagrammatic representation of the global methods categorization
8/10/2019 Final Document Print
45/78
3D Reconstruction Based on Image Pyramid and Block Matching
Dept. of ECE, MRITS 45
2.6.1. Global optimization
The algorithms that perform global optimization take into consideration the whole
image in order to determine the disparity of every single pixel. An increasing portion of
the global optimization methodologies involves segmentation of the input images
according to their colors.
The algorithm presented uses color segmentation. Each segment is described by a
planar model and assigned to a layer using a mean shift based clustering algorithm. A
global cost function is used that takes into account the summed up absolute differences,
the discontinuities between segments and the occlusions. The assignment of segments to
layers is iteratively updated until the cost function improves no more. The experimental
results indicate that the percentage of un-concluded pixels whose absolute disparity error
is greater than 1 is 1.53, 0.16, and 0.22 for the image sets, respectively.
The stereo matching algorithm proposed in makes use of color segmentation in
conjunction with the graph cuts method. The reference image is divided in non-
overlapping segments using the mean shift color segmentation algorithm. Thus, a set of
planes in the disparity space is generated. The goal of minimizing an energy function is
faced in the segment rather than the pixel domain. A disparity plane is fitted to each
segment using the graph cuts method. This algorithm presents good performance in the
texture less and occluded regions as well as at disparity discontinuities. The running
speed reported is 0.33 fps for a 384288 pixel image pair when tested on a 2.4GHz
Pentium 4 PC. The percentage of bad matched pixels and Map image sets is found to be
1.23, 0.30, 0.08, and 1.49, respectively.
The ultimate goal of the work describe is to render dynamic scenes with
interactive viewpoint control produced by a few cameras suitable color segmentation-
based algorithm is developed and implemented on a programmable ATI 9800 PRO GPU.
Disparities within segments must vary smoothly, each image is treated equally,
occlusions are modelled explicitly and consistency between disparity maps is enforced
resulting in higher quality depth maps. The results for each pixel are refined in
conjunction with the others.
8/10/2019 Final Document Print
46/78
3D Reconstruction Based on Image Pyramid and Block Matching
Dept. of ECE, MRITS 46
Another method that uses the concept of image color segmentation is reported in
an initial disparity map is calculated using an adapting window technique. The segments
are combined in larger layers iteratively. The assignment of segments to layers is
optimized using a global cost function. The quality of the disparity map is measured bywarping the reference image to the second view and comparing it with the real image and
calculating the color dissimilarity.
For the 384288 pixel and the 434383 pixel Venus test set, the algorithm
produces results at 0.05 fps rate. For the 450375 pixel Teddy image pair, the running
speed decreased to 0.01 fps due to the increased scene complexity. Running speeds refer
to an Intel Pentium 4 2.0GHz processor. The root mean square error obtained is 0.73 for
the 0.31 for the Venus and 1.07 for the image pair.
Moreover, Sun and his colleagues presented a method which treats the two
images of a stereo pair symmetrically within an energy minimization framework that can
also embody color segmentation as a soft constraint. This method enforces that the
occlusions in the reference image are consistent with the disparities found for the other
image. Belief propagation iteratively refines the results. Moreover, results for the version
of the algorithm that incorporates segmentation are better.
The percentage of pixels with disparity error larger than 1 is 0.97, 0.19, 0.16, and
0.16 for the Map image sets, respectively. The running speed for the aforementioned data
sets is about 0.02 fps tested on a 2.8GHz Pentium 4 processor.
Color segmentation is utilized as well. The matching cost used here is a self-
adapting dissimilarity measure that takes into account the sum of absolute intensity
differences as well as a gradient based measure. Disparity planes are extracted using an
insensitive to outliers technique. Disparity plane labelling is performed using belief
propagation. Execution speed varies between 0.07 and 0.04 fps on a 2.21GHz AMD
Athlon 64 processor. The results indicate 1.13, 0.10, 4.22, and 2.48 percent of bad
matched pixels in non-occluded areas for the image sets, respectively.
8/10/2019 Final Document Print
47/78
3D Reconstruction Based on Image Pyramid and Block Matching
Dept. of ECE, MRITS 47
Finally, one more algorithm that utilizes energy minimization, color
segmentation, plane fitting and repeated application of hierarchical belief propagation is
presented in this algorithm takes into account a color weighted correlation measure.
Discontinuities and occlusions are properly handled. The percentage of pixels withdisparity error larger than 1 is 0.88, 0.14, 3.55, and 2.90 for the Cones image sets,
respectively.
In two new symmetric cost functions for global stereo methods are proposed. A
symmetric data cost function for the likelihood, as well as a symmetric discontinuity cost
function for the prior in the MRF model for stereo is presented. Both the reference image
and the target image are taken into account to improve performance without modelling
half-occluded pixels explicitly and without using color segmentation. The use of both ofthe two proposed symmetric cost functions in conjunction with a belief propagation based
stereo method is evaluated.
Experimental results for standard test bed images show that the performance of
the belief propagation based stereo method is greatly improved by the combined use of
the proposed symmetric cost functions. The percentage of pixels badly matched for the
non-occluded areas was found 1.07, 0.69, 0.64, and 1.06 for the image sets, respectively.
The incorporation of Markov random fields (MRF) as a computational tool is also a
popular approach.
A method based on the Bayesian estimation theory with a prior MRF model for
the assigned disparities is described in the continuity, coherence and occlusion constraints
as well as the adjacency principal are taken into account. The optimal estimator is
computed using a Gauss-Markov random field model for the corresponding posterior
marginal, which results in a diffusion process in the probability space. The results are
accurate but the algorithm is not suitable for real-time applications, since it needs a few
minutes to process a 256255 stereo pair with up to 32 disparity levels, on an Intel
Pentium III running at 450MHz.
On the other hand, treat every pixel of the input images as generated either by a
process, responsible for the pixels visible from the reference camera and which obey to
8/10/2019 Final Document Print
48/78
3D Reconstruction Based on Image Pyramid and Block Matching
Dept. of ECE, MRITS 48
the constant brightness assumption, or by an outlier process, responsible for the pixels
that cannot be corresponded. Depth and visibility are jointly modelled as a hidden MRF,
and the spatial correlations of both are explicitly accounted for by defining a suitable
Gibbs prior distribution. An expectation maximization (EM) algorithm keeps track ofwhich points of the scene are visible in which images, and accounts for visibility
configurations. The percentages of pixels with disparity error larger than 1 are 2.57, 1.72,
6.86 and 4.64 for the image sets, respectively.
Moreover, a stereo method specifically designed for image-based rendering is
described in this algorithm uses over-segmentation of the input images and computes
matching values over entire segments rather than single pixels. Color-based segmentation
preserves object boundaries. The depths of the segments for each image are computedusing loopy belief propagation within a MRF framework. Occlusions are also considered.
The percentage of bad matched pixels in the un concluded regions is 1.69, 0.50, 6.74, and
3.19 for the Cones image sets, respectively. The aforementioned results refer to a 2.8GHz
PC platform.
An algorithm based on a hierarchical calculation of mutual information based
matching cost is proposed. Its goal is to minimize a proper global energy function, not by
iterative refinements but by aggregating matching costs for each pixel from all directions.
The final disparity map is sub- pixel accurate and occlusions are detected. The processing
speed for the image set is 0.77 fps. The error in un concluded regions is found less than
3% for all the standard image sets. Calculations are made on an Intel Xeon processor
running at 2.8GHz.
Mutual information is once again used as cost function. The extensions applied in
it result in intensity consistent disparity selection for un textured areas and discontinuity
preserving interpolation for filling holes in the disparity maps. It treats successfully
complex shapes and uses planar models for un textured areas. Bidirectional consistency
check, sub-pixel estimation as well as invalid-disparities interpolation are performed.
The experimental results indicate that the percentages of bad matching pixels in
un-concluded regions are 2.61, 0.25, 5.14, and2.77 for the Tsukuba, Venus, Teddy and
8/10/2019 Final Document Print
49/78
3D Reconstruction Based on Image Pyramid and Block Matching
Dept. of ECE, MRITS 49
Cones image sets, respectively, with 64disparity levels searched each time. However, the
reported running speed on a2.8GHz PC is less than 1 fps.
The dense disparity estimation is accomplished by a region dividing technique
that uses a Canny edge detector and a simple SAD function. The results are refined by
regularizing the vector fields by means of minimizing an energy function. The root mean
square error obtained from this method is 0.9278 and 0.9094 for the image pairs. The
running speed is 0.15 fps and 0.105 fps respectively on a Pentium 4 PC running Windows
XP.
An uncommon measure is used in this work describes an algorithm which is
focused on achieving contrast invariant stereo matching. It relies on multiple spatial
frequency channels for local matching. The measure for this stage is the deviation of
phase difference from zero. The global solution is found by a fast non-iterative left right
diffusion process. Occlusions are found by enforcing the uniqueness constraint. The
algorithm is able to handle significant changes in contrast between the two images and
can handle noise in one of the frequency channels.
Another algorithm that generates high quality results in real time is reported is
based on the minimization of a global energy function comprising of a data and a
smoothness term. The hierarchical belief propagation iteratively optimizes the
smoothness term but it achieves fast convergence by removing redundant computations
involved. In order to accomplish real-time operation authors take advantage of the
parallelism of graphics hardware (GPU).
Experimental results indicate 16 fps processing speed for 320240 pixel self-
recorded images with 16 disparity levels. The percentages of bad matching pixels in un-
concluded regions for the image sets are found to be 1.49, 0.77, 8.72, and 4.61. The
computer used is a 3GHz PC and the GPU is an NVIDIA 7900 GTX graphics card with
512M video memory.
` The work indicates that computational cost of the graph cuts stereo
correspondence technique can be efficiently decreased using the results of a simple local
8/10/2019 Final Document Print
50/78
3D Reconstruction Based on Image Pyramid and Block Matching
Dept. of ECE, MRITS 50
stereo algorithm to limit the disparity search range. The idea is to analyze and exploit the
failures of local correspondence algorithms. This method can accelerate the processing by
a factor of 2.8, compared to the sole use of graph cuts, while the resulting energy is worse
only by an average of 1.7%. These results proceed from an analysis done on a largedataset of 32 stereo pairs using a Pentium4 at 2.6GHz PC.
2.6.2. Dynamic programming
Many researchers develop stereo correspondence algorithms based on DP. This
methodology is a fair trade-off between the complexity of the computations needed and
the quality of the results obtained. In every aspect, DP stands between the local
algorithms and the global optimization ones. However, its computational complexity still
renders it as a less preferable option for hardware implementation.
The work presents a unified framework that allows the fusion of any partial
knowledge about disparities, such as matched features and known surfaces within the
scene. It combines the results from corner, edge and dense stereo matching algorithms to
impose constraints that act as guide points to the standard DP method. The result is a
fully automatic dense stereo system with up to four times faster running speed and greater
accuracy compared to results obtained by the sole use of DP.
One or more disparity candidates for the true disparity of each pixel are assigned
by local matching using oriented spatial filters. Afterwards, a two-pass DP technique that
performs optimization both along and between the scan-lines is performed. The result is
the reduction of false matches as well as of the typical Inter-scan line inconsistency
problem.
The per-pixel matching costs are aggregated in the vertical direction only
resulting in improved inters scan line consistency and sharp object boundaries. This work
exploits the color and distance proximity based weight assignment for the pixels inside a
fixed support window as reported. The real time performance is achieved due to the
parallel use of the CPU and the GPU of a computer. This implementation can process
8/10/2019 Final Document Print
51/78
3D Reconstruction Based on Image Pyramid and Block Matching
Dept. of ECE, MRITS 51
320-240 pixel images with 16 disparity levels at 43.5 fps and 640480 pixel images with
16 disparity levels at 9.9 fps.
On the contrary, the algorithm proposed in the DP method not across individual
scan lines but to a tree structure. Thus the minimization procedure accounts for all the
pixels of the image, compensating the known streaking effect without being an iterative
one. Reported running speed is a couple of frames per second for the tested image pairs.
So, real-time implementations are feasible. However, the results obtained are comparable
to those of the time-consuming global methods.
In the pixel-tree approach of the previous work is replaced by a region-tree one.
First of all, the image is color-segmented using the mean-shift algorithm. During the
stereo matching, a corresponding energy function defined on such a region-tree structure
is optimized using the DP technique. Occlusions are handled by compensating for border
occlusions and by applying cross checking. The obtained results indicate that the
percentage of the bad matched pixels in un-concluded regions is 1.39, 0.22, 7.42, and
6.31 for the Cones image sets. The running speed, on a 1.4GHz Intel Pentium M
processor, ranges from 0.1 fps with 16 disparity levels to 0.04 fps for the Cones dataset
with 60 disparity levels.
2.6.3 Other methods
There are of course other methods, producing dense disparity maps, which can be
placed in neither of previous categories. The below discussed methods use either
wavelet-based techniques or combinations of various techniques .Such a method, based
on the continuous wavelet transform (CWT) is found. It makes use of the redundant
information that results from the CWT. Using 1D orthogonal and bi-orthogonal wavelets
as well as 2D orthogonal wavelet the maximum matching rate obtained is 88.22% for the
image pair. Up sampling the pixels in the horizontal direction by a factor of two, through
zero insertion, further decreases the noise and the matching rate is increased to 84.91%.
Another work presents an algorithm based on non-uniform rational B-splines
(NURBS) curves. The curves replace the edges extracted with a wavelet based method.
The NURBS are projective invariant and so they reduce false matches due to distortion
8/10/2019 Final Document Print
52/78
3D Reconstruction Based on Image Pyramid and Block Matching
Dept. of ECE, MRITS 52
and image noise. Stereo matching is then obtained by estimating the similarity between
projections of curves of an image and curves of another image. A 96.5% matching rate
for a self-recorded image pair is reported for this method.
Finally, a different way of confronting the stereo matching issue is proposed in
investigate the possibility of fusing the results from spatially differentiated (stereo vision)
scenery images with those from temporally differentiated (structure from motion) ones.
This method takes advantage of both methods merits improving the performance.
2.7 SPARSE DISPARITY ALGORITHMS
Algorithms resulting in sparse, or semi-dense, disparity maps tend to be less
attractive as most of the contemporary applications require dense disparity information.
Though, they are very useful when fast depth estimation is required and at the same time
detail, in the whole picture, is not so important. This type of algorithms tends to focus on
the main features of the images leaving occluded and poorly textured areas unmatched.
Consequently high processing speeds, accurate results but with limited density are
achieved. Very interesting ideas flourish in this direction but since contemporary interest
is directed towards dense disparity maps, only a few indicatory algorithms are discussed
here.
An algorithm that detects and matches dense features between the left and right
images of a stereo pair, producing a semi-dense disparity map. A dense feature is a
conn