Top Banner

of 144

FullText (36)

Jul 06, 2018

Download

Documents

KAN
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
  • 8/17/2019 FullText (36)

    1/144

    TitleMutual information-based depth estimation and 3Dreconstruction for image-based rendering systems

    Advisor(s) Chan, SC; Chang, C

    Author(s) Zhu, Zhenyu;g c [‡

    Citation

    Issued Date 2012

    URL http://hdl.handle.net/10722/173910

    RightsThe author retains all proprietary rights, (such as patent rights)and the right to use in future works.

  • 8/17/2019 FullText (36)

    2/144

     

    Mutual Information-based Depth Estimation

    and 3D Reconstruction for Image-basedRendering Systems

     by

    ZHU Zhenyu (朱 )

    B.Eng.

    Ph. D. Thesis

    A thesis submitted in partial fulfillment of the requirements for

    the Degree of Doctor of Philosophy

    at the University of Hong Kong

    July 2012

  • 8/17/2019 FullText (36)

    3/144

     

    I

    Abstract of thesis entitled

    Mutual Information-based Depth Estimation

    and 3D Reconstruction for Image-based

    Rendering Systems

    Submitted by

    ZHU Zhenyu

    for the degree of Doctor of Philosophy

    at the University of Hong Kong

    in July 2012

    Image-based rendering (IBR) is an emerging technology for

    rendering photo-realistic views of scenes from a collection of densely

    sampled images or videos. It provides a framework for developing

    revolutionary virtual reality and immersive viewing systems. There has

     been considerable progress recently in the capturing, storage and

    transmission of image-based representations. This thesis proposes two

    image-based rendering (IBR) systems for improving the viewing

    freedom and environmental modeling capability of conventional static

    IBR systems. The first system consists of a circular array with 13 still

    cameras (Canon 550D) for capturing ancient Chinese artifacts at high

    resolution. The second one is constructed by mounting a linear array of 8

    video cameras (Sony HDR-TGIE) on an electrically controllable wheel

  • 8/17/2019 FullText (36)

    4/144

     

    II

    chair with its motion being controllable manually or remotely through

    wireless local area network (LAN) by means of additional hardware

    circuitry.

    Both systems support object-based rendering and 3D reconstruction

    capability and consist of two main components. 1) A novel view

    synthesis algorithm using a new segmentation and mutual information

    (MI)-based algorithm for dense depth map estimation, which relies on

    segmentation, local polynomial regression (LPR)-based depth map

    smoothing and MI-based matching algorithm to iteratively estimate the

    depth map. The method is very flexible and both semi-automatic and

    automatic segmentation methods can be employed. They rank fourth and

    sixth, respectively, in the Middlebury comparison of existing depth

    estimation methods. This allows high quality renderings of outdoor and

    indoor scenes with improved mobility/freedom to be obtained. This

    algorithm can also be extended to object tracking. Experimental results

    also show that the proposed MI-based algorithms are applicable to

    robust registration in noisy dynamic ultrasound images. 2) A new 3D

    reconstruction algorithm which utilizes sequential-structure-from-motion

    (S-SFM) technique and the dense depth maps estimated previously. It

    relies on a new iterative point cloud refinement algorithm based on

    Kalman filter (KF) for outlier removal and the segmentation-MI-based

    algorithm to further refine the correspondences and the projection

    matrices. The mobility of our system allows us to recover more

    conveniently 3D model of static objects from the improved point cloud

    using a new robust radial basis function (RBF)-based modeling

    algorithm to further suppress possible outliers and generate smooth 3D

    meshes of objects. Moreover, a new rendering technique named view

    dependent texture mapping is used to enhance the final rendering effect.

  • 8/17/2019 FullText (36)

    5/144

     

    III

    Experimental results show that the proposed 3D reconstruction

    algorithm significantly reduces the adverse effect of the outliers and

     produces high quality renderings using view dependent texture mapping

    and the model reconstructed.

    Overall, this study provides a framework for designing IBR systems

    with improved viewing freedom and ability to cope with moving and

    static objects in indoor and outdoor environment. 

    An abstract of exactly 439 words

  • 8/17/2019 FullText (36)

    6/144

     

    IV

    Declaration

    I hereby declare that this dissertation, submitted in partial

    fulfillment of the requirements for the degree of Philosophy and entitled

    “Mutual Information-based Depth Estimation and 3D Reconstruction for

    Image-based Rendering Systems” represents my own work except where

    due acknowledgement is made, and has not been previously included in

    a thesis, dissertation, or report submitted to this or any other institution

    for a degree, diploma or other qualification.

    Zhu Zhenyu

    August 2012

  • 8/17/2019 FullText (36)

    7/144

     

    V

    Acknowledgement

    First of all, I would like to extend my sincere gratitude to my

    supervisor, Dr. S. C. Chan and Dr. C. Q. Chang for their instructive

    advice and useful suggestions on my thesis. Without their consistent and

    illuminating instruction, this thesis could not be possible.

    Besides, I highly appreciate all postgraduate students and staffs in

    the Digital Signal Processing (DSP) Laboratory for their helpful

    discussion and support. They are: Dr. K. T. Ng, Dr. Z. G Zhang, Dr. K.

    M. Tsui, Mr. James Koo, Mr. B. Liao, Mr. C. Wang, Mr. S. Zhang, Mr.

    H. C. Wu and Miss Y. J. Chu.

    Last, but not the least, my thanks go to my beloved family for their

     patience and love in me all through these years.

  • 8/17/2019 FullText (36)

    8/144

     

    VI

    Contents 

    DECLARATION .................................................................................. IV 

    ACKNOWLEDGEMENT ..................................................................... V 

    CONTENTS .......................................................................................... VI 

    LIST OF FIGURES ............................................................................... X 

    LIST OF TABLES .............................................................................. XV 

    LIST OF ABBREVIATIONS .......................................................... XVI 

    CHAPTER 1 INTRODUCTION ........................................................... 1 

    1.1  BACKGROUND .................................................................................. 1 

    1.2  THESIS OUTLINE .............................................................................. 4 

    CHAPTER 2 REVIEW OF BASIC TOPICS IN IMAGE-BASED

    RENDERING .......................................................................................... 9 

    2.1  I NTRODUCTION ................................................................................ 9 

    2.2 

    R EVIEW OF PLENOPTIC FUNCTION ................................................... 9 

    2.2.1 

     Basic Theory .......................................................................... 10 

  • 8/17/2019 FullText (36)

    9/144

     

    VII

    2.3  R EVIEW OF LIGHT FIELD ................................................................ 13 

    2.3.1 

    Creating/capturing light field ................................................ 13 

    2.4  R EVIEW OF R ENDERING TECHNIQUES ............................................ 14 

    2.5 

    SUMMARY ...................................................................................... 18 

    CHAPTER 3 THE PROPOSED IMAGE-BASED RENDERING

    SYSTEMS .............................................................................................. 20 

    3.1 

    I NTRODUCTION .............................................................................. 20 

    3.2 

    CONSTRUCTION OF THE PROPOSED IBR  SYSTEMS .......................... 23 

    3.2.1 

    Still Camera System ............................................................... 23 

    3.2.2   Moveable Camera System ...................................................... 26  

    3.3 

    PRE-PROCESSING ........................................................................... 30 

    3.3.1 

    Still Camera System ............................................................... 30 

    3.3.1.1 Camera Calibration ............................................................. 30 

    3.3.1.2 Color-Tensor-based Segmentation and Matting .................. 33 

    3.3.2 

     Moveable Camera System ...................................................... 36  

    3.3.2.1 Video Stabilization ............................................................... 36  

    3.4  SUMMARY ...................................................................................... 44 

  • 8/17/2019 FullText (36)

    10/144

     

    VIII

    CHAPTER 4 A NEW COMBINED SEGMENTATION-MUTUAL-

    INFORMATION (MI)-BASED ALGORITHM FOR DENSE

    DEPTH MAP ESTIMATION .............................................................. 45 

    4.1 

    I NTRODUCTION .............................................................................. 45 

    4.2 

    COMBINED SEGMENTATION-MI-BASED DEPTH ESTIMATION ......... 46 

    4.2.1 

    Object Segmentation Using Level-Set Method ...................... 47  

    4.2.2 

     Mutual Information Matching ............................................... 49 

    4.3  DEPTH MAP R EFINEMENT .............................................................. 54 

    4.3.1 

    Occlusion Detection and Inpainting ...................................... 56  

    4.3.2 

    Smoothing of Depth Maps ...................................................... 56  

    4.4  MUTUAL I NFORMATION (MI)-BASED OBJECT TRACKING .............. 63 

    4.5 

    MORE R ESULTS AND COMPARISON ................................................ 65 

    4.6  SUMMARY ...................................................................................... 69 

    CHAPTER 5 3D RECONSTRUCTION AND MODELING .......... 71 

    5.1 

    I NTRODUCTION .............................................................................. 71 

    5.2 

    HOMOGENEOUS GEOMETRY ........................................................... 74 

    5.3  POINT MATCHING IN THE STILL IBR  SYSTEM ................................ 76 

    5.3.1 

     Epipolar Geometry................................................................. 76  

  • 8/17/2019 FullText (36)

    11/144

     

    IX

    5.3.2   Finding Correspondent Points............................................... 77  

    5.4 

    VIEW DEPENDENT TEXTURE MAPPING .......................................... 82 

    5.5  POINT MATCHING AND R EFINEMENT IN THE MOVEABLE IBR  SYSTEM ................................................................................................. 88 

    5.5.1 

    Structure-from-motion ........................................................... 88 

    5.5.2 

     Point Cloud Generation and Refinement: KF-based Outlier

    detection and Point Cloud Fusion ................................................... 90 

    5.6  RBF MODELING AND MESH GENERATION ...................................... 98 

    5.7 

    SUMMARY .................................................................................... 102 

    CHAPTER 6 CONCLUSION AND FUTURE RESEARCH ........ 103 

    6.1  CONCLUSION ................................................................................ 103 

    6.2 

    FUTURE R ESEARCH ...................................................................... 105 

    APPENDIX I PUBLICATIONS ...................................................... 107 

    REFERENCES .................................................................................... 109 

  • 8/17/2019 FullText (36)

    12/144

     

    X

    List of Figures

    Figure 1-1 Spectrum of IBR representations.…………………  2

    Figure 2-1 Light field describes the amount of light in

    radiance along light rays traveling in every

    direction through every point in empty space [Ikeu

    2012]..…………………………………………….. 10

    Figure 2-2 Forward mapping……………........……..………... 16

    Figure 2-3 Example renderings using (a) forward mapping in

     point rendering [Chan 2005], (b) layered

    representation (with two layers  –   dancer and

     background) [Chan 2009], (c) monolithic

    rendering using 3D polygonal mesh (left) and

    rendering results (right) [Zhu 2010]…………..….. 19

    Figure 3-1 Plenoptic videos: Multiple linear camera array of

    4D simplified dynamic light field with viewpoints

    constrained along line segments. The camera

    arrays developed at [Chan2009]. Each consists of

    6 JVC video cameras……….......................………  21

    Figure 3-2 Circular camera array constructed……….…….…. 24

    Figure 3-3 Snapshots: (a)Buddha (b) Dragon Vase…..………. 25

    Figure 3-4 Block diagram of the proposed IBR system…….... 26

    Figure 3-5 The proposed moveable image-based rendering

    system………………………………..……………  27

    Figure 3-6 Snapshots of the plenoptic videos at a given time

    instance: (a) is the “ Podium”  outdoor video from

    camera 1 to camera 4 and (b) is the “ Presentation” 

  • 8/17/2019 FullText (36)

    13/144

     

    XI

    indoor video from camera 1 tocamera …………………………………..……....... 29

    Figure 3-7 Block diagram of the proposed M-IBR system

    constructed………………………………………... 30

    Figure 3-8 Relationship between the world coordinate and the

    camera coordinate……………..……..………….... 31

    Figure 3-9 Planar patten............................................................ 33

    Figure 3-10 (a) Extraction results using color-tensor-basedmethod. Left: original, middle: hard segmentation,

    right: after matting. (b) Close up of segmentations

    in (a). Left: hard segmentation, Right: aftermatting………………………………………….... 36

    Figure 3-11 Motion smoothing results for horizontal(Translation-x) and vertical (Translation-y)

    directions. The original motion path and the

    smoothed motion path with different methods are

    shown. In (a)-(b), the blue dotted lines correspond

    to the shaky original motion path. Green and blacklines correspond to the smoothed motion path

    using the method in [Mats 2005] with a small and

    a large kernel sizes respectively…………………... 43

    Figure 3-12 Video Stabilization result. The first row shows the

    original images captured by our system; thesecond row shows the stabilized images without

    video completion; the third row shows thecompleted results…………………………………. 43

    Figure 4-1 Segmentation results using the level-set-basedtracking method. (a) is the initial segmentation

    obtained by lazy snapping, (b) is the initial

    segmentation obtained by graph cut method…....... 48

    Figure 4-2 A regular grid for Local Transformation…………. 51

    Figure 4-3 (a) is an example depth map obtained by using MImatching without segmentation information; (b)

    shows the depth map obtained by using automatic

  • 8/17/2019 FullText (36)

    14/144

     

    XII

    segmentation MI matching; (c) shows the depthmap obtained by using semi-automatic

    segmentation MI matching. Green areas in (c) are

    the occlusion areas detected by our algorithm. (d)-

    (e) show the refined depth maps of (c) byinpainting and smoothing (c) using SK-LPR-R-ICIand 2525  ideal low-pass filter, respectively……  55

    Figure 4-4 (a) and (b) show the renderings obtained by Figs.

    4-2(d) and (b); (c) and (d) are the enlargements of

    the red boxes……………………………………....  59

    Figure 4-5 Rendering results obtained by the proposed

    algorithm. (a) shows the depth maps

    corresponding to images in (b). The highlightedimages in (b) shows the rendered views from the

    adjacent views in (b) using depth maps in (a). (c)

    shows depth maps at other positions……….…….. 62

    Figure 4-6 Example rendering results. The first row shows theoriginal images captured by our M-IBR system.

    The second and third rows show renderings with a

    step-in ratio of about 1.15 to 1.25 times.................. 63

    Figure 4-7 Object tracking at different time instances……...... 65

    Figure 4-8 Teddy test images [Scha2002] and depth maps for

    comparison. (a) LEFT image; (b) RIGHT image;

    (c) ground truth depth map; (d) depth map

    calculated by semi-automatic segmentation-based

    MI matching; (e) depth map calculated by

    automatic segmentation-based MI matching……... 67

    Figure 4-9 Results for the “conference”. (a) and (c) are twosample frames. (b) and (d) are the depth maps of

    (a) and (c), respectively……………………......…. 68

    Figure 4-10 Ultrasound images of RF muscle under relaxed

    condition and at 50% maximal voluntarycontraction (MVC) contraction level and the

    corresponding images with outlined boundary

    contours. The tracked boundaries are highlighted

  • 8/17/2019 FullText (36)

    15/144

     

    XIII

    in green……............................................................ 69

    Figure 5-1 Epipolar Geometry................................................... 76

    Figure 5-2 Feature Points Detection. The red points in (a) and

    (b) are the feature points. (a) is from the first

    camera. (b) is from the second camera……………  80

    Figure 5-3 Epipolar Line……………………………………... 81

    Figure 5-4 Rectified images, (a) is the rectified left image, (b)

    is the rectified right image. (c) is the part of (a), (d)

    is the part of (b)……………………………………  81

    Figure 5-5 An initial point cloud extracted with noise andoutliers.………………………………...…………. 82

    Figure 5-6 View Dependent Texture Mapping...……………... 83

    Figure 5-7 View Dependent Texture, Left: blurred texture.

    Right: texture after the proposed view dependenttexture……………..………………………….…... 85

    Figure 5-8 3D models of Ancient Chinese Artifacts. (a)

    Dragon Vase, (b) Buddha, (c) Green Bottle, (d)Bowl, (e) Brush Pot, (f) Tri-Pot, (g) Wine

    Glass.………………….…………………………... 87

    Figure 5-9 Rendering Results of Ancient Chinese Artifact…... 88

    Figure 5-10 Iterative refinement of point cloud: (a) initial point

    cloud. (b) point cloud after outlier detection and

    Kalman filtering. (c) point cloud after the

     proposed iteration method………………..………. 91

    Figure 5-11 (a)-(b) shows the 3D to 2D re-projection at frame

    20 and frame 21, respectively. Blue points are

    inliers. Green points are outliers detected by the

    segmentation consistency check. Red points are

    the outliers detected by intensity and location

    consistency checks. (c) shows the enlargement ofthe highlight area in (a). The point cloud is down-

    sampled for better visualization…………………...  97

  • 8/17/2019 FullText (36)

    16/144

     

    XIV

    Figure 5-12 Convergence behavior of the root mean squaredistance (RMSD) versus the number of iteration

    for the proposed iterative 3D reconstruction

    algorithm. The blue line shows the RMSD values

    with the KF-based outlier detection. The red lineshows the RMSD values without KF-based outlierdetection…………………………………………...

    97

    Figure 5-13 3D reconstruction results (a) without using RBF,

    (b) using RBF without outlier detection and (c)

    using RBF with outlier removal…………………  100

    Figure 5-14 Object-based rendering results of “Podium” 

    sequences”  using the estimated 3D model andshadow field at different lightening conditions.….. 101

    Figure 5-15 Object-based rendering results of the “conference” 

    sequence. (a) and (b) are the 3D reconstruction

    result of two time instances. (c) and (d) are the

    rendering results of (a) and (b). Note, only partialgeometry of the dynamic object is recovered, since

    it is partially observable…………………………  101

  • 8/17/2019 FullText (36)

    17/144

     

    XV

    List of Tables

    Table 2-1 A taxonomy of plenoptic functions…….................... 12

    Table 4-1 Comparison of the rank using standard threshold of

    1 pixel on middlebury test stereo images…………... 67

  • 8/17/2019 FullText (36)

    18/144

     

    XVI

    List of Abbreviations

    BRDF Bidirectional Reflectance Distribution Function

    BP Belief Propagation

    CLF Circular Light Field

    CPU Central Processing Unit

    CSA Cross-Sectional AreaDCP Disparity Compensation Prediction

    DCT Discrete Cosine Transform

    DSCs Digital Still Cameras

    DSP Digital Signal Processing

    Fig. Figure

    fps frames per second

    GC Graph Cut

    GPU Graphic Processing Unit

    HD High Definition

    IBR Image-Based Rendering

    i.i.d. independent identically distributed

    ISKR Iterative Steering Kernel Regression

    JVT Joint View Triangulation

    KF Kalman Filter

    LAN Local Area Network

    L-BFGS Limited-memory Broyden-Fletcher-Goldfarb-Shanno

    LDIs Layered Depth Images

    LPR Local Polynomial Regression

  • 8/17/2019 FullText (36)

    19/144

     

    XVII

    LS Least Square

    MCU Micro-Controller Unit

    MI Mutual Information

    M-IBR Moveable Image-Based Rendering

    MRF Markov Random Field

    MVC Maximal Voluntary Contraction

    PCA Principal Component Analysis

     pdf probability density function

    PRT Pre-computed Radiance Transfer

    QPP Quadratic Programming Problem

    RANSAC RANdom SAmple Consensus

    RBF Radial Basis Function

    RF Rectus Femoris

    R-ICI Refined Intersection of Confidence Intervals

    RMSD Root-Mean Squared Distance

    SCLF Simplified Circular Light FieldSFM Structure-From-Motion

    SPIHT Set Partitioning In Hierarchical Trees

    S-SFM Sequential-Structure-From-Motion

  • 8/17/2019 FullText (36)

    20/144

     

    1

    Chapter 1 Introduction 

    1.1  Background

    Image-based rendering/representation (IBR) [Chen 1995], [Debe

    1996], [Gort1996], [Levo 1996], [McMi 1995], [Pele 1997], [Szel 1997],

    [Shad 1998], [Shum 1999] is a promising technology for rendering new

    views of scenes from a collection of densely sampled images or videos.

    It has potential applications in virtual reality, immersive television and

    visualization systems. Central to IBR is the plenoptic function [Adel1911], which describes the intensity of each light ray in the world as a

    function of visual angle, wavelength, time, and viewing position. The

     plenoptic function is thus a 7-dimensional function of the viewing

     position ),,(  z  y x   V V V  , the azimuth and elevation angles ),(      , time    ,

    and wavelengths  . Traditional images and videos are just 2D and 3D

    special cases of the plenoptic function. In principle, one can reconstructany views in space and time if sufficient number of samples of the

     plenoptic function is available. The rendering of novel views can

    therefore be viewed as the reconstruction of the plenoptic function from

    its samples. Image-based representations are usually densely sampled

    high dimensional data with large data sizes, but their samples are highly

    correlated. Because of the multidimensional nature of image-based

    representations and scene geometry, much research has been devoted to

    the efficient capturing, sampling, rendering and compression of IBR.

    Depending on the functionality required, there is a spectrum of IBR

    as shown in Fig. 1-1. They differ from each other in the amount of

    geometry information of the scenes/objects being used. At one end of the

    spectrum, like traditional texture mapping, we have very accurate

  • 8/17/2019 FullText (36)

    21/144

     

    2

    geometric models of the scenes and objects say generated by animation

    techniques, but only a few images are required to generate the textures.

    Given the 3- D models and the lighting conditions, novel views can be

    rendered using conventional graphic techniques. Moreover, interactive

    rendering with moveable objects and light sources can be supported

    using advanced graphic hardware.

    Figure 1-1 Spectrum of IBR representations [Chan 2010].

    At the other extreme, light field or lumigraph rendering relies on

    dense sampling (by capturing more image/videos) with no or very littlegeometry information for rendering without recovering the exact 3- D 

    models. An important advantage of the latter is its superior image quality,

    compared with 3- D model building for complicated real world scenes.

    Another important advantage is that it requires much less computational

    resources for rendering regardless of the scene complexity, because most

    of the quantities involved are pre-computed or recorded. This has

    attracted considerable attention in the computer graphic community

    recently in developing fast and efficient rendering algorithms for real-

    time relighting and soft-shadow generation [Agra 2000], [Ng 2004],

    [Sloa 2002], [Zhou 2005].

    Broadly speaking, image-based representations can be classified

    according to the geometry information used into three main categories: 1)

    Rendering with

    no geometry

    Rendering with

    implicit geometry

    Rendering with

    explicit geometryLight field

    Concentric mosaics

    Mosaicking

    Lumigraph

    View interpolation

    View morphing

    Layered-depth images

    Texture-mapped models3D warping

    View-dependent texture,Shadow light field

    Less geometry More geometry

    Less imagesMore images

    View-dependent geometry

  • 8/17/2019 FullText (36)

    22/144

     

    3

    representations with no geometry, 2) representations with implicit

    geometry and 3) representations with explicit geometry. 2-D Panoramas,

    McMillan and Bisho p’s plenoptic modeling [McMi  1995], 3-D

    concentric mosaics and light field/lumigraph belong to the first category

    and they can be viewed as the direct interpolation of the plenoptic

    function. Layered-based, object-based representations [Chan 2009], pop-

    up light [Shum 2004] using depth maps fall into the second. Finally,

    conventional 3-D computer graphic models and other more sophisticated

    representations [Deve 1998], [Wang 2005] belong to the last category.

    Although these representations also sample the plenoptic function,

    further processing of the plenoptic function has been performed to infer

    the scene geometry or surface property such as bidirectional reflectance

    distribution function (BRDF) of objects. Such image-based modeling

    approach has emerged as a more promising approach to enrich the

     photorealism and user interactivity of IBR. Moreover, since 3-D models

    of the scenes are unavailable, conventional image-based representationsare limited to the change of viewpoints and sometimes limited amount of

    relighting. Recently, it was found that real-time relighting and soft-

    shadow computation are feasible using the IBR concepts and the

    associated 3-D models using pre-computed radiance transfer (PRT)

    [Sloa 2002] and precomputed shadow fields [Zhou 2005].

    For multiple camera arrays, the huge amount of data and vast

    amount of viewpoints to be provided present one of the major challenges

    to IBR. Advanced algorithms for processing and manipulation of the

    high dimensional representation to achieve such functions as

    segmentation, depth estimation, object tracking, 3D reconstruction, etc.

    are all major challenges to be addressed. Finally, the efficient

    transmission, compression and display of dynamic IBR and models are

  • 8/17/2019 FullText (36)

    23/144

     

    4

    also urgent issues waiting for satisfactory solution in order for IBR to

    establish itself as an essential media for communication and presentation.

    All of these motivate us to study the design and construction of the

    image-based rendering systems based on plenoptic videos. The system

    can potentially provide improved viewing freedom to users and ability to

    cope with moving and static objects for 3D reconstruction.

    1.2  Thesis Outline

    This thesis is devoted to the design of image-based renderingsystems and its associating algorithms so as to provide improved

    viewing freedom and object modeling of stationary and moveable

    objects in outdoor and indoor environment. The major contributions of

    this thesis are summarized as follows:

    1)  The construction of a high resolution IBR system for capturing and

    rendering of ancient Chinese artifacts and a moveable IBR system

    for capturing and rendering indoor and outdoor objects.

    2)  Development of a novel mutual information (MI)-based algorithm

    combined with segmentation for dense depth map estimation and

    object tracking.

    3)  A 3D reconstruction algorithm for objects, which employs the

    estimated dense depth maps to obtain dense point correspondences

    from multiple views for 3D reconstruction.

  • 8/17/2019 FullText (36)

    24/144

     

    5

    Details of these contributions are briefly described below:

    1)  The first prototype system uses a multiple still camera array to

    capture ancient Chinese artifacts. Because of the high resolution of

    the still camera (Canon 550D), we can obtain excellent rendering

    quality. This system can be used for digital preservation and

    dissemination of cultural artifacts with high digital quality. To avoid

     possible damage to the artifacts and speed up the capturing process,

    we propose to employ the image-based approach instead of using

    traditional 3D laser scanners. A circular array consisting of multiple

    digital still cameras was therefore constructed in this work. Using

    this circular camera array, we developed novel techniques for

    rendering new views of the artifacts from the images captured using

    the object-based approach. The multiple views so synthesized enable

    the ancient artifacts to be displayed in modern multi-view displays.

    A number of ancient Chinese artifacts from the University Museum

    and Art Gallery at the University of Hong Kong were captured and

    excellent rendering results were obtained. The second prototype

    system uses a linear camera array consisting of 8 video cameras

    (Sony HDR-TGIE) mounted on an electrically controllable wheel

    chair. Its motion can be controlled manually or remotely by means of

    additional hardware circuitry. Unlike the previous multiple camera

    systems which are not designed to be moveable so that the view-

     points are somewhat limited and usually cannot cope with moving

    objects and perform 3D reconstruction of objects in open

    environment. Our moveable image-based rendering system can be

    used to render large environment and moving objects.

  • 8/17/2019 FullText (36)

    25/144

     

    6

    2)  A new combined segmentation-mutual-information (MI)-based

    algorithm for dense depth map estimation is presented. It relies on

    segmentation, local polynomial regression (LPR)-based depth map

    smoothing and MI-based matching algorithm to iteratively estimate

    the depth map. The method is very flexible and both semi-automatic

    and automatic segmentations can be used. The semi-automatic and

    automatic versions rank 4 and 6 respectively in the Middlebury

    comparison of existing depth estimation methods. Using the depth

    maps captured and the object-based approach, high quality

    renderings of outdoor scenes along the trajectory can be obtained,

    which considerably improved the viewing freedom. The mutual

    information-based matching algorithm is also extended to object

    tracking algorithms. It can be used to track the boundary of an object

    in a video sequence. Experimental results show that its performance

    is reliable even for noisy videos such as dynamic ultrasound images.

    3)  Using the IBR systems, correspondences from different views can be

    integrated together for 3D reconstruction. For both of the systems,

    camera calibration is firstly used to determine the value of internal

    and external parameters of the cameras. For the still camera array, a

    major technique to find the correspondent points is the epipolar

    geometry which can constraint the corresponding points on the

    conjugated epipolar lines. Meanwhile, via combining epipolar lines

    with Scale-invariant feature transform (SIFT) [Lowe 2004] feature

    detection, accurate sparse correspondent points can be located. Then

    Gabor filter, which is rather insensitive to noise, is used to obtain the

    dense correspondent points. For the moveable image-based rendering

    (M-IBR) system, the sequential-structure-from-motion (S-SFM)

    technique is adopted to estimate the locations of the M-IBR system

  • 8/17/2019 FullText (36)

    26/144

     

    7

    so as to obtain an initial set of fairly reliable 3D point cloud from the

    2D correspondences. New iterative Kalman filter (KF)-based and

    segmentation-MI-based algorithms are proposed to fuse the

    correspondences from different views and remove possible outliers

    to obtain an improved point cloud. More precisely, the proposed

    algorithm relies on the KF to track the correspondences across

    different views so as to suppress possible outliers while fusing

    correspondences from different views. With these reliable matched

     points, the camera parameters and hence the image correspondences

    can be further refined by re-projecting the updated correspondences

    to successive views to serve as prior features/correspondences for

    MI-based matching. By iterating these processes, an improved point

    cloud with reliable correspondences can be recovered. Simulation

    results show that the proposed algorithm significantly reduces the

    adverse effect of the outliers and generates a more reliable point

    cloud. To recover the 3D model from the improved point cloud, anew robust RBF-based modeling algorithm is proposed to further

    suppress possible outliers and generate smooth 3D surfaces from the

    raw 3D point cloud. Compared with the conventional RBF-based

    smoothing, it is more robust and reliable. Finally, view dependent

    texture is incorporated to enhance the final rendering effect.

    This thesis is divided into 6 chapters. In Chapter 2, some background

    materials on IBR are briefly reviewed. They include plenoptic

    function, light field and rendering techniques. In Chapter 3, the

    design and construction of two IBR systems are presented. Some

     pre-processing techniques for capturing the ancient Chinese artifacts

  • 8/17/2019 FullText (36)

    27/144

     

    8

    including camera calibration and color-tensor-based segmentation

    are also introduced. Chapter 4 is devoted to a new combined

    segmentation MI-based depth estimation algorithm. The 3D

    reconstruction and modeling algorithms will be presented in Chapter

    5.Two different point matching algorithms are studied first and then

    a RBF modeling algorithm is proposed for mesh generation. At last,

    a view dependent texture mapping method for improving the

    rendering quality will be presented. Finally, conclusion and future

    research topics are given in Chapter 6.

  • 8/17/2019 FullText (36)

    28/144

     

    9

    Chapter 2 Review of Basic Topics in Image-Based

    Rendering

    2.1  Introduction

    In this chapter, the fundamental topics in image-based rendering are

    reviewed briefly. In section 2.2, the plenoptic function and its history are

    introduced. The theory of light field is dicusssed in section 2.3. Section

    2.4 is devoted to the rendering techniques in IBR.

    2.2  Review of Plenoptic Function

    The plenoptic function was proposed by Bergen and Anderson

    [Adel 1991]. It is a function presented by visual angle, wavelength, time

    and viewing position to describe the intensity of each light ray in the

    world. All the information captured by an optical sensor can be depicted

     by this function. The plenoptic function is a 7 dimension (7D) function

    consisting of 3 dimenion (3D) position, 2 dimension visual angle ,

    wavelength and time.

    Sampling and processing of the plenoptic function are the main

    research topic in the early computer vision study. For example, object

    motion can be discribed by the derivatives of the plenoptic function in

    terms of position and time. Because the wavelength is usually

    represented by Red, Green, Blue channels in digital image processing,

    the plenoptic function of images and videos can be simplified into two

    dimension and three dimension special cases. Theoretically, if the

    sampling rate is very high, novel views at intermediate positions can be

    recovered from its samples. The algorithms which are trying to solve this

     problem are usually called image-based rendering.

  • 8/17/2019 FullText (36)

    29/144

     

    10

    θ

    ϕ

    l(x,y,z,θ,ϕ )

    (x,y,z)

    î ĵ

    k ˆ

     Figure 2-1: Light field describes the amount of light in radiance along light rays

    traveling in every direction through every point in empty space [Ikeu 2012].

    Because the plenoptic function also describes the geometry and

    surface properties, many algorithms are proposed to integrate the

    geometry and surface information into image-based rendering to

    improve user interaction and reduce the amount of samples required.

    The capturing, sampling, rendering and processing of the plenoptic

    function are all important research topics in IBR and related applications

    such as computational photography, 3D/multiview videos and displays,

    etc.

    2.2.1  Basic Theory

    The 7D plenoptic function is usually defined as

    ),,,,,,(        z  y x   V V V l  . ),,(  z  y x   V V V   is the viewing position.    and    are

    the elevation and azimuth angles respectively as shown in (Fig. 2-1).   

    and    denote the wavelength and time respectively. By employing

    different parameterization and simplification, different image-based

    rendering algorithms can be derived from the plenoptic function.

  • 8/17/2019 FullText (36)

    30/144

     

    11

    There are several camera systems which are usually used in the

    image-based rendering for capturing. For static scene, one camera can be

    rotated around the camera centre at a given position V   with different

    elevation and azimuth angles. The plenoptic function is then simplified

    to a panorama ),(     V l  . The spherical camera array can provide other

     panoramas representation because the captured image can be projected

    to a cylinder. If a multiple video camera array is employed instead, a

     panoramic video can be obtained. And the plenoptic function will be

    simplified to a 3D panorama ),,(      V l    for dynamic scenes. The close

    relationship between plenoptic function and image-based rendering was

    due to McMillan and Bishop [McMi 1995] who proposed plenoptic

    modeling using the 5D complete plenoptic function for static scene

    ),,,,(      z  y x   V V V l  .

    In the static scene, the radiance along rays is constant. Therefore

    the plenoptic function can be re-written as a 4D function which is called

    the light field [Levo 1996] and lumigraph [Gort 1996] in computer

    graphics. The set of light rays in a 4D static light field can be

     parameterized in many ways. For example, the two-plane-based

     parameterization is usually used. By adding the time into the static light

    field, a 5D plenoptic function can be obtained. Lumigraph employs

    depth maps into image-based rendering to improve the rendering quality

    which can produce more accurate representations. In [Shum 1999], an

    outward facing camera moving on a circle was used to capture a series

    of densely sampled images.

    A commonly used parameterization is the two-plane

     parameterization, where a light ray in the light field is parameterized as

    its intersections or coordinates with two parallel planes. These rays can

  • 8/17/2019 FullText (36)

    31/144

     

    12

    Table 2-1: A taxonomy of plenoptic functions. 

     be captured by taking a series of pictures on a 2D rectangular plane,

    which results in an array of images. The light field concept can be

    similarly extended to time-varying or dynamic scenes, which results in a

    5D function. Lumigraph is different from the light field, because the

    geometry in form of depth maps is used to improve the rendering quality

    which can produce more sophisticated representations in image-based

    modeling. In [Shum 1999], a set of densely sampled images are captured

     by an outward facing camera moving on a circle which are called

    concentric mosaic. This system can render new views inside the circle.

    Then some simplified systems are proposed to reduce the complexity

    such as restricting the camera locations to line by line segments [Zitn

    2004], [Chan 2005], [Chan 2009]. For time varying or dynamic scenes,

    similar representations can be used. Because the light may change at

    different viewing location, the light has to be captured continuously

    which can be done by video camera arrays. For static scenes, the light

    directions can be recorded at first. Then one can relight the rendering

    Dimension Year View space Name

    7 1991 Free Plenoptic function

    5 1995 Free Plenoptic modeling

    4 1996 Bounding box Light field/Lumigraph

    3 1999 Bounding circle Concentric Mosaics

    2 1994 Fixed point Cylindrical/Spherical panorama

  • 8/17/2019 FullText (36)

    32/144

     

    13

    with arbitrary lightings. A brief summary of these plenoptic function

    representations is given in Table 2-1 [Ikeu 2012].

    2.3  Review of Light Field

    Light field was introduced firstly in a paper by A. Gershun [Gers

    1939] for studying surface illumination by artificial lightings. A similar

    concept was introduced to the computer graphics community as the light

    field in [Levo 1996] and lumigraph in [Gort 1996]. The motivation is to

    render new views or images of objects or scenes from densely sampled

    images previously taken to avoid building or capturing complicated 3D

    models. Light field or lumigraph rendering is a special representation of

    image-based rendering and they require either no geometry [Levo 1996]

    or limited geometry in terms of depth maps [Gort 1996]. The Light field

    and lumigraph are four dimension (4D) simplification of the plenoptic

    function for static scenes.

    2.3.1  Creating/capturing light field

    Light field can be created by rendering 3D models by computer

    graphics or capture the real object by camera arrays. For real and static

    scenes, light field can be captured by one still camera controlled by a

    mechanical arm using lumigraph rendering [Bueh 2001]. In [Adel 1991],

    [Ng 2005], a lenticular lens array was used to capture the light field. In

    [Veer 2007], [Lian 2008], a coded aperture ,which can map rays from

    different directions to near pixels in the sensor array, was used to record

    images. The pixels of these images consist of a set of pixels recording

    light from different direction. The novel views can be estimated by

    combining these 4D samples in the light field. In [Ng 2005], a mirolens

    array was placed in front of the handheld digital camera. The images

  • 8/17/2019 FullText (36)

    33/144

     

    14

    captured in this way can be refocused after they have been taken. The

    light field video can be obtained in a similar way.

    Multiple camera systems are usually used to achieve large disparity

    in dynamic scenes. Much research effort has been devoted to the

    construction of 2D camera arrays. To simplify the capturing hardware,

    light field captured on line segments and circular arc have also been

    reported in Section 2.2.

    2.4  Review of Rendering Techniques

    Rendering is the process to create new view from several imagesand other auxiliary information obtained in the representations. In the

    early stage of image-based rendering, it did not employ any geometry

    information. Image blending in panoramas [Chen 1995] and ray space

    interpolation in light field [Levo 1996] are used to do image rendering.

    Each ray that goes through a target pixel is mapped to nearby sampled

    rays in ray space interpolation. Since some sophisticated representations

    use more geometry information such as layered depth images [Shad

    1998], surface light field [Wood 2000], and pop-up light field [Shum

    2004], graphics hardware has been exploited to accelerate the rendering

     process. The geometry information can either implicitly rely on

     positional correspondences or explicitly in the form of depth along

    known lines-of-sight or 3D coordinates. Representations of the former

    usually involve weakly calibrated cameras and rely on image

    correspondences to render new views, say by triangulating two reference

    images into patches according to the correspondences as in joint view

    triangulation (JVT) [Lhui 2003]. These include view interpolation, view

    morphing, JVT and transfer methods with fundamental matrices and

    trifocal tensors. Representations employing explicit geometry include

  • 8/17/2019 FullText (36)

    34/144

     

    15

    sprites, relief textures, Layered Depth Images (LDIs), view-dependent

    texture, surface light field, pop-up light field, shadow light field, etc.

    In general, the rendering methods can be broadly classified into

    three categories: 1) point-based, 2) layer-based, and 3) monolithic.

    Point-based rendering  works on 3D point clouds or point

    correspondences and typically each point is rendered independently.

    Points are mapped to the target image plane through forward mapping.

    For the 3D point X in Fig. 2-2, the mapping can be written as

    where t  x  and r  x  are homogeneous coordinates of the projection of  X  on

    target screen and reference images, respectively. C   and P   are camera

    center and projection matrix respectively and    is a scale factor. Since

    t C  ,

    t P  and the focus length

    t  f   are known for the target image, t     can be

    computed using the depth of X . Givenr 

    x   and r    , one can compute the

    exact position oft 

    x   on the target screen and transfer the color

    accordingly. Gaps or holes may exist due to magnification. Disocclusion

    and splatting techniques have been proposed to solve this problem. The

     painter ’s algorithm is frequently used to avoid the problem that multiple

     pixels from the reference view are mapped to the same pixel in the target

    image.

    Layered techniques usually separate the scene into a group of

     planar layers consisting of a 3D plane with texture and optionally a

    transparency map. The layers can be thought of as a continuous set of

     polygonal models, which are amenable to conventional texture mapping

    and view-dependent texture mapping. Usually, each layer is rendered

    using either point-based or polygon meshes as in monolithic rendering

    t t t r r r    x P C x P C X  t r          , (2-4-1)

  • 8/17/2019 FullText (36)

    35/144

     

    16

    techniques before being composed in the back-to-front order using the

     painter’s algorithm to produce the final view. Layer-based rendering can

     be implemented easily using graphic processing unit (GPU). Since the

    rendering of IBR requires very low complexity, it is even possible to

     perform the calculation using central processing unit (CPU) by working

    on individual layer or object [Chan 2009].

    Monolithic rendering usually represents the geometry as continuous

     polygon meshes with textures, which can be readily rendered using

    X

    xr xt

    Cr 

    Cter 

    et

     

    X

    xr xt

    Cr 

    Cter 

    et

     

    Figure 2-2: Forward mapping. 

  • 8/17/2019 FullText (36)

    36/144

     

    17

    graphics hardware. The 3D model normally consists of vertices, normals

    of vertices, faces, and texture mapping coordinates. The data can be

    stored in a variety of data formats. The most popular formats

    are .obj, .3ds, .max, .stl, .ply, .wrl, .dxf, etc.

    Relighting, shadow generation and interactivity have played an

    increasingly important role in 3D interactive rendering. The most

     popular algorithms are shadow mapping, shadow volume, ray-tracing,

     pre-computed radiance transfer, pre-computed shadow field, etc. Some

    of them have better rendering quality, while others are more efficient for

    real time rendering. Thanks to the development of GPU, basic lighting

    and shading algorithms like shadow mapping and shadow volume have

     been realized on the fly. Modern GPUs can even offer programmable

    rendering pipelines for customized rendering effects and “shader ” is a

    set of software instructions running on these GPUs to control the

     pipelines. Using shader programming, high quality shadow rendering

    algorithms like precomputed shadow field can be done in real time. Fig.

    2-3 shows examples renderings of the three techniques.

    Though there has been substantial progress in capturing,

    representing, rendering and modeling scenes, the ability to handle

    general complex scenes remains challenging for IBR. A lot of work is

    still required to ensure robustness in handling reflection translucency,

    highlights, depth estimation, capturing complexity, object manipulation,

    etc. Interacting with IBR representations remains challenging because

    IBR uses images for rendering. Recent approaches have been focused on

    using advanced computer vision techniques, such as stereo/multiview

    vision and photometric stereo, and depth sensing devices to extract more

    geometry information from the scene so as to enhance the functionalities

  • 8/17/2019 FullText (36)

    37/144

     

    18

    of IBR representations. While there has been considerable progress in

    relighting and interactive rendering of individual real static objects, such

    operations are still difficult for real and complicated scenes. For

    dynamic scenes, the huge amount of data and vast amount of viewpoints

    to be provided present one of the major challenges to IBR. Advanced

    algorithms for processing and manipulation of the high dimensional

    representation to achieve such functions as object extraction, model

    completion, scene inpainting, etc. are all major challenges to be

    addressed. Finally, the efficient transmission, compression and display

    of dynamic IBR and models are also urgent issues waiting for

    satisfactory solution in order for IBR to establish itself as an essential

    media for communication and presentation. All of these motivate us to

    study the design and construction of new image-based rendering systems

     based on plenoptic videos. The system can potentially provide improved

    viewing freedom to users and ability to cope with moving and static

    objects and perform 3D reconstruction.

    2.5  Summary

    In this chapter, the basic topics in image-based rendering have been

    reviewed. The plenoptic function which serves an important concept for

    describing visual information in our world was introduced. Then a brief

    review on light field was given. In fact, how to achieve high quality

    rendering and display light field with a wide range of viewing positions

    in large scale environmental will be studied in Chapters 3 and 4. Finally,

    some rendering techniques including point-based, layer-based, and

    monolithic methods are discussed. An extension of these rendering

    techniques will be further studied in Chapter 5.

  • 8/17/2019 FullText (36)

    38/144

     

    19

    (a)

    (b)

    (c)

    Figure 2-3: Example renderings using (a) forward mapping in point rendering [Chan

    2005], (b) layered representation (with two layers  –  dancer and background) [Chan

    2009], (c) monolithic rendering using 3D polygonal mesh (left) and rendering results

    (right) [Zhu 2010].

  • 8/17/2019 FullText (36)

    39/144

     

    20

    Chapter 3 The Proposed Image-based Rendering

    Systems

    3.1  Introduction

    Both IBR systems are based on the simplified light field. As

    mentioned earlier, two IBR systems are constructed and studied in this

    thesis, one for capturing and rendering ancient Chinese artifacts and the

    other for environmental modeling. They belong to the general class of

    image-based representations. Since capturing 3D models in real-time is

    still a very difficult problem, light field- or lumigraph-based dynamic

    IBR representations with little amount of geometry information have

    received considerable attention in immersive TV (also called 3D or

    multi-view TVs) applications. Because of the multidimensional nature of

    the plenoptic function and the scene geometry, much research has been

    devoted to the efficient capturing, sampling, rendering and compression

    of IBR. There has been considerably progress in these areas since the pioneer work of lumigraph by Gortler et al [Gort 1996] and light field  by

    Levoy and Hanrahan [Levo 1996]. Other IBR representations include the

    2D panorama [Szel 1997, Pele 1997], Chen and Williams’ view

    interpolation [Chen 1993], McMillan and Bishop’s plenoptic modeling

    [McMi 1995], layer depth images [Shad 1998] and the 3D concentric

    mosaics [Shum 1996], etc. Motivated by light field and lumigraph, the

     predecessors in the author ’s lab have developed a real-time system for

    capturing and rendering a simplified dynamic light field called the

    “plenoptic videos”  [Chan 2003], [Chan 2004], [Chan 2005], [Chan

    2009], [Gan2005] with four dimensions. It is a simplified dynamic light

    field, where videos are taken along line segments as shown in Fig. 3-1,

  • 8/17/2019 FullText (36)

    40/144

     

    21

    instead of a 2D plane, to simplify the capturing hardware for dynamic

    scenes.

    Figure 3-1: Plenoptic videos: Multiple linear camera array of 4D simplified dynamic

    light field with viewpoints constrained along line segments. The camera arrays

    developed at [Chan 2009]. Each consists of 6 JVC video cameras.

    Pioneer projects in cultural heritage preservation of large scale

    structure and sculptures include the Digital Michelangelo Project [Levo

    2002], the 3D facial reconstruction and visualization of ancient Egyptian

    mummies [Atta 1999], the great Buddha Project [Ikeu 2003], to name

     just a few. To avoid possible damage to the ancient artifacts and speedup the capturing process, we propose to employ the image-based

    approach instead of using 3D laser scanners. A circular array consisting

    of multiple digital still cameras (DSCs) was therefore constructed in this

    thesis to capture the simplified light field of the ancient artifacts along

    circular arcs, which we shall call the simplified circular light field

    (SCLF) or circular light field (CLF) in short. The circular array is chosen

    to provide users with a better visual experience, because it supports fly

    over effect and close-up of the artifacts uniformly in the angular domain.

    We also developed novel techniques for rendering new views of the

    ancient artifacts from the images captured using the object-based

    approach. The details will be discussed later in Chapters 4 and 5. A

    number of ancient Chinese artifacts from the University Museum and

  • 8/17/2019 FullText (36)

    41/144

     

    22

    Art Gallery at The University of Hong Kong were captured and

    excellent rendering results in ordinary as well as 3D/multiview displays

    were achieved. The proposed IBR system and associated algorithms

    server as a framework for culture preservation of media-sized ancient

    artifacts.

    While there are considerable IBR systems proposed previously, few

    IBR systems are moveable. Therefore, another objective in this thesis is

    to design a moveable IBR system for modeling objects in outdoor

    environment. The moveable IBR system proposed uses a linear camera

    array consisting of 8 video cameras mounted on an electrically

    controllable wheel chair. Its motion can be controlled manually or

    remotely by means of additional hardware circuitry. Unlike the previous

    multiple camera systems which are not designed to be moveable so that

    the view-points are somewhat limited and usually cannot cope with

    moving objects and perform 3D reconstruction of objects in open

    environment. Our moveable image-based rendering system can be used

    to render large environment and moving objects. In particular, the

    system supports object-based rendering and 3D reconstruction capability

    and consists of two main components. 1) A novel view synthesis

    algorithm using a new segmentation and mutual-information (MI)-based

    algorithm for dense depth map estimation, which relies on segmentation,

    LPR-based depth map smoothing and MI-based matching algorithm to

    iteratively estimate the depth map. The method is very flexible and both

    semi-automatic and automatic segmentation methods can be employed.

    They rank fourth and sixth, respectively, in the Middlebury comparison

    of existing depth estimation methods. This allows high quality

    renderings of outdoor scenes with improved mobility/freedom to be

    obtained. 2) A new 3D reconstruction algorithm which utilizes

  • 8/17/2019 FullText (36)

    42/144

     

    23

    sequential-structure-from-motion (S-SFM) technique and the dense

    depth maps estimated previously. It relies on a new iterative point cloud

    refinement algorithm based on Kalman filter (KF) for outlier removal

    and the segmentation-MI-based algorithm to further refine the

    correspondences and the projection matrices. The mobility of our system

    allows us to recover more conveniently 3D model of static objects from

    the improved point cloud using a new robust Radial basis function

    (RBF)-based modeling algorithm to further suppress possible outliers

    and generate smooth 3D meshes of objects. Experimental results show

    that the proposed 3D reconstruction algorithm significantly reduces the

    adverse effect of the outliers and produces high quality renderings using

    shadow light field and the model reconstructed. The details will be

    discussed later in Chapters 4 and 5.

    The rest of this chapter is devoted to the general design and

    construction of the systems. More precisely, Section 3.2 is devoted to the

    design and configuration of the IBR systems. Section 3.3 presents some

     pre-processing including camera calibration, color-tensor-based

    segmentation and matting. Finally, conclusions are drawn in Section 3.4.

    3.2  Construction of the Proposed IBR systems

    3.2.1  Still Camera System

    As mentioned previously in Section 3.1, the first prototype system

    consists of an array of 13 Canon 550D cameras mounted on a camera

    stand. The images/videos will be captured and then be processed and

    viewed on a multiview TVs. A circular array is chosen to provide users

    with a better visual experience, because it emulates “fly over”   and

    “rotate”  kind of special effects. Fig. 3-2 shows the proposed capturing

    system. Fig. 3-3 shows some snapshots captured by this system called

  • 8/17/2019 FullText (36)

    43/144

     

    24

     Buddha and Dragon Vase. The resolution of these images is 3465×2304.

    The operation flow is illustrated in Fig. 3-4. Firstly, the objects are

    captured by this system from different angles. Then we need to segment

    the objects by color tensor which is insensitive to shadow and shading.

    The natural matting can be adopted to improve the rendering quality

    when objects are mixed on other backgrounds. From the segmented

    objects, approximated geometry information for each object can be

    estimated by point-based matching for rendering and 3D reconstruction.

    Finally other rendering techniques such as shadow field re-lighting and

    view dependent texture mapping will be added in the rendering. The

    details of these algorithms will be discussed in the rest of the current

    chapter and next chapter. .

    Figure 3-2: Circular camera array constructed.

  • 8/17/2019 FullText (36)

    44/144

     

    25

    (a)

    (b)

    Figure 3-3: Snapshots: (a)Buddha (b) Dragon Vase.

  • 8/17/2019 FullText (36)

    45/144

     

    26

    Figure 3-4: Block diagram of the proposed IBR system.

    3.2.2  Moveable Camera System

    The second moveable IBR (M-IBR) system consists of a linear

    array of cameras mounted on an electrically controllable wheel chair so

    as to cope with moving objects in large environment and hence improve

    the viewing freedom of users. Fig. 3-5 shows the moveable IBR system

    that we have constructed. It consists of a linear array of 8 Sony HDR-

    TGIE high definition (HD) video cameras which is mounted on a

    FS122LGC wheel chair.

    The motion of the wheel chair is originally controlled manually

    through a VR2 joystick and power controller modules from PG drives

    [PGDT] technology. To make it electronically controllable, we

    examined the output of the joystick and generated the (x-,y-) motion

    control voltages to the power controller using a Devasys USB-I2C/IO

    [USBI] micro-controller unit (MCU). By appropriately controlling these

    voltages, we can control the motion of the wheel chair electronically.

  • 8/17/2019 FullText (36)

    46/144

     

    27

    Figure 3-5: The proposed moveable image-based rendering system.

    Moreover, by using the wireless LAN of a portable notebook mounted

    on the wheel chair, its motion can be controlled remotely. By improving

    the mobility of the IBR capturing system, we are able to cope with

    moving objects in large environment.

    The HD videos are captured in real-time into the storage cards of

    cam-corders. They can be downloaded to PC for further processing such

    as calibration, depth estimation, and rendering using the object-based

    approach. For real-time transmission, the cam-corders are equipped with

    a composite video output which can be further compressed and

    transmitted. To illustrate the concept of multiview conferencing, a

    ThinkSmart IVS-MV02 Intelligent Video surveillance system [IVS] was

    used to compress the (320x240) 30 frames/sec videos online, which can

     be retrieved remotely through the wireless LAN for viewing or further

  • 8/17/2019 FullText (36)

    47/144

     

    28

     processing. The system is built from Analog Device DSP and real-time

    compression at a bit rate of 400kbps.

    Before the cameras can be used for depth estimation, they must be

    calibrated to determine the intrinsic parameters as well as their extrinsic

     parameters, i.e. their relative positions and poses. This can be

    accomplished by using a sufficient large checkerboard calibration

     pattern. We follow the plane-based calibration method [Zhan 2000] to

    determine the projective matrix of each camera, which connects the

    world coordinate and the image coordinate. The projection matrix of a

    camera allows a 3D point in the world coordinate be translated back to

    the corresponding 2D coordinate in the image captured by that camera.

    This will facilitate depth estimation. Fig. 3-6 shows snapshots of an

    outdoor and indoor videos captured by the proposed system called

    “ podium” and “ presentation” , respectively. The resolution of these real-

    scene videos is i10801920   with 25frames per second (fps) in 24-bit

    RGB format. The system flow of the proposed moveable IBR system is

    summarized in Fig. 3-7. Firstly we need to stabilize the video to reduce

    the shaky motion frequently encountered in typical moveable IBR

    systems. Then, a novel view synthesis algorithm using a new

    segmentation and mutual-information (MI)-based algorithm for dense

    depth map estimation is used to iteratively estimate the depth map.

    Finally we need to reconstruct the 3D model using a new 3D

    reconstruction algorithm which utilizes sequential-structure-from-motion

    (S-SFM) technique and the dense depth maps estimated previously. A

    new robust radial basis function (RBF)-based modeling algorithm is

    used to further suppress possible outliers and generate smooth 3D

    meshes of objects.

  • 8/17/2019 FullText (36)

    48/144

     

    29

    (a)

    (b)Figure 3-6: Snapshots of the plenoptic videos at a given time instance: (a) is the

    “ Podium” outdoor video from camera 1 to camera 4 and (b) is the “ Presentation” 

    indoor video from camera 1 to camera 4.

  • 8/17/2019 FullText (36)

    49/144

     

    30

    Video

    stablization

    Segmentation-

    MI-based Depth

    Estimation

    Depth Map

    Refinement

    Image-based

    Rendering

    3D

    Reconstruction

     

    Figure 3-7: Block diagram of the proposed M-IBR system constructed. 

    3.3  Pre-Processing

    3.3.1  Still Camera System

    In order to speed up the whole processing procedure on the

     proposed still camera system, some pre-processing needs to be done at

    start. At first, all the cameras need to be calibrated. Because this system

    is still, intrinsic and extrinsic parameters of cameras can obtained

     precisely by following the plane-based calibration method. The proposed

    still camera system will only focus on the objects we are interested. The

    objects will be segmented out of the images for reducing the noise from

     background. 

    3.3.1.1 Camera Calibration

    In computer vision, the link between the 3D real world points and

    image pixels is the camera parameters. The camera parameters contain

    the extrinsic parameters and intrinsic parameters. Estimation of the

    extrinsic and intrinsic parameters is called camera calibration [Truc

  • 8/17/2019 FullText (36)

    50/144

     

    31

    1998]. The extrinsic parameters define the translation between the

    camera reference frame and the world reference frame. A 3D translation

    vector T  and a 3 × 3 rotation matrix R  are used to represent the extrinsic

     parameters [Truc 1998]. The relationship (see Fig. 3-8) between the

     point in the world and the camera frame is

    XW

    Pc

    PW

    Yc

    Xc

    Zc

    YW

    ZWR, T

     Figure 3-8 Relationship between the world coordinate and the camera coordinate.

    The intrinsic parameters are defined in form of camera matrix C:  

    where x f   and  y f    represent the focal length of the camera in terms of  x 

    and y direcition. x

    c  and yc  are the coordinate value of the principal point.

    d   is the skew parameter which is zero for pinhole cameras. For the

    uncertainty of the camera types, the skew parameter will be set. By

    combining the extrinsic parameters and intrinsic parameters, perspective

     projection matrix equation will be:

    )(   T P R P      wc . (3-3-1)

    100

    0 y y

     x x

    c f 

    cd  f 

    C  , (3-3-2)

  • 8/17/2019 FullText (36)

    51/144

     

    32

    where (c x , c y , c z  ) is the point in the image coordinate system and

    (w

     X  ,wY  , w Z  ,1) is the point in the world coordinate system. “ ”  is dot

     product. Both of the coordinate systems are in homogeneous coordinate

    system. By defining the projective matrix P  as

    where P  is a 3 × 4 matrix, the equation (3-3-3) can be rewriten as

    Camera calibration is to estimate the matrix P . Zhang has proposed

    an algrothm for camera calibration by using planar pattens [Zhan 1999].

    The planar pattern is usually chosen as a chessboard like plane as shown

    in Fig. 3-9, which is used in our system.

    1

    c

    c

    c

     Z 

     X 

     z 

     y

     x

    T | I CR  , (3-3-3)

    T | I CR P    ,(3-3-4)

    1

    c

    c

    c

     Z 

     X 

     z 

     y

     x

    P  . (3-3-5)

  • 8/17/2019 FullText (36)

    52/144

     

    33

    Figure 3-9 Planar patten.

    The basic procedure of Zhang’s algorithm is: 

    1. Take a few image of the test pattern in different orientations.

    2. Detect the feature points in the test images (often the corners).

    3. Estimate the five intrinsic parameters (no skew paramter) and the

    extirnsic paramters by using the closed-form solutions.

    4. Estimate the radial distortion by solving linear least squares.

    5. Refine all the parameters by minimizing error functions.

    In this work the plane-based algorithm will be changed slightly to

    fit my situation. The skew parameter will be added and the distortion

    will not be estimated at first. Not only radial distortion, but also the

    tangential distortion will be estimated.

    3.3.1.2 Color-Tensor-based Segmentation and Matting

    The first step to process these images is to segment the objects out

    of the images. In the still camera system, we employ the photometric

    invariant features [Weij 2006] to extract the foreground from the

  • 8/17/2019 FullText (36)

    53/144

     

    34

    monochromatic screen background. More precisely, the color tensor

    describes the local orientation of a color vector f ( x, y) as:

     y

     y x

     y

     y

     x x

     x

     y x f f f f 

    f f f f 

    T    ),( , (3-3-6)

    where f ( x, y) is a vector which contains the color component values at

     position ( x, y) and the subscripts x and y in ),(   y x xf   and ),(   y x yf  denote

    respectively the derivative of f ( x, y) with respect to  x  and  y, the image

    coordinates. According to [Weij 2006], the color vector can be seen as a

    weighted sum of two component vectors: )(],,[ iibbT  mme BG R   c c    

    where bc   is the color vector of the body reflectance, ic    is the color

    vector of the interface reflectance (i.e. specularities or highlights), bm  

    and im are scalars representing the corresponding magnitudes of

    reflection and e is the intensity the light source. Thus

    ii x xi

    b xbb x xbb

     x

    memememeemG B R

    c c c 

    ))(( ))(()(],,[

    ,

    (3-3-7)

    which suggests that the spatial derivative is a sum of three weighted

    vectors, successively caused by body reflectance, shading-shadow and

    specular changes. For matte surfaces, the intensity of interface

    reflectance is zero (i.e. mi=0) and the projection of the spatial derivative

     xf    on the shadow-shading axis is the shadow-shading variant containing

    all energy which can be explained by changes due to shadow and

    shading. The shadow-shading axis direction is bc  which is parallel to

    bbem c f   for matte surfaces. So the projection 1s   of the spatial

    derivative  xf    on the shadow-shading axis is

  • 8/17/2019 FullText (36)

    54/144

     

    35

    ||||/||)||/(1   f f f f f s      T 

     x.

    (3-3-8)

    Subtraction of the shadow-shading variant1s    from the total derivative

     xf    results in the shadow-shading quasi-invariant 12   s f s      x . In

    summary, the derivative of the color tensor can be separated into

    shadow-shading variant part1s    and shadow-shading invariant part

    2s  .

    The shadow-shading invariant part does not contain the derivative

    energy caused by shadows and shading. To construct a shadow-shading-

    specular quasi-invariant, this part is combined with the hue direction,

    which is perpendicular to the light source direction c i  and the shadow

    and shading direction c b. Therefore the hue direction is

    bibi  c c c c h      /)( .

    (3-3-9)

    The projection of the derivative on the hue direction is the desired

    shadow-shading-specular-quasi-invariant part:

    ||||/||)||/(   h h h h f H      T 

     x . (3-3-10)

    By replacing xf    in the color tensor equation (1) by 2s   or H , we can get

    the shadow-shading-specular-quasi-invariant color tensor and the

    shadow-shading invariant color tensor respectively. By setting a suitable

    threshold value for the color tensor, we can detect the boundary of the

    object. Fig. 3-10 shows some segmentation results that were obtained

    using the color tensor method, followed by Bayesian matting for

    extracting a foreground from the background. After segmentation, the

    hard boundary of the object will be obtained. Matting can then be

    applied to obtain soft segmentation information, called the matte, of the

    object. The matte, which is an image containing the portion of

    foreground with respect to the background (from 0 to 1) at a particular

  • 8/17/2019 FullText (36)

    55/144

     

    36

    location, greatly improves the visual quality of mixing the objects onto

    other backgrounds.

    (a)

    (b)

    Figure 3-10:(a) Extraction results using color-tensor-based method. Left: original,

    middle: hard segmentation, right: after matting. (b) Close up of segmentations in

    (a). Left: hard segmentation, Right: after matting.

    3.3.2  Moveable Camera System

    Unlike the static camera system described before, the moveable

    camera system will experience shaky motion during movement and

    hence video stabilization has to be performed.

    3.3.2.1 Video Stabilization

    To ensure good tracking of objects and to obtain more image

    samples for high quality rendering, the wheel chair is usually driven

    steadily during capturing. However, one problem with M-IBR system is

  • 8/17/2019 FullText (36)

    56/144

     

    37

    that the ground surfaces may not be smooth and the whole mechanical

    structure can vibrate considerably during movement. In our M-IBR

    system, the shaky motion of the camera array of the outdoor

    environment seems to come from the roughness of the ground surfaces

    and the vibration of the mechanical structure during the movement.

    Besides, the video captured may also appear shaky when the system is

    moving and about to settle down in indoor environment. To reduce these

    annoying effects, video stabilization [Hu 2007], [Mats 2005], [Mats

    2006], [Rata 1998] is frequently employed to eliminate the undesired

    motion fluctuation in the captured videos.

    As mentioned above, our M-IBR system was driven steadily during

    capturing. Therefore, the undesired motion fluctuation will usually

    appear as high frequency components compared to the intentional

    motion. As a result, the problem of video stabilization can also be

    viewed as the removal of high frequency components in the estimated

    velocity. To this end, one needs to estimate the global motion of the

    camera, say by mean of optical flow on the video sequence, so that this

    annoying high frequency local motion can be removed to stabilize the

    videos.

    The proposed video stabilization algorithm is divided into three

    major steps as follows. 1) Global motion estimation: firstly, the

    geometric transformation between a location T  x x   ],[ 21x    in a frame

    with that in an adjacent frame,   'x  , is modeled by an affine transformation

    t Ax x T x      ][' , where T  x x   t t    ],[ 21t   is the translational component

    and the affine rotation, scaling, and stretch are represented by the matrix

    43

    21

    aa

    aaA . In homogeneous coordinate, T 

    h  x x   ]1,,[ 21x  , T   can be

  • 8/17/2019 FullText (36)

    57/144

     

    38

    conveniently represented by a matrix multiplicationhh

    x T  , where

    10 

    t AT h . T   is estimated from the tracked features in adjacent video

    frames using the scale invariant feature transformation (SIFT) [Lowe

    2004], instead of the Lucas-Kanade tracker in [Chan 2010]. 2) Local

    smoothing of motion: the intentional motion, which is assumed to be

    slow and smooth, is then obtained by smoothing the global motion

    estimated using local polynomial regression (LPR) with adaptive

     bandwidth selection [Zhan 2009]. Unlike conventional methods, the

     bandwidth or window size for smoothing can be automatically

    determined. This will be further discussed below. 3) Video Completion:

    the uncovered areas are filled using motion inpainting [Mats 2005],

    [Mats 2006].

    We now describe each step in more details. Let  N t  I t    ,,0|)(   x  ,

    where T  x x   ],[ 21x  , 111    x , 221    x , be a video sequence

    consisting of  N   video frames with resolution 21     captured by our

    M-IBR system. Consider the global motion transformation up to time

    instant t , },,{ 11

    0

    t T T   , where1i

    iT   is the coordinate transformation from

    the i-th to the (i+1)-th frame. If 1iiT   is smoothed separately, a smoothed

    transformation chain }ˆ,,ˆ{ 11

    0

    t T T     is obtained and the t-th  compensated

    image frame t  I'   can be obtained as:

    )(]))[ˆ((  1

    0

    1

    1   x x T T  t t 

    i

    i

    i

    i

    it   I  I     

    , (3-3-11)

    wherei

    i   1T   and1ˆ   i

    iT   denote respectively the transformation from

    frame i+1 to i and the smoothed transformation from frame i  to 1i . In

    order to avoid error accumulation due to the cascade of original and

  • 8/17/2019 FullText (36)

    58/144

     

    39

    smoothed transformation chains, [Mats 2005] proposed to compute

    directly the transformationt 

    T ~

     from the current frame )(x t  I   to the

    corresponding motion compensated frame )(x t  I   using only the

    neighboring transformation matrices as

    t i

    i

    t t    iG   )(~

    T T  , where

       t  f t    :{   }  t  f    denotes the indices of neighboring frames,

    22 2/1)2()(         xe xG     is a Gaussian kernel,  2   is the support of t   

    or window size, and  denotes the element-wise convolution operation.

    It can be seen that the selection of the kernel size affects the degree

    of smoothing. A large kernel size will lead to the problem of over-

    smoothing, while a small kernel size may not be able to remove the high

    frequency undesirable motion. The green and black lines in Fig. 3-11

    illustrate the effect of using a small kernel size of 3    and a large

    kernel size of 20  , respectively, using the method in [Mats 2005].

    To address this issue, we propose a new method for choosingadaptively the kernel size using local polynomial regression (LPR) with

    adaptive bandwidth selection. The close relationship between curve

    fitting and video stabilization has been recognized for example in [Hu

    2007], where a local parabolic fitting is used to compute the smoothed

    motion path. However, the kernel size is also fixed. The advantage of

    our method is that the kernel size can be adaptively selected from the

    data.

    LPR is a very flexible and efficient nonparametric regression

    method in statistics, and it has been widely applied in many research

    areas such as data smoothing, density estimation, and nonlinear

    modeling. Given a set of noisy samples of a signal, the data points are

    fitted locally by a polynomial using the least-squares (LS) criterion with

  • 8/17/2019 FullText (36)

    59/144

     

    40

    a kernel function having certain bandwidth parameters. Since signals

    may vary considerably over time, it is crucial to choose a proper kernel

    size or local bandwidth to achieve the best basis-variance tradeoff. In

    this paper, we used the refined intersection of confidence intervals (R-

    ICI) method to perform bandwidth selection. Here, we follow the

    homoscedastic data model of the time series:

    iiii   X  X mY         )()(   , (3-3-12)

    where },,2,1|),{(   ni X Y  ii    are a set of uninvariate observations,

    )( i X m  is a smooth function specifying the conditional mean of iY 

     given

    i X  , and i   is an independent identically distributed (i.i.d.) additive

    white Gaussian noise. The problem is to estimate )( i X m  and its k-th

    derivative )()( ik   X m  from the noisy sample

    iY   so as to achieve

    smoothing. Since )( i X m   is a smooth function, we can approximate it

    locally as a general degree- p polynomial at a given point 0 x :

    ))(()()( 000   x x xm xm xm    

     p

     p

     x x p

     xm x x

     xm)(

    !

    )()(

    !2

    )(0

    0

    )(

    2

    0

    0

       

    ,)()( 0010 p

     p   x x x x              

    (3-3-13)

    where  x   is in the neighborhood of 0 x  and k      ),,1,0(   pk     is the k-th

     polynomial coefficient. The coefficient vector T  p

    ],,,[ 10            β   at

    location 0 x   can be obtained by solving the following weighted least-

    squares (WLS) regression problem:

  • 8/17/2019 FullText (36)

    60/144

     

    41

    }])()[({min1

    2

    000

    n

    i

     p

    ik iih   x X Y  x X  K      β 

    , (3-3-14)

    where h K  x X  K h

     x X 

    ihi /)()(   00

    , )( K   is a kernel function with

     bandwidth parameter h, which emphasizes the influence of neighboring

    observations around0 x  in the estimation. The parameter h  is adaptively

    chosen at different locations 0 x  so as to adapt to the local characteristics

    of the signal (i.e. the intentional motion path). Differentiating the

    objective function in (3-3-14) with respect to  β   and setting the

    derivative as zero, we get the following LS solution in the matrix form:

    Wy X WX X  β    T 1 T      )()(ˆ 0 ,h x , (3-3-15)

    where

     P 

    nn

     P 

     P 

     x X  x X 

     x X  x X 

     x X  x X 

    )()(1

    )()(1

    )()(1

    00

    0202

    0101

    X  , T 

    nY Y Y   

    21y  , and

    )}({ 0 x X  K diag  ih   W   is the weighting matrix.

    By estimating )(ˆ0 ,h x β   with an optimized bandwidth h  at different

    0 x , we obtain a smoothed representation of the data from the noisy

    observations. In the context of video stabilization, a key problem of

    applying LPR is thus to select an optimal bandwidth parameter h  to

    achieve the best bias-variance tradeoff in estimation. Here, we use the R-

    ICI bandwidth selection algorithm [Zhan 2008] to select the optimal

     bandwidth. The basic idea of the R-ICI adaptive bandwidth selection

    method is to calculate a set of smoothing results with different

     bandwidths and then to examine a sequence of confidence intervals of

    these smoothing results to determine and refine the optimal bandwidth.

    In this thesis, the kernel )(u K   is chosen as the Epanechnikov

  • 8/17/2019 FullText (36)

    61/144

     

    42

    kernel   )1)(4/3()(  2

    uu K  , and the bandwidth parameter set for R-ICI

    is }10,,1,/:|{     j N ahh   j j j  with 2.1a .  N  is the total number

    of frames. The details of the algorithm are omitted and interested readersare