THE UNIVERSITY OF BRITISH COLUMBIA Random Forests-Based 2D- to-3D Video Conversion Presenter: Mahsa Pourazad M. Pourazad, P. Nasiopoulos, and A. Bashashati
Dec 29, 2015
THE UNIVERSITY OF BRITISH COLUMBIA
Random Forests-Based 2D-to-3D Video Conversion
Presenter: Mahsa Pourazad
M. Pourazad, P. Nasiopoulos, and A. Bashashati
22
Outline
Introduction to 3D TV & 3D Content Motivation for 2D to 3D Video conversion Proposed 2D to 3D video conversion scheme Conclusions
33
Stereoscopic Dual Camera
Image-Based
Rendering
Technique
Stereo
Video
Stereo Video
3D Depth Range Camera2D Video Depth Map
Introduction to 3D TV & 3D content:
44
Industry is investing in 3D TV and broadcasting
Hollywood already is investing in 3D Technology
Are we ready for this? No!
One of the issues: lack of content Converting existing 2D to 3D:
Resell existing content (Movies, TV series, etc.)
Motivation for 2D to 3D Video Conversion:
66
2D-to-3D Conversion
Depth Map2D Video
2D to 3D Video conversion :
Monocular Depth Cues(Motion parallax, Sharpness, Occlusion and…)
Proper integration of more monocular depth cues results in more accurate depth map estimate (imitating human brain system)
88
Motion-based 2D to 3D video conversion*:
2D video
Motion Vectors
(MVs)
Motion Correction
Camera Motion
Correction
Object-based Motion Correction
Object-based Motion
Estimation
Non-Linear Transforming
Model*
)MVabs(~
X
*Pourazad, M.T., Nasiopoulos, P. and Ward, R.K. (2009) An H.264-based scheme for 2D to 3D video conversion. IEEE Transactions on Consumer Electronic, vol. 55, no. 2: 742-748.
Estimated Depth Map
) y)(x,MVabs(Cy)D(x,~
x
Main issue: Estimating depth information for static objects
Near objects move faster across the retina than further objects do
9
Our Suggested Scheme: (integrating multiple monocular depth cues)
Extracting Features Representing Monocular
Depth Cues
Motion Parallax4x4 blocks
2D Video
Implement a block matching technique between consecutive frames.
11
Our Suggested Scheme: (integrating multiple monocular depth cues)
Extracting Features Representing Monocular
Depth Cues
Motion ParallaxTexture Variation
4x4 blocks
2D Video
Face-texture of a textured material is more apparent when it is closer
12
Texture Variation Depth Cue:
Applying Law’s texture energy masks to 4x4 blocks’ luma information as:
1 2 1
2 4 2
1 2 1
-1 0 1
-2 0 2
-1 0 1
-1 2 -1
-2 4 -2
-1 2 -1
-1 -2 -1
0 0 0
1 2 1
1 0 -1
0 0 0
-1 0 1
1 -2 1
0 0 0
-1 2 -1
-1 -2 -1
2 4 2
-1 -2 -1
1 0 -1
-2 0 2
1 0 -1
1 -2 1
-2 4 -2
1 -2 1
L3L3 L3E3 L3S3
E3L3 E3E3 E3S3
S3L3 S3E3 S3S3
I: Luma information of each 4x4 block (Y)
F: Law’s mask
}2,1{ ),(),()(),(
kyxFyxInEk
Blockyxi
i
Law’s texture energy masks
Feature set with18 components represents texture variation depth cue for each 4x4 block
13
Our Suggested Scheme: (integrating multiple monocular depth cues)
Extracting Features Representing Monocular
Depth Cues
Motion ParallaxTexture VariationHaze
4x4 blocks
2D Video
Distant objects visually appear less distinct and more bluish than objects nearby due to haze
14
Haze Depth Cue:
Haze is reflected in the low frequency information of chroma (U & V): Apply L3L3 Law’s texture energy mask (local averaging) to 4x4
blocks’ Chroma information as:
1 2 1
2 4 2
1 2 1
L3L3
C: Chroma information of each 4x4 block (U & V)
F: Law’s mask
}2,1{ ),(),()(),(
kyxFyxCnEk
Blockyxi
i
Feature set with 4 components represents haze depth cue for each 4x4 block
15
Our Suggested Scheme: (integrating multiple monocular depth cues)
Extracting Features Representing Monocular
Depth Cues
Motion ParallaxTexture VariationHazePerspective
4x4 blocks
2D Video
The more the lines converge, the farther away they appear to be
Applying the Radon Transform to the luma information of each block ( {0, 30, 60, 90, 120, 150}). Amplitude and phase of the most dominant edge are selected
Feature set with 2 components
16
Our Suggested Scheme: (integrating multiple monocular depth cues)
Extracting Features Representing Monocular
Depth Cues
Motion ParallaxTexture VariationHazePerspectiveVertical Coordinate
4x4 blocks
2D Video
In general the objects closer to the bottom boarder of the image are closer to the viewer
Feature set includes vertical spatial coordinate of each 4x4 block (as a percentage of the frame’s height)
17
Our Suggested Scheme: (integrating multiple monocular depth cues)
Extracting Features Representing Monocular
Depth Cues
Motion ParallaxTexture VariationHazePerspectiveVertical CoordinateSharpness
4x4 blocks
2D Video
Closer objects appear sharper
Sharpness of each 4x4 block is measured by implementing diagonal Laplacian method*
*A. Thelen, S. Frey, S. Hirsch, and P. Hering, “Improvements in shape-from-focus for holographic reconstructions with regard to focus operators, neighborhood-size, and height value interpolation”, IEEE Trans.on Image Processing , Vol. 18, no. 1, pp. 151-157 , 2009
18
Our Suggested Scheme: (integrating multiple monocular depth cues)
Extracting Features Representing Monocular
Depth Cues
Motion ParallaxTexture VariationHazePerspectiveVertical CoordinateSharpnessOcclusion
4x4 blocks
2D Video
The object which overlaps or partly obscures our view of another object, is closer.
Extracting all feature sets for each 4x4 patch at three different image-resolution levels (1, 1/2, and 1/4). Capture occlusion Global accountable features
Level 1 Level 1/2 Level 1/4Level 1 Level 1/2 Level 1/4
19
Our Suggested Scheme: (integrating multiple monocular depth cues)
Extracting Features Representing Monocular
Depth Cues
Motion ParallaxTexture VariationHazePerspectiveVertical CoordinateSharpnessOcclusion
Random Forests (RF) Machine Learning
Depth-Map Model Estimation4x4
blocks
2D Video
81-dimensional feature vectors
RF: A classification & regression technique which is a collection of individual Decision Trees (DTs)* Randomly select the input feature vectors Application: where DTs do not perform well on unseen test data individually, but
the contribution of DTs perform well to unseen data
*L. Breiman, and A. Cutler, “Random forest.” Machine Learning, 45, pp. 5–32, 2001.
Training Set input: feature vectors of 4x4 blocks with pixels mostly belonging to a common
object of key frames output: known depth values
Test Set: 4x4 blocks of an unseen video
20
Our Suggested Scheme: (integrating multiple monocular depth cues)
Extracting Features Representing Monocular
Depth Cues
Motion ParallaxTexture VariationHazePerspectiveVertical CoordinateSharpnessOcclusion
Depth-Map Model Estimation
Estimated Depth Map
Mean shift Image segmentation*
Object-based depth information
4x4 blocks
2D Video
Depth Map
81-dimensional feature vectors
*D. Comaniciu, and P. Meer, “Mean Shift: A Robust Approach toward Feature Space Analysis,” IEEE Trans. Pattern Analysis Machine Intell., vol. 24, no. 5, pp. 603-619, 2002.
Random Forests (RF) Machine Learning
2121
Experiments:
Video Sequence Frame Size Stream Type Test View SourceOrbi 720x576 2D+Depth NA Heinrich Hertz Institute (HHI)Book Arrival 1024x768 Multiview View 8 Heinrich Hertz Institute (HHI)Breakdancer 1024x768 Multiview View 3 Microsoft Research (MSR) Rena 640x480 Multiview View 45 Nagoya UniversityAkko & Kayo 640x480 Multiview View 28 Nagoya UniversityPantomime 1280x960 Multiview View 37 Nagoya UniversityChampagne 1280x960 Multiview View 39 Nagoya University
Video Sequence Frame Size Stream Type Test View SourceInterview 720x576 2D+Depth NA Heinrich Hertz Institute (HHI)Ballet 1024x768 Multiview View 3 Microsoft Research (MSR)
Training sequences:
Test sequences:
2222
Results:
0
5
10
15
20
25
30
35
Texture Motion Haze VerticalCoordinate
Edge Sharpness
Av
gD
epth
-Cue
Im
port
ance
(%)
Depth Cues
0
5
10
15
20
25
30
35
Texture Motion Haze VerticalCoordinate
Edge Sharpness
Av
gD
epth
-Cue
Im
port
ance
(%)
Depth Cues
2D Video
Available Depth Map Existing Motion-based Technique
Our Proposed Technique
Subjective Test (ITU-R BT.500-11):18 people graded the stereo videos from 1 to 10
Original Existing Method Our MethodInterview 7.5 7.1 6.5
Ballet 7 6.8 6.3
2323
Conclusions:
A new and efficient 2D to 3D video conversion method was presented.
The method uses Random Forest Regression to estimate the depth map model based on multiple depth cues.
Performance evaluations show that our approach outperforms a state of the art existing motion based method
The subjective visual quality of our created 3D stream was also confirmed by watching the resulted 3D streams on a stereoscopic display.
Our method is real-time and can be implemented at the receiver side without any burden into the network