Unconstrained 2D to Stereoscopic 3D Image and Video Conversion using Semi-Automatic Energy Minimization Techniques

Unconstrained 2D to Stereoscopic 3D Image and Video Conversion using Semi-Automatic Energy Minimization Techniques

Raymond Phan, Richard Rzeszutek and Dimitrios Androutsos Dept. of Electrical & Computer Engineering Ryerson University – Toronto, Ontario, Canada

Thursday, October 24th, 2012 Chinese 6 Theatre – Hollywood, California, USA

© 2012 SMPT E · T he 2012 Annual Technical C onference & Exhibition · www.smpte2012.org

Outline of Presentation • Introduction & Motivation • Conversion Framework for Images

– Random Walks – Graph Cuts / Depth Priors

• Conversion Framework for Video – Keyframes to Label – Label Tracking

• Results – Images and Videos


Introduction • Creating stereoscopic content from single-

view footage / 2D to 3D Conversion – Huge surge of popularity: Converting legacy

content is very appealing – Current accepted method is quite accurate, but

very labor intensive and difficult – Known as rotoscoping: Manual operators extract

objects in a single frame to create left-right views – Much research in 2D to 3D conversion performed

to alleviate difficulty, minimize time & cost


Introduction – (2) • Goal of 2D to 3D Conversion: Depth Map

– B & W image showing depth at each point – Depth maps are the main tool for conversion

• Ultimate Goal: Automatic Conversion – Most methods concentrate here – Problem: Errors cannot be easily corrected – May require extensive pre/post-processing

• Solution? Semi-Automatic – Happy medium between auto & manual


Motivation • Semi-auto: Some user effort, rest is automatic

– User marks objects/regions in the image on what is close/far from camera (dark/light intensities or colors)

– For video: Mark several keyframes • Allow for label propagation from first frame minimizing user effort

– Using above info, goal is to solve for the other depths in the entire image, or entire video

– Results: Single depth map or a sequence of them • How do we solve?

– Using mixture of two semi-automatic segmentation algorithms: Random Walks & Graph Cuts


Conversion Framework – Images • Random Walks: Energy Minimization Scheme

– Starting from a user-defined label, what is the likelihood of a random walker visiting all unlabeled pixels in image

– Goal: Classify every pixel to belong to one of K labels – Pixel gets the label generating the highest likelihood

• Modify Random Walks to create depth maps – Likelihoods are probabilities: Spans the set [0,1] – User-defined depths and solved depths spans same set – Goal is to solve for one label: The depth!


Conversion Framework – Images – (2) – Use Scale-Space Random Walks in our framework

• Pyramidal sampling scheme, with Random Walks applied to each resolution Merged via Geometric Mean

– User chooses values from [0,1] and brushes over image – 0: Dark intensity / color, 1: Light intensity / color – Resulting solved probabilities are directly used as depths

• Is this valid? – Yes! Psych. study done at Tel Aviv Uni. in Israel – As long as the user is perceptually consistent in marking


Conversion Framework – Images – (3) • Random Walks does have its issues though

– Allows for internal depth variation, but for weak borders, results in a “bleeding effect”

– Regions of one depth may leak into regions of another


Conversion Framework – Images – (4) • Internal depth variation = Good!

– Minimizes cardboard cutout effect – RW generates depths not originally defined by user – But, we need to respect object boundaries

• Idea: Combine Random Walks with Graph Cuts – Graph Cuts = Hard Segmentation

• Only creates result with depths/labels provided by user – GC solves the MAP-MRF problem with user labels – Consider image as a weighted connected graph

• Solution is to solve the max-flow/min-cut of this graph


Conversion Framework – Images – (5) • NB: Making depth maps = Segmentation Problem

– Specifically, a multi-label segmentation problem – But! Graph Cuts: Binary segmentation problem (FG/BG) – Graph Cuts also has an integer labeling, not from [0,1]

• Must modify above to become multi-label – Each unique user-defined label is given an integer label – Binary segmentation is performed for each label

• FG: Label in question, BG: All of the other labels – Rest of the pixels are those we wish to label


Conversion Framework – Images – (6) – For each label, the maximum flow values are stored – Label assigned to a pixel is the one with the highest flow

• Problems? – Object boundaries respected, but no depth variation


Conversion Framework – Images – (7) • But! We can make use of this

– Merge Random Walks & Graph Cuts together – Create a depth prior: An initial depth map estimate – This is essentially Graph Cuts! – We merge by feeding depth prior as additional information into RW

• Before we merge… – Depth maps for RW and GC must be compatible with each other

• RW has depths of [0,1], GC has integer labels

– Map the user-defined labels from RW of set [0,1] to an integer set – Perform Graph Cuts on using this integer set, and map the integer

set back to the set of [0,1] Use a lookup table to do this


Conversion Framework – Images – (8) • Summary of Method:

– Place user-defined strokes on image & create depth prior – Feed depth prior with same strokes into RW and solve – To modify depth map, append strokes & re-run algorithm


Conversion Framework – Video • Essentially the same as images, but we need to:

– Mark more than one image / keyframe • Result will be a sequence of depth maps

– Assume no abrupt changes in video • If there are, separate manually, or use shot detection

• Also must be aware of memory constraints – Intuitive to process each frame individually

• Fits well with memory, and can compute depth maps in parallel • However, this breaks temporal relationship Flickering

– Ideal to process all frames simultaneously in memory • But this will exhaust all available memory!


Conversion Framework – Video – (2) • How do we solve?

– Use block processing to preserve temporal coherency – Process blocks of frames without exhausting memory – Overlapping frames within blocks are left unused

• Each block is independent of the others

• Back to labeling: How many frames do we label? – We allow the user the option of manually choosing

which ones to label – However, labeling only a small set of frames will result in

depth artifacts


Conversion Framework – Video – (3) • Example – From Sintel Sequence (http://www.sintel.org)

– Shows what happens with 3 frames labeled & when all are labeled

Depth maps: 3 frames labeled only


Conversion Framework – Video – (4)

Depth maps: All frames labeled

• Labeling all frames is better – Not doing this results in depth artifacts – For frames having no labels, moving points quickly “fade” in depth – Labeling all frames is better, but can be very time consuming!


Conversion Framework – Video – (5) • Labeling all frames is better – Part II

– Instead of manually labeling all frames, label first frame and use a tracking algorithm to propagate the labels

– Adjust depths of labels when object depths change • Label Tracking?

– Would be very taxing to track all points in a stroke – Decompose a stroke at a particular depth into N points – Track each of these points separately – Reconstruct the stroke using spline interpolation


Conversion Framework – Video – (6) • Tracker used: Tracking-Learning-Detection (TLD)

– Long term tracker for unconstrained video by Kalal et al. – Simply draw a bounding box around object in first frame – After, trajectory is determined for the rest of the frames,

accounting for size and illumination changes • How do we use this?

– For each point in each decomposed stroke, surround a bounding box and track the region

– Reconstruct each stroke using the tracked points


Conversion Framework – Video – (7) • Modify TLD to account for object depth changes

– Let S = {s0, s1, …, si, …, sM-1} represent the scales of initial bounding box drawn for stroke point

– Create a mapping function correlating depth with scale Smaller the bounding box, farther the depth

– We know: s0/sM-1 farthest/closest user-defined depth d0/dmax, & scale 1.0 is the depth of stroke, du

– Assume parabolic relationship D(x) = ax2 + bx + c, & solve: x = the bounding box scale


Results - Images Single Frame from Avatar Trailer


Results – Images – (2) Shots of downtown Boston


Results – Videos – (1) Big Buck Bunny – User-defined frame #1 + tracked frames

Frame #1


Results – Videos – (2) Big Buck Bunny – Depth Maps

Frame #1


Results – Videos – (3) Shell-Ferrari-Partizan – User-defined frame #1 + tracked frames Frame #1


Results – Videos – (4) Shell-Ferrari-Partizan – Depth Maps Frame #1


Conclusions • Made a semi-auto method for 2D-3D conversion

– Auto: Needs error correction & pre/post processing – Manual: Time-consuming and expensive – Happy medium between the two – Allows user to correct errors instantly and re-run fast

• Works for both images and video – Merged two segmentation algorithms together

• Combine merits of both methods together for better accuracy – Video: Modified a robust tracking algorithm to track

user-defined labels as well as dynamically adjust depths

Unconstrained 2D to Stereoscopic 3D Image and Video Conversion using Semi-Automatic Energy Minimization Techniques

Technology