Carsten Rother Microsoft Research Cambridge

Carsten Rother Microsoft Research Cambridge

~140 employees (~100 Researchers, ~30 RSDEs, ~10 Admin)

Six different groups:

Computer-Mediated Living

Machine Learning & Perception

Cambridge Innovation Development

Computational Science

Programming Principles & Tools

Systems & Networking

• Computer Vision group: medical vision, recognition, reconstruction, image editing, …

• Machine learning group: Infer.Net, Online Services and Advertisement, Xbox Ranking

• Constrained Reasoning group: Planning and Optimization

• Socio-Digital Systems: Understanding human needs for future technology

• Sensors and Devices:

SenseCam, Gadeteer, …

• Interactive 3D Technologies group

Machine Learning

Hardware design

Human studies

I3D mission: new user experiences

Graphics

Computer Vision

Intersection workshop (Mai 2012, Cambridge): http://research.microsoft.com/en-us/events/intersection12/

• All factors in the graph are trees • Discriminatively training of millions of Parameters • We can handle many loss-function

Decision/Regression Trees Random Fields

+

Discrete labelling tasks:

Noisy input Ours [Zoran, Weiss, ICCV ‘11]

Continuous labelling tasks:

Test input Ground Truth

Trees Trees & Field

• PatchMatch stereo, BMVC ’11 PatchMatchBP stereo, BMVC ‘12

• SurfaceStereo, ObjectStereo, CVPR ‘10,’11 – a review • SceneStereo, ECCV ‘12

• Learning interactive image segmentation, IJCV ‘12

• PatchMatch stereo, BMVC ’11 PatchMatchBP stereo, BMVC ‘12

• SurfaceStereo, ObjectStereo, CVPR ‘10,’11 – a review • SceneStereo, ECCV ‘12


Left view Right view

Depth map


Local stereo matching: rectangular region (patch) check photo-consistency

Local stereo matching: rectangular region (patch) check photo-consistency

Fails at discontinuities

Fails at non-fronto-parallel planes

No continuous depth label

Slow

Adaptive support weights [Yoon, CVPR ‘05]




Slow

3 continuous parameters (depth + normal) for each pixel




Slow

Depth map

Depth map

Depth map

Red Pixel means in the 4-neighborhood is a better solution






1. Random initialization 2. Go through pixel in sequential order: 2a. consider solution from left/top neighbour 2b. sample around current solution 0 1

Left image –

Reindeer

(Middlebury) Left and right disparity maps (intermediate step of iteration 1)

Left image – Sawtooth

(Middlebury)

Image consists of 3 planes -

~80.000 guesses for yellow plane Ground truth disparities

Randomization is in our favour

No cost volume needed: well suited for large images and large depth range


PatchMatch Stereo result

Unary term (photo-consistency)

Pairwise term (local curvature)

Add a Markov Random Field:

Continuous 3-dimension

Cost ≠ 0: local curvature or discontinuity

Cost = 0 both planes are aligned in 3D

So far, we have been running with λ = 0

For non-zero λ, with super high-dimensional u:

Gradient descent

Gradient descent + Fusion move

Relaxation + Gradient descent

Simulated Annealing

Continuos Belief Propagation

M2->3

Operation 1: compute neg-log Belief

s

Operation 2: re-compute Message

t s

M1->2

Sequential schedule

M1->4

Final output: us* = argmin Bs(us) us

target

Source (shifted 4.0 + noise)

Ground Truth

Error: 0.618; Unary only

Error: 0.251

Ground Truth

12x12 discrete labels

target

Source (shifted 4.2 + noise)

GT

Error: 0.66

Error: 1.9; unary only GT


Error: 5.68

Error: 3.46; unary only GT


M2->3 M1->2

Sequential schedule

M1->4

0 1

Each pixel has different set of particles:

t

0 1

s

Comment: we do max-product, hence we may not want to approximate true continuous distributions

t

us

ut

Bs(us)

(neg. log Belief) Bt(ut)

t

M2->3 M1->2

Sequential schedule

M1->4 s

0 1 0 1

0 1

= (us-ut)2

ut us

us

M2->3 M1->2

Sequential schedule

M1->4

0 1

Sample around current particles

0 1

s

us us

Final output: us* = argmin Bs(us) us

GT

Error: 5.68 discrete

Energy: 47308 Error: 0.9713

Random init

Energy: 42628

Error: 0.8259 Best unary init (144 discrete)

t s

The message Mt->s has high values for s = t since smoothness term is (us-ut)2

PM idea: sample also at your neighbours solutions!

We call this variant of Particle BP PatchMatch BP (PMBP)

0 1 0 1

= (us-ut)2

ut us

GT

Energy: 42628

Error: 0.8259 Best unary init

Random init Energy: 21959

Error: 0.4159 50 particles

Random init Energy: 22593 Error: 0.3864 1 particle

1 particles

Energy: 22593

Error: 0.3864

Energy: 21959

Error: 0.4159

50 particles

PatchMatch is a special Form of Particle BP

λ = 0

1 particle per node

Sample from neighbour nodes

Iterate two steps (in a nutshell):

1) Run full BP until convergence (convex version which solves the LP relaxation)

2) Sample all nodes individually

Highly ranked in Middlebury Table

• PatchMatch stereo, BMVC ‘11 • PatchMatchBP stereo, BMVC ‘12

• SurfaceStereo, ObjectStereo, CVPR ‘10,’11 • SceneStereo, ECCV ‘12


Ultimate Goal: Recover: geometry, light, material Recognise: object instances, attributes … and do that jointly

Theoretical Challenges: statistical models of the world and the captured images Combines statistical Priors and physical constraints Practical Challenges: Robustness Real-time inference Task-driven, e.g. Robotics

To achieve this: latest machine learning latest optimization techniques

Assignment of pixels to surfaces

Simple explanation: describe the scene by a few low-degree surfaces (splines, planes) Goal: depth estimation improves

Without prior With prior

Simple explanation: describe scene by a few Objects: - compact in 3D - Connected in 3D - each object has a color model Goal: depth estimation improves

Objects o

Depth d

Objects o

Simple explanation: describe scene by a few Objects: - compact in 3D (use bbox) - each object has a color model - Physical constraints Goal: 1) depth estimation improves 2) improves object extraction

1) Create proposal pool

2) Rank proposal pool

3) Combine best objects and recognize

Use stereo images

boat

sky

water

Goals: • Reason in 3D with

physical constraints

• Improve depth estimation

Left input image

Object labelling proposal 1

Object labelling proposal 2

Output: - Object labelling - Depth labelling - Object 3D bounding boxes - Object colour distribution

Stereo: photo-consistency

Objects:

colour model

Prior on number of objects

Left input image PatchMatch Stereo Result

Object mask

Depth map

Physical properties:

Bounding Box tightness

Bounding Box intersection

Bounding Box Gravity

Merging (simulated annealing, patchmatch)

Exploration (mean-shift, patchmatch)

Object maps

Multiple Scene Proposals by varying the prior on number of objects

Good rank in Middlebury table

Green: this term is useful

All Terms are useful

Images

Ground truth

Our labelling

2D

Ours

GT

Object stereo

2D

Object stereo

Ours

Large Scale Train and Test

Real-time

Do full 3D reconstruction (KinectFusion)

Model all physical properties: Light, Material

Use graphics engine for train and test “analysis by synthesis”

• PatchMatch stereo, BMVC ‘11 PatchMatchBP stereo, BMVC ‘12

• SurfaceStereo, ObjectStereo, CVPR ‘10,’11 SceneStereo, ECCV ‘12


Weights w

Training Time

How much user input shall we use for learning?

predictions

Testing Time

prediction

prediction

Static brush

Static trimap

Training Time Testing Time

Goal: User should reach a satisfying result in as few interactions as possible

Define: “interaction” and “satisfying”

Human (averaged over 6 users)

Computer (simulated brush strokes)

Algorithmic State

Suggested action

Ground Truth

Current Solution

What type of user? (novice user, advanced user)

Adjusting weights with the learning curve of the user

Other interactive systems

• PatchMatch stereo, BMVC ‘11 PatchMatchBP stereo, BMVC ‘12

• SurfaceStereo, ObjectStereo, CVPR ‘10,’11 SceneStereo, ECCV ‘12


Carsten Rother Microsoft Research Cambridge

Documents