© 2003 by Davi GeigerComputer Vision October 2003 L1.1 Structure-from-EgoMotion (based on notes from David Jacobs, CS-Maryland) Determining the 3-D structure.

Computer Vision October 2003 L1.1© 2003 by Davi Geiger

Structure-from-EgoMotion(based on notes from David Jacobs, CS-Maryland)

• Determining the 3-D structure of the world, and/or the motion of a camera using a sequence of images taken by a moving camera.

– Equivalently, we can think of the world as moving and the camera as fixed.

• Like stereo, but the position of the camera isn’t known (and it’s more natural to use many images with little motion between them, not just two with a lot of motion).

– We may or may not assume we know the parameters of the camera, such as its focal length.


• As with stereo, we can divide the problem:– Correspondence.– Reconstruction.

• We will focus on the reconstruction.– So we assume that each image contains some

points, and we know which points match which.

Structure-from-EgoMotion

…


Representation• We’ll talk about a fixed camera, and moving object.

• We use scaled orthographic projection (weak perspective).

– we remove the z coordinate and scale all x and y coordinates the same amount.

• Key point:

111

...

21

21

21

n

n

n

zzz

yyy

xxx

P

Points

y

x

trrr

trrrS

3,22,21,2

3,12,11,1

Some matrix

n

n

YYY

XXXI

21

21...

The image

SPI Then:


Rotation

P

rrr

rrr

rrr

3,32,31,3

3,22,21,2

3,12,11,1

R can represent a 3D rotation of the points in P. What are the constraints on R?.

First, look at 2D rotation (easier)

n

n

yyy

xxx

21

21 ...

cossin

sincos

cossin

sincosR

• RRT = Identity. RT is also a rotation matrix, in the opposite direction to R.


Full 3D Rotation

cossin0

sincos0

001

cos0sin

010

sin0cos

100

0cossin

0sincos

R

• Any rotation can be expressed as combination of three rotations about three axes.

100

010

001TRR

• Rows (and columns) of R are orthonormal vectors.

• R has determinant 1 (not -1).

Rotation about z axis: Rotates x,y coordinates. Leaves z coordinates fixed.


S: Putting it Together

Prrr

rrr

rrr

t

t

t

s

z

y

x

1000

0

0

0

100

010

001

010

001

3,32,31,3

3,22,21,2

3,12,11,1Scale

Projection

3D Translation

3D Rotation

),,(),,(

0),,(),,(

where

3,22,21,23,12,11,1

3,22,21,23,12,11,1

3,22,21,2

3,12,11,1

ssssss

ssssss

Pstsss

stsss

y

x

We can just write stx as tx and sty as ty.


Affine Structure from Motion

),,(),,(

0),,(),,(

where

3,22,21,23,12,11,1

3,22,21,23,12,11,1

3,22,21,2

3,12,11,1

ssssss

ssssss

Ptsss

tsss

y

x


Affine Structure-from-Motion: Two Frames (1)

111

......

21

21

21

22

3,2

2

2,2

2

1,2

22

3,1

2

2,1

2

1,1

11

3,2

1

2,2

1

1,2

11

3,1

1

2,1

1

1,1

22

2

2

1

22

2

2

1

11

2

1

1

11

2

1

1

n

n

n

y

x

y

x

n

n

n

n

zzz

yyy

xxx

tsss

tsss

tsss

tsss

vvv

uuu

vvv

uuu

1111

1000

0100

0010

11114321

4321

4321

zzzz

yyyy

xxxxTo simplify, suppose for the first four points:



Looking at the first four points, we get:

1111

1000

0100

0010

22

3,2

2

2,2

2

1,2

22

3,1

2

2,1

2

1,1

11

3,2

1

2,2

1

1,2

11

3,1

1

2,1

1

1,1

2

4

2

3

2

2

2

1

2

4

2

3

2

2

2

1

1

4

1

3

1

2

1

1

1

4

1

3

1

2

1

1

y

x

y

x

tsss

tsss

tsss

tsss

vvvv

uuuu

vvvv

uuuu

We can solve for motion by inverting matrix of points.

Or, explicitly, we see that first column on left (images of first point) give the translations. After solving for these, we can solve for the each column of the s components of the motion using the images of each point, in turn.



1111

1000

0100

0010

11114321

4321

4321

zzzz

yyyy

xxxx

A

But, what if the first four points aren’t so simple?

Then we define A, affine transformation, so that:

This is always possible as long as the points aren’t coplanar.

111

......

21

21

21

22

3,2

2

2,2

2

1,2

22

3,1

2

2,1

2

1,1

11

3,2

1

2,2

1

1,2

11

3,1

1

2,1

1

1,1

22

2

2

1

22

2

2

1

11

2

1

1

11

2

1

1

n

n

n

y

x

y

x

n

n

n

n

zzz

yyy

xxx

tsss

tsss

tsss

tsss

vvv

uuu

vvv

uuu

Then, given:

10004,33,32,31,3

4,23,22,21,2

4,13,12,11,1

aaaa

aaaa

aaaa

ANote that

corresponds to translation of the

points, plus a linear transformation.



111

......

21

21

21

1

22

3,2

2

2,2

2

1,2

22

3,1

2

2,1

2

1,1

11

3,2

1

2,2

1

1,2

11

3,1

1

2,1

1

1,1

22

2

2

1

22

2

2

1

11

2

1

1

11

2

1

1

n

n

n

y

x

y

x

n

n

n

n

zzz

yyy

xxx

AA

tsss

tsss

tsss

tsss

vvv

uuu

vvv

uuu

11111

1000

0100

...0010...

1

223,2

22,2

21,2

223,1

22,1

21,1

113,2

12,2

11,2

113,1

12,1

11,1

222

21

222

21

112

11

112

11

n

n

n

y

x

y

x

n

n

n

n

z

y

x

A

tsss

tsss

tsss

tsss

vvv

uuu

vvv

uuu

We have:

And:

Then:1

223,2

22,2

21,2

223,1

22,1

21,1

113,2

12,2

11,2

113,1

12,1

11,1

A

tsss

tsss

tsss

tsss

y

x

y

x is our motion. Thus, we can never determine the exact 3D structure of the scene. We can only determine it up to

some transformation, A.


Affine Structure-from-Motion: Many frames (1)

111

...

.

.

.

...

...

...

...

...

21

21

21

3,22,21,2

3,12,11,1

22

3,2

2

2,2

2

1,2

22

3,1

2

2,1

2

1,1

11

3,2

1

2,2

1

1,2

11

3,1

1

2,1

1

1,1

21

21

22

2

2

1

22

2

2

1

11

2

1

1

11

2

1

1

n

n

n

m

y

mmm

m

x

mmm

y

x

y

x

m

n

mm

m

n

mm

n

n

n

n

zzz

yyy

xxx

tsss

tsss

tsss

tsss

tsss

tsss

vvv

uuu

vvv

uuu

vvv

uuu

I S P


First Step: Solve for Translation (1)

• We pick the center of mass as origin, i.e., the average of all 3d points. It also averages noise locations.

Rotation doesn’t move the origin, which is now the center of mass. Neither does scaled orthographic projection.

n

i

i

i

in

i

i

i

i

n

ii

n

ii

n

ii

iii

n

ii

n

ii

n

ii

z

y

x

R

z

y

x

R

zyx

xxxxp

zn

zyn

yxn

x

11

111

111

0

0

0

0

...,asrewritten are pointsi.e.,

1,

1,

1


nu

un

k

i

ki

1

nv

vn

k

i

ki

1

...

...

...~ 1111

211

1

11112

111

vvvvvv

uuuuuu

I n

n

First Step: Solve for Translation (2)

n

n

n

mmm

mmm

mn

mm

mn

mm

n

n

n

n

zzz

yyy

xxx

sss

sss

sss

sss

sss

sss

vvv

uuu

vvv

uuu

vvv

uuu

21

21

21

3,22,21,2

3,12,11,1

23,2

22,2

21,2

23,1

22,1

21,1

13,2

12,2

11,2

13,1

12,1

11,1

21

21

222

21

222

21

112

11

112

11

...

.

.

.

~...~~

~~~...

...

...

~~~

~~~

~~~

~...~~Thus, translation can be eliminated.


Rank Theorem

n

n

n

mmm

mmm

m

n

mm

m

n

mm

n

n

n

n

zzz

yyy

xxx

sss

sss

sss

sss

sss

sss

vvv

uuu

vvv

uuu

vvv

uuu

21

21

21

3,22,21,2

3,12,11,1

2

3,2

2

2,2

2

1,2

2

3,1

2

2,1

2

1,1

1

3,2

1

2,2

1

1,2

1

3,1

1

2,1

1

1,1

21

21

22

2

2

1

22

2

2

1

11

2

1

1

11

2

1

1

...

.

.

.

~...~~

~~~...

...

...

~~~

~~~

~~~

~...~~

I~

I~ has rank 3.

I~

This means there are 3 vectors such that every row of is a linear combination of these vectors. These vectors are the rows of P

SP

P

• SVD is made to do this.

UDVI ~ D is diagonal with non-increasing values, select the first/top three values, i.e., make D a 3 x 3 matrix.U and V have orthonormal rows, 2f x 3 and 3 x n respectively.


Linear Ambiguity (as before)

I~

I~

= U(:,1:3) * D(1:3,1:3) * V(1:3,:)

= (U(:,1:3) * A) * (inv(A) *D(1:3,1:3) * V(1:3,:))

• has full rank.

• Best solution is to estimate I that’s as near to

as possible, with estimate of I having rank 3.

• Our current method does this.

I~

I~

Noise


Weak Perspective Motion

n

n

n

mmm

mmm

m

n

mm

m

n

mm

n

n

n

n

zzz

yyy

xxx

sss

sss

sss

sss

sss

sss

vvv

uuu

vvv

uuu

vvv

uuu

21

21

21

3,22,21,2

3,12,11,1

2

3,2

2

2,2

2

1,2

2

3,1

2

2,1

2

1,1

1

3,2

1

2,2

1

1,2

1

3,1

1

2,1

1

1,1

21

21

22

2

2

1

22

2

2

1

11

2

1

1

11

2

1

1

...

.

.

.

~...~~

~~~...

...

...

~~~

~~~

~~~

~...~~

I~ S

P

Row 2k and 2k+1 of S should be orthogonal. All rows should be unit vectors.

(Push all scale into P).

=(U(:,1:3)*A)*(inv(A) *D(1:3,1:3)*V(1:3,:))

Choose A so (U(:,1:3) * A) satisfies these conditions.

I~


Related problems we won’t cover

• Missing data.

• Points with different, known noise.

• Multiple moving objects.


Final Messages

• Structure-from-egomotion for points can be reduced to linear algebra.

• Epipolar constraint reemerges.

• SVD useful.

• Rank Theorem says the images a scene produces aren’t complicated (also important for recognition).

© 2003 by Davi GeigerComputer Vision October 2003 L1.1 Structure-from-EgoMotion (based on notes from David Jacobs, CS-Maryland) Determining the 3-D structure.

Documents