Page 1
Computer Vision October 2003 L1.1© 2003 by Davi Geiger
Structure-from-EgoMotion(based on notes from David Jacobs, CS-Maryland)
• Determining the 3-D structure of the world, and/or the motion of a camera using a sequence of images taken by a moving camera.
– Equivalently, we can think of the world as moving and the camera as fixed.
• Like stereo, but the position of the camera isn’t known (and it’s more natural to use many images with little motion between them, not just two with a lot of motion).
– We may or may not assume we know the parameters of the camera, such as its focal length.
Page 2
Computer Vision October 2003 L1.2© 2003 by Davi Geiger
• As with stereo, we can divide the problem:– Correspondence.– Reconstruction.
• We will focus on the reconstruction.– So we assume that each image contains some
points, and we know which points match which.
Structure-from-EgoMotion
…
Page 3
Computer Vision October 2003 L1.3© 2003 by Davi Geiger
Representation• We’ll talk about a fixed camera, and moving object.
• We use scaled orthographic projection (weak perspective).
– we remove the z coordinate and scale all x and y coordinates the same amount.
• Key point:
111
...
21
21
21
n
n
n
zzz
yyy
xxx
P
Points
y
x
trrr
trrrS
3,22,21,2
3,12,11,1
Some matrix
n
n
YYY
XXXI
21
21...
The image
SPI Then:
Page 4
Computer Vision October 2003 L1.5© 2003 by Davi Geiger
Rotation
P
rrr
rrr
rrr
3,32,31,3
3,22,21,2
3,12,11,1
R can represent a 3D rotation of the points in P. What are the constraints on R?.
First, look at 2D rotation (easier)
n
n
yyy
xxx
21
21 ...
cossin
sincos
cossin
sincosR
• RRT = Identity. RT is also a rotation matrix, in the opposite direction to R.
Page 5
Computer Vision October 2003 L1.6© 2003 by Davi Geiger
Full 3D Rotation
cossin0
sincos0
001
cos0sin
010
sin0cos
100
0cossin
0sincos
R
• Any rotation can be expressed as combination of three rotations about three axes.
100
010
001TRR
• Rows (and columns) of R are orthonormal vectors.
• R has determinant 1 (not -1).
Rotation about z axis: Rotates x,y coordinates. Leaves z coordinates fixed.
Page 6
Computer Vision October 2003 L1.8© 2003 by Davi Geiger
S: Putting it Together
Prrr
rrr
rrr
t
t
t
s
z
y
x
1000
0
0
0
100
010
001
010
001
3,32,31,3
3,22,21,2
3,12,11,1Scale
Projection
3D Translation
3D Rotation
),,(),,(
0),,(),,(
where
3,22,21,23,12,11,1
3,22,21,23,12,11,1
3,22,21,2
3,12,11,1
ssssss
ssssss
Pstsss
stsss
y
x
We can just write stx as tx and sty as ty.
Page 7
Computer Vision October 2003 L1.9© 2003 by Davi Geiger
Affine Structure from Motion
),,(),,(
0),,(),,(
where
3,22,21,23,12,11,1
3,22,21,23,12,11,1
3,22,21,2
3,12,11,1
ssssss
ssssss
Ptsss
tsss
y
x
Page 8
Computer Vision October 2003 L1.10© 2003 by Davi Geiger
Affine Structure-from-Motion: Two Frames (1)
111
......
21
21
21
22
3,2
2
2,2
2
1,2
22
3,1
2
2,1
2
1,1
11
3,2
1
2,2
1
1,2
11
3,1
1
2,1
1
1,1
22
2
2
1
22
2
2
1
11
2
1
1
11
2
1
1
n
n
n
y
x
y
x
n
n
n
n
zzz
yyy
xxx
tsss
tsss
tsss
tsss
vvv
uuu
vvv
uuu
1111
1000
0100
0010
11114321
4321
4321
zzzz
yyyy
xxxxTo simplify, suppose for the first four points:
Page 9
Computer Vision October 2003 L1.11© 2003 by Davi Geiger
Affine Structure-from-Motion: Two Frames (2)
Looking at the first four points, we get:
1111
1000
0100
0010
22
3,2
2
2,2
2
1,2
22
3,1
2
2,1
2
1,1
11
3,2
1
2,2
1
1,2
11
3,1
1
2,1
1
1,1
2
4
2
3
2
2
2
1
2
4
2
3
2
2
2
1
1
4
1
3
1
2
1
1
1
4
1
3
1
2
1
1
y
x
y
x
tsss
tsss
tsss
tsss
vvvv
uuuu
vvvv
uuuu
We can solve for motion by inverting matrix of points.
Or, explicitly, we see that first column on left (images of first point) give the translations. After solving for these, we can solve for the each column of the s components of the motion using the images of each point, in turn.
Page 10
Computer Vision October 2003 L1.14© 2003 by Davi Geiger
Affine Structure-from-Motion: Two Frames (5)
1111
1000
0100
0010
11114321
4321
4321
zzzz
yyyy
xxxx
A
But, what if the first four points aren’t so simple?
Then we define A, affine transformation, so that:
This is always possible as long as the points aren’t coplanar.
111
......
21
21
21
22
3,2
2
2,2
2
1,2
22
3,1
2
2,1
2
1,1
11
3,2
1
2,2
1
1,2
11
3,1
1
2,1
1
1,1
22
2
2
1
22
2
2
1
11
2
1
1
11
2
1
1
n
n
n
y
x
y
x
n
n
n
n
zzz
yyy
xxx
tsss
tsss
tsss
tsss
vvv
uuu
vvv
uuu
Then, given:
10004,33,32,31,3
4,23,22,21,2
4,13,12,11,1
aaaa
aaaa
aaaa
ANote that
corresponds to translation of the
points, plus a linear transformation.
Page 11
Computer Vision October 2003 L1.15© 2003 by Davi Geiger
Affine Structure-from-Motion: Two Frames (6)
111
......
21
21
21
1
22
3,2
2
2,2
2
1,2
22
3,1
2
2,1
2
1,1
11
3,2
1
2,2
1
1,2
11
3,1
1
2,1
1
1,1
22
2
2
1
22
2
2
1
11
2
1
1
11
2
1
1
n
n
n
y
x
y
x
n
n
n
n
zzz
yyy
xxx
AA
tsss
tsss
tsss
tsss
vvv
uuu
vvv
uuu
11111
1000
0100
...0010...
1
223,2
22,2
21,2
223,1
22,1
21,1
113,2
12,2
11,2
113,1
12,1
11,1
222
21
222
21
112
11
112
11
n
n
n
y
x
y
x
n
n
n
n
z
y
x
A
tsss
tsss
tsss
tsss
vvv
uuu
vvv
uuu
We have:
And:
Then:1
223,2
22,2
21,2
223,1
22,1
21,1
113,2
12,2
11,2
113,1
12,1
11,1
A
tsss
tsss
tsss
tsss
y
x
y
x is our motion. Thus, we can never determine the exact 3D structure of the scene. We can only determine it up to
some transformation, A.
Page 12
Computer Vision October 2003 L1.16© 2003 by Davi Geiger
Affine Structure-from-Motion: Many frames (1)
111
...
.
.
.
...
...
...
...
...
21
21
21
3,22,21,2
3,12,11,1
22
3,2
2
2,2
2
1,2
22
3,1
2
2,1
2
1,1
11
3,2
1
2,2
1
1,2
11
3,1
1
2,1
1
1,1
21
21
22
2
2
1
22
2
2
1
11
2
1
1
11
2
1
1
n
n
n
m
y
mmm
m
x
mmm
y
x
y
x
m
n
mm
m
n
mm
n
n
n
n
zzz
yyy
xxx
tsss
tsss
tsss
tsss
tsss
tsss
vvv
uuu
vvv
uuu
vvv
uuu
I S P
Page 13
Computer Vision October 2003 L1.17© 2003 by Davi Geiger
First Step: Solve for Translation (1)
• We pick the center of mass as origin, i.e., the average of all 3d points. It also averages noise locations.
Rotation doesn’t move the origin, which is now the center of mass. Neither does scaled orthographic projection.
n
i
i
i
in
i
i
i
i
n
ii
n
ii
n
ii
iii
n
ii
n
ii
n
ii
z
y
x
R
z
y
x
R
zyx
xxxxp
zn
zyn
yxn
x
11
111
111
0
0
0
0
...,asrewritten are pointsi.e.,
1,
1,
1
Page 14
Computer Vision October 2003 L1.18© 2003 by Davi Geiger
nu
un
k
i
ki
1
nv
vn
k
i
ki
1
...
...
...~ 1111
211
1
11112
111
vvvvvv
uuuuuu
I n
n
First Step: Solve for Translation (2)
n
n
n
mmm
mmm
mn
mm
mn
mm
n
n
n
n
zzz
yyy
xxx
sss
sss
sss
sss
sss
sss
vvv
uuu
vvv
uuu
vvv
uuu
21
21
21
3,22,21,2
3,12,11,1
23,2
22,2
21,2
23,1
22,1
21,1
13,2
12,2
11,2
13,1
12,1
11,1
21
21
222
21
222
21
112
11
112
11
...
.
.
.
~...~~
~~~...
...
...
~~~
~~~
~~~
~...~~Thus, translation can be eliminated.
Page 15
Computer Vision October 2003 L1.19© 2003 by Davi Geiger
Rank Theorem
n
n
n
mmm
mmm
m
n
mm
m
n
mm
n
n
n
n
zzz
yyy
xxx
sss
sss
sss
sss
sss
sss
vvv
uuu
vvv
uuu
vvv
uuu
21
21
21
3,22,21,2
3,12,11,1
2
3,2
2
2,2
2
1,2
2
3,1
2
2,1
2
1,1
1
3,2
1
2,2
1
1,2
1
3,1
1
2,1
1
1,1
21
21
22
2
2
1
22
2
2
1
11
2
1
1
11
2
1
1
...
.
.
.
~...~~
~~~...
...
...
~~~
~~~
~~~
~...~~
I~
I~ has rank 3.
I~
This means there are 3 vectors such that every row of is a linear combination of these vectors. These vectors are the rows of P
SP
P
• SVD is made to do this.
UDVI ~ D is diagonal with non-increasing values, select the first/top three values, i.e., make D a 3 x 3 matrix.U and V have orthonormal rows, 2f x 3 and 3 x n respectively.
Page 16
Computer Vision October 2003 L1.20© 2003 by Davi Geiger
Linear Ambiguity (as before)
I~
I~
= U(:,1:3) * D(1:3,1:3) * V(1:3,:)
= (U(:,1:3) * A) * (inv(A) *D(1:3,1:3) * V(1:3,:))
• has full rank.
• Best solution is to estimate I that’s as near to
as possible, with estimate of I having rank 3.
• Our current method does this.
I~
I~
Noise
Page 17
Computer Vision October 2003 L1.21© 2003 by Davi Geiger
Weak Perspective Motion
n
n
n
mmm
mmm
m
n
mm
m
n
mm
n
n
n
n
zzz
yyy
xxx
sss
sss
sss
sss
sss
sss
vvv
uuu
vvv
uuu
vvv
uuu
21
21
21
3,22,21,2
3,12,11,1
2
3,2
2
2,2
2
1,2
2
3,1
2
2,1
2
1,1
1
3,2
1
2,2
1
1,2
1
3,1
1
2,1
1
1,1
21
21
22
2
2
1
22
2
2
1
11
2
1
1
11
2
1
1
...
.
.
.
~...~~
~~~...
...
...
~~~
~~~
~~~
~...~~
I~ S
P
Row 2k and 2k+1 of S should be orthogonal. All rows should be unit vectors.
(Push all scale into P).
=(U(:,1:3)*A)*(inv(A) *D(1:3,1:3)*V(1:3,:))
Choose A so (U(:,1:3) * A) satisfies these conditions.
I~
Page 18
Computer Vision October 2003 L1.22© 2003 by Davi Geiger
Related problems we won’t cover
• Missing data.
• Points with different, known noise.
• Multiple moving objects.
Page 19
Computer Vision October 2003 L1.23© 2003 by Davi Geiger
Final Messages
• Structure-from-egomotion for points can be reduced to linear algebra.
• Epipolar constraint reemerges.
• SVD useful.
• Rank Theorem says the images a scene produces aren’t complicated (also important for recognition).