Top Banner
SAMP: Shape and Motion Priors for 4D Vehicle Reconstruction Francis Engelmann, Jörg Stückler, Bastian Leibe Computer Vision Group, Visual Computing Institute, RWTH Aachen University M.Sc. Francis Engelmann Phone: +49 241 80 20760 E-Mail: [email protected] Website: http://www.vision.rwth-aachen.de/publication/00146/ Computer Vision Group, Visual Computing Institute RWTH Aachen University Mies-van-der-Rohe Str. 15, D-52074 Aachen, Germany http://www.vision.rwth-aachen.de [1] X. Chen, K. Kundu, Y. Zhu, A. Berneshawi, H. Ma, S. Fidler, and R. Ur- tasun. 3D object proposals for accurate object class detection. In Proc. of Neural Information Processing Systems (NIPS), 2015. [2] A. Geiger, P. Lenz, and R. Urtasun. Are we ready for autonomous driving? The KITTI vision benchmark suite. In Proc. of the IEEE Int. Conference on Computer Vision and Pattern Recognition (CVPR), 2012. [3] A. Geiger, M. Roser, and R. Urtasun. Efficient large-scale stereo matching. In Proc. of the Asian Conference on Computer Vision (ACCV), 2010. [4] K. Yamaguchi, D. McAllester, and R. Urtasun. Efficient joint segmentation, occlusion labeling, stereo and flow estimation. In Proc. of the European Conference on Computer Vision (ECCV), 2014. [5] F. Engelmann, J. Stückler, and B. Leibe. Joint object pose estimation and shape reconstruction in urban street scenes using 3D shape priors. In Proc. of the German Conference on Pattern Recognition (GCPR), 2016. 3D Signed Distance Functions Input 3D CAD Models Examples z < 0 > 0 Reconstructed Examples 10 15 20 25 30 35 Distance to Camera [m] 0 2 4 6 8 10 12 14 16 MAE Rotation [deg] Mean Abs. Rotation Error Abstract Probabilistic Interpretation References Quantitative Results Qualitative Results Shape Prior Ours (ELAS) Ours (SPS-St.) 3DOP [5] (ELAS) [5] (SPS-St.) 10 15 20 25 30 35 Distance to Camera [m] 0.0 0.5 1.0 1.5 2.0 2.5 3.0 ME Translation [m] Mean Translation Error 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Threshold [meters] 40 50 60 70 80 90 100 Accuracy [%] Shape Accuracy (libELAS) libELAS [5] (libELAS) Ours (libELAS) SPS-Stereo [5] (SPS-Stereo) Ours (SPS-Stereo) 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Threshold [meters] 40 50 60 70 80 90 100 Accuracy [%] Shape Accuracy (SPS-Stereo) Dataset: We evaluate our method on the KITTI Stereo 2015 dataset [3] in terms of shape reconstruction and pose esti- mation accuracy. The dataset includes dense depth annota- tions and ground truth pose annotations. Input Output z (X ,⇠ ) t=T (X ,⇠ ) t=1 (X ,⇠ ) t=2 μ(1 , 2 ) μ(2 , 3 ) μ(T 1 , T ) '(z, T , X T ) '(z, 1 , X 1 ) '(z, 2 , X 2 ) Cam t=1 Cam t=2 Cam t=T t X t-2 X t-1 X t+1 X t t+1 z t-2 t-1 Inferring the pose and shape of vehicles in 3D from a movable platform still remains a challenging task due to the projec- tive sensing principle of cameras, difficult surface properties such as reflections or transparency, and illumination changes between images. In this work, we propose to use 3D shape and motion priors to regularize the estimation of the trajecto- ry and the shape of vehicles in sequences of stereo images. We represent shapes by 3D signed distance functions and embed them in a low-dimensional manifold. Our optimiza- tion method allows for imposing a common shape across all image observations along an object track. We employ a motion model to regularize the trajectory to plausible object motions. Problem Statement Input. Required prepocessing steps: · Depth estimation from stereo pairs. · Detections and associations (tracking). · Egomotion of observing camera. Output. Our method then estimates: · Complete shape for tracked vehicle. · Precise pose in 3D of each detection. · Improved tracking results. Shape Completion. Reconstrucion of previously unob- served surfaces. Left: Initial stereo reconstruction from [4]. Right: Our full shape reconstruction shown as wireframe. Pose Optimization Results. Top row: All initial poses of a track over- layed Bottom row: all optimized poses of the same track. · Shape prior learned form collection of 3D CAD models. · Convert each model into its discrete signed distance func- tion (SDF) representation. We samples points from the sur- face of a vehicle and for each voxel we store the distance to the closest point. · Linear low-dimensional embedding obtained with PCA. Quantitative results. Top: Mean absolute rotation and translation error. Bottom: Reconstruced shape accuracy comparing different input depths. p(z, |X )= ⌘p(X| z, ) p(z) p() {z, } = argmax z,p(z, |X ). KITTI Stereo 2015. Top: Input stereo image. Bottom: 3D shape and trajectory. Qualitative Results. Top row: Input consisting of stereo reconstruction and tracked detections. Bottom row: Full shape reconstruction and improved pose estimations. We maximimze the posterior p(z,⇠ |X ) by minimizing the en- ergy function obtained by applying the negative logarithm: · Given: a set of detections associated over time into tracks. · Find the common shape z of the tracked vehicle and its poses = {t } given the 3D depth observations X t of a detection at time t: The data term modifies the shape and the pose of the vehi- cle to correspond to the observed depth: '(z, , X )= 1 N X x2X φ z (T x) σ d x · z : Signed distance function parametrized by z. · T : Transformation matrix induced by pose . · d x : Depth uncertainty. · : Huber loss. The motion term guarantees consistent poses along the tra- jectory of an estimated motion model: μ(t , t1 )= ||t g (t1 )|| 2 + || t y t y gp | {z } || 2 gp ground plane prior · g : Motion model depending on vehicle dynamic behavior. Our approach incorporates three separate motion models: 1. Static: the vehicle is not moving. 2. Line: the vehicle drives forward on a straight line. 3. Turn: the vehicle is currently taking a turn. The motion model is chosen based on a vehicle’s initial track. Our Model This project has received funding from the European Research Council (ERC) CV-SUPER (ERC-2012-StG-307432). Motion Term Vehicle Detection (Chen et al. NIPS’15 [1]) Stereo Reconstruction (libELAS [3], SPS-Str. [4]) Tracking (Geiger et al. PAMI’14) Optimization Preprocessing Steps Shape Prior Shape z Pose . Our method 0 X E(z, )= 1 T X t ['(z, t , X t ) | {z } data term + μ(t , t1 ) | {z } motion term ]+ (z) | {z } shape regul. Data Term
1

SAMP: Shape and Motion Priors for 4D ... - Computer VisionConference on Computer Vision (ECCV), 2014. [5] F. Engelmann, J. Stückler, and B. Leibe. Joint object pose estimation and

Dec 30, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: SAMP: Shape and Motion Priors for 4D ... - Computer VisionConference on Computer Vision (ECCV), 2014. [5] F. Engelmann, J. Stückler, and B. Leibe. Joint object pose estimation and

SAMP: Shape and Motion Priors for 4D Vehicle ReconstructionFrancis Engelmann, Jörg Stückler, Bastian Leibe

Computer Vision Group, Visual Computing Institute, RWTH Aachen University

M.Sc. Francis EngelmannPhone: +49 241 80 20760E-Mail: [email protected]: http://www.vision.rwth-aachen.de/publication/00146/

Computer Vision Group, Visual Computing InstituteRWTH Aachen UniversityMies-van-der-Rohe Str. 15, D-52074 Aachen, Germanyhttp://www.vision.rwth-aachen.de

[1] X. Chen, K. Kundu, Y. Zhu, A. Berneshawi, H. Ma, S. Fidler, and R. U r -tasun. 3D object proposals for accurate object class detection. In Proc. of Neural Information Processing Systems (NIPS), 2015.

[2] A. Geiger, P. Lenz, and R. Urtasun. Are we ready for autonomous driving? The KITTI vision benchmark suite. In Proc. of the IEEE Int. Conference on Computer Vision and Pattern Recognition (CVPR), 2012.

[3] A. Geiger, M. Roser, and R. Urtasun. Efficient large-scale stereo matching. In Proc. of the Asian Conference on Computer Vision (ACCV), 2010.

[4] K. Yamaguchi, D. McAllester, and R. Urtasun. Efficient joint segmentation, occlusion labeling, stereo and flow estimation. In Proc. of the European Conference on Computer Vision (ECCV), 2014.

[5] F. Engelmann, J. Stückler, and B. Leibe. Joint object pose estimation and shape reconstruction in urban street scenes using 3D shape priors. In Proc. of the German Conference on Pattern Recognition (GCPR), 2016.

3D SignedDistanceFunctions

Input 3DCAD Models

Examples

z

< 0

> 0

ReconstructedExamples

10 15 20 25 30 35Distance to Camera [m]

0

2

4

6

8

10

12

14

16

MA

ER

otat

ion

[deg

]

Median Abs. Rotation Error

10 15 20 25 30 35Distance to Camera [m]

0.0

0.5

1.0

1.5

2.0

2.5

3.0

ME

Tra

nsla

tion

[m]

Median Translation Error

10 15 20 25 30 35Distance to Camera [m]

0

2

4

6

8

10

12

14

16

MA

ER

otat

ion

[deg

]

Mean Abs. Rotation Error

10 15 20 25 30 35Distance to Camera [m]

0.0

0.5

1.0

1.5

2.0

2.5

3.0

ME

Tra

nsla

tion

[m]

Mean Translation Error

Ours (ELAS) Ours (SPS-St.) 3DOP [5] (ELAS) [5] (SPS-St.)

Abstract

Probabilistic Interpretation

References

Quantitative ResultsQualitative ResultsShape Prior

10 15 20 25 30 35Distance to Camera [m]

0

2

4

6

8

10

12

14

16

MA

ER

otat

ion

[deg

]

Median Abs. Rotation Error

10 15 20 25 30 35Distance to Camera [m]

0.0

0.5

1.0

1.5

2.0

2.5

3.0

ME

Tra

nsla

tion

[m]

Median Translation Error

10 15 20 25 30 35Distance to Camera [m]

0

2

4

6

8

10

12

14

16

MA

ER

otat

ion

[deg

]

Mean Abs. Rotation Error

10 15 20 25 30 35Distance to Camera [m]

0.0

0.5

1.0

1.5

2.0

2.5

3.0

ME

Tra

nsla

tion

[m]

Mean Translation Error

Ours (ELAS) Ours (SPS-St.) 3DOP [5] (ELAS) [5] (SPS-St.)

10 15 20 25 30 35Distance to Camera [m]

0

2

4

6

8

10

12

14

16

MA

ER

otat

ion

[deg

]

Median Abs. Rotation Error

10 15 20 25 30 35Distance to Camera [m]

0.0

0.5

1.0

1.5

2.0

2.5

3.0

ME

Tra

nsla

tion

[m]

Median Translation Error

10 15 20 25 30 35Distance to Camera [m]

0

2

4

6

8

10

12

14

16

MA

ER

otat

ion

[deg

]

Mean Abs. Rotation Error

10 15 20 25 30 35Distance to Camera [m]

0.0

0.5

1.0

1.5

2.0

2.5

3.0

ME

Tra

nsla

tion

[m]

Mean Translation Error

Ours (ELAS) Ours (SPS-St.) 3DOP [5] (ELAS) [5] (SPS-St.)

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

Threshold [meters]

40

50

60

70

80

90

100

Acc

urac

y[%

]

Shape Accuracy (libELAS)

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

Threshold [meters]

40

50

60

70

80

90

100

Com

plet

enes

s[%

]

Shape Completeness (libELAS)

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

Threshold [meters]

40

50

60

70

80

90

100

F1-

scor

e[%

]

Shape F1-Score (libELAS)

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

Threshold [meters]

40

50

60

70

80

90

100

Acc

urac

y[%

]

Shape Accuracy (SPS-Stereo)

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

Threshold [meters]

40

50

60

70

80

90

100

Com

plet

enes

s[%

]

Shape Completeness (SPS-Stereo)

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

Threshold [meters]

40

50

60

70

80

90

100

F1-

scor

e[%

]

Shape F1-Score (SPS-Stereo)

libELAS [5] (libELAS) Ours (libELAS) SPS-Stereo [5] (SPS-Stereo) Ours (SPS-Stereo)

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

Threshold [meters]

40

50

60

70

80

90

100

Acc

urac

y[%

]

Shape Accuracy (libELAS)

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

Threshold [meters]

40

50

60

70

80

90

100

Com

plet

enes

s[%

]

Shape Completeness (libELAS)

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

Threshold [meters]

40

50

60

70

80

90

100

F1-

scor

e[%

]

Shape F1-Score (libELAS)

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

Threshold [meters]

40

50

60

70

80

90

100

Acc

urac

y[%

]

Shape Accuracy (SPS-Stereo)

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

Threshold [meters]

40

50

60

70

80

90

100

Com

plet

enes

s[%

]

Shape Completeness (SPS-Stereo)

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

Threshold [meters]

40

50

60

70

80

90

100

F1-

scor

e[%

]

Shape F1-Score (SPS-Stereo)

libELAS [5] (libELAS) Ours (libELAS) SPS-Stereo [5] (SPS-Stereo) Ours (SPS-Stereo)

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

Threshold [meters]

40

50

60

70

80

90

100

Acc

urac

y[%

]

Shape Accuracy (libELAS)

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

Threshold [meters]

40

50

60

70

80

90

100

Com

plet

enes

s[%

]

Shape Completeness (libELAS)

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

Threshold [meters]

40

50

60

70

80

90

100

F1-

scor

e[%

]

Shape F1-Score (libELAS)

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

Threshold [meters]

40

50

60

70

80

90

100

Acc

urac

y[%

]

Shape Accuracy (SPS-Stereo)

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

Threshold [meters]

40

50

60

70

80

90

100

Com

plet

enes

s[%

]

Shape Completeness (SPS-Stereo)

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

Threshold [meters]

40

50

60

70

80

90

100

F1-

scor

e[%

]

Shape F1-Score (SPS-Stereo)

libELAS [5] (libELAS) Ours (libELAS) SPS-Stereo [5] (SPS-Stereo) Ours (SPS-Stereo)

Dataset: We evaluate our method on the KITTI Stereo 2015 dataset [3] in terms of shape reconstruction and pose esti-mation accuracy. The dataset includes dense depth annota-tions and ground truth pose annotations.

Input

Output

z

… (X , ⇠)t=T

(X , ⇠)t=1

(X , ⇠)t=2

µ(⇠1, ⇠2)

µ(⇠2, ⇠3)

µ(⇠T1, ⇠T )

'(z, ⇠T ,XT )'(z, ⇠1,X1)

'(z, ⇠2,X2)

Camt=1Camt=2

Camt=T

⇠t

Xt-2 Xt-1Xt+1Xt

⇠t+1

z

⇠t-2⇠t-1

HiddenPose

ObservedDepth

HiddenShape

Inferring the pose and shape of vehicles in 3D from a movable platform still remains a challenging task due to the projec-tive sensing principle of cameras, difficult surface properties such as reflections or transparency, and illumination changes between images. In this work, we propose to use 3D shape and motion priors to regularize the estimation of the trajecto-ry and the shape of vehicles in sequences of stereo images. We represent shapes by 3D signed distance functions and embed them in a low-dimensional manifold. Our optimiza-tion method allows for imposing a common shape across all image observations along an object track. We employ a motion model to regularize the trajectory to plausible object motions.

Problem Statement

Input. Required prepocessing steps: · Depth estimation from stereo pairs. · Detections and associations (tracking). · Egomotion of observing camera.

Output. Our method then estimates: · Complete shape for tracked vehicle. · Precise pose in 3D of each detection. · Improved tracking results.

Shape Completion.Reconstrucion of previously unob-served surfaces. Left: Initial stereo reconstruction from [4]. Right: Our full shape reconstruction shown as wireframe.

Pose Optimization Results. Top row: All initial poses of a track over-layed Bottom row: all optimized poses of the same track.

· Shape prior learned form collection of 3D CAD models. · Convert each model into its discrete signed distance func-tion (SDF) representation. We samples points from the sur-face of a vehicle and for each voxel we store the distance to the closest point.

· Linear low-dimensional embedding obtained with PCA.

Quantitative results. Top: Mean absolute rotation and translation error. Bottom: Reconstruced shape accuracy comparing different input depths.

p(z, ⇠ | X ) = ⌘ p(X | z, ⇠) p(z) p(⇠)

{z, ⇠}⇤ = argmaxz,⇠

p(z, ⇠ | X ).

KITTI Stereo 2015. Top: Input stereo image. Bottom: 3D shape and trajectory.

Qualitative Results. Top row: Input consisting of stereo reconstruction and tracked detections. Bottom row: Full shape reconstruction and improved pose estimations.

We maximimze the posterior p(z, ⇠|X ) by minimizing the en-ergy function obtained by applying the negative logarithm:

· Given: a set of detections associated over time into tracks. · Find the common shape z of the tracked vehicle and its poses ⇠ = {⇠t} given the 3D depth observations Xt of a detection at time t:

The data term modifies the shape and the pose of the vehi-cle to correspond to the observed depth:

'(z, ⇠,X ) =1

N

X

x2X⇢

✓φz(T⇠ x)

σdx

· z : Signed distance function parametrized by z. · T⇠ : Transformation matrix induced by pose ⇠. · d

x

: Depth uncertainty. · ⇢ : Huber loss.

The motion term guarantees consistent poses along the tra-jectory of an estimated motion model:

µ(⇠t, ⇠t1) = ||⇠t g(⇠t1)||2⌃ + || ty tygp| {z }||2⌃gp

ground plane prior

· g : Motion model depending on vehicle dynamic behavior.

Our approach incorporates three separate motion models:1. Static: the vehicle is not moving.2. Line: the vehicle drives forward on a straight line.3. Turn: the vehicle is currently taking a turn.The motion model is chosen based on a vehicle’s initial track.

Our Model

This project has received funding from the European Research Council (ERC) CV-SUPER (ERC-2012-StG-307432).

Motion Term

Vehicle Detection (Chen et al. NIPS’15 [1])

Stereo Reconstruction

(libELAS [3], SPS-Str. [4])

Tracking(Geiger et al. PAMI’14)

Optimization

PreprocessingSteps

Shape PriorShape zz

Pose . Our method

⇠0X

E(z, ⇠) =1

T

X

t

['(z, ⇠t,Xt)| {z }data term

+ µ(⇠t, ⇠t1)| {z }motion term

] + (z)|{z}shape regul.

Data Term