Top Banner
Supplementary Material: Monocular 3D Object Detection for Autonomous Driving Xiaozhi Chen 1 , Kaustav Kundu 2 , Ziyu Zhang 2 , Huimin Ma 1 , Sanja Fidler 2 , Raquel Urtasun 2 1 Department of Electronic Engineering, Tsinghua University 2 Department of Computer Science, University of Toronto {chenxz12@mails., mhmpub@}tsinghua.edu.cn, {kkundu, zzhang, fidler, urtasun}@cs.toronto.edu Abstract In this supplementary material we include a larger set of additional experiments and visualizations. We start with com- parisons to the state-of-the-art in the KITTI test set, followed by an in-depth analysis of 2D and 3D bounding box recall as a function of number of proposals as well as the distance to the obstacle. The latter is a very important metric in the context of autonomous driving. We then show an ablation study of the features employed followed by some additional visualizations. 1. Object Detection and Orientation Estimation Performance Fig 1 and Fig. 2 show a comparison to all published monocular methods on the KITTI benchmark. Our approach achieves the highest AP and AOS scores across all categories and difficulty levels. Note that we also provide a video visualizing our results in 2D and 3D. 2. Proposal Recall In this section we report recall of our proposals in several regimes. 2D Bounding Box Recall: We show recall versus the number of proposals in Fig. 3, and recall versus IoU overlap threshold in Fig. 4, Fig. 5, and Fig. 6, by fixing the number of proposals to 500, 1000 and 2000, respectively. We also report average recall (AR) as a function of the number of proposals in Fig. 7. Our approach outperforms all monocular methods by a large margin, while being competitive with 3DOP [1], which uses stereo imagery, and thus is not a fair comparison. Recall vs Distance: We report recall as a function of the distance from the ego-car in Fig. 8. It can be seen that our approach achieves very high recall even when the distance is quite large (higher than 40m). This shows the advantage of our approach in the setting of autonomous driving. 3D Bounding Box Recall: We also compare 3D bounding box recall of our monocular approach with 3DOP [1], which, however, exploits stereo imagery. Fig. 9 shows 3D box recall as a function of the number of proposals. We set the 3D IoU overlap threshold to 0.25 for all categories. Although we do not exploit any depth features, our approach achieves similar 3D recall as 3DOP on Car. For small objects, i.e., Pedestrian and Cyclist, our results are also promising. 3. Ablation Study of features We conduct a detailed analysis of different types of features on Car proposals and show recall plots in Fig. 10, Fig. 11, and Fig. 12. Note that each feature helps improve performance. We also study their effect on car detection and orientation estimation. As shown in Table. 1, each type of feature improves AP and AOS similar to their behaviors on proposal recall. 1
15

Supplementary Material: Monocular 3D Object …Supplementary Material: Monocular 3D Object Detection for Autonomous Driving Xiaozhi Chen 1, Kaustav Kundu 2, Ziyu Zhang , Huimin Ma

May 03, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Supplementary Material: Monocular 3D Object …Supplementary Material: Monocular 3D Object Detection for Autonomous Driving Xiaozhi Chen 1, Kaustav Kundu 2, Ziyu Zhang , Huimin Ma

Supplementary Material: Monocular 3D Object Detection for AutonomousDriving

Xiaozhi Chen1, Kaustav Kundu2, Ziyu Zhang2, Huimin Ma1, Sanja Fidler2, Raquel Urtasun2

1Department of Electronic Engineering, Tsinghua University2Department of Computer Science, University of Toronto

{chenxz12@mails., mhmpub@}tsinghua.edu.cn, {kkundu, zzhang, fidler, urtasun}@cs.toronto.edu

Abstract

In this supplementary material we include a larger set of additional experiments and visualizations. We start with com-parisons to the state-of-the-art in the KITTI test set, followed by an in-depth analysis of 2D and 3D bounding box recall as afunction of number of proposals as well as the distance to the obstacle. The latter is a very important metric in the context ofautonomous driving. We then show an ablation study of the features employed followed by some additional visualizations.

1. Object Detection and Orientation Estimation PerformanceFig 1 and Fig. 2 show a comparison to all published monocular methods on the KITTI benchmark. Our approach achieves

the highest AP and AOS scores across all categories and difficulty levels. Note that we also provide a video visualizing ourresults in 2D and 3D.

2. Proposal RecallIn this section we report recall of our proposals in several regimes.

2D Bounding Box Recall: We show recall versus the number of proposals in Fig. 3, and recall versus IoU overlap thresholdin Fig. 4, Fig. 5, and Fig. 6, by fixing the number of proposals to 500, 1000 and 2000, respectively. We also report averagerecall (AR) as a function of the number of proposals in Fig. 7. Our approach outperforms all monocular methods by a largemargin, while being competitive with 3DOP [1], which uses stereo imagery, and thus is not a fair comparison.

Recall vs Distance: We report recall as a function of the distance from the ego-car in Fig. 8. It can be seen that our approachachieves very high recall even when the distance is quite large (higher than 40m). This shows the advantage of our approachin the setting of autonomous driving.

3D Bounding Box Recall: We also compare 3D bounding box recall of our monocular approach with 3DOP [1], which,however, exploits stereo imagery. Fig. 9 shows 3D box recall as a function of the number of proposals. We set the 3D IoUoverlap threshold to 0.25 for all categories. Although we do not exploit any depth features, our approach achieves similar 3Drecall as 3DOP on Car. For small objects, i.e., Pedestrian and Cyclist, our results are also promising.

3. Ablation Study of featuresWe conduct a detailed analysis of different types of features on Car proposals and show recall plots in Fig. 10, Fig. 11,

and Fig. 12. Note that each feature helps improve performance. We also study their effect on car detection and orientationestimation. As shown in Table. 1, each type of feature improves AP and AOS similar to their behaviors on proposal recall.

1

Page 2: Supplementary Material: Monocular 3D Object …Supplementary Material: Monocular 3D Object Detection for Autonomous Driving Xiaozhi Chen 1, Kaustav Kundu 2, Ziyu Zhang , Huimin Ma

Table 1: Ablation study of features on Object Detection and Orientation Estimation: AP and AOS for Car on validationset of KITTI.

ApproachAP AOS

Easy Moderate Hard Easy Moderate HardLoc 83.59 77.50 69.41 81.56 75.09 66.39

+ClsSeg 87.96 86.76 78.16 86.40 84.66 75.72+Context 93.17 88.23 79.34 91.59 86.15 76.94+Shape 93.52 88.51 79.62 91.49 86.38 77.15

+InstSeg 93.89 88.67 79.68 91.90 86.28 77.09

4. VisualizationWe visualize some qualitative results in Fig. 13 and Fig. 14. We show top 50 proposals, 2D detections and 3D detections

for each example. We also show some failure examples in Fig. 15. Most failures are propagated from errors in class semanticsegmentation. Road segmentation particularly affects the results as our approach infers extent of the 3D space from the roadregion.

5. VideoWe refer the reader to the attached video for more visualizations of results. We note that to create the video no temporal

information is used, and all results are obtained from using a single monocular image.

References[1] X. Chen, K. Kundu, Y. Zhu, A. Berneshawi, H. Ma, S. Fidler, and R. Urtasun. 3d object proposals for accurate object class detection.

In NIPS, 2015. 1

2

Page 3: Supplementary Material: Monocular 3D Object …Supplementary Material: Monocular 3D Object Detection for Autonomous Driving Xiaozhi Chen 1, Kaustav Kundu 2, Ziyu Zhang , Huimin Ma

Recall0 0.2 0.4 0.6 0.8 1

Pre

cis

ion

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

ACF 55.89

LSVM-MDPM-us 66.53

LSVM-MDPM-sv 68.02

ACF-SC 69.11

DPM-C8B1 74.33

MDPM-un-BB 71.19

DPM-VOC+VP 74.95

OC-DPM 74.94

MV-RGBD-RF 76.40

SubCat 84.14

3DVP 87.46

AOG 84.80

Regionlets 84.75

spLBP 87.19

Faster R-CNN 86.71

Ours 92.33

Recall0 0.2 0.4 0.6 0.8 1

Pre

cis

ion

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

ACF 54.74

LSVM-MDPM-us 55.42

LSVM-MDPM-sv 56.48

ACF-SC 58.66

DPM-C8B1 60.99

MDPM-un-BB 62.16

DPM-VOC+VP 64.71

OC-DPM 65.95

MV-RGBD-RF 69.92

SubCat 75.46

3DVP 75.77

AOG 75.94

Regionlets 76.45

spLBP 77.40

Faster R-CNN 81.84

Ours 88.66

Recall0 0.2 0.4 0.6 0.8 1

Pre

cis

ion

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

ACF 42.98

LSVM-MDPM-us 41.04

LSVM-MDPM-sv 44.18

ACF-SC 45.95

DPM-C8B1 47.16

MDPM-un-BB 48.43

DPM-VOC+VP 48.76

OC-DPM 53.86

MV-RGBD-RF 57.47

SubCat 59.71

3DVP 65.38

AOG 60.70

Regionlets 59.70

spLBP 60.60

Faster R-CNN 71.12

Ours 78.96

Car (Easy) Car (Moderate) Car (Hard)

Recall0 0.2 0.4 0.6 0.8 1

Pre

cis

ion

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

ACF 44.49

SubCat 54.67

SquaresICF 57.33

ACF-SC 51.53

DPM-VOC+VP 59.48

HA-SSVM 56.36

ACF-MR 58.82

Fusion-DPM 59.51

R-CNN 61.61

pAUCEnsT 65.26

MV-RGBD-RF 73.30

FilteredICF 67.65

DeepParts 70.49

CompACT-Deep 70.69

Regionlets 73.14

Faster R-CNN 78.86

Ours 80.35

Recall0 0.2 0.4 0.6 0.8

Pre

cis

ion

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

ACF 39.81

SubCat 42.34

SquaresICF 44.42

ACF-SC 44.49

DPM-VOC+VP 44.86

HA-SSVM 45.51

ACF-MR 46.23

Fusion-DPM 46.67

R-CNN 50.13

pAUCEnsT 54.49

MV-RGBD-RF 56.59

FilteredICF 56.75

DeepParts 58.67

CompACT-Deep 58.74

Regionlets 61.15

Faster R-CNN 65.90

Ours 66.68

Recall0 0.2 0.4 0.6 0.8

Pre

cis

ion

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

ACF 37.21

SubCat 37.95

SquaresICF 40.08

ACF-SC 40.38

DPM-VOC+VP 40.37

HA-SSVM 41.08

ACF-MR 42.10

Fusion-DPM 42.05

R-CNN 44.79

pAUCEnsT 48.60

MV-RGBD-RF 49.63

FilteredICF 51.12

DeepParts 52.78

CompACT-Deep 52.71

Regionlets 55.21

Faster R-CNN 61.18

Ours 63.44

Pedestrian (Easy) Pedestrian (Moderate) Pedestrian (Hard)

Recall0 0.2 0.4 0.6 0.8 1

Pre

cis

ion

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

LSVM-MDPM-sv 35.04

DPM-C8B1 43.49

LSVM-MDPM-us 38.84

DPM-VOC+VP 42.43

Vote3D 41.43

pAUCEnsT 51.62

MV-RGBD-RF 52.97

Regionlets 70.41

Faster R-CNN 72.26

Ours 76.04

Recall0 0.2 0.4 0.6 0.8

Pre

cis

ion

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

LSVM-MDPM-sv 27.50

DPM-C8B1 29.04

LSVM-MDPM-us 29.88

DPM-VOC+VP 31.08

Vote3D 31.24

pAUCEnsT 38.03

MV-RGBD-RF 42.61

Regionlets 58.72

Faster R-CNN 63.35

Ours 66.36

Recall0 0.2 0.4 0.6 0.8

Pre

cis

ion

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1LSVM-MDPM-sv 26.21

DPM-C8B1 26.20

LSVM-MDPM-us 27.31

DPM-VOC+VP 28.23

Vote3D 28.60

pAUCEnsT 33.38

MV-RGBD-RF 37.42

Regionlets 51.83

Faster R-CNN 55.90

Ours 58.87

Cyclist (Easy) Cyclist (Moderate) Cyclist (Hard)

Figure 1: Precision vs Recall curves on KITTI test set. The number next to the label indicates the Average Precision (AP).

3

Page 4: Supplementary Material: Monocular 3D Object …Supplementary Material: Monocular 3D Object Detection for Autonomous Driving Xiaozhi Chen 1, Kaustav Kundu 2, Ziyu Zhang , Huimin Ma

Recall0 0.2 0.4 0.6 0.8 1

Orienta

tion S

imila

rity

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

DPM-C8B1 59.51

LSVM-MDPM-sv 67.27

DPM-VOC+VP 72.28

OC-DPM 73.50

SubCat 83.41

3DVP 86.92

Ours 91.44

Recall0 0.2 0.4 0.6 0.8 1

Orienta

tion S

imila

rity

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

DPM-C8B1 50.32

LSVM-MDPM-sv 55.77

DPM-VOC+VP 61.84

OC-DPM 64.42

SubCat 74.42

3DVP 74.59

Ours 86.10

Recall0 0.2 0.4 0.6 0.8 1

Orienta

tion S

imila

rity

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

DPM-C8B1 39.22

LSVM-MDPM-sv 43.59

DPM-VOC+VP 46.54

OC-DPM 52.40

SubCat 58.83

3DVP 64.11

Ours 76.52

Car (Easy) Car (Moderate) Car (Hard)

Recall0 0.2 0.4 0.6 0.8 1

Orienta

tion S

imila

rity

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

ACF-MR 29.33

DPM-C8B1 31.08

SubCat 44.32

LSVM-MDPM-sv 43.58

DPM-VOC+VP 53.55

Ours 72.94

Recall0 0.2 0.4 0.6 0.8

Orie

nta

tio

n S

imila

rity

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

ACF-MR 23.18

DPM-C8B1 23.37

SubCat 34.18

LSVM-MDPM-sv 35.49

DPM-VOC+VP 39.83

Ours 59.80

Recall0 0.2 0.4 0.6 0.8

Orie

nta

tio

n S

imila

rity

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

ACF-MR 21.00

DPM-C8B1 20.72

SubCat 30.76

LSVM-MDPM-sv 32.42

DPM-VOC+VP 35.73

Ours 57.03

Pedestrian (Easy) Pedestrian (Moderate) Pedestrian (Hard)

Recall0 0.2 0.4 0.6 0.8 1

Orienta

tion S

imila

rity

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

DPM-C8B1 27.25

LSVM-MDPM-sv 27.54

DPM-VOC+VP 30.52

Ours 70.13

Recall0 0.2 0.4 0.6 0.8

Orie

nta

tio

n S

imila

rity

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1DPM-C8B1 19.25

LSVM-MDPM-sv 22.07

DPM-VOC+VP 23.17

Ours 58.68

Recall0 0.2 0.4 0.6 0.8

Orie

nta

tio

n S

imila

rity

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1DPM-C8B1 17.95

LSVM-MDPM-sv 21.45

DPM-VOC+VP 21.58

Ours 52.35

Cyclist (Easy) Cyclist (Moderate) Cyclist (Hard)

Figure 2: Orientation Similarity vs Recall curves on KITTI test set. The number next to the label indicates the AverageOrientation Similarity (AOS).

4

Page 5: Supplementary Material: Monocular 3D Object …Supplementary Material: Monocular 3D Object Detection for Autonomous Driving Xiaozhi Chen 1, Kaustav Kundu 2, Ziyu Zhang , Huimin Ma

# candidates

101

102

103

104

reca

ll a

t Io

U t

hre

sh

old

0.7

0

0.2

0.4

0.6

0.8

1

BING

SS

EB

MCG

MCG-D

3DOP

Ours

# candidates

101

102

103

104

reca

ll a

t Io

U t

hre

sh

old

0.7

0

0.2

0.4

0.6

0.8

1

BING

SS

EB

MCG

MCG-D

3DOP

Ours

# candidates

101

102

103

104

reca

ll a

t Io

U t

hre

sh

old

0.7

0

0.2

0.4

0.6

0.8

1

BING

SS

EB

MCG

MCG-D

3DOP

Ours

Car (Easy) Car (Moderate) Car (Hard)

# candidates

101

102

103

104

reca

ll a

t Io

U t

hre

sh

old

0.5

0

0.2

0.4

0.6

0.8

1

BING

SS

EB

MCG

MCG-D

3DOP

Ours

# candidates

101

102

103

104

reca

ll a

t Io

U t

hre

sh

old

0.5

0

0.2

0.4

0.6

0.8

1

BING

SS

EB

MCG

MCG-D

3DOP

Ours

# candidates

101

102

103

104

reca

ll a

t Io

U t

hre

sh

old

0.5

0

0.2

0.4

0.6

0.8

1

BING

SS

EB

MCG

MCG-D

3DOP

Ours

Pedestrian (Easy) Pedestrian (Moderate) Pedestrian (Hard)

# candidates

101

102

103

104

reca

ll a

t Io

U t

hre

sh

old

0.5

0

0.2

0.4

0.6

0.8

1

BING

SS

EB

MCG

MCG-D

3DOP

Ours

# candidates

101

102

103

104

reca

ll a

t Io

U t

hre

sh

old

0.5

0

0.2

0.4

0.6

0.8

1

BING

SS

EB

MCG

MCG-D

3DOP

Ours

# candidates

101

102

103

104

reca

ll a

t Io

U t

hre

sh

old

0.5

0

0.2

0.4

0.6

0.8

1

BING

SS

EB

MCG

MCG-D

3DOP

Ours

Cyclist (Easy) Cyclist (Moderate) Cyclist (Hard)

Figure 3: 2D bounding box Recall vs #Candidates. We use an overlap threshold of 0.7 for Car, and 0.5 for Pedestrian andCyclist. From left to right are for easy, moderate, and hard objects, respectively.

5

Page 6: Supplementary Material: Monocular 3D Object …Supplementary Material: Monocular 3D Object Detection for Autonomous Driving Xiaozhi Chen 1, Kaustav Kundu 2, Ziyu Zhang , Huimin Ma

IoU overlap threshold0.5 0.6 0.7 0.8 0.9 1

recall

0

0.2

0.4

0.6

0.8

1BING 12.3

SS 26.7

EB 37.4

MCG 45.1

MCG-D 49.6

3DOP 65.8

Ours 66.5

IoU overlap threshold0.5 0.6 0.7 0.8 0.9 1

recall

0

0.2

0.4

0.6

0.8

1BING 7.7

SS 18

EB 26.4

MCG 36.1

MCG-D 38.8

3DOP 57.5

Ours 59.9

IoU overlap threshold0.5 0.6 0.7 0.8 0.9 1

recall

0

0.2

0.4

0.6

0.8

1BING 7

SS 16.5

EB 23

MCG 31.3

MCG-D 32.8

3DOP 57

Ours 57

Car (Easy) Car (Moderate) Car (Hard)

IoU overlap threshold0.5 0.6 0.7 0.8 0.9 1

recall

0

0.2

0.4

0.6

0.8

1BING 7.6

SS 5.4

EB 9.2

MCG 15

MCG-D 19.6

3DOP 49.3

Ours 40.6

IoU overlap threshold0.5 0.6 0.7 0.8 0.9 1

recall

0

0.2

0.4

0.6

0.8

1BING 6.6

SS 5.1

EB 7.7

MCG 13.2

MCG-D 16.1

3DOP 43.6

Ours 36.8

IoU overlap threshold0.5 0.6 0.7 0.8 0.9 1

recall

0

0.2

0.4

0.6

0.8

1BING 6.1

SS 5

EB 6.9

MCG 12.2

MCG-D 14

3DOP 38.6

Ours 34.6

Pedestrian (Easy) Pedestrian (Moderate) Pedestrian (Hard)

IoU overlap threshold0.5 0.6 0.7 0.8 0.9 1

recall

0

0.2

0.4

0.6

0.8

1BING 6.1

SS 7.4

EB 5

MCG 10.9

MCG-D 10.8

3DOP 52.7

Ours 36.2

IoU overlap threshold0.5 0.6 0.7 0.8 0.9 1

recall

0

0.2

0.4

0.6

0.8

1BING 4.1

SS 6

EB 4.4

MCG 8

MCG-D 10.2

3DOP 37.6

Ours 32.5

IoU overlap threshold0.5 0.6 0.7 0.8 0.9 1

recall

0

0.2

0.4

0.6

0.8

1BING 4.3

SS 6.1

EB 4.4

MCG 8.2

MCG-D 10.7

3DOP 37.7

Ours 32.1

Cyclist (Easy) Cyclist (Moderate) Cyclist (Hard)

Figure 4: 2D bounding box Recall vs IoU for 500 proposals. The number next to the label indicates the average recall(AR).

6

Page 7: Supplementary Material: Monocular 3D Object …Supplementary Material: Monocular 3D Object Detection for Autonomous Driving Xiaozhi Chen 1, Kaustav Kundu 2, Ziyu Zhang , Huimin Ma

IoU overlap threshold0.5 0.6 0.7 0.8 0.9 1

recall

0

0.2

0.4

0.6

0.8

1BING 13.4

SS 37.7

EB 46.9

MCG 52.8

MCG-D 58.3

3DOP 66.6

Ours 67

IoU overlap threshold0.5 0.6 0.7 0.8 0.9 1

recall

0

0.2

0.4

0.6

0.8

1BING 8.6

SS 27

EB 35.5

MCG 43.5

MCG-D 46.6

3DOP 61.2

Ours 62.6

IoU overlap threshold0.5 0.6 0.7 0.8 0.9 1

recall

0

0.2

0.4

0.6

0.8

1BING 8.5

SS 24.7

EB 31.7

MCG 38.5

MCG-D 40.3

3DOP 61.2

Ours 61.4

Car (Easy) Car (Moderate) Car (Hard)

IoU overlap threshold0.5 0.6 0.7 0.8 0.9 1

recall

0

0.2

0.4

0.6

0.8

1BING 10.1

SS 10.5

EB 17.9

MCG 18.5

MCG-D 26.1

3DOP 57.5

Ours 44.8

IoU overlap threshold0.5 0.6 0.7 0.8 0.9 1

recall

0

0.2

0.4

0.6

0.8

1BING 8.9

SS 9.6

EB 15.2

MCG 16.6

MCG-D 21.6

3DOP 52

Ours 42.1

IoU overlap threshold0.5 0.6 0.7 0.8 0.9 1

recall

0

0.2

0.4

0.6

0.8

1BING 8.2

SS 9.1

EB 13.5

MCG 15.3

MCG-D 19

3DOP 46.9

Ours 40.3

Pedestrian (Easy) Pedestrian (Moderate) Pedestrian (Hard)

IoU overlap threshold0.5 0.6 0.7 0.8 0.9 1

recall

0

0.2

0.4

0.6

0.8

1BING 9.2

SS 13

EB 12.7

MCG 14.8

MCG-D 17.9

3DOP 57.9

Ours 37.6

IoU overlap threshold0.5 0.6 0.7 0.8 0.9 1

recall

0

0.2

0.4

0.6

0.8

1BING 5.9

SS 10.1

EB 9.3

MCG 11

MCG-D 15.1

3DOP 44.3

Ours 35.9

IoU overlap threshold0.5 0.6 0.7 0.8 0.9 1

recall

0

0.2

0.4

0.6

0.8

1BING 6.2

SS 10.2

EB 9.4

MCG 11.1

MCG-D 15.6

3DOP 44.2

Ours 36

Cyclist (Easy) Cyclist (Moderate) Cyclist (Hard)

Figure 5: 2D bounding box Recall vs IoU for 1000 proposals. The number next to the label indicates the average recall(AR).

7

Page 8: Supplementary Material: Monocular 3D Object …Supplementary Material: Monocular 3D Object Detection for Autonomous Driving Xiaozhi Chen 1, Kaustav Kundu 2, Ziyu Zhang , Huimin Ma

IoU overlap threshold0.5 0.6 0.7 0.8 0.9 1

recall

0

0.2

0.4

0.6

0.8

1BING 14

SS 48.2

EB 54.4

MCG 59.6

MCG-D 61.8

3DOP 67.4

Ours 67.2

IoU overlap threshold0.5 0.6 0.7 0.8 0.9 1

recall

0

0.2

0.4

0.6

0.8

1BING 9.1

SS 37.4

EB 44.2

MCG 51.7

MCG-D 50.9

3DOP 63.9

Ours 64.2

IoU overlap threshold0.5 0.6 0.7 0.8 0.9 1

recall

0

0.2

0.4

0.6

0.8

1BING 9.3

SS 34.5

EB 40.5

MCG 46.8

MCG-D 45

3DOP 64.1

Ours 63.9

Car (Easy) Car (Moderate) Car (Hard)

IoU overlap threshold0.5 0.6 0.7 0.8 0.9 1

recall

0

0.2

0.4

0.6

0.8

1BING 10.6

SS 16.5

EB 28.4

MCG 23.2

MCG-D 32.5

3DOP 62.3

Ours 47.6

IoU overlap threshold0.5 0.6 0.7 0.8 0.9 1

recall

0

0.2

0.4

0.6

0.8

1BING 9.3

SS 14.8

EB 24.3

MCG 21.1

MCG-D 26.9

3DOP 57.7

Ours 46.2

IoU overlap threshold0.5 0.6 0.7 0.8 0.9 1

recall

0

0.2

0.4

0.6

0.8

1BING 8.6

SS 13.9

EB 21.9

MCG 19.3

MCG-D 23.9

3DOP 53.6

Ours 44.9

Pedestrian (Easy) Pedestrian (Moderate) Pedestrian (Hard)

IoU overlap threshold0.5 0.6 0.7 0.8 0.9 1

recall

0

0.2

0.4

0.6

0.8

1BING 9.4

SS 20.2

EB 23.2

MCG 18.8

MCG-D 24.1

3DOP 59.3

Ours 39

IoU overlap threshold0.5 0.6 0.7 0.8 0.9 1

recall

0

0.2

0.4

0.6

0.8

1BING 6

SS 16

EB 16.7

MCG 14.8

MCG-D 19.9

3DOP 50.1

Ours 40

IoU overlap threshold0.5 0.6 0.7 0.8 0.9 1

recall

0

0.2

0.4

0.6

0.8

1BING 6.4

SS 16

EB 16.5

MCG 15.2

MCG-D 20.3

3DOP 50.3

Ours 40.1

Cyclist (Easy) Cyclist (Moderate) Cyclist (Hard)

Figure 6: 2D bounding box Recall vs IoU for 2000 proposals. The number next to the label indicates the average recall(AR).

8

Page 9: Supplementary Material: Monocular 3D Object …Supplementary Material: Monocular 3D Object Detection for Autonomous Driving Xiaozhi Chen 1, Kaustav Kundu 2, Ziyu Zhang , Huimin Ma

# candidates

10 1 10 2 10 3 10 4

ave

rag

e r

eca

ll

0

0.2

0.4

0.6

0.8

1BING

SS

EB

MCG

MCG-D

3DOP

Ours

# candidates

10 1 10 2 10 3 10 4

ave

rag

e r

eca

ll

0

0.2

0.4

0.6

0.8

1BING

SS

EB

MCG

MCG-D

3DOP

Ours

# candidates

10 1 10 2 10 3 10 4

ave

rag

e r

eca

ll

0

0.2

0.4

0.6

0.8

1BING

SS

EB

MCG

MCG-D

3DOP

Ours

Car (Easy) Car (Moderate) Car (Hard)

# candidates

10 1 10 2 10 3 10 4

ave

rag

e r

eca

ll

0

0.2

0.4

0.6

0.8

1BING

SS

EB

MCG

MCG-D

3DOP

Ours

# candidates

10 1 10 2 10 3 10 4

ave

rag

e r

eca

ll

0

0.2

0.4

0.6

0.8

1BING

SS

EB

MCG

MCG-D

3DOP

Ours

# candidates

10 1 10 2 10 3 10 4

ave

rag

e r

eca

ll

0

0.2

0.4

0.6

0.8

1BING

SS

EB

MCG

MCG-D

3DOP

Ours

Pedestrian (Easy) Pedestrian (Moderate) Pedestrian (Hard)

# candidates

10 1 10 2 10 3 10 4

ave

rag

e r

eca

ll

0

0.2

0.4

0.6

0.8

1BING

SS

EB

MCG

MCG-D

3DOP

Ours

# candidates

10 1 10 2 10 3 10 4

ave

rag

e r

eca

ll

0

0.2

0.4

0.6

0.8

1BING

SS

EB

MCG

MCG-D

3DOP

Ours

# candidates

10 1 10 2 10 3 10 4

ave

rag

e r

eca

ll

0

0.2

0.4

0.6

0.8

1BING

SS

EB

MCG

MCG-D

3DOP

Ours

Cyclist (Easy) Cyclist (Moderate) Cyclist (Hard)

Figure 7: Average Recall (AR) vs #Candidates for 2D bounding boxes. Note that the comparison to 3DOP and MCG-D isunfair as we use a monocular image while they exploit depth information.

9

Page 10: Supplementary Material: Monocular 3D Object …Supplementary Material: Monocular 3D Object Detection for Autonomous Driving Xiaozhi Chen 1, Kaustav Kundu 2, Ziyu Zhang , Huimin Ma

distance from ego-car (meters)0 10 20 30 40 50

reca

ll a

t Io

U t

hre

sh

old

0.7

0

0.2

0.4

0.6

0.8

1

BING

SS

EB

MCG

MCG-D

3DOP

Ours

distance from ego-car (meters)0 20 40 60 80

reca

ll a

t Io

U t

hre

sh

old

0.7

0

0.2

0.4

0.6

0.8

1

BING

SS

EB

MCG

MCG-D

3DOP

Ours

distance from ego-car (meters)0 20 40 60 80

reca

ll a

t Io

U t

hre

sh

old

0.7

0

0.2

0.4

0.6

0.8

1

BING

SS

EB

MCG

MCG-D

3DOP

Ours

Car (Easy) Car (Moderate) Car (Hard)

distance from ego-car (meters)0 10 20 30 40

reca

ll a

t Io

U t

hre

sh

old

0.5

0

0.2

0.4

0.6

0.8

1

BING

SS

EB

MCG

MCG-D

3DOP

Ours

distance from ego-car (meters)0 20 40 60

reca

ll a

t Io

U t

hre

sh

old

0.5

0

0.2

0.4

0.6

0.8

1

BING

SS

EB

MCG

MCG-D

3DOP

Ours

distance from ego-car (meters)0 20 40 60

reca

ll a

t Io

U t

hre

sh

old

0.5

0

0.2

0.4

0.6

0.8

1

BING

SS

EB

MCG

MCG-D

3DOP

Ours

Pedestrian (Easy) Pedestrian (Moderate) Pedestrian (Hard)

distance from ego-car (meters)0 10 20 30 40

reca

ll a

t Io

U t

hre

sh

old

0.5

0

0.2

0.4

0.6

0.8

1

BING

SS

EB

MCG

MCG-D

3DOP

Ours

distance from ego-car (meters)0 20 40 60

reca

ll a

t Io

U t

hre

sh

old

0.5

0

0.2

0.4

0.6

0.8

1

BING

SS

EB

MCG

MCG-D

3DOP

Ours

distance from ego-car (meters)0 20 40 60

reca

ll a

t Io

U t

hre

sh

old

0.5

0

0.2

0.4

0.6

0.8

1

BING

SS

EB

MCG

MCG-D

3DOP

Ours

Cyclist (Easy) Cyclist (Moderate) Cyclist (Hard)

Figure 8: 2D bounding box Recall vs Distance using 2000 proposals. We use an overlap threshold of 0.7 for Car, and 0.5for Pedestrian and Cyclist.

10

Page 11: Supplementary Material: Monocular 3D Object …Supplementary Material: Monocular 3D Object Detection for Autonomous Driving Xiaozhi Chen 1, Kaustav Kundu 2, Ziyu Zhang , Huimin Ma

# candidates

101

102

103

104

reca

ll a

t Io

U t

hre

sh

old

0.2

5

0

0.2

0.4

0.6

0.8

1

3DOP

Ours

# candidates

101

102

103

104

reca

ll a

t Io

U t

hre

sh

old

0.2

5

0

0.2

0.4

0.6

0.8

1

3DOP

Ours

# candidates

101

102

103

104

reca

ll a

t Io

U t

hre

sh

old

0.2

5

0

0.2

0.4

0.6

0.8

1

3DOP

Ours

Car (Easy) Car (Moderate) Car (Hard)

# candidates

101

102

103

104

reca

ll a

t Io

U t

hre

sh

old

0.2

5

0

0.2

0.4

0.6

0.8

1

3DOP

Ours

# candidates

101

102

103

104

reca

ll a

t Io

U t

hre

sh

old

0.2

5

0

0.2

0.4

0.6

0.8

1

3DOP

Ours

# candidates

101

102

103

104

reca

ll a

t Io

U t

hre

sh

old

0.2

5

0

0.2

0.4

0.6

0.8

1

3DOP

Ours

Pedestrian (Easy) Pedestrian (Moderate) Pedestrian (Hard)

# candidates

101

102

103

104

reca

ll a

t Io

U t

hre

sh

old

0.2

5

0

0.2

0.4

0.6

0.8

1

3DOP

Ours

# candidates

101

102

103

104

reca

ll a

t Io

U t

hre

sh

old

0.2

5

0

0.2

0.4

0.6

0.8

1

3DOP

Ours

# candidates

101

102

103

104

reca

ll a

t Io

U t

hre

sh

old

0.2

5

0

0.2

0.4

0.6

0.8

1

3DOP

Ours

Cyclist (Easy) Cyclist (Moderate) Cyclist (Hard)

Figure 9: 3D bounding box Recall vs #Candidates at IoU threshold of 0.25. Note that our monocular approach achievessimilar 3D box recall on Car with 3DOP, which exploits stereo imagery.

11

Page 12: Supplementary Material: Monocular 3D Object …Supplementary Material: Monocular 3D Object Detection for Autonomous Driving Xiaozhi Chen 1, Kaustav Kundu 2, Ziyu Zhang , Huimin Ma

# candidates

10 1 10 2 10 3 10 4

ave

rag

e r

eca

ll

0

0.2

0.4

0.6

0.8

1Loc

+ClsSeg

+Context

+Shape

+InstSeg

# candidates

10 1 10 2 10 3 10 4

ave

rag

e r

eca

ll

0

0.2

0.4

0.6

0.8

1Loc

+ClsSeg

+Context

+Shape

+InstSeg

# candidates

10 1 10 2 10 3 10 4

ave

rag

e r

eca

ll

0

0.2

0.4

0.6

0.8

1Loc

+ClsSeg

+Context

+Shape

+InstSeg

Car (Easy) Car (Moderate) Car (Hard)

Figure 10: Ablation study of features on car proposals: Average Recall (AR) vs #Candidates for 2D bounding boxes.The basic model (Loc) only uses location prior feature. We then gradually add other types of features: class segmentation,context, shape, and instance segmentation.

# candidates

10 1 10 2 10 3 10 4

reca

ll a

t Io

U t

hre

sh

old

0.7

0

0.2

0.4

0.6

0.8

1

Loc

+ClsSeg

+Context

+Shape

+InstSeg

# candidates

10 1 10 2 10 3 10 4

reca

ll a

t Io

U t

hre

sh

old

0.7

0

0.2

0.4

0.6

0.8

1

Loc

+ClsSeg

+Context

+Shape

+InstSeg

# candidates

10 1 10 2 10 3 10 4

reca

ll a

t Io

U t

hre

sh

old

0.7

0

0.2

0.4

0.6

0.8

1

Loc

+ClsSeg

+Context

+Shape

+InstSeg

Car (Easy) Car (Moderate) Car (Hard)

Figure 11: Ablation study of features on car proposals: 2D bounding box Recall vs #Candidates at IoU threshold of 0.7.The basic model (Loc) only uses location prior feature. We then gradually add other types of features: class segmentation,context, shape, and instance segmentation.

IoU overlap threshold0.5 0.6 0.7 0.8 0.9 1

recall

0

0.2

0.4

0.6

0.8

1Loc 54.5

+ClsSeg 61.7

+Context 64

+Shape 67.2

+InstSeg 67.2

IoU overlap threshold0.5 0.6 0.7 0.8 0.9 1

recall

0

0.2

0.4

0.6

0.8

1Loc 49.8

+ClsSeg 58.5

+Context 60.8

+Shape 64.1

+InstSeg 64.2

IoU overlap threshold0.5 0.6 0.7 0.8 0.9 1

recall

0

0.2

0.4

0.6

0.8

1Loc 50.3

+ClsSeg 58.9

+Context 60.1

+Shape 63.4

+InstSeg 63.9

Car (Easy) Car (Moderate) Car (Hard)

Figure 12: Ablation study of features on car proposals: 2D bounding box Recall vs IoU for 2000 proposals. The basicmodel (Loc) only uses location prior feature. We then gradually add other types of features: class segmentation, context,shape, and instance segmentation.

12

Page 13: Supplementary Material: Monocular 3D Object …Supplementary Material: Monocular 3D Object Detection for Autonomous Driving Xiaozhi Chen 1, Kaustav Kundu 2, Ziyu Zhang , Huimin Ma

Figure 13: Qualitative examples of detections results for Cars: (left) top 50 scoring proposals (color from blue to red indicatesincreasing score), (middle) 2D detections, (right) 3D detections.

13

Page 14: Supplementary Material: Monocular 3D Object …Supplementary Material: Monocular 3D Object Detection for Autonomous Driving Xiaozhi Chen 1, Kaustav Kundu 2, Ziyu Zhang , Huimin Ma

Figure 14: Qualitative examples of detections results for Pedestrians and Cyclists: (left) top 50 scoring proposals (color from blue tored indicates increasing score), (middle) 2D detections, (right) 3D detections.

14

Page 15: Supplementary Material: Monocular 3D Object …Supplementary Material: Monocular 3D Object Detection for Autonomous Driving Xiaozhi Chen 1, Kaustav Kundu 2, Ziyu Zhang , Huimin Ma

Figure 15: Qualitative examples of failure cases: (left) input images, (middle) semantic segmentation for Car (red), Pedestrian (green),Cyclist (blue), and Road (yellow), and (right) best 3D proposals among 2K candidates. The correct detections are indicated in blue andthe missed detections are in red. Most failure cases are due to class or road segmentation error.

15