Top Banner
Monocular 3D Object Detection with Pseudo-LiDAR Point Cloud Xinshuo Weng, Kris Kitani Carnegie Mellon University {xinshuow, kkitani}@cs.cmu.edu Background & Motivation Goal (Monocular 3D Object Detection): estimate the object size (width, height, length), heading angle and center location (x, y, z) in 3D space from a single input image. Modern day methods for 3D object detection re- quire the use of a 3D sensor (e.g., LiDAR ). On the other hand, single image based methods have sig- nificantly worse performance. To bridge the performance gap between 3D sens- ing and 2D sensing for 3D object detection, we in- troduce an intermediate 3D point cloud represen- tation of the data, referred to as “pseudo-LiDAR”, which is achieved by lifting image pixels to 3D space based on the estimated depth. In order to handle the large amount of noise existing in the generated pseudo-LiDAR caused by inaccu- rate depth estimation, we propose two innovations: (1) use the instance mask instead of the bounding box as the representation of 2D proposals; 2) use a 2D-3D bounding box consistency (BBC) constraint. Proposed Pipeline & Results Segmented Point Cloud Estimated Depth Monocular Depth Estimation Network 3D Box Estimation Module Point Cloud Frustum 3D Bounding Box Loss Lbox3d 3D Point Cloud Segmentation Instance Segmentation Network Pseudo-LiDAR 3D Box Correction Module Center (x, y, z) Size h, w, l Heading angle ! 3D Bounding Box Prediction Center (Δx, Δy, Δz) Size Δh, Δw, Δl Heading angle Δ! Prediction Correction + Projection Bounding Box Consistency Loss (BBCL) Lbbc 3D Segmentation Loss Lseg3d Instance Mask Proposals 2D Proposal Loss Lpp2d Initial Estimate Final Estimate Input Image (a) Pseudo-LiDAR Generation (b) 2D Instance Mask Proposal Detection (c) Amodal 3D Object Detection with 2D-3D Bounding Box Consistency Camera Matrix Camera Matrix Addition Concatenation (a) Lift every pixel of input image to 3D coordinates given estimated depth to generate pseudo-LiDAR; (b) Instance mask proposals detected for extracting point cloud frustum; (c) 3D bounding box estimated (blue) for each point cloud frustum made to be consistent with corresponding 2D proposal. Qualitative Results Quantitative Results Method AP BEV / AP 3D (in %), IoU = 0.5 AP BEV / AP 3D (in %), IoU = 0.7 Easy Moderate Hard Easy Moderate Hard ROI-10D [1] 46.9 / 37.6 34.1 / 25.1 30.5 / 21.8 14.5 / 9.6 9.9 / 6.6 8.7 / 6.3 MonoGRNet [2] - / 50.5 - / 37.0 - / 30.8 - / 13.9 - / 10.2 - / 7.6 MLF-MONO [4] 55.0 / 47.9 36.7 / 29.5 31.3 / 26.4 22.0 / 10.5 13.6 / 5.7 11.6 / 5.4 PL-MONO [3] 70.8 / 66.3 49.4 / 42.3 42.7 / 38.5 40.6 / 28.2 26.3 / 18.5 22.9 / 16.4 Ours 72.1 / 68.4 53.1 / 48.3 44.6 / 43.0 41.9 / 31.5 28.3 / 21.0 24.5 / 17.5 Analysis Effectiveness of Instance Mask Proposal Conclusion: lifting only the pixels within the instance mask proposal on the right sig- nificantly removes the points not being en- closed by the ground truth box. Effect of Bounding Box Consistency Conclusion: by adjusting the 3D bounding box estimate in 3D space so that its 2D pro- jection can have a higher 2D IoU with the corresponding 2D proposal, we demonstrate that the 3D IoU of 3D bounding box estimate with its ground truth can be also increased. Take-Home Message With the proposed instance mask proposal and bounding box consistency, monocular 3D detection using pseudo-LiDAR repre- sentation can achieve much higher perfor- mance than direct regression on images. [1]F. Manhardt, W. Kehl, and A. Gaidon. ROI-10D: Monocular Lifting of 2D Detection to 6D Pose and Metric Shape. CVPR, 2019. [2] Z. Qin, J. Wang, and Y. Lu. MonoGRNet: A Geometric Reasoning Network for Monocular 3D Object Localization. AAAI, 2018. [3]Y. Wang, W.-L. Chao, D. Garg, B. Hariharan, M. Campbell, and K. Weinberger. Pseudo- LiDAR from Visual Depth Estimation: Bridging the Gap in 3D Object Detection for Au- tonomous Driving. CVPR, 2019. [4] B. Xu and Z. Chen. Multi-Level Fusion based 3D Object Detection from Monocular Im- ages. CVPR, 2018. IEEE International Conference on Computer Vision (ICCV) Workshops, 2019.
1

Monocular 3D Object Detection with Pseudo-LiDAR Point Cloud · 2020. 8. 18. · Goal (Monocular 3D Object Detection): estimate the object size (width, height, length), heading angle

Aug 25, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Monocular 3D Object Detection with Pseudo-LiDAR Point Cloud · 2020. 8. 18. · Goal (Monocular 3D Object Detection): estimate the object size (width, height, length), heading angle

Monocular 3D Object Detection with Pseudo-LiDAR Point CloudXinshuo Weng, Kris Kitani

Carnegie Mellon University{xinshuow, kkitani}@cs.cmu.edu

Background & Motivation

•Goal (Monocular 3D Object Detection): estimatethe object size (width, height, length), heading angleand center location (x, y, z) in 3D space from asingle input image.

•Modern day methods for 3D object detection re-quire the use of a 3D sensor (e.g., LiDAR). On theother hand, single image based methods have sig-nificantly worse performance.

•To bridge the performance gap between 3D sens-ing and 2D sensing for 3D object detection, we in-troduce an intermediate 3D point cloud represen-tation of the data, referred to as “pseudo-LiDAR”,which is achieved by lifting image pixels to 3D spacebased on the estimated depth.

• In order to handle the large amount of noise existingin the generated pseudo-LiDAR caused by inaccu-rate depth estimation, we propose two innovations:(1) use the instance mask instead of the boundingbox as the representation of 2D proposals; 2) use a2D-3D bounding box consistency (BBC) constraint.

Proposed Pipeline & Results

SegmentedPoint Cloud

Estimated DepthMonocular DepthEstimation Network

3D Box Estimation Module

Point CloudFrustum

3D BoundingBox Loss Lbox3d

3D Point CloudSegmentation

InstanceSegmentation

Network

Pseudo-LiDAR

3D Box CorrectionModule

Center (x, y, z)Size h, w, l

Heading angle !3D Bounding

Box Prediction

Center (Δx, Δy, Δz)Size Δh, Δw, Δl

Heading angle Δ!PredictionCorrection

+

ProjectionBounding BoxConsistency Loss

(BBCL) Lbbc

3D SegmentationLoss Lseg3d

Instance MaskProposals

2D ProposalLoss Lpp2d

Initial Estimate

Final Estimate

Input Image

(a) Pseudo-LiDAR Generation

(b) 2D Instance Mask Proposal Detection (c) Amodal 3D Object Detection with 2D-3D Bounding Box Consistency

CameraMatrix

CameraMatrix

Addition

Concatenation

(a) Lift every pixel of input image to 3D coordinates given estimated depthto generate pseudo-LiDAR;(b) Instance mask proposals detected for extracting point cloud frustum;(c) 3D bounding box estimated (blue) for each point cloud frustum madeto be consistent with corresponding 2D proposal.

Qualitative Results

Quantitative Results

Method APBEV / AP3D (in %), IoU = 0.5 APBEV / AP3D (in %), IoU = 0.7Easy Moderate Hard Easy Moderate Hard

ROI-10D [1] 46.9 / 37.6 34.1 / 25.1 30.5 / 21.8 14.5 / 9.6 9.9 / 6.6 8.7 / 6.3MonoGRNet [2] - / 50.5 - / 37.0 - / 30.8 - / 13.9 - / 10.2 - / 7.6MLF-MONO [4] 55.0 / 47.9 36.7 / 29.5 31.3 / 26.4 22.0 / 10.5 13.6 / 5.7 11.6 / 5.4PL-MONO [3] 70.8 / 66.3 49.4 / 42.3 42.7 / 38.5 40.6 / 28.2 26.3 / 18.5 22.9 / 16.4

Ours 72.1 / 68.4 53.1 / 48.3 44.6 / 43.0 41.9 / 31.5 28.3 / 21.0 24.5 / 17.5

Analysis

Effectiveness of Instance Mask Proposal

•Conclusion: lifting only the pixels within theinstance mask proposal on the right sig-nificantly removes the points not being en-closed by the ground truth box.

Effect of Bounding Box Consistency

0.0 0.2 0.4 0.6 0.8 1.00.0

0.2

0.4

0.6

0.8

1.0

0.0 0.2 0.4 0.6 0.8 1.00.0

0.2

0.4

0.6

0.8

1.0

•Conclusion: by adjusting the 3D boundingbox estimate in 3D space so that its 2D pro-jection can have a higher 2D IoU with thecorresponding 2D proposal, we demonstratethat the 3D IoU of 3D bounding box estimatewith its ground truth can be also increased.

Take-Home Message

•With the proposed instance mask proposaland bounding box consistency, monocular3D detection using pseudo-LiDAR repre-sentation can achieve much higher perfor-mance than direct regression on images.

[1] F. Manhardt, W. Kehl, and A. Gaidon. ROI-10D: Monocular Lifting of 2D Detection to 6DPose and Metric Shape. CVPR, 2019.

[2] Z. Qin, J. Wang, and Y. Lu. MonoGRNet: A Geometric Reasoning Network for Monocular3D Object Localization. AAAI, 2018.

[3] Y. Wang, W.-L. Chao, D. Garg, B. Hariharan, M. Campbell, and K. Weinberger. Pseudo-LiDAR from Visual Depth Estimation: Bridging the Gap in 3D Object Detection for Au-tonomous Driving. CVPR, 2019.

[4] B. Xu and Z. Chen. Multi-Level Fusion based 3D Object Detection from Monocular Im-ages. CVPR, 2018.

IEEE International Conference on Computer Vision (ICCV) Workshops, 2019.