Deep Learning in Object Deep Learning in Object Detection Wanli Ouyang Wanli Ouyang Department of Electronic Engineering, Th Chi Ui i fH K The Chinese University ofHong K ong
Deep Learning in ObjectDeep Learning in Object Detection
Wanli OuyangWanli OuyangDepartment of Electronic Engineering, Th Chi U i i f H KThe Chinese University of Hong Kong
Face alignment
Deep learningObject detection
Human pose estimation
Deep learning
Pedestrian detection
Outline• General object detectionj
• Pedestrian Detection
l li i• Human part localization
2 “How to”s2 How to s• How to effectively train a deep model Data augmentation Label more dataPre‐train on large‐scale related data (RCNN) Layerwise pre‐training + fine tuning (Multi‐stage)
• How to formulate a vision problem with deep learning Tune hyper‐parameters, e.g. number of hidden nodes, filter Tune hyper parameters, e.g. number of hidden nodes, filter
size, number of layers, activation function, dropout …Make use of experience and insights obtained in CV researchp g
Sequential design/learning vs joint learning Contextual information (Multi‐stage, face, human pose)Background clutter removal (SDN)
2 “How to”s2 How to s• How to effectively train a deep model D t t tiData augmentation Label more dataPre train on large scale related data (RCNN)Pre‐train on large‐scale related data (RCNN) Layerwise pre‐training + fine tuning (Multi‐stage)
• How to formulate a vision problem with deep learning• How to formulate a vision problem with deep learning Tune hyper‐parameters, e.g. number of hidden nodes,
number of layers, activation function, dropout.u be o aye s, act at o u ct o , d opoutMake use of experience and insights obtained in CV researchSequential design/learning vs joint learning Contextual information (Multi‐stage, face, human pose)Background clutter removal (SDN)Background clutter removal (SDN)Short and long range temporal relationship (Action recognition)
Object detectionObject detectionP l VOC I t ILSVRCPascal VOC
~ 20 object classesi i 00 i
Image‐net ILSVRC~ 200 object classesi i 39 000 iTraining: ~ 5,700 images
Testing: ~10,000 imagesTraining: ~ 395,000 imagesTesting: ~ 40,000 images
SIFT HOG LBP DPMSIFT, HOG, LBP, DPM …
[Regionlets. Wang et al. ICCV’13] [SegDPM. Fidler et al. CVPR’13]
With CNN featuresWith CNN features
2 “How to”s2 How to s• How to effectively train a deep model D t t tiData augmentation Label more dataPre train on large scale related data (RCNN)Pre‐train on large‐scale related data (RCNN) Layerwise pre‐training + fine tuning (Multi‐stage)
• How to formulate a vision problem with deep learning• How to formulate a vision problem with deep learning Tune hyper‐parameters, e.g. number of hidden nodes,
number of layers, activation function, dropout.u be o aye s, act at o u ct o , d opoutMake use of experience and insights obtained in CV researchSequential design/learning vs joint learning Contextual information (Multi‐stage, face, human pose)Background clutter removal (SDN)Short and long range temporal relationship (Li fei‐fei and Yu kai’s works)
R CNN: regions + CNN featuresR‐CNN: regions + CNN features
Input image Extract region proposals (~2k/image)
Compute CNN f t
2‐class linear SVMproposals ( 2k/image) featuresRegion: 91.6%/98Selective Search [van de Sande, Uijlings et al.]. % recall rate on ImageNet/PASCAL
CNN feature : Krizhevsky, Sutskever & Hinton. NIPS 2012. Also called “AlexNet”SVM: Liblinear
CNN trainingCNN training
T i S Vi i CNN* f th 1000• Train a SuperVision CNN* for the 1000‐way ILSVRC image classification task (1.2 million images)
• Fine‐tune the CNN for detectionFine tune the CNN for detectionTransfer the representation learned from ILSVRC
Classification to PASCAL (or ImageNet) detectionClassification to PASCAL (or ImageNet) detection
ImageNetCls Pre‐Train ImageNetCls
Det Fine‐tune
Network from Krizhevsky, Sutskever & Hinton. NIPS 2012Also called “AlexNet
Det
Overfeat [Sermanet et al 2014]Overfeat [Sermanet et al. 2014]
M id ti d d l d i• More considerations on deep model design– Multi‐resolution, dense pooling
• Sliding window (not region proposal)• Does not use ImageNet Cls for pretraining• Does not use ImageNet‐Cls for pretraining
Experimental results on ILSVRC 2013Experimental results on ILSVRC 2013
2 “How to”s2 How to s• How to effectively train a deep model D t t tiData augmentation Label more dataPre train on large scale related data (RCNN)Pre‐train on large‐scale related data (RCNN) Layerwise pre‐training + fine tuning (Multi‐stage)
• How to formulate a vision problem with deep learning• How to formulate a vision problem with deep learning Tune hyper‐parameters, e.g. number of hidden nodes,
number of layers, activation function, dropout.u be o aye s, act at o u ct o , d opoutMake use of experience and insights obtained in CV researchSequential design/learning vs joint learning Contextual information (Multi‐stage, face, human pose)Background clutter removal (SDN)Short and long range temporal relationship (Li fei‐fei and Yu kai’s works)
Pedestrian Detection
Improve state‐of‐the‐art average miss detection rateaverage miss detection rate on the largest Caltech dataset from 63% to 38%
ICCV’13
CVPR’12 CVPR’13 ICCV’13 CVPR’14
Joint Deep LearningJoint Deep Learning
What if we treat an existing deep model as a black box in pedestrian detection?
ConvNet−U−MS
– Sermnet, K. Kavukcuoglu, S. Chintala, and LeCun, “Pedestrian Detection with Unsupervised Multi‐Stage Feature Learning,” CVPR 2013.
Results on Caltech TestResults on ETHZ
2 “How to”s2 How to s• How to effectively train a deep model D t t tiData augmentation Label more dataPre train on large scale related data (RCNN)Pre‐train on large‐scale related data (RCNN) Layerwise pre‐training + fine tuning (Multi‐stage)
• How to formulate a vision problem with deep learning• How to formulate a vision problem with deep learning Tune hyper‐parameters, e.g. number of hidden nodes,
number of layers, activation function, dropout.u be o aye s, act at o u ct o , d opoutMake use of experience and insights obtained in CV researchSequential design/learning vs joint learning Contextual information (Multi‐stage, face, human pose)Background clutter removal (SDN)Short and long range temporal relationship (Li fei‐fei and Yu kai’s works)
• N. Dalal and B. Triggs. Histograms of oriented gradients for human detection. CVPR 2005 (6000 citations)CVPR, 2005. (6000 citations)
• P. Felzenszwalb, D. McAlester, and D. Ramanan. A Discriminatively Trained, Multiscale, Deformable Part Model. CVPR, 2008. (2000 citations)
• W. Ouyang and X. Wang. A Discriminative Deep Model for Pedestrian Detection with Occlusion Handling. CVPR, 2012.
Our Joint Deep Learning ModelOur Joint Deep Learning Model
Modeling Part DetectorsModeling Part Detectors
• Design the filters in the second convolutional layer with variable sizes
Part models learned from HOG
Part models Learned filtered at the second convolutional layer
Deformation LayerDeformation Layer
Visibility Reasoning with Deep Belief NetVisibility Reasoning with Deep Belief Net
Experimental ResultsExperimental Results• Caltech Test dataset (largest most widely used)• Caltech – Test dataset (largest, most widely used)
90
100 95%68%( %)
70
80
90 68%63% (state‐of‐the‐art)
53%
ss ra
te (
50
60
39% (best performing)age mis
2000 2002 2004 2006 2008 2010 2012 201430
40( p g)
Improve by ~ 20%
Avera
W. Ouyang, X. Zeng and X. Wang, "Modeling Mutual Visibility Relationship in Pedestrian Detection ", CVPR 2013.W. Ouyang, Xiaogang Wang, "Single‐Pedestrian Detection aided by Multi‐pedestrian Detection ", CVPR 2013.X Zeng W Ouyang and X Wang ” A Cascaded Deep Learning Architecture for Pedestrian Detection ” ICCV 2013
W. Ouyang and X. Wang, "A Discriminative Deep Model for Pedestrian Detection with Occlusion Handling,“ CVPR 2012.
X. Zeng, W. Ouyang and X. Wang, A Cascaded Deep Learning Architecture for Pedestrian Detection, ICCV 2013.W. Ouyang and Xiaogang Wang, “Joint Deep Learning for Pedestrian Detection,” IEEE ICCV 2013.
Results on Caltech Test Results on ETHZ
DN‐HOGUDN‐HOGUDN‐HOGCSSUDN‐CNNFeatUDN‐DefLayer
Multi‐Stage Contextual Deep Learning
Motivated by Cascaded Classifiers and Contextual Boost
• The classifier of each stage deals with a specific set of samples
• The score map output by one classifier can serve as contextual information for the next classifier
Only pass one detection score to the next stage Classifiers are trained Classifiers are trained sequentially
Conventional cascaded classifiers for detection
2 “How to”s2 How to s• How to effectively train a deep model D t t tiData augmentation Label more dataPre train on large scale related data (RCNN)Pre‐train on large‐scale related data (RCNN) Layerwise pre‐training + fine tuning (Multi‐stage)
• How to formulate a vision problem with deep learning• How to formulate a vision problem with deep learning Tune hyper‐parameters, e.g. number of hidden nodes,
number of layers, activation function, dropout.u be o aye s, act at o u ct o , d opoutMake use of experience and insights obtained in CV researchSequential design/learning vs joint learning Contextual information (Multi‐stage, face, human pose)Background clutter removal (SDN)Short and long range temporal relationship (Li fei‐fei and Yu kai’s works)
• Our deep model keeps the score map output by the current classifier and it serves as contextual information to support the decision at the next stageserves as contextual information to support the decision at the next stage
• Cascaded classifiers are jointly optimized instead of being trained sequentially• To avoid overfitting, a stage‐wise pre‐training scheme is proposed to regularize
ti i tioptimization• Simulate the cascaded classifiers by mining hard samples to train the network stage‐by‐stage
Training StrategiesTraining Strategiesd l b l• Unsupervised pre‐train Wh,i+1 layer‐by‐layer, setting Ws,i+1 = 0, Fi+1 = 0
• Fine‐tune all the Wh,i+1 with supervised BP• Train Fi 1 andW i 1 with BP stage‐by‐stageTrain Fi+1 and Ws,i+1 with BP stage by stage• A correctly classified sampled at the previous stage does not influence the
update of parameters• Stage‐by‐stage training can be considered as adding regularization
constraints to parameters, i.e. some parameters are constrained to be zeros in the early training stagesy g g
Log error function:
Gradients for updating parameters:
Experimental ResultsExperimental Results
Caltech ETHZ
DeepNetNoneFilter
Comparison of Different Training StrategiesComparison of Different Training Strategies
Network‐BP: use back propagation to update all the parameters without pre‐trainingPretrainTransferMatrix‐BP: the transfer matrices are unsupervised pertrained, and then all the parameters are fine‐tunedMulti‐stage: our multi‐stage training strategy
Switchable Deep Network for Pedestrian DetectionPedestrian Detection
Switchable Deep Network for Pedestrian Detection
• Background clutter and large variations of pedestrianappearance.
• Proposed Solution. A Switchable Deep Network (SDN)for learning the foreground map.
2 “How to”s2 How to s• How to effectively train a deep model D t t tiData augmentation Label more dataPre train on large scale related data (RCNN)Pre‐train on large‐scale related data (RCNN) Layerwise pre‐training + fine tuning (Multi‐stage)
• How to formulate a vision problem with deep learning• How to formulate a vision problem with deep learning Tune hyper‐parameters, e.g. number of hidden nodes,
number of layers, activation function, dropout.u be o aye s, act at o u ct o , d opoutMake use of experience and insights obtained in CV researchSequential design/learning vs joint learning Contextual information (Multi‐stage, face, human pose)Background clutter removal (SDN)Background clutter removal (SDN)Short and long range temporal relationship (Action recognition)
Switchable Deep Network for Pedestrian Detection
• Switchable Restricted Boltzmann Machine
Switchable Deep Network for Pedestrian Detection
• Switchable Restricted Boltzmann Machine
ForegroundBackground
Switchable Deep Network for Pedestrian Detection
(a) Performance on Caltech Test (b) Performance on ETH
Human part localizationHuman part localization
• Facial Keypoint Detection• Human pose estimationHuman pose estimation
CVPR’ 13
CVPR’ 14
2 “How to”s2 How to s• How to effectively train a deep model D t t tiData augmentation Label more dataPre train on large scale related data (RCNN)Pre‐train on large‐scale related data (RCNN) Layerwise pre‐training + fine tuning (Multi‐stage)
• How to formulate a vision problem with deep learning• How to formulate a vision problem with deep learning Tune hyper‐parameters, e.g. number of hidden nodes,
number of layers, activation function, dropout.u be o aye s, act at o u ct o , d opoutMake use of experience and insights obtained in CV researchSequential design/learning vs joint learning Contextual information (Multi‐stage, face, human pose)Background clutter removal (SDN)Background clutter removal (SDN)Short and long range temporal relationship (Action recognition)
Facial Keypoint DetectionFacial Keypoint Detection
• Y. Sun, X. Wang and X. Tang, “Deep Convolutional Network Cascade for Facial Point Detection,” CVPR 2013
Comparison with Belhumeur et al. [4], Cao et al. [5] on LFPW test images.
1. http://www.luxand.com/facesdk/2. http://research.microsoft.com/en‐us/projects/facesdk/.3. O. Jesorsky, K. J. Kirchberg, and R. Frischholz. Robust face detection using the hausdorff distance. In Proc. AVBPA, 2001.4. P. N. Belhumeur, D. W. Jacobs, D. J. Kriegman, and N. Kumar. Localizing parts of faces using a consensus of exemplars. In Proc. CVPR, 2011.5. X. Cao, Y. Wei, F. Wen, and J. Sun. Face alignment by explicit shape regression. In Proc. CVPR, 2012.6. L. Liang, R. Xiao, F. Wen, and J. Sun. Face alignment via component‐based discriminative search. In Proc. ECCV, 2008.7. M. Valstar, B. Martinez, X. Binefa, and M. Pantic. Facial point detection using boosted regression and graph models. In Proc. CVPR, 2010.
Validation.
BioID.
LFPWLFPW.
Benefits of Using Deep ModelBenefits of Using Deep Model
• Take the full face as input to make full use of texture context information over the entire face to locate each keypointTh fi t t k th t t k th h l f i t d• The first network that takes the whole face as input needs deep structures to extract high‐level features
• Since the networks are trained to predict all the keypoints• Since the networks are trained to predict all the keypointssimultaneously, the geometric constraints among keypointsare implicitly encodedp y
• Global geometric constraints among keypoints can also be explicitly encoded by deep model.
Human pose estimationHuman pose estimation• W Ouyang X Wang and X Tang “Multi‐W. Ouyang, X. Wang and X. Tang, Multisource Deep Learning for Human Pose Estimation” CVPR 2014Estimation CVPR 2014.
Multiple information sourcesp
• Appearance
Multiple information sourcesp
• Appearance• Appearance mixture typeAppearance mixture type
Multiple information sourcesp
• Appearance• Appearance mixture typeAppearance mixture type• Deformation
Multi‐source deep modelp
Experimental resultsExperimental resultsPARSE
Method Torso U.leg L.leg U.arm L.arm head Total
Yang&Ramanan [58] 82.9 68.8 60.5 63.4 42.4 82.4 63.6
O 89 3 78 0 72 0 67 8 47 8 89 3 71 0Ours 89.3 78.0 72.0 67.8 47.8 89.3 71.0
UIUC PeopleMethod Torso U.leg L.leg U.arm L.arm head Total
Yang&Ramanan [58] 81.8 65.0 55.1 46.8 37.7 79.8 57.0
Ours 89 1 72 9 62 4 56 3 47 6 89 1 65 6Ours 89.1 72.9 62.4 56.3 47.6 89.1 65.6
LSPMethod Torso U.leg L.leg U.arm L.arm head Total
Yang&Ramanan [58] 82.9 70.3 67.0 56.0 39.8 79.3 62.8
Ours 85 8 76 5 72 2 63 3 46 6 83 1 68 6Ours 85.8 76.5 72.2 63.3 46.6 83.1 68.6
Up to 8.6 percent accuracy improvement with global geometric constraints
Experimental resultsExperimental resultsRightLeftOursYang&Ramanan
Conclusion 2 “How to”sConclusion ‐ 2 How to s• How to effectively train a deep model D t t tiData augmentation Label more dataPre train on large scale related data (RCNN)Pre‐train on large‐scale related data (RCNN) Layerwise pre‐training + fine tuning (Multi‐stage)
• How to formulate a vision problem with deep learning?• How to formulate a vision problem with deep learning? Tune hyper‐parameters, e.g. number of hidden nodes,
number of layers, activation function, dropout.u be o aye s, act at o u ct o , d opoutMake use of experience and insights obtained in CV researchSequential design/learning vs joint learning Contextual information (Multi‐stage, face, human pose)Background clutter removal (SDN)Background clutter removal (SDN)Short and long range temporal relationship (Action recognition)
ReferenceReferenceR Gi hi k J D h T D ll J M lik “Ri h F Hi hi f A Obj• R. Girshick, J. Donahue, T. Darrell, J. Malik, “Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation,” CVPR, 2014.
• Sermanet, Pierre, et al. "Overfeat: Integrated recognition, localization and detection i l ti l t k " Xi i t Xi 1312 6229 (2013)using convolutional networks." arXiv preprint arXiv:1312.6229 (2013).
• W. Ouyang and X. Wang, "Joint Deep Learning for Pedestrian Detection," in Proceedings of IEEE International Conference on Computer Vision (ICCV) 2013X Z W O d X W "M l i S C l D L i f P d i• X. Zeng, W. Ouyang and X. Wang, "Multi‐Stage Contextual Deep Learning for Pedestrian Detection," in Proceedings of IEEE International Conference on Computer Vision (ICCV)2013
• P Luo Y Tian X Wang and X Tang "Switchable Deep Network for Pedestrian• P. Luo, Y. Tian, X. Wang, and X. Tang, "Switchable Deep Network for Pedestrian Detection", IEEE Conf. on Computer Vision and Pattern Recognition, June 2014
ReferenceReferenceW O X Z d X W "M d li M l Vi ibili R l i hi i h D• W. Ouyang, X. Zeng and X. Wang, "Modeling Mutual Visibility Relationship with a Deep Model in Pedestrian Detection," in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3222‐3229, 2013 W O d X W "A Di i i ti D M d l f P d t i D t ti ith• W. Ouyang, and X. Wang, "A Discriminative Deep Model for Pedestrian Detection with Occlusion Handling," in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3258‐3265, 2012
• Y Sun X Wang and X Tang "Deep Convolutional Network Cascade for Facial Point• Y. Sun, X. Wang and X. Tang, Deep Convolutional Network Cascade for Facial Point Detection," in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3476‐3483, 2013
• W Ouyang X Chu and X Wang "Multi‐source Deep Learning for Human Pose Estimation"W. Ouyang, X. Chu, and X. Wang, Multi source Deep Learning for Human Pose Estimation , IEEE Conf. on Computer Vision and Pattern Recognition, June 2014
Q&AQ&A
mmlab.ie.cuhk.edu.hk/ www.ee.cuhk.edu.hk/~xgwang/ www.ee.cuhk.edu.hk/~wlouyang/