deep learning icme detection - CUHK Electronic Engineering ...xgwang/detection.pdf · 2 “How toto s”s ... W. Ouyang and Xiaogang Wang, “Joint Deep Learning for Pedestrian Detection,”

Deep Learning in ObjectDeep Learning in Object Detection

Wanli OuyangWanli OuyangDepartment of Electronic Engineering, Th Chi U i i f H KThe Chinese University of Hong Kong

Face alignment

Deep learningObject detection

Human pose estimation

Deep learning

Pedestrian detection

Outline• General object detectionj

• Pedestrian Detection

l li i• Human part localization

2 “How to”s2 How to s• How to effectively train a deep model Data augmentation Label more dataPre‐train on large‐scale related data (RCNN) Layerwise pre‐training + fine tuning (Multi‐stage)

• How to formulate a vision problem with deep learning Tune hyper‐parameters, e.g. number of hidden nodes, filter Tune hyper parameters, e.g. number of hidden nodes, filter

size, number of layers, activation function, dropout …Make use of experience and insights obtained in CV researchp g

Sequential design/learning vs joint learning Contextual information (Multi‐stage, face, human pose)Background clutter removal (SDN)

2 “How to”s2 How to s• How to effectively train a deep model D t t tiData augmentation Label more dataPre train on large scale related data (RCNN)Pre‐train on large‐scale related data (RCNN) Layerwise pre‐training + fine tuning (Multi‐stage)

• How to formulate a vision problem with deep learning• How to formulate a vision problem with deep learning Tune hyper‐parameters, e.g. number of hidden nodes,

number of layers, activation function, dropout.u be o aye s, act at o u ct o , d opoutMake use of experience and insights obtained in CV researchSequential design/learning vs joint learning Contextual information (Multi‐stage, face, human pose)Background clutter removal (SDN)Background clutter removal (SDN)Short and long range temporal relationship (Action recognition)

Object detectionObject detectionP l VOC I t ILSVRCPascal VOC

~ 20 object classesi i 00 i

Image‐net ILSVRC~ 200 object classesi i 39 000 iTraining: ~ 5,700 images

Testing: ~10,000 imagesTraining: ~ 395,000 imagesTesting: ~ 40,000 images

SIFT HOG LBP DPMSIFT, HOG, LBP, DPM …

[Regionlets. Wang et al. ICCV’13] [SegDPM. Fidler et al. CVPR’13]

With CNN featuresWith CNN features



number of layers, activation function, dropout.u be o aye s, act at o u ct o , d opoutMake use of experience and insights obtained in CV researchSequential design/learning vs joint learning Contextual information (Multi‐stage, face, human pose)Background clutter removal (SDN)Short and long range temporal relationship (Li fei‐fei and Yu kai’s works)

R CNN: regions + CNN featuresR‐CNN: regions + CNN features

Input image Extract region proposals (~2k/image)

Compute CNN f t

2‐class linear SVMproposals ( 2k/image) featuresRegion: 91.6%/98Selective Search [van de Sande, Uijlings et al.]. % recall rate on ImageNet/PASCAL

CNN feature : Krizhevsky, Sutskever & Hinton. NIPS 2012. Also called “AlexNet”SVM: Liblinear

CNN trainingCNN training

T i S Vi i CNN* f th 1000• Train a SuperVision CNN* for the 1000‐way ILSVRC image classification task (1.2 million images)

• Fine‐tune the CNN for detectionFine tune the CNN for detectionTransfer the representation learned from ILSVRC

Classification to PASCAL (or ImageNet) detectionClassification to PASCAL (or ImageNet) detection

ImageNetCls Pre‐Train ImageNetCls

Det Fine‐tune

Network from Krizhevsky, Sutskever & Hinton. NIPS 2012Also called “AlexNet

Det

Overfeat [Sermanet et al 2014]Overfeat [Sermanet et al. 2014]

M id ti d d l d i• More considerations on deep model design– Multi‐resolution, dense pooling

• Sliding window (not region proposal)• Does not use ImageNet Cls for pretraining• Does not use ImageNet‐Cls for pretraining

Experimental results on ILSVRC 2013Experimental results on ILSVRC 2013




Pedestrian Detection

Improve state‐of‐the‐art average miss detection rateaverage miss detection rate on the largest Caltech dataset from 63% to 38%

ICCV’13

CVPR’12 CVPR’13 ICCV’13 CVPR’14

Joint Deep LearningJoint Deep Learning

What if we treat an existing deep model as a black box in pedestrian detection?

ConvNet−U−MS

– Sermnet, K. Kavukcuoglu, S. Chintala, and LeCun, “Pedestrian Detection with Unsupervised Multi‐Stage Feature Learning,” CVPR 2013.

Results on Caltech TestResults on ETHZ




• N. Dalal and B. Triggs. Histograms of oriented gradients for human detection. CVPR 2005 (6000 citations)CVPR, 2005. (6000 citations)

• P. Felzenszwalb, D. McAlester, and D. Ramanan. A Discriminatively Trained, Multiscale, Deformable Part Model. CVPR, 2008. (2000 citations)

• W. Ouyang and X. Wang. A Discriminative Deep Model for Pedestrian Detection with Occlusion Handling. CVPR, 2012.

Our Joint Deep Learning ModelOur Joint Deep Learning Model

Modeling Part DetectorsModeling Part Detectors

• Design the filters in the second convolutional layer with variable sizes

Part models learned from HOG

Part models Learned filtered at the second convolutional layer

Deformation LayerDeformation Layer

Visibility Reasoning with Deep Belief NetVisibility Reasoning with Deep Belief Net

Experimental ResultsExperimental Results• Caltech Test dataset (largest most widely used)• Caltech – Test dataset (largest, most widely used)

90

100 95%68%( %)

70

80

90 68%63% (state‐of‐the‐art)

53%

ss ra

te (

50

60

39% (best performing)age mis

2000 2002 2004 2006 2008 2010 2012 201430

40( p g)

Improve by ~ 20%

Avera

W. Ouyang, X. Zeng and X. Wang, "Modeling Mutual Visibility Relationship in Pedestrian Detection ", CVPR 2013.W. Ouyang, Xiaogang Wang, "Single‐Pedestrian Detection aided by Multi‐pedestrian Detection ", CVPR 2013.X Zeng W Ouyang and X Wang ” A Cascaded Deep Learning Architecture for Pedestrian Detection ” ICCV 2013

W. Ouyang and X. Wang, "A Discriminative Deep Model for Pedestrian Detection with Occlusion Handling,“ CVPR 2012.

X. Zeng, W. Ouyang and X. Wang, A Cascaded Deep Learning Architecture for Pedestrian Detection, ICCV 2013.W. Ouyang and Xiaogang Wang, “Joint Deep Learning for Pedestrian Detection,” IEEE ICCV 2013.

Results on Caltech Test Results on ETHZ

DN‐HOGUDN‐HOGUDN‐HOGCSSUDN‐CNNFeatUDN‐DefLayer

Multi‐Stage Contextual Deep Learning

Motivated by Cascaded Classifiers and Contextual Boost

• The classifier of each stage deals with a specific set of samples

• The score map output by one classifier can serve as contextual information for the next classifier

Only pass one detection score to the next stage Classifiers are trained Classifiers are trained sequentially

Conventional cascaded classifiers for detection




• Our deep model keeps the score map output by the current classifier and it serves as contextual information to support the decision at the next stageserves as contextual information to support the decision at the next stage

• Cascaded classifiers are jointly optimized instead of being trained sequentially• To avoid overfitting, a stage‐wise pre‐training scheme is proposed to regularize

ti i tioptimization• Simulate the cascaded classifiers by mining hard samples to train the network stage‐by‐stage

Training StrategiesTraining Strategiesd l b l• Unsupervised pre‐train Wh,i+1 layer‐by‐layer, setting Ws,i+1 = 0, Fi+1 = 0

• Fine‐tune all the Wh,i+1 with supervised BP• Train Fi 1 andW i 1 with BP stage‐by‐stageTrain Fi+1 and Ws,i+1 with BP stage by stage• A correctly classified sampled at the previous stage does not influence the

update of parameters• Stage‐by‐stage training can be considered as adding regularization

constraints to parameters, i.e. some parameters are constrained to be zeros in the early training stagesy g g

Log error function:

Gradients for updating parameters:

Experimental ResultsExperimental Results

Caltech ETHZ

DeepNetNoneFilter

Comparison of Different Training StrategiesComparison of Different Training Strategies

Network‐BP: use back propagation to update all the parameters without pre‐trainingPretrainTransferMatrix‐BP: the transfer matrices are unsupervised pertrained, and then all the parameters are fine‐tunedMulti‐stage: our multi‐stage training strategy

Switchable Deep Network for Pedestrian DetectionPedestrian Detection

Switchable Deep Network for Pedestrian Detection

• Background clutter and large variations of pedestrianappearance.

• Proposed Solution. A Switchable Deep Network (SDN)for learning the foreground map.





• Switchable Restricted Boltzmann Machine


• Switchable Restricted Boltzmann Machine

ForegroundBackground


(a) Performance on Caltech Test (b) Performance on ETH

Human part localizationHuman part localization

• Facial Keypoint Detection• Human pose estimationHuman pose estimation

CVPR’ 13

CVPR’ 14




Facial Keypoint DetectionFacial Keypoint Detection

• Y. Sun, X. Wang and X. Tang, “Deep Convolutional Network Cascade for Facial Point Detection,” CVPR 2013

Comparison with Belhumeur et al. [4], Cao et al. [5] on LFPW test images.

1. http://www.luxand.com/facesdk/2. http://research.microsoft.com/en‐us/projects/facesdk/.3. O. Jesorsky, K. J. Kirchberg, and R. Frischholz. Robust face detection using the hausdorff distance. In Proc. AVBPA, 2001.4. P. N. Belhumeur, D. W. Jacobs, D. J. Kriegman, and N. Kumar. Localizing parts of faces using a consensus of exemplars. In Proc. CVPR, 2011.5. X. Cao, Y. Wei, F. Wen, and J. Sun. Face alignment by explicit shape regression. In Proc. CVPR, 2012.6. L. Liang, R. Xiao, F. Wen, and J. Sun. Face alignment via component‐based discriminative search. In Proc. ECCV, 2008.7. M. Valstar, B. Martinez, X. Binefa, and M. Pantic. Facial point detection using boosted regression and graph models. In Proc. CVPR, 2010.

Validation.

BioID.

LFPWLFPW.

Benefits of Using Deep ModelBenefits of Using Deep Model

• Take the full face as input to make full use of texture context information over the entire face to locate each keypointTh fi t t k th t t k th h l f i t d• The first network that takes the whole face as input needs deep structures to extract high‐level features

• Since the networks are trained to predict all the keypoints• Since the networks are trained to predict all the keypointssimultaneously, the geometric constraints among keypointsare implicitly encodedp y

• Global geometric constraints among keypoints can also be explicitly encoded by deep model.

Human pose estimationHuman pose estimation• W Ouyang X Wang and X Tang “Multi‐W. Ouyang, X. Wang and X. Tang, Multisource Deep Learning for Human Pose Estimation” CVPR 2014Estimation CVPR 2014.

Multiple information sourcesp

• Appearance


• Appearance• Appearance mixture typeAppearance mixture type


• Appearance• Appearance mixture typeAppearance mixture type• Deformation

Multi‐source deep modelp

Experimental resultsExperimental resultsPARSE

Method Torso U.leg L.leg U.arm L.arm head Total

Yang&Ramanan [58] 82.9 68.8 60.5 63.4 42.4 82.4 63.6

O 89 3 78 0 72 0 67 8 47 8 89 3 71 0Ours 89.3 78.0 72.0 67.8 47.8 89.3 71.0

UIUC PeopleMethod Torso U.leg L.leg U.arm L.arm head Total

Yang&Ramanan [58] 81.8 65.0 55.1 46.8 37.7 79.8 57.0

Ours 89 1 72 9 62 4 56 3 47 6 89 1 65 6Ours 89.1 72.9 62.4 56.3 47.6 89.1 65.6

LSPMethod Torso U.leg L.leg U.arm L.arm head Total

Yang&Ramanan [58] 82.9 70.3 67.0 56.0 39.8 79.3 62.8

Ours 85 8 76 5 72 2 63 3 46 6 83 1 68 6Ours 85.8 76.5 72.2 63.3 46.6 83.1 68.6

Up to 8.6 percent accuracy improvement with global geometric constraints

Experimental resultsExperimental resultsRightLeftOursYang&Ramanan

Conclusion 2 “How to”sConclusion ‐ 2 How to s• How to effectively train a deep model D t t tiData augmentation Label more dataPre train on large scale related data (RCNN)Pre‐train on large‐scale related data (RCNN) Layerwise pre‐training + fine tuning (Multi‐stage)

• How to formulate a vision problem with deep learning?• How to formulate a vision problem with deep learning? Tune hyper‐parameters, e.g. number of hidden nodes,


ReferenceReferenceR Gi hi k J D h T D ll J M lik “Ri h F Hi hi f A Obj• R. Girshick, J. Donahue, T. Darrell, J. Malik, “Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation,” CVPR, 2014.

• Sermanet, Pierre, et al. "Overfeat: Integrated recognition, localization and detection i l ti l t k " Xi i t Xi 1312 6229 (2013)using convolutional networks." arXiv preprint arXiv:1312.6229 (2013).

• W. Ouyang and X. Wang, "Joint Deep Learning for Pedestrian Detection," in Proceedings of IEEE International Conference on Computer Vision (ICCV) 2013X Z W O d X W "M l i S C l D L i f P d i• X. Zeng, W. Ouyang and X. Wang, "Multi‐Stage Contextual Deep Learning for Pedestrian Detection," in Proceedings of IEEE International Conference on Computer Vision (ICCV)2013

• P Luo Y Tian X Wang and X Tang "Switchable Deep Network for Pedestrian• P. Luo, Y. Tian, X. Wang, and X. Tang, "Switchable Deep Network for Pedestrian Detection", IEEE Conf. on Computer Vision and Pattern Recognition, June 2014

ReferenceReferenceW O X Z d X W "M d li M l Vi ibili R l i hi i h D• W. Ouyang, X. Zeng and X. Wang, "Modeling Mutual Visibility Relationship with a Deep Model in Pedestrian Detection," in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3222‐3229, 2013 W O d X W "A Di i i ti D M d l f P d t i D t ti ith• W. Ouyang, and X. Wang, "A Discriminative Deep Model for Pedestrian Detection with Occlusion Handling," in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3258‐3265, 2012

• Y Sun X Wang and X Tang "Deep Convolutional Network Cascade for Facial Point• Y. Sun, X. Wang and X. Tang, Deep Convolutional Network Cascade for Facial Point Detection," in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3476‐3483, 2013

• W Ouyang X Chu and X Wang "Multi‐source Deep Learning for Human Pose Estimation"W. Ouyang, X. Chu, and X. Wang, Multi source Deep Learning for Human Pose Estimation , IEEE Conf. on Computer Vision and Pattern Recognition, June 2014

Q&AQ&A

mmlab.ie.cuhk.edu.hk/ www.ee.cuhk.edu.hk/~xgwang/ www.ee.cuhk.edu.hk/~wlouyang/

deep learning icme detection - CUHK Electronic Engineering ...xgwang/detection.pdf · 2 “How toto s”s ... W. Ouyang and Xiaogang Wang, “Joint Deep Learning for Pedestrian Detection,”

Documents