CSCE 496/896 Lecture 9: Object Detectioncse.unl.edu/~sscott/teach/Classes/cse496S19/slides/09-ObjectDetection.pdf · RCNN SPP-net Fast RCNN YOLO 3/21. CSCE 496/896 Lecture 9: Object

CSCE496/896

Lecture 9:Object

Detection

Stephen Scott

Introduction

PerformanceMeasures

R-CNN

SPP-net

Fast R-CNN

YOLO

CSCE 496/896 Lecture 9:Object Detection

Stephen Scott

[email protected]

1 / 21

mailto:[email protected]

CSCE496/896

Lecture 9:Object

Detection

Stephen Scott

Introduction

PerformanceMeasures

R-CNN

SPP-net

Fast R-CNN

YOLO

Introduction

We know that CNNs are useful in image classificationNow consider object detection

Given an input image, identify what objects (plural) arein it and where they areOutput bounding box of each object

2 / 21

CSCE496/896

Lecture 9:Object

Detection

Stephen Scott

Introduction

PerformanceMeasures

R-CNN

SPP-net

Fast R-CNN

YOLO

Outline

Performance measuresRCNNSPP-netFast RCNNYOLO

3 / 21

CSCE496/896

Lecture 9:Object

Detection

Stephen Scott

Introduction

PerformanceMeasures

R-CNN

SPP-net

Fast R-CNN

YOLO

Performance MeasuresMean Average Precision

Mean average precision (mAP) to measure how wellobjects are identified

Recall from Lecture 3

Precision isfraction of thoselabeled positivethat are positiveRecall is fractionof the truepositives that arelabeled positivePrecision-recallcurve plotsprecision vs recall

4 / 21

CSCE496/896

Lecture 9:Object

Detection

Stephen Scott

Introduction

PerformanceMeasures

R-CNN

SPP-net

Fast R-CNN

YOLO

Performance MeasuresMean Average Precision (2)

Given a ranking (by confidence values) of n items,average precision at n (AP@n) is average of precisionvalues at each position in the ranking:

AP =

n∑k=1

P(k)∆r(k) ,

where P(k) is precision at position k and ∆r(k) ischange in recall: r(k)− r(k − 1) (= 0 if instance k isnegative, = 1/Np if k is one of Np positives)

E.g., if ranking = 〈+,+,−,+,−〉, AP@5= (1)(1/3) + (1)(2/3) + (2/3)(2/3) + (3/4)(1) + (3/5)(1)Larger as more positives ranked above negatives

mAP is mean of average precision across all classes

5 / 21

CSCE496/896

Lecture 9:Object

Detection

Stephen Scott

Introduction

PerformanceMeasures

R-CNN

SPP-net

Fast R-CNN

YOLO

Performance MeasuresIntersection Over Union

Intersection over union (IoU) to measure quality ofbounding boxes

Divide the size of the two boxes’ intersection by the sizeof their union

6 / 21

CSCE496/896

Lecture 9:Object

Detection

Stephen Scott

Introduction

PerformanceMeasures

R-CNN

SPP-net

Fast R-CNN

YOLO

Basic Idea of Object Detection

Split input image into regions and classify each region witha CNN and other machinery

Region boundary isbounding boxObject detected inregion is object in BB

Issues:

Limited to bounding boxes of fixed sizes and locationsAn object could span regions

7 / 21

CSCE496/896

Lecture 9:Object

Detection

Stephen Scott

Introduction

PerformanceMeasures

R-CNN

SPP-net

Fast R-CNN

YOLO

Region CNN (Girshick et al. 2014)

R-CNN proposes collection of 2000 regions in imageWarps each region to match input dimensions(227× 227× 3) of CNN to get 4096-dimensionalembedded representationClassifies each embedded vector with class-specificbinary SVMsApply class-specific regressors to fine-tune boundingboxes

8 / 21

CSCE496/896

Lecture 9:Object

Detection

Stephen Scott

Introduction

PerformanceMeasures

R-CNN

SPP-net

Fast R-CNN

YOLO

Region CNN (Girshick et al. 2014)Example from Girshick (2015)

9 / 21

CSCE496/896

Lecture 9:Object

Detection

Stephen Scott

Introduction

PerformanceMeasures

R-CNN

SPP-net

Fast R-CNN

YOLO

Region CNN (Girshick et al. 2014)Selective Search

Popular method to propose RoIs: selective search1 Segment the image2 Compute bounding

boxes of segments3 Iteratively merge

adjacent segmentsbased on similarity

Linearcombination ofsimilarities of:color, texture, size,shape

4 Goto 2

10 / 21

CSCE496/896

Lecture 9:Object

Detection

Stephen Scott

Introduction

PerformanceMeasures

R-CNN

SPP-net

Fast R-CNN

YOLO

Region CNN (Girshick et al. 2014)Issues

Training and detection are slowDetection: 13s/image on GPU, 53s/image on CPUDue to large number of regions proposed, each runthrough CNN and classified

11 / 21

CSCE496/896

Lecture 9:Object

Detection

Stephen Scott

Introduction

PerformanceMeasures

R-CNN

SPP-net

Fast R-CNN

YOLO

Spatial Pyramid Pooling (He et al. 2015)

Part of R-CNN’s slowdown at test time is running eachRoI through ConvNet separatelyTo speed up test time, instead put entire image throughsingle ConvNet

Choose RoIs from ConvNetoutput and run throughspatial pyramid pooling(SPP) layer

Max/avg pooling withfixed number of binsProduces fixed-lengthvector regardless of inputsize

Fixed-length vectors feed tofully connected layers, thenSVMs12 / 21

CSCE496/896

Lecture 9:Object

Detection

Stephen Scott

Introduction

PerformanceMeasures

R-CNN

SPP-net

Fast R-CNN

YOLO

Spatial Pyramid Pooling (He et al. 2015)Example from Girshick (2015)

13 / 21

CSCE496/896

Lecture 9:Object

Detection

Stephen Scott

Introduction

PerformanceMeasures

R-CNN

SPP-net

Fast R-CNN

YOLO

Spatial Pyramid Pooling (He et al. 2015)Drawbacks

While training is faster then R-CNN, is still slow anddisk-intensiveCannot efficiently update ConvNet parameters, so keptfrozen

Each RoI’s receptive field covers most of entire image,so forward pass expensive across all images ofmini-batch

14 / 21

CSCE496/896

Lecture 9:Object

Detection

Stephen Scott

Introduction

PerformanceMeasures

R-CNN

SPP-net

Fast R-CNN

YOLO

Fast R-CNN (Girshick 2015)Hierarchical Sampling

Similar architecture to SPP-net

Mini-batches constructedvia hierarchical sampling:Sample a similar number ofRoIs over a smaller numberof images

15 / 21

CSCE496/896

Lecture 9:Object

Detection

Stephen Scott

Introduction

PerformanceMeasures

R-CNN

SPP-net

Fast R-CNN

YOLO

Fast R-CNN (Girshick 2015)Example from Girshick (2015)

16 / 21

CSCE496/896

Lecture 9:Object

Detection

Stephen Scott

Introduction

PerformanceMeasures

R-CNN

SPP-net

Fast R-CNN

YOLO

Fast R-CNN (Girshick 2015)Example from Girshick (2015)

17 / 21

CSCE496/896

Lecture 9:Object

Detection

Stephen Scott

Introduction

PerformanceMeasures

R-CNN

SPP-net

Fast R-CNN

YOLO

You Only Look Once (Redmon et al. 2016)

A single, unified networkCan process 45 frames per second on a GPU (155 fpsfor Fast YOLO)Lower mAP than some R-CNN variants, but much fasterHighest mAP of real-time detectors (≥ 30 fps)

18 / 21

CSCE496/896

Lecture 9:Object

Detection

Stephen Scott

Introduction

PerformanceMeasures

R-CNN

SPP-net

Fast R-CNN

YOLO

You Only Look Once (Redmon et al. 2016)Idea

Divides image into S× SgridEach grid cell predicts Bbounding boxes, each as(x, y,w, h) (coordinates,width, height), and aconfidence (five totalpredictions)

x, y,w, h ∈ [0, 1] (relative to image dimensions and gridcell location)Each cell also predicts C class probabilitiesOutput is S× S× (5B + C) tensor

19 / 21

CSCE496/896

Lecture 9:Object

Detection

Stephen Scott

Introduction

PerformanceMeasures

R-CNN

SPP-net

Fast R-CNN

YOLO

You Only Look Once (Redmon et al. 2016)Architecture

Leaky ReLU for all layers except output, which is linear

20 / 21

CSCE496/896

Lecture 9:Object

Detection

Stephen Scott

Introduction

PerformanceMeasures

R-CNN

SPP-net

Fast R-CNN

YOLO

You Only Look Once (Redmon et al. 2016)Training

Pretrained 20 convolutional layers on ImageNet 1000Added 4 convolutional layers and 2 connected layersTrained to optimize weighted square loss functionλcoord = 5 times more weight on (x, y,w, h) predictions

21 / 21

CSCE 496/896 Lecture 9: Object Detectioncse.unl.edu/~sscott/teach/Classes/cse496S19/slides/09-ObjectDetection.pdf · RCNN SPP-net Fast RCNN YOLO 3/21. CSCE 496/896 Lecture 9: Object

Documents