Performance Evaluation of Video Interest Point Detectors A Thesis submitted in partial fulfillment of the requirements for the award of the degree Master of Technology by Tammina Siva Naga Srinivasu Roll No: 124102011 under supervision of Dr. Prabin Kumar Bora and Dr. Tony Jacob Department of Electronics and Electrical Engineering Indian Institute of Technology Guwahati July 2014
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Performance Evaluation
of
Video Interest Point Detectors
A Thesis submitted in partial fulfillment of the requirements
for the award of the degree
Master of Technology
by
Tammina Siva Naga Srinivasu
Roll No: 124102011
under supervision of
Dr. Prabin Kumar Bora and Dr. Tony Jacob
Department of Electronics and Electrical Engineering
Indian Institute of Technology Guwahati
July 2014
D E C L A R A T I O N
I hereby declare that the work which is being presented in the thesis entitled, “Perfor-
mance Evaluation of Video Interest Point Detectors”, in partial fulfillment of
the requirements for the award of degree of Master of Technology at Indian Institute
of Technology Guwahati, is an authentic record of my own work carried out under the
supervision of Dr. Prabin Kumar Bora and Dr. Tony Jacob and duly listed
other researchers works in the reference section. The contents of this thesis, in full or
in parts, have not been submitted to any other Institute or University for the award of
any degree or diploma.
Tammina Siva Naga Srinivasu
July 2014 Dept. of Electronics & Electrical Engineering,
Guwahati IIT Guwahati,
Guwahati, Assam,
India - 781039.
C E R T I F I C A T E
This is to certify that the work contained in the thesis entitled Performance Evalu-
ation of Video Interest Point Detectors is a bonafide work of Tammina Siva
Naga Srinivasu, Roll No. 124102011, which has been carried out in the department
of Electronics and Electrical Engineering, Indian Institute of Technology Guwahati under
my supervision and this work has not been submitted elsewhere for a Degree.
Dr. Prabin Kumar Bora and Dr. Tony Jacob
July 2014 Dept. of Electronics & Electrical Engineering,
3.7 The two types of box filter approximations for the 2 + 1D second orderpartial derivatives in one direction and in two directions [3] . . . . . . . . 30
Require: Video f and spatial scales σ2l = σ2l,1, σ2l,2....σ
2l,n and temporal scales
τ2l = τ2l,1, τ2l,2....τ
2l,n
1: Construct scale space for initial spatial and temporal scales and in integration scales.
2: Find interest points pj =(xj, yj , tj , σ
2l,j, τ
2l,j
), for j = 1, 2, ...., N over scale space
using maxima of Equation 3.4.3: for each interest point j=1 to N do4: Compute ∇2
normL by using Equation 3.5 in present scale σ2l,j, τ2l,j and scale below
and above the present scale.5: Chose the scale which maximizes ∇2
normL6: if σ2i,j 6= σ2i,jorτ
2i,j 6= τ2i,j then
7: Redirect interest point pi to pj =(xj , yj, tj , σ
2l,j, τ
2l,j
)where τ2l,j,σ
2l,j are the
adapted scales and(xj , yj, tj
)is new position of interest point close to (xj, yj , tj)
8: set pj = pj and got o step 49: end if
10: end for
Figure 3.6: STIP Example: Frames with interest points marked
Chapter 3. Video Interest Point Detectors 29
There are cases where spatial corners have also been detected as spatio temporal interest
points even though there is no motion around the point. This could be attributed to
intensity variations between the frames which can cause a spatial interest point with
high variation in both spatial domains to become a spatio temporal interest point with
high gradient in temporal domain resulting form brightness variation or noise rather
than any motion or event of interest around the same. Points surrounding corner are
also being detected due to gaussian blurring of different scales.
3.4 ESURF
Speeded Up Robust Features (SURF) [11] is extended to detect spatio-temporal interest
points by Willems and Tuytelaars in [3]. It uses 3D Hessian as the saliency measure.
Box filters and integral image concepts in SURF are extended to integral video to get
better execution performance.
3.4.1 Hessian based Interest Point Detection and Scale Selection
Spatio temporal interest points were detected by using the 3D Hessian matrix :
H(.;σ2, τ2) =
Lxx Lxy Lxt
Lyx Lyy Lyt
Ltx Lty Ltt
If det(H) > 0 it is treated as interest point. This doest imply that all eigen values are
positive as in 2D case, so in addition to blobs saddle points are also being detected.
Scale Selection
We use γ normalization to find correct scales σ0 and τ0 at the center of Gaussian blob.At
the center of blob only first term in determinant exists and remaining vanish. Thus
Lγnormxx Lγnorm
yy Lγnormtt = σ5τ
52LxxLyyLtt (3.6)
For scale selection using Equation 3.6 we need to optimize two parameters σ and τ .
Iterative method is used find the scales at which det(H) attains maxima over the scales.
3.4.2 Implementation Details
Integral Video SURF uses integral image which simplifies computation of sum of
values in a rectangular region to addition of 4 terms. similarly ESURF uses integral
Chapter 3. Video Interest Point Detectors 30
video concept to compute sum of values within rectangular volume using 8 additions.
For a video of MxNxT an integral video at location (x,y,t) is equal to sum of all pixel
values over rectangular region spanned by (0,0) , (x,y) and summed over all frames
[0,t].For a video V, integral video IV is defined as Equation 3.7.
IV (x, y, t) =
t∑
i=1
y∑
j=1
x∑
k=1
V (i,j, k) (3.7)
ESURF uses determinant of 3D Hessian matrix as saliency measure, which have second
order derivatives in spatio temporal domain Dxx,Dyy ,Dtt,Dxt,Dxy,Dyt. They are com-
puted by rotated versions of box filters shown in Figure 3.7(a) and Figure 3.7(b) with 8
additions by using integral video.
(a) (b)
Figure 3.7: The two types of box filter approximations for the 2 + 1D second orderpartial derivatives in one direction and in two directions [3]
We build a pyramid of 3 octaves each octave consisting of 5 scales and ratio of scaling
between consecutive scales is 1.2 . This task can be parallelized due to box filter and
integral video each scale is independent of other. Determinant Hessian matrix is com-
puted over each scale in temporal and spatial domain. Combination of spatial and
temporal octaves oσandoτ results in pair of scales (σi, τi) gives a cube structure with
Hessian determinants. once cube is filled apply non maximum suppression to obtain
extrema in five spatio temporal scale space(x, y, t, σ, τ). The points at which extrema
occurred is the location of interest point and scale of interest point is the scale at which
it is detected .
Example 3.4. We applied ESURF algorithm on a sample of walking sequence. As
shown in Figure 3.8 algorithm detects points around head and jacket where there is
motion between frames, along with some points in the background.
ESURF detects blobs which are both in motion and are in background which gives
redundant information for some of applications.This algorithm execution time is less
Chapter 3. Video Interest Point Detectors 31
Figure 3.8: ESURF Example: Frames with interest points marked
due to the integral video concept and approximation of derivatives with box filters .
since each octave is independent of others it can be paralleled in execution which is not
possible for all other detector which work in hierarchial way.
3.5 ELIFT
Local Invariant Feature Tracks (LIFT) [12] was proposed by Mezaris,Ioannis and Dimou.
Feature tracks are the set of visually similar interest points in video having temporal
continuity.So, tracks are set point which are visually similar and similarly moving points.
We extended this notion of tracks to get interest points in video of fixed camera by
considering magnitude and direction of motion.
3.5.1 Feature Track Extraction
Let S be video of T frames, S = {It}Tt=1. We apply SIFT [6] on each frame It to get
feature points and feature descriptors Φ = {φm}Mm=1 where M is number of detected
Chapter 3. Video Interest Point Detectors 32
features and the descriptor φm is defined as φm = [φxm, φym, φdm]. φxm, φ
ym represents the
location of the interest point and φdm describes the interest point with 128 bin vector.
The interest point φm from It can have temporal correspondence with the point in It−1.
It is found by a local search in the neighbors of interest point by taking a square patch
of dimension 2δ + 1. A point φn ∈ Φt−1 is said to be tracked in It as φm if it satisfy
Equation 3.8
|φxm − φxn| ≤ δ,
|φym − φyn| ≤ δ,
d(φdm, φ
dn
)≤ dmin
(3.8)
δ = 7 and dmin = 0.6 are the values taken for evaluation.If there are more than one point
being matched to current point, we consider the one with having less distance. When
such an interest point φn exists, the interest point φm ∈ Φt is appended to the track
where the φn is present else φm is considered as the first element of the new feature track.
A set of feature tracks Ψ = {ψk}Kk=1 is formed by finding the temporal correspondences of
all interest points in all frames of S. The feature track ψk is defined as ψk = [ψxk , ψ
yk , ψ
dk],
where ψdk is average descriptor for feature track which is obtained by element wise
averaging of descriptors of elements of feature track and ψxk = [ψx,tk1
k , ψx,tk1+1k , ....ψx,tk2
k ]
, tk1 < tk2. ψyk is defined in the similar way as ψx
k . The trajectory of the feature is
ξk = [ψxk , ψ
yk ]. If the length of trajectory, tk2 − tk1, is more than 5 it is considered as
valid else it is discarded.
3.5.2 Interest Points on Feature Track
Interest points are the points on feature track with special features like points where track
changes its direction or motion. Select a feature track ψk with trajectory ξk. Algorithm
3 shows shows steps followed to detect interest points on feature track. Tracks with
average displacement greater than threshold are only considered for detection, because
background points which are stationary will have no motion and are not truly spatio
temporal interest points as shown in Figure 3.9(a). Points which have nonuniform
displacement are interest points on tracks and they become local maxima or minima
in track.The points where a track changes it direction or where it deviates from normal
path more than a threshold are said to be interest points as shown in Figure 3.9(b).
Example 3.5. We applied ELIFT algorithm on a sample of walking sequence. As shown
in Figure 3.10(b) algorithm detects points around head and jacket where there is motion
between frames, along with some points in the background due to round off errors.Points
which were tracked are shown in Figure 3.10(a) There are 1782 feature tracks with 49577
tracked points which were reduced to 2046 points.
Chapter 3. Video Interest Point Detectors 33
010
2030
40
93
93.5
94
94.5
95286
286.5
287
287.5
288
y
tx
(a)
4950
5152
5354
95
95.5
96
96.5
9798
99
100
101
102
103
(b)
Figure 3.9: Figure (a) Track with no motion andFigure (b) Track with motion and deviation in path
Algorithm 3 Detection Interest Points on Feature Tracks
Require: Feature tracks ΨEnsure: Ψ 6= φ1: Interest points P = φ2: for Feature Track ψk=1 to K do3: Calculate displacement in X and Y directions xd and yd4: Calculate average displacement xdavg and ydavg5: if average displacement xdavg > 1 && ydavg > 1 then
6: calculate direction of displacement d = tan−1(ydxd
)
7: for Every point pi ∈ ψk do8: if pi is local maxima or minima in x or y then9: P = P ∪ pi
10: break;11: end if12: if Direction di is opposite to di−1 then13: P = P ∪ pi14: break;15: end if16: if |di−di−1|> 40 then17: P = P ∪ pi18: end if19: end for20: end if21: end for22: RETURN P
The main advantage of this algorithm is that it detects points with motion without
optical flow which is computationally expensive. There are some background points
still persisting. This is due jumping of features between tracks which has arisen due
to distance between them spatially is very small. They have almost same 16x16 patch
considered for descriptor computation in SIFT. Their distance between their descriptors
Chapter 3. Video Interest Point Detectors 34
(a)
(b)
Figure 3.10: Figure (a) shows the traced features(LIFT) andFigure (b) shows the modified Interest points(ELIFT)
Chapter 3. Video Interest Point Detectors 35
is very small. This problem can be addressed by increasing dmin given Equation 3.8 but
the actual interest points are being filtered so there should a tradeoff.
3.6 Evaluation
3.6.1 Evaluation Criteria
We considered repeatability and execution time as in [14] for performance measuring.
Repeatability
It determines how much percentage of spatio temporal interest points are being retained
in spite of geometric transforms and protomorphic deformations. If this rate is higher
detector is robust to the transformations. It is defined as the ratio of number of matched
points to the mean of number of detected interest points of two videos given by
Equation 3.9
r1,2 =c(V1, V2)
mean(n1, n2)(3.9)
where c (V1, V2) is the number of matched pair of points and n1 and n2 represent number
detected interest points in V1 andV2 respectively.
Execution Time
Execution time is time taken for detector for detection of interest points in the video.
It measures the computational complexity of the detector. we also measure the number
of points detected and determine if they are sparse or dense.
3.6.2 Experimental Results
All programs are being executed on Intel Core2duo processor computer with 16 GB
RAM, under Windows 7 professional operating system. Table 3.1 shows the amount of
time taken for execution and number of interest points detected. A video of 1000 frames
is taken with 1280x720 YouTube720p with frame rate 25fps is considered for evaluation.
Brightness change
video is subjected to uniform change in intensity by Inew = Iframe + d, where d is a
constant varying from -40 to 40 image intensity is limited to [0 255]. The corresponding
performance characteristics of the video descriptors is shown in Figure 3.11. n-SIFT
,ESURF and STIP performed better than MoSIFT and ELIFT. The number of interest
points are less for ELIFT and MoSIFT and due to saturation of intensity at lower or
upper level may remove some of the points so they are sensitive to brightness change.
Chapter 3. Video Interest Point Detectors 36
−40 −30 −20 −10 0 10 20 30 400
10
20
30
40
50
60
70
80
90
100
brightness variation
Rep
eata
bilit
y
Brightness variation
ESURFnSIFTSTIPMoSIFTELIFT
Figure 3.11: Brightness change Vs Repeatability
0 0.5 1 1.5 2 2.5 3 3.5 40
10
20
30
40
50
60
70
80
90
100
scale variation
Rep
eata
bilit
y
Scale Variation Vs Repeatability
ESURFn−SIFTSTIPMoSIFTELIFT
Figure 3.12: Scale change Vs Repeatability
Scale changes
The videos and their spatially scaled versions from 50% to 200%, are used for evalu-
ating robustness of the descriptors to scaling. The result is shown in Figure 3.12.The
performance of all the algorithms is quite bad with down scaling of the video but with
up scaling there is some improvement in performance. This does not augur well for
matching or aligning images of different scales.
Compression
Video is compressed in MPEG format in a scale of [1 50] and results are presented in
Figure 3.13. As we performed lossy compression it changes values of pixels which changes
Chapter 3. Video Interest Point Detectors 37
1 2 3 4 5 6 70
10
20
30
40
50
60
70
80
90
100 Compression Vs Repeatability
compression variation
repe
atab
ility
ESURFn−SIFTSTIPMoSIFTELIFT
Figure 3.13: Compression change Vs Repeatability
the value of gradients hence every detector is expected to show decrease in repeatability
as compression increases. n-SIFT and STIP has shown better repeatability and ESURF
which uses box filters and integral video concept to give additive losses of information
and thus it shows low repeatability.
Execution Time
A comparison table of time of execution and number of interest points detected is given
in Table 3.1.It is observed that ESURF has better performance in time of execution
which has derived from integral video usage. MoSIFT and n-SIFT require more time of
execution. MoSIFT and ELIFT detect less number of interest points with less number
of background points. n-SIFT and ESURF detect a large quantity of points which are
dense. n-SIFT, STIP and ESURF detect background points along with points with
motion.
Time ofExecution(sec)
No.pointsdetected
n-SIFT 1404.954570 31288
STIP 1127.339653 130261
ESURF 113.326833 261960
MoSIFT 1829.52802 72723
ELIFT 737.707349 12046
Table 3.1: Execution Time Comparison
Chapter 4
Conclusions and Future Scope
4.1 Conclusion
This thesis mainly dealt with spatio temporal interest point detectors of video and their
stability in geometric and photometric deformations. SUSAN detector, Harris corner
detector, Harris Laplace detector and SIFT spatial interest points detectors have been
studied and their performance has been evaluated. SIFT exhibits better repeatability
compared to other detectors under image deformation. SIFT is a better package with
robust interest point detection and descriptor along with suitable matching techniques.
The spatio temporal extension of these image interest point detectors has been studied
and their performance is evaluated in terms of repeatability and time of execution. n-
SIFT has better repeatability than others under video deformations. The proposed
ELIFT has least number of interest points and less number of points without motion.
MoSIFT detects second least number of points, while ESURF detects the maximum
number of interest points. ESURF have less execution time and next is ELIFT. While
ESURF has problems in detecting spatio temporal interest points with motion, n-SIFT
detects spatial interest points with and without motion.
4.2 Future Scope
The probable extensions of the work done in this thesis are following
• Further study is required for analyzing stability temporal scaling i.e having differ-
ent frame rates.
• Performance evaluation of algorithms with nonuniform brightness variation need
to be evaluated.
38
Bibliography 39
• Detectors performance along with corresponding descriptor need to be analyzed
for different application like video classification, video alignment and action recog-
nition.
• ELIFT need to be extended for moving camera video applications.
• Camera motion estimation and evaluation of the video descriptors on video se-
quences with a moving camera along with alignment of such sequences present
another challenging research topic.
Bibliography
[1] W. Cheung and G. Hamarneh. N-sift: N-dimensional scale invariant feature trans-
form for matching medical images. In Biomedical Imaging: From Nano to Macro,
2007. ISBI 2007. 4th IEEE International Symposium on, pages 720–723, April 2007.
doi: 10.1109/ISBI.2007.356953.
[2] Ming-yu Chen and Alexander Hauptmann. Mosift: Recognizing human actions in
surveillance videos. 2009.
[3] Geert Willems, Tinne Tuytelaars, and Luc Van Gool. An efficient dense and scale-
invariant spatio-temporal interest point detector. In Computer Vision–ECCV 2008,
pages 650–663. Springer, 2008.
[4] Tinne Tuytelaars and Krystian Mikolajczyk. Local invariant feature detectors: a