Video Scene Detection Using Compact Bag of Visual Word Modelsdownloads.hindawi.com/journals/am/2018/2564963.pdf · ResearchArticle Video Scene Detection Using Compact Bag of Visual
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Research ArticleVideo Scene Detection Using Compact Bag ofVisual Word Models
Muhammad Haroon 1 Junaid Baber1 Ihsan Ullah1 Sher Muhammad Daudpota2
Maheen Bakhtyar1 and Varsha Devi3
1Department of Computer Science amp IT University of Balochistan Pakistan2Department of Computer Science Sukkur IBA University Pakistan3Department of Computer Science Sardar Bahadur KhanWomenrsquos University Pakistan
Correspondence should be addressed to Muhammad Haroon haroonsdbagmailcom
Received 18 May 2018 Revised 14 August 2018 Accepted 3 October 2018 Published 8 November 2018
Academic Editor Deepu Rajan
Copyright copy 2018 Muhammad Haroon et al This is an open access article distributed under the Creative Commons AttributionLicense which permits unrestricted use distribution and reproduction in any medium provided the original work is properlycited
Video segmentation into shots is the first step for video indexing and searching Videos shots are mostly very small in durationand do not give meaningful insight of the visual contents However grouping of shots based on similar visual contents gives abetter understanding of the video scene grouping of similar shots is known as scene boundary detection or video segmentationinto scenes In this paper we propose a model for video segmentation into visual scenes using bag of visual word (BoVW) modelInitially the video is divided into the shots which are later represented by a set of key frames Key frames are further represented byBoVW feature vectors which are quite short and compact compared to classical BoVWmodel implementations Two variations ofBoVWmodel are used (1) classical BoVWmodel and (2)Vector of Linearly AggregatedDescriptors (VLAD)which is an extensionof classical BoVWmodel The similarity of the shots is computed by the distances between their key frames feature vectors withinthe sliding window of length L rather comparing each shot with very long lists of shots which has been previously practicedand the value of L is 4 Experiments on cinematic and drama videos show the effectiveness of our proposed framework TheBoVW is 25000-dimensional vector and VLAD is only 2048-dimensional vector in the proposed modelThe BoVW achieves 090segmentation accuracy whereas VLAD achieves 083
1 Introduction
The size of video databases is increasing exponentially due tothe emergence of cheap and fast Internet The indexing andretrieval of the videos are getting more difficult The expec-tation of users are high due to advanecment in technologiesThe giant video portals such as YouTube Dailymotion andGoogle are investing huge amount on efficient and smartindexing and retrieval so that their portals remain attractiveand addictive to the users
To process videos for indexing and searching the firsttask is to segment the videos into shots and extract represen-tative frames known as key frames from each shot Thesekey frames are later used for searching efficient indexingscene generation and video classification Main idea to selectkey frame is to reduce the computational cost as video is the
collection of frames which are stored in temporal order ieevery video uploaded on Youtube is 30 frames per secondor higher The more the frames per second the better thevisual effect Despite being very sophisticated hardware allthe frames cannot be processed in real time applications suchas event detection from CCTV streaming To process oneframe for the detection of possible objects it takes 05 to15 seconds to identify objects in the frame (cascade objectdetector is used to identify possible text boards in the frameusing Matlab)
In video scene segmentation the video is divided intoshots and similar shots are combined together to make thescenes Shots are uninterrupted and unceasing sequencesof video frames where there is no change in theme andcamera [1] Generally the video shots can be categorized intotwo types abrupt shots and gradual shots An abrupt shot
HindawiAdvances in MultimediaVolume 2018 Article ID 2564963 9 pageshttpsdoiorg10115520182564963
2 Advances in Multimedia
boundary is the sudden change in the scene such as changeof speaker during TV interviews whereas gradual shots takeseveral frames to change the shot such as fades and dissolvesIn videos many shots repeat in very short interval of time ifthose shots are combined then these collections of shots arecalled scenes For example if two actors are talking then thecamera keeps switching to both actors with very little changein background and in two-minute conversation of video clipthere are sometimes 25-30 shots Scene detection aka sceneboundary detection or video scene segmentation is the studyto merge similar or repeating shots into one clip or dividingthe videos into clips which are semantically or visually relatedor similar
Manual segmentation of videos for websites and DVDsis very time consuming and not feasible when dealing withlarge datasets Recently automatic video segmentation intoshots and scenes have gained wide attraction among industryand researchers [2ndash5]
In the proposed methodology videos are segmented intoabrupt shot boundaries which are further grouped on thebasis of the similarity to construct the scenes The proposedmethodology is inspired by the BoVW model for scenedetection [2] the abstract flow diagram is shown in Figure 1In bag of visual word model local key point descriptorswhich are extracted from the key frame of the shots arerepresented by the histograms of visual words These keyframes arematched based onbag of visual word histograms insliding window of lengthL [3] It has been shown that shotsmatched in sliding windows aremore efficient [2 3] ClassicalBoVWmodel andVLAD are usedwith compact vocabularieswithout compromising in accuracies
Rest of the paper is organized as follows Section 2presents the related work It is divided into three subsections(1) shot boundary detection (2) key frame extraction and (3)scene boundary detection Section 3 presents the proposedmethodology along with experimental protocols and finallySection 4 concludes the papers and discusses the future work
2 Related Work
In this section a brief literature review and state-of-the-artmethodologies are presented for all the main steps of thevideo segmentation which include shot boundary detectionkey frame extraction and scene boundary detection
21 Shot Boundary Detection In the problem of videoindexing and searching the first and foremost step is shotboundary detection Shot boundary has two types as men-tioned earlier abrupt and gradual shot boundaries Abruptshot boundary is the sudden change in the stream if thedissimilarity difference between the two consecutive framesis very large then either of the adjacent frames is consideredthe boundary whereas gradual shot boundary is the gradualchange in the video such as the effects like fade-in fade-outand dissolves
Let the 119865 = 1198911 1198912 119891119899 be frames of a video and 119891119894 issaid to be an abrupt shot boundary if and only if the differencebetween 119891119894 and 119891119894+1 is greater than a threshold 120591 In ourexperiments we are not taking the gradual shot detection into
Video Shot BoundaryDetection
Key frame Selection
FeatureExtractionQuantization
Similarity ofKey Frames in
Sliding Window
Grouping ofShots Scene Creation
Figure 1 Abstract flow diagram of proposed framework
consideration as the ratio of gradual shots in any cinematicvideo is too small more than 90 of shot boundaries areabrupt boundaries [2 3]
There is long list of methodologies first on pixel to pixeldifference between the consecutive video frames ie 119891119894 and119891119894 + 1 are used for segmentation of the video [6] In thistechnique if the sum of the pixels difference is greater thansome threshold then it was considered to be an abrupt shotboundary
Later on many other scientists worked on this problemand proposed a new technique in which pixel intensityhistograms were used in successive frames instead of the pixelto pixel difference to detect the abrupt shot boundaries [7 8]These techniques are good except that they are sensitive tomotion of objects and camera [9]
Moreover a latter approach [10] detects the shot bound-ary based on the mutual information and joint entropybetween the consecutive frames A sports dataset has beenused to detect the shot boundary This technique of jointentropy is useful if used for faded or gradual boundariesThe entropy is high for an extended time period duringfade-in because the visual intensity gradually increases andthe entropy is low during fade-out as the intensity slowlydecreases
Videos fragmentation by pixel variance in frames andpixel strength in histogram calculations has been presentedin [11 12] The frame indexing was used with rapid boundarywhen the amount of shot pixels between two frames isoverdone by some threshold
Chavez et al [13] proposed a different technique in whichthey used supervised learning with support vector machine(SVM) in order to separate the abrupt boundaries from thegradual boundaries In this technique authors calculated thedissimilarity vector which assimilate set of different featuresincluding Fourier-Mellinmoments Zernike moments andcolor histograms (RGB and HSV) to capture the informationlike illumination changes and rapid motion After then thisvector is used in SVM for detection of shot boundariesThe authors also used illumination changes for detecting thegradual shot boundaries
Advances in Multimedia 3
Furthermore [14] proposed a new technique of learningalgorithm which has three main steps
(1) Firstly frames which have smooth changes areremoved
(2) Secondly three types of feature differences areextracted The intensity difference difference in ver-tical and horizontal edge histograms and differencebetween the HSV color histograms are calculatedfrom the shot boundaries
(3) Lastly the authors detect the gradual boundariesfrom the video using a technique named as temporalmultiresolution analysis
Several other works use various kinds of methodologiesfor different kind of shots Such an example work [15] useddifferent techniques for abrupt shots and gradual shots byutilizing SIFT and SVM Their methodology also comprisesfew main steps which are given below
(1) In the first step they select the shot boundary framesfrom video using the difference between color his-tograms of two consecutive frames
(2) Then in the second step they extract the SIFT featuresfrom the frames selected as a shot boundary
(3) Lastly they use different approach for abrupt andgradual shot boundaries by using SIFT and SVMSIFT is considered to be the most efficient effectiveand massively used in state-of-the-art techniques
Although SIFT is considered to be the most widely usedfeature extraction technique it still has some downsides ascompared to SURF The SIFT feature has high dimensionalfeature vector ie 128-D whereas the SURF only has 64-Dvector SIFT is slow as compared to SURF due to complex andcomprehensive images Moreover Baber et al [3] proposed anew technique for shot boundary detection using two differ-ent feature extraction methods that are SURF and EntropyTheir research consists of different steps in which theydetect shot boundaries (abrupt and gradual) and differentiategradual boundaries from abrupt shot boundaries The stepsare as follows
(1) In the first step the fade boundaries are detected bythe analysis of entropy pattern during fade effects
(2) After the detection of fade shot boundaries the otherkind of shot boundary that is abrupt shot boundaryis detected by using the entropy difference betweentwo consecutive frames If the difference between twoconsecutive frames is higher than the threshold 120591then it is considered to be the abrupt shot boundary
(3) SURF is used for removing the false negative bound-aries
22 Key Frame Extraction Most of the researchers use theshot boundary detection as the important step to extract therepresentative key frames from the videos Representative keyframes in the videos are the particular frames which describethe whole content of the particular scene in the video Each
video may consist of one or more key frames based on thescenes or content in the video
Shot boundary detection is one of the most crucial stepsin our problem for finding representative key frames fromthe videos as scene detection is completely based on theserepresentative key frames Baber et al [5] used the entropydifferences between two consecutive frames for finding theshot boundaries If the contents of the two consecutive framessay 119891119894 and 119891119894+1 are different and their entropy difference isgreater than the specified threshold then 119891119894 is said to be ashot boundary and considered as a representative key frame
In our methodology we have first calculated the entropyof each video frame and then the difference between twoconsecutive frames is recorded The difference is greater thanthe the threshold 120591 is considered to be the representative keyframe Entropy is a statistical measure of randomness thatcan be used to characterize the texture of the input imageMathematically entropy is defined as
119864119909 = minussum119910
119867119910 log (119867119910) (1)
where119867 is the normalized histogram of the gray scale imagersquospixels intensity
23 Scene Detection At the first stage the video is segmentedin shots and semantically similar shots are merged to formscenes Scenes are categorized into various classes such asconversation indoor and outdoor scenes Many importantresearches have been published related to video segmentationinto scenes using different type of videos for example cin-ematic drama serials (indoor and outdoor) video lecturesand documentaries Although a lot of work has been reportedfor segmentation of video into scenes there is still a gap toaddress the challenge in cinematic videos Commonly thereare two types of features being extracted from the videos forsegmentation ie audio and visual We have focused on thevisual features in our research
Yeung et al [16] proposed a technique in which theauthors used the scene transition graph (STG) to segment thevideo The nodes in the graph are considered as a shot whichis based on the temporal relationship and visual similarityedges are describedThen the graph is divided into subgraphsand these subgraphs are considered as scenes which are basedon the color similarity of the shots
Rasheed et al [17] proposed an effective technique forscene detection in Hollywood and TV shows For featuresthey have used the motion color and length of the shotsIn the initial step they first cluster the shots by usingBackward Shot Coherence (BSC) Next by calculating thecolor similarity between shots they first detect potentialscene boundaries and after that they remove the false negativefrom potential scene boundaries by scene dynamics which isbased on motion and length of the shot
Many recent authors worked on video scene segmen-tation and proposed new technique for this problem inresearch Some researchers used multimodal fusion tech-nique of optimally grouped features using the dynamicprogramming scheme [18ndash20] Their methodology includesfew steps in which the first step was to divide the video into
4 Advances in Multimedia
shots and then using clustering technique they cluster theshots The authors in their paper [19] proposed a techniqueknown as intermediate fusion which uses all the informationfrom different modalities They considered this problem anoptimization problem and used it via dynamic programming[19] The authors have some previous research [18] in whichthey proposed a technique of dividing the video into scenesusing the sequential structure In this technique they decideda location for video segmentation and only inspected thepartitioning possibilities In this technique the video is rep-resented by set of features and each set is given by a distancemetric between them The segmentation purely depends oninput features and distance metric [18]
Furthermore a different technique was proposed inwhich they used spectral clustering technique with anautomatic selection on number of clusters and extractedthe normalized histogram of each shot Further they usedBhattacharyya distance and temporal distance as a distancemetric Authors in this paper said that clustering is notconsistent and adjacent shots belong to different clusters [20]
Sakarya et al [21] used a new technique of graph con-struction for the segmentation of video into scenesThey con-struct a graph weighting the temporal and spatial function ofsimilarity From this the dominant shots are detected and fortemporal consistency constraint they used the edges of thescene via mean and standard deviation of the shot positionThis process kept on going until all the video is allocated toscenes Lin et al [22] used the approach of color histogramsfor the shot boundary detection and then formed the scenebymerging the similar shots via identifying the local minimaand maxima to determine the scene transitions
Baraldi et al [4] used another approach for shots andscene detection from the videos using the color histogramsand clustering technique respectivelyThe authors first detectthe shots using the color histogram then the authors clus-tered the shots using the hierarchical K-means clusteringtechnique and created N clusters for N number of shotsEach shot is assigned a particular cluster and they find theleast dissimilar shots using the distance metric formula andmerged the two clusters with the least distance This processcontinues until and unless all the scenes are detected andvideo is completed
Chen et al [23] proposed a new approach for scene detec-tion from the H264 video sequences They define a scenechange factor which is used to reserve bits for each frameTheir methodology has reduced rate error and was foundbetter when compared with JVT-G012 algorithm The workof [24] proposed a novel technique for scene change detectionespecially for H264AVC encoded video sequences and theytake into consideration the design and performance evalua-tion of the systemThey further worked with a dynamic thre-shold which adapts and tracks different descriptors andincreased the accuracy of system by locating true scenes inthe videos
3 Proposed Methodology for Scene Detection
The proposed framework comprises shot boundary detec-tion key frame extraction local key point descriptors
4 6 8 10 12 14 16 18 20Entropy Difference ()
0
010
020
030
040
050
060
070
080
090
1
Prec
ision
Rec
allF
-sco
re
PrecisionRecallF-score
Figure 2 Sensitivity of 120591119860 on movie Pink Panther (2006)
extraction from key frames feature quantization and sceneboundary detection
31 Shot Boundary Detection Shot boundary detection isthe primary step for any kind of video operations Thereare number of frameworks for shot boundary detection Wehave used the technique for shot boundary detection basedon entropy differences [5 26] The entropy is computed foreach frame and differences between the adjacent frames arecomputed The frame 119891119894 is considered to be a shot boundaryparticularly abrupt shot boundary if the entropy differencebetween 119891119894 and 119891119894+1 is greater than the predefined threshold120591119860 [2 3 5] It can be returned as
B (119891119894) =
True if D (119891119894 119891119894+1) gt 120591119860False otherwise
(2)
B() decides either the given frame 119891119894 is shot boundary ornot and D computes the dissimilarity or difference betweenadjacent framesThe value 120591119860 gives better precision with poorrecall if it is high and better recall with poor precision if it islow as shown in Figure 2 During experiment the value of 120591119860is set experimentally which gives high F-score
32 Key Frame and Local Key Point Descriptors ExtractionLet S = 1199041 1199042 119904119899 be the set of all shot boundaries Oneor set of key frame(s) from each shot are selected There are anumber of possibilities to select representative frames akakey frames from each shot Since the entropies are alreadycomputed in shot boundary process so entropy based keyframe selection criteria are used [3]
For any given shot 119904119894 isin S the frame with maximumentropy is selected as key frame It has been shown exper-imentally that if the entropy is larger the contents in theframe are dense which represents the shots precisely The
Advances in Multimedia 5
shots are now represented by key frames and denoted byF = 1198911199041 1198911199042 119891119904119899 where 119891119904119894 denotes the key frame of shot119904119894
Two images can be matched if they are similar based onsome similarity criteria Similarity is computed between thefeatures of the images SIFT [27] is widely used as imagefeature for various applications of computer vision and videoprocessing For any given image key points are detected andthose key points are represented by some descriptors such asSIFT On average there are 2-3 thousand key points on singleimage whichmakes matching very expensive and exhaustiveas single image is represented by 2-3 thousand feature vectorsTomatch two images of size 800times600 each it takes 2 secondson commodity hardware on average If one image has to bematched with several hundreds or thousand images then it isnot practical to use SIFT or any rawdescriptors Quantizationis used to reduce the feature space
33 Quantization BoVW Model Bag of visual word modelis widely used for feature quantization Every key pointdescriptor 119909119895 sub R119889 is quantized into a finite number ofcentroids from 1 to 119896 where 119896 denotes the total number ofcentroids aka visual words denoted byV = V1 V2 V119896and each V119894 sub R119889 Let say a frame 119891 be represented bysome local key point descriptors119891119883 = 1199091 1199092 119909119898 where119909119894 sub R119889 In BoVWmodel a function G is defined as
G R119889 997891997888rarr [1 119896]
119909119894 997891997888rarr G (119909119894)(3)
G maps descriptor 119909119894 sub R119889 to an integer index Forgiven frame 119891 bag of visual word I = 1205831 1205832 120583119896 iscomputed 120583119894 indicates the number of times V119894 appeared inframe119891 andI is unit normalized at the endMostly 119896-meanor hierarchical 119896-mean clustering is applied and centroids(visual words) V are obtained The value of 119896 is keptvery large for image matching or retrieval applications thesuggested value of 119896 is 1 millionThe accuracy of quantizationmainly depends on the value of 119896 if the value is small thentwo different key point descriptors will be quantized to samevisual words which will decrease the distinctiveness or if thevalue is very large then two similar key point descriptorswhich are slightly distorted can be assigned different visualwords which will decrease the robustness [28]
In the case of the video segmentation the scenario isdifferent than the searching ormatching one imagewith set ofvery large database which have severe image transformationssuch as illumination scale viewpoint and scene capture atdifferent time In video segmentation image is matched withfew other images 4 to 7 in sliding window which containslightly different contents The each image in sliding windowis a key framewhich represents the shot an example of slidingwindow matching is shown in Figure 3
In proposed framework the value of 119896 is kept far smallerthan the value suggested in the literature [2] without compro-mising on the segmentation accuracy During experimentthe value of 119896 = 25000 gives approximately same accuracyas the value 500000 which is used in our previous work
[2] For the above-mentioned experiment the value of 119896 wasgradually increased from 5000 to 30000 by the factor of 1000and it was found that the value 119896 = 25000 gives approximatelysame accuracy as of our previous work [2]
34 Quantization VLAD Model VLAD is emerging quanti-zation framework for local key point descriptors [29] Insteadcomputing the histogram of visual words it computes thesum of the differences of residual descriptors with visualwords and concatenates into single vector of 119889 times 119896 Let G119881be VLAD quantization function [30]
G119881 R
119889 997891997888rarr V119895 isin V
119909119894 997891997888rarr G119881 (119909119894) = arg min
VisinV
1003817100381710038171003817119909119894 minus V10038171003817100381710038172 (4)
The VLAD is computed in three steps(1) offline visual words are obtained V(2) all the key point descriptors obtained from given
frame 119891119883 are quantized using (4)(3) VLAD is computed for given frame J119891 =
1198951 1198952 119895119896 where each 119895119902 is 119889-dimensional vectorobtained as follows
119895119902 = sum119883G119881(119883)=V119898
119883 minus V119898 (5)
J119891 is 119889 times 119896 dimensional feature In case of SIFT 119889 = 128and recommended value of 119896 isin 64 128 256 [29] As statedabove video segmentation does not require very large valuesof 119896 During experiments the value of 119896 for VLAD is 16 andJ using SIFT is 128 times 16 = 2048 dimensional J is unitnormalized at the end The vector is very compact withoutthe loss of accuracy as shown in experiments
35 Scene Boundary Detection Algorithm 1 is used to findthe scene boundaries [2] H denotes feature vectors for keyframes the feature vectors are either VLAD or BoVWvectorsexplained in the previous section The similarity between twokey frames is decided by dissimilarity function D which canbe computed as follows
D (119867119894 119867119895) =N
sum119902=1
min (ℎ119894119902 ℎ119895119902) (6)
Two key frames are treated as similar if their D() gt 120591119904 Thevalue of 120591119904 is the average of the minimum and the maximumsimilarities of the similar shots on a subset of the videos usedin the experiments The average of similarity score is widelyused as the value of 120591119904 In our experiment the average ofsimilarity scores gives low segmentation accuracy ie 0713
4 Experiments and Results
Cinematic and drama videos are used for scene boundarydetection list of movies and dramas is given in Table 1 F-score is used as performance metrics for scene boundarydetectionThere is no benchmark dataset Two strategies havebeen used to obtain the ground-truth first party and third
6 Advances in Multimedia
fsfs fs fs fs
fsfs fs fs
Figure 3 Example of key framesmatching in the sliding window of lengthL = 3 Each frame represents the shot and there are 9 consecutiveshots 1198911199041 1198911199049 Each key frame 119891119904119894 is matched with next three neighbors
Require H = 1198671 1198672 119867119899(1) 119860[1] larr997888 1(2) 119906 larr997888 1(3) index larr997888 2(4) for each 119867119894 isin H 119894 = [1 119899 minus 1] do(5) isSimilar larr997888 false(6) for 119895 = 119894 + 1 to 119894 + L do(7) if not Contains(119860 1 119895 + 1) and D(119867119894 119867119895) gt 119879119904 then(8) 119860[index] larr997888 119895(9) isSimilar larr997888true(10) end if(11) end for(12) if not isSimilar and(119894 ge 119860[index]) then(13) add (119906 119860[index]) toZ(14) 119906 larr997888 119860[index] + 1(15) index larr997888 index +1(16) end if(17) end for(18) Merge the short scenes(19) return Z
Algorithm 1 Scene detection algorithm
party ground-truth First party ground-truth is generated bythe authors and third party ground-truth is collected fromthe experts who have adequate knowledge of shots and sceneboundaries [2 3] To make ground-truth hinased third partyapproach is used in our experiments [3 5 26]
The accuracy of proposed system can be seen in Table 1Our dataset has two different groups with completely dif-ferent videos One group consists of cinematic movies withentirely different environment and challenging effects withcomplex motion of scenes On the other hand the secondgroup of data consists of indoor drama serials which areeasy to segment compared to cinematic movies becauseof their simple scene with no challenging effects that iswhy then length of the sliding window L is different forboth groups of dataset The sensitivity of L can be seen inFigure 4 [2] In cinematic videos the scenes are longer andshots are shorter In just few seconds there are sometimesmore than 20 shots due to different effects and actions Thevalue of L is marginally bigger compared on drama typesof videos Though single value can also be used for alltypes
Since the values of 119896 for VLAD and BoVW are shorterin proposed experiments compared to recommended valueswhich increase the efficiency for similarity computation thesimilarity computation by (6) or any other distance is at leastO(119899) where 119899 denotes the dimensionality of the feature Thecomputation of similarity is faster if the value of 119899 is shorteras shown in Figure 5 It can be seen that VLAD is faster thanBoVW because VLAD has shorter dimensions compared toBoVW The recommended value of 119896 for BoVW is 1000000as discussed in previous section whereas in our experimentsthe value of 119896 is 25000
5 Conclusion
Video segmentation is a primary step for video indexing andsearching Shot boundary detection divides the videos intosmall units These small units do not give meaningful insightof the video story or theme However grouping of similarshots give better insight of the video and this grouping can betreated as video scene and grouping of similar shots is calledscenes
Advances in Multimedia 7
Table 1 Performance of BoVW and VLAD on cinematic and drama videos
Figure 4 Sensitivity ofL on different types of videos
In this paper we propose framework which uses state-of-the-art searching techniques such as BoVW and VLADwhich is widely used for image and video retrieval for sceneboundary detection Images or video frames are representedby BoVW and VLAD which are very high dimensionalfeature vectors We experimentally show that in the fieldof scene boundary detection competitive accuracy can beachieved by keeping the dimensions of BoVW and VLADto very small The recommended dimensions for BoVWare 1 million in our experiments we just tuned it to be25000 The recommended dimensions of VLAD are 32768
0 2 4 6 8Data Size
0
05
1
15
2
25
3
35
Tim
e (se
c)
VLADBoVW
times104
Figure 5 Timing plot of query image matching with all the imagesin database VLAD always has less dimensions compared to theBoVWwhich makes VLAD faster than BoVW
in our experiments it is tuned to 2048 We exploit thesliding window for shot boundary detection In very smallsliding window the contents of the video shots do notchange drastically which helps to represent shots by reduceddimensions of BoVW and VLAD
8 Advances in Multimedia
Data Availability
The data used to support the findings of this study areavailable from the corresponding author upon request
Conflicts of Interest
The authors declare that they have no conflicts of interest
Acknowledgments
We are thankful to Shinrsquoichi Satoh from National Institute ofInformatics Japan Nitin Afzulpurkar fromAsian Institute ofTechnologyThailand andChadapornKeatmanee fromThai-Nichi Institute of Technology Thailand for their expertisethat greatly assisted this research
References
[1] S Lefevre and N Vincent ldquoEfficient and robust shot changedetectionrdquo Journal of Real-Time Image Processing vol 2 no 1pp 23ndash34 2007
[2] J Baber S Satoh N Afzulpurkar and C Keatmanee ldquoBag ofvisual wordsmodel for videos segmentation into scenesrdquo inPro-ceedings of the Fifth International Conference on Internet Multi-media Computing and Service pp 191ndash194 ACM 2013
[3] J Baber N Afzulpurkar and S Satoh ldquoA framework for videosegmentation using global and local featuresrdquo InternationalJournal of Pattern Recognition and Artificial Intelligence vol 27no 5 Article ID 1355007 2013
[4] L Baraldi C Grana and R Cucchiara ldquoShot and scene detec-tion via hierarchical clustering for re-using broadcast videordquoin Proceedings of the International Conference on ComputerAnalysis of Images and Patterns pp 801ndash811 Springer 2015
[5] J Baber NAfzulpurkar andM Bakhtyar ldquoVideo segmentationinto scenes using entropy and SURFrdquo in Proceedings of the 20117th International Conference on Emerging Technologies (ICETrsquo11) pp 1ndash6 IEEE 2011
[6] T Kikukawa and S Kawafuchi ldquoTransaction of the instituteof electronics development of an automatic summary editingsystem for the audio visual resourcesrdquo Information and commu-nication Engineers vol 75 no 2 pp 398ndash402 1992
[7] A Nagasaka and Y Tanaka Visual database systems II 1992[8] H Zhang A Kankanhalli and S W Smoliar ldquoAutomatic parti-
tioning of full-motion videordquo Multimedia Systems vol 1 no 1pp 10ndash28 1993
[9] I Koprinska and S Carrato ldquoTemporal video segmentation asurveyrdquo Signal Processing Image Communication vol 16 no 5pp 477ndash500 2001
[10] Z Cernekova I Pitas and C Nikou ldquoInformation theory-based shot cutfade detection and video summarizationrdquo IEEETransactions on Circuits and Systems for Video Technology vol16 no 1 pp 82ndash91 2006
[11] T Kikukawa and S Kawafuchi ldquoDevelopment of an automaticsummary editing system for the audio-visual resourcesrdquo Trans-actions on Electronics and Information J75-A pp 204ndash212 1992
[12] A Nagasaka ldquoAutomatic video indexing and full-video searchfor object appearancesrdquo in Proceedings of the IFIP 2nd WorkingConference on Visual Database Systems 1992
[13] G C Chavez F Precioso M Cord S Philipp-Foliguet andA D A Araujo ldquoShot boundary detection at trecvid 2006rdquo inProceedings of the TREC Video Retrieval Eval vol 15 2006
[14] X Ling O Yuanxin L Huan and X Zhang ldquoAMethod for FastShot Boundary Detection Based on SVMrdquo in Proceedings of the2008 Congress on Image and Signal Processing vol 2 pp 445ndash449 IEEE 2008
[15] J Li Y Ding Y Shi and W Li ldquoA divide-and-rule scheme forshot boundary detection based on SIFTrdquo International Journalof Digital Content Technology and Its Applications vol 4 no 3pp 202ndash214 2010
[16] M Yeung B-L Yeo and B Liu ldquoSegmentation of Video byClustering and Graph Analysisrdquo Computer Vision and ImageUnderstanding vol 71 no 1 pp 94ndash109 1998
[17] Z Rasheed andM Shah ldquoScene detection inHollywoodmoviesand TV showsrdquo in Proceedings of the 2003 IEEE ComputerSociety Conference on Computer Vision and Pattern Recognitionvol 2 pp 343ndash348 IEEE 2003
[18] D Rotman D Porat and G Ashour ldquoRobust and efficientvideo scene detection using optimal sequential groupingrdquo inProceedings of the 18th IEEE International Symposium on Multi-media ISM rsquo16 pp 275ndash280 IEEE 2016
[19] D Rotman D Porat and G Ashour ldquoRobust video scenedetection using multimodal fusion of optimally grouped fea-turesrdquo in Proceedings of the 19th IEEE International Workshopon Multimedia Signal Processing MMSP rsquo17 pp 1ndash6 2017
[20] L Baraldi C Grana and R Cucchiara ldquoAnalysis and re-useof videos in educational digital libraries with automatic scenedetectionrdquo in Proceedings of the Italian Research Conference onDigital Libraries pp 155ndash164 Springer 2015
[21] U Sakarya and Z Telatar ldquoVideo scene detection using dom-inant setsrdquo in Proceedings of the 2008 15th IEEE InternationalConference on Image Processing - ICIP rsquo08 pp 73ndash76 IEEE2008
[22] T Lin H Zhang and Q-Y Shi ldquoVideo scene extraction byforce competitionrdquo in Proceedings of the IEEE InternationalConference on Multimedia and Expo (ICME rsquo01) pp 753ndash7562001
[23] X Chen and F Lu ldquoAdaptive rate control algorithm for H264AVC considering scene changerdquo Mathematical Problems inEngineering vol 2013 Article ID 373689 6 pages 2013
[24] GRascioni S Spinsante andEGambi ldquoAnoptimized dynamicscene change detection algorithm for H264AVC encodedvideo sequencesrdquo International Journal of Digital MultimediaBroadcasting vol 2010 Article ID 864123 9 pages 2010
[25] Z Rasheed andM Shah ldquoScene detection inHollywoodmoviesand TV showsrdquo in Proceedings of the 2003 IEEE ComputerSociety Conference on Computer Vision and Pattern Recognition2003
[26] J Baber N Afzulpurkar M N Dailey and M BakhtyarldquoShot boundary detection from videos using entropy and localdescriptorrdquo in Proceedings of the 2011 17th International Con-ference onDigital Signal Processing (DSP rsquo11) pp 1ndash6 IEEE 2011
[27] D G Lowe ldquoDistinctive image features from scale-invariantkeypointsrdquo International Journal of ComputerVision vol 60 no2 pp 91ndash110 2004
[28] J Baber M N Dailey S Satoh N Afzulpurkar and M Bakht-yar ldquoBIG-OH binarization of gradient orientation histogramsrdquoImage and Vision Computing vol 32 no 11 pp 940ndash953 2014
[29] H JegouMDouze C Schmid and P Perez ldquoAggregating localdescriptors into a compact image representationrdquo inProceedingsof the 2010 IEEE Computer Society Conference on ComputerVision and Pattern Recognition (CVPR rsquo10) pp 3304ndash3311 2010
Advances in Multimedia 9
[30] J Delhumeau P-HGosselinH Jegou andP Perez ldquoRevisitingthe VLAD image representationrdquo inProceedings of the 21st ACMInternational Conference on Multimedia pp 653ndash656 2013
Chemical EngineeringInternational Journal of Antennas and
Propagation
International Journal of
Hindawiwwwhindawicom Volume 2018
Hindawiwwwhindawicom Volume 2018
Navigation and Observation
International Journal of
Hindawi
wwwhindawicom Volume 2018
Advances in
Multimedia
Submit your manuscripts atwwwhindawicom
2 Advances in Multimedia
boundary is the sudden change in the scene such as changeof speaker during TV interviews whereas gradual shots takeseveral frames to change the shot such as fades and dissolvesIn videos many shots repeat in very short interval of time ifthose shots are combined then these collections of shots arecalled scenes For example if two actors are talking then thecamera keeps switching to both actors with very little changein background and in two-minute conversation of video clipthere are sometimes 25-30 shots Scene detection aka sceneboundary detection or video scene segmentation is the studyto merge similar or repeating shots into one clip or dividingthe videos into clips which are semantically or visually relatedor similar
Manual segmentation of videos for websites and DVDsis very time consuming and not feasible when dealing withlarge datasets Recently automatic video segmentation intoshots and scenes have gained wide attraction among industryand researchers [2ndash5]
In the proposed methodology videos are segmented intoabrupt shot boundaries which are further grouped on thebasis of the similarity to construct the scenes The proposedmethodology is inspired by the BoVW model for scenedetection [2] the abstract flow diagram is shown in Figure 1In bag of visual word model local key point descriptorswhich are extracted from the key frame of the shots arerepresented by the histograms of visual words These keyframes arematched based onbag of visual word histograms insliding window of lengthL [3] It has been shown that shotsmatched in sliding windows aremore efficient [2 3] ClassicalBoVWmodel andVLAD are usedwith compact vocabularieswithout compromising in accuracies
Rest of the paper is organized as follows Section 2presents the related work It is divided into three subsections(1) shot boundary detection (2) key frame extraction and (3)scene boundary detection Section 3 presents the proposedmethodology along with experimental protocols and finallySection 4 concludes the papers and discusses the future work
2 Related Work
In this section a brief literature review and state-of-the-artmethodologies are presented for all the main steps of thevideo segmentation which include shot boundary detectionkey frame extraction and scene boundary detection
21 Shot Boundary Detection In the problem of videoindexing and searching the first and foremost step is shotboundary detection Shot boundary has two types as men-tioned earlier abrupt and gradual shot boundaries Abruptshot boundary is the sudden change in the stream if thedissimilarity difference between the two consecutive framesis very large then either of the adjacent frames is consideredthe boundary whereas gradual shot boundary is the gradualchange in the video such as the effects like fade-in fade-outand dissolves
Let the 119865 = 1198911 1198912 119891119899 be frames of a video and 119891119894 issaid to be an abrupt shot boundary if and only if the differencebetween 119891119894 and 119891119894+1 is greater than a threshold 120591 In ourexperiments we are not taking the gradual shot detection into
Video Shot BoundaryDetection
Key frame Selection
FeatureExtractionQuantization
Similarity ofKey Frames in
Sliding Window
Grouping ofShots Scene Creation
Figure 1 Abstract flow diagram of proposed framework
consideration as the ratio of gradual shots in any cinematicvideo is too small more than 90 of shot boundaries areabrupt boundaries [2 3]
There is long list of methodologies first on pixel to pixeldifference between the consecutive video frames ie 119891119894 and119891119894 + 1 are used for segmentation of the video [6] In thistechnique if the sum of the pixels difference is greater thansome threshold then it was considered to be an abrupt shotboundary
Later on many other scientists worked on this problemand proposed a new technique in which pixel intensityhistograms were used in successive frames instead of the pixelto pixel difference to detect the abrupt shot boundaries [7 8]These techniques are good except that they are sensitive tomotion of objects and camera [9]
Moreover a latter approach [10] detects the shot bound-ary based on the mutual information and joint entropybetween the consecutive frames A sports dataset has beenused to detect the shot boundary This technique of jointentropy is useful if used for faded or gradual boundariesThe entropy is high for an extended time period duringfade-in because the visual intensity gradually increases andthe entropy is low during fade-out as the intensity slowlydecreases
Videos fragmentation by pixel variance in frames andpixel strength in histogram calculations has been presentedin [11 12] The frame indexing was used with rapid boundarywhen the amount of shot pixels between two frames isoverdone by some threshold
Chavez et al [13] proposed a different technique in whichthey used supervised learning with support vector machine(SVM) in order to separate the abrupt boundaries from thegradual boundaries In this technique authors calculated thedissimilarity vector which assimilate set of different featuresincluding Fourier-Mellinmoments Zernike moments andcolor histograms (RGB and HSV) to capture the informationlike illumination changes and rapid motion After then thisvector is used in SVM for detection of shot boundariesThe authors also used illumination changes for detecting thegradual shot boundaries
Advances in Multimedia 3
Furthermore [14] proposed a new technique of learningalgorithm which has three main steps
(1) Firstly frames which have smooth changes areremoved
(2) Secondly three types of feature differences areextracted The intensity difference difference in ver-tical and horizontal edge histograms and differencebetween the HSV color histograms are calculatedfrom the shot boundaries
(3) Lastly the authors detect the gradual boundariesfrom the video using a technique named as temporalmultiresolution analysis
Several other works use various kinds of methodologiesfor different kind of shots Such an example work [15] useddifferent techniques for abrupt shots and gradual shots byutilizing SIFT and SVM Their methodology also comprisesfew main steps which are given below
(1) In the first step they select the shot boundary framesfrom video using the difference between color his-tograms of two consecutive frames
(2) Then in the second step they extract the SIFT featuresfrom the frames selected as a shot boundary
(3) Lastly they use different approach for abrupt andgradual shot boundaries by using SIFT and SVMSIFT is considered to be the most efficient effectiveand massively used in state-of-the-art techniques
Although SIFT is considered to be the most widely usedfeature extraction technique it still has some downsides ascompared to SURF The SIFT feature has high dimensionalfeature vector ie 128-D whereas the SURF only has 64-Dvector SIFT is slow as compared to SURF due to complex andcomprehensive images Moreover Baber et al [3] proposed anew technique for shot boundary detection using two differ-ent feature extraction methods that are SURF and EntropyTheir research consists of different steps in which theydetect shot boundaries (abrupt and gradual) and differentiategradual boundaries from abrupt shot boundaries The stepsare as follows
(1) In the first step the fade boundaries are detected bythe analysis of entropy pattern during fade effects
(2) After the detection of fade shot boundaries the otherkind of shot boundary that is abrupt shot boundaryis detected by using the entropy difference betweentwo consecutive frames If the difference between twoconsecutive frames is higher than the threshold 120591then it is considered to be the abrupt shot boundary
(3) SURF is used for removing the false negative bound-aries
22 Key Frame Extraction Most of the researchers use theshot boundary detection as the important step to extract therepresentative key frames from the videos Representative keyframes in the videos are the particular frames which describethe whole content of the particular scene in the video Each
video may consist of one or more key frames based on thescenes or content in the video
Shot boundary detection is one of the most crucial stepsin our problem for finding representative key frames fromthe videos as scene detection is completely based on theserepresentative key frames Baber et al [5] used the entropydifferences between two consecutive frames for finding theshot boundaries If the contents of the two consecutive framessay 119891119894 and 119891119894+1 are different and their entropy difference isgreater than the specified threshold then 119891119894 is said to be ashot boundary and considered as a representative key frame
In our methodology we have first calculated the entropyof each video frame and then the difference between twoconsecutive frames is recorded The difference is greater thanthe the threshold 120591 is considered to be the representative keyframe Entropy is a statistical measure of randomness thatcan be used to characterize the texture of the input imageMathematically entropy is defined as
119864119909 = minussum119910
119867119910 log (119867119910) (1)
where119867 is the normalized histogram of the gray scale imagersquospixels intensity
23 Scene Detection At the first stage the video is segmentedin shots and semantically similar shots are merged to formscenes Scenes are categorized into various classes such asconversation indoor and outdoor scenes Many importantresearches have been published related to video segmentationinto scenes using different type of videos for example cin-ematic drama serials (indoor and outdoor) video lecturesand documentaries Although a lot of work has been reportedfor segmentation of video into scenes there is still a gap toaddress the challenge in cinematic videos Commonly thereare two types of features being extracted from the videos forsegmentation ie audio and visual We have focused on thevisual features in our research
Yeung et al [16] proposed a technique in which theauthors used the scene transition graph (STG) to segment thevideo The nodes in the graph are considered as a shot whichis based on the temporal relationship and visual similarityedges are describedThen the graph is divided into subgraphsand these subgraphs are considered as scenes which are basedon the color similarity of the shots
Rasheed et al [17] proposed an effective technique forscene detection in Hollywood and TV shows For featuresthey have used the motion color and length of the shotsIn the initial step they first cluster the shots by usingBackward Shot Coherence (BSC) Next by calculating thecolor similarity between shots they first detect potentialscene boundaries and after that they remove the false negativefrom potential scene boundaries by scene dynamics which isbased on motion and length of the shot
Many recent authors worked on video scene segmen-tation and proposed new technique for this problem inresearch Some researchers used multimodal fusion tech-nique of optimally grouped features using the dynamicprogramming scheme [18ndash20] Their methodology includesfew steps in which the first step was to divide the video into
4 Advances in Multimedia
shots and then using clustering technique they cluster theshots The authors in their paper [19] proposed a techniqueknown as intermediate fusion which uses all the informationfrom different modalities They considered this problem anoptimization problem and used it via dynamic programming[19] The authors have some previous research [18] in whichthey proposed a technique of dividing the video into scenesusing the sequential structure In this technique they decideda location for video segmentation and only inspected thepartitioning possibilities In this technique the video is rep-resented by set of features and each set is given by a distancemetric between them The segmentation purely depends oninput features and distance metric [18]
Furthermore a different technique was proposed inwhich they used spectral clustering technique with anautomatic selection on number of clusters and extractedthe normalized histogram of each shot Further they usedBhattacharyya distance and temporal distance as a distancemetric Authors in this paper said that clustering is notconsistent and adjacent shots belong to different clusters [20]
Sakarya et al [21] used a new technique of graph con-struction for the segmentation of video into scenesThey con-struct a graph weighting the temporal and spatial function ofsimilarity From this the dominant shots are detected and fortemporal consistency constraint they used the edges of thescene via mean and standard deviation of the shot positionThis process kept on going until all the video is allocated toscenes Lin et al [22] used the approach of color histogramsfor the shot boundary detection and then formed the scenebymerging the similar shots via identifying the local minimaand maxima to determine the scene transitions
Baraldi et al [4] used another approach for shots andscene detection from the videos using the color histogramsand clustering technique respectivelyThe authors first detectthe shots using the color histogram then the authors clus-tered the shots using the hierarchical K-means clusteringtechnique and created N clusters for N number of shotsEach shot is assigned a particular cluster and they find theleast dissimilar shots using the distance metric formula andmerged the two clusters with the least distance This processcontinues until and unless all the scenes are detected andvideo is completed
Chen et al [23] proposed a new approach for scene detec-tion from the H264 video sequences They define a scenechange factor which is used to reserve bits for each frameTheir methodology has reduced rate error and was foundbetter when compared with JVT-G012 algorithm The workof [24] proposed a novel technique for scene change detectionespecially for H264AVC encoded video sequences and theytake into consideration the design and performance evalua-tion of the systemThey further worked with a dynamic thre-shold which adapts and tracks different descriptors andincreased the accuracy of system by locating true scenes inthe videos
3 Proposed Methodology for Scene Detection
The proposed framework comprises shot boundary detec-tion key frame extraction local key point descriptors
4 6 8 10 12 14 16 18 20Entropy Difference ()
0
010
020
030
040
050
060
070
080
090
1
Prec
ision
Rec
allF
-sco
re
PrecisionRecallF-score
Figure 2 Sensitivity of 120591119860 on movie Pink Panther (2006)
extraction from key frames feature quantization and sceneboundary detection
31 Shot Boundary Detection Shot boundary detection isthe primary step for any kind of video operations Thereare number of frameworks for shot boundary detection Wehave used the technique for shot boundary detection basedon entropy differences [5 26] The entropy is computed foreach frame and differences between the adjacent frames arecomputed The frame 119891119894 is considered to be a shot boundaryparticularly abrupt shot boundary if the entropy differencebetween 119891119894 and 119891119894+1 is greater than the predefined threshold120591119860 [2 3 5] It can be returned as
B (119891119894) =
True if D (119891119894 119891119894+1) gt 120591119860False otherwise
(2)
B() decides either the given frame 119891119894 is shot boundary ornot and D computes the dissimilarity or difference betweenadjacent framesThe value 120591119860 gives better precision with poorrecall if it is high and better recall with poor precision if it islow as shown in Figure 2 During experiment the value of 120591119860is set experimentally which gives high F-score
32 Key Frame and Local Key Point Descriptors ExtractionLet S = 1199041 1199042 119904119899 be the set of all shot boundaries Oneor set of key frame(s) from each shot are selected There are anumber of possibilities to select representative frames akakey frames from each shot Since the entropies are alreadycomputed in shot boundary process so entropy based keyframe selection criteria are used [3]
For any given shot 119904119894 isin S the frame with maximumentropy is selected as key frame It has been shown exper-imentally that if the entropy is larger the contents in theframe are dense which represents the shots precisely The
Advances in Multimedia 5
shots are now represented by key frames and denoted byF = 1198911199041 1198911199042 119891119904119899 where 119891119904119894 denotes the key frame of shot119904119894
Two images can be matched if they are similar based onsome similarity criteria Similarity is computed between thefeatures of the images SIFT [27] is widely used as imagefeature for various applications of computer vision and videoprocessing For any given image key points are detected andthose key points are represented by some descriptors such asSIFT On average there are 2-3 thousand key points on singleimage whichmakes matching very expensive and exhaustiveas single image is represented by 2-3 thousand feature vectorsTomatch two images of size 800times600 each it takes 2 secondson commodity hardware on average If one image has to bematched with several hundreds or thousand images then it isnot practical to use SIFT or any rawdescriptors Quantizationis used to reduce the feature space
33 Quantization BoVW Model Bag of visual word modelis widely used for feature quantization Every key pointdescriptor 119909119895 sub R119889 is quantized into a finite number ofcentroids from 1 to 119896 where 119896 denotes the total number ofcentroids aka visual words denoted byV = V1 V2 V119896and each V119894 sub R119889 Let say a frame 119891 be represented bysome local key point descriptors119891119883 = 1199091 1199092 119909119898 where119909119894 sub R119889 In BoVWmodel a function G is defined as
G R119889 997891997888rarr [1 119896]
119909119894 997891997888rarr G (119909119894)(3)
G maps descriptor 119909119894 sub R119889 to an integer index Forgiven frame 119891 bag of visual word I = 1205831 1205832 120583119896 iscomputed 120583119894 indicates the number of times V119894 appeared inframe119891 andI is unit normalized at the endMostly 119896-meanor hierarchical 119896-mean clustering is applied and centroids(visual words) V are obtained The value of 119896 is keptvery large for image matching or retrieval applications thesuggested value of 119896 is 1 millionThe accuracy of quantizationmainly depends on the value of 119896 if the value is small thentwo different key point descriptors will be quantized to samevisual words which will decrease the distinctiveness or if thevalue is very large then two similar key point descriptorswhich are slightly distorted can be assigned different visualwords which will decrease the robustness [28]
In the case of the video segmentation the scenario isdifferent than the searching ormatching one imagewith set ofvery large database which have severe image transformationssuch as illumination scale viewpoint and scene capture atdifferent time In video segmentation image is matched withfew other images 4 to 7 in sliding window which containslightly different contents The each image in sliding windowis a key framewhich represents the shot an example of slidingwindow matching is shown in Figure 3
In proposed framework the value of 119896 is kept far smallerthan the value suggested in the literature [2] without compro-mising on the segmentation accuracy During experimentthe value of 119896 = 25000 gives approximately same accuracyas the value 500000 which is used in our previous work
[2] For the above-mentioned experiment the value of 119896 wasgradually increased from 5000 to 30000 by the factor of 1000and it was found that the value 119896 = 25000 gives approximatelysame accuracy as of our previous work [2]
34 Quantization VLAD Model VLAD is emerging quanti-zation framework for local key point descriptors [29] Insteadcomputing the histogram of visual words it computes thesum of the differences of residual descriptors with visualwords and concatenates into single vector of 119889 times 119896 Let G119881be VLAD quantization function [30]
G119881 R
119889 997891997888rarr V119895 isin V
119909119894 997891997888rarr G119881 (119909119894) = arg min
VisinV
1003817100381710038171003817119909119894 minus V10038171003817100381710038172 (4)
The VLAD is computed in three steps(1) offline visual words are obtained V(2) all the key point descriptors obtained from given
frame 119891119883 are quantized using (4)(3) VLAD is computed for given frame J119891 =
1198951 1198952 119895119896 where each 119895119902 is 119889-dimensional vectorobtained as follows
119895119902 = sum119883G119881(119883)=V119898
119883 minus V119898 (5)
J119891 is 119889 times 119896 dimensional feature In case of SIFT 119889 = 128and recommended value of 119896 isin 64 128 256 [29] As statedabove video segmentation does not require very large valuesof 119896 During experiments the value of 119896 for VLAD is 16 andJ using SIFT is 128 times 16 = 2048 dimensional J is unitnormalized at the end The vector is very compact withoutthe loss of accuracy as shown in experiments
35 Scene Boundary Detection Algorithm 1 is used to findthe scene boundaries [2] H denotes feature vectors for keyframes the feature vectors are either VLAD or BoVWvectorsexplained in the previous section The similarity between twokey frames is decided by dissimilarity function D which canbe computed as follows
D (119867119894 119867119895) =N
sum119902=1
min (ℎ119894119902 ℎ119895119902) (6)
Two key frames are treated as similar if their D() gt 120591119904 Thevalue of 120591119904 is the average of the minimum and the maximumsimilarities of the similar shots on a subset of the videos usedin the experiments The average of similarity score is widelyused as the value of 120591119904 In our experiment the average ofsimilarity scores gives low segmentation accuracy ie 0713
4 Experiments and Results
Cinematic and drama videos are used for scene boundarydetection list of movies and dramas is given in Table 1 F-score is used as performance metrics for scene boundarydetectionThere is no benchmark dataset Two strategies havebeen used to obtain the ground-truth first party and third
6 Advances in Multimedia
fsfs fs fs fs
fsfs fs fs
Figure 3 Example of key framesmatching in the sliding window of lengthL = 3 Each frame represents the shot and there are 9 consecutiveshots 1198911199041 1198911199049 Each key frame 119891119904119894 is matched with next three neighbors
Require H = 1198671 1198672 119867119899(1) 119860[1] larr997888 1(2) 119906 larr997888 1(3) index larr997888 2(4) for each 119867119894 isin H 119894 = [1 119899 minus 1] do(5) isSimilar larr997888 false(6) for 119895 = 119894 + 1 to 119894 + L do(7) if not Contains(119860 1 119895 + 1) and D(119867119894 119867119895) gt 119879119904 then(8) 119860[index] larr997888 119895(9) isSimilar larr997888true(10) end if(11) end for(12) if not isSimilar and(119894 ge 119860[index]) then(13) add (119906 119860[index]) toZ(14) 119906 larr997888 119860[index] + 1(15) index larr997888 index +1(16) end if(17) end for(18) Merge the short scenes(19) return Z
Algorithm 1 Scene detection algorithm
party ground-truth First party ground-truth is generated bythe authors and third party ground-truth is collected fromthe experts who have adequate knowledge of shots and sceneboundaries [2 3] To make ground-truth hinased third partyapproach is used in our experiments [3 5 26]
The accuracy of proposed system can be seen in Table 1Our dataset has two different groups with completely dif-ferent videos One group consists of cinematic movies withentirely different environment and challenging effects withcomplex motion of scenes On the other hand the secondgroup of data consists of indoor drama serials which areeasy to segment compared to cinematic movies becauseof their simple scene with no challenging effects that iswhy then length of the sliding window L is different forboth groups of dataset The sensitivity of L can be seen inFigure 4 [2] In cinematic videos the scenes are longer andshots are shorter In just few seconds there are sometimesmore than 20 shots due to different effects and actions Thevalue of L is marginally bigger compared on drama typesof videos Though single value can also be used for alltypes
Since the values of 119896 for VLAD and BoVW are shorterin proposed experiments compared to recommended valueswhich increase the efficiency for similarity computation thesimilarity computation by (6) or any other distance is at leastO(119899) where 119899 denotes the dimensionality of the feature Thecomputation of similarity is faster if the value of 119899 is shorteras shown in Figure 5 It can be seen that VLAD is faster thanBoVW because VLAD has shorter dimensions compared toBoVW The recommended value of 119896 for BoVW is 1000000as discussed in previous section whereas in our experimentsthe value of 119896 is 25000
5 Conclusion
Video segmentation is a primary step for video indexing andsearching Shot boundary detection divides the videos intosmall units These small units do not give meaningful insightof the video story or theme However grouping of similarshots give better insight of the video and this grouping can betreated as video scene and grouping of similar shots is calledscenes
Advances in Multimedia 7
Table 1 Performance of BoVW and VLAD on cinematic and drama videos
Figure 4 Sensitivity ofL on different types of videos
In this paper we propose framework which uses state-of-the-art searching techniques such as BoVW and VLADwhich is widely used for image and video retrieval for sceneboundary detection Images or video frames are representedby BoVW and VLAD which are very high dimensionalfeature vectors We experimentally show that in the fieldof scene boundary detection competitive accuracy can beachieved by keeping the dimensions of BoVW and VLADto very small The recommended dimensions for BoVWare 1 million in our experiments we just tuned it to be25000 The recommended dimensions of VLAD are 32768
0 2 4 6 8Data Size
0
05
1
15
2
25
3
35
Tim
e (se
c)
VLADBoVW
times104
Figure 5 Timing plot of query image matching with all the imagesin database VLAD always has less dimensions compared to theBoVWwhich makes VLAD faster than BoVW
in our experiments it is tuned to 2048 We exploit thesliding window for shot boundary detection In very smallsliding window the contents of the video shots do notchange drastically which helps to represent shots by reduceddimensions of BoVW and VLAD
8 Advances in Multimedia
Data Availability
The data used to support the findings of this study areavailable from the corresponding author upon request
Conflicts of Interest
The authors declare that they have no conflicts of interest
Acknowledgments
We are thankful to Shinrsquoichi Satoh from National Institute ofInformatics Japan Nitin Afzulpurkar fromAsian Institute ofTechnologyThailand andChadapornKeatmanee fromThai-Nichi Institute of Technology Thailand for their expertisethat greatly assisted this research
References
[1] S Lefevre and N Vincent ldquoEfficient and robust shot changedetectionrdquo Journal of Real-Time Image Processing vol 2 no 1pp 23ndash34 2007
[2] J Baber S Satoh N Afzulpurkar and C Keatmanee ldquoBag ofvisual wordsmodel for videos segmentation into scenesrdquo inPro-ceedings of the Fifth International Conference on Internet Multi-media Computing and Service pp 191ndash194 ACM 2013
[3] J Baber N Afzulpurkar and S Satoh ldquoA framework for videosegmentation using global and local featuresrdquo InternationalJournal of Pattern Recognition and Artificial Intelligence vol 27no 5 Article ID 1355007 2013
[4] L Baraldi C Grana and R Cucchiara ldquoShot and scene detec-tion via hierarchical clustering for re-using broadcast videordquoin Proceedings of the International Conference on ComputerAnalysis of Images and Patterns pp 801ndash811 Springer 2015
[5] J Baber NAfzulpurkar andM Bakhtyar ldquoVideo segmentationinto scenes using entropy and SURFrdquo in Proceedings of the 20117th International Conference on Emerging Technologies (ICETrsquo11) pp 1ndash6 IEEE 2011
[6] T Kikukawa and S Kawafuchi ldquoTransaction of the instituteof electronics development of an automatic summary editingsystem for the audio visual resourcesrdquo Information and commu-nication Engineers vol 75 no 2 pp 398ndash402 1992
[7] A Nagasaka and Y Tanaka Visual database systems II 1992[8] H Zhang A Kankanhalli and S W Smoliar ldquoAutomatic parti-
tioning of full-motion videordquo Multimedia Systems vol 1 no 1pp 10ndash28 1993
[9] I Koprinska and S Carrato ldquoTemporal video segmentation asurveyrdquo Signal Processing Image Communication vol 16 no 5pp 477ndash500 2001
[10] Z Cernekova I Pitas and C Nikou ldquoInformation theory-based shot cutfade detection and video summarizationrdquo IEEETransactions on Circuits and Systems for Video Technology vol16 no 1 pp 82ndash91 2006
[11] T Kikukawa and S Kawafuchi ldquoDevelopment of an automaticsummary editing system for the audio-visual resourcesrdquo Trans-actions on Electronics and Information J75-A pp 204ndash212 1992
[12] A Nagasaka ldquoAutomatic video indexing and full-video searchfor object appearancesrdquo in Proceedings of the IFIP 2nd WorkingConference on Visual Database Systems 1992
[13] G C Chavez F Precioso M Cord S Philipp-Foliguet andA D A Araujo ldquoShot boundary detection at trecvid 2006rdquo inProceedings of the TREC Video Retrieval Eval vol 15 2006
[14] X Ling O Yuanxin L Huan and X Zhang ldquoAMethod for FastShot Boundary Detection Based on SVMrdquo in Proceedings of the2008 Congress on Image and Signal Processing vol 2 pp 445ndash449 IEEE 2008
[15] J Li Y Ding Y Shi and W Li ldquoA divide-and-rule scheme forshot boundary detection based on SIFTrdquo International Journalof Digital Content Technology and Its Applications vol 4 no 3pp 202ndash214 2010
[16] M Yeung B-L Yeo and B Liu ldquoSegmentation of Video byClustering and Graph Analysisrdquo Computer Vision and ImageUnderstanding vol 71 no 1 pp 94ndash109 1998
[17] Z Rasheed andM Shah ldquoScene detection inHollywoodmoviesand TV showsrdquo in Proceedings of the 2003 IEEE ComputerSociety Conference on Computer Vision and Pattern Recognitionvol 2 pp 343ndash348 IEEE 2003
[18] D Rotman D Porat and G Ashour ldquoRobust and efficientvideo scene detection using optimal sequential groupingrdquo inProceedings of the 18th IEEE International Symposium on Multi-media ISM rsquo16 pp 275ndash280 IEEE 2016
[19] D Rotman D Porat and G Ashour ldquoRobust video scenedetection using multimodal fusion of optimally grouped fea-turesrdquo in Proceedings of the 19th IEEE International Workshopon Multimedia Signal Processing MMSP rsquo17 pp 1ndash6 2017
[20] L Baraldi C Grana and R Cucchiara ldquoAnalysis and re-useof videos in educational digital libraries with automatic scenedetectionrdquo in Proceedings of the Italian Research Conference onDigital Libraries pp 155ndash164 Springer 2015
[21] U Sakarya and Z Telatar ldquoVideo scene detection using dom-inant setsrdquo in Proceedings of the 2008 15th IEEE InternationalConference on Image Processing - ICIP rsquo08 pp 73ndash76 IEEE2008
[22] T Lin H Zhang and Q-Y Shi ldquoVideo scene extraction byforce competitionrdquo in Proceedings of the IEEE InternationalConference on Multimedia and Expo (ICME rsquo01) pp 753ndash7562001
[23] X Chen and F Lu ldquoAdaptive rate control algorithm for H264AVC considering scene changerdquo Mathematical Problems inEngineering vol 2013 Article ID 373689 6 pages 2013
[24] GRascioni S Spinsante andEGambi ldquoAnoptimized dynamicscene change detection algorithm for H264AVC encodedvideo sequencesrdquo International Journal of Digital MultimediaBroadcasting vol 2010 Article ID 864123 9 pages 2010
[25] Z Rasheed andM Shah ldquoScene detection inHollywoodmoviesand TV showsrdquo in Proceedings of the 2003 IEEE ComputerSociety Conference on Computer Vision and Pattern Recognition2003
[26] J Baber N Afzulpurkar M N Dailey and M BakhtyarldquoShot boundary detection from videos using entropy and localdescriptorrdquo in Proceedings of the 2011 17th International Con-ference onDigital Signal Processing (DSP rsquo11) pp 1ndash6 IEEE 2011
[27] D G Lowe ldquoDistinctive image features from scale-invariantkeypointsrdquo International Journal of ComputerVision vol 60 no2 pp 91ndash110 2004
[28] J Baber M N Dailey S Satoh N Afzulpurkar and M Bakht-yar ldquoBIG-OH binarization of gradient orientation histogramsrdquoImage and Vision Computing vol 32 no 11 pp 940ndash953 2014
[29] H JegouMDouze C Schmid and P Perez ldquoAggregating localdescriptors into a compact image representationrdquo inProceedingsof the 2010 IEEE Computer Society Conference on ComputerVision and Pattern Recognition (CVPR rsquo10) pp 3304ndash3311 2010
Advances in Multimedia 9
[30] J Delhumeau P-HGosselinH Jegou andP Perez ldquoRevisitingthe VLAD image representationrdquo inProceedings of the 21st ACMInternational Conference on Multimedia pp 653ndash656 2013
Chemical EngineeringInternational Journal of Antennas and
Propagation
International Journal of
Hindawiwwwhindawicom Volume 2018
Hindawiwwwhindawicom Volume 2018
Navigation and Observation
International Journal of
Hindawi
wwwhindawicom Volume 2018
Advances in
Multimedia
Submit your manuscripts atwwwhindawicom
Advances in Multimedia 3
Furthermore [14] proposed a new technique of learningalgorithm which has three main steps
(1) Firstly frames which have smooth changes areremoved
(2) Secondly three types of feature differences areextracted The intensity difference difference in ver-tical and horizontal edge histograms and differencebetween the HSV color histograms are calculatedfrom the shot boundaries
(3) Lastly the authors detect the gradual boundariesfrom the video using a technique named as temporalmultiresolution analysis
Several other works use various kinds of methodologiesfor different kind of shots Such an example work [15] useddifferent techniques for abrupt shots and gradual shots byutilizing SIFT and SVM Their methodology also comprisesfew main steps which are given below
(1) In the first step they select the shot boundary framesfrom video using the difference between color his-tograms of two consecutive frames
(2) Then in the second step they extract the SIFT featuresfrom the frames selected as a shot boundary
(3) Lastly they use different approach for abrupt andgradual shot boundaries by using SIFT and SVMSIFT is considered to be the most efficient effectiveand massively used in state-of-the-art techniques
Although SIFT is considered to be the most widely usedfeature extraction technique it still has some downsides ascompared to SURF The SIFT feature has high dimensionalfeature vector ie 128-D whereas the SURF only has 64-Dvector SIFT is slow as compared to SURF due to complex andcomprehensive images Moreover Baber et al [3] proposed anew technique for shot boundary detection using two differ-ent feature extraction methods that are SURF and EntropyTheir research consists of different steps in which theydetect shot boundaries (abrupt and gradual) and differentiategradual boundaries from abrupt shot boundaries The stepsare as follows
(1) In the first step the fade boundaries are detected bythe analysis of entropy pattern during fade effects
(2) After the detection of fade shot boundaries the otherkind of shot boundary that is abrupt shot boundaryis detected by using the entropy difference betweentwo consecutive frames If the difference between twoconsecutive frames is higher than the threshold 120591then it is considered to be the abrupt shot boundary
(3) SURF is used for removing the false negative bound-aries
22 Key Frame Extraction Most of the researchers use theshot boundary detection as the important step to extract therepresentative key frames from the videos Representative keyframes in the videos are the particular frames which describethe whole content of the particular scene in the video Each
video may consist of one or more key frames based on thescenes or content in the video
Shot boundary detection is one of the most crucial stepsin our problem for finding representative key frames fromthe videos as scene detection is completely based on theserepresentative key frames Baber et al [5] used the entropydifferences between two consecutive frames for finding theshot boundaries If the contents of the two consecutive framessay 119891119894 and 119891119894+1 are different and their entropy difference isgreater than the specified threshold then 119891119894 is said to be ashot boundary and considered as a representative key frame
In our methodology we have first calculated the entropyof each video frame and then the difference between twoconsecutive frames is recorded The difference is greater thanthe the threshold 120591 is considered to be the representative keyframe Entropy is a statistical measure of randomness thatcan be used to characterize the texture of the input imageMathematically entropy is defined as
119864119909 = minussum119910
119867119910 log (119867119910) (1)
where119867 is the normalized histogram of the gray scale imagersquospixels intensity
23 Scene Detection At the first stage the video is segmentedin shots and semantically similar shots are merged to formscenes Scenes are categorized into various classes such asconversation indoor and outdoor scenes Many importantresearches have been published related to video segmentationinto scenes using different type of videos for example cin-ematic drama serials (indoor and outdoor) video lecturesand documentaries Although a lot of work has been reportedfor segmentation of video into scenes there is still a gap toaddress the challenge in cinematic videos Commonly thereare two types of features being extracted from the videos forsegmentation ie audio and visual We have focused on thevisual features in our research
Yeung et al [16] proposed a technique in which theauthors used the scene transition graph (STG) to segment thevideo The nodes in the graph are considered as a shot whichis based on the temporal relationship and visual similarityedges are describedThen the graph is divided into subgraphsand these subgraphs are considered as scenes which are basedon the color similarity of the shots
Rasheed et al [17] proposed an effective technique forscene detection in Hollywood and TV shows For featuresthey have used the motion color and length of the shotsIn the initial step they first cluster the shots by usingBackward Shot Coherence (BSC) Next by calculating thecolor similarity between shots they first detect potentialscene boundaries and after that they remove the false negativefrom potential scene boundaries by scene dynamics which isbased on motion and length of the shot
Many recent authors worked on video scene segmen-tation and proposed new technique for this problem inresearch Some researchers used multimodal fusion tech-nique of optimally grouped features using the dynamicprogramming scheme [18ndash20] Their methodology includesfew steps in which the first step was to divide the video into
4 Advances in Multimedia
shots and then using clustering technique they cluster theshots The authors in their paper [19] proposed a techniqueknown as intermediate fusion which uses all the informationfrom different modalities They considered this problem anoptimization problem and used it via dynamic programming[19] The authors have some previous research [18] in whichthey proposed a technique of dividing the video into scenesusing the sequential structure In this technique they decideda location for video segmentation and only inspected thepartitioning possibilities In this technique the video is rep-resented by set of features and each set is given by a distancemetric between them The segmentation purely depends oninput features and distance metric [18]
Furthermore a different technique was proposed inwhich they used spectral clustering technique with anautomatic selection on number of clusters and extractedthe normalized histogram of each shot Further they usedBhattacharyya distance and temporal distance as a distancemetric Authors in this paper said that clustering is notconsistent and adjacent shots belong to different clusters [20]
Sakarya et al [21] used a new technique of graph con-struction for the segmentation of video into scenesThey con-struct a graph weighting the temporal and spatial function ofsimilarity From this the dominant shots are detected and fortemporal consistency constraint they used the edges of thescene via mean and standard deviation of the shot positionThis process kept on going until all the video is allocated toscenes Lin et al [22] used the approach of color histogramsfor the shot boundary detection and then formed the scenebymerging the similar shots via identifying the local minimaand maxima to determine the scene transitions
Baraldi et al [4] used another approach for shots andscene detection from the videos using the color histogramsand clustering technique respectivelyThe authors first detectthe shots using the color histogram then the authors clus-tered the shots using the hierarchical K-means clusteringtechnique and created N clusters for N number of shotsEach shot is assigned a particular cluster and they find theleast dissimilar shots using the distance metric formula andmerged the two clusters with the least distance This processcontinues until and unless all the scenes are detected andvideo is completed
Chen et al [23] proposed a new approach for scene detec-tion from the H264 video sequences They define a scenechange factor which is used to reserve bits for each frameTheir methodology has reduced rate error and was foundbetter when compared with JVT-G012 algorithm The workof [24] proposed a novel technique for scene change detectionespecially for H264AVC encoded video sequences and theytake into consideration the design and performance evalua-tion of the systemThey further worked with a dynamic thre-shold which adapts and tracks different descriptors andincreased the accuracy of system by locating true scenes inthe videos
3 Proposed Methodology for Scene Detection
The proposed framework comprises shot boundary detec-tion key frame extraction local key point descriptors
4 6 8 10 12 14 16 18 20Entropy Difference ()
0
010
020
030
040
050
060
070
080
090
1
Prec
ision
Rec
allF
-sco
re
PrecisionRecallF-score
Figure 2 Sensitivity of 120591119860 on movie Pink Panther (2006)
extraction from key frames feature quantization and sceneboundary detection
31 Shot Boundary Detection Shot boundary detection isthe primary step for any kind of video operations Thereare number of frameworks for shot boundary detection Wehave used the technique for shot boundary detection basedon entropy differences [5 26] The entropy is computed foreach frame and differences between the adjacent frames arecomputed The frame 119891119894 is considered to be a shot boundaryparticularly abrupt shot boundary if the entropy differencebetween 119891119894 and 119891119894+1 is greater than the predefined threshold120591119860 [2 3 5] It can be returned as
B (119891119894) =
True if D (119891119894 119891119894+1) gt 120591119860False otherwise
(2)
B() decides either the given frame 119891119894 is shot boundary ornot and D computes the dissimilarity or difference betweenadjacent framesThe value 120591119860 gives better precision with poorrecall if it is high and better recall with poor precision if it islow as shown in Figure 2 During experiment the value of 120591119860is set experimentally which gives high F-score
32 Key Frame and Local Key Point Descriptors ExtractionLet S = 1199041 1199042 119904119899 be the set of all shot boundaries Oneor set of key frame(s) from each shot are selected There are anumber of possibilities to select representative frames akakey frames from each shot Since the entropies are alreadycomputed in shot boundary process so entropy based keyframe selection criteria are used [3]
For any given shot 119904119894 isin S the frame with maximumentropy is selected as key frame It has been shown exper-imentally that if the entropy is larger the contents in theframe are dense which represents the shots precisely The
Advances in Multimedia 5
shots are now represented by key frames and denoted byF = 1198911199041 1198911199042 119891119904119899 where 119891119904119894 denotes the key frame of shot119904119894
Two images can be matched if they are similar based onsome similarity criteria Similarity is computed between thefeatures of the images SIFT [27] is widely used as imagefeature for various applications of computer vision and videoprocessing For any given image key points are detected andthose key points are represented by some descriptors such asSIFT On average there are 2-3 thousand key points on singleimage whichmakes matching very expensive and exhaustiveas single image is represented by 2-3 thousand feature vectorsTomatch two images of size 800times600 each it takes 2 secondson commodity hardware on average If one image has to bematched with several hundreds or thousand images then it isnot practical to use SIFT or any rawdescriptors Quantizationis used to reduce the feature space
33 Quantization BoVW Model Bag of visual word modelis widely used for feature quantization Every key pointdescriptor 119909119895 sub R119889 is quantized into a finite number ofcentroids from 1 to 119896 where 119896 denotes the total number ofcentroids aka visual words denoted byV = V1 V2 V119896and each V119894 sub R119889 Let say a frame 119891 be represented bysome local key point descriptors119891119883 = 1199091 1199092 119909119898 where119909119894 sub R119889 In BoVWmodel a function G is defined as
G R119889 997891997888rarr [1 119896]
119909119894 997891997888rarr G (119909119894)(3)
G maps descriptor 119909119894 sub R119889 to an integer index Forgiven frame 119891 bag of visual word I = 1205831 1205832 120583119896 iscomputed 120583119894 indicates the number of times V119894 appeared inframe119891 andI is unit normalized at the endMostly 119896-meanor hierarchical 119896-mean clustering is applied and centroids(visual words) V are obtained The value of 119896 is keptvery large for image matching or retrieval applications thesuggested value of 119896 is 1 millionThe accuracy of quantizationmainly depends on the value of 119896 if the value is small thentwo different key point descriptors will be quantized to samevisual words which will decrease the distinctiveness or if thevalue is very large then two similar key point descriptorswhich are slightly distorted can be assigned different visualwords which will decrease the robustness [28]
In the case of the video segmentation the scenario isdifferent than the searching ormatching one imagewith set ofvery large database which have severe image transformationssuch as illumination scale viewpoint and scene capture atdifferent time In video segmentation image is matched withfew other images 4 to 7 in sliding window which containslightly different contents The each image in sliding windowis a key framewhich represents the shot an example of slidingwindow matching is shown in Figure 3
In proposed framework the value of 119896 is kept far smallerthan the value suggested in the literature [2] without compro-mising on the segmentation accuracy During experimentthe value of 119896 = 25000 gives approximately same accuracyas the value 500000 which is used in our previous work
[2] For the above-mentioned experiment the value of 119896 wasgradually increased from 5000 to 30000 by the factor of 1000and it was found that the value 119896 = 25000 gives approximatelysame accuracy as of our previous work [2]
34 Quantization VLAD Model VLAD is emerging quanti-zation framework for local key point descriptors [29] Insteadcomputing the histogram of visual words it computes thesum of the differences of residual descriptors with visualwords and concatenates into single vector of 119889 times 119896 Let G119881be VLAD quantization function [30]
G119881 R
119889 997891997888rarr V119895 isin V
119909119894 997891997888rarr G119881 (119909119894) = arg min
VisinV
1003817100381710038171003817119909119894 minus V10038171003817100381710038172 (4)
The VLAD is computed in three steps(1) offline visual words are obtained V(2) all the key point descriptors obtained from given
frame 119891119883 are quantized using (4)(3) VLAD is computed for given frame J119891 =
1198951 1198952 119895119896 where each 119895119902 is 119889-dimensional vectorobtained as follows
119895119902 = sum119883G119881(119883)=V119898
119883 minus V119898 (5)
J119891 is 119889 times 119896 dimensional feature In case of SIFT 119889 = 128and recommended value of 119896 isin 64 128 256 [29] As statedabove video segmentation does not require very large valuesof 119896 During experiments the value of 119896 for VLAD is 16 andJ using SIFT is 128 times 16 = 2048 dimensional J is unitnormalized at the end The vector is very compact withoutthe loss of accuracy as shown in experiments
35 Scene Boundary Detection Algorithm 1 is used to findthe scene boundaries [2] H denotes feature vectors for keyframes the feature vectors are either VLAD or BoVWvectorsexplained in the previous section The similarity between twokey frames is decided by dissimilarity function D which canbe computed as follows
D (119867119894 119867119895) =N
sum119902=1
min (ℎ119894119902 ℎ119895119902) (6)
Two key frames are treated as similar if their D() gt 120591119904 Thevalue of 120591119904 is the average of the minimum and the maximumsimilarities of the similar shots on a subset of the videos usedin the experiments The average of similarity score is widelyused as the value of 120591119904 In our experiment the average ofsimilarity scores gives low segmentation accuracy ie 0713
4 Experiments and Results
Cinematic and drama videos are used for scene boundarydetection list of movies and dramas is given in Table 1 F-score is used as performance metrics for scene boundarydetectionThere is no benchmark dataset Two strategies havebeen used to obtain the ground-truth first party and third
6 Advances in Multimedia
fsfs fs fs fs
fsfs fs fs
Figure 3 Example of key framesmatching in the sliding window of lengthL = 3 Each frame represents the shot and there are 9 consecutiveshots 1198911199041 1198911199049 Each key frame 119891119904119894 is matched with next three neighbors
Require H = 1198671 1198672 119867119899(1) 119860[1] larr997888 1(2) 119906 larr997888 1(3) index larr997888 2(4) for each 119867119894 isin H 119894 = [1 119899 minus 1] do(5) isSimilar larr997888 false(6) for 119895 = 119894 + 1 to 119894 + L do(7) if not Contains(119860 1 119895 + 1) and D(119867119894 119867119895) gt 119879119904 then(8) 119860[index] larr997888 119895(9) isSimilar larr997888true(10) end if(11) end for(12) if not isSimilar and(119894 ge 119860[index]) then(13) add (119906 119860[index]) toZ(14) 119906 larr997888 119860[index] + 1(15) index larr997888 index +1(16) end if(17) end for(18) Merge the short scenes(19) return Z
Algorithm 1 Scene detection algorithm
party ground-truth First party ground-truth is generated bythe authors and third party ground-truth is collected fromthe experts who have adequate knowledge of shots and sceneboundaries [2 3] To make ground-truth hinased third partyapproach is used in our experiments [3 5 26]
The accuracy of proposed system can be seen in Table 1Our dataset has two different groups with completely dif-ferent videos One group consists of cinematic movies withentirely different environment and challenging effects withcomplex motion of scenes On the other hand the secondgroup of data consists of indoor drama serials which areeasy to segment compared to cinematic movies becauseof their simple scene with no challenging effects that iswhy then length of the sliding window L is different forboth groups of dataset The sensitivity of L can be seen inFigure 4 [2] In cinematic videos the scenes are longer andshots are shorter In just few seconds there are sometimesmore than 20 shots due to different effects and actions Thevalue of L is marginally bigger compared on drama typesof videos Though single value can also be used for alltypes
Since the values of 119896 for VLAD and BoVW are shorterin proposed experiments compared to recommended valueswhich increase the efficiency for similarity computation thesimilarity computation by (6) or any other distance is at leastO(119899) where 119899 denotes the dimensionality of the feature Thecomputation of similarity is faster if the value of 119899 is shorteras shown in Figure 5 It can be seen that VLAD is faster thanBoVW because VLAD has shorter dimensions compared toBoVW The recommended value of 119896 for BoVW is 1000000as discussed in previous section whereas in our experimentsthe value of 119896 is 25000
5 Conclusion
Video segmentation is a primary step for video indexing andsearching Shot boundary detection divides the videos intosmall units These small units do not give meaningful insightof the video story or theme However grouping of similarshots give better insight of the video and this grouping can betreated as video scene and grouping of similar shots is calledscenes
Advances in Multimedia 7
Table 1 Performance of BoVW and VLAD on cinematic and drama videos
Figure 4 Sensitivity ofL on different types of videos
In this paper we propose framework which uses state-of-the-art searching techniques such as BoVW and VLADwhich is widely used for image and video retrieval for sceneboundary detection Images or video frames are representedby BoVW and VLAD which are very high dimensionalfeature vectors We experimentally show that in the fieldof scene boundary detection competitive accuracy can beachieved by keeping the dimensions of BoVW and VLADto very small The recommended dimensions for BoVWare 1 million in our experiments we just tuned it to be25000 The recommended dimensions of VLAD are 32768
0 2 4 6 8Data Size
0
05
1
15
2
25
3
35
Tim
e (se
c)
VLADBoVW
times104
Figure 5 Timing plot of query image matching with all the imagesin database VLAD always has less dimensions compared to theBoVWwhich makes VLAD faster than BoVW
in our experiments it is tuned to 2048 We exploit thesliding window for shot boundary detection In very smallsliding window the contents of the video shots do notchange drastically which helps to represent shots by reduceddimensions of BoVW and VLAD
8 Advances in Multimedia
Data Availability
The data used to support the findings of this study areavailable from the corresponding author upon request
Conflicts of Interest
The authors declare that they have no conflicts of interest
Acknowledgments
We are thankful to Shinrsquoichi Satoh from National Institute ofInformatics Japan Nitin Afzulpurkar fromAsian Institute ofTechnologyThailand andChadapornKeatmanee fromThai-Nichi Institute of Technology Thailand for their expertisethat greatly assisted this research
References
[1] S Lefevre and N Vincent ldquoEfficient and robust shot changedetectionrdquo Journal of Real-Time Image Processing vol 2 no 1pp 23ndash34 2007
[2] J Baber S Satoh N Afzulpurkar and C Keatmanee ldquoBag ofvisual wordsmodel for videos segmentation into scenesrdquo inPro-ceedings of the Fifth International Conference on Internet Multi-media Computing and Service pp 191ndash194 ACM 2013
[3] J Baber N Afzulpurkar and S Satoh ldquoA framework for videosegmentation using global and local featuresrdquo InternationalJournal of Pattern Recognition and Artificial Intelligence vol 27no 5 Article ID 1355007 2013
[4] L Baraldi C Grana and R Cucchiara ldquoShot and scene detec-tion via hierarchical clustering for re-using broadcast videordquoin Proceedings of the International Conference on ComputerAnalysis of Images and Patterns pp 801ndash811 Springer 2015
[5] J Baber NAfzulpurkar andM Bakhtyar ldquoVideo segmentationinto scenes using entropy and SURFrdquo in Proceedings of the 20117th International Conference on Emerging Technologies (ICETrsquo11) pp 1ndash6 IEEE 2011
[6] T Kikukawa and S Kawafuchi ldquoTransaction of the instituteof electronics development of an automatic summary editingsystem for the audio visual resourcesrdquo Information and commu-nication Engineers vol 75 no 2 pp 398ndash402 1992
[7] A Nagasaka and Y Tanaka Visual database systems II 1992[8] H Zhang A Kankanhalli and S W Smoliar ldquoAutomatic parti-
tioning of full-motion videordquo Multimedia Systems vol 1 no 1pp 10ndash28 1993
[9] I Koprinska and S Carrato ldquoTemporal video segmentation asurveyrdquo Signal Processing Image Communication vol 16 no 5pp 477ndash500 2001
[10] Z Cernekova I Pitas and C Nikou ldquoInformation theory-based shot cutfade detection and video summarizationrdquo IEEETransactions on Circuits and Systems for Video Technology vol16 no 1 pp 82ndash91 2006
[11] T Kikukawa and S Kawafuchi ldquoDevelopment of an automaticsummary editing system for the audio-visual resourcesrdquo Trans-actions on Electronics and Information J75-A pp 204ndash212 1992
[12] A Nagasaka ldquoAutomatic video indexing and full-video searchfor object appearancesrdquo in Proceedings of the IFIP 2nd WorkingConference on Visual Database Systems 1992
[13] G C Chavez F Precioso M Cord S Philipp-Foliguet andA D A Araujo ldquoShot boundary detection at trecvid 2006rdquo inProceedings of the TREC Video Retrieval Eval vol 15 2006
[14] X Ling O Yuanxin L Huan and X Zhang ldquoAMethod for FastShot Boundary Detection Based on SVMrdquo in Proceedings of the2008 Congress on Image and Signal Processing vol 2 pp 445ndash449 IEEE 2008
[15] J Li Y Ding Y Shi and W Li ldquoA divide-and-rule scheme forshot boundary detection based on SIFTrdquo International Journalof Digital Content Technology and Its Applications vol 4 no 3pp 202ndash214 2010
[16] M Yeung B-L Yeo and B Liu ldquoSegmentation of Video byClustering and Graph Analysisrdquo Computer Vision and ImageUnderstanding vol 71 no 1 pp 94ndash109 1998
[17] Z Rasheed andM Shah ldquoScene detection inHollywoodmoviesand TV showsrdquo in Proceedings of the 2003 IEEE ComputerSociety Conference on Computer Vision and Pattern Recognitionvol 2 pp 343ndash348 IEEE 2003
[18] D Rotman D Porat and G Ashour ldquoRobust and efficientvideo scene detection using optimal sequential groupingrdquo inProceedings of the 18th IEEE International Symposium on Multi-media ISM rsquo16 pp 275ndash280 IEEE 2016
[19] D Rotman D Porat and G Ashour ldquoRobust video scenedetection using multimodal fusion of optimally grouped fea-turesrdquo in Proceedings of the 19th IEEE International Workshopon Multimedia Signal Processing MMSP rsquo17 pp 1ndash6 2017
[20] L Baraldi C Grana and R Cucchiara ldquoAnalysis and re-useof videos in educational digital libraries with automatic scenedetectionrdquo in Proceedings of the Italian Research Conference onDigital Libraries pp 155ndash164 Springer 2015
[21] U Sakarya and Z Telatar ldquoVideo scene detection using dom-inant setsrdquo in Proceedings of the 2008 15th IEEE InternationalConference on Image Processing - ICIP rsquo08 pp 73ndash76 IEEE2008
[22] T Lin H Zhang and Q-Y Shi ldquoVideo scene extraction byforce competitionrdquo in Proceedings of the IEEE InternationalConference on Multimedia and Expo (ICME rsquo01) pp 753ndash7562001
[23] X Chen and F Lu ldquoAdaptive rate control algorithm for H264AVC considering scene changerdquo Mathematical Problems inEngineering vol 2013 Article ID 373689 6 pages 2013
[24] GRascioni S Spinsante andEGambi ldquoAnoptimized dynamicscene change detection algorithm for H264AVC encodedvideo sequencesrdquo International Journal of Digital MultimediaBroadcasting vol 2010 Article ID 864123 9 pages 2010
[25] Z Rasheed andM Shah ldquoScene detection inHollywoodmoviesand TV showsrdquo in Proceedings of the 2003 IEEE ComputerSociety Conference on Computer Vision and Pattern Recognition2003
[26] J Baber N Afzulpurkar M N Dailey and M BakhtyarldquoShot boundary detection from videos using entropy and localdescriptorrdquo in Proceedings of the 2011 17th International Con-ference onDigital Signal Processing (DSP rsquo11) pp 1ndash6 IEEE 2011
[27] D G Lowe ldquoDistinctive image features from scale-invariantkeypointsrdquo International Journal of ComputerVision vol 60 no2 pp 91ndash110 2004
[28] J Baber M N Dailey S Satoh N Afzulpurkar and M Bakht-yar ldquoBIG-OH binarization of gradient orientation histogramsrdquoImage and Vision Computing vol 32 no 11 pp 940ndash953 2014
[29] H JegouMDouze C Schmid and P Perez ldquoAggregating localdescriptors into a compact image representationrdquo inProceedingsof the 2010 IEEE Computer Society Conference on ComputerVision and Pattern Recognition (CVPR rsquo10) pp 3304ndash3311 2010
Advances in Multimedia 9
[30] J Delhumeau P-HGosselinH Jegou andP Perez ldquoRevisitingthe VLAD image representationrdquo inProceedings of the 21st ACMInternational Conference on Multimedia pp 653ndash656 2013
Chemical EngineeringInternational Journal of Antennas and
Propagation
International Journal of
Hindawiwwwhindawicom Volume 2018
Hindawiwwwhindawicom Volume 2018
Navigation and Observation
International Journal of
Hindawi
wwwhindawicom Volume 2018
Advances in
Multimedia
Submit your manuscripts atwwwhindawicom
4 Advances in Multimedia
shots and then using clustering technique they cluster theshots The authors in their paper [19] proposed a techniqueknown as intermediate fusion which uses all the informationfrom different modalities They considered this problem anoptimization problem and used it via dynamic programming[19] The authors have some previous research [18] in whichthey proposed a technique of dividing the video into scenesusing the sequential structure In this technique they decideda location for video segmentation and only inspected thepartitioning possibilities In this technique the video is rep-resented by set of features and each set is given by a distancemetric between them The segmentation purely depends oninput features and distance metric [18]
Furthermore a different technique was proposed inwhich they used spectral clustering technique with anautomatic selection on number of clusters and extractedthe normalized histogram of each shot Further they usedBhattacharyya distance and temporal distance as a distancemetric Authors in this paper said that clustering is notconsistent and adjacent shots belong to different clusters [20]
Sakarya et al [21] used a new technique of graph con-struction for the segmentation of video into scenesThey con-struct a graph weighting the temporal and spatial function ofsimilarity From this the dominant shots are detected and fortemporal consistency constraint they used the edges of thescene via mean and standard deviation of the shot positionThis process kept on going until all the video is allocated toscenes Lin et al [22] used the approach of color histogramsfor the shot boundary detection and then formed the scenebymerging the similar shots via identifying the local minimaand maxima to determine the scene transitions
Baraldi et al [4] used another approach for shots andscene detection from the videos using the color histogramsand clustering technique respectivelyThe authors first detectthe shots using the color histogram then the authors clus-tered the shots using the hierarchical K-means clusteringtechnique and created N clusters for N number of shotsEach shot is assigned a particular cluster and they find theleast dissimilar shots using the distance metric formula andmerged the two clusters with the least distance This processcontinues until and unless all the scenes are detected andvideo is completed
Chen et al [23] proposed a new approach for scene detec-tion from the H264 video sequences They define a scenechange factor which is used to reserve bits for each frameTheir methodology has reduced rate error and was foundbetter when compared with JVT-G012 algorithm The workof [24] proposed a novel technique for scene change detectionespecially for H264AVC encoded video sequences and theytake into consideration the design and performance evalua-tion of the systemThey further worked with a dynamic thre-shold which adapts and tracks different descriptors andincreased the accuracy of system by locating true scenes inthe videos
3 Proposed Methodology for Scene Detection
The proposed framework comprises shot boundary detec-tion key frame extraction local key point descriptors
4 6 8 10 12 14 16 18 20Entropy Difference ()
0
010
020
030
040
050
060
070
080
090
1
Prec
ision
Rec
allF
-sco
re
PrecisionRecallF-score
Figure 2 Sensitivity of 120591119860 on movie Pink Panther (2006)
extraction from key frames feature quantization and sceneboundary detection
31 Shot Boundary Detection Shot boundary detection isthe primary step for any kind of video operations Thereare number of frameworks for shot boundary detection Wehave used the technique for shot boundary detection basedon entropy differences [5 26] The entropy is computed foreach frame and differences between the adjacent frames arecomputed The frame 119891119894 is considered to be a shot boundaryparticularly abrupt shot boundary if the entropy differencebetween 119891119894 and 119891119894+1 is greater than the predefined threshold120591119860 [2 3 5] It can be returned as
B (119891119894) =
True if D (119891119894 119891119894+1) gt 120591119860False otherwise
(2)
B() decides either the given frame 119891119894 is shot boundary ornot and D computes the dissimilarity or difference betweenadjacent framesThe value 120591119860 gives better precision with poorrecall if it is high and better recall with poor precision if it islow as shown in Figure 2 During experiment the value of 120591119860is set experimentally which gives high F-score
32 Key Frame and Local Key Point Descriptors ExtractionLet S = 1199041 1199042 119904119899 be the set of all shot boundaries Oneor set of key frame(s) from each shot are selected There are anumber of possibilities to select representative frames akakey frames from each shot Since the entropies are alreadycomputed in shot boundary process so entropy based keyframe selection criteria are used [3]
For any given shot 119904119894 isin S the frame with maximumentropy is selected as key frame It has been shown exper-imentally that if the entropy is larger the contents in theframe are dense which represents the shots precisely The
Advances in Multimedia 5
shots are now represented by key frames and denoted byF = 1198911199041 1198911199042 119891119904119899 where 119891119904119894 denotes the key frame of shot119904119894
Two images can be matched if they are similar based onsome similarity criteria Similarity is computed between thefeatures of the images SIFT [27] is widely used as imagefeature for various applications of computer vision and videoprocessing For any given image key points are detected andthose key points are represented by some descriptors such asSIFT On average there are 2-3 thousand key points on singleimage whichmakes matching very expensive and exhaustiveas single image is represented by 2-3 thousand feature vectorsTomatch two images of size 800times600 each it takes 2 secondson commodity hardware on average If one image has to bematched with several hundreds or thousand images then it isnot practical to use SIFT or any rawdescriptors Quantizationis used to reduce the feature space
33 Quantization BoVW Model Bag of visual word modelis widely used for feature quantization Every key pointdescriptor 119909119895 sub R119889 is quantized into a finite number ofcentroids from 1 to 119896 where 119896 denotes the total number ofcentroids aka visual words denoted byV = V1 V2 V119896and each V119894 sub R119889 Let say a frame 119891 be represented bysome local key point descriptors119891119883 = 1199091 1199092 119909119898 where119909119894 sub R119889 In BoVWmodel a function G is defined as
G R119889 997891997888rarr [1 119896]
119909119894 997891997888rarr G (119909119894)(3)
G maps descriptor 119909119894 sub R119889 to an integer index Forgiven frame 119891 bag of visual word I = 1205831 1205832 120583119896 iscomputed 120583119894 indicates the number of times V119894 appeared inframe119891 andI is unit normalized at the endMostly 119896-meanor hierarchical 119896-mean clustering is applied and centroids(visual words) V are obtained The value of 119896 is keptvery large for image matching or retrieval applications thesuggested value of 119896 is 1 millionThe accuracy of quantizationmainly depends on the value of 119896 if the value is small thentwo different key point descriptors will be quantized to samevisual words which will decrease the distinctiveness or if thevalue is very large then two similar key point descriptorswhich are slightly distorted can be assigned different visualwords which will decrease the robustness [28]
In the case of the video segmentation the scenario isdifferent than the searching ormatching one imagewith set ofvery large database which have severe image transformationssuch as illumination scale viewpoint and scene capture atdifferent time In video segmentation image is matched withfew other images 4 to 7 in sliding window which containslightly different contents The each image in sliding windowis a key framewhich represents the shot an example of slidingwindow matching is shown in Figure 3
In proposed framework the value of 119896 is kept far smallerthan the value suggested in the literature [2] without compro-mising on the segmentation accuracy During experimentthe value of 119896 = 25000 gives approximately same accuracyas the value 500000 which is used in our previous work
[2] For the above-mentioned experiment the value of 119896 wasgradually increased from 5000 to 30000 by the factor of 1000and it was found that the value 119896 = 25000 gives approximatelysame accuracy as of our previous work [2]
34 Quantization VLAD Model VLAD is emerging quanti-zation framework for local key point descriptors [29] Insteadcomputing the histogram of visual words it computes thesum of the differences of residual descriptors with visualwords and concatenates into single vector of 119889 times 119896 Let G119881be VLAD quantization function [30]
G119881 R
119889 997891997888rarr V119895 isin V
119909119894 997891997888rarr G119881 (119909119894) = arg min
VisinV
1003817100381710038171003817119909119894 minus V10038171003817100381710038172 (4)
The VLAD is computed in three steps(1) offline visual words are obtained V(2) all the key point descriptors obtained from given
frame 119891119883 are quantized using (4)(3) VLAD is computed for given frame J119891 =
1198951 1198952 119895119896 where each 119895119902 is 119889-dimensional vectorobtained as follows
119895119902 = sum119883G119881(119883)=V119898
119883 minus V119898 (5)
J119891 is 119889 times 119896 dimensional feature In case of SIFT 119889 = 128and recommended value of 119896 isin 64 128 256 [29] As statedabove video segmentation does not require very large valuesof 119896 During experiments the value of 119896 for VLAD is 16 andJ using SIFT is 128 times 16 = 2048 dimensional J is unitnormalized at the end The vector is very compact withoutthe loss of accuracy as shown in experiments
35 Scene Boundary Detection Algorithm 1 is used to findthe scene boundaries [2] H denotes feature vectors for keyframes the feature vectors are either VLAD or BoVWvectorsexplained in the previous section The similarity between twokey frames is decided by dissimilarity function D which canbe computed as follows
D (119867119894 119867119895) =N
sum119902=1
min (ℎ119894119902 ℎ119895119902) (6)
Two key frames are treated as similar if their D() gt 120591119904 Thevalue of 120591119904 is the average of the minimum and the maximumsimilarities of the similar shots on a subset of the videos usedin the experiments The average of similarity score is widelyused as the value of 120591119904 In our experiment the average ofsimilarity scores gives low segmentation accuracy ie 0713
4 Experiments and Results
Cinematic and drama videos are used for scene boundarydetection list of movies and dramas is given in Table 1 F-score is used as performance metrics for scene boundarydetectionThere is no benchmark dataset Two strategies havebeen used to obtain the ground-truth first party and third
6 Advances in Multimedia
fsfs fs fs fs
fsfs fs fs
Figure 3 Example of key framesmatching in the sliding window of lengthL = 3 Each frame represents the shot and there are 9 consecutiveshots 1198911199041 1198911199049 Each key frame 119891119904119894 is matched with next three neighbors
Require H = 1198671 1198672 119867119899(1) 119860[1] larr997888 1(2) 119906 larr997888 1(3) index larr997888 2(4) for each 119867119894 isin H 119894 = [1 119899 minus 1] do(5) isSimilar larr997888 false(6) for 119895 = 119894 + 1 to 119894 + L do(7) if not Contains(119860 1 119895 + 1) and D(119867119894 119867119895) gt 119879119904 then(8) 119860[index] larr997888 119895(9) isSimilar larr997888true(10) end if(11) end for(12) if not isSimilar and(119894 ge 119860[index]) then(13) add (119906 119860[index]) toZ(14) 119906 larr997888 119860[index] + 1(15) index larr997888 index +1(16) end if(17) end for(18) Merge the short scenes(19) return Z
Algorithm 1 Scene detection algorithm
party ground-truth First party ground-truth is generated bythe authors and third party ground-truth is collected fromthe experts who have adequate knowledge of shots and sceneboundaries [2 3] To make ground-truth hinased third partyapproach is used in our experiments [3 5 26]
The accuracy of proposed system can be seen in Table 1Our dataset has two different groups with completely dif-ferent videos One group consists of cinematic movies withentirely different environment and challenging effects withcomplex motion of scenes On the other hand the secondgroup of data consists of indoor drama serials which areeasy to segment compared to cinematic movies becauseof their simple scene with no challenging effects that iswhy then length of the sliding window L is different forboth groups of dataset The sensitivity of L can be seen inFigure 4 [2] In cinematic videos the scenes are longer andshots are shorter In just few seconds there are sometimesmore than 20 shots due to different effects and actions Thevalue of L is marginally bigger compared on drama typesof videos Though single value can also be used for alltypes
Since the values of 119896 for VLAD and BoVW are shorterin proposed experiments compared to recommended valueswhich increase the efficiency for similarity computation thesimilarity computation by (6) or any other distance is at leastO(119899) where 119899 denotes the dimensionality of the feature Thecomputation of similarity is faster if the value of 119899 is shorteras shown in Figure 5 It can be seen that VLAD is faster thanBoVW because VLAD has shorter dimensions compared toBoVW The recommended value of 119896 for BoVW is 1000000as discussed in previous section whereas in our experimentsthe value of 119896 is 25000
5 Conclusion
Video segmentation is a primary step for video indexing andsearching Shot boundary detection divides the videos intosmall units These small units do not give meaningful insightof the video story or theme However grouping of similarshots give better insight of the video and this grouping can betreated as video scene and grouping of similar shots is calledscenes
Advances in Multimedia 7
Table 1 Performance of BoVW and VLAD on cinematic and drama videos
Figure 4 Sensitivity ofL on different types of videos
In this paper we propose framework which uses state-of-the-art searching techniques such as BoVW and VLADwhich is widely used for image and video retrieval for sceneboundary detection Images or video frames are representedby BoVW and VLAD which are very high dimensionalfeature vectors We experimentally show that in the fieldof scene boundary detection competitive accuracy can beachieved by keeping the dimensions of BoVW and VLADto very small The recommended dimensions for BoVWare 1 million in our experiments we just tuned it to be25000 The recommended dimensions of VLAD are 32768
0 2 4 6 8Data Size
0
05
1
15
2
25
3
35
Tim
e (se
c)
VLADBoVW
times104
Figure 5 Timing plot of query image matching with all the imagesin database VLAD always has less dimensions compared to theBoVWwhich makes VLAD faster than BoVW
in our experiments it is tuned to 2048 We exploit thesliding window for shot boundary detection In very smallsliding window the contents of the video shots do notchange drastically which helps to represent shots by reduceddimensions of BoVW and VLAD
8 Advances in Multimedia
Data Availability
The data used to support the findings of this study areavailable from the corresponding author upon request
Conflicts of Interest
The authors declare that they have no conflicts of interest
Acknowledgments
We are thankful to Shinrsquoichi Satoh from National Institute ofInformatics Japan Nitin Afzulpurkar fromAsian Institute ofTechnologyThailand andChadapornKeatmanee fromThai-Nichi Institute of Technology Thailand for their expertisethat greatly assisted this research
References
[1] S Lefevre and N Vincent ldquoEfficient and robust shot changedetectionrdquo Journal of Real-Time Image Processing vol 2 no 1pp 23ndash34 2007
[2] J Baber S Satoh N Afzulpurkar and C Keatmanee ldquoBag ofvisual wordsmodel for videos segmentation into scenesrdquo inPro-ceedings of the Fifth International Conference on Internet Multi-media Computing and Service pp 191ndash194 ACM 2013
[3] J Baber N Afzulpurkar and S Satoh ldquoA framework for videosegmentation using global and local featuresrdquo InternationalJournal of Pattern Recognition and Artificial Intelligence vol 27no 5 Article ID 1355007 2013
[4] L Baraldi C Grana and R Cucchiara ldquoShot and scene detec-tion via hierarchical clustering for re-using broadcast videordquoin Proceedings of the International Conference on ComputerAnalysis of Images and Patterns pp 801ndash811 Springer 2015
[5] J Baber NAfzulpurkar andM Bakhtyar ldquoVideo segmentationinto scenes using entropy and SURFrdquo in Proceedings of the 20117th International Conference on Emerging Technologies (ICETrsquo11) pp 1ndash6 IEEE 2011
[6] T Kikukawa and S Kawafuchi ldquoTransaction of the instituteof electronics development of an automatic summary editingsystem for the audio visual resourcesrdquo Information and commu-nication Engineers vol 75 no 2 pp 398ndash402 1992
[7] A Nagasaka and Y Tanaka Visual database systems II 1992[8] H Zhang A Kankanhalli and S W Smoliar ldquoAutomatic parti-
tioning of full-motion videordquo Multimedia Systems vol 1 no 1pp 10ndash28 1993
[9] I Koprinska and S Carrato ldquoTemporal video segmentation asurveyrdquo Signal Processing Image Communication vol 16 no 5pp 477ndash500 2001
[10] Z Cernekova I Pitas and C Nikou ldquoInformation theory-based shot cutfade detection and video summarizationrdquo IEEETransactions on Circuits and Systems for Video Technology vol16 no 1 pp 82ndash91 2006
[11] T Kikukawa and S Kawafuchi ldquoDevelopment of an automaticsummary editing system for the audio-visual resourcesrdquo Trans-actions on Electronics and Information J75-A pp 204ndash212 1992
[12] A Nagasaka ldquoAutomatic video indexing and full-video searchfor object appearancesrdquo in Proceedings of the IFIP 2nd WorkingConference on Visual Database Systems 1992
[13] G C Chavez F Precioso M Cord S Philipp-Foliguet andA D A Araujo ldquoShot boundary detection at trecvid 2006rdquo inProceedings of the TREC Video Retrieval Eval vol 15 2006
[14] X Ling O Yuanxin L Huan and X Zhang ldquoAMethod for FastShot Boundary Detection Based on SVMrdquo in Proceedings of the2008 Congress on Image and Signal Processing vol 2 pp 445ndash449 IEEE 2008
[15] J Li Y Ding Y Shi and W Li ldquoA divide-and-rule scheme forshot boundary detection based on SIFTrdquo International Journalof Digital Content Technology and Its Applications vol 4 no 3pp 202ndash214 2010
[16] M Yeung B-L Yeo and B Liu ldquoSegmentation of Video byClustering and Graph Analysisrdquo Computer Vision and ImageUnderstanding vol 71 no 1 pp 94ndash109 1998
[17] Z Rasheed andM Shah ldquoScene detection inHollywoodmoviesand TV showsrdquo in Proceedings of the 2003 IEEE ComputerSociety Conference on Computer Vision and Pattern Recognitionvol 2 pp 343ndash348 IEEE 2003
[18] D Rotman D Porat and G Ashour ldquoRobust and efficientvideo scene detection using optimal sequential groupingrdquo inProceedings of the 18th IEEE International Symposium on Multi-media ISM rsquo16 pp 275ndash280 IEEE 2016
[19] D Rotman D Porat and G Ashour ldquoRobust video scenedetection using multimodal fusion of optimally grouped fea-turesrdquo in Proceedings of the 19th IEEE International Workshopon Multimedia Signal Processing MMSP rsquo17 pp 1ndash6 2017
[20] L Baraldi C Grana and R Cucchiara ldquoAnalysis and re-useof videos in educational digital libraries with automatic scenedetectionrdquo in Proceedings of the Italian Research Conference onDigital Libraries pp 155ndash164 Springer 2015
[21] U Sakarya and Z Telatar ldquoVideo scene detection using dom-inant setsrdquo in Proceedings of the 2008 15th IEEE InternationalConference on Image Processing - ICIP rsquo08 pp 73ndash76 IEEE2008
[22] T Lin H Zhang and Q-Y Shi ldquoVideo scene extraction byforce competitionrdquo in Proceedings of the IEEE InternationalConference on Multimedia and Expo (ICME rsquo01) pp 753ndash7562001
[23] X Chen and F Lu ldquoAdaptive rate control algorithm for H264AVC considering scene changerdquo Mathematical Problems inEngineering vol 2013 Article ID 373689 6 pages 2013
[24] GRascioni S Spinsante andEGambi ldquoAnoptimized dynamicscene change detection algorithm for H264AVC encodedvideo sequencesrdquo International Journal of Digital MultimediaBroadcasting vol 2010 Article ID 864123 9 pages 2010
[25] Z Rasheed andM Shah ldquoScene detection inHollywoodmoviesand TV showsrdquo in Proceedings of the 2003 IEEE ComputerSociety Conference on Computer Vision and Pattern Recognition2003
[26] J Baber N Afzulpurkar M N Dailey and M BakhtyarldquoShot boundary detection from videos using entropy and localdescriptorrdquo in Proceedings of the 2011 17th International Con-ference onDigital Signal Processing (DSP rsquo11) pp 1ndash6 IEEE 2011
[27] D G Lowe ldquoDistinctive image features from scale-invariantkeypointsrdquo International Journal of ComputerVision vol 60 no2 pp 91ndash110 2004
[28] J Baber M N Dailey S Satoh N Afzulpurkar and M Bakht-yar ldquoBIG-OH binarization of gradient orientation histogramsrdquoImage and Vision Computing vol 32 no 11 pp 940ndash953 2014
[29] H JegouMDouze C Schmid and P Perez ldquoAggregating localdescriptors into a compact image representationrdquo inProceedingsof the 2010 IEEE Computer Society Conference on ComputerVision and Pattern Recognition (CVPR rsquo10) pp 3304ndash3311 2010
Advances in Multimedia 9
[30] J Delhumeau P-HGosselinH Jegou andP Perez ldquoRevisitingthe VLAD image representationrdquo inProceedings of the 21st ACMInternational Conference on Multimedia pp 653ndash656 2013
Chemical EngineeringInternational Journal of Antennas and
Propagation
International Journal of
Hindawiwwwhindawicom Volume 2018
Hindawiwwwhindawicom Volume 2018
Navigation and Observation
International Journal of
Hindawi
wwwhindawicom Volume 2018
Advances in
Multimedia
Submit your manuscripts atwwwhindawicom
Advances in Multimedia 5
shots are now represented by key frames and denoted byF = 1198911199041 1198911199042 119891119904119899 where 119891119904119894 denotes the key frame of shot119904119894
Two images can be matched if they are similar based onsome similarity criteria Similarity is computed between thefeatures of the images SIFT [27] is widely used as imagefeature for various applications of computer vision and videoprocessing For any given image key points are detected andthose key points are represented by some descriptors such asSIFT On average there are 2-3 thousand key points on singleimage whichmakes matching very expensive and exhaustiveas single image is represented by 2-3 thousand feature vectorsTomatch two images of size 800times600 each it takes 2 secondson commodity hardware on average If one image has to bematched with several hundreds or thousand images then it isnot practical to use SIFT or any rawdescriptors Quantizationis used to reduce the feature space
33 Quantization BoVW Model Bag of visual word modelis widely used for feature quantization Every key pointdescriptor 119909119895 sub R119889 is quantized into a finite number ofcentroids from 1 to 119896 where 119896 denotes the total number ofcentroids aka visual words denoted byV = V1 V2 V119896and each V119894 sub R119889 Let say a frame 119891 be represented bysome local key point descriptors119891119883 = 1199091 1199092 119909119898 where119909119894 sub R119889 In BoVWmodel a function G is defined as
G R119889 997891997888rarr [1 119896]
119909119894 997891997888rarr G (119909119894)(3)
G maps descriptor 119909119894 sub R119889 to an integer index Forgiven frame 119891 bag of visual word I = 1205831 1205832 120583119896 iscomputed 120583119894 indicates the number of times V119894 appeared inframe119891 andI is unit normalized at the endMostly 119896-meanor hierarchical 119896-mean clustering is applied and centroids(visual words) V are obtained The value of 119896 is keptvery large for image matching or retrieval applications thesuggested value of 119896 is 1 millionThe accuracy of quantizationmainly depends on the value of 119896 if the value is small thentwo different key point descriptors will be quantized to samevisual words which will decrease the distinctiveness or if thevalue is very large then two similar key point descriptorswhich are slightly distorted can be assigned different visualwords which will decrease the robustness [28]
In the case of the video segmentation the scenario isdifferent than the searching ormatching one imagewith set ofvery large database which have severe image transformationssuch as illumination scale viewpoint and scene capture atdifferent time In video segmentation image is matched withfew other images 4 to 7 in sliding window which containslightly different contents The each image in sliding windowis a key framewhich represents the shot an example of slidingwindow matching is shown in Figure 3
In proposed framework the value of 119896 is kept far smallerthan the value suggested in the literature [2] without compro-mising on the segmentation accuracy During experimentthe value of 119896 = 25000 gives approximately same accuracyas the value 500000 which is used in our previous work
[2] For the above-mentioned experiment the value of 119896 wasgradually increased from 5000 to 30000 by the factor of 1000and it was found that the value 119896 = 25000 gives approximatelysame accuracy as of our previous work [2]
34 Quantization VLAD Model VLAD is emerging quanti-zation framework for local key point descriptors [29] Insteadcomputing the histogram of visual words it computes thesum of the differences of residual descriptors with visualwords and concatenates into single vector of 119889 times 119896 Let G119881be VLAD quantization function [30]
G119881 R
119889 997891997888rarr V119895 isin V
119909119894 997891997888rarr G119881 (119909119894) = arg min
VisinV
1003817100381710038171003817119909119894 minus V10038171003817100381710038172 (4)
The VLAD is computed in three steps(1) offline visual words are obtained V(2) all the key point descriptors obtained from given
frame 119891119883 are quantized using (4)(3) VLAD is computed for given frame J119891 =
1198951 1198952 119895119896 where each 119895119902 is 119889-dimensional vectorobtained as follows
119895119902 = sum119883G119881(119883)=V119898
119883 minus V119898 (5)
J119891 is 119889 times 119896 dimensional feature In case of SIFT 119889 = 128and recommended value of 119896 isin 64 128 256 [29] As statedabove video segmentation does not require very large valuesof 119896 During experiments the value of 119896 for VLAD is 16 andJ using SIFT is 128 times 16 = 2048 dimensional J is unitnormalized at the end The vector is very compact withoutthe loss of accuracy as shown in experiments
35 Scene Boundary Detection Algorithm 1 is used to findthe scene boundaries [2] H denotes feature vectors for keyframes the feature vectors are either VLAD or BoVWvectorsexplained in the previous section The similarity between twokey frames is decided by dissimilarity function D which canbe computed as follows
D (119867119894 119867119895) =N
sum119902=1
min (ℎ119894119902 ℎ119895119902) (6)
Two key frames are treated as similar if their D() gt 120591119904 Thevalue of 120591119904 is the average of the minimum and the maximumsimilarities of the similar shots on a subset of the videos usedin the experiments The average of similarity score is widelyused as the value of 120591119904 In our experiment the average ofsimilarity scores gives low segmentation accuracy ie 0713
4 Experiments and Results
Cinematic and drama videos are used for scene boundarydetection list of movies and dramas is given in Table 1 F-score is used as performance metrics for scene boundarydetectionThere is no benchmark dataset Two strategies havebeen used to obtain the ground-truth first party and third
6 Advances in Multimedia
fsfs fs fs fs
fsfs fs fs
Figure 3 Example of key framesmatching in the sliding window of lengthL = 3 Each frame represents the shot and there are 9 consecutiveshots 1198911199041 1198911199049 Each key frame 119891119904119894 is matched with next three neighbors
Require H = 1198671 1198672 119867119899(1) 119860[1] larr997888 1(2) 119906 larr997888 1(3) index larr997888 2(4) for each 119867119894 isin H 119894 = [1 119899 minus 1] do(5) isSimilar larr997888 false(6) for 119895 = 119894 + 1 to 119894 + L do(7) if not Contains(119860 1 119895 + 1) and D(119867119894 119867119895) gt 119879119904 then(8) 119860[index] larr997888 119895(9) isSimilar larr997888true(10) end if(11) end for(12) if not isSimilar and(119894 ge 119860[index]) then(13) add (119906 119860[index]) toZ(14) 119906 larr997888 119860[index] + 1(15) index larr997888 index +1(16) end if(17) end for(18) Merge the short scenes(19) return Z
Algorithm 1 Scene detection algorithm
party ground-truth First party ground-truth is generated bythe authors and third party ground-truth is collected fromthe experts who have adequate knowledge of shots and sceneboundaries [2 3] To make ground-truth hinased third partyapproach is used in our experiments [3 5 26]
The accuracy of proposed system can be seen in Table 1Our dataset has two different groups with completely dif-ferent videos One group consists of cinematic movies withentirely different environment and challenging effects withcomplex motion of scenes On the other hand the secondgroup of data consists of indoor drama serials which areeasy to segment compared to cinematic movies becauseof their simple scene with no challenging effects that iswhy then length of the sliding window L is different forboth groups of dataset The sensitivity of L can be seen inFigure 4 [2] In cinematic videos the scenes are longer andshots are shorter In just few seconds there are sometimesmore than 20 shots due to different effects and actions Thevalue of L is marginally bigger compared on drama typesof videos Though single value can also be used for alltypes
Since the values of 119896 for VLAD and BoVW are shorterin proposed experiments compared to recommended valueswhich increase the efficiency for similarity computation thesimilarity computation by (6) or any other distance is at leastO(119899) where 119899 denotes the dimensionality of the feature Thecomputation of similarity is faster if the value of 119899 is shorteras shown in Figure 5 It can be seen that VLAD is faster thanBoVW because VLAD has shorter dimensions compared toBoVW The recommended value of 119896 for BoVW is 1000000as discussed in previous section whereas in our experimentsthe value of 119896 is 25000
5 Conclusion
Video segmentation is a primary step for video indexing andsearching Shot boundary detection divides the videos intosmall units These small units do not give meaningful insightof the video story or theme However grouping of similarshots give better insight of the video and this grouping can betreated as video scene and grouping of similar shots is calledscenes
Advances in Multimedia 7
Table 1 Performance of BoVW and VLAD on cinematic and drama videos
Figure 4 Sensitivity ofL on different types of videos
In this paper we propose framework which uses state-of-the-art searching techniques such as BoVW and VLADwhich is widely used for image and video retrieval for sceneboundary detection Images or video frames are representedby BoVW and VLAD which are very high dimensionalfeature vectors We experimentally show that in the fieldof scene boundary detection competitive accuracy can beachieved by keeping the dimensions of BoVW and VLADto very small The recommended dimensions for BoVWare 1 million in our experiments we just tuned it to be25000 The recommended dimensions of VLAD are 32768
0 2 4 6 8Data Size
0
05
1
15
2
25
3
35
Tim
e (se
c)
VLADBoVW
times104
Figure 5 Timing plot of query image matching with all the imagesin database VLAD always has less dimensions compared to theBoVWwhich makes VLAD faster than BoVW
in our experiments it is tuned to 2048 We exploit thesliding window for shot boundary detection In very smallsliding window the contents of the video shots do notchange drastically which helps to represent shots by reduceddimensions of BoVW and VLAD
8 Advances in Multimedia
Data Availability
The data used to support the findings of this study areavailable from the corresponding author upon request
Conflicts of Interest
The authors declare that they have no conflicts of interest
Acknowledgments
We are thankful to Shinrsquoichi Satoh from National Institute ofInformatics Japan Nitin Afzulpurkar fromAsian Institute ofTechnologyThailand andChadapornKeatmanee fromThai-Nichi Institute of Technology Thailand for their expertisethat greatly assisted this research
References
[1] S Lefevre and N Vincent ldquoEfficient and robust shot changedetectionrdquo Journal of Real-Time Image Processing vol 2 no 1pp 23ndash34 2007
[2] J Baber S Satoh N Afzulpurkar and C Keatmanee ldquoBag ofvisual wordsmodel for videos segmentation into scenesrdquo inPro-ceedings of the Fifth International Conference on Internet Multi-media Computing and Service pp 191ndash194 ACM 2013
[3] J Baber N Afzulpurkar and S Satoh ldquoA framework for videosegmentation using global and local featuresrdquo InternationalJournal of Pattern Recognition and Artificial Intelligence vol 27no 5 Article ID 1355007 2013
[4] L Baraldi C Grana and R Cucchiara ldquoShot and scene detec-tion via hierarchical clustering for re-using broadcast videordquoin Proceedings of the International Conference on ComputerAnalysis of Images and Patterns pp 801ndash811 Springer 2015
[5] J Baber NAfzulpurkar andM Bakhtyar ldquoVideo segmentationinto scenes using entropy and SURFrdquo in Proceedings of the 20117th International Conference on Emerging Technologies (ICETrsquo11) pp 1ndash6 IEEE 2011
[6] T Kikukawa and S Kawafuchi ldquoTransaction of the instituteof electronics development of an automatic summary editingsystem for the audio visual resourcesrdquo Information and commu-nication Engineers vol 75 no 2 pp 398ndash402 1992
[7] A Nagasaka and Y Tanaka Visual database systems II 1992[8] H Zhang A Kankanhalli and S W Smoliar ldquoAutomatic parti-
tioning of full-motion videordquo Multimedia Systems vol 1 no 1pp 10ndash28 1993
[9] I Koprinska and S Carrato ldquoTemporal video segmentation asurveyrdquo Signal Processing Image Communication vol 16 no 5pp 477ndash500 2001
[10] Z Cernekova I Pitas and C Nikou ldquoInformation theory-based shot cutfade detection and video summarizationrdquo IEEETransactions on Circuits and Systems for Video Technology vol16 no 1 pp 82ndash91 2006
[11] T Kikukawa and S Kawafuchi ldquoDevelopment of an automaticsummary editing system for the audio-visual resourcesrdquo Trans-actions on Electronics and Information J75-A pp 204ndash212 1992
[12] A Nagasaka ldquoAutomatic video indexing and full-video searchfor object appearancesrdquo in Proceedings of the IFIP 2nd WorkingConference on Visual Database Systems 1992
[13] G C Chavez F Precioso M Cord S Philipp-Foliguet andA D A Araujo ldquoShot boundary detection at trecvid 2006rdquo inProceedings of the TREC Video Retrieval Eval vol 15 2006
[14] X Ling O Yuanxin L Huan and X Zhang ldquoAMethod for FastShot Boundary Detection Based on SVMrdquo in Proceedings of the2008 Congress on Image and Signal Processing vol 2 pp 445ndash449 IEEE 2008
[15] J Li Y Ding Y Shi and W Li ldquoA divide-and-rule scheme forshot boundary detection based on SIFTrdquo International Journalof Digital Content Technology and Its Applications vol 4 no 3pp 202ndash214 2010
[16] M Yeung B-L Yeo and B Liu ldquoSegmentation of Video byClustering and Graph Analysisrdquo Computer Vision and ImageUnderstanding vol 71 no 1 pp 94ndash109 1998
[17] Z Rasheed andM Shah ldquoScene detection inHollywoodmoviesand TV showsrdquo in Proceedings of the 2003 IEEE ComputerSociety Conference on Computer Vision and Pattern Recognitionvol 2 pp 343ndash348 IEEE 2003
[18] D Rotman D Porat and G Ashour ldquoRobust and efficientvideo scene detection using optimal sequential groupingrdquo inProceedings of the 18th IEEE International Symposium on Multi-media ISM rsquo16 pp 275ndash280 IEEE 2016
[19] D Rotman D Porat and G Ashour ldquoRobust video scenedetection using multimodal fusion of optimally grouped fea-turesrdquo in Proceedings of the 19th IEEE International Workshopon Multimedia Signal Processing MMSP rsquo17 pp 1ndash6 2017
[20] L Baraldi C Grana and R Cucchiara ldquoAnalysis and re-useof videos in educational digital libraries with automatic scenedetectionrdquo in Proceedings of the Italian Research Conference onDigital Libraries pp 155ndash164 Springer 2015
[21] U Sakarya and Z Telatar ldquoVideo scene detection using dom-inant setsrdquo in Proceedings of the 2008 15th IEEE InternationalConference on Image Processing - ICIP rsquo08 pp 73ndash76 IEEE2008
[22] T Lin H Zhang and Q-Y Shi ldquoVideo scene extraction byforce competitionrdquo in Proceedings of the IEEE InternationalConference on Multimedia and Expo (ICME rsquo01) pp 753ndash7562001
[23] X Chen and F Lu ldquoAdaptive rate control algorithm for H264AVC considering scene changerdquo Mathematical Problems inEngineering vol 2013 Article ID 373689 6 pages 2013
[24] GRascioni S Spinsante andEGambi ldquoAnoptimized dynamicscene change detection algorithm for H264AVC encodedvideo sequencesrdquo International Journal of Digital MultimediaBroadcasting vol 2010 Article ID 864123 9 pages 2010
[25] Z Rasheed andM Shah ldquoScene detection inHollywoodmoviesand TV showsrdquo in Proceedings of the 2003 IEEE ComputerSociety Conference on Computer Vision and Pattern Recognition2003
[26] J Baber N Afzulpurkar M N Dailey and M BakhtyarldquoShot boundary detection from videos using entropy and localdescriptorrdquo in Proceedings of the 2011 17th International Con-ference onDigital Signal Processing (DSP rsquo11) pp 1ndash6 IEEE 2011
[27] D G Lowe ldquoDistinctive image features from scale-invariantkeypointsrdquo International Journal of ComputerVision vol 60 no2 pp 91ndash110 2004
[28] J Baber M N Dailey S Satoh N Afzulpurkar and M Bakht-yar ldquoBIG-OH binarization of gradient orientation histogramsrdquoImage and Vision Computing vol 32 no 11 pp 940ndash953 2014
[29] H JegouMDouze C Schmid and P Perez ldquoAggregating localdescriptors into a compact image representationrdquo inProceedingsof the 2010 IEEE Computer Society Conference on ComputerVision and Pattern Recognition (CVPR rsquo10) pp 3304ndash3311 2010
Advances in Multimedia 9
[30] J Delhumeau P-HGosselinH Jegou andP Perez ldquoRevisitingthe VLAD image representationrdquo inProceedings of the 21st ACMInternational Conference on Multimedia pp 653ndash656 2013
Chemical EngineeringInternational Journal of Antennas and
Propagation
International Journal of
Hindawiwwwhindawicom Volume 2018
Hindawiwwwhindawicom Volume 2018
Navigation and Observation
International Journal of
Hindawi
wwwhindawicom Volume 2018
Advances in
Multimedia
Submit your manuscripts atwwwhindawicom
6 Advances in Multimedia
fsfs fs fs fs
fsfs fs fs
Figure 3 Example of key framesmatching in the sliding window of lengthL = 3 Each frame represents the shot and there are 9 consecutiveshots 1198911199041 1198911199049 Each key frame 119891119904119894 is matched with next three neighbors
Require H = 1198671 1198672 119867119899(1) 119860[1] larr997888 1(2) 119906 larr997888 1(3) index larr997888 2(4) for each 119867119894 isin H 119894 = [1 119899 minus 1] do(5) isSimilar larr997888 false(6) for 119895 = 119894 + 1 to 119894 + L do(7) if not Contains(119860 1 119895 + 1) and D(119867119894 119867119895) gt 119879119904 then(8) 119860[index] larr997888 119895(9) isSimilar larr997888true(10) end if(11) end for(12) if not isSimilar and(119894 ge 119860[index]) then(13) add (119906 119860[index]) toZ(14) 119906 larr997888 119860[index] + 1(15) index larr997888 index +1(16) end if(17) end for(18) Merge the short scenes(19) return Z
Algorithm 1 Scene detection algorithm
party ground-truth First party ground-truth is generated bythe authors and third party ground-truth is collected fromthe experts who have adequate knowledge of shots and sceneboundaries [2 3] To make ground-truth hinased third partyapproach is used in our experiments [3 5 26]
The accuracy of proposed system can be seen in Table 1Our dataset has two different groups with completely dif-ferent videos One group consists of cinematic movies withentirely different environment and challenging effects withcomplex motion of scenes On the other hand the secondgroup of data consists of indoor drama serials which areeasy to segment compared to cinematic movies becauseof their simple scene with no challenging effects that iswhy then length of the sliding window L is different forboth groups of dataset The sensitivity of L can be seen inFigure 4 [2] In cinematic videos the scenes are longer andshots are shorter In just few seconds there are sometimesmore than 20 shots due to different effects and actions Thevalue of L is marginally bigger compared on drama typesof videos Though single value can also be used for alltypes
Since the values of 119896 for VLAD and BoVW are shorterin proposed experiments compared to recommended valueswhich increase the efficiency for similarity computation thesimilarity computation by (6) or any other distance is at leastO(119899) where 119899 denotes the dimensionality of the feature Thecomputation of similarity is faster if the value of 119899 is shorteras shown in Figure 5 It can be seen that VLAD is faster thanBoVW because VLAD has shorter dimensions compared toBoVW The recommended value of 119896 for BoVW is 1000000as discussed in previous section whereas in our experimentsthe value of 119896 is 25000
5 Conclusion
Video segmentation is a primary step for video indexing andsearching Shot boundary detection divides the videos intosmall units These small units do not give meaningful insightof the video story or theme However grouping of similarshots give better insight of the video and this grouping can betreated as video scene and grouping of similar shots is calledscenes
Advances in Multimedia 7
Table 1 Performance of BoVW and VLAD on cinematic and drama videos
Figure 4 Sensitivity ofL on different types of videos
In this paper we propose framework which uses state-of-the-art searching techniques such as BoVW and VLADwhich is widely used for image and video retrieval for sceneboundary detection Images or video frames are representedby BoVW and VLAD which are very high dimensionalfeature vectors We experimentally show that in the fieldof scene boundary detection competitive accuracy can beachieved by keeping the dimensions of BoVW and VLADto very small The recommended dimensions for BoVWare 1 million in our experiments we just tuned it to be25000 The recommended dimensions of VLAD are 32768
0 2 4 6 8Data Size
0
05
1
15
2
25
3
35
Tim
e (se
c)
VLADBoVW
times104
Figure 5 Timing plot of query image matching with all the imagesin database VLAD always has less dimensions compared to theBoVWwhich makes VLAD faster than BoVW
in our experiments it is tuned to 2048 We exploit thesliding window for shot boundary detection In very smallsliding window the contents of the video shots do notchange drastically which helps to represent shots by reduceddimensions of BoVW and VLAD
8 Advances in Multimedia
Data Availability
The data used to support the findings of this study areavailable from the corresponding author upon request
Conflicts of Interest
The authors declare that they have no conflicts of interest
Acknowledgments
We are thankful to Shinrsquoichi Satoh from National Institute ofInformatics Japan Nitin Afzulpurkar fromAsian Institute ofTechnologyThailand andChadapornKeatmanee fromThai-Nichi Institute of Technology Thailand for their expertisethat greatly assisted this research
References
[1] S Lefevre and N Vincent ldquoEfficient and robust shot changedetectionrdquo Journal of Real-Time Image Processing vol 2 no 1pp 23ndash34 2007
[2] J Baber S Satoh N Afzulpurkar and C Keatmanee ldquoBag ofvisual wordsmodel for videos segmentation into scenesrdquo inPro-ceedings of the Fifth International Conference on Internet Multi-media Computing and Service pp 191ndash194 ACM 2013
[3] J Baber N Afzulpurkar and S Satoh ldquoA framework for videosegmentation using global and local featuresrdquo InternationalJournal of Pattern Recognition and Artificial Intelligence vol 27no 5 Article ID 1355007 2013
[4] L Baraldi C Grana and R Cucchiara ldquoShot and scene detec-tion via hierarchical clustering for re-using broadcast videordquoin Proceedings of the International Conference on ComputerAnalysis of Images and Patterns pp 801ndash811 Springer 2015
[5] J Baber NAfzulpurkar andM Bakhtyar ldquoVideo segmentationinto scenes using entropy and SURFrdquo in Proceedings of the 20117th International Conference on Emerging Technologies (ICETrsquo11) pp 1ndash6 IEEE 2011
[6] T Kikukawa and S Kawafuchi ldquoTransaction of the instituteof electronics development of an automatic summary editingsystem for the audio visual resourcesrdquo Information and commu-nication Engineers vol 75 no 2 pp 398ndash402 1992
[7] A Nagasaka and Y Tanaka Visual database systems II 1992[8] H Zhang A Kankanhalli and S W Smoliar ldquoAutomatic parti-
tioning of full-motion videordquo Multimedia Systems vol 1 no 1pp 10ndash28 1993
[9] I Koprinska and S Carrato ldquoTemporal video segmentation asurveyrdquo Signal Processing Image Communication vol 16 no 5pp 477ndash500 2001
[10] Z Cernekova I Pitas and C Nikou ldquoInformation theory-based shot cutfade detection and video summarizationrdquo IEEETransactions on Circuits and Systems for Video Technology vol16 no 1 pp 82ndash91 2006
[11] T Kikukawa and S Kawafuchi ldquoDevelopment of an automaticsummary editing system for the audio-visual resourcesrdquo Trans-actions on Electronics and Information J75-A pp 204ndash212 1992
[12] A Nagasaka ldquoAutomatic video indexing and full-video searchfor object appearancesrdquo in Proceedings of the IFIP 2nd WorkingConference on Visual Database Systems 1992
[13] G C Chavez F Precioso M Cord S Philipp-Foliguet andA D A Araujo ldquoShot boundary detection at trecvid 2006rdquo inProceedings of the TREC Video Retrieval Eval vol 15 2006
[14] X Ling O Yuanxin L Huan and X Zhang ldquoAMethod for FastShot Boundary Detection Based on SVMrdquo in Proceedings of the2008 Congress on Image and Signal Processing vol 2 pp 445ndash449 IEEE 2008
[15] J Li Y Ding Y Shi and W Li ldquoA divide-and-rule scheme forshot boundary detection based on SIFTrdquo International Journalof Digital Content Technology and Its Applications vol 4 no 3pp 202ndash214 2010
[16] M Yeung B-L Yeo and B Liu ldquoSegmentation of Video byClustering and Graph Analysisrdquo Computer Vision and ImageUnderstanding vol 71 no 1 pp 94ndash109 1998
[17] Z Rasheed andM Shah ldquoScene detection inHollywoodmoviesand TV showsrdquo in Proceedings of the 2003 IEEE ComputerSociety Conference on Computer Vision and Pattern Recognitionvol 2 pp 343ndash348 IEEE 2003
[18] D Rotman D Porat and G Ashour ldquoRobust and efficientvideo scene detection using optimal sequential groupingrdquo inProceedings of the 18th IEEE International Symposium on Multi-media ISM rsquo16 pp 275ndash280 IEEE 2016
[19] D Rotman D Porat and G Ashour ldquoRobust video scenedetection using multimodal fusion of optimally grouped fea-turesrdquo in Proceedings of the 19th IEEE International Workshopon Multimedia Signal Processing MMSP rsquo17 pp 1ndash6 2017
[20] L Baraldi C Grana and R Cucchiara ldquoAnalysis and re-useof videos in educational digital libraries with automatic scenedetectionrdquo in Proceedings of the Italian Research Conference onDigital Libraries pp 155ndash164 Springer 2015
[21] U Sakarya and Z Telatar ldquoVideo scene detection using dom-inant setsrdquo in Proceedings of the 2008 15th IEEE InternationalConference on Image Processing - ICIP rsquo08 pp 73ndash76 IEEE2008
[22] T Lin H Zhang and Q-Y Shi ldquoVideo scene extraction byforce competitionrdquo in Proceedings of the IEEE InternationalConference on Multimedia and Expo (ICME rsquo01) pp 753ndash7562001
[23] X Chen and F Lu ldquoAdaptive rate control algorithm for H264AVC considering scene changerdquo Mathematical Problems inEngineering vol 2013 Article ID 373689 6 pages 2013
[24] GRascioni S Spinsante andEGambi ldquoAnoptimized dynamicscene change detection algorithm for H264AVC encodedvideo sequencesrdquo International Journal of Digital MultimediaBroadcasting vol 2010 Article ID 864123 9 pages 2010
[25] Z Rasheed andM Shah ldquoScene detection inHollywoodmoviesand TV showsrdquo in Proceedings of the 2003 IEEE ComputerSociety Conference on Computer Vision and Pattern Recognition2003
[26] J Baber N Afzulpurkar M N Dailey and M BakhtyarldquoShot boundary detection from videos using entropy and localdescriptorrdquo in Proceedings of the 2011 17th International Con-ference onDigital Signal Processing (DSP rsquo11) pp 1ndash6 IEEE 2011
[27] D G Lowe ldquoDistinctive image features from scale-invariantkeypointsrdquo International Journal of ComputerVision vol 60 no2 pp 91ndash110 2004
[28] J Baber M N Dailey S Satoh N Afzulpurkar and M Bakht-yar ldquoBIG-OH binarization of gradient orientation histogramsrdquoImage and Vision Computing vol 32 no 11 pp 940ndash953 2014
[29] H JegouMDouze C Schmid and P Perez ldquoAggregating localdescriptors into a compact image representationrdquo inProceedingsof the 2010 IEEE Computer Society Conference on ComputerVision and Pattern Recognition (CVPR rsquo10) pp 3304ndash3311 2010
Advances in Multimedia 9
[30] J Delhumeau P-HGosselinH Jegou andP Perez ldquoRevisitingthe VLAD image representationrdquo inProceedings of the 21st ACMInternational Conference on Multimedia pp 653ndash656 2013
Figure 4 Sensitivity ofL on different types of videos
In this paper we propose framework which uses state-of-the-art searching techniques such as BoVW and VLADwhich is widely used for image and video retrieval for sceneboundary detection Images or video frames are representedby BoVW and VLAD which are very high dimensionalfeature vectors We experimentally show that in the fieldof scene boundary detection competitive accuracy can beachieved by keeping the dimensions of BoVW and VLADto very small The recommended dimensions for BoVWare 1 million in our experiments we just tuned it to be25000 The recommended dimensions of VLAD are 32768
0 2 4 6 8Data Size
0
05
1
15
2
25
3
35
Tim
e (se
c)
VLADBoVW
times104
Figure 5 Timing plot of query image matching with all the imagesin database VLAD always has less dimensions compared to theBoVWwhich makes VLAD faster than BoVW
in our experiments it is tuned to 2048 We exploit thesliding window for shot boundary detection In very smallsliding window the contents of the video shots do notchange drastically which helps to represent shots by reduceddimensions of BoVW and VLAD
8 Advances in Multimedia
Data Availability
The data used to support the findings of this study areavailable from the corresponding author upon request
Conflicts of Interest
The authors declare that they have no conflicts of interest
Acknowledgments
We are thankful to Shinrsquoichi Satoh from National Institute ofInformatics Japan Nitin Afzulpurkar fromAsian Institute ofTechnologyThailand andChadapornKeatmanee fromThai-Nichi Institute of Technology Thailand for their expertisethat greatly assisted this research
References
[1] S Lefevre and N Vincent ldquoEfficient and robust shot changedetectionrdquo Journal of Real-Time Image Processing vol 2 no 1pp 23ndash34 2007
[2] J Baber S Satoh N Afzulpurkar and C Keatmanee ldquoBag ofvisual wordsmodel for videos segmentation into scenesrdquo inPro-ceedings of the Fifth International Conference on Internet Multi-media Computing and Service pp 191ndash194 ACM 2013
[3] J Baber N Afzulpurkar and S Satoh ldquoA framework for videosegmentation using global and local featuresrdquo InternationalJournal of Pattern Recognition and Artificial Intelligence vol 27no 5 Article ID 1355007 2013
[4] L Baraldi C Grana and R Cucchiara ldquoShot and scene detec-tion via hierarchical clustering for re-using broadcast videordquoin Proceedings of the International Conference on ComputerAnalysis of Images and Patterns pp 801ndash811 Springer 2015
[5] J Baber NAfzulpurkar andM Bakhtyar ldquoVideo segmentationinto scenes using entropy and SURFrdquo in Proceedings of the 20117th International Conference on Emerging Technologies (ICETrsquo11) pp 1ndash6 IEEE 2011
[6] T Kikukawa and S Kawafuchi ldquoTransaction of the instituteof electronics development of an automatic summary editingsystem for the audio visual resourcesrdquo Information and commu-nication Engineers vol 75 no 2 pp 398ndash402 1992
[7] A Nagasaka and Y Tanaka Visual database systems II 1992[8] H Zhang A Kankanhalli and S W Smoliar ldquoAutomatic parti-
tioning of full-motion videordquo Multimedia Systems vol 1 no 1pp 10ndash28 1993
[9] I Koprinska and S Carrato ldquoTemporal video segmentation asurveyrdquo Signal Processing Image Communication vol 16 no 5pp 477ndash500 2001
[10] Z Cernekova I Pitas and C Nikou ldquoInformation theory-based shot cutfade detection and video summarizationrdquo IEEETransactions on Circuits and Systems for Video Technology vol16 no 1 pp 82ndash91 2006
[11] T Kikukawa and S Kawafuchi ldquoDevelopment of an automaticsummary editing system for the audio-visual resourcesrdquo Trans-actions on Electronics and Information J75-A pp 204ndash212 1992
[12] A Nagasaka ldquoAutomatic video indexing and full-video searchfor object appearancesrdquo in Proceedings of the IFIP 2nd WorkingConference on Visual Database Systems 1992
[13] G C Chavez F Precioso M Cord S Philipp-Foliguet andA D A Araujo ldquoShot boundary detection at trecvid 2006rdquo inProceedings of the TREC Video Retrieval Eval vol 15 2006
[14] X Ling O Yuanxin L Huan and X Zhang ldquoAMethod for FastShot Boundary Detection Based on SVMrdquo in Proceedings of the2008 Congress on Image and Signal Processing vol 2 pp 445ndash449 IEEE 2008
[15] J Li Y Ding Y Shi and W Li ldquoA divide-and-rule scheme forshot boundary detection based on SIFTrdquo International Journalof Digital Content Technology and Its Applications vol 4 no 3pp 202ndash214 2010
[16] M Yeung B-L Yeo and B Liu ldquoSegmentation of Video byClustering and Graph Analysisrdquo Computer Vision and ImageUnderstanding vol 71 no 1 pp 94ndash109 1998
[17] Z Rasheed andM Shah ldquoScene detection inHollywoodmoviesand TV showsrdquo in Proceedings of the 2003 IEEE ComputerSociety Conference on Computer Vision and Pattern Recognitionvol 2 pp 343ndash348 IEEE 2003
[18] D Rotman D Porat and G Ashour ldquoRobust and efficientvideo scene detection using optimal sequential groupingrdquo inProceedings of the 18th IEEE International Symposium on Multi-media ISM rsquo16 pp 275ndash280 IEEE 2016
[19] D Rotman D Porat and G Ashour ldquoRobust video scenedetection using multimodal fusion of optimally grouped fea-turesrdquo in Proceedings of the 19th IEEE International Workshopon Multimedia Signal Processing MMSP rsquo17 pp 1ndash6 2017
[20] L Baraldi C Grana and R Cucchiara ldquoAnalysis and re-useof videos in educational digital libraries with automatic scenedetectionrdquo in Proceedings of the Italian Research Conference onDigital Libraries pp 155ndash164 Springer 2015
[21] U Sakarya and Z Telatar ldquoVideo scene detection using dom-inant setsrdquo in Proceedings of the 2008 15th IEEE InternationalConference on Image Processing - ICIP rsquo08 pp 73ndash76 IEEE2008
[22] T Lin H Zhang and Q-Y Shi ldquoVideo scene extraction byforce competitionrdquo in Proceedings of the IEEE InternationalConference on Multimedia and Expo (ICME rsquo01) pp 753ndash7562001
[23] X Chen and F Lu ldquoAdaptive rate control algorithm for H264AVC considering scene changerdquo Mathematical Problems inEngineering vol 2013 Article ID 373689 6 pages 2013
[24] GRascioni S Spinsante andEGambi ldquoAnoptimized dynamicscene change detection algorithm for H264AVC encodedvideo sequencesrdquo International Journal of Digital MultimediaBroadcasting vol 2010 Article ID 864123 9 pages 2010
[25] Z Rasheed andM Shah ldquoScene detection inHollywoodmoviesand TV showsrdquo in Proceedings of the 2003 IEEE ComputerSociety Conference on Computer Vision and Pattern Recognition2003
[26] J Baber N Afzulpurkar M N Dailey and M BakhtyarldquoShot boundary detection from videos using entropy and localdescriptorrdquo in Proceedings of the 2011 17th International Con-ference onDigital Signal Processing (DSP rsquo11) pp 1ndash6 IEEE 2011
[27] D G Lowe ldquoDistinctive image features from scale-invariantkeypointsrdquo International Journal of ComputerVision vol 60 no2 pp 91ndash110 2004
[28] J Baber M N Dailey S Satoh N Afzulpurkar and M Bakht-yar ldquoBIG-OH binarization of gradient orientation histogramsrdquoImage and Vision Computing vol 32 no 11 pp 940ndash953 2014
[29] H JegouMDouze C Schmid and P Perez ldquoAggregating localdescriptors into a compact image representationrdquo inProceedingsof the 2010 IEEE Computer Society Conference on ComputerVision and Pattern Recognition (CVPR rsquo10) pp 3304ndash3311 2010
Advances in Multimedia 9
[30] J Delhumeau P-HGosselinH Jegou andP Perez ldquoRevisitingthe VLAD image representationrdquo inProceedings of the 21st ACMInternational Conference on Multimedia pp 653ndash656 2013
Chemical EngineeringInternational Journal of Antennas and
Propagation
International Journal of
Hindawiwwwhindawicom Volume 2018
Hindawiwwwhindawicom Volume 2018
Navigation and Observation
International Journal of
Hindawi
wwwhindawicom Volume 2018
Advances in
Multimedia
Submit your manuscripts atwwwhindawicom
8 Advances in Multimedia
Data Availability
The data used to support the findings of this study areavailable from the corresponding author upon request
Conflicts of Interest
The authors declare that they have no conflicts of interest
Acknowledgments
We are thankful to Shinrsquoichi Satoh from National Institute ofInformatics Japan Nitin Afzulpurkar fromAsian Institute ofTechnologyThailand andChadapornKeatmanee fromThai-Nichi Institute of Technology Thailand for their expertisethat greatly assisted this research
References
[1] S Lefevre and N Vincent ldquoEfficient and robust shot changedetectionrdquo Journal of Real-Time Image Processing vol 2 no 1pp 23ndash34 2007
[2] J Baber S Satoh N Afzulpurkar and C Keatmanee ldquoBag ofvisual wordsmodel for videos segmentation into scenesrdquo inPro-ceedings of the Fifth International Conference on Internet Multi-media Computing and Service pp 191ndash194 ACM 2013
[3] J Baber N Afzulpurkar and S Satoh ldquoA framework for videosegmentation using global and local featuresrdquo InternationalJournal of Pattern Recognition and Artificial Intelligence vol 27no 5 Article ID 1355007 2013
[4] L Baraldi C Grana and R Cucchiara ldquoShot and scene detec-tion via hierarchical clustering for re-using broadcast videordquoin Proceedings of the International Conference on ComputerAnalysis of Images and Patterns pp 801ndash811 Springer 2015
[5] J Baber NAfzulpurkar andM Bakhtyar ldquoVideo segmentationinto scenes using entropy and SURFrdquo in Proceedings of the 20117th International Conference on Emerging Technologies (ICETrsquo11) pp 1ndash6 IEEE 2011
[6] T Kikukawa and S Kawafuchi ldquoTransaction of the instituteof electronics development of an automatic summary editingsystem for the audio visual resourcesrdquo Information and commu-nication Engineers vol 75 no 2 pp 398ndash402 1992
[7] A Nagasaka and Y Tanaka Visual database systems II 1992[8] H Zhang A Kankanhalli and S W Smoliar ldquoAutomatic parti-
tioning of full-motion videordquo Multimedia Systems vol 1 no 1pp 10ndash28 1993
[9] I Koprinska and S Carrato ldquoTemporal video segmentation asurveyrdquo Signal Processing Image Communication vol 16 no 5pp 477ndash500 2001
[10] Z Cernekova I Pitas and C Nikou ldquoInformation theory-based shot cutfade detection and video summarizationrdquo IEEETransactions on Circuits and Systems for Video Technology vol16 no 1 pp 82ndash91 2006
[11] T Kikukawa and S Kawafuchi ldquoDevelopment of an automaticsummary editing system for the audio-visual resourcesrdquo Trans-actions on Electronics and Information J75-A pp 204ndash212 1992
[12] A Nagasaka ldquoAutomatic video indexing and full-video searchfor object appearancesrdquo in Proceedings of the IFIP 2nd WorkingConference on Visual Database Systems 1992
[13] G C Chavez F Precioso M Cord S Philipp-Foliguet andA D A Araujo ldquoShot boundary detection at trecvid 2006rdquo inProceedings of the TREC Video Retrieval Eval vol 15 2006
[14] X Ling O Yuanxin L Huan and X Zhang ldquoAMethod for FastShot Boundary Detection Based on SVMrdquo in Proceedings of the2008 Congress on Image and Signal Processing vol 2 pp 445ndash449 IEEE 2008
[15] J Li Y Ding Y Shi and W Li ldquoA divide-and-rule scheme forshot boundary detection based on SIFTrdquo International Journalof Digital Content Technology and Its Applications vol 4 no 3pp 202ndash214 2010
[16] M Yeung B-L Yeo and B Liu ldquoSegmentation of Video byClustering and Graph Analysisrdquo Computer Vision and ImageUnderstanding vol 71 no 1 pp 94ndash109 1998
[17] Z Rasheed andM Shah ldquoScene detection inHollywoodmoviesand TV showsrdquo in Proceedings of the 2003 IEEE ComputerSociety Conference on Computer Vision and Pattern Recognitionvol 2 pp 343ndash348 IEEE 2003
[18] D Rotman D Porat and G Ashour ldquoRobust and efficientvideo scene detection using optimal sequential groupingrdquo inProceedings of the 18th IEEE International Symposium on Multi-media ISM rsquo16 pp 275ndash280 IEEE 2016
[19] D Rotman D Porat and G Ashour ldquoRobust video scenedetection using multimodal fusion of optimally grouped fea-turesrdquo in Proceedings of the 19th IEEE International Workshopon Multimedia Signal Processing MMSP rsquo17 pp 1ndash6 2017
[20] L Baraldi C Grana and R Cucchiara ldquoAnalysis and re-useof videos in educational digital libraries with automatic scenedetectionrdquo in Proceedings of the Italian Research Conference onDigital Libraries pp 155ndash164 Springer 2015
[21] U Sakarya and Z Telatar ldquoVideo scene detection using dom-inant setsrdquo in Proceedings of the 2008 15th IEEE InternationalConference on Image Processing - ICIP rsquo08 pp 73ndash76 IEEE2008
[22] T Lin H Zhang and Q-Y Shi ldquoVideo scene extraction byforce competitionrdquo in Proceedings of the IEEE InternationalConference on Multimedia and Expo (ICME rsquo01) pp 753ndash7562001
[23] X Chen and F Lu ldquoAdaptive rate control algorithm for H264AVC considering scene changerdquo Mathematical Problems inEngineering vol 2013 Article ID 373689 6 pages 2013
[24] GRascioni S Spinsante andEGambi ldquoAnoptimized dynamicscene change detection algorithm for H264AVC encodedvideo sequencesrdquo International Journal of Digital MultimediaBroadcasting vol 2010 Article ID 864123 9 pages 2010
[25] Z Rasheed andM Shah ldquoScene detection inHollywoodmoviesand TV showsrdquo in Proceedings of the 2003 IEEE ComputerSociety Conference on Computer Vision and Pattern Recognition2003
[26] J Baber N Afzulpurkar M N Dailey and M BakhtyarldquoShot boundary detection from videos using entropy and localdescriptorrdquo in Proceedings of the 2011 17th International Con-ference onDigital Signal Processing (DSP rsquo11) pp 1ndash6 IEEE 2011
[27] D G Lowe ldquoDistinctive image features from scale-invariantkeypointsrdquo International Journal of ComputerVision vol 60 no2 pp 91ndash110 2004
[28] J Baber M N Dailey S Satoh N Afzulpurkar and M Bakht-yar ldquoBIG-OH binarization of gradient orientation histogramsrdquoImage and Vision Computing vol 32 no 11 pp 940ndash953 2014
[29] H JegouMDouze C Schmid and P Perez ldquoAggregating localdescriptors into a compact image representationrdquo inProceedingsof the 2010 IEEE Computer Society Conference on ComputerVision and Pattern Recognition (CVPR rsquo10) pp 3304ndash3311 2010
Advances in Multimedia 9
[30] J Delhumeau P-HGosselinH Jegou andP Perez ldquoRevisitingthe VLAD image representationrdquo inProceedings of the 21st ACMInternational Conference on Multimedia pp 653ndash656 2013