A motion based scene tree for browsing and retrieval of compressed videos

A Motion based Scene Tree for Browsing and Retrieval ofCompressed Videos

Haoran Yi, Deepu Rajan and Liang-Tien ChiaCenter for Multimedia and Network Technology

School of Computer EngineeringNanyang Technological University, Singapore 639798

{pg03763623, asdrajan, asltchia}@ntu.edu.sg

ABSTRACTThis paper describes a fully automatic content-based ap-proach for browsing and retrieval of MPEG-2 compressedvideo. The first step of the approach is the detection ofshot boundaries based on motion vectors available from thecompressed video stream. The next step involves the con-struction of a scene tree from the shots obtained earlier.The scene tree is shown to capture some semantic informa-tion as well as to provide a construct for hierarchical brows-ing of compressed videos. Finally, we build a new modelfor video similarity based on global as well as local motionassociated with each node in the scene tree. To this end,we propose new approaches to camera motion and objectmotion estimation. The experimental results demonstratethat the integration of the above techniques results in anefficient framework for browsing and searching large videodatabases.

Categories and Subject DescriptorsI.4.8 [Computing Methodologies]: Image Processing andComputer Vision—Scene Analysis

General TermsAlgorithms

KeywordsShot boundary detection, video indexing, video browsing,video similarity, video retrieval.

1. INTRODUCTIONState of the art video compression and communication

technologies have enabled a large amount of digital video tobe available online. Storage and transmission technologieshave advanced to a stage so that they can accommodate

Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies arenot made or distributed for profit or commercial advantage and that copiesbear this notice and the full citation on the first page. To copy otherwise, torepublish, to post on servers or to redistribute to lists, requires prior specificpermission and/or a fee.MMDB’04, November 13, 2004, Washington, DC, USA.Copyright 2004 ACM 1-58113-975-6/04/0011 ...$5.00.

the demanding volume of video data. Encoding technolo-gies such as MPEG, H.263 and H.264 [5, 4, 6, 7] providefor access to digital videos within the constraints of currentcommunications infrastructure and technology. Even pro-duction of digital videos has become available to the masseswith introduction of high performance, low-cost digital cap-ture and recording devices. As a result, a huge volumeof digital video content is available in digital archives onthe World Wide Web, in broadcast data streams, and inpersonal and professional databases. Such a vast amountof content information calls for effective and efficient tech-niques for finding, accessing, filtering and managing videodata. While search engines and database management sys-tems suffice for text documents, they simply cannot handlethe relatively unstructured, albeit information rich, videocontent. Hence, building a content based video indexingsystem turns out to be a difficult problem. However, wecan identify three tasks that are fundamental to building anefficient video management system: (i) The entire video se-quence must be segmented into shots, where a shot is definedas a collection of frames recorded from a single camera op-eration. This is akin to a tuple which is the basic structuralelement for retrieval in conventional text based databasemanagement system. (ii) Even though a shot determines aphysical boundary in a video sequence, it does not conveyany meaningful semantics within it. Hence, shots that arerelated to each other must be grouped together into a scene[8, 9, 19]. (iii) Finally, a robust retrieval method depends ona model that captures similarity in the semantics of videocontent.

In this paper, we address the above fundamental tasksto provide an integrated approach for managing compressedvideo content. Since video is mostly available in compressedform, there is a need to develop algorithms to process com-pressed video directly without paying for overheads to de-code them before processing. Tasks such as restoration, res-olution enhancement, tracking etc. in compressed videoshave been reported recently [15, 3, 17, 20]. Our objectiveis to develop a fully automatic technique for content basedorganization and management of MPEG compressed videos.To this end, the paper

1. describes a novel shot boundary detection algorithmthat is capable of detecting both abrupt and gradualshot changes like dissolve, fade-in/fade-out;

2. describes a scene tree that acts as an efficient structureto facilitate browsing;

10

https://www.researchgate.net/publication/244447151_VORTEX_Video_Retrieval_and_Tracking_from_Compressed_Multimedia_Databases-Multiple_Object_Tracking_from_MPEG2_Bitstream?el=1_x_8&enrichId=rgreq-1ffb0fee5e8272fbce8456178d04beb6-XXX&enrichSource=Y292ZXJQYWdlOzIyMDUwNDI4NztBUzo5OTMxNzMzMTI2NzU4OUAxNDAwNjkwNTA1MjI5

https://www.researchgate.net/publication/2434759_Restoration_of_Compressed_Video_using_Temporal_Information?el=1_x_8&enrichId=rgreq-1ffb0fee5e8272fbce8456178d04beb6-XXX&enrichSource=Y292ZXJQYWdlOzIyMDUwNDI4NztBUzo5OTMxNzMzMTI2NzU4OUAxNDAwNjkwNTA1MjI5

https://www.researchgate.net/publication/222399869_Survey_of_Compressed-Domain_Features_used_in_Audio-Visual_Indexing_and_Analysis?el=1_x_8&enrichId=rgreq-1ffb0fee5e8272fbce8456178d04beb6-XXX&enrichSource=Y292ZXJQYWdlOzIyMDUwNDI4NztBUzo5OTMxNzMzMTI2NzU4OUAxNDAwNjkwNTA1MjI5

https://www.researchgate.net/publication/2515181_Video_Scene_Segmentation_Via_Continuous_Video_Coherence?el=1_x_8&enrichId=rgreq-1ffb0fee5e8272fbce8456178d04beb6-XXX&enrichSource=Y292ZXJQYWdlOzIyMDUwNDI4NztBUzo5OTMxNzMzMTI2NzU4OUAxNDAwNjkwNTA1MjI5

https://www.researchgate.net/publication/3327769_Super-Resolution_Reconstruction_of_Compressed_Video_Using_Transform-Domain_Statistics?el=1_x_8&enrichId=rgreq-1ffb0fee5e8272fbce8456178d04beb6-XXX&enrichSource=Y292ZXJQYWdlOzIyMDUwNDI4NztBUzo5OTMxNzMzMTI2NzU4OUAxNDAwNjkwNTA1MjI5

https://www.researchgate.net/publication/3308474_Scene_extraction_in_motion_pictures?el=1_x_8&enrichId=rgreq-1ffb0fee5e8272fbce8456178d04beb6-XXX&enrichSource=Y292ZXJQYWdlOzIyMDUwNDI4NztBUzo5OTMxNzMzMTI2NzU4OUAxNDAwNjkwNTA1MjI5

https://www.researchgate.net/publication/31653608_The_Film_Encyclopedia_E_Katz?el=1_x_8&enrichId=rgreq-1ffb0fee5e8272fbce8456178d04beb6-XXX&enrichSource=Y292ZXJQYWdlOzIyMDUwNDI4NztBUzo5OTMxNzMzMTI2NzU4OUAxNDAwNjkwNTA1MjI5

3. presents a video similarity model that enables efficientindexing of video content based on motion and whichis shown to be useful in video retrieval.

As noted earlier, these three tasks provide for an integratedapproach to browsing and retrieval in large video databases.

The remainder of this paper is organized as follows. InSection 2, we describe a novel shot boundary detection algo-rithm in the compressed domain. The procedure for buildingan adaptive scene tree is described in Section 3. In Section 4,the motion-based indexing and retrieval techniques are dis-cussed. The experimental results are presented in Section 5.Finally, we give concluding remarks in Section 6.

2. SHOT BOUNDARY DETECTIONALGORITHM

There is a rich literature of algorithms for detecting shotboundaries in video sequences. They can be broadly classi-fied into methods that use one or more of the following fea-tures extracted from the video frames : (i) pixel differences(ii) statistical differences (iii) histogram (iv) compressiondifferences (v) edge tracking and (vi) motion vectors [10].In the initial phase of research in shot boundary detection(SBD), the data mostly consisted of uncompressed video[25,24, 18]. However, with the emergence of the MPEG com-pression standard, it has become prudent to develop algo-rithms to parse MPEG video directly. Several methods thatdetect shot boundaries directly from compressed video havealso been reported [2, 1, 21, 11]. Most of these methods useeither the DC coefficients or the bit-rate of different MPEGpicture types (I, P, or B) in their algorithms; the meth-ods using DC coefficients involve significant decoding of theMPEG-compressed video, while the methods using bit-ratedo not yield satisfactory detection. Moreover, these meth-ods are incapable of handling both hard cuts and gradualtransitions (i.e. fade, dissolve, etc.) simultaneously. Lastly,most of them involve tight thresholds that are largely de-pendent on the video sequence.

We have developed a novel shot boundary detection al-gorithm for MPEG compressed videos[23]. The proposedalgorithm is able to detect both abrupt and gradual shotchanges, i.e. dissolve, fade in/out. The number of mac-roblock (MB) types in each of the I-, P-, and B-frames areused to derive the SBD algorithm. For abrupt shot bound-ary detection, we calculate the Frame Dissimilarity Ratio(FDR) as shown in equation (1):

FDRn =

{F wn−1Bin−1

for Reference frame

max (F wnBin

, BknBin

) for B frame(1)

where the Reference frame refers to either an I-frame or aPframe and the index n denotes the frame number and Fw,Bi, Bk denote forward predicted, bi-directional predictedand backward predicted MB type. The calculated FDR arefurther filtered with Dominant MB Change (DMBC) to getthe Modified Frame Dissimilarity Ratio (MFDR). Slidingwindow technique is applied on the MFDR to detect abruptshot boundaries. The local sum of DMBC is used to detectgradual shot changes. Due to the lack of space, we do notshow the details of the SBD algorithm. Interested readersmay refer [23] for details.

3. SCENE TREE BUILDING ALGORITHMThe objective of content based indexing of videos is to

facilitate easy browsing and retrieval. Ideally, a non lin-ear browsing capability is desirable as opposed to standardtechniques like fast forward or fast reverse. This can beachieved, preferably, using a structure that represents videoinformation as a hierarchy of various semantic levels. Such amulti-layer abstraction makes it not only more convenient toreference video information but also simplifies video index-ing and storage organization. Several multilevel structureshave been proposed in the literature [16, 9, 14]. However, allof them use fix structure to describe the video content, i.e.shot and scene. The underlying theme is to group togetherframes that are ‘similar’ to each other where similarity couldbe defined in such primitive terms as temporal adjacency[12] or in terms of video content. The latter results in theentity called scene, which should convey semantic informa-tion in the video being viewed. Hence, our objective is tobuild a browsing hierarchy whose shape and size are deter-mined only by the semantic complexity of the video. We callsuch a hierarchical structure as an adaptive scene tree. Thescene tree algorithm described in this section is motivatedby the work of Oh and Hua [13] who build a scene tree foruncompressed videos. Our algorithm is an extension and animprovement over [13] that works for compressed videos.

The main idea of the scene tree building algorithm is to se-quentially compare the similarity of the key frame from thecurrent shot to the key frames from the previous w shots.Based on a measure of similarity, the current shot is eitherappended to the parent node of the previous shot or ap-pended to a newly created node. The result of the scenetree building algorithm is a browsing tree whose structure isadaptive, i.e. the number of levels of the tree are larger forcomplex content and smaller for simple content. The detailsof the algorithm, which takes in a sequence of video shotsand outputs the scene tree is show below:

1. Initialization: Create a scene node SN0i at the lowest

level (i.e., level 0) of the scene tree for each shoti. Thesubscript indicates the shot (or scene) from which thescene node is derived and the superscript denotes thelevel of the scene node in the scene tree;

2. Initialize i← 3.

3. Check if shoti is similar to shots shoti−1, . . . , shoti−w(indescending order) using a function isSimilar(), whichwill be described later. The comparisons stop when asimilar shot, say shotj , is found. If no related shot isfound, a new empty node is created and connected toSN0

i as its parent node; then proceed to step 5.

4. For scene nodes SN0i−1 and SN0

j ,

(a) If SN0i−1 and SN0

j do not currently have a parent

node, we connect all scene nodes, SN0i through

SN0j to a new empty node as their parent node.

(b) If SN0i−1 and SN0

j share an ancestor node, we

connect SN0i to this ancestor node.

(c) If SN0i−1 and SN0

j do not currently share an an-

cestor node, we connect SN0i to the current oldest

ancestor of SN0i−1, and then create a new empty

node and connect the oldest ancestor of all nodesfrom SN0

j to SN0i−1 to it as its children.

11

https://www.researchgate.net/publication/220461169_Automatic_Partitioning_of_Full-Motion_Video?el=1_x_8&enrichId=rgreq-1ffb0fee5e8272fbce8456178d04beb6-XXX&enrichSource=Y292ZXJQYWdlOzIyMDUwNDI4NztBUzo5OTMxNzMzMTI2NzU4OUAxNDAwNjkwNTA1MjI5

https://www.researchgate.net/publication/221123715_Identification_of_Story_Units_in_Audio-Visual_Sequences_by_Joint_Audio_and_Video_Processing?el=1_x_8&enrichId=rgreq-1ffb0fee5e8272fbce8456178d04beb6-XXX&enrichSource=Y292ZXJQYWdlOzIyMDUwNDI4NztBUzo5OTMxNzMzMTI2NzU4OUAxNDAwNjkwNTA1MjI5

https://www.researchgate.net/publication/242439086_Scene_change_detection_in_a_MPEG_compressed_video_sequence?el=1_x_8&enrichId=rgreq-1ffb0fee5e8272fbce8456178d04beb6-XXX&enrichSource=Y292ZXJQYWdlOzIyMDUwNDI4NztBUzo5OTMxNzMzMTI2NzU4OUAxNDAwNjkwNTA1MjI5

https://www.researchgate.net/publication/3666056_On_the_extraction_of_DC_sequence_from_MPEG_Compressed_Video?el=1_x_8&enrichId=rgreq-1ffb0fee5e8272fbce8456178d04beb6-XXX&enrichSource=Y292ZXJQYWdlOzIyMDUwNDI4NztBUzo5OTMxNzMzMTI2NzU4OUAxNDAwNjkwNTA1MjI5

https://www.researchgate.net/publication/2685313_A_Statistical_Approach_to_Scene_Change_Detection?el=1_x_8&enrichId=rgreq-1ffb0fee5e8272fbce8456178d04beb6-XXX&enrichSource=Y292ZXJQYWdlOzIyMDUwNDI4NztBUzo5OTMxNzMzMTI2NzU4OUAxNDAwNjkwNTA1MjI5

https://www.researchgate.net/publication/224185303_Scene_change_detection_algorithm_for_MPEG_video_sequence?el=1_x_8&enrichId=rgreq-1ffb0fee5e8272fbce8456178d04beb6-XXX&enrichSource=Y292ZXJQYWdlOzIyMDUwNDI4NztBUzo5OTMxNzMzMTI2NzU4OUAxNDAwNjkwNTA1MjI5


https://www.researchgate.net/publication/221127643_A_unified_approach_to_detection_of_shot_boundaries_and_subshots_in_compressed_video?el=1_x_8&enrichId=rgreq-1ffb0fee5e8272fbce8456178d04beb6-XXX&enrichSource=Y292ZXJQYWdlOzIyMDUwNDI4NztBUzo5OTMxNzMzMTI2NzU4OUAxNDAwNjkwNTA1MjI5


https://www.researchgate.net/publication/228358317_Comparison_of_automatic_shot_boundary_detection_algorithms?el=1_x_8&enrichId=rgreq-1ffb0fee5e8272fbce8456178d04beb6-XXX&enrichSource=Y292ZXJQYWdlOzIyMDUwNDI4NztBUzo5OTMxNzMzMTI2NzU4OUAxNDAwNjkwNTA1MjI5

https://www.researchgate.net/publication/2288156_A_Feature-Based_Algorithm_for_Detecting_and_Classifying_Scene_Breaks?el=1_x_8&enrichId=rgreq-1ffb0fee5e8272fbce8456178d04beb6-XXX&enrichSource=Y292ZXJQYWdlOzIyMDUwNDI4NztBUzo5OTMxNzMzMTI2NzU4OUAxNDAwNjkwNTA1MjI5

https://www.researchgate.net/publication/3693346_Combined_audio_and_visual_streams_analysis_for_video_sequence_segmentation?el=1_x_8&enrichId=rgreq-1ffb0fee5e8272fbce8456178d04beb6-XXX&enrichSource=Y292ZXJQYWdlOzIyMDUwNDI4NztBUzo5OTMxNzMzMTI2NzU4OUAxNDAwNjkwNTA1MjI5

https://www.researchgate.net/publication/221214470_Efficient_and_Cost-effective_Techniques_for_Browsing_and_Indexing_Large_Video_Databases?el=1_x_8&enrichId=rgreq-1ffb0fee5e8272fbce8456178d04beb6-XXX&enrichSource=Y292ZXJQYWdlOzIyMDUwNDI4NztBUzo5OTMxNzMzMTI2NzU4OUAxNDAwNjkwNTA1MjI5


Frame Widthw w

Fram

e Height

h

: Background Area

: Foreground Area

Figure 1: Background and foreground areas for com-puting the function isSimilar().

5. If there are more shots, we set i← i+1, and go to step3. Otherwise, connect all the nodes currently withouta parent to a new empty node as their parent.

6. For each scene node at the bottom of the scene tree(it represents a shot), we select the key frame as itsrepresentative frame by choosing the I frame in a shotwhose DC value is closest to the average of the DCvalues of all the I frames in the shot. Since we do notdesire to decode all the frames in the video sequence(recall that we would like to process in the compresseddomain), we choose the DC value of the I frame tomake the algorithm more efficient. We then traverseall the nodes in the scene tree from the bottom to thetop. For each empty node visited, we identify the childnode which contains the largest number of frames andassign its representative frame as the representativeframe for this node.

We now return to the function isSimilar() used in Step3 to compute the similarity between key frames. The firststep is to divide each frame into a fixed background areaand a fixed object area, as shown in Figure 1. The fixedbackground area is defined as the left, right and top marginsof the frame whose width is chosen as 10% of the framewidth. The remaining area of the frame is called a fixedobject area. Such a partitioning of the frame was proposedin [13], but the authors used it to develop a SBD algorithm.Operating on the HSV color space, we compute the colorhistograms of the object and background areas of two keyframes k1 and k2. The H and S are quantized into 8 binsand 4 bins, respectively, while we ignore the V values since itcorresponds to luminance. The similarity measure is definedas

Sim = w1 × |histb(k1, k2)|+ w2 × |histo(k1, k2)| (2)

where histb(.) and histo(.) are the Euclidean distances ofhistograms of the background and object areas, respectively,between frames k1 and k2 and w1, w2 are weightings for thebackground and object areas, respectively. We choose w1 =0.7 and w2 = 0.3 so as to give more weight to the backgroundsince a change in background area is a stronger indicationof a change in the scene. The value of Sim in equation (2)

is compared with a threshold to determine if two key framesare similar.

Although the above adaptive scene tree building algo-rithm is motivated by [13], there are some important dif-ferences, which we highlight now. As mentioned earlier, ouralgorithm works directly on MPEG-2 compressed video inthe sense that we need only partially decode the video toextract the motion vectors and DC images. Moreover, a fulldecoding is required only for one frame per shot that repre-sents the key frame. Secondly, the similarity between shotsis ascertained through a meaningful predefined window, w,as opposed to all the shots from the beginning of the video.We now compare the computational complexity of our algo-rithm to [13]. The shot similarity determination can be donein O(w∗s), where w is the length of the window and s is thenumber of shots. Generally, the number of frames in a shot,f is much larger than s and s is larger than w(f >> s > w).The scene tree construction algorithm involves traversal instep 4 and step 6. Therefore, the worst-case computationalcomplexity of building the tree is of O(s ∗ log(s)). Thus,the total computational complexity of our algorithm is ofO(s ∗ log(s)), while that in [13] is O(s ∗ f2).

Video retrieval and browsing can be envisioned as an in-tegrated and interactive process. This process can be seento be analogous to browsing through books with a Tableof Contents in the front and an Index at the back of thebook. In order to search for a specific topic of interest, onemay consult the index to locate the pages where it appears.Consequently, if one wants to know more about the topic,he may refer to the Table of Contents to browse throughchapters, sections and so on. Similarly, for video data, thefeature vectors serve as the Index and the scene tree servesas the Table of Contents. The search for relevant content inthe video is initiated by locating the video shot by featurematching/retrieval and then carried forward by browsing upand down along the scene tree (table of content) to furtherexplore the relevant content. Hence, the feature vector andthe scene tree are closely related tools for the managementof the video content.

4. MOTION BASED INDEXING ANDRETRIEVAL

In this section, we develop a model to compute the sim-ilarity between two video sequences. The model is basedon motion content within the video sequences. The motioncontent itself is represented by three components, viz., thecamera motion (CM), the object motion (OM), and thetotal motion (TM). Each of CM, OM and TM are matri-ces and it is these matrices that serve as indices for videoretrieval.

4.1 Motion ExtractionRecall that all the videos considered in this paper are

compressed according to MPEG-2 format. Hence, the MVfor frame n is obtained as

MVn =

−Bkn−1 for I frameFwn − Fwn−1 for P frame(Fwn − Fwn−1 + Bkn −Bkn−1)/2 for B frame

(3)Refer to Section 2 for meaning of the notations. The MVfields (MVFs) obtained are then smoothened using a spatial(3× 3× 1) median filter followed by a temporal (1× 1× 3)

12





median filter. The smoothened MVFs are then used to cal-culate the camera motion and object motion. The total mo-tion component at each MB is just the smoothened motionvector at its center.

We use a six parameter affine model to estimate the cam-era motion which is modelled as

mvx(i, j) = p1 · i + p2 · j + p3

mvy(i, j) = p4 · i + p5 · j + p6. (4)

where mvx(i, j) and mvy(i, j) are the x and y componentsof the motion vector for a MB centered at (i, j), ps are theaffine parameters. Consider a group of MBs G whose affineparameters P = {p1, · · · , p6} need to be determined. Wewill explain the significance of G shortly, when we presentthe iterative algorithm for camera motion estimation. Usingthe method of least squares, from equation (4), P is obtainedby minimizing∑G

(mvx(i, j)−p1·i−p2·j−p3)2+(mvy(i, j)−p4·i−p5·j−p6)

2

(5)The two terms inside the summation can be minimized inde-pendently. Considering the first term only and differentiat-ing it with respect to p1, p2 and p3 and setting the resultingequations to zero, we get∑

(i,j)εG

(mv(i, j)− p1 · i− p2 · j − p3)i = 0 (6)

∑(i,j)εG

(mv(i, j) − p1 · i− p2 · j − p3)j = 0 (7)

∑(i,j)εG

(mv(i, j) − p1 · i− p2 · j − p3) = 0 (8)

If the origin of the frame is taken to be at its center (insteadof the top left corner as is conventionally done), i′ = i− V +1

2

and j′ = j− U+12

, U×V is the size of the motion vector field,then

∑G i′ = 0 and

∑G j′ = 0 and the affine parameters

can be easily shown to be

p1 = IX·Y Y −JX·XYA

, p2 =JX ·XX − IX ·XY

A

P3 =∑

G mvx(i,j)

g, p4 =

IY · Y Y − JY ·XY

A

p5 = JY ·XX−IY ·XYA

, p6 =

∑G mvy(i, j)

g

(9)

where

IX =∑G

i′ ·mvx(i′, j′), JX =∑G

j′ ·mvx(i′, j′),

IY =∑G

i′ ·mvy(i′, j′), JY =∑G

j′ ·mvy(i′, j′),

XX =∑G

i′2, Y Y =∑G

j′2, XY =∑G

i′ · j′,

A = XY ·XY −XX · Y X,

and g is the number of macroblocks in the group G. mv(i′, j′) =mv(i, j) since the they refer to the same MB.

The algorithm to estimate the affine parameters starts bylabelling all the MBs in a frame as ‘inliers’. Then the pa-rameters are estimated for all the inliers using equation (9).A new set of motion vectors are reconstructed for each in-lier using the estimated parameters. If the magnitude of theresidual motion vector Rmv , calculated as the difference be-tween the original motion vector and the reconstructed one,is greater than an adaptive threshold T = max(median(Rmv), β)for a particular MB, then that MB and the one situated di-agonally opposite it are marked as ‘outliers’. The role ofβ is to prevent the rejection of a large number of MBs ifthe median of the residuals is very small. We choose β tobe 1. Note that the diagonally opposite MB is also markedas ‘outliers’ due to the shifting of the co-ordinate axes tothe center of the frame. After each iteration, some MBs aremarked as ‘outliers’; these MBs correspond to areas in whichthe motion is associated with moving objects. The steps inthe algorithm to estimate the camera motion is shown inAlgorithm 1.

Algorithm 1 Camera Motion Estimation

1: Mark all the MBs as ‘inliers’.2: Estimate the motion affine parameters for ‘inliers’ (equa-

tion (9)).3: Reconstruct the global motion vector at each mac-

roblock with the estimated affine parameters.4: Calculate residual motion vector (Rmv) as the difference

between the original and reconstructed motion vector.5: If Rmv is greater than max(median(Rmv), β), then mark

this MB and its opposite diagonal MB as ‘outliers’.median(Rmv) is the median of the all the Rmvs.

6: Go to step 2 until there are no more new ‘outliers’ ormore than two thirds the MBs are marked as ‘outliers’.

7: If more than two thirds of the MBs are ‘outliers’, theaffine parameters are set to zero.

Since the computed camera motion vector should not begreater than the total motion vector at a MB,

|cm(i, j)| = min(|mv(i, j)|, |cm(i, j)|) (10)

where cm(i, j) and mv(i, j) are the camera motion vectorand total motion vector, respectively, at the MB centered at(i, j). For the same reason

|om(i, j)| = max(0, |mv(i, j)− cm(i, j)|) (11)

where om(i, j) is the object motion vector at the MB cen-tered at (i, j). The CM, OM and TM matrices are nowformed for a shot by accumulating |cm(i, j)|, |om(i, j)| and|mv(i, j)|, respectively, over all the frames in the shot, i.e.,CM(i, j) =

∑nl=1 cml(i, j), OM(i, j) =

∑nl=1 oml(i, j), and

TM(i, j) =∑n

l=1 mvl(i, j), where n is the number of framesin the shot. These matrices serve as indices for retrieval, asexplained in the next sub-section.

4.2 Video Similarity MeasureEach shot is characterized by the three matrixes CM , OM

and TM . In order to compute the similarity between twoshots, we assume that the matrices belong to the Hilbertsequence space l2 with metric defined by

d(Mx, My) = [U∑

u=1

V∑v=1

|Mx(u, v)−My(m,n)|2] 12 (12)

13

where Mx and My are the motion matrices of order U × V .However, instead of comparing matrices, we reduce the di-mension of the feature space by projecting the matrix ele-ments along the rows and the columns to form one dimen-sional feature vectors Mr

x(v) =∑U

u=1 Mx(u, v) and Mcx(u) =∑V

v=1 Mx(u, v). The metric in equation (12) can then berewritten as

d(Mx, My) = [V∑

v=1

1

U|Mr

x (v)−Mry (v)|2+

U∑u=1

1

V|Mc

x(u)−Mcy(u)|2] 1

2 .

(13)This leads us to the distance function between two shots

sx and sy to be defined as

d̂(sx, sy) = ωC · d(CMx, CMy) + ωO · d(OMx, OMy)

+ωT · d(TMx, TMy), (14)

where the d(·)’s are computed according to equation (13)and the ω’s are the weights for each motion component.The weights are assigned in such a way that there is equalcontribution from each of the components to the distancefunction. Metric based indexing techniques like M-tree andR-tree can be applied as the index to the feature vectors.

5. EXPERIMENTAL RESULTSOur experiments were designed to assess the performance

of the proposed techniques for SBD, scene tree construction,motion estimation and motion based indexing and retrievalof shots. Due to the space limitation, the experimental resultof the proposed SBD algorithm is not shown in this paper.The experimental result of the SBD on a wide variety ofvideo sequences like soccer, news, movie, cartoons etc. canbe found in [23]. In the following subsections, we presentthe experimental results for scene tree construction, motionestimation and motion based indexing and retrieval of shots.

5.1 Adaptive Scene Tree ConstructionTo evaluate the scene tree building algorithm, we run the

algorithm described in Section 3 for various videos. Sincewe cannot quantify the effectiveness of the algorithm to pro-duce the scene trees, we show two examples of scene trees inFigure 2 and describe the main events occurring in the videosequences from which the tree is constructed. The scene treein Figure 2(a) is built from a three minute video clip avail-able from the MPEG-7 test video set (CD 21 Misc2.mpg).In this sequence, two women are watching the TV, and talk-ing to each other (the first 8 shots, leaf nodes). Then oneof the women leaves while the other continues watching theTV (next 4 shots). After a while, this woman also leaves(frames 3300). Then both the women appear together. Oneof them puts on a coat and leaves while the other startscleaning the table (the last 4 shots, leaf nodes). The scenetree in Figure 2(b) is built from a two-minute video clipfrom the TV program “Opening The New Era”. This videosequence consists of three scenes. A man and a woman aretaking a picture together. Then the man drives the womanto her home. The woman goes inside her home and readsthe newspaper. If we traverse the scene trees from the topto bottom, which in essence is a non-linear browsing of thevideo, we get the above stories. The scene trees give a hier-archical view of the scenes as well as provide a summary ofthe video sequence.

5.2 Motion EstimationIn this subsection, we evaluate the proposed algorithm for

motion estimation. We consider video sequences that arecompressed using the MPEG-2 standard. Figure 3 showsfour examples of such videos. Each row consists of the keyframe in a particular shot, camera motion, object motionand total motion. The motion information contained in theCM, OM and TM matrices are represented as monochromeimages for visualization purposes; the brighter pixels indi-cate higher motion activity. The first row in Figure 3 showsthe shot of a news anchor person in which the camera isstationery. This is reflected in the black CM image whilethe motion of the object (face of the person) is clear in theOM image. Thus the total motion consists essentially ofthe object motion only. Similarly in the second row whichshows a scene with a moving camera while the objects in thescene are stationery, the OM image is black (indicating nomotion) while the CM image is uniformly bright. The thirdrow illustrates the case of both object motion and cameramotion contributing to the total motion in the shot froma soccer game. We can clearly see the zooming in of thecamera in the CM image while object motion is indicatedby some bright pixels in the OM image. Finally, we show ashot of the camera tracking a person and its correspondingmotion images which are all mostly bright implying that themotion contains both camera as well as object motion. Wesee that the proposed motion estimation algorithm has beenable to extract the motion information quite well.

5.3 Motion Based RetrievalWe noted earlier that the video retrieval process is implic-

itly contained within the process of browsing the adaptivescene tree. However, the retrieval method proposed in thispaper can also be considered as a ‘standalone’ process. Toevaluate the performance of our proposed method for mo-tion indexing and retrieval, we build a database consistingof video shots using the SBD algorithm on the 643 videosequences of the MPEG-7 test set. The lengths of the videoclips range from 5 seconds to 30 seconds.

Using the CM, OM and TM matrices as indices and thevideo similarity measure developed in Section 4, we retrievethe top N video shots from the database. In Figure 4, weshow the top 3 results of the query shown in the first column.We compare the retrieval results when only TM is used withthe case when all the three motion matrices, viz., CM, OMand TM, are used to retrieve. Figure 4(a) shows the retrievalresults with the motion feature extracted from TM matrixonly, while figure 4(b) shows the retrieval result when CM,OM and TM matrices are considered. The first columnsof figure 4(a) and figure 4(b) are the key frames from thequery video shots. The second, third and fourth columnsof figure 4(a) and figure 4(b) are the first, second and thirdretrieval results. From these examples, we observe that bothmodels retrieve video clips which have similar motion con-tent. However, we see that the video similarity model withCM, OM and TM matrices gives better retrieval result inboth the motion and semantic sense than the model thatuses only TM matrix.

We randomly choose 50 video shots from the video databaseas queries and retrieved the top 30 videos. Using precisionas the measure for performance, where precision is definedas the ratio of the number of relevant shots retrieved tothe total number of shots retrieved, we plot precision as a

14


1695

1905

1380

420

1755

330

2820

2430

2340

2130

1995

3045

3300

3555

3720

3915

4095

4215

4335

SN(0,1)

SN(0,2)

SN(0,3)

SN(0,4)

SN(0,5)

SN(0,6)

SN(0,7)

SN(0,8)

SN(0,9)

SN(0,10)

SN(0,11)

SN(0,12)

SN(0,13)

SN(0,14)

SN(0,15)

SN(0,16)

SN(0,17)

SN(0,18)

SN(0,20)

SN(1,1)

SN(1,2)

SN(1,3)

SN(1,4)

SN(2,1)

SN(3,1)

420

2820

2130

2820

4095

2820

15

120

210

1275

240

330

690

780

1020

1410

1740

1770

1845

1890

1935

1980

2115

2265

2415

SN(0,12)

SN(0,13)

SN(0,14)

SN(0,15)

SN(0,16)

SN(0,17)

SN(0,18)

SN(0,0)

SN(0,2)

SN(0,3)

SN(0,4)

SN(0,5)

SN(0,6)

SN(0,7)

SN(0,8)

SN(0,9)

SN(0,10)

SN(0,11)

SN(1,1)

120

SN(1,2)

2490 SN(0,19)

SN(1,3)

SN(2,1)

1020

2265

1020

SN(0,1)

(a) (b)

Figure 2: (a) Scene tree of MPEG-7 test video, (b) scene tree of TV program “Opening The New Era”

15

3CM

2 4 6 8 10 12 14 16 18 20 22

2

4

6

8

10

12

14

16

18

3OM

2 4 6 8 10 12 14 16 18 20 22

2

4

6

8

10

12

14

16

18

3TM

2 4 6 8 10 12 14 16 18 20 22

2

4

6

8

10

12

14

16

18

2CM

2 4 6 8 10 12 14 16 18 20 22

2

4

6

8

10

12

14

16

18

2OM

2 4 6 8 10 12 14 16 18 20 22

2

4

6

8

10

12

14

16

18

2TM

2 4 6 8 10 12 14 16 18 20 22

2

4

6

8

10

12

14

16

18

81CM

2 4 6 8 10 12 14 16 18 20 22

2

4

6

8

10

12

14

16

18

81OM

2 4 6 8 10 12 14 16 18 20 22

2

4

6

8

10

12

14

16

18

81TM

2 4 6 8 10 12 14 16 18 20 22

2

4

6

8

10

12

14

16

18

camobjCM

2 4 6 8 10 12 14 16 18 20

2

4

6

8

10

12

14

camobjOM

2 4 6 8 10 12 14 16 18 20

2

4

6

8

10

12

14

camobjTM

2 4 6 8 10 12 14 16 18 20

2

4

6

8

10

12

14

Figure 3: Motion estimation. Each row consists of a key frame of the shot, the CM image, the OM imageand the TM image of News anchor person (first row), gallery (second row), soccer (third row) and movingperson(fourth row)

5 10 15 20 25 300.6

0.65

0.7

0.75

0.8

0.85

0.9

0.95

1TMTM,OM,CM

Figure 5: Precision as a function of the number ofvideo clips retrieved. Solid line correspond to re-trievals using the TM , OM and CM and dashed linecorrespond to retrievals using TM only.

function of the number of retrieved video clips are shown inFigure 5. The average precision (computed as average of theprecision with the number of top return video clips varyingfrom 1 to 30) when CM, OM and TM are considered is 91%compared to 85% when only TM is used as an index.

6. CONCLUSIONIn this paper, we have presented a fully automatic content-

based approach to organizing and indexing compressed video

data. The three main steps in our approach are (i) shotboundary detection using motion prediction information ofMPEG-2, (ii) development of a browsing hierarchy calledadaptive scene tree and (iii) video indexing based on cam-era motion, object motion and total motion. The SBD algo-rithm can detect both abrupt cuts and gradual transitions.Unlike existing schemes for building browsing hierarchies,our technique builds a scene tree automatically from the vi-sual content of the video. The size and shape of the treereflect the semantic complexity of the video sequence. Avideo similarity measure based on the CM, OM and TMmatrices is shown to give good retrieval results. The esti-mation of these matrices using the affine parameter modelis also shown to perform well.

The hunt for highly discriminating features is a continu-ous one. From our study, we find that the CM, OM andTM matrices have excellent discriminatory characteristics.However, during computation of video similarity, the con-version of the matrices to a one dimensional vector mightresult in loss of valuable information. A more elegant androbust use of these matrices is one of the future directionsfor this research. The proposed scene tree can be integratedinto the MPEG-7 standard description scheme to generate aMPEG-7 compliant XML metadata. In the MDS (Multime-dia Description Schema) of MPEG-7, there is a descriptorto describe the temporal segment of the video. We can usethis segment descriptor to represent the scene tree since eachnode in the scene tree is a video segment. To check the ef-fectiveness of the scene tree building algorithm, it will beuseful to conduct user studies in the form of presenting thesummarized video to several users and asking them to nar-rate the story as they understand it. However, it remains

16

Query 1st 3rd2nd

(a)

Query 1st 3rd2nd

(b)

Figure 4: Motion based video retrieval. (a) Five example query results using TM matrix. (b) The same fivequery results using CM, OM and TM matrices.

17

to be seen if a meaningful scene tree can be built for sportsvideos.

7. REFERENCES[1] B.L.Yeo and B.Liu. Rapid scene analysis on

compressed video. IEEE Transactions On Circuits andSystems for Video Technology, 5:553–544, Dec. 1995.

[2] J. Feng, K.-T. Lo, and M. H. Scene change detectionalgorithm for mpeg video sequence. In IEEEInternational Conference on Image Processing,volume 1, pages 821–824, 1996.

[3] B. Gunturk, Y. Altunbasak, and R. Mersereau.Super-resolution reconstruction of compressed videousing transform-domain statistics. IEEE Transactionson Image Processing, 13(1):33–43, jan 2004.

[4] International Organization for Standardization.MPEG21 Overview (CODING OF MOVINGPICTURES AND AUDIO), iso/iecjtc1/sc29/wg11/n4318 edition, July 2001.

[5] International Organization for Standardization.Overview of the MPEG-7 Standard,iso/iec/jtc1/sc29/wg11 n4031 edition, Mar 2001.

[6] ITU-T. Video Coding for Low Bit RateCommunication, ITU-T Recommendation, h.263edition, Feb 1998.

[7] ITU-T. Joint Final Committee Draft (JFCD) of JointVideo Specification (ITU-T Rec. H.264 — ISO/IEC14496-10 AVC), h.264 edition, July 2002.

[8] E. Katz. The film encyclopedia, 2nd ed. New York:Harper Collins, 1994.

[9] J. R. Kender and B.-L. Yeo. Video scene segmentationvia continuous video coherence. In IEEE InternationConference on Computer Vision and PatternRecognition, Santa Barbara, CA, jun 1998.

[10] R. Lienhart. Comparison of automatic shot boundarydetection algortihms. Proceedings of SPIE conferenceon Storage and Retrieval for Image and VideoDatabases VII, 3656:290–301, Jan. 1999.

[11] J. Meng, Y. Juan, and S.-F. Chang. Scene changedetection in a MPEG video sequence. In Proceedingsof SPIE Conference on Multimedia Computing andNetworking, volume 2417, pages 180–191, San Jose,CA, Feb 1995.

[12] J. Nam and A. H. Tewfik. Combined audio and visualstreams analysis for video sequence segmentation. InProceedings of ICASSP-97, volume 4, pages2665–2668, Munich, Germany, Apr 1997.

[13] J. Oh and K. A. Hua. An efficient and cost-effectivetechnique for browsing and indexing large videodatabases. In Proceedings of 2000 ACM SIGMODInternational Conference on Management of Data,pages 415–426, Dallas, TX, May 2000.

[14] Z. Rasheed and M. Shah. Scene boundary detection inhollywood movies and TV show. In IEEEInternational Conference on Computer Vision andPattern Recognition, Madison, WI, June 2003.

[15] M. A. Robertson and R. L. Stevenson. Restoration ofcompressed video using temporal information. InSPIE conference on Visual Communications andImage Processing, volume 4310, pages 21–29, SanJose, CA, 2001.

[16] C. Saraceno and R. Leonardi. Identification of storyunits in audio-visual sequences by joint audio andvideo processing. In Proceeding of InternationalConference on Image Processing, pages 358–362,Chicago, IL, USA, 1998.

[17] D. Schonfeld and D. Lelescu. VORTEX: Videoretrieval and tracking from compressed multimediadatabasesmultiple object tracking from mpeg-2bitstream. Journal of Visual Communications andImage Representation, Special Issue on MultimediaDatabase Management, 11:154–182, 2000.

[18] I. K. Sethi and N. V. Patel. Statistical approach toscene change detection. In Proceedings of SPIEConference on Storage and Retrieval for Image andVideo Databases, volume 2420, pages 329–338, SanJose, CA, feb 1995.

[19] B. T. Truong, S. Vekatesh, and C. Dorai. Sceneextraction in motion picture. IEEE Transactions onCircuits and Systems for Video Technology,13(1):5–15, jan 2003.

[20] H. Wang, A. Divakaran, A. Vetro, S.-F. Chang, andH. Sun. Survey of compressed-domain features used inaudio-visual indexing and analysis. Journal of VisualCommunication and Image Representation,14(2):50–183, June 2003.

[21] B. L. Yeo and B. Liu. On the extraction of DCsequence from MPEG compressed video. In IEEEInternational Conference on Image Processing,Washington D.C, Oct. 1995.

[22] B. L. Yeo and B. Liu. Rapid scene analysis oncompressed video. IEEE Transactions On Circuits andSystems for Video Technology., 5:553–544, Dec. 1995.

[23] H. Yi, D. Rajan, and L.-T. Chia. A unified approachto detection of shot boundaries and subshots incompressed video. In IEEE International Conferenceon Image Processing, Barcelona, Spain, Sept. 2003.

[24] R. Zabih, J. Miller, and K. Mai. A feature-basedalgorithm for detecting and classifying scene breaks.In Proceedings of ACM Multimedia 95, volume 1,pages 10–28, San Francisco, CA, Nov. 1995.

[25] H. Zhang, A. Kankanhalli, and S. Smoliar. Automaticpartitioning of full-motion video. Multimedia Systems,1:10–28, 1993.

18














































































A motion based scene tree for browsing and retrieval of compressed videos

Documents