COMPRESSED DOMAIN H.264/AVC SHOT DETECTION Hugo Santos Varandas Dissertação para obtenção do grau de Mestre em Engenharia Electrotécnica e Computadores Júri Presidente: Prof. António Topa Orientador: Prof. Fernando Pereira Vogal: Prof. Paulo Correia Outubro de 2008
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
1.1 CONTEXT AND MOTIVATION ............................................................................................................................ 1
1.2 VIDEO SHOT TRANSITIONS ............................................................................................................................... 2
1.3 OBJECTIVE OF THIS THESIS ............................................................................................................................... 3
1.4 OUTLINE OF THIS THESIS .................................................................................................................................. 4
CHAPTER 2 SHORT OVERVIEW ON THE H.264/AVC VIDEO CODING STANDARD ........................................... 7
2.1 OBJECTIVES AND ARCHITECTURE ....................................................................................................................... 7
2.2 VIDEO CODING LAYER ..................................................................................................................................... 8
2.2.1 Intra Prediction .................................................................................................................................. 11 2.2.2 Inter Prediction .................................................................................................................................. 12
3.3 MAIN RELEVANT SHOT DETECTION TRANSITION SOLUTIONS ................................................................................. 22
3.3.1 Shot Transition Detection Using a Graph Partition Model ................................................................ 23 3.3.2 Shot Transition Detection Based on a Statistical Detector ................................................................ 28 3.3.3 Shot Detection in H.264/AVC Using Partition Features ..................................................................... 31 3.3.4 Shot Detection in H.264/AVC Hierarchical Bit Streams ..................................................................... 35 3.3.5 Shot Detection in H.264/AVC using Intra and Inter Prediction Features ........................................... 42 3.3.6 Summary ........................................................................................................................................... 45
CHAPTER 4 SYSTEM ARCHITECTURE AND FUNCTIONAL DESCRIPTION ....................................................... 47
4.1 SYSTEM ARCHITECTURE ................................................................................................................................. 47
6.2.1 Player ................................................................................................................................................. 78 6.2.2 Video Thumbnail ................................................................................................................................ 79 6.2.3 Algorithm and Charts Control ............................................................................................................ 80 6.2.4 Charts Tab Control ............................................................................................................................. 82
7.1 VIDEO COLLECTION ...................................................................................................................................... 85
8.1 SUMMARY AND CONCLUSIONS ..................................................................................................................... 101
8.2 FUTURE WORK .......................................................................................................................................... 103
FIGURE 3.3 – ARCHITECTURE OF THE GRAPH PARTITION MODEL BASED DETECTION ALGORITHM [29]. ........................................ 24
FIGURE 3.4 ‐ GRAPH WITH 13 NODES (LEFT) AND SIMILARITY MATRIX (RIGHT) WHERE BRIGHT MEANS HIGH SIMILARITY AS OPPOSED TO
DARK [2]. ................................................................................................................................................................. 25
FIGURE 3.5 ‐ SEGMENT OF CONTINUITY SIGNAL CONTAINING TWO HARD CUTS [2]. ................................................................. 26
FIGURE 3.6 ‐ SYSTEM ARCHITECTURE FOR THE STATISTICAL DETECTOR [3]. ............................................................................ 28
FIGURE 3.7 ‐ DETECTOR CASCADE FOR DETECTING VARIOUS TRANSITION TYPES [3]. ................................................................ 29
FIGURE 3.8 ‐ TYPICAL BEHAVIOR OF DISCONTINUITY VALUES WITHIN A SLIDING WINDOW OF LENGTH N FOR HARD CUTS (A) AND
FIGURE 5.4 – FRAME DESCRIPTIONS CORRESPONDING TO THE H.264/AVC HIGH PROFILE CODING FOR THE FRAMES IN FIGURE 5.1
CONSIDERING ALSO THE INTRA CHROMINANCE PREDICTION MODES. ..................................................................................... 58
FIGURE 5.5 – GOP DIFFERENCE SCORES FOR THE VIDEO SEQUENCES INTRODUCED IN FIGURE 5.1 USING THE INTRA LUMINANCE
PREDICTION MODES DESCRIPTOR WITH FRAME GRANULARITY AND (A) SUM OF ABSOLUTE DIFFERENCES AND (B) VARIANT OF
PEARSON’S TEST ........................................................................................................................................................ 60
FIGURE 5.6 – TWO FRAME DESCRIPTIONS TAKEN FROM TWO CONSECUTIVE P FRAMES BELONGING TO DIFFERENT SHOTS; IN EACH
FIGURE 5.7 – MOTION VECTOR PREDICTION FOR DIRECT BLOCKS IN E IS PERFORMED BY ANALYZING MOTION INFORMATION FROM
BLOCKS A, B AND C OR D. ........................................................................................................................................... 65
FIGURE 6.1 – DTD FOR THE GROUND TRUTH XML FILE. ................................................................................................... 77
FIGURE 6.2 – EXCERPT OF AN XML FILE CONTAINING THE GROUND TRUTH TRANSITION DESCRIPTIONS OF A VIDEO SEQUENCE. ........ 77
FIGURE 6.3 – GUI OF THE DEVELOPED APPLICATION. ........................................................................................................ 78
FIGURE 6.4 – PLAYER WINDOW AND CONTROLS. .............................................................................................................. 79
FIGURE 6.5 – SHOT TRANSITIONS IN THE VIDEO THUMBNAIL. .............................................................................................. 79
FIGURE 6.6 – SUSPECT GOP MODE IN THE VIDEO THUMBNAIL. ........................................................................................... 80
FIGURE 6.7 – TWO EXAMPLES OF THE VIDEO THUMBNAIL CONTROL COMPONENT. .................................................................. 80
FIGURE 6.8 – ALGORITHM AND CHART TAB CONTROL. ...................................................................................................... 81
FIGURE 6.9 – THE BATCH MODE TAB. ............................................................................................................................ 82
FIGURE 6.10 – CHARTS TAB CONTROL WITH A LINE CHART EXAMPLE. .................................................................................. 83
FIGURE 6.11 – CHARTS TAB CONTROL WITH A HISTOGRAM CHART EXAMPLE: IN THIS EXAMPLE, THE DESCRIPTORS FROM TWO FRAMES
CAN BE COMPARED. .................................................................................................................................................... 83
FIGURE 7.1‐ RECALL/PRECISION FOR THE LUM FEATURE USING A FIXED THRESHOLD. ............................................................. 91
FIGURE 7.2 ‐ RECALL/PRECISION FOR THE LUMCOL FEATURES USING A FIXED THRESHOLD. ..................................................... 91
FIGURE 7.3 – RECALL/PRECISION FOR THE LUM FEATURES USING A MEDIAN‐BASED THRESHOLD. ............................................. 92
FIGURE 7.4 – RECALL/PRECISION FOR LUMCOL TYPE FEATURES USING A MEDIAN‐BASED THRESHOLD. ...................................... 92
FIGURE 7.5 ‐ RECALL/PRECISION THE LUM FEATURES USING AN AVERAGE‐BASED THRESHOLD. ................................................. 93
FIGURE 7.6 ‐ RECALL/PRECISION FOR THE LUMCOL FEATURES USING THE AVERAGE‐BASED THRESHOLD. .................................. 94
FIGURE 7.7 ‐ RECALL/PRECISION USING THE VARIOUS PROPOSED THRESHOLD APPROACHES FOR THE LUMCOL FEATURES. ............ 94
FIGURE 7.8 ‐ RECALL/PRECISION FOR ABRUPT TRANSITION DETECTION BY THE ALGORITHMS RELYING ON TEMPORAL DEPENDENCIES IN
FIGURE 7.13 ‐ RECALL/PRECISION FOR THE GRADUAL TRANSITION DETECTION BY THE IBR APPROACH WITH DIFFERENT PARAMETER
SETTINGS IN MAIN PROFILE. ......................................................................................................................................... 99
FIGURE 7.14 ‐ RECALL/PRECISION FOR OVERALL TRANSITION DETECTION BY THE IBR APPROACH WITH DIFFERENT PARAMETER
SETTINGS IN MAIN PROFILE. ......................................................................................................................................... 99
TABLE 3.4 ‐ BEST RESULTS OBTAINED BY THE IMBR/PTCD DETECTION APPROACH [34]. ......................................................... 34
TABLE 3.5 – PERFORMANCE RESULTS FOR THE ALGORITHM [16]. ........................................................................................ 42
TABLE 3.6 ‐ NUMBER OF STATES IN EACH MODEL [35]. ..................................................................................................... 44
TABLE 3.7 ‐ TEST RESULTS USING ONLY HMMS [35]. ....................................................................................................... 45
TABLE 3.8 ‐ TEST RESULTS USING THE CANDIDATE GOP DETECTION[35]. .............................................................................. 45
TABLE 3.9 ‐ NUMBER OF TOTAL GOPS AND POTENTIAL GOPS USING T=0.3 [35]. ................................................................. 45
TABLE 3.10 ‐ BRIEF SUMMARY OF THE SOLUTIONS PRESENTED IN SECTION 3.3. ..................................................................... 45
TABLE 4.1 ‐ SUMMARY OF THE ADVANTAGES AND DISADVANTAGES OF THE PROPOSED TWO PHASE’S HIERARCHICAL SYSTEM. .......... 48
TABLE 7.1 ‐ SOME PERFORMANCE RESULTS FOR THE DEVELOPED SYSTEM. ........................................................................... 100
the objectives for the work are described and, finally, the structure of this document is introduced.
f user generated
roduction
In this chapter, the context and motivation for this work are first presented; afterwards, the most
common types of shot boundaries are presented due to their central role for the work reported; n
1.1 Context and Motivation
Nowadays, due to the major advances in video coding and the increased availability of computing and
network resources, the creation, manipulation, distribution and usage of digital video are widespreadto the general user and not limited to professionals as before. In fact, these advances have led to a
rising number of applications using digital video, such as digital libraries, video-on-demand, digital
video broadcast and interactive TV, which generate and use large collections of video data. Another
factor contributing to the explosion of digital video data is the increasing popularity o
video content, like in online video-sharing services such as the popular YouTube [1].
This increased amount and usage of digital video material gives rise to the need of improving the
accessibility to video content by the users. In order to quickly and efficiently browse, search and
consume video content, content-based video retrieval and summarization applications are more and
more required. Since the manual annotation of the video content is mostly unfeasible due to the size
of the video collections, automatic approaches to analyze the video content in order to extract its
structure, semantics, etc. are gaining importance. A fundamental and initial step of such applications
is, naturally, to structure the videos into shorter elementary units, i.e., to perform a temporal structural
analysis of the video, the so-called temporal segmentation. Among the possible types of elementary
units, there is the shot which has been considered an appropriate elementary unit for this kind of
applications and has been used by a great majority of them; a shot consists on a series of interrelated
consecutive pictures taken contiguously by a single camera and representing a continuous action in
time and space. Due to the importance of shot transition detection in this application context, shot
transition detection tools have been an extensively researched and reported subject in the relevant
literature [2], [3], [4], [5].
However, digital video content is nowadays made available in a compressed format to reduce its
storage and transmission requirements. Over the years, various video coding standards have been
developed, successively providing higher compression factors to more efficiently use the availablestorage capacity and transmission bandwidth. This has generated the need for shot transition
detection systems which operate directly on the compressed domain, avoiding the time-consuming
decompression process. This has an especial importance for applications which require fast temporal
segmentations, even if, in some cases, this implicates lower detection performance levels. Nowadays,
the state-of-the-art on video compression is the H.264/Advanced Video Coding (AVC) standard [6]
and, therefore, the state-of-the-art shot transition detection compressed domain systems are those
sed videos.
otably depending on the content creator
ed by the following four parameters:
hot transition.
ot transition.
Al u
succe
o gs to the disappearing
shot an sition and it is also
known as
d of transitions is very
customizable, according to spatial, temporal and chromatic characteristics, which makes them
difficult to model. The most common types of gradual transitions are:
which operate with H.264/AVC compres
1.2 Video Shot TransitionsThere are many types of shot transitions in video content, n
creativity. In this document, video shot transitions will be defin
o Pre-frame – The last frame before the s
o Post-frame – The next frame after the shot transition.
o Type – The type of the sh
o Length – The number of frames between the pre-frame and the post-frame of the shot
transition.
tho gh there are several types of video shot transitions currently used in film editing to connect
ssive shots, they are usually grouped under two main classes:
Abrupt or hard transitions – In this kind of transitions, one frame belon
d the next to the appearing shot; this is the most usual type of tran
a cut. An example of such transitions is depicted in Figure 1.1.
a) b)
Figure 1.1 – Cut transition example: a) Pre-frame and b) Post-frame.
o Gradual or soft transitions – In this kind of transitions, cinematic effects are added to combine
the two shots using chromatic, spatial or spatial-chromatic effects which can gradually replace
one shot by another. Since these effects last for several frames, this kind of transitions are more
difficult to detect when compared with abrupt transitions. Another problem is that, due to the
increased role of computer technology in video editing, this kin
environments. Operating in the H.264/AVC compressed domain means that, the algorithm must only
perform some essential and low-complexity decoding tasks, like parsing the bit stream or do some
minor calculations, while avoiding all the time consuming decoding tasks, e.g., motion vectors inferring
or transform decoding.
To encourage research on information retrieval by providing a large test collection and uniform scoringprocedures, the Text Retrieval Conference (TREC) series has been initiated in 1992. In 2001, a video
"track" devoted to research on automatic segmentation, indexing and content-based retrieval of digital
video was initiated and, in 2003, an independent TREC Video Evaluation (TRECVID) conference
series [7] was formed. Between 2001 and 2007, the TREC and later the TRECVID initiatives provided
a common video database and common evaluation criteria with the associated ground truth, which
allowed evaluating several proposed shot transition detection systems under solid and fair conditions.
This contest environment had a major impact on the development of this technology.
Among the various metrics relevant for the evaluation of shot transition detection systems, the most
commonly used are:
o Recall – Ratio between the number of correctly detected shots and the number of existing shots
in the video material (1).
(1)
o Precision – Ratio between the number of correctly detected shots and the number of detected
shots (2).
(2)
These metrics will be also intensively used in this document to evaluate the performance of the
developed shot transition detection systems. In Figure 1.5 and Figure 1.6, the performance of the
participant teams in TRECVID 2007 is shown. These figures provide an idea on the precision and
recall values obtained nowadays with state-of-the-art shot transition technology. It is, however, very
important to remind that most of these algorithms work in the uncompressed domain and only a few of
them operate in the MPEG-1 compressed domain. The algorithms to be studied, designed,
implemented and evaluated in this Thesis make one step further since they work in the compressed
domain of the most recent video coding standard, the H.264/AVC.
1.4 Outline of this Thesis
This Thesis is organized in seven chapters besides this introductory chapter, where, mostly, the
motivation and objectives are presented. In 0, a short overview of the H.264/AVC video coding
standard is presented. In Chapter 3, a review of the state of the art on shot transition detection
systems is presented; with this review in mind, a general framework, and a classification tree for these
systems are also proposed; finally, some of the most representative shot transition detection systems
in the literature are reviewed. In Chapter 4, the architecture and the functional modules of the
developed shot transition detection systems are introduced. Next, a detailed description of the shot
transition detection algorithms designed and implemented for the core architectural modules is
stream without the need of any information from any other slice. In especial cases, e.g. to achieve
better error resistance, a Flexible Macroblock Order (FMO) may be used in which case the
macroblock order in the slice may differ. In the H.264/AVC standard, there are five types of slices: I, B,
P, SI and SP. The SI and SP slices are new regarding previous standards and target to solve network
transmissions problems; for this reason, only the remaining three types will be considered here:
o I-Slice – In this type of slices, the samples have to be encoded using the intra mode defined in
Section 2.2.1.
o P-Slice – In this type of slices, the macroblocks may be encoded in intra mode or in inter mode
where each prediction block may use up to one motion vector and reference index.
o B-Slice – In this type of slices, the macroblocks may be encoded in intra or inter prediction
mode where each prediction block may be encoded using at most two motion vectors and two
reference indexes.
For each macroblock, the encoder decides which type of prediction should be used to maximize thecoding efficiency. For this, it computes the prediction error, which is quantized and transformed; after,
it entropy codes the prediction error along with other information so that the decoder can recomputed
that prediction; the outcome of the entropy coder is the H.264/AVC bit stream.
There are two types of entropy coders which can be used in H.264/AVC: i) Context-Adaptive Variable
Length Coding (CAVLC), and ii) Context-Adaptive Binary Arithmetic Coding (CABAC). The CABAC
solution yields a more efficient coding although due to an increased complexity.
2.2.1 Intra Prediction
In previous standards, macroblocks encoded in intra mode did not have any prediction; however, in
this new standard, a prediction for an intra coded macroblock block may be computed based on
samples from already decoded neighbor macroblocks in the same slice. There are four of such intra
encoding modes used for luminance samples:
o Intra4x4 – Each block of 4 x 4 luminance samples in the macroblock is predicted using one of
o Intra8x8 – In this intra mode, each 8 x 8 luminance block in the macroblock is predicted using
one of 9 prediction modes available which are similar to those in the Intra4x4 mode considering
8 x 8 blocks instead of 4 x 4 blocks.
o Intra16x16 – This mode performs macroblock predictions over the 16 x 16 samples macroblock
using one of the 4 prediction modes available and depicted in Figure 2.7.
Figure 2.7 - Intra16x16 prediction modes.
o PCM – This is a mode which is rarely used since it provides no compression when compared to
the previously introduced intra prediction modes; it is specified for the following purposes:
It allows the encoder to precisely represent the sample values.
It provides a way to accurately represent the values of anomalous picture content without
significant data expansion.
It enables placing a hard limit on the number of bits a decoder must handle for a
macroblock without harming the coding efficiency.
For chrominance samples, the macroblock is not divided and a prediction is made for all the 16x16 or
8x8 chrominance samples in the macroblock, depending on the chrominance sub-sampling format
used, e.g. 4:2:0 or 4:4:4. This prediction is made in the same fashion as Intra16x16, since
chrominance data is usually smooth over large areas.
2.2.2 Inter Prediction
Using the inter prediction mode, the prediction for a macroblock can be based on samples from other
previously decoded frames. The available prediction types to encode a macroblock depend on theslice type. The available inter prediction modes are explained in the following:
o P mode – In this mode, both the motion and prediction error information are available in the bit
stream. To maximize the coding efficiency, the H.264/AVC standard specifies several
partitioning modes for an inter macroblock, as depicted in Figure 2.8. Each H.264/AVC partition
can have its own motion information (motion vectors and associated reference indexes); in the
case of sub-macroblocks, which is the name given to 8x8 partitions of an P-mode macroblock,
besides the partition motion information, each sub-macroblock partition also can have its own
motion vector information. Depending on the slice type, this motion information can be of two
de important information about the visual content being analyzed, therefo
Figure 3.1 - General framework for shot transition detection algorithms.
In the proposed general framework, shown in Figure 3.1, several modules can be identified:
o Feature extraction – In this first stage, the visual content, available in a compressed or
uncompressed format, is represented by means of feature descriptors which map each frame
into a feature space in order further processing may be simplified. The extracted features, and
corresponding descriptors, should be sensitive enough to various content variations, thus
providing some additional
allowing a shot transition to be detected; during a shot, they should be invariant, in order no
false transitions are declared.
o Similarity score calculation – In the second module, descriptors are evaluated to measure the
similarity or dissimilarity (difference) between frames, thus generating continuity or discontinuity
scores. This may be achieved by simply analysis one or two frames, or by considering more
frames, thus incorporating contextual information into the process. Other scores may also begenerated in this module, which may aid the decision process by
In the following, some further considerations are presented regarding the various classes of shot
algorithms is usually lower than the one provided by discriminative detectors. A more usual approach
ed by cascading a cut detector with a gradual
ot constrained to a specific
encoding format nor to a specific encoder implementation, so these detectors have greater detection
tent is coded, these algorithms will need the data to be decoded first,
putational resources, since the feature vectors generated might be
very large. For that reason, it is usually used in combination with less sensitive feature
ch – Another possibility is to segment each frame into blocks and extract
features for each block. Features extracted in this way have the advantage of being more
o
ose an algorithm creating a color histogram for each frame;
the descriptor uses singular value decomposition over a feature matrix formed by several
transition detectors resulting from the classification dimensions introduced above.
3.2.1 Generic Transition Detectors
In this class, the algorithms detect the transitions regardless of their type, e.g., abrupt and gradual.This approach is mainly used when a low complexity algorithm is required, since the alternative
usually corresponds in cascading discriminative detectors, thus increasing the processing time. They
are designed so the general characteristic of a shot transition is detected, that is, a significant
difference between a frame from a shot and the one belonging to the next shot. However, this type of
very general technique is not much used because the detection performance achieved by such
to detect all transition types is to use a detector design
transition detector [2], [20], which as explained earlier, is classified here as a discriminative solution.
3.2.1.1 Uncompressed Domain Detectors
Most of the literature presents algorithms based on features extracted from a raw image, this means
uncompressed data. The advantage of these algorithms is that they are n
potential. However, if the con
thus adding time-consuming computational complexity to the process.
Single Granularity Level
The uncompressed features can be obtained at various spatial granularity levels, notably:
o Pixel-based approach – Some shot transition algorithms exploit a feature descriptor
representing each pixel. This type of mappings is usually very sensitive to shot transitions;
however, it can also be extremely sensitive to motion, local changes and camera operation, and
usually requires more com
descriptors, for instance those taken on a frame or block basis, or with some kind of motion
compensation or filtering.
o Block-based approa
invariant to camera or object movement and local changes, without a significant loss in terms of
feature sensitivity.
Whole frame approach – Some algorithms use descriptors that describe whole frame features,
therefore being even more robust to motion within a shot than block-based solutions. However,
these approaches are usually less sensitive to shot changes since they might not consider the
spatial differences between compared frames. An example of this type of detector is proposed
followed by an algorithm which works in the MPEG-1 video compressed domain; finally, three
discriminative, uncompressed and single level (block-
based) solutions. The authors submitted their detection performances to TRECVID 2005, 2006 and
s introduced for TRECVID 2007 [29] will
is
ntage of using this procedure is to
achieve invariance to local changes since the model incorporates significant contextual information.
fed to a Support Vector Machine (SVM) which tries to detect certain
The architecture of the solution presented in this section is shown in Figure 3.3; the highlighted blocks
ced in [29]. The detection is conducted by a hierarchical
ing the following steps described in the next section:
ection of cut transitions;
tion feature vectors;
o Motion post-processing;
o Scale Invariant Feature Transform (SIFT) post-processing.
algorithms operating in the H.264/AVC compressed domain are presented.
3.3.1 Shot Transition Detection Using a Graph Partition Model
In [2], from 2007, Yuan et al. present a formal study of the shot transition detection problem, reviewseveral of the existing technical approaches and, afterwards, present a shot transition detection
system based on a graph partition model (GPM). Finally, some experiments are conducted using the
TRECVID [7] platform, comparing various parameter profiles. Under the classification proposed in
Section 3.2, this system fits in the category of
2007, and obtained very good scores [7]. Some modification
be also considered in the following description.
3.3.1.1 Objectives and Basic ApproachThe main objective of this shot detection solution is to achieve a good performance in detecting any
kind of transition using a unified shot transition detector based on a graph partition model, which is
used to compute the similarity score signal.
An undirected weighted graph is used where the frames are treated as nodes while the weight of
edges expresses the similarity between the connected frames. At each time frame, a subset graph
divided into two sub-graphs by employing a min-max cut procedure with temporal constraints; the
obtained score is used as the continuity value. The main adva
The continuity signal is then
characteristic transition patterns usually present in video content.
3.3.1.2 Architecture
represent the modifications introdu
classification process consider
o Visual content representation;
o Fade out/in detection;
o Construction of continuity signal;
o Construction of feature vectors for cut detection;
The solution presented in this section has been extensively tested. In [2], the authors carry out several
experiments to evaluate alternative solutions for each major module. In TRECVID 2007, the algorithm
has been ranked among the best.
The TRECVID 2007 video set consisted of seventeen videos corresponding to 637,805 frames; 2,463
transitions; 2,236 cuts (90.8%); 134 dissolves (5.4%); 2 fade-out/-in (<0.1%); 91 other special effects
(3.7%). Ten runs, whose descriptions are available in Table 3.1, were submitted for evaluation in
TRECVID 2007, obtaining the results presented in Table 3.2.
Table 3.1 - Description of the ten runs evaluated in TRECVID 2007 [29].
Sysid Description
Thu01Baseline system : RGB histogram using 2 x 2 blocks for cut and gradual transition detector, no motiondetector, no sift post-processing, only using development set of 2005 as training set.
Thu02Same algorithm as thu01, but with 2 x 2 blocks for cut detector and 4 x 4 blocks for gradual transitiondetector
Thu03 Same algorithm as thu02, but with SIFT post-processing for cut detectionThu04 Same algorithm as thu03, but with Motion detector for GT
Thu05 Same algorithm as thu04, but with SIFT post-processing for GT
Thu06 Same algorithm as thu05, but no SIFT processing for CUT
Thu09 Same algorithm as thu05, but with different parameters
Thu11 Same algorithm as thu05, but with different parameters
Thu12 Same algorithm as thu05, but with different parameters
Thu13 Same algorithm as thu05, but with different parameters
Thu14 Same algorithm and parameters as thu05, but trained with all the development data from 2003-2006
Table 3.2 – Evaluation results for the ten submissions to TRECVID 2007 [29].
There are two types of entities ishabl decision ruledistingu e in the
o Likelihood functions ( | and (5):
) – They express the probability that a certain
discontinuity value has of belonging to each hypothesis. These functions are estimated using
several representative training sequences; they should not contain strong motion or strong
lighting changes because this might include discontinuity values which are out of their proper
range due to the effects of these extreme factors;
o P k (S) – It stands for the probability of validation of the hypotheses S at a frame k . This term (6)
reflects the influence of two kinds of information in the decision process:
( ) ( ) ( )( )k SPSPSPk
a
k k ψ |=
(6)
A priori information – Information that does not depend on any measurement on a
discriminative video sequence; in this algorithm, the author models the probability of a shot
transition occurring after a certain number of elapsed frames since the last detectedtransition, mainly to reduce false detections of shot boundaries detected immediately after
a previous one.
Additional information – Information which depends not only on initial assumptions but
also on the observed data. With this purpose, Hanjalic suggests using some pattern
modeling functions (ψ(k)) to compare the measured pattern within the temporal vicinities
(using a sliding window of size N ) of the frame being evaluated with the typical pattern
previously formulated for each transition type. This allows providing the detector with some
contextual information which might confirm, or contradict, the guess made by only
evaluating the distance functions for the frame under processing. The patterns which the
detector tries to identify are a sharp peak in the discontinuity values for cuts, and a
triangular pattern in the discontinuity values combined with an analysis on the intensity
variance along the frames in the sliding window for dissolves. This assumption can be
made by observation of the discontinuity values in Figure 3.8.
The terms in (6) and the likelihood functions are calculated and then the decision rule (5) is evaluated
successively in each module in the cascade until a transition is detected or the end of the cascade is
reached, in a process depicted in Figure 3.7.
3.3.2.4 Performance Evaluation
The performance of this algorithm has been evaluated by Hanjalic [3] for five test sequences,
belonging to four program categories (movie, football match, news and commercial documentary),
using the same detection parameters. These sequences, not used in the training stage in which
likelihood functions and other detection parameters were obtained, contain several effects which
usually cause detection errors, such as camera motion and zooming, fast object motion editing
effects... The performance evaluation results are shown in Table 3.3.
The best results achieved by the algorithm in the experiments reported by the authors in [34], are
shown in Table 3.4. In Figure 3.10, the Recall and Precision scores obtained for video I using the Nero
encoder for different threshold values are shown; in Figure 3.11, similar scores are shown for the
QuickTime encoder. An additional performance measure is also used by the authors: the average shot
detection time in relation to the decoding time. This is important since the algorithm works oncompressed domain and, therefore, a significant reduction in the algorithm execution time may be also
a major requirement.
Table 3.4 - Best results obtained by the IMBR/PTCD detection approach [34].
Video THP/THI Detected
cuts/gradualsFalse
DetectionRecall Precision
Average shot detectiontime in relation to the
decoding time
I(Nero)
0.60/0.60 135/25 6 94% 96% 9.54%
II (QT) 0.50/0.60 112/108 14 95% 94% 8.41%
III(QT)
0.50/0.50 228/3 22 95% 91% 9.17%
Figure 3.10 – Recall and Precision for the IMBR/PTCD detection approach for video I (Nero) [34].
From the presented results, the authors concluded that the algorithm performance does not vary
significantly for the various encoders and sequences used; moreover, the thresholds do not need
great adjustments to achieve the best results for the different sequences and encoders. Also the
algorithm execution time is below 10% of the time required to decompress the video sequences.
Figure 3.11 - Recall and Precision for the IMBR/PTCD detection approach for video I (QT) [34].
Other tests carried out by the authors, however, have shown that IMBR is highly dependent on the
video encoding bit rate, generating more false alarms for the videos coded with higher bit rate.
Another possible problem referred is the behavior of the algorithm with other encoding profiles since it
has only been tested with sequences encoded with the Baseline profile, which uses fewer H.264/AVC
tools.
3.3.4 Shot Detection in H.264/AVC Hierarchical Bit Streams
In [16], from 2008, De Bruyne et al. present a shot detection algorithm operating in the H.264/AVCcompressed domain algorithm which detects both cuts and gradual transitions. Considering the
classification presented in Section 3.2, this is discriminative and compressed algorithm based on a
combination of feature granularities (frame and block levels).
3.3.4.1 Objective and Basic Approach
This algorithm relies on several features, some of which, contrary to the previous solutions, are not
available at the very first parsing level. While intra and inter prediction modes are used, as the
previous algorithms, the algorithm additionally recurs to motion information, which is not directly
available in the bit stream.
The authors propose two algorithms: one for shot transition detection for traditional coding patterns
and another for hierarchical coding structures such as those which may be used in the H.264/AVC
standard. The same features and difference scores are considered in the two algorithms; thus, the
main difference between these two algorithms is that while the first algorithm compares consecutive
frames, the second algorithm efficiently exploits hierarchical (or pyramidal) coding structures to speed
up the process, considering primarily frames from the base layer and, only when a shot change is
suspected to happen, processing frames from higher layers.
3.3.4.2 ArchitectureThe architecture of this algorithm is presented in Figure 3.12.
3.3.4.3 Algorithm Description
This algorithm detects both abrupt and gradual transitions. In this article, the authors proposed one
algorithm to abrupt transitions and another to detect gradual transitions, by analyzing the frames in the
video sequence. These procedures, for detecting each type of transitions, are described next and,
afterwards, the usage of these procedures, in the context of hierarchical coding structures, is
described.
Detection of Abrupt Transitions Relying on Temporal Dependencies
To detect an abrupt transition between two consecutive frames (in terms of global visualization order
or considering only frames of the same layer), temporal dependences in those frames are evaluated.
In fact, since a frame from a shot usually does not share similarities with a frame from the next shot,
the H.264/AVC encoder reflects that fact in the reference frames it uses to generate predictions.
Therefore, when an abrupt shot change occurs, the pre-frame is usually encoded using forward
predicted blocks, while the post-frame consists of intra or backward predicted blocks. In such case, it
is said that a gap in the temporal prediction chain has occurred, as illustrated in Figure 3.13, since this
Figure 3.12 - Architecture of the algorithm proposed in [16].
Figure 3.13 - Example of a video sequence consisting of three shots: the full arrows represent the use ofreference frames while the dashed arrows indicate reference frames which are not being used [16].
With the purpose of detecting gaps in the prediction chain, frames are split into 8 x 8 blocks and, by
evaluating the prediction types and POC numbers of the used reference frames, the following ratios
are derived:
o Intra prediction ratio (i(f i )) – This is the ratio between the number 8 x 8 blocks which are
encoded intra mode and the number of 8 x 8 blocks in the current frame;
o Forward prediction ratio ( φ(f i )) - This is the ratio between the number 8 x 8 blocks in which the
frames used for reference have a lower POCs than the current frame and the number of 8x8
blocks in the current frame;
o Bi-directional ratio ( δ (f i )) - This is the ratio between the number 8 x 8 blocks which are
encoded using two reference frames, one with a lower POC and one with a higher POC when
compared to the current frame, and the number of 8x8 blocks in the current frame;
o Backward prediction ratio ( β(f i )) - This is the ratio between the number 8 x 8 blocks in which
the frames used for reference have a higher POCs than the current frame and the number of
8x8 blocks in the current frame;.
Afterwards, the condition (7) is verified and, if it is considered valid, an abrupt transition is declared
Detection of Abrupt Transitions Relying on Spatial Dissimilarities
This procedure aims at verifying cut detections in which the new shot is a consequence of the
encoding pattern (like IPPP patterns) or the presence of an I or IDR frame. The presence of IDR
frames result in possible falsely detected gaps in the prediction chain, since no frame succeeding an
IDR frame can use as reference frames which are previous to the IDR frame, as depicted in Figure
3.14. Therefore, a different procedure is suggested for these cases, based on spatial similarities of
intra frames.
Figure 3.14 - The use of IDR frames results in a temporal prediction chain that is broken, as nosubsequent frame in decoding order is allowed to use as reference frames prior to the IDR frame [16].
However, as the distance between successive I frames is usually large, a comparison between them
is not recommended. Instead, intra-prediction maps (M 1 and M 2 ) are created for the frames where gap
was detected; these maps contain, for each macroblock position, the intra partitioning information of
the last intra-coded macroblock. This procedure works as follows:
o For each frame, in decoding order, that directly or indirectly have a temporal dependence over
the frame for which the prediction map is being computed, including this last:
For each Macroblock in that frame encoded in intra mode, the corresponding macroblock in
the prediction map being calculated is updated with the new partitioning information.
For example, in the situation which is depicted in Figure 3.14, a gap in the prediction chain is found
between P32 and B33 due to the presence of IDR40. To calculate M 33, the iteration starts at IDR 40 ,
continues by analyzing frames B36 and B34 and finishes analyzing frame B33; during this iteration,
whenever an intra coded macroblock is found, the partitioning information for that macroblock replaces
the partitioning information at the macroblocks in the prediction map for the corresponding position.
After these maps are computed, the dissimilarities between the two maps are calculated, by
comparing the partitioning of corresponding macroblocks; more precisely, to compensate camera or
object motion, a window of 3 x 3 macroblocks is used for each macroblock and the collocated
windows are compared considering the distribution of partitioning used in these windows.
To do this, a histogram w is made for each macroblock m, consisting of T bins, T = {Intra4x4, Intra8x8
and Intra16x16 }, as depicted in (8). Afterwards, the dissimilarity (W ) is computed for each
corresponding macroblock window, by calculating a normalized sum of absolute differences (9). At the
Figure 3.16 - Recursive algorithm for detecting shot abrupt transitions in hierarchical structures [16].
For gradual transitions, the intra usage is calculated and evaluated considering base layer frames. If
the intra usage in a frame of the base layer is above T grad , the intra usage in the intermediate frame in
the next level is evaluated. If the intra usage in that frame is low, more precisely, if it is below a
predefined threshold (T nextLayer ), the motion intensity is calculated for that frame considering the
foreground and background estimated at the base layer. Otherwise, if the intra usage in that frame is
high, motion information from that frame is not reliable. Therefore, the procedure needs to be repeated
considering that as the base frame for extracting the foreground and background and motion
information is evaluated at the next layer. Unless this also uses to much intra prediction in which case
the procedure advances to the next layer and so on. This is exemplified in Figure 3.17; i(f 24) > Tgrad
which causes the previous frame in the above layer to be analyzed, in terms of its motion prediction.As this frame still has many intra coded blocks, i(f 22)>Tnextlayer , the motion analysis is performed on
frames f 21 and f 23 instead, where i(f 21)<Tnextlayer and i(f23)<Tnextlayer .
This hierarchical algorithm is summarized in Figure 3.18.
Figure 3.17 – Example of a gradual transition in a hierarchical coding structure. Intra-coded macroblocksare represented by their original color, whereas inter coded macroblocks are blanched [16].
o the number of 4 x 4 blocks with forward, backward, and bidirectional prediction.
o the number of 4 x 4 blocks with skipped and direct modes.
o the number of 4 x 4 blocks with forward and backward multiple reference pictures.
The GOP structure used by the developed system has size 15 (which means one out of 15 frames is
intra coded) and 2 B frames between any two consecutive P or I frames. A frame coding structure,
called word in this context, and depicted in Figure 3.20, which consists of the current P frame and the
B frames between the preceding and the following P or I frames, represents the observation window in
consideration. Several words, shown in Table 3.6, are defined representing the possible patterns: 1 for
no transition in the structure, 1 for gradual transition and 6 representing each possible abrupt
transition. For each possible pattern, an HMM is built.
Figure 3.20 – Frame coding structure [35].
Table 3.6 - Number of states in each model [35].
Word 000001 000010 000100 001000
Number of States 3 4 3 3
Word 010000 100000 000000 111111
Number of States 4 3 2 2
For each candidate GOP, the observation window is centered on the first P frame; after, the likelihood
of each possible model given the observation vector (composed of 5 feature vectors) is analyzed.Then, the observation window advances to the next P frame until the end of the GOP under analysis is
reached. At the end of the GOP, the algorithm analyses the obtained likelihoods and considers that
with the highest likelihood.
3.3.5.4 Performance Evaluation
The algorithm has been evaluated by the authors using a test set composed by two sequences
encoded with the H.264/AVC reference software, JM7.3:
o News - Spanish daily news from the MPEG-7 Content Set; CIF format; 10017 frames; 69 cuts;
4 dissolves.
o Advertisement - From CCTV broadcaster; 720x576 size; 29997 frames; 48 cuts; 9 dissolves.
The HMMs have been trained with a different data set. Two tests were carried out: one using only the
HMMs, assuming all GOPs as candidates, and another using candidate GOP detection; the obtained
results are shown in Table 3.7 and in Table 3.8.
Comparing the results for the two tested solutions, it is possible to observe that using the intra
prediction information the algorithm retains the Recall and improves the Precision achieved using only
the HMMs. The results presented in Table 3.9 indicate that the intra prediction information can also
speed up detection less GOPs are analyzed using the HMMs.
Section 3.1; afterwards, a functional description of each module in the architecture is
presented.
ch depicts the modules of the system designed and implemented and the relations between
which is performed
which only
belong to either the first or second phase are grouped according the corresponding phase.
In this chapter, the architecture of the developed system is firstly introduced and compared with that
proposed in
4.1 System Architecture
In Section 3.1, a general architecture for shot transition detection systems for shot transition detection
algorithms was proposed. Fitting that general architecture, a more specific one is presented in Figure
4.1 whi
them.
In the developed system, a two phase’s hierarchical procedure was adopted:
o 1st
phase: Suspect GOP detection – This is the part of the processing chain which is first
executed. It aims at classifying each GOP in the video sequence as a suspect or a non-suspectGOP depending on whether a transition is likely to occur in the GOP under analysis or not. This
is performed by solely analyzing those frames which are the first from the corresponding GOP.
o 2 nd
phase: Transition Detection – In the second phase, the GOPs which were considered
suspect of having transitions are analyzed more thoroughly by considering all of its composing
frames. In most of the shot detection systems, this second phase is the only
which is the equivalent, in this system, as considering all GOPs as suspect.
The modules in the architecture presented in Figure 4.1 are grouped into four major modules which
compose the proposed general framework. Besides this classification, those modules
ot divided but it is rather analyzed as a whole and only one
nd frame granularity.
imilar intra prediction
encode more detailed areas whereas bigger partitions are used to encode smoother
usually depends on the content and textures being encoded
ution
ver a frame was therefore proposed. Accordingly:
s (9 representing the 4 x 4 intra prediction modes and 4 for the 16 x 16 intra prediction
normalized dividing the value of each bin for the number of 4 x 4 blocks which
displayed which will be used to
(a) (b)
Luminance Partition Types.
o Spatial granularity – The same descriptions may be made at two granularity levels:
Block – Each frame is divided into blocks and for each bloc
these block descriptions together form the frame description.
Frame – The frame is n
description is generated;
The original algorithm in [35] uses the luminance prediction modes as features a
5.1.1.1 Features Algorithm 1: Luminance Prediction Modes
In [35], the authors claim that the intra prediction modes used to encode one frame reflect the visual
content being encoded and, therefore, similar content should be encoded using smodes. Each intra prediction mode is basically characterized by two dimensions:
o Partition sizes – This usually reflects the granularity of the visual content; smaller partitions are
used to
areas.
o Intra prediction direction – This
rather than the granularity.
Thus, the algorithm proposed in [35] requires the creation of the histogram describing the distrib
of the luminance intra prediction modes used oo Each frame is divided into 4 x 4 blocks;
o Each block is classified according to the intra luminance prediction mode used into 13
categorie
modes)
o The histogram is
form the frame.
In Figure 5.1, three sample frames from a high definition video are
exemplify some of the concepts in this and in the following sections.
Figure 5.1 – Three sample frames extracted from the “BBC Motion Gallery presents CCTV” videosequence downloaded from the Apple HD Gallery [44]. a) Frame 309, b) Frame 5078, c) Frame 5383.
In Figure 5.2, the luminance intra prediction histograms for the frames in Figure 5.1 are depicted. By
analyzing the visual content in the frames and the corresponding histograms, it is possible to verify the
assumptions mentioned earlier, notably:
o A comparison between the description a) with either the description b) or c) seems to confirm
the idea that frames which contain more detail use mainly smaller partition whereas frames with
a more smooth content use bigger partitions.
o By comparing descriptions b) and c), it is possible to observe that the differences in the visual
content may yield also differences in the intra prediction direction even if the partition sizes used
Figure 5.4 – Frame descriptions corresponding to the H.264/AVC High profile coding for the frames inFigure 5.1 considering also the intra chrominance prediction modes.
5.1.1.3 Features Algorithm 3: Luminance Partition Types
Another method for description extraction for intra frames is presented in [16]. In this paper, luminance
partition types are used as features when processing intra frames, generating a histogram composed
of 3 bins, each representing a partition type relative frequency (16 x 16, 8 x 8 and 4 x 4) over the
block. By observation of Figure 5.3, it can be seen that partition sizes can trigger transitions; however,
this does not seem as accurate as considering prediction modes also.
dimension, the remaining macroblocks at the edges are discarded. Comparing to the window
approach, this is faster since the blocks are not overlapping.
5.1.2 GOP Difference Score Computation
As addressed in the previous section, differences in the visual content may generate differences in thestatistical distribution of the histograms which compose the descriptions. There are several ways of
measuring such differences. However, in the developed system, the two metrics implemented are:
o Sum of absolute differences – The sum of absolute differences was the metric originally
proposed in [35]; in the current implementation, the only modification was the normalization of
the metric leading to (15).
(15)
o Variant of Pearson’s homogeneity test – A variant of the Pearson’s homogeneity test was
implemented; this metric (16) was the solution which better performed in a test carried out in
[20] for luminance histograms; here, it is normalized and proposed to be used for intra
prediction modes histograms instead.
(16)
To generate the difference score between two frames for the features defined in the previous section,
a metric has to be chosen to compare the block descriptions from the corresponding blocks in those
frames (frame descriptions taken at frame granularity are considered as block descriptions with only
one block); afterwards, the scores obtained for the blocks are summed to generate the frame score
difference. In this sub-module, a difference score is generated for each GOP which is computed by
comparing the first frame of the current GOP (f a) with the first frame of the next one ( f b) as in (17). In
Figure 5.5, some examples of such difference scores are depicted.
(17)
5.1.3 GOP Classification
This last module of the suspect GOP detection phase aims at classifying e GOP in the H.264/AVC
coded stream as suspect or not in terms of shot transition. As referred in Section 3.1, there are several
methods to achieve this goal; in the developed system, two algorithms were implemented to classify
each GOP, notably:
o Fixed threshold – Each score is compared to a fixed threshold (Tf ) heuristically set before the
analysis as in (18); this is the procedure used in [35].
Figure 5.5 – GOP Difference Scores for the video sequences introduced in Figure 5.1 using the intraluminance prediction modes descriptor with frame granularity and (a) Sum of Absolute Differences and
(b) Variant of Pearson’s Test
o Adaptive threshold – An adaptive threshold is computed for each frame taking into
consideration the difference scores from surrounding GOPs which form a window of difference
scores. There are some alternatives regarding the difference scores to consider in this window:
it has N samples which may be centered on the current GOP or contain only values obtained
from previous GOPs and the value of the current GOP may or not be discarded (depending on
the chosen option). There are two basic approaches implemented:
Average-based threshold - this threshold is computed using the expressions (19), (20) and
(21), where a and b are heuristically set coefficients and µ (average) and σ (standard
deviation) are calculated using the window of difference scores. The minimum and
maximum values in (20) and (21) are used to exclude extreme values which may happen,
for instance, at the beginning or at the end of a video sequence where the window might
not be completed. After this calculation, the similarity score is compared with the computed
adaptive threshold.
(19)
(20)
(21)
Median-based threshold - this threshold is computed using the expressions (22), (23) and
(24), where a and b are heuristically set coefficients and Median is calculated using the
window of difference scores. The minimum and maximum values in (20) and (21) are used
to exclude extreme values which may happen, for instance, at the beginning or at the end
of a video sequence where the window might not be completed. After this calculation, the
similarity score is compared with the computed adaptive threshold.
Those GOPs for which the difference score value is above the threshold T a, as in (25), are considered
suspect GOPs and are added to the set of suspect GOPs which will, at the end of this procedure, be
provided to the next modules in the system, this means, to the second phase of transition detection.
The last GOP of the video sequence is always considered suspect.
(25)
5.2 Second Phase: Transition Detection
This phase targets the detection of the frames in which a transition occurs for the GOPs which have
previously been considered as suspect. For this phase, four algorithms were implemented:
o Algorithm 1 – The shot detection algorithm described in Section 3.3.3 and proposed in [34] and
in [33].
o Algorithm 2 – A shot detection algorithm inspired by Algorithm 1 but with some modifications
proposed by the author of this Thesis to improve its performance.
o Algorithm 3 – A shot detection algorithm based on the system proposed in [16] with some
modifications made by the author of this Thesis.
o Algorithm 4 - A shot detection algorithm using hierarchical detection based on the system
proposed in [16] with some modifications made by the author of this Thesis.
These four algorithms will be described in detail in the next sections. This description aims at the
functioning of the algorithm using constant GOP structures of N=15 and M=3, which will be used at
evaluation. Despite that fact, the algorithms can be easily extended to support other GOP structures.
Remind that, according to the architecture presented in Chapter 3, each transition detection algorithms
considers three sub-modules: frame description generation, similarity score computation and decision.
5.2.1 Algorithm 1
This first algorithm here described has been proposed in [34] and [33] and was briefly described inSection 3.3.3. As it was explained, the key idea of this algorithm is to detect transitions by analyzing
changes in the partition sizes and partition types and the usage of intra prediction modes in P and B
frames. This algorithm was tested by its authors only using videos encoded with the Baseline Profile,
which does not allow B frames, which means it was never tested when for B frames.
In this section, a detailed description of the algorithm used in each module will be provided.
5.2.1.1 Frame Description Generation
In this algorithm, only B and P frames are evaluated; each of these frames is described by two
o Partition histogram (PH) – This descriptor accounts for the inter partition sizes and types used
in each frame.
o Intra block ratio (IBR) – This descriptor contains the ratio of intra coded macroblocks in the
current frame.
Partition Histogram
For the generation of this type of description, each frame is split into each 4 x 4 blocks and each block
is grouped according to its prediction type (P, if forward prediction, B, if backwards, interpolated or
direct prediction, or skipped prediction) and size of the corresponding prediction partition into 15 bins:
P16 x 16
, P16x8
, P8x16
, P8 x 8
, P8x4
, P4x8
, P4 x 4
, B16 x 16
, B16x8
, B8x16
, B8 x 8
, B8x4
, B4x8
, B4 x 4
and S16 x 16
.
Intra prediction partitions are not considered because the authors argue it would produce too many
false positives since these prediction modes may be used also due to fast motion; instead the usage
of such prediction type is indirectly considered due to effect its rises and falls produce in the usage of
the considered partition types.
Intra Block Ratio
As it is done for generating the PH descriptor, in this case the frame is also split into 4 x 4 blocks;
afterwards, the ratio of those blocks belonging to intra prediction partitions is computed. As the intra
blocks are used for new content, high usages of intra prediction modes may appear when a shot
transition is taking place; however, this may also happen when encoding frames with fast motion.
These descriptors are exemplified in Figure 5.6. In this figure, it is possible to observe the differences
in the PH description and a significant increase in the IBR description value in consecutive frames
which belong to different shots.
Figure 5.6 – Two frame descriptions taken from two consecutive P frames belonging to different shots; ineach figure, it is possible to observe the PH description at the 8 leftmost bins and the IBR description at
the rightmost bin.
5.2.1.2 Difference Score Computation
In this transition detection phase, this score accounts for the discontinuity in the visual content at the
frame being analyzed; higher values mean a higher probability of a shot change taking place and vice-
versa. With this purpose, two scores are implemented for this algorithm, for each frame, notably:
o Partition histogram difference (PHD) – This metric evaluates the differences between frames
by comparing the corresponding PH descriptions; the descriptions of the current and previous
frames are compared according to (26), based on the sum of absolute differences, or to (27),
based on the sum of non-absolute differences. According to the experiments realized by the
authors who proposed this algorithm, the later performs better yielding less false positives whencompared to the first [33], since there are some cases where partitioning changes, not due to
real content change, but due to compression efficiency decisions of the encoder, e.g., if an
encoder starts to use Skipped macroblocks instead of predicted macroblocks. However, this
seems contradictory with the partition change detection since the changes using non-absolute
differences will be only due to intra ratio change (rises and falls) and not due to partition
changes.
f 1N
, N
(26)
f 1N , N
(27)
o Intra block ratio (IBR) – Regards the direct usage of the IBR description for the current frame;
for each frame, this is equal to the ratio of intra coded macroblocks in that frame.
5.2.1.3 Decision
In this last sub-module of the second phase, the similarity scores previously obtained are analyzed. As
it was previously referred, high difference scores stand for a high degree of dissimilarity in the frames
analyzed; therefore, by detecting those frames which correspond to high difference scores transitions
may be detected.
In the original algorithm [34] [33], Schöffmann et al. state that a frame should be considered as a
candidate for an abrupt transition if its PHD is equal (28) or above a predefined fixed threshold (TPHD)
or if its IBR is equal or above ano e efi e ) (29).ther fix d pred ned thr shold (TIBR
f f
(28)
f f (29)
These candidates are added to the respective PDH or IBR candidate set which will be provided to a
post-processing procedure to transform this candidate set into a definitive transition set. This post-
processing is a three step procedure including:
o Gradual transition detection – This step is meant to group frame candidates that seem to
belong to gradual transitions. In this step, frames in the candidate set which are less than Δ
frames apart from each other as in (30) are grouped; this is to tolerate “detection holes” which
span over a maximum of Δ frames. If this group obeys to the size constraints as in (31), then it
is considered a valid group, added to a gradual transition candidate set and the corresponding
original abrupt candidates are removed from the set, otherwise the group is discarded and the
original abrupt candidates remain in the abrupt candidate set. There are two sets of these three
parameters: one for the oth r r the RPHD and the e fo IB .
… , 1
|| 1
∆ (30)
(31)
o Consecutive cut removal – This rule (32) excludes from the candidate set abrupt candidates
which are too close from each other assuming that shots have to be more than μ frames length.
This comparison is checked starting in the last cut candidate, which is compared to the previous
cut candidate a d xcl it s and ndidate is reached.n e uded if i too close, performed until the first ca
µ excludedfromcandidateset (32)
o IBR/PHD combination – This last step aims at combining the IBR and PHD approaches in
order to create the detection set. In their experiments, Schöffmann et al. found that PHD alone
works fine for cut detection; however, it lacks in gradual detection. On the other hand, IBR
works better for gradual detection since it yields many false positives in cut detection.
Therefore, after the previous post-processing steps, the PHD candidate cuts are added to the
detected transition set and only gradual transitions are added to that set among the IBR
candidates.
5.2.2 Algorithm 2
As mentioned before, Algorithm 1 was only tested by its authors using videos encoded with the
Baseline Profile; in fact, the description of the algorithm’s operation when using sequences encodedwith other profiles provided in both [34] and [33] seems to lack functionality. Therefore, a second
algorithm – Algorithm 2 - was designed by the author of this Thesis, still inspired by the ideas
underpinning Algorithm 1 with the main purpose of improving its performance.
5.2.2.1 Frame Description Generation
Algorithm 2 uses the same type of descriptors as proposed for Algorithm 1. Comparing the descriptors
in the two algorithms, the significant differences are in the PH descriptor; these modifications aim at
enhancing the previous algorithm for B frames. With this purpose in mind, two major modifications in
the definition of the descriptors are proposed:
o Modification 1: Partition classification - The first major modification proposed is to classify
the partitions based on their size and prediction direction; since the objective of this algorithm is
to use the partition approach adopted by Algorithm 1, the size still plays a major role in these
descriptors. Therefore, the B prediction type is split into interpolated (I) and backward (B)
prediction types with the skipped partitions being considered as forward partitions (either P16 x 16
or P8 x 8
depending in the partition size); this extends the histogram to 21 bins (P16 x 16
, P16x8
,
P8x16
, P8 x 8
, P8x4
, P4x8
, P4 x 4
, B16 x 16
, B16x8
, B8x16
, B8 x 8
, B8x4
, B4x8
, B4 x 4
, I16 x 16
, I16x8
, I8x16
, I8 x 8
, I8x4
,
I4x8
and I4 x 4
). In this way, the prediction direction is meant to be provided with more importance
than it had in the original algorithm which is only based on the prediction types.
Two frame descriptors were adopted and implemented from [16] for this Algorithm 3, notably:
o Prediction direction – This is used to describe the temporal dependencies of the frame under
analysis; with this purpose, each frame is partitioned into 8 x 8 blocks and each is classified
according to the prediction direction used: intra, forward, backwards or interpolated. This gives
rise to a 4 bin histogram which is normalized by diving each bin by the number of 8 x 8 blocks
which form the frame. As for the previous algorithm, the inference procedure for prediction
direction in direct coded partitions described in Section 5.2.2.1 was implemented to classify
those partitions. This is related with the Inter procedure.
o Intra prediction map – This is used to describe the spatial characteristics of a certain frame. It
is constructed for two frames in a GOP: the first and the last, and contains the intra prediction
encoding information (as it is done in the suspect GOP detection phase); Each prediction map
starts being constructed at the beginning of the GOP (I frame) and advances trough its Pframes until the frame for which the map is being constructed is reached; meanwhile, every time
an intra coded macroblock is found in those frames, the corresponding macroblock prediction
information in the prediction map is updated. After updating this prediction map with the current
frame, intra frame descriptors are generated for that prediction map using the same algorithms
presented in Section 5.1.1. This is related with the Intra procedure.
In the original algorithm in [16], another descriptor is proposed: the motion intensity for the foreground
and background areas of the picture which is used in the gradual transition detection. However,
motion extraction from the H.264/AVC bit stream is not a straightforward procedure since the motion
vectors are not directly available from the bit stream; instead, only the differential motion vectors are
available and can be parsed from the bit stream. To compute the motion vectors, a motion vector
prediction has to be inferred from neighbor partitions, which is only done in late stages of the decoding
process.
5.2.3.2 Similarity Scores Computation
In this algorithm, four scores are calculated with the purpose to express the continuity and
discontinuity between frames:
o Sum of intra and forward predicted block ratios for previous frame (s1 ) - This expresses
continuity in the previous frame related to the video content before it and it is calculated for
every inter frame. This is related with the Inter procedure.
o Sum of intra and backward predicted block ratios for current frame (s2 ) - This expresses
discontinuity in the video content between the previous and current frame and it is calculated for
every inter frame. This is related with the Inter procedure.
o Intra block ratio (IBR) – This is the IBR for the current frame; this is only calculated in P frames
and is related with the Grad procedure.
o Intra frame difference (Dintra ) – Unlike the previous scores, this is not computed for all frames;
instead, it is used to calculate differences between an intra prediction map belonging to a P
o Profile – H.264/AVC compression tools allowed for the coding of the current sequence as
specified in [6].
o Level – Coding constrains for some characteristics used for the current coded sequence.
o List of random access points (RAP) – List of random access points available; a RAP is a
frame at which the decoding process may be started; it marks the beginning of GOPs in a video
sequence.
This information is available in some of the MP4 metadata boxes and it is accessed using already
implemented functions available in libgpac. The modifications made by the author to this module
mainly aimed at providing the means to aggregate the required information and to provide it to the
system.
Parts of the H.264/AVC Bit Stream
This sub-module was implemented modifying the source code in MP4box presented above. In the
original implementation, one of the supported operations on the MP4 container was the extraction of
an indicated track from the MP4 container to a file. The purpose of the modification was to optimize
the software according to the current system specifications; in this context, two major differences must
be highlighted:
o Output – Instead of outputting the H.264/AVC stream to a file, the modified software uses
program memory which has the advantages of speeding up the process and providing a more
seamlessly approach, since no auxiliary files are created;
o Frame selection – While the original software was designed to extract all frames from a
selected track, using the modified software version the system can request this module to be
provided with a window of coded frames, excluding from that window all frames which are not
RAPs. This is a very useful feature since it improves the computational performance due to: i) a
reduction in the amount of data read from the file since skipped frames are not read from the file
considering the bit stream random access feature provided by the MP4 container, and ii) a
reduction in the used memory since this module only provides the frames which are required for
the current processing.
The RAP filtering procedure is very useful for the suspect GOP detection, this means the first phase of
the shot transition algorithms developed, since these frames are the only needed for this phase. As
defined in MPEG-4 Part 15 [43], these can be identified in the bitstream, since IDR frames are the only
ones which can be considered as random access points.
6.1.2.2 H.264/AVC Reference Decoder: Low-level Features Extraction
This module is meant to extract some low-level features from an H.264/AVC bit stream. The
H.264/AVC reference software decoder, in which this module was based on, decompresses a
H.264/AVC file into a raw YUV decompressed video file. As for the MP4 handling module, this original
software was also modified to allow a better integration in the developed shot transition detectionsystem. With this purpose, some processes were enhanced, notably:
SMNxK – The equivalent to the earlier introduced PNxM mode; this kind of sub-
macroblocks are encoded using inter prediction and divided into prediction blocks of NxK
samples ( N and K being 4 or 8);
IBLOCK – Sub-macroblock encoded using only intra prediction; if it happens, it is classified
as an IBLOCK.
PSKIP – Sub-macroblock which uses skip or direct prediction.
Partition inter prediction direction list – This list contains the prediction direction of each of
the 16 x 16, 16 x 8, 8 x 16 or 8 x 8 partitions in a PNxM macroblock.
Intra chrominance prediction mode – For intra macroblocks, this feature stores the intra
prediction mode used to encode the chrominance in the current macroblock.
Intra luminance prediction mode list – This list contains the prediction types used for
predicting each partition block in the current intra macroblock.
As can be easily noticeable, this is a general purpose structure, e.g., it does not depend on the frametype and macroblock information does not depend on the macroblock type which is not memory
efficient, contrary to what happens in the main application,. This is because passing structures from C
to C#, and vice-versa, is not a straightforward procedure, due to their different nature, and therefore
do not permit much flexibility. However, this is not a big issue since the life-time of this structure in C#
memory is very limited (after receiving this structure, the main application creates a more efficient
frame object erasing the previous structure).
6.1.3 Application Structure
The developed application is composed by three main parts entirely implemented by the author of this
Thesis:
o Main form – This is the entry point of the application; it includes the main Windows Form that is
used for the GUI and some classes which are used to control that GUI.
o Player – This is a library which can be used to open a video window with a specific position and
dimensions.
o Core library – This is a library containing several classes which are used in the shot transition
detection system.
6.1.3.1 Main Form
This part of the application mainly covers the GUI. It defines some components and operations to
interact with the user and it is formed by the form and two classes which can be instantiated to
encapsulate ZedGraph charts (one for histograms and another for line charts).
6.1.3.2 Player
In the context of this work, a player was needed to display the video under analysis and the detection
results. For this reason, an independent library was designed which displays a video player window
The main class of this library is CPlayer; this class can be instantiated to construct and encapsulate a
filter graph to display a video at a certain position. This class has some public methods to command
the player (Stop, Pause, Seek, Play, etc, ...). It can also export snapshots of frames being displayed
which will be used by the Main Form.
As previously referred, the DirectShow library does not include filters for handling MPEG-4 streams;those required in order to support these streams have been developed by third-parties and have to be
separately installed by the user. Only one combination of filters (a MP4 file parser filter and a
H.264/AVC decoder filter are needed) was found which ensures a good functioning of the player
component, namely, the support for accurate frame seeking; This combination is formed by the Haali
Media Splitter [52] and the CoreAVC [53] codec.
6.1.3.3 Core Library
This component contains some classes which are needed to many shot transition detection operations
and also some not directly related to the detection, such as classes needed to read/write XML files in
the TRECVID format containing the shot transition ground-truth or the shot structure as detected by
the application.
The XML files are used both to save the detection results, as well as to load the corresponding ground
truth for performance evaluation. The format for the XML files was adopted from TRECVID [7]; in this
format, each transition is described according to its type (abrupt, dissolve, FOI or other), preFnum and
postFnum. In Figure 6.1, the Document Type Definition (DTD) which defines the structure of such
XML files is presented; Figure 6.2 shows an excerpt of a ground truth XML file.
Figure 6.1 – DTD for the ground truth XML file.
Figure 6.2 – Excerpt of an XML file containing the ground truth transition descriptions of a videosequence.
If the last analysis performed was a suspect GOP detection, this window lists the suspect/non-suspect
GOPs (and specifically their IDR frame) as displayed in Figure 6.6; if the last analysis was transition
detection, this window lists the transitions as depicted in Figure 6.5.
Figure 6.6 – Suspect GOP mode in the video thumbnail.
Each element in the list shown represents a frame; for each frame, the frame itself, the frame number
and, the frame type may be displayed (the frame type only when the frame is stored in the program
memory (later this will be referred to as interactive mode).
The user can select some frames to triggering some internal events which will update some items in
the GUI, namely the player (which moves to that frame) and charts (which can highlight the selected
frames in the line chart or display its descriptions in the histogram chart).
To control the frames which appear in the video thumbnail a set of checkboxes is provided.. The two
possible sets of checkboxes in this window are displayed in Figure 6.7.
Figure 6.7 – Two examples of the video thumbnail control component.
6.2.3 Algorithm and Charts Control
This window groups some controls which are useful to control the shot transition detection algorithmtasks and the visualization of the results. This window is shown in Figure 6.8; it is organized as
follows:
o First Phase Processing – This window regards the first phase of the shot transition detection
algorithm and it has two types of tabs:
Actions & Results – Here, the user may load the frames necessary for the first phase
analysis and, afterwards, to run the analysis. After the analysis, some statistics about the
results are shown in this tab;
Parameter Definition – This tab is divided according to the three phases of the algorithm
processing, i.e., feature extraction, similarity score and decision; it can be used to control
source to be used since no post-processing is performed to the MPEG-1
o
r profiles were created to generate these bit streams which are
12kbs – The options changed from the default values in the meGUI tool are the
In this chapter, the performance evaluation of the developed system is presented. First, the video
collection used for this evaluation is introduced; afterwards, the performance evaluation procedures
are defined and, by the end of the chapter, the re
7.1 Video Collection
In this performance evaluation, the video collection from the TRECVID 2007 was adopted [7]. Thiscollection consists of 17 MPEG-1 encoded videos, yielding a cumulative length of 6 hours. The videos
have a luminance resolution of 288x352 pixels, a frame rate of 25 fps and are encoded at 1157 kbps.
As presented in Section 3.2.1.4, this video set consists of 2,463 transitions; 2,236 cuts (90.8%); 134
dissolves (5.4%); 2 fade-out/-in (<0.1%); 91 other special effects (3.7
truth provided for TRECVID was also used without any modification.
For the purpose of the work presented in this Thesis, the test videos had to be recompressed using
the H.264/AVC standard. For this re-encoding process, the MeGUI tool [54], [55] was used; this tool is
basically a front-end for many media coding related
the MeGUI tool consisted in the following steps:
An Avisynth script [56] was created to be fed to the H.264/AVC encoder; this kind of script
specifies how the original file (MPEG-1 coded) is to be used. In this case, the created scripts
only specify the
decoded video.
The x264 encoder (version: 949 – Jarod’s patched build) was used to create the next
The GPAC’s mp4Box is, finally, used to encapsulate the created bit stream into an mp4 file.
and
rocedure (for the fist phase) was designed by the author based on the first (for the second
tection system. Every system detected or ground truth
Average Bitrate = 512 kbps;
Scene Change Sensitivi
Allow P4 x 4 partitions;
Main 512kbs – Besides the changes made in the Baseline 512kbs profile defined above,
the additional options changed from the default value
Number of B frames (between I/P frames) = 2;
Weighted Bidirectional Prediction = Enable
Bidirectional Motion Estimation = Enabled
B Frame Mode = Auto (can use both temporal and spatial direct)
o
7.2 Performance Evaluation Procedures
Most of the shot detection algorithms in the literature evaluate their performance based on two main
metrics: Precision and Recall; these metrics have been defined in Chapter 1. They are usually
separately computed separately for abrupt and gradual transitions, since the detection difficulty and,
usually, the algorithm used for the detection of each kind of transition are rather different. Therefore,
presenting the results in this manner – separate precision and recall for abrupt and gradual transitions
- provides a more meaningful and sequence independent assessment, since the overall recall
precision usually depend on the ratio of gradual and abrupt transitions present in each sequence.
As the proposed system performs the shot detection following a two-layer hierarchical detectionprocedure and the nature of the detection results differs for these two phases, two different
performance evaluation procedures will be used. In the following sections, these evaluation
procedures will be described: first, the procedure used to perform the evaluation of the transition
detection is presented and, afterwards, the procedure used for performing the evaluation of the
suspect GOP detection is explained. Although this order may be unexpected since the procedure for
the second phase is presented before the procedure for the first phase, this sequence is justified since
the original evaluation procedure was designed for transitions detection (the second phase) and the
second p
phase).
7.2.1 Transition Detection Evaluation Procedure
The first performance evaluation procedure presented here was adopted from TRECVID [7]. In the
context of this work, this procedure will be used to evaluate the second phase transition detection
algorithms and the overall shot transition de
transition is characterized by three attributes:
o preFnum – The number of the last frame before the transition
o postFnum – The number of the first frame after the transition
Figure 7.3 – Recall/Precision for the LUM features using a median-based threshold.
0,5
0,55
0,6
0,65
0,7
0,75
0,8
0,85
0,9
0,95
1
0 0,2 0,4 0,6 0,8 1
R e c a l l
Precision
LUMCOL ‐ FRM ‐ SAD
LUMCOL ‐ FRM ‐ VPT
LUMCOL ‐ WIN3x3 ‐ SAD
LUMCOL ‐ WIN3x3 ‐ VPT
LUMCOL ‐ BLK3x3 ‐ SAD
LUMCOL ‐ BLK3x3 ‐ VPT
LUMCOL ‐ WIN1x1 ‐ SAD
LUMCOL ‐ WIN1x1 ‐ VPT
Figure 7.4 – Recall/Precision for LUMCOL type features using a median-based threshold.
From the observation of the charts above, it is possible to conclude:
o The results for the LUMCOL features are better from those of the LUM feature;
o Contrary to what occurs using the fixed threshold; in this case, FRM seems to perform worse
than the block based approaches (WIN3x3, BLK3x3 and WIN1x1). The usage of BLK3x3
granularity still yields a very similar performance than that obtained using WIN3x3, which is the
granularity that yields the best performance;
o SAD and VPT yield very similar performances;
o The best solution using this median-based threshold is the combination LUMCOL + WIN3x3 +
SAD.
7.3.1.3 Average-based Threshold Detection
The third threshold type tested was based on the average of the difference scores over a slidingwindow, as depicted in equations (19), (20) and (21). As for the median base threshold, this sliding
ns of the work described in this document and some suggestions for future work on this
subject.
video directly in the compressed
the
This chapter finalizes the report by presenting a summary of the addressed topics, the main
conclusio
8.1 Summary and Conclusions
Chapter 1 introduced the motivations for the problem addressed in this Thesis; mainly due to the
increase in the digital video availability, applications providing means to browse and consume large
video collections, such as content-based video retrieval and summarization applications, are gaining
relevance. As shot detection is one of the fundamental steps of these types of applications; it is a
problem which needs to be addressed and resolved. Moreover, as digital video is usually compressed;
if shot transition detection is performed directly in the compressed domain, i.e., without having to
decompress the video, a significant reduction of computational complexity can be achieved. Amongthe video coding standards, H.264/AVC is the latest standard and its popularity is growing. This
standard achieves great compression efficiency at the cost of increased complexity, when compared
to previous standards, which strengths the need for processing the
domain. A short overview on this standard is provided in Chapter 2.
Chapter 3 structured and presented the shot detection problem and the solutions found among
relevant literature. Some of the most relevant solutions found were also described in more detail.
In Chapter 4 the developed system for shot detection was first introduced. This chapter described the
system architecture and provided a functional description of each of its modules. It is also motivated
the decision to adopt an hierarchical procedure, as suggested in [35], based on first detecting the
GOPs suspect of having transitions (suspect GOP detection) and, afterwards, analyzing those GOPs
more thoroughly, in order to find the exact placement of transitions (transition detection).
Chapter 5 described in detail the various processing algorithms developed in each of the system’s
modules to perform the shot transition detection. For the suspect GOP detection phase, the algorithm
in [35] was implemented, along with some modifications proposed by the author to test differentapproaches. For the transition detection phase, four algorithms were designed. The first algorithm is
that proposed in [34]; it compares successive inter frames using their partition sizes and types and the
second is an improvement of this first algorithm. The third algorithm was based on [16]; it inspects
intra prediction modes and the direction the used reference frames and it was designed to compare
successive frames in a sequential way, as happens in algorithms 1 and 2. The fourth algorithm was
based on the hierarchical detection also proposed in [16], with some modifications proposed by the
author of this Thesis; it analyses frames using the same features as in algorithm 3, but it does so in a
and explained.
hapter 7 presented a comparison of the results obtained for the several implemented algorithms,
and implementation of a shot
allowed several GOPs to
direction to be used and, although a solution is proposed
in [16], it is still a problem needing a better solution, since it is the main problem limiting the
nsition detection.
different order, exploiting the hierarchical reference usage. This is done to analyze less frames and to
improve the detection accuracy.
Chapter 6 intended to provide the reader with some relevant implementation details of the developed
system. The GUI developed for this system is also presented
C
over a representative dataset adopted from TRECVID 2007.
This work aimed mainly at designing, implementing, evaluation and comparison of shot transition
solutions in the H.264/AVC compressed domain and the designed
transition detection application. With this purpose algorithms in [35], [34] and [16] were implemented,
along with some modifications proposed by the author of this Thesis.
For the suspect GOP detection phase, the obtained results were below those expected and reported
in the original algorithm [35]. Despite that fact, the introduction of this phase
be skipped from a detailed analysis in the second phase. Many modifications were proposed to the
original algorithm which yielded improvements in the algorithm performance.
For the transition detection phase, four algorithms were implemented. Many conclusions can be drawn
from the tests carried out and the presented results. Namely, inspecting inter partitions sizes does not
yield better performance detection when compared to the simpler analysis of inter prediction direction.
Also, the usage of hierarchical detection inside the GOP improves performance over analyzing
successive frames. Finally, there are two main aspects which may need a more proper solution; first,
gradual transitions are very difficult to detect based only on the ratio of intra prediction usage; second,