Watching, Thinking, Reacting: A Human-Centered Framework ... · textual information. Figure. 1 shows an interactive system for movie navigation and editing that employs both techniques

International Journal of Digital Content Technology and its Applications Volume 4, Number 5, August, 2010

Watching, Thinking, Reacting:

A Human-Centered Framework for Movie Content Analysis

Anan Liu*, Zhaoxuan Yang *Corresponding Author School of Electronic Engineering, Tianjin University, Tianjin, China 300072

{liuanan,yangzhx}@tju.edu.cn doi: 10.4156/jdcta.vol4.issue5.3

Abstract

In this paper, we propose a human-centered framework, “Watching, Thinking, Reacting”, for movie content analysis. The framework consists of a hierarchy of three levels. The low level represents human perception to external stimuli, where the Weber-Fechner Law-based human attention model is constructed to extract movie highlights. The middle level simulates human cognition to semantic, where semantic descriptors are modeled for automatic semantic annotation. The high level imitates human actions based on perception and cognition, where an integrated graph with content and contextual information is proposed for movie highlights correlation and recommendation. Moreover, three recommendation strategies are presented. The promising results of subjective and objective evaluation indicate that the proposed framework can make not only computers intelligently understand movie content, but also provide personalized service for movie highlights recommendation to effectively lead audiences to preview new movies in an individualized manner.

Keywords: Human-centered, Movie, Personalized, Recommendation 1. Introduction

The movie industry is an active producer of video. Every year, about 4,500 movies are released around the world, spanning approximately 9,000 hours of video [1]. With such a massive amount of information, there is a great demand for technologies that enable viewers to access new movies conveniently and therefore facilitate film advertisements.

Previews or trailers are film advertisements for feature films that will be exhibited in the future at a cinema. Currently, there are two popular trailer web sites [2, 3]. Trailers consist of a series of highlights, selected exciting, funny, or otherwise noteworthy shots from the movie being advertised [4]. The purpose of the trailer is to enable audiences to conveniently browse the video content and attract them to the movie. Therefore, an attractive trailer is essential to enhance the influence of a movie. In order to reduce human involvement in trailer production, many researchers in the field of content-based video analysis are engaged in the research of video summarization technique for automatic generation of trailers. Video summaries provide condensed and succinct representations of the content of a video stream through a combination of still images, video segments, graphical representations, and textual descriptors [5]. There are two different types of video summarization techniques: static video abstract and dynamic video skimming [6]. A static abstract is known as a storyboard which is composed of key frames arranged in the temporal sequence. Many algorithms for key frame extraction (e.g., [7-10]) have been proposed and can be used for static abstract generation in the previous research. Many other researchers have engaged on dynamic video skimming because it can be more informative to viewers. Much literature [12-15,38] has addressed dynamic video skimming by integrating visual, audio, and textual information. Figure. 1 shows an interactive system for movie navigation and editing that employs both techniques [11]. This is a two layer system. Layer 1 of the system is a storyboard that displays a static video abstract, which enables viewers to conveniently browse through the movie content. If a viewer wants to know the specific content of a certain clip, they can double-click the corresponding key frame to launch a video player to play the clip. Layer 2 assist a user to select highlights by dragging and dropping the corresponding key frames using video editors. These clips are concatenated from left to right to form a video summary. Then the movie summarizations are provided for previewing the movie. However, the previous video summarization techniques have two major limitations. First, most previous work focused on video content analysis with multimodal information

23

Watching, Thinking, Reacting: A Human-Centered Framework for Movie Content Analysis Anan Liu, Zhaoxuan Yang

without considering the essence of human understanding. Second, the relatively fixed content and style of the summary may not be suitable for diverse tastes of different audiences and thereby lead to loss of potential viewers.

Figure 1. An illustration of the interactive system for static and dynamic video summary generation for the movie, Fearless.

In this paper, we propose a human-centered framework for movie content analysis by imitating

human behavior from Watching, Thinking to Reacting. The novelty of the framework lies in its imitating three stages of human understanding of movie content and provides adaptive movie highlight recommendation to facilitate the preview of movie content in a personalized manner. The framework consists of the following three hierarchies. The low level represents human perception to external stimuli. The Weber-Fechner Law-based human attention model is constructed to extract movie highlights. The middle level simulates human cognition to semantics. Semantic descriptors are modeled for automatic semantic annotation. The high level imitates human action based on perception and cognition. Motivated by the key role recommendation plays in long tail mining [16], we employ an integrated graph with content and contextual information for movie highlights recommendation to replace traditional video summary techniques. Furthermore, we adopt a recommendation scheme with three different strategies, including global recommendation, related recommendation and long tail recommendation. We evaluate the proposed framework using both objective and subjective criteria. The evaluation results indicate: 1) with this framework, the proposed system can tell the salient parts and “Who, What, Where, How” happening in the movie; 2) the system can incorporate content and user-based sources of information to discover the latent relationship of highlights and arrange highlights in specific sequences to accommodate individual taste.

The remainder of the paper is organized as follows. In Section 2, we present the proposed framework for movie analysis in details. Experimental results are illustrated in Section 3, followed by conclusions and future work in Section 4. 2. The human-centered framework

Humans usually experience a process from Watching, Thinking to Reacting when facing unknowns. Motivated by the research in cognitive physiology, the proposed framework consists of three levels by imitating human behavior from perception, cognition, to action. As shown in Figure.2, these three parts correspond to the flow of information through the brain, from the external world, to the higher regions that underlie cognition, and finally to the motor systems that control our actions [17]. With the proposed framework, we realize automatic movie content analysis that simulates the cognitive process of the human brain.

24


Figure 2. The proposed framework and its foundation of information processing in human brain. 2.1. Low level hierarchy for human perception

As a neurobiological concept, perception means the awareness of the elements of an environment through physical sensation [18]. Human beings receive and process certain types of external stimuli and change the concentration of mental powers on the storyline. Therefore, the low level hierarchy of the framework deals with the representation of the variation of human perception based on the Weber-Fechner Law-based human attention model and its application on movies highlights extraction.

2.1.1. The Weber-Fechner Law based Human Attention Model

The Weber–Fechner law [19] describes the relationship between the physical magnitudes of stimuli

and the perceived intensity of the stimuli [20]. In essence, it states that the human perception is not straightforwardly related with external stimuli, I, but with its change, I△ . Therefore, for constructing a proper attention model, we must consider information in the temporal domain. The Weber-Fechner Law based human attention model for an entire movie (M) is formulated as follows:

M k* log[ HAM(t )] C= + (1) where HAM(t) is the human attention model for a single shot at time t, and k, C are constants. The human attention model for a single shot is comprised of three sub-models, namely visual, auditory, and textual sub-models, represented as follows:

1

1

1

l

i ii

m

j jj

n

k kk

VAM( t ) w *VF ( t )

AAM( t ) w * AF ( t )

TAM( t ) w * TF ( t )

HAM( t ) Fusion[VAM( t ),AAM( t ),TAM( t )]

=

=

=

⎧=⎪

⎪⎪

=⎪⎨⎪⎪ =⎪⎪ =⎩

∑

∑

∑

(2) where VAM(t), AAM(t), and TAM(t) are visual, aural, and textual sub-models, respectively; wi, wj , and wk denote the weights for them, respectively; VFi(t), AFj(t), and TFk(t) refer to the features in visual, audio, and textual modalities, respectively; and Fusion[ ] is the operation of fusion scheme. Here, the unit of t is shot, the basic unit for movie content analysis, and the continuous change of t means the concessive shots in one video. Any extractable features in aforementioned three modalities can be integrated into our framework because of its extensibility. 2.1.2. Computation Methodology

25


In this section, we introduce the low level features in visual, aural, and textual modalities which directly influence human perception at first. Then we discuss the quantitative representation of the features using a uniform metric of information quantity for the proposed human attention model.

(a) Selection of low level features In the visual modality, drastic and complex motion of objects will have a great impact on viewers’

attention. Therefore, motion energy (ME) and complexity (MC) [21] are used as visual features. In the aural modality, the accompanying sound is very important to express the semantic meanings.

Large volume and high pace of audio especially influence viewers with strength and haste. Therefore, audio energy (AE) and audio pace (AP) [11] are calculated for auditory features.

In the textual modality, captions with more words usually convey richer semantic information [39]. To ensure the temporal correspondence between captions and the video content, we used the video caption extraction method in [22] and the Hanwang OCR SDK, a commercial software package for OCR recognition. We use the text length (TE) to represent its importance. In addition, we formulate text information (TI) by the ratio between the total number of nouns, verbs and adjectives and the total number of words. Here, we use the lexical analysis system ICTCLAS, a free software application, to implement word segmentation for extracting nouns, verbs and adjectives that are more informative for viewers.

(b) Formulation Quantitative representation of human perception with scientific basis has been an unsolved problem

for a long time. However, cognitive informatics, a newly emerging subject, has founded a theoretical foundation and based on which, Wang proposed that information quantity is a more proper measure for human perception [23]. Therefore, by calculating the information quantifies in visual, aural, and textual modalities, we can represent the human perception quantitatively.

For the features mentioned above, we can see that ME, AE, and TE represent energy and MC, AP, and TI reflect frequency. They belong to two categories and are not additive. However, they can be converted into information quantities and integrated as follows:

VAM ( t ) ME( t )* [ log MC( t )]AAM ( t ) AE( t )* [ log AP( t )]TAM ( t ) TE( t )* [ log TI( t )]

= −⎧⎪ = −⎨⎪ = −⎩ (3)

Here we consider the features representing energy as the weights of those representing information.

Linear fusion method can be used for the formulation of the Weber-Fechner Law based human attention model for an entire movie (M) as follows:

3

1 2 31

+C 1V A Ti

iV A T

M k * log HAM ( t ) C

(VAM ( t ) ) ( AAM ( t ) ) (TAM ( t ) )k * log w * w * w * ( w )

μ μ μσ σ σ =

= +

⎡ ⎤− − −= + + =⎢ ⎥

⎣ ⎦∑

(4)

where Vμ , Aμ , and Tμ are averages of VAM(t), AAM(t), and TAM(t), respectively; and Vσ , Aσ , and Tσ are standard deviations of VAM(t), AAM(t), and TAM(t), respectively. The weight wi means the importance of each attention sub-models and can be flexibly set depending on specific needs. k and C are constants used to scale M into [0,1]. 2.1.3. Movie Highlight Detection

In filmmaking terminology, the structure of a movie can be described by film grammar, which “comprised of a body of ‘rules’ and conventions” that “are a product of experimentation, an accumulation of solutions found by everyday practice of the craft,” [24] and results from the fact that films are crafted, built, shaped to convey a purpose. According to film grammar, a movie is comprised of the following elements:

26


• A frame is a single still image. It is analogous to a letter. • A shot is a single continuous recording made by a camera. It is analogous to a word. • A scene is a series of related shots. It is analogous to a sentence. The study of transitions between

scenes is described in film punctuation. • A sequence is a series of scenes which together tell a major part of an entire story, such as that

contained in a complete movie. It is analogous to a paragraph. Depending on film grammar, the director, the cinematographer, and the editor need to carefully

structure the building blocks of the following four filming elements: 1) the plots, characters, and dialogues in a script; 2) the instruments, notes, and melody in music; 3) the volume, bass, tremble, and sound effects; and 4) the basic visual components (visual effects) [25]. Therefore, movie highlights contain the shots that contain sufficient information and remarkable changes in multiple modalities and the detection of movie highlights should reflect human perception.

With the proposed human attention model, the important score for each shot can be calculated. Then we get the attention flow plot to depict the development of the storyline. Moreover, the attention flow plot is smoothed with a Gaussian filter for the two reasons. Firstly, there will not be drastic changes in the continuously developing storyline. Secondly, human perception gradually changes because of memory retention. Based on the attention flow plot of each sub-models, we implement a predefined threshold approach for highlights detection: (a) In one shot, we count the number, N, of the peaks over the threshold Th1; (b) If N is larger than the threshold Th2, the shot is decided as the highlight.

2.2. Middle level hierarchy for human cognition

Cognition is a mental process or ability of reasoning and judgment based on awareness and

perception [23]. For a movie, audiences are usually impressed by the important semantics, Who, What, Where and How in the highlights by “thinking” as well as “watching”. Therefore, the middle level hierarchy of the framework deals with semantic analysis and automatic annotation for the four elements in movie content.

2.2.1. Who

“Who” means the main characters in a movie. To detect the main characters, we propose a voting-based method. We apply the face detection and recognition algorithms [26] on key frames and cluster the results. Each category corresponds to one character. By majority voting, the most important characters can be found and the occurrence of main characters in each shot or scene can be annotated.

2.2.2. What

“What” denotes semantics in the movie story. Music, dialogue, and exciting scenes, such as action and war usually account for most of a film. Therefore, we implement music, dialogue, action and warfare detectors for semantic scene annotation in the entire movie.

(a) Music detector We employ a audio classification method for music scene detection. The audio stream is firstly

segmented into non-overlapped 20-ms short time frame. Then five frame-level audio feature: Short-Time Energy Function, Short-Time Zero-Crossing Rate, Frequency energy, Sub-band energy ratio, Mel-frequency cepstral coefficients, are extracted to represent the character of each short time frame. Finally, each audio clip is classified into four kinds: silence, speech, music, and others with the SVM-based approach in [27].

(b) Dialogue detector Different from the previous methods mainly with audio information, we integrate aural and textual

modalities for dialogue detection because audio in movies is quite complicated and difficult for classification. Firstly, we implement the audio classification method above to detect the speech scenes. Then, we use textual attention sub-model to filter these candidates to get dialogue scenes.

(c) Action detector In an action scene, there usually exists drastic and complicated object motion and large volume and

high pace of audio. Therefore, we integrate visual features, Motion Intensity and Motion Complexity, and audio features, Audio Energy and Audio Pace for action scene detection as reported previously in [11].

27


(d) Warfare detector In a warfare scene, there exists gunfire as well as the similar characteristics with an action scene.

Therefore, based on the action detector, we model the SVM-based gunfire descriptor with the method in [28]. We combine these two detectors for warfare detection.

2.2.3. Where

“Where” means the spatial place of scenarios. We use SVM-based concept modeling framework [29] we proposed previously to model the related concepts proposed in LSCOM [30], including: indoor, outdoor, river, ocean, hill, forest. These classifiers can be implemented on key frames of movies to determine the spatial place where the scenarios happen.

2.2.4. How

“How” is related with affective understanding for movie content. Due to the seemingly inscrutable nature of emotions and the difficulty of bridging the affective gap [31], affective analysis for videos become a challenging research area. We use the approach grounded upon psychology and cinematography with multiple audiovisual cues in [32] for affective categorization, including: Neutral, Fear, Joyous, Surprise, Anger and Sad.

2.3. High level hierarchy for human action

Based on the first two phases, movie highlight extraction and semantic annotation are fulfilled to

show the novel content by machine intelligence. Thereafter, people usually have the reaction, at least ‘like or dislike’, for the movie which would be a useful reference for the future audiences. Research in consumer psychology has shown that the product adoption is synonymous to the will to acquire novel information and is influenced by preferences and external environment [33]. Motivated by this theory, we consider both content and contextual information for personalized movie navigation. In the high level hierarchy, an integrated graph is proposed for personalized movie highlights recommendation. The proposed method consists of three steps: 1) constructing an integrated graph by correlating individual candidate highlight with both semantic annotation and user preference; 2) grouping the candidate highlights into communities to explore their relationship; and 3) designing a heuristic scheme for personalized recommendation.

2.3.1. Integrated Graph Construction

Both user preference and semantic annotation for highlights are utilized to generate bipartite

networks for the integrated graph construction. User’s preference is collected by timing the interval of adjacent selection of highlights by viewers when they preview the selected highlight. If the interval is bigger than the preset threshold, the user and highlight are linked as shown in the left part of Figure.3 (a) which means the user likes this highlight. In the same way, semantic detectors are utilized to construct a bipartite network to model the relationship of semantic annotation and highlights as shown in Figure.3 (b).

We utilize Resource-Allocation Dynamics (RAD) to project bipartite network to undirected graph. RAD [34] allows the resource flow from one node to another on the bipartite networks and hereby denotes correlation between highlights. As shown in the middle and right parts of Figure.3 (a) and (b), two undirected graphs and corresponding confusion matrixes W1 and W2 are generated. Because the correlation of two highlights depends on both user preference and semantic annotation, the latent relationship between them can be integrated as follows:

1 * 2W W W= • (5)

where ‘.*’ is scalar production. The matrix W represents the relationship of individual highlight in the integrated graph and can be used for information discovery and recommendation.

28


2.3.2. Highlight Community Discovery

Based on the integrated graph, highlights can be grouped for community discovery. However,

complicated linkage in the graph usually lead to highly overlapping cohesive groups of nodes due to human’s multiple tastes. To uncover the modular structural of the complex graph, we adopt the method in [35] by analyzing the main statistical features of the interwoven sets of overlapping

(a)

(b)

Figure 3. (a) Graph construction with user preference; (b) Graph construction with semantic annotation.

communities to not only discover the k-clique-community but also make the members of the community reachable through well connected subsets of nodes.

There is an important parameter k for community division, which denotes the clique size. A lower k value means communities with lower link density and will result in larger communities while a higher k value means more cohesive communities with higher link density. For movie highlight correlation, k is empirically chosen with a high value 7 because they are highly correlated by containing similar semantics and user interests.

2.3.3. Highlight Recommendation Scheme

After exploring the structure of the integrated graph, we propose a recommendation framework with three different strategies, including global recommendation, related recommendation and long tail recommendation, for different applications.

• Initialization: Global Recommendation Whenever facing new movies without user’s preference, computers can only extract highlights and

construct semantic-based graph for recommendation. Therefore, we use global recommendation for initialization. In essence, this method is equivalent to semantic-based clustering. With the community distribution of the movie, the global strategy is done as follow: 1) ranking the strength of all communities and choosing the first N communities with high strength; (2) recommending the centers of these communities to users.

• Update: Related Recommendation & Long Tail Recommendation Depending on an individual selection, the highlights in one movie can be correlated in a personal

manner while their relationship can also be influenced by selections of others. Therefore, the integrated graph and community discovery can be recalculated to update the relationship of highlights for dynamic and personalized recommendation. We propose the following methods for this goal:

(1) Related recommendation:

29


Assuming that the users would like to preview the highlights with similar semantics and high

interests of other users, we can conduct related recommendation. The members in one community are ranked whenever the integrated graph is updated. Once the user chooses a highlight from one community, we can recommend the nearest members in the community to the user.

(2) Long tail recommendation: Assuming that the users would like to preview those with diverse content and high interests of other

users, we can conduct long tail recommendation. The neighbor communities are ranked whenever the integrated graph is updated. The center of each community can be automatically recommended. This will help users jump out of the current community to meet their multiple tastes by previewing the related highlights and discovering their favorites. 3. Experimental results

To demonstrate the effectiveness of the human-centered framework for movie analysis, we made a movie dataset with 10 popular movies from Internet Movie Database (IMDb) [37] for experiments considering genre, decade and gender. The details of each movie are presented in Table 1.For pre-processing a movie, video structurization, including: shot boundary detection, key frame extraction, and scene boundary detection as we did in [11], is implemented on it. Then movie content is automatically analyzed with the proposed framework. To evaluate the proposed framework, we choose user study for each hierarchy like [36] for the following two reasons: (1) Till now no literature focuses on movie content analysis by imitating human behavior and there is no standard or optimal method for the evaluation of the performance. (2) The proposed framework is a strong subjective task and it is difficult for any objective comparison or simulation methods to obtain accurate evaluations [6]. In the experiments, we carried out a user study and invited 20 people (10 male movie fans and 10 female ones, with the ages from 20 to 30, without research experience in movie analysis and recommendation) to give their subjective scores which are quantified into five levels, 5 to 1 respectively denoting from best to worst. In addition, we use objective evaluation to test the performance of “what” detectors, which are usually the favorites of viewers.

Table 1. Detailed information about each movie

Movie Title a b c d e f g h i j Runtime (min) 110 120 103 177 113 145 131 134 183 170 Movie Genre Action+Drama Action+Sci-

fiAction+War

File Format MPEG-1 Audio Format 16 bits/sample, mono, 22kHz Delivery:(f/s) 25 30 25 30 30 30 30 25 30 25

(a: Fearless; b: Crouching Tiger, Hidden Dragon; c: Fist of Legend; d: Brave heart; e: The Matrix 1; f: Minority Report; g: Enemy At The Gates; h: Wind Talkers; i: Perl Harbor; j: Thin Red Line)

Table 2. The Subjective Evaluation for Low Hierarchy Movie Title VAM AAM TAM Average/Movie

b 4.8 4.6 5.0 4.80a 4.6 4.3 5.0 4.63e 4.2 4.5 5.0 4.57 g 4.5 4.6 4.6 4.57c 4.5 4.2 4.8 4.50f 4.3 4.1 5.0 4.47h 4.1 4.3 5.0 4.47d 4.0 4.2 5.0 4.40i 4.6 4.1 4.5 4.40 j 3.8 4.2 5.0 4.33

Average/Mo 4.34 4.31 4.89 ——

30


3.1. Evaluation for low hierarchy

Based on video structurization, movie highlights are extracted by three human attention sub-models

with empirical thresholds for Th1 and Th2. Figure.4 shows an example for the video clip of Crouching Tiger, Hidden Dragon. This figure shows that by mapping the low level features into three attention sub-models, we can quantitatively represent the variation of human attention.Table.2 shows the subjective evaluation for the low hierarchy. The high scores of Average/Model indicate that the human attention in different modalities can match three attention sub-models well. Compared to the average score of textual sub-model, the evaluation of the other two models is relatively lower because the complexity in visual and aural modalities makes it more difficult to imitate human attention. However, the scores of Average/Movie show that with sub-models, the low hierarchy can effectively extract movie highlights.

Figure 4. Movie structure for one video clip of Crouching Tiger, Hidden Dragon and highlights detection with the Weber-Fechner Law based human attention sub-models respectively; the horizontal lines in different colors denote the threshold Th1 for corresponding sub-models and the boxes in different colors mean the corresponding results for highlights detection.

3.2. Evaluation for middle hierarchy

In the middle hierarchy, semantic detectors are implemented to automatically annotate the extracted highlights. Figure. 5 shows an example for the video clip of Crouching Tiger, Hidden Dragon. In Figure. 5. (a), for convenient navigation of movies, we put the video into a two-dimensional Cartesian Coordinates in the hierarchical browsing subwindow. Along the vertical dimension are the key frames of the same scene while along the horizontal dimension is the linear temporal dimension of scene sequences. In Figure.5. (b-e), Who, What, Where and How are separately annotated with semantic detectors. Table.3 shows the subjective evaluation for the middle hierarchy. From each column, we can see that by modeling specific semantic descriptors, we realize the detection of characters, events, surroundings, and emotion. Due to the successful technique in face detection and recognition and clear definition of events, the first and second columns show satisfactory results. However, due to the diversity of the surrounding descriptor and the complicated nature of human emotion, the evaluation in the third and fourth columns is relative low and need to be improved. From each row, we can see: (a) for the movies in the first six rows, the subjective evaluation is high due to the relatively simple

31


storyline. The framework shows a good structure with semantic annotation for each movie. (b) for the last four movies, the scores are a little low because of two reasons: 1) the scenarios of the movies are very complicated and lead to the difficulties in the discriminative representation of semantics with low level features; 2) the advanced techniques, such as ceaseless changes in illumination, color, motion, make it difficult to detect specific events, especially “Where and How”.

Figure 5. An illustration of semantic annotation for one video clip of Crouching Tiger, Hidden Dragon.

Table 3. The subjective evaluation for middle hierarchy Movie Title Who What Where How Average/Movie

b 4 4.2 3.5 3.6 3.83 a 3.9 4.1 3.0 3.4 3.60 c 3.7 4.0 3.2 3.2 3.53 d 3.5 3.8 3.4 3.1 3.45 e 3.8 3.6 3.0 3.3 3.43 g 3.4 3.9 3.3 3.1 3.43 i 3.6 3.4 3.2 3.0 3.30 j 3.5 3.7 2.9 3.1 3.30 h 3.7 3.6 2.7 2.8 3.20 f 3.5 3.3 2.8 2.9 3.13

Average/Element 3.66 3.76 3.1 3.15 ——

32


In addition, we use the objective evaluation for “what” detectors of music, dialogue, action, and

warfare which are usually the favorites of audiences. From human judgments, we obtain the ground truth. Precision and Recall are used to measure the results. The detailed experimental results are shown in Table 4. From Table 4, we can see that recall values for different genres of movies are all over 93% which means that the semantic detector can represent the characteristics of the semantic events well. Comparatively, precision value is a little lower. The main reason is that other semantics may have the similar characteristics and thereby give the false positives.

Table 4. Experimental results for semantic scene detection

Semantic Action Dialogue War Music Movie b c e i b c h i g h j d a

TP 8 8 7 12 8 8 9 12 19 21 5 28 35FP 2 1 0 2 2 1 2 2 5 3 1 3 3 FN 0 1 1 0 0 1 1 0 2 2 0 0 3

P (%) 80 87.5 100 85.7 80 87.5 80 85.7 77.3 86.4 83.3 90.3 91.4R (%) 100 87.5 86 100 100 87.5 88.9 100 89.5 90.5 100 100 91.4

AP (%) 88.3 83.3 82.3 90.9 AR (%) 93.4 94.1 93.3 95.7

(TP: True Positive; FP: False Positive; FN: False Negative; P: Precision; R: Recall; AP: Average Precision; AR: Average Recall)

3.3. Evaluation for high hierarchy

In the high hierarchy, both semantic annotation and user preference are integrated to explore the

potential relationship among highlights. Figure.6 shows an example for community discovery and highlights recommendation for the entire movie, Crouching Tiger, Hidden Dragon. To demonstrate the superiority of the integrated conceptual graph, we do comparative experiments with user preference, semantic annotation, and both cues respectively shown in Figure.6 (a)-(i). From Figure.6 (a) we can see that the users’ rating results cannot directly be used for recommendation because of insufficient data. The reason is that people might present multiple interests in a movie, which leads to the overlapping cohesive groups of modes. Therefore, there is only one community and no recommendation method can be implemented on it. From Figure.6 (b)-(d) we can see that although the semantics-based result for community discovery is better than the user-based one, the result still cannot satisfy the needs for effective recommendation. Communities with too many elements mean that the recommendation lists are too long to view. Although most of the highlights can be properly classified, too many highlights with similar stimulus in visual, audio, and textual modalities can make viewers lose patience for those redundant highlights and make them less attractive, especially for those in the back of the lists. However, with the integrated graph, we can realize the personalized recommendation. From Figure.6 (e)-(i) we can see this method mainly has two superiorities: (1) shorter recommendation lists can save viewers’ time for the same kind of highlights and provide the viewers more time to preview those with diverse content; (2) the integrated graph-based method can discover unexpected and interesting results. It is known that it is difficult to precisely detect the highlights with both dialogue and music. The experiment results in Figure.6 (b)-(d) cannot show this kind yet. However, depending on cinematic convention, it is common that both of them coexist for attracting viewers. The community corresponding to picture (h) is just the class of “dialogue + music”. This promising result shows that by integrating user preference and machine intelligence, we can discover hidden communities with specific meanings. Therefore, the integrated graph-based method can greatly benefit the latent relationship discovery for recommendation.

To conduct effective evaluation for a personalized recommendation, we set six criteria for reference: • I. Attractive: whether the display of the movie content attracts audiences; • II. Content redundant: whether there are too many highlights with the similar movie content;

33


• III. Time-consuming: whether users can quickly find the favorite highlights; • IV. Acceptable: whether the method is friendly for human-computer interaction; • V. Personalized: whether the method can satisfy individual interests; • VI. Overall: the overall evaluation for the method considering the five criteria above. As the comparative experiment, the static and dynamic video summarization systems in our

previous work [11] were also provided to viewers for evaluation. Each participant rates the performance of static summarization (S), dynamic summarization (D), and the proposed framework (F) for ten movies. The detailed evaluation is shown in Table 5.

Figure 6. Community Discovery and Highlights Recommendation for Crouching Tiger, Hidden Dragon. (The images above the graphs (e)-(i) are the key frames of the corresponding highlights; the numbers under the graphs (a)-(i) mean the numbers of elements in the corresponding communities)

34


Table 5. The subjective evaluation for high hierarchy

Evaluation I II III IV V VI S 3.1 5 5 1.2 1 2.6 D 3.8 2.8 3.2 3.5 1 3.4 F 4.6 4.5 4.8 4.9 4.9 4.8

(Note: The numbers above are the average scores by 20 participants for the 10 movies)

From Table 5, we can get the following observations: • Attraction: Compared to S and D, F is more attractive because it can show viewers personalized recommendation lists depending on individual taste.

• Content redundancy: Because there are always similar movie contents in D, viewers feel a little boring when watching too many of them. Most viewers hope to see something new in the limited time and therefore F gets higher score. Note that the score for S is not worth for a reference because there are only key frames of highlights and it is difficult to understand movie content.

• Time-consuming: Because each community by F contains the highlights with similar semantics and user preference, it is convenient for viewers to seek their favorites. Although D contains different content of movies, participants give a low score because dynamic video summarization can only be previewed in the temporal order and it is not easy for viewers to locate the favorites. Note that the score for S is still not worth for reference because there are only key frames of highlights and viewers can navigate the movie at a glance without content understanding.

• Acceptance: F gets outstanding scores for three reasons: (1) movie highlights benefit the understanding of movie content; (2) the duration of one highlight is under the patience of viewers; (3) viewers can choose highlights based on their interests with the recommendation methods. However, S gets the rather low score because viewers cannot understand the movie content with only a static storyboard.

• Personalization: It is perceptive that F can automatically recommend movie highlights depending on viewers’ selection. Comparatively, S and D can only give the fixed results after video summarization generation and consequently get the lowest scores, 1.

• Overall: Although the score of D is higher than that of S, it can not satisfy viewers’ need for conveniently previewing a novel movie. By recommending highlights to viewers, users can easily navigate the movie content depending on personal interest and quickly decide whether the entire movie is worth watching. Therefore, by integrating three recommendation measures, the system can perform better for personalized movie highlights recommendation.

4. Conclusions and future work

In this paper, we propose a human-centered framework for movie content analysis. The framework consists of three hierarchies. From low level to high level, we respectively imitate the perception, cognition and action of human and realize movie highlight extraction, automatic semantic annotation and personalized recommendation. With this framework, computers can perform “watching, thinking and reacting” with human intelligence. The promising results of subjective and objective evaluation indicate that the proposed framework can make not only computers intelligently understand movie content but also provide personalized service for movie highlight recommendation to effectively lead viewers to preview new movies in an individualized manner.

However, the existing technologies are not sufficient to handle complex storyline and the potential semantics. More research efforts are needed in the following challenging aspects: (a) Developing more descriptors with multimodal information for semantic representation; (b) Developing algorithms of latent relationship discovery with more content and contextual information for highlight categorization and recommendation.

5. Acknowledgements

The authors greatly appreciate the valuable contribution of Prof. Takeo Kanade and Prof. Jie Yang with School of Computer Science, Carnegie Mellon University.

35


This work was supported in part by the Doctoral Fund of Ministry of Education of China

(20090032110028); the National Natural Science Foundation of China (60976001). 6. References [1] H. D. Wactlar, “The Challenges of Continuous Capture, Contemporaneous Analysis and

Customized Summarization of Video Content”, Defining a Motion Imagery Research and Development Program Workshop, sponsored by NCSA/NIMA/NSF, Herndon, VA, November 28-30, 2001.

[2] http://www.apple.com/trailers/. [3] http://www.traileraddict.com/. [4] http://en.wikipedia.org/wiki/Film_trailer. [5] A. G. Money, and H. Agius, “Video summarisation: A conceptual framework and survey of the

state of the art,” Journal of Visual Communication and Image Representation, vol.19, no. 2, pp.121-143,February 2008.

[6] Y.F. Ma, L. Lu, H.J. Zhang, and M.J. Li, “A User Attention Model for Video Summarization,” Proc. of the tenth ACM international conference on Multimedia, 2002.

[7] H.J. Zhang, J. Wu, and D. Zhong, “An integrated system for content-based video retrieval and browsing,” Pattern Recognition, vol. 30, no.4, pp.643-658, 1997.

[8] W. Wolf, “Key frame selection by motion analysis,” Proc. of ICASSP’96, vol.2, pp.1228-1231, 1996.

[9] Y. Zhuang, Y. Rui, T.S. Huang, and S. Mehrotra, “Adaptive key frame extraction using unsupervised clustering,” Proc. of ICIP’98, vol. 1, pp. 866-870, 1998.

[10] F. Dufaux, “Key frame selection to represent a video,” Proc. of ICME 2000, vol. 2, pp. 275-278, 2000,

[11] A.A Liu, J.T Li, Y.D Zhang, S. Tang, Y. Song, and Z.X. Yang, An Innovative Model of Tempo and Its Application in Action Scene Detection for Movie Analysis,” Proc. of International Workshop of Application of Computer Vision (WACV), USA, 2008.

[12] M.A. Smith, and T. Kanade, “Video skimming and characterization through the combination of image and language understanding techniques,” Proc. of CVPR'97, pp. 775-781 1997.

[13] N. Jeho, and H.T. Ahmed, “Dynamic video summarization and visualization,” Proc. of ACM Multimedia, 1999.

[14] A. Stefanidis, P. Partsinevelos, P. Agouris, and P. Doucette, “Summarizing video datasets in the spatiotemporal domain,” Proc. of 11th Inter. Workshop on Database and Expert Systems Applications, pp.906-912, 2000.

[15] X. Orriols, and X. Binefa, “An EM algorithm for video summarization, generative model approach,” Proc. of ICCV, vol. 2, pp. 335-342, 2001.

[16] Chris Anderson, The Long tail, BOOK-STUDIO Press, 2006. [17] J.D. Bransford, A.L. Brown, and R.R. Cocking, How People Learn: Brain, Mind, Experience, and

School, National Academy Press, Washington, D.C., 2000. [18] J.Y. You, G.Z. Liu, L. Sun, and H.L. Li, “A Multiple Visual Models Based Perceptive Analysis

Framework for Multilevel Video Summarization,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 17, no. 3, pp. 335-342, March 2007.

[19] S. Hecht, “The Visual Discrimination of Intensity and the Weber-Fechner Law,” Journal of General Physiology, vol. 7, pp. 235-267, 1924.

[20] J.H. Shen, “On the foundations of vision modeling I. Weber's law and Weberized TV (total variation) restoration,” Physica D: Nonlinear Phenomena , Vol:175, pp. 241-251, 2003.

[21] A.A. Liu, J.T. Li, Y.D. Zhang, T. Sheng, Yan. Song, and Z.X. Yang, “Human Attention Model for Action Movie Analysis,” Proc of 2nd International Conference on Pervasive Computing and Applications, 2007, UK.

[22] R. Lienhart, and A. Wernicke, “Localizing and Segmenting Text in Images and Videos,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 12, no. 4, April 2002.

[23] Y.X. Wang, “On Cognitive Informatics,” Proc. of the First IEEE International Conference on Cognitive Informatics, 2002.

[24] D. Arijon, Grammar of the Film Language, Los Angeles, CA: Silman James Press, 1976.

36


[25] Y. Li, S.H. Lee, S.H. Yeh, and Kuo C.-C.J. “Techniques for Movie Content Analysis and

Skimming,” IEEE Signal Processing Magzine, vol. 23, pp. 79-89, March, 2006. [26] M. Zhao, C. Chen, S.Z. Li, and J.J. Bu, “Subspace analysis and optimization for AAM based face

alignment,” In: Proceedings of Sixth IEEE International Conference on Automatic Face and Gesture Recognition. Seoul, South Korea, 2004.

[27] L. Bai, Y. Hu, S.Y. Lao, J.Y. Chen, and L.D. Wu, “Feature analysis and extraction for audio automatic classification,” Proc. of IEEE International Conference on Systems, Man and Cybernetics, vol.1, pp.767-772,2005.

[28] S. Moncrieff, C. Dorai, and S. Venkatesh, “Detecting Indexical Signs in Film Audio for Scene Interpretation,” Proc. of ICME, 2001.

[29] S. Tang, Y.D. Zhang, J.T. Li, M. Li, N. Cai, X. Zhang, K. Tao, L. Tan, S. Xu, and Y. Ran, “TRECVID 2007 High-Level Feature Extraction By MCG-ICT-CAS,” Proc. of TRECVID Workshop, USA, 2007.

[30] “LSCOM Lexicon Definitions and Annotations (Version 1.0),” Columbia University ADVENT Technical Report #217-2006-3, March 2006.

[31] A. Mittal, and L.F. Cheong, “Framework for synthesizing semantic level indexes,” Multimedia Tools Appl., vol. 20, no. 2, pp. 135–158, 2003.

[32] H.L. Wang, and L.F. Cheong, “Affective Understanding in Film,” IEEE Transactions on Circuits and Systems for Video Technology, Vol. 16, No. 6, June 2006.

[33] S. Boutemedjet, and D. Ziou, “A graphical Model for Context-Aware Visual Content Recommendation”, IEEE Transaction on Multimedia, vol. 10, no.1, Jan., 2008.

[34] Q. Ou, Y.D. Jin, T. Zhou, B.H. Wang, and B.Q. Yin, “Power-Law Strength-Degree Correlation from Resource-Allocation Dynamics on Weighted Networks,” Phys. Rev. E 75, 021102, 2007.

[35] G. Palla, I. Derényi, I. Farkas, and T. Vicsek, “Uncovering the overlapping community structure of complex networks in nature and society,” Nature 435, 814-818, 2005.

[36] H. Sundaram, L. Xie, and S. F. Chang, “A utility framework for the automatic generation of audio-visual skims,” in Proc. 10th ACM Int. Conf. Multimedia, pp. 189–198, 2002.

[37] http://www.imdb.com. [38] Irfanullah, Nida Aslam, Kok-Keong Loo,Roohullah, "Semantic Annotation Gap: Where to put

Responsibility?", JDCTA: International Journal of Digital Content Technology and its Applications, Vol. 3, No. 1, pp. 94 ~ 97, 2009

[39] Irfanullah , Nida Aslam , Kok-Keong Loo ,Roohullah, "Semantic Multimedia Annotation: Text Analysis ", JDCTA: International Journal of Digital Content Technology and its Applications, Vol. 3, No. 2, pp. 152 ~ 156, 2009

37

Watching, Thinking, Reacting: A Human-Centered Framework ... · textual information. Figure. 1 shows an interactive system for movie navigation and editing that employs both techniques

Documents