Top Banner
Using Both Audio and Visual Clues Yao Wang, Zhu Liu, and Jin-Cheng Huang M ultimedia content analysis refers to the computerized understanding of the seman- tic meanings of a multimedia document, such as a video sequence with an accompa- nying audio track. As we enter the digital multimedia in- formation era, tools that enable such automated analysis are becoming indispensable to be able to efficiently access, digest, and retrieve information. Information retrieval, as a field, has existed for some time. Until recently, however, the focus has been on understanding text information, e.g., how to extract key words from a document, how to categorize a document, and how to summarize a docu- ment, all based on written text. With a multimedia docu- ment, its semantics are embedded in multiple forms that are usually complimentary of each other. For example, live coverage on TV about an earthquake conveys information that is far beyond what we hear from the reporter. We can see and feel the effects of the earthquake, while hearing the reporter talking about the statistics. Therefore, it is neces- sary to analyze all types of data: image frames, sound tracks, texts that can be extracted from image frames, and spoken words that can be deciphered from the audio track. This usually involves segmenting the document into se- mantically meaningful units, classifying each unit into a predefined scene type, and indexing and summarizing the document for efficient retrieval and browsing. In this article, we review recent advances in using audio and visual information jointly for accomplishing the above tasks. We will describe audio and visual features that can effectively characterize scene content, present se- lected algorithms for segmentation and classification, and review some testbed systems for video archiving and re- trieval. We will also briefly describe audio and visual descriptors and description schemes that are being consid- ered by the MPEG-7 standard for multimedia content de- scription. What Does Multimedia Content Analysis Entail? The first step in any multimedia content analysis task is the parsing or segmentation of a document. (From here on- wards, we use the word video to refer to both the image frames and the audio waveform contained in a video.) For a video, this usually means segmenting the entire video into scenes so that each scene corresponds to a story unit. Sometimes it is also necessary to divide each scene into shots, so that the audio and/or visual characteristics of each shot are coherent. Depending on the application, dif- ferent tasks follow the segmentation stage. One important task is the classification of a scene or shot into some prede- fined category, which can be very high level (an opera per- formance in the Metropolitan Opera House), mid level (a music performance), or low level (a scene in which audio is dominated by music). Such semantic level classification 12 IEEE SIGNAL PROCESSING MAGAZINE NOVEMBER 2000 1053-5888/00/$10.00©2000IEEE 1995 Master Series and 1991 21Century Media
25

Using Both Audio and Visual Clues Yao Wang, Zhu …vision.poly.edu/papers/before_2010/YaoWang2000SPNov.pdfUsing Both Audio and Visual Clues Yao Wang, Zhu Liu, and Jin-Cheng Huang M

Jul 09, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Using Both Audio and Visual Clues Yao Wang, Zhu …vision.poly.edu/papers/before_2010/YaoWang2000SPNov.pdfUsing Both Audio and Visual Clues Yao Wang, Zhu Liu, and Jin-Cheng Huang M

Using Both Audio and Visual Clues

Yao Wang, Zhu Liu, and Jin-Cheng Huang

Multimedia content analysis refers to thecomputerized understanding of the seman-tic meanings of a multimedia document,such as a video sequence with an accompa-

nying audio track. As we enter the digital multimedia in-formation era, tools that enable such automated analysisare becoming indispensable to be able to efficiently access,digest, and retrieve information. Information retrieval, asa field, has existed for some time. Until recently, however,the focus has been on understanding text information,e.g., how to extract key words from a document, how tocategorize a document, and how to summarize a docu-ment, all based on written text. With a multimedia docu-ment, its semantics are embedded in multiple forms thatare usually complimentary of each other. For example, livecoverage on TV about an earthquake conveys informationthat is far beyond what we hear from the reporter. We cansee and feel the effects of the earthquake, while hearing thereporter talking about the statistics. Therefore, it is neces-sary to analyze all types of data: image frames, soundtracks, texts that can be extracted from image frames, andspoken words that can be deciphered from the audio track.This usually involves segmenting the document into se-mantically meaningful units, classifying each unit into apredefined scene type, and indexing and summarizing thedocument for efficient retrieval and browsing.

In this article, we review recent advances in using audioand visual information jointly for accomplishing theabove tasks. We will describe audio and visual featuresthat can effectively characterize scene content, present se-lected algorithms for segmentation and classification, andreview some testbed systems for video archiving and re-trieval. We will also briefly describe audio and visualdescriptors and description schemes that are being consid-ered by the MPEG-7 standard for multimedia content de-scription.

What Does MultimediaContent Analysis Entail?The first step in any multimedia content analysis task is theparsing or segmentation of a document. (From here on-

wards, we use the word video to refer to both the imageframes and the audio waveform contained in a video.) Fora video, this usually means segmenting the entire videointo scenes so that each scene corresponds to a story unit.Sometimes it is also necessary to divide each scene intoshots, so that the audio and/or visual characteristics ofeach shot are coherent. Depending on the application, dif-ferent tasks follow the segmentation stage. One importanttask is the classification of a scene or shot into some prede-fined category, which can be very high level (an opera per-formance in the Metropolitan Opera House), mid level (amusic performance), or low level (a scene in which audiois dominated by music). Such semantic level classification

12 IEEE SIGNAL PROCESSING MAGAZINE NOVEMBER 20001053-5888/00/$10.00©2000IEEE

19

95M

aste

rS

erie

san

d

1991

21C

entu

ryM

edia

Page 2: Using Both Audio and Visual Clues Yao Wang, Zhu …vision.poly.edu/papers/before_2010/YaoWang2000SPNov.pdfUsing Both Audio and Visual Clues Yao Wang, Zhu Liu, and Jin-Cheng Huang M

is key to generating text-form indexes. Beyond such “la-beled” indexes, some audio and visual descriptors mayalso be useful as low-level indexes, so that a user can re-trieve a video clip that is aurally or visually similar to anexample clip. Finally, video summarization is essential inbuilding a video retrieval system to enable a user toquickly browse through a large set of returned items in re-sponse to a query. Beyond a text summary of the videocontent, some AV summaries will give the user a bettergrasp of the characters, the settings, and the style of thevideo.

Note that the above tasks are not mutually exclusive,but may share some basic elements or be interdependent.For example, both indexing and summarization may re-quire the extraction of some key frames within eachscene/shot that best reflects the visual content of thescene/shot. Likewise, scene segmentation and classifica-tion are dependent on each other, because segmentationcriteria are determined by scene class definitions. A key tothe success of all above tasks is the extraction of appropri-ate audio and visual features. They are not only useful aslow-level indexes, but also provide a basis for comparisonbetween scenes/shots. Such a comparison is required forscene/shot segmentation and classification and for choos-ing sample frames/clips for summarization.

Earlier research in this field has focused on using visualfeatures for segmentation, classification, and summariza-tion. Recently, researchers have begun to realize that au-dio characteristics are equally, if not more, importantwhen it comes to understanding the semantic content of avideo. This applies not just to the speech information,which obviously provides semantic information, but alsoto generic acoustic properties. For example, we can tellwhether a TV program is a news report, a commercial, ora sports game without actually watching the TV or un-derstanding the words being spoken because the back-ground sound characteristics in these scenes are verydifferent. Although it is also possible to differentiate thesescenes based on the visual information, audio-based pro-cessing requires significantly less complex processing.When audio alone can already give definitive answers re-garding scene content, more sophisticated visual process-ing can be saved. On the other hand, audio analysis resultscan always be used to guide additional visual processing.When either audio or visual information alone is not suf-ficient in determining the scene content, combining au-dio and visual cues may resolve the ambiguities inindividual modalities and thereby help to obtain more ac-curate answers.

AV Features forCharacterizing Semantic ContentA key to the success of any multimedia content analysisalgorithm is the type of AV features employed for theanalysis. These features must be able to discriminateamong different target scene classes. Many features have

been proposed for this purpose. Some of them are de-signed for specific tasks, while others are more generaland can be useful for a variety of applications. In this sec-tion, we review some of these features. We will describeaudio features in greater detail than for visual features, asthere have been several recent review papers covering vi-sual features.

Audio FeaturesThere are many features that can be used to characterizeaudio signals. Usually audio features are extracted in twolevels: short-term frame level and long-term clip level.Here a frame is defined as a group of neighboring sampleswhich last about 10 to 40 ms, within which we can as-sume that the audio signal is stationary and short-termfeatures such as volume and Fourier transform coeffi-cients can be extracted. The concept of audio frame comesfrom traditional speech signal processing, where analysisover a very short time interval has been found to be most

NOVEMBER 2000 IEEE SIGNAL PROCESSING MAGAZINE 13

List of Abbreviations.

Abbreviation Full Term

4MEAMDFAVBWCCCCVDDCHDSERSB1/2/3

FCGMMGoF/GoPHMMKLTLSMDCMDAMEMFCCMoCAMPEGNPRNSROCRPCFPSTDrmsSPRSPTSVMVDRVQVSTDVUZCRZSTD

4-Hz modulation energyAverage magnitude difference functionAudiovisualBandwidthCeptral coefficientColor coherence vectorDescriptorDifference between color histogrmDescription schemeSubband energy ratio at frequency band0-630 Hz, 630-1720 Hz, and 1720-4400Hz, respectivelyFrequency centroidGaussian mixture modelGroup of frames/picturesHidden Markov modelKarhunen Loeve transformLeast square minimum distance classifierMultiple discriminant analysisMotion energyMel-frequency ceptral coefficientMovie content analysisMotion picture expert groupNonpitch ratioNonsilence ratioOptical character recognitionPhase correlation functionStandard deviation of pitchRoot mean squareSmooth pitch ratioSpectral peak trackSupport vector machineVolume dynamic rangeVector quantizerVolume standard deviationVolume undulationZero crossing rateStandard deviation of ZCR

Page 3: Using Both Audio and Visual Clues Yao Wang, Zhu …vision.poly.edu/papers/before_2010/YaoWang2000SPNov.pdfUsing Both Audio and Visual Clues Yao Wang, Zhu Liu, and Jin-Cheng Huang M

appropriate. For a feature to reveal the semantic meaningof an audio signal, analysis over a much longer period isnecessary, usually from one second to several tens sec-onds. Here we call such an interval an audio clip (in theliterature, the term “window” is sometimes used). A clipconsists of a sequence of frames, and clip-level featuresusually characterize how frame-level features changeover a clip. The clip boundaries may be the result of au-dio segmentation such that the frame features withineach clip are similar. Alternatively, fixed length clips,usually 1 to 2 seconds, may be used. Both frames andclips may overlap with their previous ones, and the over-lapping lengths depend on the underlying application.Figure 1 illustrates the relation of frame and clip. In thefollowing, we first describe frame-level features and thenmove onto clip-level features.

Frame-Level FeaturesMost of the frame-level features are inherited from tradi-tional speech signal processing. Generally they can be sep-arated into two categories: time-domain features, whichare computed from the audio waveforms directly, and fre-quency-domain features, which are derived from theFourier transform of samples over a frame. In the follow-ing, we use N to denote the frame length, and s in ( ) to de-note the ith sample in the nth audio frame.

Volume: The most widely used and easy-to-computeframe feature is volume. (Volume is also referred as loud-ness, although strictly speaking, loudness is a subjectivemeasure that depends on the frequency response of thehuman listener.) Volume is a reliable indicator for si-lence detection, which may help to segment an audio se-quence and to determine clip boundaries. Normallyvolume is approximated by the rms of the signal magni-tude within each frame. Specifically, the volume offrame n is calculated by

v nN

s ii

N

n( ) ( )==

∑10

12

(the rms volume is also referred as energy).Note that the volume of an audio signal

depends on the gain value of the recordingand digitizing devices. To eliminate the in-fluence of such device-dependent condi-tions, we may normalize the volume for aframe by the maximum volume of someprevious frames.

Zero Crossing Rate: Besides the volume,ZCR is another widely used temporal fea-ture. To compute the ZCR of a frame, wecount the number of times that the audiowaveform crosses the zero axis. Formally,

( ) ( )Z n s i s ifNn n

i

Ns( ) ( ) ( )= − −

=

∑12

11

1

sign sign

where f s represents the sampling rate. ZCR is one of themost indicative and robust measures to discern unvoicedspeech. Typically, unvoiced speech has a low volume buta high ZCR. By using ZCR and volume together, one canprevent low energy unvoiced speech frames from beingclassified as silent.

Pitch: Pitch is the fundamental frequency of an audiowaveform and is an important parameter in the analysisand synthesis of speech and music. Normally only voicedspeech and harmonic music have well-defined pitch. Butwe can still use pitch as a low-level feature to characterizethe fundamental frequency of any audio waveforms. Thetypical pitch frequency for a human being is between50-450 Hz, whereas the pitch range for music is muchwider. It is not easy to robustly and reliably estimate thepitch value for an audio signal. Depending on the re-quired accuracy and complexity constraint, differentmethods for pitch estimation can be applied [1].

One can extract pitch information by using either tem-poral or frequency analysis. Temporal estimation meth-ods rely on computation of the short time autocorrelationfunction R ln ( ) or AMDF A ln ( ), where

R l s i s i l

A l s i

n ni

N l

n

n ni

N l

( ) ( ) ( )

( ) | (

= +

=

=

− −

=

− −

0

1

0

1

and

+ −l s in) ( )|.

Figure 2 shows the autocorrelation function andAMDF for a typical voiced speech. We can see that thereexist periodic peaks in the autocorrelation function. Simi-larly, there are periodic valleys in the AMDF. Here peaks

14 IEEE SIGNAL PROCESSING MAGAZINE NOVEMBER 2000

Clip 1 Clip 2 Clip 3 Clip 4 Clip 5

0 5 10 Time (Second)

Vol

ume

Frame 1

Frame 2

Frame 3 Frame N

▲ 1. Decomposition of an audio signal into clips and frames.

Page 4: Using Both Audio and Visual Clues Yao Wang, Zhu …vision.poly.edu/papers/before_2010/YaoWang2000SPNov.pdfUsing Both Audio and Visual Clues Yao Wang, Zhu Liu, and Jin-Cheng Huang M

and valleys are defined as local extremes that satisfy addi-tional constraints in terms of their values relative to theglobal minimum and their curvatures. For example, theAMDF in Fig. 2 has two valleys, and the pitch frequencyis the reciprocal of the time period between the origin andthe first valley. Such valleys exist in voiced and musicframes and vanish in noise or unvoiced frames.

In frequency-based approaches, pitch is determinedfrom the periodic structure in the magnitude of the Fou-rier transform or cepstral coefficients of a frame. For ex-ample, we can determine the pitch by finding themaximum common divider for all the local peaks in themagnitude spectrum [2]. When the required accuracy ishigh, a large size Fourier transform needsto be computed, which is time consuming.

Spectral Features: The spectrum of anaudio frame refers to the Fourier trans-form of the samples in this frame. Figure3 shows the waveforms of three audioclips digitized from TV broadcasts. Thecommercial clip contains male speechover a music background, the news clipincludes clean male speech, and the sportsclip is from a live broadcast of a basketballgame. Figure 4 shows the spectrograms(magnitude spectrums of successive over-lapping frames) of these three clips. Ob-viously, the difference among these threeclips is more noticeable in the frequencydomain than in the waveform domain.Therefore, features computed from thespectrum are likely to help audio contentanalysis.

The difficulty with using the spectrumitself as a frame-level feature lies in its veryhigh dimension. For practical applications,it is necessary to find a more succinct de-scription. LetSn ( )ω denote the power spec-trum (i.e., magnitude square of thespectrum) of frame n. If we think of ω as arandom variable and Sn ( )ω normalized bythe total power as the probability densityfunction ofω, we can define mean and stan-dard deviation of ω. It is easy to see that,the mean measures the FC, whereas thestandard deviation measures the BW of thesignal. They are defined as [3]

FC nS d

S d

BW nFC n

n

n

( )( )

( )

( )( ( ))

=

=−

∫∫

ω ω ω

ω ω

ω

0

0

2 0

2

and

S d

S d

n

n

( )

( ).

ω ω

ω ω0

It has been found that FC is related to the human sen-sation of the brightness of a sound we hear [4].

In addition to FC and BW, Liu et al. proposed to usethe ratio of the energy in a frequency subband to the totalenergy as a frequency domain feature [3], which is re-ferred to as ERSB. Considering the perceptual propertyof human ears, the entire frequency band is divided intofour subbands, each consisting of the same number ofcritical bands, where the critical bands correspond to co-chlear filters in the human auditory model [5]. Spe-cifically, when the sampling rate is 22050 Hz, thefrequency ranges for the four subbands are 0-630 Hz,630-1720 Hz, 1720-4400 Hz, and 4400-11025 Hz. Be-

NOVEMBER 2000 IEEE SIGNAL PROCESSING MAGAZINE 15

Autocorrelation and AMDF Functions

0 50 100 150 200 250 300

−40

−20

0

20

40

60

R(I

)

0 50 100 150 200 250 3000

50

100

150

200

250

300

A(I

)

I

▲ 2. Autocorrelation and AMDF of a typical male voice segment.

−1

−0.8

−0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8

1

−1

−0.8

−0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8

1

−1

−0.8

−0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8

1

0 1 2 3 0 1 2 3 0 1 2 3

Commercial News Sport

Wav

efor

m

Time (Second)

▲ 3. Waveforms of commercial, news, and sports clips.

Page 5: Using Both Audio and Visual Clues Yao Wang, Zhu …vision.poly.edu/papers/before_2010/YaoWang2000SPNov.pdfUsing Both Audio and Visual Clues Yao Wang, Zhu Liu, and Jin-Cheng Huang M

cause the summation of the four ERSBs is always one,only first three ratios were used as audio features, re-ferred as ERSB1, ERSB2, and ERSB3, respectively.

Scheirer et al. used spectral rolloff point as a frequencydomain feature [6], which is defined as the 95th percen-tile of the power spectrum. This is useful to distinguishvoiced from unvoiced speech. It is a measure of the“skewness” of the spectral shape, with a right-skewed dis-tribution having a higher value.

MFCCs or CCs [7] are widely used for speech recogni-tion and speaker recognition. While both of them providea smoothed representation of the original spectrum of anaudio signal, MFCC further considers the nonlinear prop-erty of the human hearing system with respect to differentfrequencies. Based on the temporal change of MFCC, anaudio sequence can be segmented into different segments,so that each segment contains music of the same style orspeech from one person. Boreczky and Wilcox used 12cepstral coefficients along with some color and motion fea-tures to segment video sequences [8].

Clip-Level FeaturesAs described before, frame-level features are designed tocapture the short-term characteristics of an audio signal.To extract the semantic content, we need to observe thetemporal variation of frame features on a longer timescale. This consideration leads to the development of vari-ous clip-level features, which characterize howframe-level features change over a clip. Therefore,clip-level features can be grouped by the type offrame-level features on which they are based.

Volume Based: Figure 5 presents the volume contoursof the three clips previously shown in Fig. 3. By compar-ing these three graphs, we see that the mean volume of aclip does not necessarily reflect the scene content, but the

temporal variation of the volume in a clip does. To mea-sure the variation of volume, Liu et al. proposed severalclip-level features [3]. The VSTD is the standard devia-tion of the volume over a clip, normalized by the maxi-mum volume in the clip. The VDR is defined as( ( ) ( )) / ( )max min maxv v v− , where min( )v and max( )v arethe minimum and maximum volume within an audioclip. Obviously these two features are correlated, but theydo carry some independent information about the scenecontent. Another feature is VU, which is the accumula-tion of the difference of neighboring peaks and valleys ofthe volume contour within a clip.

Scheirer proposed to use percentage of “low-energy”frames [6], which is the proportion of frames with rms vol-ume less than 50% of the mean volume within one clip.Liu et al. used NSR [3], the ratio of the number ofnonsilent frames to the total number of frames in a clip,where silence detection is based on both volume and ZCR.

The volume contour of a speech waveform typicallypeaks at 4 Hz. To discriminate speech from music,Scheirer et al. proposed a feature called 4ME [6], which iscalculated based on the energy distribution in 40subbands. Liu et al. proposed a different definition thatcan be directly computed from the volume contour. Spe-cifically, it is defined as [3]

4ME =

∞∫

∫W C d

C d

( )| ( )|

| ( )|,0

2

0

2

ω ω ω

ω ω

where C( )ω is the Fourier transform of the volume con-tour of a given clip andW( )ω is a triangular window func-tion centered at 4 Hz. Speech clips usually have highervalues of 4 ME than music or noise clips.

ZCR Based: Figure 6 shows the ZCR contours of pre-vious three clips. We can see that, with a speech signal,

low and high ZCR periods are inter-laced. This is because voiced and un-voiced sounds often occur alternativelyin a speech. The commercial clip has arelatively smooth contour since it has astrong music background. The sportsclip has a smoother contour than thenews clip, due to the existence of noisebackground. Liu et al. used the ZSTDwithin a clip to classify different audiocontents [3]. Saunders proposed to usefour statistics of the ZCR as features[9]. These are i) standard deviation offirst order difference, ii) third centralmoment about the mean, iii) total num-ber of zero crossing exceeding a thresh-old, and iv) difference between thenumber of zero crossings above and be-low the mean values. Combined withthe volume information, the proposedalgorithm can discriminate speech andmusic at a high accuracy of 98%.

16 IEEE SIGNAL PROCESSING MAGAZINE NOVEMBER 2000

0

2000

4000

6000

8000

10000

0

2000

4000

6000

8000

10000

0

2000

4000

6000

8000

10000

0 1 2 0 1 2 0 1 2

Freq

uenc

y (H

z)

Commercial News Sport

Time (Second)

−70

−60

−50

−40

−30

−20

−10

0

10

20

30

▲ 4. Spectrograms of commercial, news, and sports clips.

Page 6: Using Both Audio and Visual Clues Yao Wang, Zhu …vision.poly.edu/papers/before_2010/YaoWang2000SPNov.pdfUsing Both Audio and Visual Clues Yao Wang, Zhu Liu, and Jin-Cheng Huang M

Pitch Based: Figure 7 shows the pitch contours of thethree audio clips, which are obtained using theautocorrelation function method. In these graphs, theframes that are silent or without detected pitch are as-signed a zero pitch frequency. For the news clip, by com-paring with its volume and ZCR contours, we know thatthe zero-pitch segments correspond to silence or un-voiced speech. Although the sports clip has manyzero-pitch segments, they correspond not to silent peri-ods but periods with only background sounds. There aremany discontinuous pitch segments in the commercialclip, wherein each the pitch value is almost constant. Thispattern is due to the music background in the commercialclip. The pitch frequency in a speech signal is primarily in-fluenced by the speaker (male or female), whereas thepitch of a music signal is dominated by the strongest notethat is being played. It is not easy to derive the scene con-tent directly from the pitch level of isolated frames, butthe dynamics of the pitch contour over successive framesappear to reveal the scene content more.

Liu et al. utilized three clip-level features to capture thevariation of pitch [3]: PSTD, SPR, and NPR. SPR is thepercentage of frames in a clip that have similar pitch as theprevious frames. This feature is used to measure the per-centage of voiced or music frames within a clip, since onlyvoiced and music have smooth pitch. On the other hand,NPR is the percentage of frames without pitch. This fea-ture can measure how many frames are unvoiced speechor noise within a clip.

Frequency Based: Given frame-level features that reflectfrequency distribution, such as FC, BW, and ERSB, onecan compute their mean values over a clip to derive corre-sponding clip-level features. Since the frame with a highenergy has more influence on the perceived sound by thehuman ear, Liu et al. proposed using aweighted average of correspondingframe-level features, where the weightingfor a frame is proportional to the energy ofthe frame. This is especially useful whenthere are many silent frames in a clip be-cause the frequency features in silentframes are almost random. By using en-ergy-based weighting, their detrimental ef-fects can be removed.

Zhang and Kuo used SPTs in a spectro-gram to classify audio signals [10]. First,SPT is used to detect music segments. Ifthere are tracks which stay at about thesame frequency level for a certain period oftime, this period is considered a music seg-ment. Then, SPT is used to further classifymusic segments into three subclasses:song, speech with music, and environmen-tal sound with music background. Songsegments have one of three features: rip-ple-shaped harmonic peak tracks due tovoice sound, tracks with longer duration

than speech, and tracks with fundamental frequencyhigher than 300 Hz. Speech with music background seg-ment has SPTs concentrating in the lower to middle fre-quency bands and has lengths within a certain range.Those segments without certain characteristics are classi-fied as environmental sound with music background.

There are other clip features that are very useful. Dueto the space limit, we cannot include all of them here. In-terested readers are referred to [11]-[13].

Visual FeaturesSeveral excellent papers have appeared recently, summa-rizing and reviewing various visual features useful for im-age/video indexing [14], [15]. Therefore, we only brieflyreview some of the visual features in this article. Visualfeatures can be categorized into four groups: color, tex-ture, shape, and motion. We describe these separately.

Color: Color is an important attribute for image rep-resentation. Color histogram, which represents thecolor distribution in an image, is one of the most widelyused color features. It is invariant to image rotation,translation, and viewing axis. The effectiveness of thecolor histogram feature depends on the color coordinateused and the quantization method. Wan and Kuo [16]studied the effect of different color quantization meth-ods in different color spaces including RGB, YUV,HSV, and CIE L*u*v*. When it is not feasible to use thecomplete color histogram, one can also specify the firstfew dominant colors (the color values and their percent-ages) in an image.

A problem with the color histogram is that it does notconsider the spatial configuration of pixels with the samecolor. Therefore, images with similar histograms canhave drastically different appearances. Several approaches

NOVEMBER 2000 IEEE SIGNAL PROCESSING MAGAZINE 17

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

0

Commercial News Sport

0 1 2 0 1 2 3 0 1 2Time (Second)

Vol

ume

▲ 5. Volume contours of commercial, news, and sports clips.

Page 7: Using Both Audio and Visual Clues Yao Wang, Zhu …vision.poly.edu/papers/before_2010/YaoWang2000SPNov.pdfUsing Both Audio and Visual Clues Yao Wang, Zhu Liu, and Jin-Cheng Huang M

have been proposed to circumvent this problem. In [17],Pass and Zabih proposed a histogram refinement algo-rithm. The algorithm is based on CCV, which partitionspixels based upon their spatial coherence. A pixel is con-sidered coherent if it belongs to a sizable contiguous re-gion with similar colors. A CCV is a collection ofcoherence pairs, which are numbers of coherent and inco-herent pixels, for each quantized color. Similarly, Chenand Wong proposed an augmented image histogram[18], which includes, for each color, not only its probabil-ity, but also mean, variance, and entropy values ofpair-wise distances among pixels with this color.

Texture: Texture is an important feature of a visiblesurface where repetition or quasi-repetition of a funda-mental pattern occurs. There are two popular texture rep-resentations: co-occurrence matrix representation andTamura representation. A co-occurrence matrix describesorientation and distance between image pixels, fromwhich meaningful statistics can be extracted. In contrast,inverse difference moment and entropy have been foundto have the biggest discriminatory power [19]. Tamurarepresentation is motivated by the psychological studiesin human visual perception of texture and includes mea-sures of coarseness, contrast, directionality, linelikeness,

regularity, and roughness [20]. Tamurafeatures are attractive in image retrieval be-cause they are visually meaningful. Otherrepresentations include Markov randomfield, Gabor transforms, and wavelet trans-forms, etc.

Shape: Shape features can be representedusing traditional shape analysis such as mo-ment invariants, Fourier descriptors,autoregressive models, and geometry at-tributes. They can be classified into globaland local features. Global features are theproperties derived from the entire shape.Examples of global features are roundnessor circularity, central moments, eccentric-ity, and major axis orientation. Local fea-tures are those derived by partial process-ing of a shape and do not depend on the en-tire shape. Examples of local features aresize and orientation of consecutive bound-ary segments, points of curvature, corners,and turning angle. The most popularglobal representations are Fourier descrip-tor and moment invariants. Fourierdescriptor uses the Fourier transform of theboundary as the shape feature. Momentinvariants use region-based moments,which are invariant to geometric transfor-mations. Studies have shown that the com-bined representation of Fourier descriptorand moment invariants performs betterthan using Fourier descriptor or momentinvariants alone [21]. Other works in shaperepresentations include finite elementmethod [22], turning function [23], andwavelet descriptors [24].

Motion: Motion is an important attributeof video. Motion information can be gener-ated by block-matching or optical flow tech-niques. Motion features such as moments ofthe motion field, motion histogram, orglobal motion parameters (e.g., affine orbilinear) can be extracted from motion vec-tors. High-level features that reflect cameramotions such as panning, tilting, and zoom-

18 IEEE SIGNAL PROCESSING MAGAZINE NOVEMBER 2000

0 1 2 0 1 2 0 1 2Time (Second)

Commercial News Sport

0

50

100

150

200

250

300

0

50

100

150

200

250

300

0

50

100

150

200

250

300

Pitc

h(H

z)

▲ 7. Pitch contours of commercial, news, and sports clips.

0

2000

4000

6000

8000

10000

0 0

0 1 2 0 1 2 0 1 2

Commercial News Sport

Time (Second)

9000

7000

5000

3000

1000

0

2000

4000

6000

8000

10000

9000

7000

5000

3000

1000

0

2000

4000

6000

8000

10000

9000

7000

5000

3000

1000

3

ZC

R

▲ 6. ZCR contours of commercial, news, and sports clips.

Page 8: Using Both Audio and Visual Clues Yao Wang, Zhu …vision.poly.edu/papers/before_2010/YaoWang2000SPNov.pdfUsing Both Audio and Visual Clues Yao Wang, Zhu Liu, and Jin-Cheng Huang M

NOVEMBER 2000 IEEE SIGNAL PROCESSING MAGAZINE 19

ing can also be extracted. Another useful motion feature isthe PCF between two frames [25]. When one frame is thetranslation of the other, the PCF has a single peak at a loca-tion corresponding to the translation vector. When thereare multiple objects with different motions in the imagedscene, the PCF tends to have multiple peaks, each with amagnitude proportional to the number of pixels experienc-ing a particular motion. In this sense, the PCF reveals simi-lar information as the motion histogram. But it can becomputed from the image functions directly and thereforeis not affected by motion estimation inaccuracy. Figure 8shows the PCFs corresponding to several typical motions.For a motion field that contains primarily zero motion, thePCF has a single peak at (0, 0). For a motion field that con-tains a global translation, peak occurs at a nonzero posi-tion. The peak spreads out gradually when the camerazooms, and the PCF is almost flat when a shot change oc-curs and the estimated motion field appears as a uniformrandom field. Instead of using the entire PCF, a parametricrepresentation can be used as a motion feature.

Correlation between Audio and VisualFeatures and Feature Space ReductionGiven an almost endless list of audio and visual featuresthat one can come up with, a natural question to ask iswhether they provide independent information about thescene content, and, if not, how to derive a reduced set offeatures that can best serve the purpose. One way to mea-sure the correlation among features within the same mo-dality and across different modalities is by computing thecovariance matrix

C (x m (x m m xx x

= − − =∈ ∈∑ ∑1 1

N N,T) )

χ χ

with

where x =( , , , )x x x KT

1 2 … is a K-dimensional featurevector, χ is the set containing all feature vectors derivedfrom training sequences, and N is the total number of fea-ture vectors in χ. The normalized correlation betweenfeatures i and j is defined by

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

500

-50 -50 0 50

Y X

3D View of PCF

Mag

nitu

de o

f PC

F

0

0.02

500

-50 -50 0 50

Y X

3D View of PCFM

agni

tude

of P

CF

0.04

0.06

0.08

0.1

0.12

0.14

0.16

0.18

500

-50 -500 50

Y X

3D View of PCF

Mag

nitu

de o

f PC

F

0

0.005

0.015

0.02

0.025

0.03

0.035

0.01

0.04

(a) (b)

(c)

500

-50 -500 50

Y X

(d)

3D View of PCF

0

0.005

0.01

0.015

0.02

0.025

Mag

nitu

de o

f PC

F

▲ 8. The phase correlation functions for typical motion: (a) static background (news report), (b) camera pans [football game; peaks at( , )−27 and ( , )−17 ], (c) camera zooms into a baseball field, and (d) scene change.

Page 9: Using Both Audio and Visual Clues Yao Wang, Zhu …vision.poly.edu/papers/before_2010/YaoWang2000SPNov.pdfUsing Both Audio and Visual Clues Yao Wang, Zhu Liu, and Jin-Cheng Huang M

~( , )( , )

( , ) ( , )C i j

C i jC i i C j j

=

where C i j( , ) is the ( , )i j th element in C.Figure 9 shows the normalized correlation matrix in

absolute value derived from a training set containing fivetypes of TV programs: commercials, news, live basket-ball games, live football games, and weather forecast.About ten minutes of each scene type are included in thetraining set. A total of 28 features are considered: 14 au-dio features, eight color features, and six motion fea-tures. The audio features consist of VSTD, VDR, VU,ZSTD, NSR, 4ME, PSTD, SPR, NPR, FC, BW,ERSB1, ERSB2, and ERSB3. The color features in-clude the mean values for the three color componentsand percentage of the most dominant color, followed bytheir variances; and the motion features are the meanvalues of the two motion components and percentage ofthe most dominant motion vector, followed by theirvariances. The color and motion features are collectedover the same time duration associated with an audioclip, so that the mean or variance is calculated over allframes corresponding to an audio clip. As shown in Fig.9, the correlation between features from different mo-dalities (e.g., audio, color, and motion) is very low.Within the same modality, high correlation exists be-tween some features, such as NSR and VSTD, VSTDand VDR, SPR and NPR, FC and BW, ERSB1 andERSB2, among audio features, and between the meansand variances of three color components in the domi-nant color.

High correlation between certain features in the aboveexample suggests that the feature dimension can be re-duced through proper transformations. Two powerfulfeature space reduction techniques are KLT and MDA,both using linear transforms. With KLT, the transform isdesigned to decorrelate the features, and only those fea-tures with eigenvalues larger than a threshold will be re-tained. With MDA, the transform is designed tomaximize the ratio of between-class scattering to thewithin-class scattering. The maximum dimension of thenew feature space is the number of classes minus one. Fig-ure 10 shows the distribution of feature points from fivescene classes (denoted by different symbols and colors).The original feature space consist of the same 28 featuresused in Fig. 9. The left plot is based on two original fea-tures: FC and the mean of the most dominant color (redcomponent). The middle plot is based on the first twofeatures obtained after applying KLT on the original fea-ture vector. The right plot is based on the first two fea-tures obtained with MDA. We can easily see that there isthe least amount of between-class overlapping in the fea-ture space obtained with MDA. This means that the twonew features after MDA have the best scene discrimina-tion capability.

Video Segmentation Using AV FeaturesVideo segmentation is a fundamental step for analyzingthe content of a video sequence and for the efficient ac-cessing, retrieving and browsing of large video data-bases. A video sequence usually consists of separatescenes, and each scene includes many shots. The ulti-mate goal of video segmentation is to automaticallygroup shots into what the human being perceives as“scenes.” In the literature, scenes are also referred to asstory units, story segments, or video paragraphs. Usingthe movie production terminology, a shot is “one unin-terrupted image with a single static or mobile framing”[26], and a scene is “usually composed of a small numberof inter-related shots that are unified by location or dra-matic incident” [27]. Translating the definition of shotsinto technical terms, a shot should be a group of framesthat have consistent visual (including color, texture, andmotion) characteristics. Typically, the camera directionand view angle define a shot: when a camera looks at thesame scene from different angles, or at different regionsof a scene from the same angle, we see different (camera)shots. Because shots are characterized by the coherenceof some low-level visual features, it is a relatively easytask to separate a video into shots. On the other hand,the clustering of “shots” into “scenes” depends on sub-jective judgment of semantic correlation. Strictly speak-ing, such clustering requires the understanding of thesemantic content of the video. By joint analysis of audioand visual characteristics, however, sometimes it is pos-sible to recognize shots that are related in locations or

20 IEEE SIGNAL PROCESSING MAGAZINE NOVEMBER 2000

5 10 15 20 25

25

20

15

10

5

Within Class Scatter Coefficient Matrices

▲ 9. The normalized correlation matrix between features fromdifferent modalities. The first 14 features are audio features;the next eight are color features; and the last six are motionfeatures.

Page 10: Using Both Audio and Visual Clues Yao Wang, Zhu …vision.poly.edu/papers/before_2010/YaoWang2000SPNov.pdfUsing Both Audio and Visual Clues Yao Wang, Zhu Liu, and Jin-Cheng Huang M

events, without actually invoking high-level analysis ofsemantic meanings.

Earlier works on video segmentation have primarilyfocused on using the visual information. They mostlyrely on comparing color, motion, or other visual features.The resulting segments usually correspond to individualcamera shots, which are often overly detailed for con-tent-level analysis. Recognizing the importance of audioin video segmentation, recently, more research effortshave been devoted to scene level segmentation using jointAV analysis. We review one such work in this section.Other approaches that accomplish scene segmentationand classification jointly are reviewed later on. We also re-view an approach for shot segmentation using both audioand visual information.

Hierarchical SegmentationIn [28], a hierarchical segmentation approach was pro-posed that can detect scene breaks and shot breaks at dif-ferent hierarchies. The algorithm is based on the

observation that a scene change is usually associatedwith simultaneous changes of color, motion, and audiocharacteristics, whereas a shot break is only accompa-nied with visual changes. For example, a TV commercialoften consists of many shots, but the audio in the samecommercial follows the same rhythm and tone. The al-gorithm proceeds in two steps. First, significant changesin audio, color, and motion characteristics are detectedseparately. Then shot and scene breaks are identified de-pending on the coincidence of changes in audio, color,and motion.

To detect audio breaks, an audio feature vector con-sisting of 14 audio features is computed over each shortaudio clips. The audio features consist of VSTD, VDR,VU, ZSTD, NSR, 4ME, PSTD, SPR, NPR, FC, BW,ERSB1, ERSB2, and ERSB3. Then, at each time in-stance, the difference between the audio features in sev-eral previous clips and following clips is calculated. Thisdifference is further normalized by the variance of the au-dio features in these clips to yield a measure of the relativechange in audio characteristics.

NOVEMBER 2000 IEEE SIGNAL PROCESSING MAGAZINE 21

Two Original Features 2-D KLT MDA

CommercialBasketballFootballNewsWeather

0

0.5

1

1.5

2

2.5

3

−3

−2

−1

0

1

2

3

4

5

6

7

−2.5

−2

−1.5

−1

−0.5

0

0.5

1

1.5

2

0 2 4 6 −15 −10 −5 0 5 −4 −2 0 2

Dom

inan

t Col

or–R

Dim

ensi

on 2

Dim

ensi

on 2

FC Dimension 1 Dimension 1

▲ 10. Distribution of two features in the original feature vector, after KLT, and after MDA. Feature vectors are extracted from five sceneclasses.

Page 11: Using Both Audio and Visual Clues Yao Wang, Zhu …vision.poly.edu/papers/before_2010/YaoWang2000SPNov.pdfUsing Both Audio and Visual Clues Yao Wang, Zhu Liu, and Jin-Cheng Huang M

Color breaks are detected by comparing the color histo-grams between adjacent video frames. Direct thresholdingthe histogram difference value can yield false breaks duringa fast camera pan, where different parts (differing in color)of the same scene are captured in successive frames. There-fore, a relative change is examined, which normalizes thehistogram difference by the mean differences of severalprevious and future frames.

Motion breaks are detected by comparing the PCF ofadjacent frames. For frames that experience camera mo-tion acceleration or de-acceleration, shift of the dominantmotion location often occurs in the PCF. Hence, correla-tion of two PCFs is used instead of a simple subtraction.Finally, a relative change measure is used to locate motionbreak boundaries.

Real shot breaks are usually accompanied by signifi-cant changes in color and discontinuities in motion.Using color histogram alone may result in false detectionunder lighting changes (e.g., camera flash). This false de-tection can be avoided by examining the PCF, which is in-variant to lighting changes. When no significant colorchange occurs during a shot break, motion change is usu-ally very significant. In the algorithm of [28], the shotbreaks are located by using three thresholds, Tc for colorand T m

high and T mlow for motion. A shot break is declared

only if change in motion is above T mhigh , or change in mo-

tion is above T mlow and change in color is above Tc .

To detect scene breaks, frames with both audio and vi-sual breaks are located. The threshold for audio break isintentionally set low so that no scene changes are missed.The falsely detected breaks can be corrected by visual in-formation. For each detected audio break, visual breaksare searched in neighboring frames of the time instant as-sociated with the audio break. If a visual break is present,then a scene change is declared.

The above algorithm has been tested over several3-minute long sequences digitized from broadcasting TVprograms. Figure 11 shows the result for one test se-quence, which comprises five segments with different se-mantic content. All the scene breaks are correctlyidentified. Most of the true shot breaks are detected cor-

rectly with one falsely detectedbreak and one missed break.The false detection happenswhen the camera suddenlystopped tracking the basketballplayer, and the missed break isdue to the imposed constraintthat forbids more than onebreak within one second.

Video Shot Detection andClassification Using HMMA common approach to detectshot boundaries is by comput-ing the difference between theluminance or color histograms

of two adjacent frames and compare it to a preset thresh-old. A problem with this approach is that it is hard to se-lect a threshold that will work with different shottransitions. To circumvent this problem, Boreczky andWilcox proposed an alternative approach which uses aHMM (see “TV Program Categorization Using HMM”for more details on HMM) to model a video sequenceand accomplish shot segmentation and classification si-multaneously [8]. By turning the problem into a classifi-cation problem, the need for thresholding is eliminated.Another advantage of the HMM framework is that it canintegrate multimodal features easily.

The use of an HMM for modeling a video sequence ismotivated by the fact that a video consists of differentshots connected by different transition types and cameramotions. As shown in Fig. 12, different states of theHMM are used to model different types of video seg-ments: the shots themselves, the transitions betweenthem (cuts, fades, and dissolves), and camera motions(pans and zooms). Transition is allowed only betweentwo states connected through an arc. For example, fromthe shot state, it is possible to go to any of the transitionstates, but a transition state cannot go to a different transi-tion state. The parameters of the HMM (including thedistribution of each state and transition probabilities be-tween states) are learned using training data of a videomanually labeled with shots, transition types, and motiontypes. Note that in fact, this model is not a “hidden” one,as the states are prelabeled, and the probability distribu-tion of each state is trained using training segments withthe corresponding label. Once the HMM is trained, agiven video is segmented into its component shots andtransitions, by applying the Viterbi algorithm, to deter-mine the most likely sequence of states.

Both audio and visual features can be used in the inputdata for the HMM, and they are calculated at every newvideo frame time. The visual features are the gray-levelhistogram difference and two motion features that are in-tended to separate camera pans and zooms. The audiofeatures are computed in two stages. First, a 12- dimen-sional cepstral vector is computed over short (20 ms) au-

22 IEEE SIGNAL PROCESSING MAGAZINE NOVEMBER 2000

0

.5

1

.5

2

True Shot BreakTrue Scene Break

0 50 100 150Time (Seconds)

▲ 11. Scene and shot segmentation result for a testing sequence using audio, color, and motioninformation. The short lines indicate detected shot breaks whereas the tall lines indicate scenebreaks.

Page 12: Using Both Audio and Visual Clues Yao Wang, Zhu …vision.poly.edu/papers/before_2010/YaoWang2000SPNov.pdfUsing Both Audio and Visual Clues Yao Wang, Zhu Liu, and Jin-Cheng Huang M

dio clips. Then all such vectors in a 2-second intervalbefore the frame being considered and those in the inter-val after this frame are compared to yield a single audiofeature, which measures the likelihood that the two inter-vals share the same sound type.

This method has been applied to portions of a videodatabase containing television shows, news, movies,commercials, and cartoons. The HMM method, with ei-ther the histogram feature alone, or with additional au-dio or motion features, was compared to thethresholding method. For shot boundary detection, theHMM method has achieved higher precision than thethreshold method, while maintaining a similar recallrate. The HMM method has an added advantage in thatit can classify the motion and transition types. Using thehistogram and audio feature together yielded the high-est classification accuracy.

Scene Content ClassificationThe goal of scene content classification is to label an AVsegment as one of several predefined semantic classes.Scene classification can be done at different levels. At thevery low level, an audio or visual segment can be catego-rized into some elementary classes, e.g., speech versusmusic, indoor versus outdoor, high action or low action.At the next mid level, some basic scene types may be iden-tified, such as a dialog between two people, an indoorconcert, a typical scene on the beach, etc. At the highestlevel, the actual story needs to be understood, e.g., a hur-ricane in Florida, a New Year’s Eve celebration in NewYork’s Times Square. Somewhat in parallel withmid-level scene classification is the problem of recogniz-ing typical video categories (or genres), e.g., news report,commercial, sports game, cartoon, and in the case of amovie, drama, comedy, or action.

Obviously, based on low-level audio and visual fea-tures alone, it is hard to understand the story content. Onthe other hand, there have been encouraging research ef-forts demonstrating successful low- to mid-level sceneclassification and categorization. In this section, we firstreview works on low-level audio content classification,and then we describe works in basic scene type detectionand video program categorization.

Audio Content ClassificationIn a content-based audio indexing and retrieval system,the most important task is to identify the content of audioautomatically. Depending on the application, differentcategorization can be applied. In [9], Saunders consid-ered the discrimination of speech from music. Saracenoand Leonardi further classified audio into four groups: si-lence, speech, music, and noise [29]. The addition of thesilence and noise categories is appropriate, since a large si-lence interval can be used as segment boundaries, and the

characteristic of noise is much different from that ofspeech or music.

A more elaborate audio content categorization wasproposed by Wold et al. [4], which divides audio contentinto ten groups: animal, bells, crowds, laughter, machine,instrument, male speech, female speech, telephone, andwater. Furthermore, instrument sound is classified intoaltotrombone, cel lobowed, oboe, percussion,tubularbells, violinbowed, and violinpizz. To character-ize the difference among these audio groups, the authorsused mean, variance, and autocorrelation of loudness,pitch, brightness (i.e., frequency centroid), and band-width as audio features. A nearest neighbor classifierbased on a weighted Euclidean distance measure was em-ployed. The classification accuracy is about 81% over anaudio database with 400 sound files.

Another interesting work related to general audio con-tent classification is by Zhang and Kuo [30]. They ex-plored five kinds of audio features: energy, ZCR,fundamental frequency, timber, and rhythm. Based onthese features, a hierarchical system for audio classificationand retrieval was built. In the first step, audio data is classi-fied into speech, music, environmental sounds, and silenceusing a rule-based heuristic procedure. In the second step,environmental sounds are further classified into applause,rain, birds’ sound, etc., using an HMM classifier. Thesetwo steps provide the so-called coarse-level and fine-levelclassification. The coarse-level classification achieved 90%accuracy and the fine-level classification achieved 80% ac-curacy in a test involving ten sound classes.

NOVEMBER 2000 IEEE SIGNAL PROCESSING MAGAZINE 23

PF

Fade

Dissolve

Shot

Cut 1

Cut 2

ZoomPan

PD

1–PF

PT PT

PT

1–PD

1–PC

1–PZ

1–PP

1–3 2P – PT M

1 PC

PMPM

PZPP

▲ 12. Hidden Markov model for video shot segmentation/classifi-cation. From [8].

Page 13: Using Both Audio and Visual Clues Yao Wang, Zhu …vision.poly.edu/papers/before_2010/YaoWang2000SPNov.pdfUsing Both Audio and Visual Clues Yao Wang, Zhu Liu, and Jin-Cheng Huang M

Besides audio classification in general video, several re-search groups have focused on the problem of music clas-sification. This is important to automatically index largevolumes of music resources and to provide the capabilityof music query by example. Matityaho et al. proposed touse a multilayer neural network classifier to separate clas-sical and pop music [31]. The audio features used are theaverage amplitude of Fourier transform coefficientswithin different subbands. The subband is defined by di-viding the cochlea into several equal-sized bands andchoosing corresponding resonance frequencies along thecochlea at these positions. The neural network considers awindow of successive frames simultaneously, and the fi-nal decision is made after the output of the neural net-work is integrated over a short period. An accuracy of100% was reported with the best parameter setting over adatabase containing 24 music pieces, with about half con-taining classic music and the other half rock music.

Lambrou et al. [32] attempted to classify music intorock, piano, and jazz. They collected eight first- and sec-ond-order statistical features in the temporal domain aswell as three different transform domains: adaptive split-ting wavelet transform, logarithmic splitting wavelettransform, and uniform splitting wavelet transform. Forfeatures from each domain, four different classifiers wereexamined: minimum distance classifier, K-nearest neigh-bor distance classifier, LSMDC, and quadrature classifier.An accuracy of 91.67% was achieved under several com-binations of feature set and classifiers. The LSMDC wasthe best classifier for most feature types.

Basic Video Scene Type DetectionWe present two approaches for segmenting a video intosome basic scene types. In both approaches, shot-levelsegmentation is first accomplished, and then shots aregrouped into a scene based on the scene definition.Therefore, scene segmentation and classification are ac-complished simultaneously.

Scene Characterization Using AV Features JointlySaraceno and Leonardi [33] considered segmenting avideo into the following basic scene types: dialogs, sto-ries, actions, and generic. This is accomplished by first di-viding a video into audio and visual shots independentlyand then grouping video shots so that audio and visualcharacteristics within each group follow some predefinedpatterns. First, the audio signal is segmented into audio

shots, and each audio shot is classified as silence, speech,music, or noise. Simultaneously, the video signal is di-vided into video shots based on the color information.For each detected video shot, typical block patterns in thisshot are determined using a vector quantization method.Finally, the scene detector and characterization unitgroups a set of consecutive video shots into a scene if theymatch the visual and audio characteristics defined for aparticular scene type.

To accomplish audio segmentation and classification,the silence segments are first detected based on an analysisof the signal energy. The nonsilence segments are furtherseparated into speech, music, and noise by evaluating thedegree of periodicity and the ZCR of the audio samples[29], [34].

Video shot breaks are determined based on the colorhistograms, using the twin comparison techniques pro-posed in [35]. For each shot, a codebook is designed,which contains typical block patterns in all frames withinthis shot. Then successive shots are compared and labeledsequentially: a shot that is close to a previous shot is giventhe same label as that shot; otherwise, it is given a new la-bel. To compare whether two shots are similar, thecodebook for one shot is used to approximate the blockpatterns in the other shot. If each shot can be representedwell using the codebook of the other shot, then these twoshots are considered similar.

Scene detection and classification depends on the defi-nition of scene types in terms of their audio and visual pat-terns. For example, in a dialog scene, the audio signalshould consist of mostly speech, and video shot labelsshould follow a pattern of the type ABABAB. On theother hand, in an action scene, the audio signal shouldmostly belong to one nonspeech type, and the visual in-formation should exhibit a progressive pattern of the typeABCDEF. By examining successive video shots, to seewhether they follow one of the predefined patterns (e.g.,ABABAB) and whether the corresponding audio shotshave the correct labels (e.g., mostly speech), the maxi-mum set of consecutive video shots that satisfy a particu-lar scene definition is identified and classified.

The above technique was applied to 75 minutes ofa movie and 30 minutes of news. It has been foundthat more accurate results can be obtained for newsthan for movies. This is as expected, because news se-quences follow the somewhat simplified scene defini-tion more closely.

Scene Characterization Based on Separate CriteriaLienhart et al. [12] proposed to use different criteria tosegment a video: scenes with similar audio characteristics,scenes with similar settings, and dialogs. The scheme con-sists of four steps. First, video shot boundaries are de-tected by the algorithm in [36], which can detect andclassify hard cuts, fades, and dissolve. Then, audio fea-tures, color features, orientation features, and faces are ex-tracted. Next, distances between every two video shots

24 IEEE SIGNAL PROCESSING MAGAZINE NOVEMBER 2000

Usually audio features areextracted in two levels:short-term frame level andlong-term clip level.

Page 14: Using Both Audio and Visual Clues Yao Wang, Zhu …vision.poly.edu/papers/before_2010/YaoWang2000SPNov.pdfUsing Both Audio and Visual Clues Yao Wang, Zhu Liu, and Jin-Cheng Huang M

are calculated, with respect to each feature modality, toform a distance table. Finally, based on the calculated shotdistance tables, video shots are merged based on each fea-ture separately. The authors also investigated how tomerge the scene detection results obtained using differentfeatures. The authors argued that it is better to first per-form scene segmentation/classification based on separatecriteria and leave the merging task to a later stage that isapplication dependent.

To examine audio similarity, an audio feature vector iscomputed for each short audio clip, which includes themagnitude spectrum of the audio samples. A forecastingfeature vector is also calculated at every instance using ex-ponential smoothing of previous feature vectors. An au-dio shot cut is detected by comparing the calculatedfeature vector with the forecasting vector. The predictionprocess is reset after a detected shot cut. All feature vec-tors of an audio shot describe the audio content of thatshot. The distance between two shots is defined as theminimal distance between two audio feature vectors ofthe respective video shots. A scene in which audio charac-teristics are similar is noted as an “audio sequence.” Itconsists of all video shots so that every two shots are sepa-rated no more than a look-ahead number (three) of shotsand that the distance between these two shots is below athreshold.

A setting is defined as a locale where the action takesplace. The distribution of the color and the edge orienta-tion are usually similar in frames under the same setting.To examine color similarity, the CCV is computed be-tween every two frames. For edge orientation, an orienta-tion correlogram is determined, which describes thecorrelation of orientation pairs separated by a certain spa-tial distance. The color (respectively, orientation) dis-tance between two video shots is defined as the minimumdistance between the CCVs (respectively, orientationcorrelograms) of all frames contained within the respec-tive shots. A scene in which color or orientation charac-teristics are similar is noted as a “video setting.” It isdetermined in the same way as for an “audio sequence,”but using the shot distance table calculated based on thecolor or orientation feature.

A dialog scene is detected by using face detection andmatching techniques. Faces are detected by modifyingthe algorithm developed by Rowley et al. [37]. Similarfaces are then grouped into face-based sets using theeigenface face recognition algorithm [38]. A consecutiveset of video shots is identified as a dialog scene if alternat-ing face-based sets occur in these shots.

The above scene determination scheme has been ap-plied to two full-length feature movies. On average, theaccuracy of determining the dialog scenes is much higherthan that for audio sequences and settings. This is proba-bly because the definition and extraction of dialog scenesconform more closely with the human perception of a di-alog. The authors also attempted to combine the videoshot clustering results obtained based on different crite-

ria. The algorithm works by merging two clusters result-ing from different criteria, whenever they overlap. Thishas yielded much better results than those obtained basedon audio, color, or orientation features separately.

Video Program CategorizationFilm Genre RecognitionFischer et al. investigated automatic recognition of filmgenres using both audio and visual cues [39]. In theirwork, film materials are actually extracted from TV pro-grams, and film genres refer to the type of programs:news cast, sports, commercials, or cartoons. In theirwork, video analysis is accomplished at three increasinglevels of abstraction. At the first level, some syntacticproperties of a video, which include color histogram, mo-tion energy (sum of frame difference), motion field,waveform and spectrum of the audio, are extracted. Shotcut detection is also accomplished at this stage based onchanges in both color histogram and motion energy. (Intheir paper, each shot is referred to as a scene.) Video styleattributes are then derived from syntactic properties atthe second level. Typical style attributes are shot lengths,camera motion such as zooming and panning, shot tran-sitions such as fading and morphing, object motion (onlyrigid objects with translational motions are considered),presence of some predefined patterns such as TV stationlogos, and audio type (speech, music, silence, or noise,detected based on the spectrum and loudness). At the fi-nal level, the temporal variation pattern of each style at-tribute, called style profile, is compared to the typicalprofiles of various video genres. Four classification mod-ules are used to compare the profiles: one for shot lengthand transition styles, one for camera motion and objectmotion, one for the occurrence of recognized patterns,and one for audio spectrum and loudness. The classifica-tion is accomplished in two steps. In the first step, eachclassification module assigns a membership value to eachparticular genre. In the second step, the outputs of allclassification modules are integrated by weighted averageto determine the genre of the film.

The above technique was applied to separate five filmgenres, namely news cast, car race, tennis, animated car-toon, and commercials. The test data consist of about twofour-minute video segments for each genre. Promisingresults were reported.

TV Program Categorization Using HMMThe classification approaches reviewed so far areprimarily rule based, in that they rely on some rules thateach scene class/category must follow. As such, the result-ing solution depends on the scene definition and the ap-propriateness of the rules. Here, we describe anHMM-based classifier for TV program categorization,which is driven by training data, instead of manually de-vised rules [40], [41]. Similar framework can be appliedto other classification/categorization tasks.

NOVEMBER 2000 IEEE SIGNAL PROCESSING MAGAZINE 25

Page 15: Using Both Audio and Visual Clues Yao Wang, Zhu …vision.poly.edu/papers/before_2010/YaoWang2000SPNov.pdfUsing Both Audio and Visual Clues Yao Wang, Zhu Liu, and Jin-Cheng Huang M

Our use of HMM to model a video program is moti-vated by the fact that the feature values of different pro-grams at any time instant can be similar. Their temporalbehaviors, however, are quite different. For example, in abasketball game, a shooting consists of a series of basic mo-tions: the camera first points to the player (static) andthen follows the ball after the ball is thrown away from theplayer’s hand (panning/tilting). These individual basicevents can occur in a football game or a commercial aswell, but the ordering of the basic events and the durationof each are quite unique with basketball games.

An HMM assumes that an input observation sequenceof feature vectors follows a multistate distribution. AnHMM is characterized by the initial state distribution(Π), the state transition probabilities (A), and the obser-vation probability distribution in each state (B). AnHMM classifier requires training, which estimates the pa-rameter set λ = ( , , )A B Π for each class based on certaintraining sequences from that class. A widely used algo-rithm is the Baum-Welch method [7], which is an itera-tive procedure based on expectation-maximization. Theclassification stage is accomplished using the maximumlikelihood method. Figure 13 illustrates the process of adiscrete HMM classifier. An input sequence of featurevectors is first discretized into a series of symbols using aVQ. The resulting discrete observation sequence is thenfed into pretrained HMMs for different classes and themost likely sequence of states and the corresponding like-lihood for each class is determined using the Viterbi algo-rithm. The class with the highest likelihood is thenidentified. For more detailed descriptions on HMM, seethe classical paper by Rabiner [42].

We considered the classification of five common TVprograms: commercial, live basketball game, live footballgame, news report, and weather forecast. For each pro-gram type, 20 minutes of data are collected, with halfused for training and the remaining half for testing. The

classification is done for every short sequence of 11 s.Audio features are collected over 1 s long clips, and eachaudio observation sequence consists of 20 feature vectorscalculated over 20 overlapping clips. Visual (color andmotion) features are calculated over 0.1 s long frames,and each visual observation sequence consists of 110 fea-ture vectors. The audio features are VSTD, VDR, VU,ZSTD, NSR, 4ME, PSTD, SPR, NPR, FC, BW,ERSB1, ERSB2, and ERSB3; the color features are themost dominant color with its percentage, and the motionfeatures are the first four dominant motions with theirpercentages. For each scene class, a five-state ergodicHMM is used and the feature vector is quantized into 256observation symbols.

Tables 1-3 show the classification results using audio,color, and motion features separately. We can see that au-dio features alone can effectively separate video into threesuper classes, i.e., commercial, basketball and footballgames, and news and weather forecast. This is becauseeach super class has distinct audio characteristics. On theother hand, visual information, color or motion, can dis-tinguish basketball game from football game and weatherforecast from news. Overall, audio features are more ef-fective than the visual features in this classification task.

To resolve the ambiguities present in individual mo-dalities, we explored different approaches for integratingmultimodal features [41]. Here, we present the two moreeffective and generalizable approaches.

Direct Concatenation: Concatenating feature vectorsfrom different modalities into one super vector is astraightforward way for combining audio and visual in-formation. This approach has been employed for speechrecognition using both speech and mouth shape informa-tion [43], [44]. To combine audio and visual featuresinto a super vector, audio and visual features need to besynchronized. Hence, visual features are calculated for thesame duration associated with each audio clip. In general,

26 IEEE SIGNAL PROCESSING MAGAZINE NOVEMBER 2000

Feature VectorSequence X

VQO

HMM forScene Class 1

HMM forScene Class 2

HMM forScene Class L

P O( )|λ1

P O( )|λ2

P O( )|λL

MAX

Scene ClassLabel

▲ 13. Illustration of a discrete HMM classifier.

Page 16: Using Both Audio and Visual Clues Yao Wang, Zhu …vision.poly.edu/papers/before_2010/YaoWang2000SPNov.pdfUsing Both Audio and Visual Clues Yao Wang, Zhu Liu, and Jin-Cheng Huang M

the concatenation approach can improve classification re-sults. As the feature dimension increases, however, moredata are needed for training.

Product HMM: With the product approach, we as-sume that the features from different modalities are inde-pendent of each other and use an HMM classifier tocompute the likelihood values for all scene classes basedon features from each modality separately. The final likeli-hood for a scene class is the product of the results from allmodules, as shown in Fig. 14. With this approach, HMMclassifiers for different modalities can be trained sepa-rately. Also, features from different modalities can be cal-culated on different time scales. Another advantage ofthis approach is that it can easily accommodate a new mo-dality if the features in the new modality are independentof existing features.

Tables 4 and 5 show the results obtained using directconcatenation and product HMM approaches, respec-tively. Both approaches achieved significantly better per-formance than that obtained with any single modality.Overall, the product approach is better than the concate-nation approach. This is partly because model parameterscan be trained reliably with the product approach. An-other reason is that, as previously shown in Fig. 9, audio,color, and motion features employed for this classifica-tion task are not strongly correlated, and therefore, onedoes not lose much by assuming these features are inde-pendent. In our simulations, because of the limitation oftraining data, HMMs obtained for the concatenation ap-proach tend to be unreliable, and the classification accu-racy is not always as expected. With either approach, theclassification accuracy for the news category is lowest.This is because the news category in our simulation in-cludes both anchor-person reports in a studio and live re-ports in various outdoor environments. A much higheraccuracy is expected if only in-studio reports are included.

Testbeds for Video Archiving and RetrievalInformedia: The CMU Digital Video LibraryThe Informedia project [45]-[47] at Carnegie MellonUniversity (CMU) is one of the six projects supported bythe U.S. Digital Library Initiative [48]. It has created aterabyte digital video library (presently containing onlynews and documentary video) that allows users to re-trieve/browse a video segment using both textual and vi-sual means. A key feature of the system is that it combinesspeech recognition, natural language understanding, andimage processing for multimedia content analysis. Thefollowing steps are applied to each video document: i)generate a speech transcript from the audio track usingCMU’s Sphinx-II speech recognition system [49]; ii) se-lect keywords from the transcript using natural languageunderstanding techniques to support text-based queryand retrieval; iii) divide the video document into storyunits; iv) divide each story unit into shots and select a keyframe for each shot; iv) generate visual descriptors foreach shot; and v) provide a summary for each story unit in

NOVEMBER 2000 IEEE SIGNAL PROCESSING MAGAZINE 27

Table 2. Classification Accuracy Based on ColorFeatures Only (Average Accuracy: 72.79%).

InputClass

Output Class

ad bskb ftb news wth

ad 84.25 1.79 0.00 10.19 3.77

bskb 9.76 78.10 0.00 9.67 2.46

ftb 0.79 0.00 98.10 1.11 0.00

news 56.67 11.88 0.00 20.64 10.81

wth 3.78 4.07 0.00 9.30 82.85

ad—commercial; bskb—basketball game; ftb—football;wth—weather forecast.

Table 3. Classification Accuracy Based on MotionFeatures Only (Average Accuracy: 78.02%).

InputClass

Output Class

ad bskb ftb news wth

ad 64.45 3.68 5.94 21.51 1.42

bskb 1.55 88.59 1.00 8.85 0.00

ftb 0.47 9.41 80.55 9.57 0.00

news 17.43 9.15 7.79 65.63 0.00

wth 3.10 0.00 7.56 1.45 87.89

ad—commercial; bskb—basketball game; ftb—football;wth—weather forecast.

Table 1. Classification Accuracy Based on AudioFeatures Only (Average Accuracy: 79.71%).

InputClass

Output Class

ad bskb ftb news wth

ad 75.66 7.36 0.38 15.66 0.94

bskb 1.46 91.79 5.29 1.46 0.00

ftb 1.82 13.28 83.64 1.26 0.00

news 0.00 0.19 4.58 57.55 37.68

wth 0.00 0.00 0.00 10.08 89.92

ad—commercial; bskb—basketball game; ftb—football;wth—weather forecast.

Page 17: Using Both Audio and Visual Clues Yao Wang, Zhu …vision.poly.edu/papers/before_2010/YaoWang2000SPNov.pdfUsing Both Audio and Visual Clues Yao Wang, Zhu Liu, and Jin-Cheng Huang M

different forms. Figure 15 illustrates the analysis resultsfrom different processing modules for a sample video seg-ment. We briefly review the techniques used for the lastthree tasks, with emphasis on how audio and visual infor-mation are combined. We will also describe a companionproject Name-It.

Segmentation into Story Units: Each video document ispartitioned into story units. This is done manually in thecurrent running version of the system. However, variousautomatic segmentation schemes have been developedand evaluated. The approach described in [50] makes useof text, audio, and image information jointly. (In [50],“video paragraph” refers to a story unit, and “scene” actu-ally refers to a shot.) When closed caption is available, textmarkers such as punctuation are used to identify storysegments. Otherwise, silence periods are detected basedon the audio sample energy. Either approach yieldsboundaries of “acoustic paragraphs.” Then for eachacoustic paragraph boundary, the nearest shot break isdetected (see below), which is considered the boundaryof a story unit.

Segmentation into Shots: To enable quick access to con-tent within a segment and to support visual summary andimage-similarity matching, each story unit is further par-titioned into shots, so that each shot has a continuous ac-tion or similar color distribution. Shot break detection isbased on frame differences in both color histogram andmotion field [50].

Key Frame Selection within Each Shot: Several heuristicswere employed to choose key frames. One is based on thecamera motion analysis. When there is a continuous cam-

era motion in a shot and this motion stops, the corre-sponding frame usually contains a person or an object ofsignificance and hence can be chosen as a key frame. An-other cue is the presence of faces and texts in the video.For this purpose, face detection [37] and text detection[51] are applied to the video frames. Text detection fol-lowed by OCR is also used to improve the accuracy ofspeech recognition based on the audio track.

Visual Descriptors: In addition to the keywords gener-ated from the transcript, Informedia provides other in-formation extracted from the video frames. First, toenable image-similarity based retrieval, a human percep-tual color clustering algorithm was developed [52],which quantizes colors using a perceptual similarity dis-tance and forms regions with similar colors using an au-tomatic clustering scheme. Each key frame is indexed bytwo sets of descriptors: a modified color histogram,which includes the color value, its percentage, and thespatial connectivity of pixels for each quantized color;and a list of regions, which describes the mean color andgeometry of each region.

In addition to the above low-level color and shape fea-tures, the presence of faces and/or texts is also part of theindex. The face presence descriptor also provides libraryusers a face-searching capability.

Video-Skims: Informedia supports three types of visualsummary, including thumbnail, filmstrip, and video-skim [53]. To generate the video-skim for a segment,keywords and phrases that are most relevant to the querytext are first identified in the transcript. Then, time-corre-sponding video frames are examined for scene changes,entrance and exit of a human face, presence of overlaidtexts, and desirable camera motion (e.g., a static frame be-fore or after a continuous camera motion). Audio levelsare also examined to detect transitions between speakersand topics, which often lead to low energy or silence inthe sound track. Finally, the chosen audio and visual in-tervals are combined to yield a video-skim, individuallyextended or reduced in time as necessary to yield satisfac-tory AV synchronization. Such skims have demonstratedsignificant benefits in performance and user satisfactioncompared to simple subsampled skims in empirical stud-ies [54].

Name-It: Name-It is a companion project toInformedia. It is aimed at automatically associating facesdetected from video frames and names detected from theclosed caption in news videos [55]. It does not rely on anyprestored face templates for selected names, which is boththe challenge and novelty of the system. Initial face detec-tion is accomplished using CMU’s neural net-based ap-proach [37], and then the same face is tracked based oncolor similarity until a scene change occurs. To recognizethe same face appearing in separated shots, the eigenfacetechnique developed at MIT [56] is used. For name de-tection, it uses not only closed-caption text, but also textsdetected from video frames [51]. To avoid detecting allnames appearing in the transcript, lexical and grammati-

28 IEEE SIGNAL PROCESSING MAGAZINE NOVEMBER 2000

X XX

HMMClassifier I

HMMClassifier II

HMMClassifier III

Audio Color Motion

P1 P2 PL

P1,m PL,aP1,a

P1,c PL,cPL,m

▲ 14. The product HMM classifier: P P Pl a l c l m, ,, , are the likelihoodthat an input sequence belongs to class l based on individualfeature sets. Pl is the overall likelihood for class l based on allfeatures.

Visual features can becategorized into four groups:color, texture, shape, andmotion.

Page 18: Using Both Audio and Visual Clues Yao Wang, Zhu …vision.poly.edu/papers/before_2010/YaoWang2000SPNov.pdfUsing Both Audio and Visual Clues Yao Wang, Zhu Liu, and Jin-Cheng Huang M

cal analysis as well as the a priori knowledge about thestructure of a news coverage is used to identify names thatmay be of interest to the underlying news topic. To ac-complish name and face association, a co-occurrence fac-tor is calculated for each possible pairing of a detectedname and a detected face, which is primarily determinedby their time-coincidence. Multiple co-occurrences of thesame name-face pair will lead to a higher score. Findingcorrect association is a very difficult task, because thename detected in the closed caption is often not the sameas the person appearing on the screen. This will be thecase, for example, when an anchor or reporter is tellingthe story of someone else. The system reported in [55],although impressive, still has a relatively low success rate(lower than 50%), either for face-to-name retrieval orname-to-face retrieval.

One possible approach to improve the performance ofName-It is by exploiting the sound characteristics of thespeaker. For example, if one can tell a male speaker from afemale speaker based on the sound characteristics and dif-ferentiate male and female names based on some lan-guage models, then one can reduce the false association ofa female face with a male name or vice verse. One can alsouse speaker identification techniques to improve the per-formance of face clustering: by requiring similarity inboth facial features and voice characteristics.

AT&T Pictorial Transcript SystemPictorial Transcript is an automated archiving and re-trieval system for broadcast news program, developed atAT&T Labs [57]. In the first generation of the system,Pictorial Transcript extracts the content structure by lexi-cal and linguistic processing based on the text informa-tion in the closed caption. After text processing, a newsprogram is decomposed into a hierarchical structure, in-cluding page, paragraph, and sentence, which helps togenerate a hypermedia document for browsing and re-trieval. Simultaneously, the video stream is segmentedinto shots by a content-based sampling algorithm [58],based on brightness and motion information. A keyframe is then chosen for each shot. The key frames for theshots corresponding to a paragraph in the closed captionare used as the visual representation of that paragraph.

In the current development of the system, more so-phisticated audio and visual analysis is being added tohelp the automatic generation of the content hierarchy.The multimedia data stream, composed of audio, video,and text information, lies at the lowest level of the hierar-chy. On the next level, two major content categories innews program, news report and commercial, are sepa-rated. On the third level, news report is further seg-mented into anchor-person speech and others, whichincludes live report. On the highest level, text processingis used to generate a table of contents based on the bound-ary information extracted in lower levels and correspond-ing closed-caption information. In the following, wedescribe how audio and visual cues are combined to sepa-

rate news report and commercial [59] and to detectanchor-person segments [60].

Separation of News and Commercials: To accomplishthis task, a news program is first segmented into clipsbased on audio volume, where each clip is about 2 s long.For each clip, 14 audio features and four visual featuresare extracted and integrated into a 18-dimensionclip-level feature vector. The used audio features areNSR, VSTD, ZSTD, VDR, VU, 4ME, PSTD, SPR,NPR, FC, BW, and ERSB1-3. To extract visual features,the DCH and ME of adjacent frames are first computed.The visual features are means and standard deviations ofDCH and ME within a clip. Then each clip is classified aseither news or commercial. For this purpose, four differ-ent classification mechanisms have been tested, includinghard threshold classifier, linear fuzzy classifier, GMMclassifier, and SVM classifier. The SVM classifier was

NOVEMBER 2000 IEEE SIGNAL PROCESSING MAGAZINE 29

Table 4. Classification Accuracy Based on Audio,Color, and Motion Features Using the Direct

Concatenation Approach(Average Accuracy: 86.49%).

InputClass

Output Class

ad bskb ftb news wth

ad 91.23 7.08 0.00 1.60 0.09

bskb 2.55 86.13 8.21 3.10 0.00

ftb 1.58 1.34 94.31 2.77 0.00

news 2.63 1.66 3.02 64.95 27.75

wth 0.00 0.00 0.00 4.17 95.83

ad—commercial; bskb—basketball game; ftb—football;wth—weather forecast.

Table 5. Classification Accuracy Based on Audio,Color, and Motion Features Using the Product

Approach (Average Accuracy: 91.40%).

InputClass

Output Class

ad bskb ftb news wth

ad 93.58 0.47 0.00 5.38 0.57

bskb 6.39 93.34 0.27 0.00 0.00

ftb 0.00 0.00 100.00 0.00 0.00

news 7.30 2.14 0.29 83.54 6.72

wth 0.39 0.00 0.00 13.08 86.53

ad—commercial; bskb—basketball game; ftb—football;wth—weather forecast.

Page 19: Using Both Audio and Visual Clues Yao Wang, Zhu …vision.poly.edu/papers/before_2010/YaoWang2000SPNov.pdfUsing Both Audio and Visual Clues Yao Wang, Zhu Liu, and Jin-Cheng Huang M

found to achieve the best performance, with an error rateof less than 7% over two-hour video data from NBCNightly News. It was found that audio features play a moreimportant role in this classification task compared to thevisual counterpart [59].

Detection of Anchor Person: When the anchor person fora certain news program is known and fixed, audio ap-proach (e.g., speaker identification) or visual approach(e.g., face recognition) alone may be sufficient. But it isnot easy to reliably detect unknown anchor persons. Fig-ure 16 shows an approach to adaptively detect unknownanchors using on-line trained audio and visual models[60]. First, audio cues are exploited to identify the thememusic segment of the given news program. Based on thestarting time of the theme music, an anchor frame is lo-cated from which a feature block, the neck down areawith respect to the detected face region, is extracted tobuild an on-line visual model for the anchor. Such a fea-ture block captures both the style and the color of theclothing, and it is independent of the background settingand the location of the anchor. Using this model, all otheranchor frames are identified by matching against it.

When the theme music cannot be detected reliably,face detection is applied to every key frame and then fea-ture blocks are identified for every detected face. Once thefeature blocks are extracted, dissimilarity measures arecomputed among all possible pairs of detected persons,

based on color histogram difference and block matchingerror. An agglomerative hierarchical clustering scheme isthen applied to group faces into clusters that are associ-ated with similar feature blocks (i.e., the same clothing).Given the nature of the anchor’s function, it is clear thatthe largest cluster with the most scattered appearancetime corresponds to the anchor class.

To recover segments where the anchor speech is pres-ent but not the anchor appearance, an audio-based an-chor detection scheme is also developed. The anchorframes detected from the video stream identify the loca-tions of the anchor speech in the audio stream. Acousticdata at these locations are gathered as the training data tobuild an on-line speaker model for the anchor, which isthen applied to the remaining audio stream to extract allother segments containing the anchor speech.

Movie Content Analysis (MoCA) Projectat the University of MannheimMoCA is a project at the University of Mannheim, Ger-many, targeted mainly for understanding the semanticcontent of movies (including those made for televisionbroadcasting) [61]. We have reviewed the techniquesused for movie genre recognition previously, as an exam-ple for scene classification. Here we review the techniquesdeveloped for video abstracting [62], [63].

30 IEEE SIGNAL PROCESSING MAGAZINE NOVEMBER 2000

Audio Level

Word Relevance

Text Detection

Face Detection

Camera Motion

Scene Changes

▲ 15. Component technologies applied to segment video data in the Informedia system. From [46, 3c].

Page 20: Using Both Audio and Visual Clues Yao Wang, Zhu …vision.poly.edu/papers/before_2010/YaoWang2000SPNov.pdfUsing Both Audio and Visual Clues Yao Wang, Zhu Liu, and Jin-Cheng Huang M

Video abstracting refers to the construction of a videosequence, similar to the trailer (or preview) of a movie,that will give the audience a good overview of the storycontent, the characters, and the style of the movie. Foursteps are involved in producing a video abstract: i) seg-ment the movie into scenes and shots; ii) identify clipscontaining special events; iii) select among the specialevent clips those to be included in the abstract and certainfiller clips; and iv) assemble the selected clips into a videoabstract by applying appropriate ordering and editing ef-fects between clips. We have reviewed the segmentationtechnique previously. Not all the features of this schemehas been implemented in the current MoCA system. Inthe following, we describe steps ii)-iv) in more detail.

Special Event Detection: Special events are defined asthose shots containing close-up views of main characters,dialogs between main characters, title texts, and shotswith special sound characteristics such as explosion andgun fires. Dialog detection is part of scene segmentation,which has been described in “Some Characteristics Basedon Separate Criteria.” Detection of close-up views is ac-complished as a post-processing of dialog detection. Todetect title texts, the algorithm of [64] is used, whichmakes use of heuristics about the special appearance of ti-tles in most movies. The detected texts are also subject toOCR to produce text-based indexes. To detect gun fireand explosion sounds, an audio feature vector consistingof loudness, frequencies, pitch, fundamental frequency,onset, offset, and frequency transition is computed foreach short audio clip. This feature vector is then com-pared to those stored for typical gun-fire and explosionsounds [2].

Selection of Clips for Inclusion: Several heuristic criteriaare applied when choosing candidate clips for inclusion inthe abstract. Obviously, the criteria would vary depend-ing on the type of video: the preview for a feature movie isdifferent from the summary for a documentary movie.For feature movies, the criteria used include: i) importantobjects and people (main actors/actresses); ii) shots con-taining action (gun fire and explosion); iii) scenes con-taining dialogs; and iv) frames containing title text andtitle music.

Generation of Abstract: Including all the scenes/shotsthat contain special events may generate too long an ab-stract. Also, simply staggering them together may not bevisually or aurally appealing. In the MoCA project, it wasdetermined that only 50% of the abstract should containspecial events. The remaining part should be left for fillerclips. The special event clips to be included are chosenuniformly and randomly from different types of events.The selection of a short clip from a scene is subject tosome additional criteria, such as the amount of action andthe similarity to the overall color composition of themovie. Closeness to the desired AV characteristics of cer-tain scene types are also considered. The filler clips arechosen so that they do not overlap with the content of

chosen special event clips, to ensure a good coverage of allparts of a movie.

MPEG-7 Standard for MultimediaContent Description InterfaceMPEG-7 is an on-going standardization effort for contentdescription of AV documents. The readers are referred to[65]-[67] for a broader coverage about the standard. Wewill only review the audio and visual descriptors and de-scription schemes that are currently being considered byMPEG-7. We first provide an overview on how MPEG-7decomposes an AV document to arrive at both syntacticand semantic descriptions. We then describe audio and vi-sual descriptors used in these descriptions.

Syntactic and SemanticDecomposition of AV DocumentsWith MPEG-7, an AV document is represented by a hier-archical structure both syntactically and semantically.Our description below follows [68] and [69].

Syntactic Decomposition: With syntactic decomposi-tion, an AV document (e.g., a video program with audiotracks) is divided into a hierarchy of segments, known as asegment tree. For example, the top segment could be theentire document, the next layer includes segments corre-sponding to different story units, the subsegments in-cluded in each story segment then corresponds todifferent scenes, and finally each scene segment may con-tain many subsegments corresponding to different cam-era shots. A segment is further divided into videosegment and audio segment, corresponding to the videoframes and the audio waveform, respectively. In additionto using a video segment that contains a set of completevideo frames (may not be contiguous in time), a still ormoving region can also be extracted. A region can be re-cursively divided into subregions to form a region tree.

Each segment or region is described by a set of DSsand Ds. A DS is a combination of Ds specified with adesignated format. There are some DSs that specifysome meta information, e.g., the source, creation, us-age, and time duration. There are also Ds that specify thesegment level in the segment tree and its relative impor-tance. The spatial-temporal connectivity of a segment orregion is defined by some spatial- and temporal-connec-tivity attributes. The actual AV characteristics are de-scribed by audio and visual Ds, which are explained inmore detail below. The relation between the segments is

NOVEMBER 2000 IEEE SIGNAL PROCESSING MAGAZINE 31

The goal of video segmentationis to automatically group shotsinto what the human beingperceives as “scenes.”

Page 21: Using Both Audio and Visual Clues Yao Wang, Zhu …vision.poly.edu/papers/before_2010/YaoWang2000SPNov.pdfUsing Both Audio and Visual Clues Yao Wang, Zhu Liu, and Jin-Cheng Huang M

specified by a segment relation graph, which describesthe spatial-temporal relations between the segments.Similarity certain AV attributes can also be specified(e.g., similar color, faster motion, etc.).

To facilitate fast and effective browsing of an AV doc-ument, MPEG-7 also defines different summary repre-sentations for an AV segment. For example, a hierarchicalsummary of a segment can be formed by selecting onekey-frame for each node in the tree decomposition of thissegment. Instead of using a key- frame, an audio or visualor AV clip (a subsegment) can also be used at all or certainnodes of the tree.

Semantic Decomposition: In parallel with the syntacticdecomposition, MPEG-7 also uses semantic decomposi-tion of an AV document to describe its semantic content.It identifies certain events and objects that occur in a doc-ument and attach corresponding “semantic labels” tothem. For example, the event type could be a news broad-cast, a sports game, etc. The object type could be a person,a car, etc. An event is described by an event-type D and anannotation DS (who, what, where, when, and why, and afree-text description). An object is described by an ob-ject-type D and an annotation DS (who, what object,what action, where, when, and why, and a free-text de-scription). An event can be further broken up into manysubevents to form an event tree. Similarly, an object treecan be formed. An event-object relation graph describesthe relation between events and objects.

Relation between Syntactic and Semantic Decompositions:An event is usually associated with a segment and an ob-

ject with a region. Each event orobject may occur multiple times ina document, and their actual loca-tions (which segment or region)are described by a syntactic-se-mantic link. In this sense, the syn-tactic structure, represented bysegment tree and region tree, islike the table of contents in the be-ginning of a book, whereas the se-mantic structure, i.e., the eventtree and object tree, is like the in-dex at the end of the book.

An event or object can be char-acterized by a model, and any seg-ment or region can be assignedinto an event or object by a classi-fier. MPEG-7 defines a variety ofmodel descriptions: probabilis-tic, analytic, correspondence,synthetic (for synthetic objects),and knowledge (for defining dif-ferent semantic notions). For ex-ample, probabilistic modelsdefine the statistical distributionsof the AV samples in a segment orregion belonging to a particular

event/class. Defined models include: Gaussian, high-or-der Gaussian (in MPEG-7, Gaussian refers to a jointGaussian distribution where individual variables are in-dependent, so that only the mean and variance of indi-vidual components need to be specified; high-orderGaussian refers to the more general case, through thedefinition of a covariance matrix), mixture of Gaussian,and HMM.

Visual DescriptorsFor each segment or region at any level of the hierarchy, aset of audio and visual Ds and DSs are used to character-ize the segment or region. MPEG-7 video group has cur-rently established the following features as visual Ds,which may still undergo changes in the future. The visualDs can be categorized into four groups: color, shape, mo-tion, and texture. Our description below follows [70].

Color: These descriptors describe the color distributionsin a video segment, a moving region, or a still region.▲ Color Space: Four color spaces are defined: RGB,YCrCb, HSV, HMMD. Alternatively, one can specifyan arbitrary linear transformation matrix from theRGB coordinate.▲ Color Quantization: This D is used to specify thequantization method, which can be linear, nonlinear (inMPEG-7, uniform-quantization is referred as linearquantization, nonuniform quantizer as nonlinear), orlookup table, and the quantization parameters (e.g.,number of bins for linear quantization).

32 IEEE SIGNAL PROCESSING MAGAZINE NOVEMBER 2000

Key Frames

DissimilarityFeature Extraction

Face Detection

Feature BlockExtraction

UnsupervisedAnchor Detection

AnchorKeyframes

Online TrainingSpeech Extraction

Online SpeakerModel Generation

Audio Model BasedAnchor Detection

Integrated Audio/VisualAnchor Detection

Model BasedAnchor Detection

On-Line AnchorVisual Model

Theme MusicDetection

Anchor Segments

Video Input

▲ 16. Diagram of an integrated algorithm for anchor detection. From [60].

Page 22: Using Both Audio and Visual Clues Yao Wang, Zhu …vision.poly.edu/papers/before_2010/YaoWang2000SPNov.pdfUsing Both Audio and Visual Clues Yao Wang, Zhu Liu, and Jin-Cheng Huang M

▲ Dominant Color: This D describes the dominant colorsin the underlying segment, including the number of dom-inant colors, a confidence measure on the calculated dom-inant colors, and for each dominant color, the value ofeach color component and its percentage.▲ Color Histogram: Several types of histograms can bespecified: i) the common color histogram, which in-cludes the percentage of each quantized color among allpixels in a segment or region; ii) the GoF/GoP histo-gram, which can be the average, median, or intersection(minimum percentage for each color) of conventionalhistograms over a group of frames or pictures; and iii)color-structure histogram, which is intended to capturesome spatial coherence of pixels with the same color. Anexample is to increase the counter for a color as long asthere is at least one pixel with this color in a small neigh-borhood around each pixel.▲ Compact Color Descriptor: Instead of specifying the en-tire color histogram, one can specify the first few coeffi-cients of the Haar transform of the color histogram.▲ Color Layout: This is used to describe in a coarse levelthe color pattern of an image. An image is reduced to8 8×blocks with each block represented by its dominant color.Each color component (Y/Cb/Cr) in the reduced image isthen transformed using discrete cosine transform, and thefirst few coefficients are specified.

Shape: These descriptors are used to describe the spa-tial geometry of still and moving regions.▲ Object Bounding Box: This D specifies the tightest rect-angular box enclosing two- or three-dimensional (2- or3-D) object. In addition to the size, center, and orienta-tion of the box, the occupancy of the object in the box isalso specified, by the ratio of the object area (volume) tothe box area (volume).▲ Contour-Based Descriptor: This D is applicable to a 2-Dregion with a closed boundary. MPEG-7 has chosen touse the peaks in the curvature scale space (CSS) represen-tation [71], [72] to describe a boundary, which has beenfound to reflect human perception of shapes, i.e., similarshapes have similar parameters in this representation.▲ Region-Based Shape Descriptor: This D can be used todescribe the shape of any 2-D region, which may consistof several disconnected subregions. MPEG-7 has chosento use Zernike moments [73] to describe the geometry ofa region. The descriptor specifies the number of momentsused and the value of each moment.

Texture: This category is used to describe the texturepattern of an image.▲ Homogeneous Texture: This is used to specify the energydistribution in different orientations and frequency bands(scales). This can be obtained through a Gabor transformwith six orientation zones and five scale bands.▲ Texture Browsing: This D specifies the texture ap-pearances in terms of regularity, coarseness, anddirectionality, which are more in line with the type ofdescriptions that a human may use in browsing/retriev-ing a texture pattern. In addition to regularity, up to

two dominant directions and the coarseness along eachdirection can be specified.▲ Edge Histogram: This D is used to describe the edge ori-entation distribution in an image. Three types of edge his-tograms can be specified, each with five entries, describingthe percentages of directional edges in four possible orien-tations and nondirectional edges. The global edge histo-gram is accumulated over every pixel in an image; the localhistogram consists of 16 subhistograms, one for eachblock in an image; the semi- global histogram consists ofeight subhistograms, one for each group of rows or col-umns in an image.

Motion: These Ds describe the motion characteristicsof a video segment or a moving region as well as globalcamera motion.▲ Camera Motion: Seven possible camera motions areconsidered: panning, tracking (horizontal translation),tilting, booming (vertical translation), zooming,dollying (translation along the optical axis), and rolling(rotation around the optical axis). For each motion, twomoving directions are possible. For each motion typeand direction, the presence (i.e., duration), speed andthe amount of motion are specified. The last term mea-sures the area that is covered or uncovered due to a par-ticular motion.▲ Motion Trajectory: This D is used to specify the trajec-tory of a nonrigid moving object, in terms of the 2-D or3-D coordinates of certain key points at selected samplingtimes. For each key point, the trajectory between two ad-jacent sampling times is interpolated by a specified inter-polation function (either linear or parabolic).▲ Parametric Object Motion: This D is used to specify the2-D motion of a rigid moving object. Five types of mo-tion models are included: translation, rotation/scaling,affine, planar perspective, and parabolic. In addition tothe model type and model parameters, the coordinate ori-gin and time duration need to be specified.▲ Motion Activity: This D is used to describe the intensityand spread of activity over a video segment (typically at theshot level). Four attributes are associated with this D: i) in-tensity of activity, measured by the standard deviation ofthe motion vector magnitudes; ii) direction of activity, de-termined from the average of motion vector directions; iii)spatial distribution of activity, derived from the run-lengths of blocks with motion magnitudes lower than theaverage magnitude; and iv) temporal distribution of activ-ity, described by a histogram of quantized activity levelsover individual frames in the shot.

Audio DescriptorsAt the time of writing, the work on developing audiodescriptors are still on-going in MPEG-7 audio group. Itis planned, however, that Ds and DSs for four types of au-dio will be developed: music, pure speech, sound effect,and arbitrary sound track [74].

NOVEMBER 2000 IEEE SIGNAL PROCESSING MAGAZINE 33

Page 23: Using Both Audio and Visual Clues Yao Wang, Zhu …vision.poly.edu/papers/before_2010/YaoWang2000SPNov.pdfUsing Both Audio and Visual Clues Yao Wang, Zhu Liu, and Jin-Cheng Huang M

Concluding RemarksIn this article, we have reviewed several important aspectsof multimedia content analysis using both audio and vi-sual information. We hope that the readers would carryaway two messages from this article: i) There is much tobe gained by exploiting both audio and visual informa-tion, and ii) there are still plenty of unexplored territoriesin this new research field. We conclude the paper by out-lining some of the interesting open issues.

Most of the approaches we reviewed are driven by ap-plication-dependent heuristics. One question that stilllacks consensus among researchers in the field is whetherit is feasible to derive some application-domain inde-pendent approaches. We believe that at least at the low-and mid-level processing this is feasible. For example,feature extraction, shot-level segmentation, and somebasic scene level segmentation and classification are re-quired in almost all multimedia applications requiringunderstanding of semantic content. The features andsegmentation approaches that have been proven usefulfor one application (e.g., news analysis) are often appro-priate for another application (e.g., movie analysis) aswell. On the other hand, higher level processing (such asstory segmentation and summarization) may have to beapplication specific, although even there, methodolo-gies developed for one application can provide insightsfor another.

How to combine audio and visual information be-longs to the general problem of multiple evidence fusion.Until now, most work in this area was quite heuristic andapplication-domain dependent. A challenging task is todevelop some theoretical framework for joint AV pro-cessing and, more generally, for multimodal processing.For example, one may look into theories and techniquesdeveloped for sensor fusion, such as Dempster-Shafertheory of evidence, information theory regarding mutualinformation from multiple sources, and Baysian theory.

Another potential direction is to explore the analogybetween natural language understanding and multimediacontent understanding. The ultimate goals of these twoproblems are quite similar: being able to meaningfullypartition, index, summarize, and categorize a document.A major difference between a text medium and a multi-media document is that, for the same semantic concept,there are relatively few text expressions, whereas therecould be infinitely many AV renditions. This makes thelatter problem more complex and dynamic. On the otherhand, the multiple cues present in a multimedia docu-ment may make it easier to derive the semantic content.

AcknowledgmentsWe would like to thank Howard Wactlar for providinginformation regarding CMU’s Informedia project, QianHuang for reviewing information regarding AT&T’s Pic-torial Transcript project, Rainer Lienhart for exchanginginformation regarding the MoCA project, Riccardo

Leonardi for providing updated information regardingtheir research, and Philippe Salembier, Adam Lindsay,and B.S. Manjunath for providing information regardingMPEG-7 audio and visual descriptors and descriptionschemes. This work was supported in part by the Na-tional Science Foundation through its STIMULATEprogram under Grant IRI-9619114.

Yao Wang (Senior Member) received the B.S. and M.S.degrees from Tsinghua University, Beijing, China, andthe Ph.D degree from University of California at SantaBarbara, all in electrical engineering. Since 1990, she hasbeen with Polytechnic University, Brooklyn, NY, whereshe is a Professor of Electrical Engineering. She is also aconsultant with AT&T Bell Laboratories, now AT&TLabs–Research. She was on sabbatical leave at PrincetonUniversity in 1998 and was a visiting professor at theUniversity of Erlangen, Germany. She was an AssociateEditor for IEEE Transactions on Multimedia and IEEETransactions on Circuits and Systems for Video Technology.She is a member of several IEEE technical committees.Her current research interests include image and videocompression for unreliable networks, motion estimationand object-oriented video coding, signal processing usingmultimodal information, and image reconstruction prob-lems in medical imaging. She was awarded the New YorkCity Mayors award for excellence in Science and Technol-ogy in the Young Investigator category in 2000.

Zhu Liu (Student Member) received the B.S. and M.S.degrees in electronic engineering from Tsinghua Univer-sity, Beijing, China, in 1994 and 1996, respectively. SinceSeptember 1996, he has been a Ph.D. candidate in theElectrical Engineering Department in Polytechnic Uni-versity. From May 1998 to August 1999, he was a consul-tant with AT&T Labs–Research; in 2000 he joined thecompany as a Senior Technical Staff Member. His re-search interests include audio/video signal processing,multimedia database, pattern recognition, and neuralnetwork. He is a member of Tau Beta Pi.

Jincheng Huang (Student Member) received the B.S.and M.S. degrees in electrical engineering from Poly-technic University, Brooklyn, NY, in 1994. Currently,he is pursuing the Ph.D. degree in electrical engineer-ing at the same institution. Since 2000, he has been aMember of Technical Staff at Epson Palo Alto Labora-tory, Palo Alto, CA. His current research interests in-clude image halftoning, image and video compression,and multimedia content analysis. He is a member ofEta Kappa Nu.

References[1] W. Hess, Pitch Determination of Speech Signals. New York: Springer-Verlag,

1983.

[2] S. Pfeiffer, S. Fischer, and W. Effelsberg, “Automatic audio content analy-sis,” in Proc. 4th ACM Int. Conf. Multimedia, Boston, MA, Nov. 18-221996, pp. 21-30.

34 IEEE SIGNAL PROCESSING MAGAZINE NOVEMBER 2000

Page 24: Using Both Audio and Visual Clues Yao Wang, Zhu …vision.poly.edu/papers/before_2010/YaoWang2000SPNov.pdfUsing Both Audio and Visual Clues Yao Wang, Zhu Liu, and Jin-Cheng Huang M

[3] Z. Liu, Y. Wang, and T. Chen, “Audio feature extraction and analysis forscene segmentation and classification,” J. VLSI Signal Processing Syst. Signal,Image, Video Technol., vol. 20, pp. 61-79, Oct. 1998.

[4] E. Wold, T. Blum, D. Keislar, and J. Wheaton, “Content-based classifica-tion, search, and retrieval of audio,” IEEE Multimedia Mag., vol. 3, pp.27-36, Fall 1996.

[5] N. Jayant, J. Johnston, and R. Safranek, “Signal compression based onmodels of human perception,” Proc. IEEE, vol. 81, pp. 1385-1422, Oct.1993.

[6] E. Scheirer and M. Slaney, “Construction and evaluation of a robustmultifeatures speech/music discriminator,” in Int. Conf. Acoustic, Speech, andSignal Processing (ICASSP-97), vol. 2, Munich, Germany, Apr. 21-24,1997, pp. 1331-1334.

[7] L. Rabiner and B.-H. Juang, Fundamentals of Speech Recognition.Englewood Cliffs, NJ: Prentice Hall, 1993.

[8] J.S. Boreczky and L.D. Wilcox, “A hidden Markov model framework forvideo segmentation using audio and image features,” in Proc. Int. Conf.Acoustics, Speech, and Signal Processing (ICASSP-98), vol. 6, Seattle, WA,May 12-15, 1998, pp. 3741-3744.

[9] J. Saunders, “Real-time discrimination of broadcast speech/music,” in Proc.Int. Conf. Acoustic, Speech, and Signal Processing (ICASSP-96), vol. 2, At-lanta, GA, May 7-10, 1996, pp. 993-996.

[10] T. Zhang and C.-C.J. Kuo, “Video content parsing based on combinedaudio and visual information,” in Proc. SPIE’s Conf. Multimedia Storage andArchiving Systems IV, Boston, MA, Sept. 1999, pp. 78-89.

[11] K. Minami, A. Akutsu, H. Hamada, and Y. Tonomura, “Video handlingwith music and speech detection,” IEEE Multimedia Mag., vol. 5, pp.17-25, July-Sept. 1998.

[12] R. Lienhart, S. Pfeiffer, and W. Effelsberg, “Scene determination based onvideo and audio features,” in Proc. IEEE Int. Conf. Multimedia Computingand Systems, vol. 1, Florence, Italy, June 7-11, 1999, pp. 685-690.

[13] Y. Chang, W. Zeng, I. Kamel, and R. Alonso, “Integrated image andspeech analysis for content-based video indexing,” in Proc. 3rd IEEE Int.Conf. Multimedia Computing and Systems, Hiroshima, Japan, June 17-23,1996, pp. 306-313.

[14] F. Idris and S. Panchanathan, “Review of image and video indexing tech-niques,” J. Visual Commun. Image Represent., vol. 8, pp. 146-166, June1997.

[15] Y. Rui, T. S. Huang, and S.-F. Chang, “Image retrieval: current technolo-gies, promising directions, and open issues,” J. Visual Commun. Image Rep-resent., vol. 10, pp. 39-62, Mar. 1999.

[16] X. Wan and C.-C. J. Kuo, “A new approach to image retrieval with hierar-chical color clustering,” IEEE Trans. Circuits Systems Video Technol., vol. 8,pp. 628-643, Sept. 1998.

[17] G. Pass, R. Zabih, and J. Miller, “Comparing images using color coher-ence vectors,” in Proc. 4th ACM Int. Conf. Multimedia, Boston, MA, Nov.5-9, 1996, pp. 65-73.

[18] Y. Chen and E.K. Wong, “Augmented image histogram for image andvideo similarity search,” in Proc. SPIE Conf. Storage and Retrieval for Imageand Video Database VII, San Jose, CA, Jan. 26-29, 1999, pp. 523-532.

[19] C.C. Gotlieb and H. E. Kreyszig, “Texture descriptors based onco-ocurrence matrices,” Comput. Vision, Graphics, Image Processing, vol. 51,pp. 70-86, July 1990.

[20] H. Tamura, S. Mori, and T. Yamawaki, “Texture features correspondingto visual perception,” IEEE Trans. System, Man, Cybernet., vol. SMC-8, no.6, 1978.

[21] B.M. Mehtre, M.S. Kankanhalli, and W.F. Lee, “Shape measures for con-tent based image retrieval: A comparison,” Inform. Processing Manage., vol.33, pp. 319-337, May 1997.

[22] A. Pentland, R.W. Picard, and S. Sclaroff, “Photobook: Content-basedmanipulation of image databases,” Int. J. Computer Vision, vol. 18, pp.233-254, June 1996.

[23] E.M. Arkin, L. Chew, D. Huttenlocher, K. Kedem, and J. Mitchell, “Anefficiently computable metric for comparing polygonal shapes,” IEEETrans. Pattern Anal. Mach. Intell., vol. 13, pp. 209-216, Mar. 1991.

[24] G.C.-H. Chuang and C.-C.J. Kuo, “Wavelet descriptor of planar curves:Theory and applications,” IEEE Trans. Image Processing, vol. 5, pp. 56-70,Jan. 1996.

[25] Y. Wang, J. Huang, Z. Liu, and T. Chen, “Multimedia content classifica-tion using motion and audio information,” in Proc. IEEE Int. Symp. Circuitsand Systems (ISCAS-97), vol. 2, Hong Kong, June 9-12, 1997, pp.1488-1491.

[26] D. Bordwell and K. Thompson, Film Art: An Introduction, 4th ed. NewYork: McGraw-Hill, 1993.

[27] F. Beaver, Dictionary of Film Terms. New York: Twayne, 1994.

[28] J. Huang, Z. Liu, and Y. Wang, “Integration of audio and visual informa-tion for content-based video segmentation,” in Proc. IEEE Int. Conf. ImageProcessing (ICIP-98), vol. 3, Chicago, IL, Oct. 4-7, 1998, pp. 526-530.

[29] C. Saraceno and R. Leonardi, “Audio as a support to scene change detec-tion and characterization of video sequences,” in Proc. Int. Conf. Acoustic,Speech, and Signal Processing (ICASSP-97), vol. 4, Munich, Germany, Apr.21-24, 1997, pp. 2597-2600.

[30] T. Zhang and C.-C.J. Kuo, “Hierarchical classification of audio data forarchiving and retrieving,” in Proc. Int. Conf. Acoustic, Speech, and Signal Pro-cessing, vol. 6, Phoenix, AZ, Mar. 15-19, 1999, pp. 3001-3004.

[31] B. Matityaho and M. Furst, “Neural network based model for classifica-tion of music type,” in Proc. 18th Conv. Electrical and Electronic Engineers inIsrael, Tel Aviv, Israel, Mar. 7-8, 1995, pp. 4.3.4/1-5.

[32] T. Lambrou, P. Kudumakis, R. Speller, M. Sandler, and A. Linney, “Clas-sification of audio signals using statistical features on time and wavelettranform domains,” in Proc. Int. Conf. Acoustic, Speech, and Signal Processing(ICASSP-98), vol. 6, Seattle, WA, May 12-15, 1998, pp. 3621-3624.

[33] C. Saraceno and R. Leonardi, “Identification of story units in AVsequencies by joint audio and video processing,” in Proc. Int. Conf. ImageProcessing (ICIP-98), vol. 1, Chicago, IL, Oct. 4-7, 1998, pp. 363-367.

[34] C. Saraceno and R. Leonardi, “Indexing AV databases through a joint au-dio and video processing,” Int. J. Image Syst. Technol., vol. 9, pp. 320-331,Oct. 1998.

[35] H.J. Zhang, A. Kankanhalli, and S.W. Smoliar, “Automatic partioning ofvideo,” Multimedia Syst., vol. 1, no. 1, pp. 10-28, 1993.

[36] R. Lienhart, “Comparison of automatic shot boundary detection algo-rithms,” in Proc. SPIE Conf. Image and Video Processing VII, San Jose, CA,Jan. 26-29, 1999, pp. 290-301.

[37] H.A. Rowley, S. Baluja, and T. Kanade, “Neural network-based face de-tection,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 20, pp. 23-38, Jan.1998.

[38] M. Turk and A. Pentland, “Eigenfaces for recognition,” J. Cognitive Neu-roscience, vol. 3, no. 1, pp. 71-86, 1991.

[39] S. Fischer, R. Lienhart, and W. Effelsberg, “Automatic recognition of filmgenres,” in Proc. 3rd ACM Int. Conf. Multimedia, San Francisco, CA, Nov.5-9, 1995, pp. 295-304.

[40] Z. Liu, J. Huang, and Y. Wang, “Classification of TV programs based onaudio information using hidden Markov model,” in IEEE Workshop Multi-media Signal Processing (MMSP-98), Los Angeles, CA, Dec. 7-9, 1998, pp.27-32.

[41] J. Huang, Z. Liu, Y. Wang, Y. Chen, and E. K. Wong, “Integration ofmultimodal features for video classification based on HMM,” in IEEEWorkshop Multimedia Signal Processing (MMSP-99), Copenhagen, Denmark,Sept 13-15, 1999, pp. 53-58.

[42] L.R. Rabiner, “A tutorial on hidden Markov models and selected applica-tions in speech recognition,” Proc. IEEE, vol. 77, pp. 257-286, Feb. 1989.

NOVEMBER 2000 IEEE SIGNAL PROCESSING MAGAZINE 35

Page 25: Using Both Audio and Visual Clues Yao Wang, Zhu …vision.poly.edu/papers/before_2010/YaoWang2000SPNov.pdfUsing Both Audio and Visual Clues Yao Wang, Zhu Liu, and Jin-Cheng Huang M

[43] M.T. Chan, Y. Zhang, and T.S. Huang, “Real-time lip tracking and bi-modal continuous speech recognition,” in IEEE Workshop Multimedia Sig-nal Processing (MMSP-98), Los Angeles, CA, Dec. 7-9, 1998, pp. 65-70.

[44] G. Potamianos and H.P. Graf, “Discriminative training of HMM streamexponents for AV speech recognition,” in Proc. Int. Conf. Acoustic, Speechand Signal Processing (ICASSP-98), vol. 6, Seattle, WA, May 12-15, 1998,pp. 3733-3766.

[45] H.D. Wactlar, M.G. Christel, Y. Gong, and A.G. Hauptmann, “Lessonslearned from building a terabyte digital video library,” IEEE ComputerMag., vol. 32, pp. 66-73, Feb. 1999.

[46] H.D. Wactlar, T. Kanade, M.A. Smith, and S.M. Stevens, “Intelligent ac-cess to digital video: Informedia project,” IEEE Computer Mag., vol. 29,pp. 46-52, May 1996.

[47] Available http://www.infomedia.cs.cmu.edu/html/main.html

[48] Available http://www.dli2.nsf.gov/index.html

[49] M. Hwamg, R. Rosenfeld, E. Theyer, R. Mosur, L. Chase, R. Weide, X.Huang, and F. Alleva, “Improving speech recognition performance viaphone-dependent VQ codebooks and adaptive language models inSPHINX-II,” in Proc. Int. Conf. Acoustic, Speech, and Singal Processing(ICASSP-94), vol. 1, Adelaide, Australia, Apr. 19-22, 1994, pp. 549-552.

[50] A.G. Hauptman and M.A. Smith, “Text, speech and vision for video seg-mentation: the informedia project,” in Proc. AAAI Fall Symp. ComputationalModels for Integrating Language and Vision, Boston, MA, Nov 10-12, 1995.

[51] T. Sato, T. Kanade, E.K. Hughes, M.A. Smith, and S. Satoh, “VideoOCR: Indexing digital news libraries by recognition of superimposed cap-tion,” ACM Multimedia Syst., vol. 7, no. 5, pp. 385-395, 1999.

[52] Y.H. Gong, G. Proietti, and C. Faloutos, “Image indexing and retrievalbased on human perceptual color clustering,” in Proc. Computer Vision andPattern Recognition Conf., 1998, pp. 578-583.

[53] M. Smith and T. Kanade, “Video skimming and characterization throughthe combination of image and language understanding techniques,” in Proc.Computer Vision and Pattern Recognition Conf., San Juan, Puerto Rico, June1997, pp. 775-781.

[54] M. Christel et al., “Evolving video skims into useful multimedia abstrac-tions,” in Proc. CHI’98 Conf. Human Factors in Computing System, 1998,pp. 171-178.

[55] S. Satoh, Y. Nakamura, and T. Kanade, “Name-it: Naming and detectingfaces in news videos,” IEEE Multimedia Mag., vol. 6, pp. 22-35, Jan.-Mar.1999.

[56] M. Turk and A. Pentland, “Eigenfaces for recognition,” J. Cognitive Neu-roscience, vol. 3, no. 1, pp. 71-86, 1991.

[57] B. Shahraray and D. Gibbon, “Pictorial transcripts: Multimedia processingapplied to digital library creation,” in Proc. IEEE 1st Multimedia Signal Pro-

cessing Workshop (MMSP-97), Princeton, NJ, June 23-25, 1997, pp.581-586.

[58] B. Shahraray, “Scene change detection and content-based sampling ofvideo sequences,” in Proc. SPIE Conf. Digital Video Compression: Algorithmsand Technologies, 1995, pp. 2-13.

[59] Z. Liu and Q. Huang, “Detecting news reporting using AV information,”in Proc. IEEE Int. Conf. Image Processing (ICIP-99), Kobe, Japan, Oct.1999, pp. 324-328.

[60] Z. Liu and Q. Huang, “Adaptive anchor detection using on-line trainedAV model,” in Proc. SPIE Conf. Storage and Retrieval for Media Database,San Jose, CA, Jan. 2000, pp. 156-167.

[61] Available http://www.informatik.uni-mannheim.de/informatik/pi4/pro-jects/MoCA

[62] R. Lienhart, S. Pfeiffer, and W. Effelsberg, “Video abstracting,” J. ACM,vol. 40, pp. 55-62, Dec. 1997.

[63] S. Pfeiffer, R. Lienhart, S. Fischer, and W. Effelsberg, “Abstracting digitalmovies automatically,” J. Visual Commun. Image Representation, vol. 7, no.4, pp. 345-353, 1996.

[64] R. Lienhart, “Automatic text recognition for video indexing,” in Proc. 4thACM Int. Conf. Multimedia ‘96, Boston, MA, Nov. 18-22, 1996, pp.11-20.

[65] Available http://www.darmstadt.gmd.de/mobile/MPEG7

[66] F. Nack and A.T. Lindsay, “Everything you wanted to know aboutMPEG-7: Part I,” IEEE Multimedia Mag., vol. 6, pp. 65-77, July-Sept.1999.

[67] F. Nack and A. T. Lindsay, “Everything you wanted to know aboutMPEG-7: Part II,” IEEE Multimedia Mag., vol. 6, pp. 64-73, Oct.-Dec.1999.

[68] MPEG-7 Generic AV Description Schemes (v0.8), ISO/IECJTC1/SC29/WG11 M5380, Dec. 1999.

[69] Supporting Information for MPEG-7 Description Schemes, ISO/IECJTC1/SC29/WG11 N3114, Dec. 1999.

[70] MPEG-7 Visual Part of Experimentation Model (v4.0), ISO/IECJTC1/SC29/WG11 N3068, Dec. 1999.

[71] F. Mokhtarian, S. Abbasi, and J. Kittler, “Robust and efficient shape in-dexing through curvature scale space,” in Proc. British Machine Vision Conf.,Edinburgh, UK, 1996, pp. 53-62.

[72] Available http://www.ee.surrey.ac.uk/Research/VSSP imagedb/demo.html

[73] A Rotation Invariant Geometric Shape Description Using Zernike Moments,ISO/IEC JTC1/SC29/WG11, P687, Feb. 1999.

[74] Framework for MPEG-7 Audio Descriptor, ISO/IEC JTC1/SC29/WG11N3078, Dec. 1999.

36 IEEE SIGNAL PROCESSING MAGAZINE NOVEMBER 2000