Meeting - IHMC Public Cmaps (3)

Multimedia Systems (2007) 12:439–457DOI 10.1007/s00530-006-0066-5

REGULAR PAPER

Meeting browsingState-of-the-art review

Matt-M. Bouamrane · Saturnino Luz

Published online: 24 October 2006© Springer-Verlag 2006

Abstract Meeting, to discuss and share information,take decisions and allocate tasks, is a central aspectof human activity. Computer mediated communicationoffers enhanced possibilities for synchronous collabo-ration by allowing seamless capture of meetings, thusrelieving participants from time-consuming documenta-tion tasks. However, in order for meeting systems to betruly effective, they must allow users to efficiently navi-gate and retrieve information of interest from recordedmeetings. In this article, we review the state of the art inmultimedia segmentation, indexing and browsing tech-niques and show how existing meeting browser systemsbuild on these techniques and integrate various modal-ities to meet their users’ information needs.

Keywords Multimedia segmentation ·Indexing and retrieval · Multimodal meeting browsers

1 Introduction

The complexity of many projects performed in the work-place means that most tasks need to be carried out ona daily basis by teams involving people with variousresponsibilities and fields of expertise, sometimes resid-ing in different places. Phases of individual work arepunctuated by meetings of some sort to discuss pro-gress, share ideas, take decisions, allocate tasks, etc.

M.-M. Bouamrane (B) · S. LuzDepartment of Computer Science, Trinity College Dublin,Dublin, Irelande-mail: [email protected]

S. Luze-mail: [email protected]

Meeting is thus a central aspect of professional activity.As computers have become ubiquitous tools for com-munication, possibilities for synchronous collaborationhave been greatly enhanced and the complete captureof meetings could in principle free participants from dis-tracting and time-consuming tasks such as note-takingand minute production. Indeed, there are many reasonswhy one would want to capture and archive meetings.An Internet survey carried out by [32] and involvingmore than 500 respondents sheds some light on some ofthe many reasons for storing and reviewing past meet-ings: to keep accurate records, check the veracity andconsistency of statements and descriptions, revisit por-tions of the meeting which were misunderstood or notheard, re-examine past positions in the light of newinformation, obtain proofs and recall certain ideas arecited as the main reasons. However, recording meet-ings only solves part of the problem and as the num-ber of recorded meetings grows, so does the complexityof extracting meaningful information from such record-ings. To provide access to multimodal recordings, one isfaced with the challenge of structuring and integratingorthogonal (space and time based) media in an intuitiveway for the users. Continuous media, such as audio andvideo, are difficult to access for lack of natural referencepoints. Navigation in these media is time-consuming andcan be confusing. Summarisation is a non-trivial pro-cess. A study of users browsing and searching strategieswhen accessing voicemail messages, sometimes of veryshort duration (30 s), showed that people had seriousproblems with local navigation of messages and difficul-ties remembering message content [75]. Many users per-formed time-consuming sequential listening of messagesin order to find relevant information and often reportedtaking notes to remember content. In contrast, users

440 M.-M. Bouamrane, S. Luz

displayed improved browsing performance, playing lessaudio when speech recognition transcripts were avail-able as audio indexes in the user interface [31]. Thus, tobe truly effective, conferencing capture systems need tooffer users efficient means of navigating recordings andaccessing specific information.

There has been growing research interest in produc-ing applications for visual mining of multimodal meet-ing data in order to support users’ meeting browsingrequirements. Interaction modalities used in meetingsand thus the nature of the recorded media will typi-cally be dictated by the meetings’ physical setup (e.g.purpose-built meeting room [13,39] vs. Internet-basedenvironments [15,19]) and meeting capture capabilities.In what follows, we review the state of the art in mul-timodal meeting browsers. We will use the taxonomyintroduced by [65] where browsers are classified accord-ing to the focus of the browsing task, or primary modeused for the meeting data presentation. The three mainbrowser categories are: audio browsers, video brows-ers and artefact browsers. The first two categories focuson communication modalities used in meetings and thecontents they convey, while the third focuses on objectsproduced or manipulated during the meeting, such asnotes, slides, drawings, plans etc.

We first review the state of the art in the various exist-ing segmentation, indexing and searching techniques forspeech and video data. We then present some of themain existing multimedia and meeting browser applica-tions, and discuss evaluation methods for such applica-tions.

2 Speech browsing

Unlike space-based media, such as text and images,where one can quickly visually scan a page of text ora set of picture thumbnails to get a general impressionof a document’s content, audio is a medium that doesnot lend itself well to visualisation. Audio recordingscan be of very long duration, and multimedia databasesmay contain large numbers of such recordings. Access-ing specific parts of audio documents is therefore partic-ularly challenging because we often do not know whatconstitutes relevant information or where it is until wehave actually heard it. Listening to entire audio record-ings is however extremely time-consuming and in somecases simply not feasible. In what follows, we describethe state of the art in techniques for structured speechbrowsing. We define structured audio as the acoustic sig-nal supplemented by an abstract representation whichprovides an overview of the recording, indications onthe nature or importance of specific parts of the audio,

and access to any location within the recording. Othercomprehensive surveys of audio and speech access tech-niques can be found in [20,27,37].

2.1 Speaker segmentation

Visually representing audio in a meaningful manner is aparticularly difficult task as there is no obvious or intu-itive way of doing so. Graphically displaying an audiorecording as a waveform would generally be inappro-priate because, for most users, the audio signal spec-trum offers no information about content. Some level ofstructuring can, however, be attained by common sig-nal processing techniques. A frequently used strategyis to visually segment a meeting’s audio track accord-ing to participants contributions over time. This tech-nique is known as speaker segmentation [30,44]. Whenaudio of various participants is recorded on a singletrack, speaker identification needs to be carried outprior to speaker segmentation. Speaker identification isthe process of automatically distinguishing between par-ticipants’ voices in order to determine when the varioustalkers are active [77]. Audio browsers based on speakersegmentation will typically display a visual representa-tion of talk spurts as horizontal bar over a timeline, iden-tifying participants through thumbnail pictures, colours,etc. Clicking on a bar will play the corresponding audiosegment. A user can choose to listen to neighbouringaudio segments or specific contributions.

There are a number of limitations to speaker seg-mentation as a browsing modality. First of all, typicalmeetings will contain hundreds of speech exchanges, themajority of which have very short duration.

In order to visually distinguish between contributingspeech sources, speaker segments will be represented onwindows of short duration (e.g. a few minutes) whosetimescales will be stretched in comparison to the overallmeeting. Browsing through the audio file therefore im-plies scrolling across a large amount of these audio seg-ment windows. This can be confusing and might makeit difficult for a user to develop a clear picture of thestructure of the audio recording. The other limitationof speaker segmentation lies in the fact that individ-ual contributions may be rendered meaningless withoutthe context (other participants’ contributions) in whichthey were said. In accordance with the natural struc-ture of discourse, it is reasonable to assume that au-dio segments in close time proximity are relevant toone another. This phenomenon can be interpreted interms of the question–answer pair paradigm [70] whereadjacent speech exchanges are considered more infor-mative jointly than in isolation. Therefore, segmentingconversations by topic seems a more appropriate choice

Meeting browsing 441

for a general audio browsing task. However, speakersegmentation can be useful when additional informa-tion is available, such as a textual description of play-back points, which can precisely identify the contextof specific audio contributions. Roy and Malamud [56]have developed a system which maps text transcriptsof proceedings of the United States House of Repre-sentatives to speaker transition in the audio recordings.The transcripts are manually drafted in real time duringthe House’s sittings by a human transcriber and lateredited. Participants’ precise contributions regarding aparticular issue can be pin-pointed by a text-to-audioalignment system which provides pointers to an audiodatabase containing hundreds of hours of recording.When selecting a portion of text from the transcripts,a user is presented with a list of audio contributors towhich they can listen.

2.2 Speech skimming

Various time- and frequency-based signal processingtechniques can be applied to an acoustic signal in or-der to alter the play-back rate of an audio recording [3].Playing audio at a faster rate (time compression) willthus permit a user to listen to more in less time. There isof course a limit to play-back rate increase before audiobecomes unintelligible. SpeechSkimmer [4] is a systemwhich combines various speech processing techniquesin order to navigate through audio recordings. The usercan adjust the speed at which he listens to the recording:slower or faster than the normal rate. Playing the audioat an increased speed is achieved through time compres-sion techniques involving sampling of the original signal,while the entire audio content itself is preserved. Skim-ming, on the other hand, involves playing only selectedsections of the recording. Selection is based on acousticcues of discourse such as pause, voice pitch and speakeridentification. The system offers the user several levelsof skimming. The first level plays back the recording atthe normal pace. In the second level, audio is segmentedby speech detection. Speech pauses (silence) under acertain duration threshold are removed whereas longerones are reduced to a set value (the duration threshold).The next level attempts to take advantage of the naturalstructure and properties of discourse. Level 3 identi-fies salient points in the recording by considering longerpauses as juncture pauses, which would tend to indicateeither a new topic, or a new speech turn. Hence, the sys-tem will only play back (for a certain quantum of time)segments of speech which occurred after juncture pausesbefore jumping to the next one. Level 4, uses empha-sis detection to identify salient segments of the record-ing. The emphasis detection algorithm is based on the

speaker’s pitch, or voice fundamental frequency (F0).An adaptive threshold for each speaker is generatedwhich identifies points of highest pitch frames within therecording. The system will then only play sentences con-taining these highest pitch frames. These segmentationtechniques are error-prone and will occasionally missdesired boundaries while mistakenly identifying others.The system compensates for these shortcomings by pro-viding the user with additional navigation tools. Theseinclude a skimming backward mechanism which playsback the audio normally but jumps to the previous seg-ment. This functionality enables a user who has heardsomething of interest to pinpoint its precise location.The user can also jump forward to the next segmentif he decides that the current segment is not relevant.Skimming audio through the highest levels of the systemmay be disorientating, as unrelated speech segments areplayed in fast successive order. A usability study of theSpeechSkimmer showed that users used the system atthe highest skimming levels to navigate through the au-dio in order to identify general topic locations and thenused lower skimming levels (normal play-back or pausecompression-suppression) to listen to specific parts ofthe recording.

2.3 Automatic speech recognition

The field of automatic speech recognition (ASR) hasmade significant progress in the last decade, evolvingfrom single speaker, discrete dictation systems with lim-ited vocabulary for restricted domains to sophisticatedsystems that tackle speaker independent, large vocab-ulary continuous speech recognition (LVCSR) tasks.Unconstrained LVCSR is a difficult task for a numberof reasons including speech disfluencies in spontaneousdialogues, lack of word or sentence boundaries, poorrecording conditions, crosstalk, inappropriate languagemodels, out-of-vocabulary items and variations in ac-cent and pronunciation. These conditions combined cancause substantial decreases in recognition rates [21].

Speech recognition is the task of automatically iden-tifying a sequence of spoken words according to thespeech signal [52,79,36]. In other words, given asequence of word utterances w, recognition consists offinding the most likely word sequence w given theobserved acoustic signal S.

A speech recognition process encompasses a numberof successive steps based on a property of languages:the use of a limited number of phonemes (the small-est perceptual “building blocks” of words), typicallyidentified from 40 to 60 distinct phones (basic sounds).Phones can be modelled using a Hidden Markov Model(HMM) containing a number of states and connected by


transition arcs. These models can be combined togetherto form word models which in turn can be combinedinto sentence models. The first step of the recognitiontask will process the audio signal and extract a number ofacoustic features over a certain timeframe duration (typ-ically 10 ms). Features are chosen for extraction accord-ing to their ability to discriminate between differentphones. The observed acoustic features are subsequentlytranslated into phone probabilities according to anacoustic model. The acoustic model consists of a pronun-ciation lexicon, where phones are usually divided intothree states: beginning, middle and end. The triphonemodel further adds context to these states whereby indi-vidual phones are influenced by the surrounding ones.The decoding stage outputs the most likely sequenceof words according to the word pronunciation dictio-nary and a language model. The language model assignswords prior to probabilities according to some gram-mar inferred from a large corpus. A grammar definesallowable sequence of words and their probabilities. Anexample of such grammar is the n-gram model, wherethe presence of a word is deemed to depend only on then − 1 previous words. Probabilities of n-grams are thuscomputed by counting the number of occurrences of nsuccessive words instances in a training corpus (wordfrequency for unigram model, word pair frequency forbigram model, etc.) As the underlying language modelexplicitly models inter-word relationships, a misrecogni-tion will often lead to another.

Although LVCSR has produced very encouragingresults for certain task-specific applications, seriouschallenges remain in recognising speaker independentspontaneous speech in unconstrained domains. Currentresearch issues focus on building robust recognition sys-tems by using automatic adaptation techniques, such asadaptation of acoustic models to speakers’ voices andspeech rate fluctuations, language model adaptation andimproved spontaneous speech modelling [22]. Despitethe aforementioned shortcomings, ASR is a central com-ponent to many audio browsing systems. Typically, theASR module is used to produce conversation transcriptsfor convenient user scanning, reading and other text-based information retrieval operations.

2.4 Word spotting

A keyword-based retrieval query offers an alternativeparadigm to full LVCSR transcription. Word spottingconsists of detecting the presence of a specific word orphrase in a speech corpus. This task is thus computation-ally far less expensive then generating full transcriptsand may also be more appropriate for certain types ofapplications, such as querying a large audio database.

Two types of errors can occur with a word spottingsystem, a miss and a false-alarm. A miss consists ofnot retrieving a particular keyword and a false-alarm ofwrongly recognising one. Tuning a system requires find-ing an acceptable trade-off between correct keyworddetection (true-hit) and false-alarm rates. The receiveroperating characteristic (ROC) is defined as the percent-age of keyword detection as a function of false-alarmrates (in fa/kw/h: false-alarm per keyword per hour). Afigure-of-merit (FOM) is calculated as the average valueof the ROC curve between 0 and 10 fa/kw/h.

The application of an HMM-based recognition sys-tem to keyword spotting will typically require build-ing acoustic and language models for a pre-defined setof keywords and non-keywords, or fillers [54]. Spottinga keyword then consists of two phases: hypothesisingwhen a keyword may occur in speech (putative hit) andsubsequently assigning a score to the hypothesis. Thehypothesis is accepted if the keyword score is abovea rejection threshold. Thus, the output of a wordspot-ter would be a set of keywords and their time offsets,with everything in-between considered as backgroundwords. Filler modelling is used to match arbitrary non-keywords present in speech and is crucial in the per-formance of the word-spotter. Appropriate models willreduce the rate of false-alarms, as shown by the compar-ative studies of filler models in [55]. Another decisivecomponent in the performance of the word-spotter is anappropriate scoring algorithm. DECIPHER [71] assignsa likelihood score to a hypothesised keyword by combin-ing acoustic likelihood probability and language modelprobability, where the language model is trained fromcombining task-specific data (with high occurrences ofthe specific keywords) and task independent data. Themain disadvantages of LVCSR-based systems for wordspotting is that they are computationally expensive andcan only effectively recognise keywords if these are pres-ent in their lexicons.

To circumvent some of these shortcomings, an alter-native approach to word spotting is off-line speech pre-processing to generate a phone lattice representation.The lattice representation consists of an output of mul-tiple phone hypotheses at every speech frame, alongwith a likelihood score for the hypothesis [33]. Thedepth of the lattice can hence be set by preserving onlythe n best hypotheses. Thus, wordspotting reverts to akeyword pronunciation match against each lattice. Themain advantages of the phone lattice representationare that search is fast and that there is no restrictionson keywords. In [16], speech is initially converted off-line into a table of phone trigrams with acoustic scores.This is followed by a two-step search, using the key-word phonetic transcription. If the query term does not


appear in a pronunciation dictionary, a spelling-to-sounddatabase generates the likely phonetic representation ofthe word. The first step is a fast coarse match which iden-tifies keyword locations according to the phone trigramsindex. In order to reduce the number of false alarms, thisis followed by a detailed acoustic match.

2.5 Topic segmentation

Automatic topic segmentation is the process of segment-ing a (text or audio) document into regions of semanticrelatedness. This is a difficult task for a number of rea-sons. First of all, as an abstract concept, a topic is difficultto define. Furthermore, it is a subjective notion and topi-cal annotation of documents by humans will often differfrom annotator to annotator, particularly in the case oftopic shifts. This is evidenced in [29] where seven read-ers who were asked to find topical boundaries of a textdocument exposed a variety of judgements. Researchon topic detection and tracking (TDT) was originallytargeted at newswire and news broadcast and typicallyinvolves three distinct phases. The first is to segmentdata streams into self-contained coherent units. A sec-ond phase consists in detecting new (previously unseen)topics. This step can either be performed on-line (as thenews are broadcast live) or retrospectively, on a corpusof samples. The final step consists of identifying whetherincoming samples are related to a particular (target)topic. In the particular context of audio streams, all theseoperations are ideally performed using ASR transcriptsfor full automation. Therefore, the first audio segmen-tation step can be seen as a text segmentation task. TheDragon system [78] requires a topic model for the seg-mentation task. A topic is modelled with unigram statis-tics. A training set is clustered into different topics usinga distance metric. If a sample’s distance to a given clusteris less than a certain threshold, the sample is includedinto the cluster and the cluster model is updated. If thedistance is above the threshold, a new cluster is created.Once the topic model has been created, segmenting astream is done by scoring stream frames against the topicmodel and detecting topic transition. Another approachto segmentation, described in [2], measures shifts in thevocabulary. Each sentence of a text stream is run as aquery against a local context analysis (LCA) thesauruswhich identifies and returns a number of semanticallyrelated words or concepts. Although some sentences ofthe original text have few or no common words, theymay in fact share a number of concepts. The text is thenindexed at the sentence level according to these features.A function of the feature offsets is then used as a heuris-tic measure of change in content. The chief advantageof this approach is that it is unsupervised. Its drawbacks

are that it is computationally costly (a database queryper sentence) and that LCA results for sentences withpoor semantic value (a common feature of speech) areessentially random. Another approach is to model lex-ical features or “marker words” usually found at startand end of topical segments in order to predict topicchanges.

The approaches mentioned earlier make exclusiveuse of textual features while ignoring some of the spe-cific characteristics of speech such as prosody. Prosody(in linguistics) refers to phonological features in speechsuch as syllable length, intonation, stress and juncture,which convey structural and semantic information. Inaddition to lexical information obtained from speechrecognition, Tur et al. [66] use prosodic features auto-matically extracted from speech for automatic topic seg-mentation. A distinctive advantage of using a prosodicmodel is that it is largely independent of the recognitiontask and therefore should be robust to recognition er-rors. The topic segmentation algorithm is implementedin two phases: the speech input is first segmented intosentences (speech units). Then sentence boundaries areanalysed to determine whether they coincide with atopical change. In effect, this approach reduces topic seg-mentation to a boundary classification problem, i.e. esti-mating the probability of a topic boundary given a wordsequence and its set of prosodic features. To this end, aprosodic model needs to be created, built on the feasi-ble extraction of prosodic features for a fully automatedsolution. A corpus with human labelled topic boundarieswas used in order to infer useful prosodic features. Fea-tures which were found to be important in identifyingtopic boundaries include: pause duration at boundaries,pitch or fundamental frequency across boundary, lastphone duration before boundary [59]. In addition, non-prosodic features which were available from the speechrecogniser, such as speech turns and speaker gender,were also included in the model. The prosodic modelwas subsequently combined with a language model sim-ilar to the one used in [78]. The performance obtainedwas comparable to that of the best word-based systems.

2.6 Spoken language summarisation

Unlike automatic text summarisation, which has longbeen a subject of study, spoken language summarisationis a new research domain, with serious issues remainingto be solved. These include how to deal with the pres-ence of speech disfluencies in spoken dialogue, sentenceboundaries, information spanning across several speak-ers and speech recognition. Speech disfluencies includenon-lexicalised filled pauses (um, uh), lexicalised filledpauses (like), repetitions, substitutions and false starts.


DiaSumm [81] is a spoken language summarisationsystem comprising a number of stages. Audio record-ings can theoretically be used as an input. However,the results described below were obtained using humangenerated transcripts with annotated topic boundaries[80]. The system first runs a part of speech (POS) taggeron the transcripts to identify disfluencies. Repetitionsand discourse fillers are subsequently removed througha clean-up filter algorithm. The result of the POS tag-ger is then fed into the sentence boundary detectioncomponent. False starts are then detected and removed.Cross-speaker information detection consists of iden-tifying question–answer pairs, which is first done bydetecting questions and then the corresponding answer.Once all these steps are completed, the summarisationmechanisms rank sentences using a term frequency, in-verse document frequency-based (TFIDF) MMR rank-ing within topical segments. This algorithm is intendedto extract salient parts of the document while avoidingredundancy. The TFIDF of a term tk with respect to adocument Dj in a set of documents Tr is given by

tfidf(tk, Dj) = nt(tk, Dj) × log|Tr|

nd(tk, Tr)

where nt(tk, Dj) is the number of times tk appears inDj and nd(tk, Tr) is the number of documents from setTr with at least one occurrence of tk. TFIDF reflectsthe intuition that the more a term occurs in a docu-ment, the more it is representative of that document,and that the more a term occurs across various docu-ments, the less discriminative it is. Maximum marginalrelevance (MMR) [9] rewards “novelty” by allocatingincreased weight to a document if it is both relevant tothe query and has little similarity with previous selecteddocuments.

Valenza et al. [68] present a speech summarisationsystem which combines inverse term frequency with au-dio confidence measure from the speech recogniser’soutput. For a word to be included in a summary it needsto have high probabilities of relevance and correct rec-ognition. The authors stress that in order to produceuseful summaries, a certain level of inaccuracy should beacceptable. Giving too much weight to audio confidencerisks omitting relevant information from the final sum-mary. The acoustic confidence measure for a particularword is determined by the sum of phone probabili-ties for that word normalised by word duration. Sum-maries are generated on a per-minute basis to favourspread content more than punctual information andmay be more adapted to the targeted audio (broadcastnews). A summary can be a set of keywords (frequentlyoccurring single words), n-grams (n words extracted

from audio transcript, with n determined by the user)or utterances (audio segments delimited by speaker orcontent change). The user interface provides a keywordlist, a user-specified summary type as well as full textoutput. Selecting a keyword causes relevant segmentsof the summary and full text to be highlighted. The usercan also listen to the corresponding audio segment or tolarger audio segments. The system thus provides audioindexing and summarisation.

An approach which unlike the previous does not relyon lexical recognition was introduced in [10]. It usespitch as an energy content to detect emphasis and createsummaries by selecting emphasised segments in tempo-ral proximity.

3 Video browsing

A video document essentially consists of a successionof images over time, but will often contain additionalmodalities such as sound and text. Therefore, indexinga video document can be approached from the vari-ous modalities it may contain. Because we have coveredtechniques for audio document browsing in detail in theprevious section, here we will describe visual and mul-timodal approaches to video indexing from a meetingrecording point of view. Current approaches to videobrowsing for the most part employ techniques whichrely on domain knowledge of video types and thus makea number of assumptions about features of the record-ing which limit their applicability to meeting browsing.The availability of closed-captions in news broadcast,for example, may be used to generate summaries usingtext-based techniques. Sports video indexing techniquesand highlight extraction generally use heuristics validonly in the context of the rules, grammar and semanticsof a specific sport (though more generic approaches tosports videos have also been investigated [28]). High mo-tion, high pitch, increased audio volume may be used toidentify action scenes in feature films but would be oflimited use in recordings of typical meetings. In [62],techniques are reviewed which regard video from an au-thor’s perspective, assuming a process of production andediting which defines documents with clear structure andsemantics, where scenes and transitions can be identi-fied. In [61], selection of static frames preceded by scenesof camera motion, or zooming are among a number ofheuristics used to choose frames of importance. Suchassumptions, while valid in a production environment,are mostly inadequate in the case of automatic meetingrecordings, which typically contain raw data capturedfrom a number of unmanned fixed cameras. In [40], asurvey of browsing behaviour for various types of video


content concludes that as information in conferences isessentially audio centric, visual features only offer usersminimal cues on content. In what follows and unlessotherwise stated, we present a number of techniquessuitable for indexing, segmenting and browsing meetingrecordings captured by unmanned static cameras in aconference room.

3.1 Visual indexing

A scene or shot can be defined as a succession of imageswhich have been continuously filmed and constitutes anintuitive fundamental unit in video. One common tech-nique for automatic video segmentation is automaticshot boundary detection which is achieved through thenon-trivial task of measuring similarity (or rather dissim-ilarity) between successive frames over a certain numberof features of the image (colour, texture, shapes, spatialfeatures, motion, etc.) A number of reliable methodshave been proposed to this end [1]. When a video doc-ument is created through a production process, changesbetween shots may not necessarily be clear cut but a fad-ing, dissolving or wiping transition. In order to detectgradual transitions, algorithms for boundary detectiongenerally include a dissimilarity accumulation functionwith a boundary found if the functions goes over a cer-tain threshold. Once a video has been segmented intoscenes, these can be characterised by a single image,chosen according to certain heuristics (e.g. choose firstframe, frame containing a face or significant object).Time-varying spatial information can thus be translatedinto the spatial domain for convenient scanning throughthe use of keyframes, whereby each scene of a videocan be represented with a single image, offering a vi-sual summary of the recording. Although these meth-ods can be valuable for feature films or video databaseindexing, their application to meeting recordings is cer-tainly limited as significant visual changes in a commonmeeting scenario are likely to be minor (e.g. drawing onthe board); thus their discriminating power is weak andtheir semantics rather limited without additional (audio)information.

Another promising research area in automaticmeeting segmentation and indexing is concerned withidentifying specific and significant meeting actions.McCowan et al. [49] propose using low-level audio andvisual features (speech activity, energy, pitch and speechrate, face and hand blob) to model meetings as a con-tinuous sequence of high-level meeting group actions(monologue, presentation, discussion, etc.) using anHMM model based on the interactions of individualparticipants.

3.2 Video summarisation and skimming

Similar to the familiar fast-forward feature of standardvideo players, time-compression can be used to increasethe speed of image play-back in digital video record-ings. However, speed increase is inversely proportionalto a viewer’s ability to understand the recording andthis technique will quickly result in a serious degra-dation of comprehension. It is also cumbersome forbrowsing lengthy recordings with no specific referencepoints. Another technique consists of displaying framesseparated by a fixed time interval (e.g. 30 s) but thisprocess is essentially random, might skip over crucialinformation, is generally confusing and remains timeconsuming. In the study of video browsing behaviourcarried out in [40], users navigated conference presen-tation recordings using essentially time compression andspeech-based silence removal techniques. An interestingalternative to these fast-forwarding techniques is usedin CueVideo [63]. Sequences with low motion are sam-pled and played with fewer frames thus faster than se-quences with higher motion levels in order to quicklyskip over scenes with little content information andjump to significant ones. This technique seems partic-ularly well adapted to meeting recordings as they oftencontain long shots with little or no significant motion.The drawback of this technique is that faster video se-quences cannot be synchronised with audio in an intel-ligible way. It remains, however, an efficient tool fornavigation. The InformediaTM [61] project at CarnegieMellon University offers search and retrieval of videodocuments from an online digital library. The systemintegrates image analysis and speech and language pro-cessing techniques to produce skims of video documents.Keywords are identified through audio transcripts andclosed captions (where available) using TFIDF. A com-pressed audio track is then generated according to thelocation and duration of these keywords. The number ofkeywords retrieved therefore defines the duration of theskim. Once the audio track has been created, a corre-sponding video skim is generated. To avoid redundancywithin close proximity, a keyword cannot be selectedtwice within a certain number of frames. A minimum(2 s) of matching video frames is played along withthe keywords from the audio track for clarity, but thevideo segments are not necessarily time-aligned withthe audio. Alternative frames of a corresponding sceneare picked according to certain heuristics. These in-clude prioritising introduction scenes, frames with hu-man faces, static frames preceded by camera motion,zoom, etc. Compaction ratio is typically 10:1 but a ra-tio as high as 20:1 has been found to preserve essentialinformation.


We have alluded to the fact that many domain knowl-edge assumptions used in broadcast news, feature filmsand sports video browsing may not fare well when usedwith unstructured meeting recordings. However, regu-lar meetings in a given organisation may also follow awell-structured grammar, which can then be exploitedfor meeting summarisation. VidSum [57] uses regularpatterns occurring in weekly staff forums (e.g. introduc-tion by first speaker, presentation by second speaker,applause, questions and answers) to generate concisesummaries of presentation recordings. Content analy-sis first extracts a number of visual features, which arethen matched against a presentation library in order tofind the most likely presentation structure. The next stepis to populate a video template called summary designpattern (SDP). Slots in the SDP are filled according topriority criteria (e.g. introduction, conclusion) until apre-determinate structure and the time constraints aremet. The result is a concise, well-structured and pleasantto watch summary. However, evaluating the summari-sation process remains difficult because, as remarkedin [57], different templates may produce significantlydifferent summaries.

4 Artefact browsing

This study is essentially concerned with reviewing auto-mated solutions for accessing meeting recordings; there-fore we are primarily interested in tools and techniqueswhich require no additional effort from the participantsduring the actual meetings. However, there is anotherimportant category of meeting access tools, which wewill refer to as meeting minutes systems. Minutes systemsprovide support for note-taking, meeting annotationand thus for later access to meeting recordings throughactive effort by the participants during the meeting pro-cess. Although a comprehensive review of all meetingsupport tools is beyond the scope of this article, we willdescribe a number of meeting minutes systems and ana-lyse some implications of note-taking for browsing.

4.1 Meeting minutes systems

NoteLook [12] is a client-server system deployed in amedia-enriched meeting room to support multimedianote-taking. Participants take notes during the meetingsthrough the NoteLook client, which runs on wirelesspen-based notebook computers. Presentation materialdisplayed by a room projector, images of the white-board, video of the speaker standing at a presentationpodium and room activity are some of the data whichparticipants can incorporate into their personal minutes,

either as still images or video streams (recorded by theserver). The users can select which live video channelis displayed on the client, and still images can be incor-porated into the notes either as thumbnails or as thepage’s background image. For slide presentations, Note-Look provides an automatic note-taking option whichcaptures any new slide transition and generates thumb-nails of room activity at regular intervals during a slideduration. Images and pen strokes are timestamped andcan therefore be later used to access the video record-ings of the meetings. LiteMinutes [11] is an applet-basednote-taking application running on a wireless personalcomputer. Meetings take place in a media-enriched con-ference room, and video, audio and slide images area number of potential multimedia items captured andstored at a server. Notes taken during the meeting aretimestamped. They can be viewed in real time by othermeeting participants (if a designated person acts as ascribe) and can also be revised later on. Notes taken ondifferent laptops are handled separately. After the meet-ing, notes can be e-mailed to designated recipients andare also accessible through the capture server, which hy-perlinks the notes to related media (slide, video) if thesewere active at the time of writing (smart-link). Min-uteAid [39] is a meeting support system which enablesparticipants to request and embed meeting multimediaitems within a Word document during a meeting. Mul-timedia items which can be requested by the Minute-Aid client running on a participant’s personal computerinclude projected slides, audio recordings, omni-direc-tional video and whiteboard images. Slides can be ob-tained in real time, audio tracks require a 15 s delaywhereas video can only be obtained once the meetinghas ended and the video recording has been processedby the server. Once all data requests have been pro-cessed, participants can manipulate the minutes as astandard multimedia document.

4.2 Implications for browsing

In most meeting scenarios, participants will interact withartefacts of some sort, to present and share informa-tion (slides), express and clarify ideas (whiteboard) oras personal minutes (note-taking). Thus, actions asso-ciated with artefacts will generally be associated withsignificant meeting events and will convey strong seman-tic content. A number of researchers have investigatedparticipants’ interactions with meeting artefacts as ameans of segmenting, indexing and structuring meet-ings. Filochat [76] is a digital notebook which enablesaudio indexing of collocated meetings through note-taking. Time indexed handwritten notes allow users tolisten to concurrent segments of audio. An important


and unforeseen result of a usability study of the devicewas that some users made explicit indexing notes dur-ing meeting when hearing subjects of potential interestin order to revisit these specific points later on. Au-dio indexing according to note-taking activity is alsoimplemented in the Audio Notebook [64], a paper note-book with cordless pen coupled with a digital audiorecorder. Audio indexing is complemented by speechskimming functionalities through speed control of au-dio play back, phrase detection (which prevents audiofrom being played from the middle of a sentence) ontopic shift detection, based on acoustic features of theaudio recording (pitch, pauses and energy). Users of theAudio Notebook were able to use the functionalitiesprovided by the system to successfully review recordedinformation, clarify ambiguous or misunderstood notes,and retrieve portions of audio which had been intention-ally bookmarked. Classroom 2000 [8] is an educationalsystem which aims to give students post-access to thecontent of university lectures. The system provides ac-cess to audio and video recordings as well as additionalinformation, such as web documents, visited during thelecture and note written on an electronic whiteboard.There are several levels of access to a lecture: slide tran-sitions, which provide access to the audio for a durationof each slide, pen-stroke level, which provides access toaudio for the writing duration and word level. To facili-tate navigation of recorded lectures, the system displaysa timeline indexed with all significant events capturedduring the lecture.

5 Meeting browsers

Meeting browsers are systems that integrate some orall of the previously described technologies in orderto provide information seekers with a unified interfaceto multimedia meeting archives. Such integration wasusually based on ad hoc frameworks built around par-ticular technologies, especially LVCSR. Although gen-eral models of multimedia storage have been proposedwhich explicitly tackle the integration issue, use of thosemodels in meeting browsers has been somewhat limited.Multimedia modelling has, however, been an active re-search field and today’s meeting browsers owe a great(although sometimes unacknowledged) deal to earlywork on media streams [26]. An elaboration of the con-cept of streams that combines object oriented and rela-tional database techniques into a unified model has beenpresented in [17]. Research by Jain and collaborators onimage retrieval has been extended to define a unifiedsemantics [58] which incorporates elements of interac-tion and context and thus lends itself well to modelling of

more general multimedia data, including meeting data[60]. A unifying effort along similar lines is presented in[50], and an approach that originated of meeting stor-age concerns is described in [43]. Models continue to beinvestigated and standards, such as MPEG-7 [45], arestarting to emerge which target descriptive annotationand structuring of multimedia data. Nevertheless, of thesystems reviewed below, only COMAP, HANMER andMeeting Miner can be said to be fully based on a generalmodel. They employ the content mapping model pro-posed in [43]. As the area of meeting browser researchmatures we expect models to play a more prominentrole.

5.1 The Meeting Browser

The Meeting Browser [69] displays meeting transcriptstime aligned with corresponding sound or video files.The browser comprises a number of components, includ-ing a speech transcription engine and an automatic sum-mariser. The summariser attempts to identify salientparts of the audio and present the result to the user as acondensed script, or gist of the meeting. The summaris-er takes a textual transcript as an input. This transcriptis either generated manually or from a speech recogni-tion run. The summarisation algorithm works as follows:identify the most common stems present in the tran-script and then weight all speech turns accordingly. Theturns with the highest weights are then included in thesummary. These most commons stems are then removedand the process is repeated over turns not previously in-cluded until the summary has reached a pre-defined size.Several experiments were designed in order to evaluatethe summarisation system. The first task involved askingusers to categorise 30 dialogues into a certain numberof pre-defined categories according to a ten turn sum-mary of the dialogues. The authors report a precisionof 92.8%. Another task was to ask users to answer anumber of questions based on a summary of the dia-logue. The dialogue transcript used in this case for sum-marisation was generated by speech recognition. Theuser could decide (and increase) the number of turnsincluded in the summary. With the number of correctanswers increasing with the number of speech turns in-cluded, the authors claimed that this demonstrated thepotential use of speech recognition output for summari-sation while conveying important points of a dialogue(Fig. 1).

5.2 The SCAN system

The spoken content-based audio navigation (SCAN) is“a system for retrieving and browsing speech documents


Fig. 1 The Meeting Browseruser interface

from large audio corpora” [14,74]. SCAN uses machinelearning techniques over acoustic and prosodic featuresof 20 ms long audio segments to automatically detectintonational phrase boundaries. Intonational phrasesare subsequently merged into intonational paragraphs,or paratones. The result of the intonational phrases seg-mentation is then fed into a speech recogniser (around30% word-error rate) which uses the automatic tran-scripts generated for a document retrieval system basedon the vector space model of weighted terms. SCANintroduces several interesting mechanisms as an infor-mation retrieval system. The first one is called queryexpansion which adds related words (located withinhigh-ranking documents) to users’ short queries. Thesecond one is called document expansion and attemptsto compensate for some of the errors due to speechrecognition. It uses the best recognition output for agiven audio document as a query on the audio database.The top 25% of words present in the original documentword lattice (and not included in the final transcript)and in at least half of the highest ranking documents re-trieved by the query are subsequently added to the orig-inal document transcript. Both techniques improved theinformation retrieval tasks. SCAN’s user interface hasthree components: search, overview and transcript. Thesearch component retrieves audio documents based onusers’ queries match against the ASR transcripts of thedocuments contained in the database. The ten highestranking documents are displayed along with the num-ber of hits (number of terms of the query containedin the document transcript). The overview displays theaudio document segmented along the paratones men-tioned earlier, with their width proportional to their

duration. Terms from the user query contained in thespeech segments are represented by a colour coded rect-angle, whose height is proportional to term frequency.The transcript view displays the paratones’ ASR tran-scripts. Clicking on them will play the correspondingaudio segment (Fig. 2).

5.3 Video Manga

Video Manga [5,67] is a video system which automat-ically creates pictorial summaries of video recordings.The system was primarily tested and evaluated onrecordings of collocated meetings in a conference roombut it was found to also work well with other video gen-res (films, commercials). Although recordings were notedited after the meetings, an operator was in controlof meeting capture and could pan and zoom as well asswitch between a number of cameras and other displays.This would naturally tend to encourage the capture ofhighlights, which would not occur with unmanned fixedcameras. Video Manga generates summaries of a meet-ing as a chronologically ordered compact set of still im-ages, similar to a comic strip, hence the name. Clickingon a specific keyframe will play the corresponding videosegment. The keyframe extraction technique does notsimply rely on shot boundary detection but on a col-our histogram-based hierarchical clustering techniquewhich identifies groups of similar frames, regardless oftiming. Once video segments have been identified, animportance metric is used to reward segments if theyare both long (a heuristic suited to the specific mannedcapture environment) and unusual. Segments whichscore less than one-eighth (empirical threshold) of the


Fig. 2 The SCAN userinterface

maximum scoring segment are discarded (anotheroption is to precisely select the number of segmentsincluded in the summary). For meeting recordings, thisthreshold led to discarding around 75% of the frames. Inorder to give higher visual importance to better scoringsegments, a keyframe size in the final summary var-ies on a scale from one to three according to impor-tance score. The selected frames are further reducedby removing consecutive frames from the same clustersand similar frames which are separated by only one sin-gle frame from another cluster (e.g. in dialogues). Theframes importance score can also be weighted accordingto human, groups, slides shots detection. Documents,such as slides, web pages, transparencies, displayed inthe meeting room are captured every 5 s. The text fromthese documents is timestamped and can be used to labelcorresponding shots in the pictorial summary (Fig. 3).

5.4 The Portable Meeting Recorder: MuVie

The Portable Meeting Recorder [18,38] is a system thatcaptures a panoramic view of meetings and detects inspeaker location in real time. Post-meeting processingof the recorded data generates a video skim which fo-cuses on participants according to speech activity. Asthere would normally be minimal motion during a typ-ical meeting, the authors argue that segments of highermotion potentially indicate significant events such as a

participant joining the meeting or doing a presentation.Similarly, segment of higher speech volumes may pointto phases of intense discussions, particularly when cou-pled with information about speakers’ locations (highnumber of exchanges). The MuVIE (Meeting VIEwer)user interface thus provides among other informationabout the meeting (keyframes, transcripts) a visual rep-resentation over a timeline of audio and visual activitiesand speakers’ turns. A meeting summary can also begenerated by playing back in time order the video seg-ments containing the highest visual or audio activity andhighest ranking keywords extracted from the meetingtranscripts (Fig. 4).

5.5 The MeetingViewer

The MeetingViewer [24] is a client application forbrowsing meetings recorded with the TeamSpace [25,53]online conferencing system. The TeamSpace client pro-vides low-bandwidth video for awareness feedback andsupports the use of a number of artefacts such as sharingand annotating slide presentations, creating and editingagenda and meeting action items and inserting book-marks. In addition to session events (joining, leavingmeeting) all interaction events performed on the cli-ent are automatically recorded and timestamped by theserver. These events are subsequently used to indexthe meeting and are displayed on a timeline on the


Fig. 3 A meeting summaryproduced by Video Manga

MeetingViewer interface to facilitate navigation. Theuser can thus choose relevant sections of the meeting.Playback will play corresponding segments of the au-dio and video recording along with all concurrent meet-ing events. Specific artefacts may be picked for viewingthrough the use of a tabbed pane (Fig. 5).

5.6 COMAP and HANMER

The COMAP (COntent MAPper) [46,43] is a systemfor browsing captured online speech and text meetingsusing the concepts of temporal neighbourhoods and con-textual neighbourhoods. These concepts are based onviewing meeting data as a collection of discrete events,or segments. A temporal neighbourhood is defined asconcurrent media events as well as segments related tothese events. Segments are in a contextual neighbour-hood if they share some content features (keywords).The system takes as input an audio recording alongwith an XML file containing detailed metadata aboutparticipants’ edits and gesturing (telepointing) actions.These action metadata are automatically generated by

RECOLED [47], a shared-text editor designed for thispurpose. The user interface displays the textual outcomeof the co-authoring task along with mosaic timelineviews of the participants’ speech and editing activities.To browse a meeting, a user can click on a portion oftext which will highlight audio segments in the temporalneighbourhood of that text segment. The user can lis-ten to an audio segment by clicking on it, which in turnmay highlight potential concurrent editing operations.An interleave factor (IF) metric [41] measures levelsof concurrent media activity, with intervals of greateractivity deemed to be of greatest significance. A sum-mary view of a recording can be generated through IFranking. HANMER (HANd held Meeting browsER)[48,42] provides the same functionalities as COMAPbut was designed for portable devices (Fig. 6).

5.7 WorkspaceNavigator

WorkspaceNavigator [35] is designed to provide accessto information on loosely structured collaborative de-sign projects which lasted over a long period of time


Fig. 4 The Muvie client userinterface

Fig. 5 The MeetingVieweruser interface


Fig. 6 Hanmer user interface

in a designated workplace. Unlike most of the othersystems described here, the data recorded for meetingdocumentation do not include audio and video mediastreams but rather discrete events. This design choiceis motivated by the fact that (1) given the long dura-tion of the design process, recording live streams of allactivities would produce a prohibitive quantum of dataand (2) the assumption that still images are often suffi-cient to jog participants memories. Information on thedesign process is captured implicitly but participants canalso explicitly capture specific events should they wishto do so, for later reference. Implicit data capture is per-formed every 30 s and include an overview image ofthe activity in the workplace, motion events, computerscreenshots as well as opened files and web resourcesand shots of operations performed on the whiteboard.In addition, participants can choose to capture the stateof the whiteboard and integrate images and annotationsto the project’s documentation at any time. A numberof usability studies performed on WorkspaceNavigatordemonstrated the usefulness of implicit discrete infor-mation capture for design process documentation, datarecovery and specific information item retrieval (Fig. 7).

5.8 The Ferret Media Browser

The Ferret Media Browser [72] is a client-server applica-tion for browsing recorded collocated multimodalmeetings. Recorded data include video, audio, slides(on computer projection screen) as well as whiteboardstrokes and individual note-taking (digital pen-strokes),which are timestamped. A tabletop microphone arraypermits speaker identification. Upon starting the Fer-ret browser, the user can pick a combination of any

available media for display and synchronised play-back.ASR transcripts, a keyword search and speechsegmented according to speakers’ identity are alsoavailable. The user can zoom on particular parts of themeeting. Media streams can be dynamically added toor removed from the display during the browsing task.Other data sources can also be accessed through theInternet (Fig. 8).

5.9 The Meeting Miner

Meeting Miner [6] is a tool designed for navigatingrecordings of online text-and-speech collaborativemeetings. Meetings are recorded through a lightweightcollaborative writing environment which can easily beinstalled on a personal computer and which was spe-cially designed to capture editing activities [7]. Temporalinformation from the logs of actions captured on self-contained information items (paragraphs of text) is usedto uncover potential information links between thesesemantic data units. Access to the audio recordings canbe performed through exploration of a hierarchical treestructure generated for each paragraph through audioand activity linkage, a navigation scheme based on par-ticipants’ discrete space-based actions, keyword index-ing displayed above participants speech exchanges onthe timeline, and keyword and topic search (Fig. 9).

6 Meeting browsers evaluation

Meeting browser systems are notoriously hard to evalu-ate. Unlike speech recognition and spoken documentretrieval, for which the TRECs 6–8 (Text REtrieval


Fig. 7 WorkspaceNavigatoruser interface

Fig. 8 The Ferret mediabrowser

Conference) tracks [23] set precise evaluation tasks,with specific evaluation metrics on well-defined corpuscollections, the diversity of multimodal meeting record-ings and browsing strategies makes defining evaluationmetrics and system comparison impractical. Systemcomparisons have been confined to assessments of infor-mation retrieval performance on multimodal meetingbrowsers against a baseline system, typically one basedon a tape-recorder interface metaphor, as in [74]. A morecommon, less constrained but inherently less compara-ble approach is evaluation by usability testing. Tasksemployed in usability testing have included using the

browser for identifying the topic of a conversation [81],classifying media items into pre-defined categories [69],answering specific questions about meetings (quiz) [18],locating specific information items [5], and producingmeeting summaries. Evaluation focuses on user feed-back such as ranking of features of the user interfaceaccording to perceived usability [5] and overall impres-sions on system performance [68] which give a goodindication of how fit a system is for general use. In[67], manual minutes generated by a scribe during themeetings are used as a benchmark. Automatically gen-erated meeting summaries were analysed to quantify the


Fig. 9 The Meeting Miner

number of significant events they were able to convey.Information contained in the minutes which could notbe inferred from the complete meeting recording (e.g.information external to the meeting or outside camerarange) was not taken into account in the performancemeasure as it could not possibly have been included inthe video summary.

Recently, more systematic approaches to comparingmeeting browser performance have started to emerge.A strategy is described in [73] which suggests using thenumber of observations of interest uncovered by sys-tem users in a certain period of time as an evaluationmetric. A test is proposed which can be described asfollows. Human observers review information of inter-est in recorded meetings. Test subjects are then askedto answer as many (true–false) questions as they canin a period corresponding to half the duration of themeeting. Although this metric is general enough to beused by most meeting browsers, it alone does not solveissues relating to the diversity of corpora and accessmodalities, and therefore does not suffice for perfor-mance comparison. Standard meeting corpora, such asthe ICSI meeting corpus [34], have become availablewhich might help alleviate this problem. A crucial issuerelating to usefulness and structure in browsing tasks isthat of how to locate and “salvage” (recover with thepurpose of creating a summary of the meeting) obser-vations of interest. An interesting study on this issue ispresented in [51]. Although that investigation is set inthe context of designing meeting capture support tools,it offers valuable insight into how meeting browsers canbe evaluated.

7 Conclusion

We have presented an overview of existing methods forsegmentation, indexing and searching of captured multi-media meeting data and introduced a number of brows-ing systems, underlining their individual approaches tointegrating various modalities for navigation of meetingrecordings. Multimodal meeting browsing is currentlyan active research area with many open issues. While itwould have been impossible to review all contributionsin this area, we hope this survey will prove useful inindicating general trends in multimedia search and re-trieval, information visualisation, seamless integrationof multiple modalities, meeting interaction modelling,and evaluation methods which are essential to today’smeeting browsing systems.

Acknowledgments The authors would like to thank the MeetingBrowser, ScanMail, Video Manga, Portable Meeting Recorder:MuVie, Teamspace-MeetingViewer, WorkspaceNavigator andFerret development teams for kindly giving us permission to repro-duce screenshots of their systems. We would also like to thank thereviewers of the article for their valuable comments and sugges-tions. This work has been supported by Enterprise Ireland througha Basic Research Grant.

References

1. Aigrain, P., Zhang, H., Petkovic, D.: Content-based represen-tation and retrieval of visual media: a state-of-the-art review.Multimed. Tools Appl. 3, 179–202 (1996)

2. Allan, J., Carbonell, J., Doddington, G., Yamron, J., Yang, Y.:Topic detection and tracking pilot study: final report. In:Proceedings of the DARPA broadcast news transcription andunderstanding workshop (1998)


3. Arons, B.: Techniques, perception, and applications of time-compressed speech. In: Proceedings of conference of Ameri-can voice I/O society, pp. 169–177 (1992)

4. Arons, B.: Speechskimmer: a system for interactively skim-ming recorded speech. ACM Trans. Comput. Hum. Interact.4(1), 3–38 (1997)

5. Boreczky, J., Girgensohn, A., Golovchinsky, G., Uchihashi, S.:An interactive comic book presentation for exploring video.In: Proceedings of CHI’00: human factors in computing sys-tems, pp. 185–192. ACM Press (2000)

6. Bouamrane, M.M., Luz, S.: Navigating multimodal meetingrecordings with the Meeting Miner. In: Proceedings of flex-ible query answering systems, FQAS’2006, LNCS, vol. 4027,pp. 356–367. Springer, Berlin Heidelberg New York (2006)

7. Bouamrane, M.M., Luz, S., Masoodian, M., King, D.: Support-ing remote collaboration through structured activity logging.In: Hai Zhuge, G.C.F. (ed.) Proceedings of 4th internationalconference on grid and cooperative computing, GCC 2005,LNCS, vol. 3795, pp. 1096–1107. Springer, Berlin HeidelbergNew York (2005)

8. Brotherton, J.A., Bhalodia, J.R., Abowd, G.D.: Automatedcapture, integration, and visualization of multiple mediastreams. In: Proceedings of the international conference onmultimedia computing and systems, ICMCS ’98, p. 54. IEEEComputer Society (1998)

9. Carbonell, J., Goldstein, J.: The use of MMR, diversity-basedreranking for reordering documents and producing summa-ries. In: Proceedings of the 21st ACM sigir conference onresearch and development in information retrieval, SIGIR’98, pp. 335–336. ACM Press (1998)

10. Chen, F., Withgott, M.: The use of emphasis to automat-ically summarize a spoken discourse. In: Proceedings ofIEEE conference on acoustics, speech, and signal processing,ICASSP’92, vol. 1, pp. 229–232 (1992)

11. Chiu, P., Boreczky, J., Girgensohn, A., Kimber, D.: Litemin-utes: an Internet-based system for multimedia meeting min-utes. In: Proceedings of the 10th international conference onWorld Wide Web, WWW ’01, pp. 140–149. ACM Press (2001)

12. Chiu, P., Kapuskar, A., Reitmeier, S., Wilcox, L.: NoteLook:taking notes in meetings with digital video and ink. In: Pro-ceedings of the 7th ACM international conference on multi-media (Part 1), MULTIMEDIA ’99, pp. 149–158. ACM Press(1999)

13. Chiu, P., Kapuskar, A., Wilcox, L., Reitmeier, S.: Meet-ing capture in a media enriched conference room. In: Co-Build ’99: Proceedings of the 2nd international workshopon cooperative buildings, integrating information, organiza-tion, and architecture, pp. 79–88. Springer, Berlin HeidelbergNew York (1999)

14. Choi, J., Hindle, D., Pereira, F., Singhal, A., Whittaker, S.: Spo-ken content-based audio navigation (SCAN). In: Proceedingsof the ICPhS-99 (1999)

15. Cutler, R., Rui, Y., Gupta, A., Cadiz, J.J., Tashev, I., wei He,L., Colburn, A., Zhang, Z., Liu, Z., Silverberg, S.: Distrib-uted meetings: a meeting capture and broadcasting system.In: ACM multimedia, pp. 503–512. ACM Press (2002)

16. Dharanipragada, S., Roukos, S.: A multistage algorithm forspotting new words in speech. IEEE Trans. Speech AudioProcess. 10(8), 542–550 (2002)

17. Dionisio, J.D.N., Cardenas, A.F.: Unified data model for rep-resenting multimedia, timeline, and simulation data. IEEETrans. Knowl. Data Eng. 10(5), 746–767 (1998)

18. Erol, B., Lee, D.S., Hull, J.J.: Multimodal summarizationof meeting recordings. In: Proceedings of internationalconference on multimedia and expo, ICME ’03, vol. 3,pp. 25–28 (2003)

19. Erol, B., Li, Y.: An overview of technologies for e-meeting ande-lecture. In: IEEE international conference on multimediaand expo, pp. 1000–1005 (2005)

20. Foote, J.: An overview of audio information retrieval. In:ACM multimedia systems, vol. 7, pp. 2–10 (1999)

21. Furui, S.: Automatic speech recognition and its applicationto information extraction. In: Proceedings of the 37th an-nual meeting of the association for computational linguistics,pp. 11–20. ACL (1999)

22. Furui, S.: Robust methods in automatic speech recognitionand understanding. In: Proceedings EUROSPEECH, vol. III,pp. 1993–1998 (2003)

23. Garofolo, J.S., Voorhees, E.M., Auzanne, C.G., Stanford, V.M.:Spoken document retrieval: 1998 evaluation and investigationof new metrics. In: Proceedings of ESCA ETRW on accessinginformation in spoken audio, pp. 1–7 (1999)

24. Geyer, W., Richter, H., Abowd, G.D.: Making multimediameeting records more meaningful. In: Proceedings of interna-tional conference on multimedia and expo, ICME ’03, vol. 2,pp. 669–672 (2003)

25. Geyer, W., Richter, H., Fuchs, L., Frauenhofer, T., Daijavad,S., Poltrock, S.: A team collaboration space supporting cap-ture and access of virtual meetings. In: Proceedings of the 2001international conference on supporting group work, GROUP’01, pp. 188–196. ACM Press (2001)

26. Gibbs, S., Breiteneder, C., Tsichritzis, D.: Data modeling oftime-based media. ACM SIGMOD Record. 23(2), 91–102(1994)

27. Goldman, J., Renals, S., Bird, S., de Jong, F., Federico, M.,Fleischhauer, C., Kornbluh, M., Lamel, L., Oard, D., Stewart,C., Wright, R.: Accessing the spoken word. Int. J. Digit. Libr.5(4), 287–298 (2005)

28. Hanjalic, A.: Generic approach to highlights extraction froma sport video. In: Proceedings of international conferenceon image processing, ICIP 2003, vol. 1, pp. 1–4. IEEE Press(2003)

29. Hearst, M.A.: Multi-paragraph segmentation of expositorytext. In: Proceedings of the 32nd annual meeting of the asso-ciation for computational linguistics, pp. 9–16. ACL (1994)

30. Hindus, D., Schmandt, C.: Ubiquitous audio: capturing spon-taneous collaboration. In: Proceedings of the 1992 ACM con-ference on computer-supported cooperative work, CSCW ’92,pp. 210–217. ACM Press (1992)

31. Hirschberg, J., Whittaker, S., Hindle, D., Pereira, F., Singhal,A.: Finding information in audio: a new paradigm for audiobrowsing and retrieval. In: Mani, I., Maybury, M.T. (eds.) Pro-ceedings of the ESCA workshop: accessing information inspoken audio, pp. 117–122 (1999)

32. Jaimes, A., Omura, K., Nagamine, T., Hirata, K.: Memory cuesfor meeting video retrieval. In: CARPE’04: Proceedings of thethe 1st ACM workshop on continuous archival and retrievalof personal experiences, pp. 74–85. ACM Press (2004)

33. James, D.A., Young, S.J.: A fast lattice-based approach tovocabulary independant worspotting. In: Proceedings of inter-national conference on acoustics, speech, and signal process-ing, ICASSP-94, vol. 1, pp. 377–380 (1994)

34. Janin, A., Ang, J., Bhagat, S., Dhillon, R., Edwards, J., Macias-Guarasa, J., Morgan, N., Peskin, B., Shriberg, E., Stolcke, A.,Wooters, C., Wrede, B.: The ICSI meeting project: resourcesand research. In: NIST ICASSP meeting recognition work-shop (2004)

35. Ju, W., Ionescu, A., Neeley, L., Winograd, T.: Where the wildthings work: capturing shared physical design workspaces.In: CSCW ’04: Proceedings of the 2004 ACM conference oncomputer supported cooperative work, pp. 533–541. ACMPress (2004)


36. Jurafsky, D., Martin, J.H.: Speech and language processing: anintroduction to natural language processing, computationallinguistics, and speech recognition. Prentice-Hall, EnglewoodCliffs (2000)

37. Koumpis, K., Renals, S.: Content-based access to spoken au-dio. IEEE Signal Proc. Mag. 22(5), 61–69 (2005)

38. Lee, D.S., Erol, B., Graham, J., Hull, J.J., Murata, N.: Portablemeeting recorder. In: Proceedings of the 10th ACM inter-national conference on multimedia, MULTIMEDIA ’02,pp. 493–502. ACM Press (2002)

39. Lee, D.S., Hull, J., Erol, B., Graham, J.: Minuteaid: multimedianote-taking in an intelligent meeting room. In: IEEE inter-national conference on multimedia and expo, vol. 3, pp. 1759–1762. IEEE Press (2004)

40. Li, F.C., Gupta, A., Sanocki, E., wei He, L., Rui, Y.: Browsingdigital video. In: CHI ’00: Proceedings of the SIGCHI con-ference on human factors in computing systems, pp. 169–176.ACM Press (2000)

41. Luz, S.: Interleave factor and multimedia information visuali-sation. In: Sharp, H., Chalk, P. (eds.) Proceedings of humancomputer interaction, vol. 2, pp. 142–146 (2002)

42. Luz, S., Masoodian, M.: A mobile system for non-linear accessto time-based data. In: Proceedings of the working conferenceon advanced visual interfaces, AVI ’04, pp. 454–457. ACMPress (2004)

43. Luz, S., Masoodian, M.: A model for meeting content stor-age and retrieval. In: Proceedings of the 11th internationalmultimedia modelling conference, MMM’05, pp. 392–398(2005)

44. Luz, S., Roy, D.: Meeting browser: a system for visualising andaccessing audio in multicast meetings. In: Society, I.S.P. (ed.)Proceedings of the international workshop on multimedia sig-nal processing (1999)

45. Martinez, J., Koenen, R., Pereira, F.: MPEG-7: the genericmultimedia content description standard, part 1. IEEE Mul-timedia 9(1070-986X), 78–87 (2002)

46. Masoodian, M., Luz, S.: Comap: A content mapper for audio-mediated collaborative writing. In: Smith, M.J., Savendy, G.,Harris, D. Koubek, R.J. (eds.) Usability evaluation and inter-face design, vol. 1, pp. 208–212. Lawrence Erlbaum, Hillsdale(2001)

47. Masoodian, M., Luz, S., Bouamrane, M.M., King, D.: Recoled:A group-aware collaborative text editor for capturing docu-ment history. In: Proceedings of WWW/Internet 2005, vol. 1,pp. 323–330 (2005)

48. Masoodian, M., Luz, S., Weng, C.: Hanmer: A mobile tool forbrowsing recorded collaborative meeting contents. In: Kemp,E., Philip, C., Wong, W. (eds.) Proceedings of CHI-NZ ’03,pp. 87–92. ACM Press (2003)

49. McCowan, I., Gatica-Perez, D., Bengio, S., Lathoud, G.,Barnard, M., Zhang, D.: Automatic analysis of multimodalgroup actions in meetings. IEEE Trans. Pattern Anal. Mach.Intell. 27(3), 305–317 (2005)

50. Meghini, C., Sebastiani, F., Straccia, U.: A model of multime-dia information retrieval. J. ACM 48(5), 909–970 (2001)

51. Moran, T.P., Palen, L., Harrison, S., Chiu, P., Kimber, D.,Minneman, S., van Melle, W., Zellweger, P.: “I’ll get thatoff the audio”: a case study of salvaging multimedia meet-ing records. In: Proceedings of ACM conference on humanfactors in computing systems, CHI 97, vol. 1, pp. 202–209(1997)

52. Rabiner, L.R., Juang, B.H.: Fundamentals of speech recogni-tion. Prentice-Hall, Englewood Cliffs (1993)

53. Richter, H.A., Abowd, G.D., Geyer, W., Fuchs, L.,Daijavad, S., Poltrock, S.E.: Integrating meeting capturewithin a collaborative team environment. In: Proceedings

of UbiComp ’01, pp. 123–138. Springer, Berlin HeidelbergNew York (2001)

54. Rohlicek, J., Russell, W., Roukos, S., Gish, H.: Continuoushidden Markov modeling for speaker-independent word spot-ting. In: Proceedings of international conferenceof acoustics,speech, and signal processing, ICASSP-89, vol. 1, pp. 627–630(1989)

55. Rose, R.C., Paul, D.B.: A hidden Markov model basedkeyword recognition system. In: Proceedings of interna-tional conference on acoustics, speech, and signal processing,ICASSP-90, vol. 1, pp. 129–132 (1990)

56. Roy, D., Malamud, C.: Speaker identification based text toaudio alignment for an audio retrieval system. In: Proceed-ings of the 1997 IEEE international conference on acoustics,speech, and signal processing, ICASSP ’97, vol. 2, pp. 1099–1102. IEEE Computer Society (1997)

57. Russell, D.M.: A design pattern-based video summarizationtechnique: moving from low-level signals to high-level struc-ture. In: HICSS ’00: Proceedings of the 33rd Hawaii inter-national conference on system sciences, vol. 3, p. 3048. IEEEComputer Society (2000)

58. Santini, S., Gupta, A., Jain, R.: Emergent semantics throughinteraction in image databases. IEEE Trans. Knowl. Data Eng.13(3), 337–411 (2001)

59. Shriberg, E., Stolcke, A., Hakkani-Tur, D., Tur, G.: Prosody-based automatic segmentation of speech into sentences andtopics. Speech Commun. 32(1–2), 127–154 (2000)

60. Singh, R., Li, Z., Kim, P., Pack, D., Jain, R.: Event-basedmodeling and processing of digital media. In: Proceedings ofCVDB’04: computer vision meets databases, pp. 19–26. ACMPress (2004)

61. Smith, M.A., Kanade, T.: Video skimming and character-ization through the combination of image and languageunderstanding techniques. In: Proceedings of workshop oncontent-based access of image and video database, pp. 61–70.IEEE Computer Society (1998)

62. Snoek, C.G.M., Worring, M.: Multimodal video indexing: areview of the state-of-the-art. Multimed. Tools Appl. 25(1),5–35 (2005)

63. Srinivasan, S., Ponceleon, D., Amir, A., Petkovic, D.: What isin that video anyway?: in search of better browsing. In: Pro-ceedings of IEEE conference on multimedia computing andsystems, vol. 1, pp. 388–393 (1999)

64. Stifelman, L., Arons, B., Schmandt, C.: The audio notebook:paper and pen interaction with structured speech. In: Pro-ceedings of CHI’01: Human factors in computing systems,pp. 182–189. ACM Press (2001)

65. Tucker, S., Whittaker, S.: Accessing multimodal meeting data:systems, problems and possibilities. In: Samy Bengio, H.B.(ed.) Machine learning for multimodal interaction: first inter-national workshop, MLMI 2004, vol. 3361, pp. 1–11. Springer,Berlin Heidelberg New York (2005)

66. Tur, G., Hakkani-Tur, D., Stolcke, A., Shriberg, E.: Integratingprosodic and lexical cues for automatic topic segmentation.Comput. Linguist. 27(1), 31–57 (2001)

67. Uchihashi, S., Foote, J., Girgensohn, A., Boreczky, J.: Videomanga: generating semantically meaningful video summaries.In: MULTIMEDIA ’99: Proceedings of the 7th ACM interna-tional conference on multimedia (Part 1), pp. 383–392. ACMPress (1999)

68. Valenza, R., Robinson, T., Hickey, M., Tucker, R.: Summari-sation of spoken audio through information extraction. In:Proceedings of the ESCA workshop: accessing informationin spoken audio, pp. 111–115 (1999)

69. Waibel, A., Bett, M., Finke, M., Stiefelhagen, R.: Meetingbrowser: tracking and summarizing meetings. In: Penrose,


D.E.M. (ed.) Proceedings of the broadcast news transcrip-tion and understanding workshop, pp. 281–286. MorganKaufmann (1998)

70. Waibel, A., Bett, M., Metze, F., Ries, K., Schaaf, T., Schultz,T., Soltau, H., Yu, H., Zechner, K.: Advances in automaticmeeting record creation and access. In: Proceedings of theinternational conference on acoustics, speech and signal pro-cessing, pp. 597–600 (2001)

71. Weintraub, M.: Keyword-spotting using SRI’s decipher large-vocabulary speech-recognition system. In: Proceedings ofIEEE international conference on acoustics, speech, and sig-nal processing, ICASSP-93, vol. 2, pp. 463–466 (1993)

72. Wellner, P., Flynn, M., Guillemot, M.: Browsing recordedmeetings with Ferret. In: Bengio, S., Bourlard, H. (eds.) Pro-ceedings of machine learning for multimodal interaction: firstinternational workshop, MLMI 2004, vol. 3361, pp. 12–21.Springer, Berlin Heidelberg New York (2004)

73. Wellner, P., Flynn, M., Tucker, S., Whittaker, S.: A meetingbrowser evaluation test. In: CHI ’05 extended abstracts onhuman factors in computing systems, pp. 2021–2024. ACMPress (2005)

74. Whittaker, S., Hirschberg, J., Choi, J., Hindle, D., Pereira, F.,Singhal, A.: Scan: designing and evaluating user interfaces tosupport retrieval from speech archives. In: Proceedings of the22nd ACM SIGIR conference on research and developmentin information retrieval, SIGIR’99, pp. 26–33. ACM Press(1999)

75. Whittaker, S., Hirschberg, J., Nakatani, C.H.: Play it again: astudy of the factors underlying speech browsing behavior. In:CHI ’98: CHI 98 conference summary on human factors incomputing systems, pp. 247–248. ACM Press (1998)

76. Whittaker, S., Hyland, P., Wiley, M.: Filochat: handwrittennotes provide access to recorded conversations. In: Proceed-ings of the ACM conference on human factors in computingsystems, pp. 24–28. ACM Press (1994)

77. Wilcox, L., Kimber, D., Chen, F.: Audio indexing usingspeaker identification. In: Proceedings of conference on auto-matic systems for the inspection and identification of humans,pp. 149–157 (1994)

78. Yamron, J., Carp, I., Gillick, L., Lowe, S., van Mulbregt, P.:Event tracking and text segmentation via hidden Mar-kov models. In: Proceedings of IEEE workshop on auto-matic speech recognition and understanding, pp. 519–526(1997)

79. Young, S.: Large vocabulary continuous speech recognition: areview. In: Proceedings of the IEEE workshop on automaticspeech recognition and understanding, pp. 3–28 (1995)

80. Zechner, K.: Automatic generation of concise summariesof spoken dialogues in unrestricted domains. In: Proce-dings of the conference on research and development ininformation retrieval, SIGIR’01, pp. 199–207. ACM Press(2001)

81. Zechner, K., Waibel, A.: DiaSumm: flexible summarizationof spontaneous dialogues in unrestricted domains. In: Pro-ceedings of the 18th conference on computational linguistics,pp. 968–974. ACL (2000)

COPYRIGHT INFORMATION

TITLE: Meeting browsingSOURCE: Multimedia Syst 12 no4/5 Mr 2007

The magazine publisher is the copyright holder of this article and itis reproduced with permission. Further reproduction of this article inviolation of the copyright is prohibited. To contact the publisher:http://www.springerlink.com/content/1432-1882/