Multimodal Person Discovery in Broadcast TV Task at MediaEval 2015 Johann Poignant, Hervé Bredin, Claude Barras LIMSI – CNRS, Orsay, France [email protected] Motivation ➢ Indexing TV archives → find people appearing and speaking ➢ Biometric models are not always available → find people identities in the video (pronounced names, names written on screen) Definition of the task ➢ From a collection of TV broadcast pre-segmented into shots, participants are asked to provide: ➢ The names of people both speaking and appearing at the same time during the shot, with a confidence score ➢ An evidence (a unique shot) proving that the person holds the right name ➢ List of persons was not provided ➢ Person biometric models trained on externel data can not be used ➢ Participants should find their names in the audio (using ASR) or visual (using OCR) streams A B A Hello Mrs B Mr A blah blah shot #1 shot #2 shot #3 A B B blah blah shot #4 speaking face evidence A B blah blah A text overlay speech transcript INPUT OUTPUT LEGEND Shot#1 is an evidence for Mr A Shot#3 is an evidence for Mrs B Datasets ➢ Dev set: REPERE (2 French channel, 8 shows, 137 hours, 50 hours manually annotated) ➢ Test set: INA corpus (1 French channel, 172 editions of the evening broadcast news, 106 hours) ➢ Annotated a posteriori based on participants' submissions Baseline and metadata ➢ Available as open-source software at http://github.com/MediaEvalPersonDiscoveryTask Metrics ➢ For each queries q ϵ Q, rank all shots (according to their confidence) of the best hypothesis p (if Levenshtein ratio(p, q) > 0.95) ➢ Average Precision ➢ Mean Average Precision ➢ Correctness of evidences ➢ Evidence-weighted Mean Average k is the rank P(k) is the precision at cut-off k rel(k) =1 if the shot at rank k is relevant, 0 otherwise ∑ k=1 (P(k)) · rel(k)) # relevant shots n AP = Q MAP = ∑ q=1 AP(q) 1 Q C(q) = 1 if the Levenshtein ratio(p, q) > 0.95 and if the evidence proves q identity 0 otherwise Q EwMAP = ∑ q=1 C(q) · AP(q) 1 Q