METISS 17-18 Mai 2006Evaluation INRIA1 METISS Modélisation et Expérimentation pour le Traitement des Informations et des Signaux Sonores Scientific leader.

17-18 Mai 2006 Evaluation INRIA 1METISS

METISSMETISS Modélisation et Expérimentationpour le Traitement des Informations et des Signaux Sonores

Scientific leader : Frédéric BIMBOT

Audio & speech processingAudio & speech processing

Overview of activities 2002-2005Overview of activities 2002-2005

INRIA-Rennes


IntroductionIntroduction


Framework and foundations

Scientific foundations Probabilistic models and statistical estimation Redundant systems and adaptive representations

analysis, processingmodelling, representation description, decompositiondetection, classificationrecognition

General frameworkaudiospeechmusicmultimedia…

signalsrecordingsstreamstracks…

of

Audio scene analysis, description and recognition


Scientific objectives

to design generic, robust, fast and flexible approaches to a variety of problems in speech and audio segmentation, detection and classification, operating in the probabilistic framework

to investigate on theoretical properties and practical applications of adaptive representations and sparseness criteria with the purpose of advanced processing and structured description of audio signals

to extend and adapt approaches classically used in the context of speech processing to other classes of signals and problems

to study convergence between statistical approaches and adaptive decomposition within a common framework embedding signal representations and classification


Application domain and focus

Applicative fields Security, verification, authentication, rights management Rich audio transcription, content-based indexing, multi-purpose

navigation, information retrieval and summarization Advanced audio processing : segmentation, separation, spatialisation,

sound object extraction, music modeling Audio and audio-visual authoring, production and repurposing Education and entertainement

Primary focuses Speaker characterisation Audio structuring and indexing Sparse representations : theory and applications Audio source separation (under-determined case)


Team composition

MAILHEARBERET

TENGHUET

FORTHOFERSALLOZEROVLESAGECOLLET

BEN

BENAROYABLOUET

MC DONAGH

POREEBETSER

KIJAKKRSTULOVIC

GONONBEN

MORARU

BIMBOTGRAVIERGRIBONVAL

3

3 2

Permanent researchers (CR - CNRS or INRIA)Non-permanent staff (Engineers, ATER, Post-Doc)

PhD - 100 % with METISS PhD ~ 50 % with METISS2

2002 2003 2004 2005

+ Marie-Noëlle Georgeault administrative assistant (~ 25 %)


Probabilistic modeling Probabilistic modeling of audio signalsof audio signals


Probabilistic modeling (1)

1 audio class or 1 sound object

a variety of observations

1 family of sounds 1 probabilistic model

1 probability density function 1 likelihood function

)( 1 XYP T )(ˆ 1 XYP T


Probabilistic modeling (2)

Probabilistic modelingStatistical estimationState-sequence decodingBayesian decision

+ « know-how »

DetectionClassificationVerificationSegmentation…

Probabilistic models offer a well-understood generic inter-operable framework for the description and the classification of audio and speech signals

Dominant position of Hidden Markov Models (HMM) (and variants)

Highly competitive field in speech processing (research & industry)

More open in audio indexing (additional factors of complexity)

17-18 Mai 2006 Evaluation INRIA 10

METISS

Challenges and positioning

Robustness to unseen acoustic conditions to scarce training data to poorly representative samples to missing observations to …

Implementability size speed

scalability

distribution etc …

Generalisation to wider classesof signals with an audio component

multiple scales multiple sources multiple structures multiple sensors multiple levels of underlying processes heterogeneous streams (audio-visual) external sources of knowledge

METISS positioning :

- robust training and test methods- compact distributed algorithms- versatility / migration of formalism- methodology and evaluation

speaker verification audio segmentation broad sound-class indexing( speech recognition)


METISS

Adaptive Adaptive representationsrepresentations


METISS

Adaptive representations (1)

Audio signal : diversity of structures (time, frequency, statistics,…) superimposition of objects (notes, sources, tracks, …)

Redundant system(dictionary of atoms)

Adaptive decomposition

NiTti tgD

11)(

Tttss 1)(

TN with

Large set of vectors with various :- scales- time structures- frequency structures- phases- statistical properties- …

)()(1

tgts iNii

Selection of the« best » decomposition,

according to a given criterion :- sparsity- perception criterion- separability- conditional entropy- …


METISS

Adaptive representations (2)

= 2 : quadratic norm maximizes dispersion = 0 : minimum non-zero coefficient NP-complete = 1 : tractable « compromise »

)(

FMinArgConstraint :

1

1

)()(

Ni

iLFCriterion :

)()(1

tgts iNii

Nii 1

Decomposition

Sparsity criteria

Pursuit algorithms (Matching Pursuit)


METISS

Recent fast-growing field

High applicative potential

Intense emerging competition

Optimality and convergence of adaptive decompositions

Dictionary design (knowledge-based, data driven, …) Deformable, stochastic, multi-dimensional, … atoms Efficient decomposition algorithms and implementations Application scope

Ongoing scientific issues

METISS positioning :

- theoretical results- concepts and methodologies- decomposition algorithms

audio source separation(under-determined case)


METISS

AchievementsAchievements2002-20052002-2005

and selected resultsand selected results

Speaker characterisation Audio structuring and indexing Sparse representations : theory and applications Audio source separation (under-determined case)


METISS

Speaker characterisation

CART trees for scalable and distributable speaker verification

Model-based metrics and normalisations for speaker verification

Structural adaptation of speaker models (hierarchical Bayesian networks)

Methodology and algorithms for optimizing the coverage of a speaker database

Relative speaker space and metrics for efficient speaker indexing and retrieval [ongoing]


METISS

CART based speaker verification

)Xy(P̂

)Xy(P̂log)y(S

t

t

tX

direct score functionassignment

-0.4-0.5

-0.80.7

0.9

0.3

1a

1b

2a

2b

3b

11ay

12by

32by

21ay

22by

YES

YES

YES

YES

NO

NO

NO

NO

YES

NO

-0.8

0.3 -0.5

0.9

0.7 -0.4

CART Treesused as a familyof approximatingfunctions

Blouet, Bimbot, Gonon, et al.

+ Extensionto oblique trees

complexity down 200 xerror rate up 33% only

EU-ISTINSPIRED Project


METISS

Speaker recognition inthe model space (1)

Formal links between LLR and KL-divergence

+mean-only adaptation

training procedure

likelihood ratio test

~=Euclidean distance in the

model space

Ben, Bimbot et al.


METISS

Speaker recognition inthe model space (2)

Consequences :

- faster score computation procedure (at least -50%)- simpler normalization schemes (M-Norm)

no need of additional development data

with no performance degradation

Ben, Bimbot et al.

Tested successfullyfor speaker recognition forNIST and ESTER campaigns


METISS

Audio indexing

HMM-based audio and audio-visual structuring (applied to sports programmes)

Audio segmentation and tracking using probabilistic models and statistical tests

Detection of simultaneous events in audio tracks

Granular models of audio signals using deformable atoms

Comparison and evaluation of beam-search techniques and hypothesis rescoring using external sources of knowledge [ongoing]

Algebraic representations and statistical modeling of formal music [ongoing]


METISS

Multi-stream HMM modeling (1)of a tennis match

inspiredand adapted

from thespeech

recognitionparadigms

multi-level state-sequencerepresentation of a tennis match

Kijak et al. (with TMM)

multi-stream audio-visual HMM


METISS

Video-onlyShot-basedC = 77%

Video+AudioShot-based + segmentalC = 85%

Multi-stream HMM modeling (2)Delakis, Gravier et al.(with TexMex)

segmental models relaxed synchronyconstraints


METISS

Sparse representations

Mathematical test for the optimality of a sparse representation

Matching pursuit made tractable (1 hour 0.25 x RT)

Structured matching pursuit incorporating explicit signal family models

Adaptive computational strategies

Beyond sparsity : recovering structured representations…

Learning shift-invariant atoms (MoTIF algorithms) [ongoing]


METISS

Sparse solutions to inverse linear problems

In the under-determined case :

Gribonval et al.

BUT if :

If a sparse representation is sparse enough,then it is the sparsest one


METISS

Matching Pursuit made tractableGribonval, Krstulovic et al.

C++ ToolkitGPL Licence

for a 1 hour audio signalprocessing time reduced from 20 h 0.25 h

flexible operationreproducible results

usable in other fields : medical signals, sismology, etc …

MPTK


METISS

Source separation(with primary focus on undertermined problems)

Statistical schemes and adaptive training for single-channel separation

Source separation approaches using multi-channel Matching Pursuit in the underdetermined case

Contributions in evaluation methodology : task definition & performance measurements

Speech « denoising » using underdetermined sources separation techniques

Dictionary design methods for source separation [ongoing]

DEMIX : a robust algorithm to estimate the number of sources using clustering techniques [ongoing]


METISS

Single sensor audio source separation

Factorial GMM

Voice GMM

Music GMM

Observed signalVoice + Music

Wiener filter

EstimatedVoice signal

Benaroya, Bimbot, Gribonval, Ozerov (with FTR&D)

innovative scheme for underdetermined source separation compatibility with speech processing state-of-the-art strong links with sparse decomposition problems versatile and efficient for a range of audio description tasks

Use of afactorial GMMto builda time-varyingWiener filter

Articlein IEEETrans SAP2006

+ new resultsto come


METISS

Underdetermined stereophonicsource separation using sparse method

Separation

least squares sparsity

Mixing matrix

Lesage, Gribonval et al.

Audio examplesavailable


METISS

Collaborations, Disseminationand Visibility

Privileged cooperation with the TEXMEX group at IRISA (+ VISTA)

Consistent network of academic and industrial partners outside IRISA

Regular participation to collaborative projects (EU-IST, RNRT, bilateral partnership, …)

Strong involvement in concerted research actions (ESTER, MathSTIC, GDR-ISIS, NIST evaluations, …)

Visible participation to and production of free software : ELISA platform, AudioSeg, MPTK, SIROCCO, BSS-EVAL

Sustained effort of publication and dissemination of the group research results

Additional visibility through responsability taking in scientific societies, workshop organisation and editorial boards


METISS

Summary 2002-2005Summary 2002-2005

Strategy and perspectivesStrategy and perspectives2006-20102006-2010


METISS

Achievements 2002-2005 (1)

solid contributions to the state-of-the art with respect to several topics related to speaker and audio class modelling and recognition

key extension, experimentation and validation of the Hidden Markov Model framework for joint audio and video modelling and structuring

major theoretical and experimental progress in the field of sparse representations and adaptive decomposition

pioneering work in mono- and multi-channel source separation in the underdetermined case


METISS

Achievements 2002-2005 (2)

strategic improvement in the efficiency of pursuit algorithms both in terms of search strategy and implementation

development of a usable know-how in keyword spotting and speech recognition

sustained activities in assessment methodology, resource distribution and evaluation campaigns

scientific objective #4 needs consolidation


METISS

To keep our position in our initial field of expertise : models, algorithms and tools for automatic processing of audio and speech signal

To push our advantage in the field of sparse representations, both from the theoretical and applicative viewpoint.

To extend our scope towards more powerful approaches for the representation and modeling of audio and multi-modal signals with an audio component

To step in and progress in the area of compressing large-scale high-dimensional multi-modal data

Strategy 2006-2010


METISS

Scientific challenges

Probabilistic multi-level multi-stream dependency models for the representation of multiple sources and the integration of heterogeneous levels of knowledge in audio (-visual) streams Bayesian networks

Data-driven representations, model discovery and self-structuring of information in audio and audio-visual streams and contents

theoretical consolidation

Experimental platforms and numerically efficient algorithms for large scale data and near real-time processing engineering work

Deeper understanding of the links between theoretical concepts of adaptive representation, sparse decomposition, multi-scale analysis and pratical implications in terms of robustness, separability and adaptability

potential links with SVM

Compressing large-scale high-dimensional multimodal data for storage, description and classification compressed sensing


METISS

QuestionsQuestions

METISS 17-18 Mai 2006Evaluation INRIA1 METISS Modélisation et Expérimentation pour le Traitement des Informations et des Signaux Sonores Scientific leader.

Documents