On the use of statistical tools for audio processingrecherche.ircam.fr/equipes/analyse-synthese/lagrange/teaching/ecn1… · Mathieu Lagrange. Statistical Tools for Audio Processing.!

On the use of statistical tools for audio processing

Mathieu Lagrange and Juan José Burred Analyse / Synthèse Team, IRCAM

[email protected]

École centrale de Nantes Filière ISBA

Mathieu Lagrange. Statistical Tools for Audio Processing. 2

Outline

1.  Introduction 1.  Context and challenges

2.  Past and Present 1.  Speech

1.  Model

2.  Applications (coding, speaker recognition, speech recognition)

2.  Audio (Music)

1.  Sound models

2.  Applications (classification, similarity)

3.  Limitations

3.  Sound source separation 1.  Paradigms, tasks and applications

2.  Mixing Models

3.  Methods for the under determined case

4.  Clustering of Spectral Audio (CoSA) 1.  Auditory Scene Analysis (ASA)

2.  Clustering


Outline


Mathieu Lagrange. Statistical Tools for Audio Processing.

Technological Context « We are drowning in information and starving for knowledge »

R. Roger

•  Needs:

o  Measurement

o  Transmission

o  Access

•  Aim of a numerical representation:

o  Precision

o  Efficiency

o  Relevance

•  Means

o  Mechanical biology

o  Psycho-acoustic

o  Cognition

4


Challenges

« Forty-two! yelled Loonquawl. Is that all you've got to show for seven and a half million years' work? »

D. Adams

Music is great to study as it is both:

•  object : arrangement de sons et de silences au cours du temps

•  function: more or less codified form of expression of :

o  Individual feelings (mood)

o  Collective feelings (party, singing, dance)

5


Audio Processing: Past and Present

6


Outline



1.  Model



Speech signal

•  The speech signal is produced when the air flow coming from the lungs go through the vocal chords and the vocal tract.

o  The size and the shape of the vocal tract as well as the vocal chords excitations are changing relatively slowly

o  The speech signal can therefore be considered as quasi-stationary over short period of about 20 ms.

•  Type of speech production

o  Voiced: <a>, <e>, …

o  Unvoiced: <s>, <ch>,

o  Plosives: <pe>, <ke>


Source / Filter Model

•  In the case of an idealized voiced speech signal, the vocal chords are producing a perfectly periodic harmonic signal

•  The influence of the vocal tract can be considered as a filtering with a given frequency response whose maximas are called formants.


Source / Filter Coding •  Algorithm :

o  Voiced / Unvoiced detection;

o  Voiced case: the source signal is approximated with a Dirac comb:

o  a Dirac comb whose successive Diracs are respectively T spaced by T as a spectrum which is a Dirac comb whose successive combs are 1/T spaced.

o  Parameters : T, gain

o  Unvoiced: the source signal is approximated by a stochastic signal:

o  Parameter : gain.

o  The Source signal is next filtered.

o  Parameters : filter coefficients.


« Code-Excited Linear Predictive » (CELP)

For each frame of 20 ms :

o  Auto-Regressive coefficients are computed such that the prediction error is minimized over the entire duration of the frame:

o  Quantified coefficients and an index encoding the error signal are transmitted.


« Code-Excited Linear Predictive » (CELP)

Signal Residual AR Coefficients

index


Speaker Recognition •  Classical pattern recognition problem

•  Specific problems:

o  Open Set / Closed Set: rejection problem

o  Identification / Verification

o  Text Dependency

•  Method

o  Feature extraction: model each speech with Mel-Frequency Cepstral Coefficients (MFCCs) and their derivatives.

o  Classification

o  Text independent: Vector Quantization Codebooks or Gaussian Mixture Models (GMMs)

o  Text dependent: Dynamic Time Warping (DTW) or Hidden Markov Model (HMM)

13

s1 s2

s3


Speech recognition •  An Automatic Speech Recognition System is typically

decomposed into:

o  Feature Extraction: MFCCs

o  Acoustic Models: HMMs trained for set of phones

o  Each phone is modelled with 3 states

o  Pronunciation dictionary: convert a series of phones into a word

o  Language Model: predict the likelihood of specific words occurring one after another with n-grams

14

(Fig. from HTK documentation)


MFCCs rules ? Mel Frequency Cepstral Coefficients are commonly derived as

follows:

1.  Take the Fourier transform of (a windowed excerpt of) a signal.

2.  Map the powers of the spectrum obtained above onto the mel scale, using triangular overlapping windows.

3.  Take the logs of the powers at each of the mel frequencies.

4.  Take the discrete cosine transform (DCT) of the list of mel log powers, as if it were a signal.

5.  The MFCCs are the amplitudes of the resulting spectrum.

15


MFCCs Rules ?

16


Issues with MFCCs computation steps •  The MEL frequency wraping:

o  highly criticized form a perceptual point of view (Greenwood)

o  conceptually: periodicity analysis over data that are not periodic anymore (Camacho)

•  The Cepstral Coefficients are COSINE coefficients:

o  cannot shift with speaker size to capture the shift in formant frequencies that occurs as children grow up and their vocal tracts get longer

•  Not a sound representation:

o  no way to provide enhancements such as speaker and channel adaptation, background noise suppression, source separation

17


Potentials of the DCT step •  Observation of Pols that the main components capture most of

the variance using a few smooth basis functions, smoothing away the pitch ripples

•  Principal components of a collection of vowel spectra on a warped frequency scale aren't so far from the cosine basis functions

•  Decorrelates the features.

o  This is important because the MFCC are in most cases modelled by Gaussians with diagonal covariance matrices

18


Outline



1.  Model


2.  Audio (Music)

1.  Sound models


3.  Limitations


Sound Models •  Major classes of sounds

1.  Transients (castanets, …)

2.  Pseudo periodic (flute, …)

3.  Stochastic (waves, …)

•  Models

1.  Impulsive noise

2.  Sum of sinusoids

3.  Wide band noise

20


Classification •  Method: [Tzanetakis’02]

o  Agree on mutually exclusive set of tags (the ontology)

o  Extract features from audio (MFCCs and variations)

o  Train statistical models:

o  Due to the high dimensionality of the feature vectors discriminatives approaches are prefered (SVMs)

•  Segmentation

o  Smoothing decision using dynamic programming (DP)

21

Tzanetakis’02 Tzanetakis, G. Cook, P. Musical Genre Classification of Audio Signals IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING 2002

(Fig. from [Ramona07])


Multi-Class Discriminative Classification •  Usually performed by combining binary

classifiers

•  Two approaches:

o  One-vs-all: For each class build a classifier for that class versus the rest

o  Often very imbalanced classifiers (use asymmetric regularization)

o  All-vs-all Build a classifier for each couple of class

o  A priori a large number of classifiers to build but the pairwise classification are faster and the classifications are balanced (easier to find the best regularization)

22

s1 s2

s3


Multi-Label Discriminative Classification •  Each object may be tagged using

several labels

•  Computational approaches

o  Power Sets

o  Binary Relevance (equivalent to one-vs-all)

•  Multiple criteria:

o  « Flattening » the ontology

o  Research trend: considering the ontology structure to benefit from co-occurrence labels of different semantic criterion

23

C1 C2

C3


Music Similarity •  Question to solve: « Given a seed song, provide us with the

entries of the database which are the most similar »

•  Annotation type: Artist / Album

•  Method: [Aucouturier’04]

o  Songs are modeled as GMM of MFCCs

o  Proximity of GMMs are considered as similiarity measure:

o  Likelihood (requires access to the MFCCs)

o  Sampling

24

[Aucouturier’04] J.-J. Aucouturier and F. Pachet. Improving Timbre Similarity: How High is the Sky? Journal of Negative Results in Speech and Audio Sciences, 1 (1), 2004.


Cover Version Detection •  Question to solve: « Given a seed song, provide us with the

entries of the database which are cover versions »

•  Annotation: canonical song

•  Method: [Serra’08]

o  Songs are modeled as a time series of Chromas

o  Computation of the similarity matrix between the two time series

o  Similarity is measured using Dynamic Programming Local Alignment

25

[Serra’08] Chroma Binary Similarity and Local AlignmentApplied to Cover Song Identification, IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, 2008

(Fig. from [Serr08])


Limitations •  Description of audio and music

o  Polyphonic

o  Multiple shapes varying in various ways

•  Statistical Modeling

o  Curse of dimensionality

o  Sense of structure relevant at multiple levels of temporality

26


Outline



1.  Model


2.  Audio (Music)

1.  Sound models


3.  Limitations


2.  Mixing Models



Sound Source Separation •  “Cocktail party effect”

o  E. C. Cherry, 1953.

o  Ability to concentrate attention on a specific sound source from within a mixture.

o  Even when interfering energy is close to energy of desired source.

•  “Prince Shotoku Challenge” o  Legendary Japanese prince Shotoku (6th Century AD)

could listen and understand simultaneously the petitions by ten people.

o  Concentrate attention on several sources at the same time!

o  “Prince Shotoku Computer” (Okuno et al., 1997)

•  Both allegories imply an extra step of semantic understanding of the sources, beyond mere acoustical isolation.

[Cherry53]

[Okuno97]

E. C. Cherry. Some Experiments on the Recognition of Speech, With One and Two Ears. Journal of the Acoustical Society of America, Vol. 25, 1953. H. G. Okuno, T. Nakatani and T. Kawabata. Understanging Three Simultaneous Speeches. Proc. Int. Joint Conference on Artificial Intelligence (IJCAI), Nagoya, Japan, 1997.


The paradigms of Musical Source Separation •  (based on [Scheirer00])

"   Understanding without separation

"   E.g. music genre classification

"   “Glass ceiling” of traditional methods (MFCC+GMM) [Aucouturier&Pachet04]

"   Separation for understanding

"   First (partially) separate, then feature extraction

"   Source separation as a way to break the glass ceiling?

"   Separation without understanding

"   BSS: Blind Source Separation (ICA, ISA, NMF)

"   Blind means: only very general statistical assumptions taken.

"   Understanding for separation

"   Supervised source separation (based on a training database)

[Scheirer00] [Aucouturier&Pachet04]

E. D. Scheirer. Music-Listening Systems. PhD thesis, Massachusetts Institute of Technology, 2000. J.-J. Aucouturier and F. Pachet. Improving Timbre Similarity: How High is the Sky? Journal of Negative Results in Speech and Audio Sciences, 1 (1), 2004.


Required sound quality •  Audio Quality Oriented (AQO)

o  Aimed at full unmixing at the highest possible quality.

o  Applications:

o  Unmixing, remixing, upmixing

o  Hearing aids

o  Post-production

•  Significance Oriented (SO) o  Separation quality just enough for facilitating semantic analysis of

complex signals.

o  Less demanding, more realistic.

o  Applications:

o  Music Information Retrieval

o  Polyphonic Transcription

o  Object-based audio coding


Musical Source Separation Tasks

•  Classification according to the nature of the mixtures:

•  Classification according to available a priori information:


Linear mixing model •  Only amplitude scaling before mixing (summing)

•  Linear stereo recording setups:

XY Stereo MS Stereo Close miking Direct injection


•  Amplitude scaling and delay before mixing

•  Delayed stereo recording setups:

Delayed mixing model

AB Stereo Mixed Stereo Close miking

with delay Direct injection

with delay


Convolutive mixing model •  Filtering between sources and sensors

•  Convolutive stereo recording setups:

Reverberant environment Binaural

Close miking with reverb

Direct injection with reverb


Some terminology

•  System of linear equations:

o  Usual algebraic methods from high school: X known, A known, S unknown

o  But in source separation: unknown variables (S, sources) AND unknown coefficients (A, mixing matrix)

•  Algebra terminology is retained for source separation:

o  More equations (mixtures) than unknowns (sources): overdetermined

o  Same number of equations (mixtures) than unknowns (sources): determined (square A)

o  Less equations (mixtures) than unknowns (sources): underdetermined

•  The underdetermined case is the most demanding, but also the most important for music!

o  Music is (still) mostly in stereo, with usually more than 2 instruments

o  Overdetermined and determined situtations are only of interest for arrays of sensors or arrays of microphones (localization, tracking)


Binaural Case (1) •  Goal: find a mask M that retrieves one source when used to filter a given

time-frequency representation.

•  DUET (Degenerate Unmixing Estimation Technique) [Yilmaz&Rickard04] o  Histogram of Interchannel Intensity

(IID) and Phase (IPD) Differences

o  Binary Mask created by selecting bins around histogram peaks.

•  Drawback of t-f masking: “musical noise” or “burbling” artifacts

º is the Hadamard (element-wise) product

(Fig. from [Vincent06])

(Fig. from [Yilmaz&Rickard04]) [Yilmaz&Rickard04] Ö. Yilmaz and S. Rickard. Blind Separation of Speech Mixtures via Time-Frequency Masking. IEEE Trans. on Signal Processing. Vol. 52(7), July 2004


Binaural Case (2) •  Human-assisted time-frequency masking [Vinyes06]

o  Human-assisted selection of the time-frequency bins out of the DUET-like histogram for creating the unmixing mask

o  Implementation as a VST plugin (“Audio Scanner”)

[Vinyes06] M. Vinyes, J. Bonada and A. Loscos. Demixing Commercial Music Productions via Human-Assisted Time-Frequency Masking. 120th AES convention, Paris, France, 2006.


Monaural Case •  Classification according to a priori knowledge

o  Supervised

o  Based on training the model with a sound example database

o  Better quality and more demanding situations at the cost of less generality

o  Unsupervised

•  Classification according to model type o  Adaptive basis decompositions (ISA, NMF, NSC)

o  Sinusoidal Modeling

•  Classification according to mixture type o  Monaural systems

o  Hybrid systems combining advanced source models with spatial diversity


Independent Subspace Analysis

•  Application of ISA to audio: Casey and Westner, 2000.

•  Application of ICA to the spectogram of a mono mixture.

•  Each independent component corresponds to an independent subspace of the spectrogram.

•  Component-to-source clustering o  The extracted components usually do not directly correspond to the sources.

o  They must be clustered together according to some similarity criterion.

o  Casey&Westner use a matrix of Kullback-Leibler divergences called the ixegram.

(Fig. from [Casey&Westner00])

[Casey&Westner00] M. Casey and A. Westner. Separation of Mixed Audio Sources by Independent Subspace Analysis. Proc, Int. Computer Music Conference (ICMC), Berlin, Germany, 2000.


ICA for Audio

40

(Figs from Virtanen)


Nonnegative Matrix Factorization •  Matrix factorization ( ) imposing non-negativity.

•  Needed when using magnitude or power spectrograms.

•  NMF does not aim at statistical independence, but: o  It has been proven that, under some conditions, non-negativity is sufficient for

separation.

o  NMF yields components that very closely correspond to the sources.

o  To date, there is no exact theoretical explanation why is that so!

•  Use for transcription:

o  P. Smaragdis and J.C. Brown. Non-Negative Matrix Factorization for Polyphonic Music Transcription. Proc. IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), New Paltz, USA, 2003.

•  Use for separation:

o  B. Wang and M. D. Plumbley. Musical Audio Stream Separation by Non-Negative Matrix Factorization. Proc. UK Digital Music Research Network (DMRN) Summer Conf., 2005.


NMF for Audio

42

(Figs from Virtanen)


NMF for Vision •  By representing signals as a sum purely additive, nonnegative

sources, we get a parts-based representation [Lee’99]

43

[Lee’99] Lee and Seung, Learning the parts of objects by nonnegative matrix factorization, Nature, 1999, 41


Nonnegative Sparse Coding •  Combination of non-negativity and sparsity constraints in the factorization.

•  [Virtanen03]: NSC is optimized with an additional criterion of temporal continuity.

o  Measured by the absolute value of the overall amplitude difference between consecutive frames.

•  [Virtanen04]: Convolutive Sparse Coding

o  Improved temporal accuracy by modeling the sources as the convolution of spectrograms with a vector of onsets.

Mixture Component 1 Component 2

Mixture Component 1 Component 2 [Virtanen03]

[Virtanen04]

T. Virtanen. Sound Source Separation Using Sparse Coding with Temporal Continuity Objective. Proc. Int. Computer Music Conference (ICMC), Singapore, 2003. T. Virtanen. Separation of Sound Sources by Convolutive Sparse Coding. Proc. ISCA Tutorial and Research Workshop on Statistical and Perceptual Audio Processing (SAPA), Jeju, Korea, 2004.


Sinusoidal Methods •  Sinusoidal Modeling: detection and tracking of the

sinusoidal partial peaks on the spectrogram.

•  Based on Auditory Scene Analysis (ASA) cues of good-continuation, common fate and smoothness of sinusoidal tracks.

•  Overall, very good reduction of interfering sources, but moderate timbral quality.

o  Appropriate for Significance-Oriented applications

•  [Virtanen&Klapuri02]: model of spectral smoothness of harmonic sounds

o  Based on basis decomposition of harmonic structures

o  Additive resynthesis of partial parameters

•  [Every&Szymanski06]

o  Spectral subtraction instead of additive resynthesis

(Fig. from [Every06])

Mixture Separated sources

[Virtanen&Klapuri02]

[Every&Szymanski06]

T. Virtanen and A. Klapuri. Separation of Harmonic Sounds Using Linear Models for the Overtone Series. Proc. IEEE Int. Conf. on Acoustics, Speech and Signal Processing (ICASSP), Orlando, USA, 2002. M. R. Every and J. E. Szymanski. Separation of Synchronous Pitched Notes by Spectral Filtering of Harmonics. IEEE Trans. on Audio, Speech and Signal Processing. Vol. 14(5), 2006.


Supervised Methods (1) •  Use of a training database to create a set of source models, each one

modeling a specific instrument.

o  Better separation as a trade-off for generality.

•  Supervised sinusoidal methods

o  [Burred&Sikora07]

o  The source models are compact descriptions of the spectral envelope and its temporal evolution.

o  The detailed temporal evolution allows to ignore harmonicity constraints, and thus separation of chords and

inharmonic sounds is possible.



Separation of chords Inharmonic separation

[Burred&Sikora07] J.J. Burred and T. Sikora. Monaural Source Separation from Musical Mixtures Based on Time-Frequency Timbre Models. Proc. Int. Conf. on Music Information Retrieval (ISMIR), Vienna, Austria, September 2007.


Supervised Methods (2) •  Bayesian Networks

o  [Vincent06]

o  Multilayered model describing note probabilities (state layer), spectral decomposition (source layer) and spatial information (mixture layer).

o  Trained on a database of isolated notes.

o  Allows separation of sounds with reverb.

•  Learnt priors for Wiener-based separation

o  [Ozerov05]

o  Single-channel

o  GMM models of singing voice and accompaniment.


Separated sources Mixture

[Vincent06]

[Ozerov05]

E. Vincent. Musical Source Separation Using Time-Frequency Source Priors. IEEE Trans. on Audio, Speech and Language Processing, Vol. 14 (1), 2006. A. Ozerov, O. Philippe, R. Gribonval and F. Bimbot. One Microphone Singing Voice Separation Using Source-Adapted Models. Proc. IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), New Paltz, USA, 2005.


Conclusions •  Still far from fully-general, audio-quality-oriented system.

•  More realistic: significance oriented

o  Separation good enough to facilitate content analysis

•  Methods based on adaptive models, time-frequency masking:

o  More realistic mixtures, but more artifacts and interferences

•  Methods based on sinusoidal modeling:

o  More artificial timbre, but less interferences.

•  Current polyphony limitations:

o  Mono signals: up to 3, 4 instruments

o  Stereo signals: up to 5, 6 instruments


Outline



1.  Model


2.  Audio (Music)

1.  Sound models


3.  Limitations


2.  Mixing Models


4.  Clustering of Spectral Audio (CoSA) 1.  Auditory Scene Analysis (ASA)

2.  Clustering


Binary Masking with Oracle •  Binary masking is an effective way

of performing the separation

•  Using an oracle allows to assess the relevance of Fourier spectrograms as an atomic representation

•  The binary mask is set to 1 if the source of interest is dominant in the considered frequency bin, and 0 otherwise

50

STFT ISTFT

|STFT| Mask


Auditory Scene Analysis •  Formalism proposed by psycho-

acousticians [Bregman’90]

•  Main principle :

o  The scene can be decomposed into a set of atoms

o  A first level of structuration clusters atoms into entities (notes)

o  A second one clusters entities into streams (voices)

51

[Bregman’90] A. Bregman, Auditory Scene Analysis: The Perceptual Organization of Sound, The MIT Press, 1990


Affinity Cues •  Proximity

o  Time

o  Frequency

o  Amplitude

•  Dynamic

•  Harmonicity

52


Vector-based Clustering •  Each object of the dataset is described as a

set of features

•  For that purpose “k-means” is widely used due to its efficiency

•  Aim: minimize the within-cluster sum of squares (WCSS)

•  2-Steps iterative method:

o  Assignment step: Assign each observation to the cluster with the closest mean

o  Update step: Calculate the new means to be the centroid of the observations in the cluster

53


Graph-based Clustering •  Given a set of data points A, the similarity matrix may be defined

as a matrix W where w(i, j) represents a measure of the similarity between points.

•  More generic approach, as vector-based descriptions can be trivially converted

•  Many existing methods:

o  Agglomerative Hierarchical Clustering (AHC)

o  k-medoids

o  Spectral clustering

54


Spectral Clustering (1) •  Spectral clustering techniques make use of

o  the spectrum of the similarity matrix of the data (eigenvectors)

o  to perform dimensionality reduction for clustering in fewer dimensions.

55


Spectral Clustering (2) •  Method:

o  Compute the similarity between each objects (W)

o  Computes the normalized laplacian (L):

o  With D the degree matrix defined as the diagonal matrix with the degrees on the diagonal:

o  Select the eigenvectors corresponding to the k largest eigenvalues of the laplacian and normalize them by rows

o  Use those eigenvectors for clustering with k-means

56


Spectral Clustering (3) •  Can be viewed as a relaxation of the Normalized Cuts problem:

57


Performance Criterion •  Normalized Mutual Information

o  Given 2 sets of labels (the ground truth and the cluster results) estimate the degree of matching between the 2

•  The NMI is between 0 and 1, 1 being a perfect match.

58

€

NMI(X,Y ) =H(X) −H(X |Y )(H(X) +H(Y )) /2

H(X |Y ) = − p(x,y)log(p(x | y))∑


Clustering of Spectral Audio (CoSA) •  Aim: apply clustering method for

generating the binary mask

•  Method:

o  Spectrogram as the atomic representation

o  Prune the representation to retain only high amplitude components

o  Compute the features related to each retained components

o  Split the spectrogram into contiguous texture windows

o  For each texture window

o  Apply clustering

o  Select clusters that are likely to belong to the source of interest

o  Apply spectrogram inversion

59


Performance Criteria •  NMI

o  Does not consider the amplitude of the spectral components

•  Spectral domain Signal to Noise Ratio

o  Does not consider phase and framing issues

•  Time domain Signal to Noise Ratio

o  Is still not perfect as it is only vaguely related to perception

60

€

SSNR(X,Y ) =| Xn,m |

| Xn,m | − |Yn,m |m=1

M

∑n=1

N

∑

€

TSNR(x,y) =1N

x(nT + t)2

(x(nT + t) − y(nT + t))2t=1

T

∑n=1

N

∑


Literature •  Very few overview materials on Musical Source Separation

•  P. D. O´Grady, B. A. Pearlmutter and S. T. Rickard. Survey of sparse and non-sparse methods in source separation. International Journal of Imaging Systems and Technology, 15(1). 2005.

•  E. Vincent, M. G. Jafari, S. A. Abdallah, M. D. Plumbley and M. E. Davies. Model-based audio source separation. Technical Report C4DM-TR-05-01, Queen Mary University, London, UK, 2006.

•  T. Virtanen. Unsupervised Learning Methods for Source Separation in Monaural Music Signals. Chapter in A. Klapuri, M. Davy (Eds.), Signal Processing Methods for Music Transcription, Springer 2006.

•  Stereo Audio Source Separation Evaluation Campaign:

o  http://sassec.gforge.inria.fr

•  Tutorial (section 3 is an excerpt)

o  Juan Jose Burred. Musical Source Separation: Principles and State of the Art. 2nd Int. Workshop on Learning Semantics of Audio Signals (LSAS)

On the use of statistical tools for audio processingrecherche.ircam.fr/equipes/analyse-synthese/lagrange/teaching/ecn1… · Mathieu Lagrange. Statistical Tools for Audio Processing.!

Documents