Topic 14 - ece.rochester.edu

Topic 14

Sound Event Detection

An Example

ECE 477 - Computer Audition - Zhiyao Duan 2018 2

A busy street

What is it?

• Scene classification

– Characterize an acoustic scene from an audio recording with a semantic label

• Sound Event Detection

– Locate and recognize each occurrence of a specific event in an audio recording

– “when” and “what”


Let’s listen to some examples


Busy street

SpeechFootstepMotorcycleCar horn

Forest

Bird singingInsects

Office

Copy machineComputer keyboardPhoneFootstep

Scene

Events

Why should we care?

• Intellectual merit

– Important problem in computer audition / artificial intelligence

• Broader impacts

– Security surveillance

– Environment/context aware computing

– Biological environment monitoring

– Robot navigation

– Healthcare (assisting deaf people)

– Multimedia indexing


Addressing SED at different levels

• Detecting specific sounds

– E.g., gunshots, vehicles, machines, birds

• Classifying isolated events

– Only answers “what” but not “when”

• Detecting non-overlapping events from continuous audio recordings (i.e., monophonic)

• Detecting overlapping events from continuous audio recordings (i.e., polyphonic)


Classical Approaches

• Classification

– Represent an audio recording as a bag-of-frame features (e.g., MFCC)

– Classify the audio according to statistics of these features

• Dictionary learning

– Decompose an audio recording using a dictionary of “acoustic atoms”, e.g., NMF


Datasets

• Google’s AudioSet

– An expanding ontology of 632 audio event classes

– 2,084,320 human labeled 10s sound clips from YouTube

– https://research.google.com/audioset/


[Gemmeke et al. ICASSP’17]

https://research.google.com/audioset/

Datasets

• FSD

– Freesound content crowd source labeled with AudioSet ontology

– https://datasets.freesound.org/


As of November 20, 2018

https://datasets.freesound.org/

IEEE AASP Challenge…

• … on Detection and Classification of Acoustic Scenes and Events (D-CASE)

• 2013: 1st edition, 3 tasks

– http://c4dm.eecs.qmul.ac.uk/sceneseventschallenge/

• 2016: 2nd edition, 4 tasks, started a workshop, 23 papers

– http://www.cs.tut.fi/sgn/arg/dcase2016/

• 2017: 3rd edition, 4 tasks, 27 papers

– http://www.cs.tut.fi/sgn/arg/dcase2017/index

• 2018: 4th edition, 5 tasks, 44 papers

– http://dcase.community/challenge2018/


http://c4dm.eecs.qmul.ac.uk/sceneseventschallenge/

http://www.cs.tut.fi/sgn/arg/dcase2016/

http://www.cs.tut.fi/sgn/arg/dcase2017/index

http://dcase.community/challenge2018/

DCASE 2018

• Task 1: Acoustic scene classification

• Task 2: General-purpose audio tagging of Freesoundcontent with AudioSet labels

• Task 3: Bird audio detection

• Task 4: Large-scale weakly labeled semi-supervised sound event detection in domestic environments

• Task 5: Monitoring of domestic activities based on multi-channel acoustics


Task 1 – Acoustic Scene Classification

• Dataset: TUT Urban Acoustic Scene 2018, ~24 hours

– 10 scenes: Airport, indoor shopping mall, metro station, pedestrian street, public square, street with medium level of traffic, traveling by a tram, traveling by a bus, traveling by an underground metro, urban park

• Devices

– Soundman OKM II Klassik/studio A3 + 3 consumer devices

• Subtask A

– Same device between training and testing

• Subtask B

– Different device between training and testing

• Subtask C

– Use of external data in training


Baseline System

• CNN approach

– Input: log-mel spectrogram (40 bands * 500 frames, 10s)

– 2 CNN layers with ReLU

– 1 dense layer with ReLU

– Output: softmax


Subtask A comparison

Best Scoring System

• Ensemble of 9 CNNs trained on different spectrograms with data augmentation


Task2 – Audio Tagging

• A subset of FSD dataset: Freesound content with AudioSet labels

• 41 labels with AudioSet ontology


•Tearing•Bus•Shatter•Gunshot, gunfire•Fireworks•Writing•Computer keyboard•Scissors•Microwave oven•Keys jangling•Drawer open or close•Squeak•Knock•Telephone

•Saxophone•Oboe•Flute•Clarinet•Acoustic guitar•Tambourine•Glockenspiel•Gong•Snare drum•Bass drum•Hi-hat•Electric piano•Harmonica•Trumpet

•Violin, fiddle•Double bass•Cello•Chime•Cough•Laughter•Applause•Finger snapping•Fart•Burping, eructation•Cowbell•Bark•Meow

Baseline System

• CNN approach

– Input: log-mel spectrogram (64 bands * 25 frames, 0.25s)

– 3 CNN layers with ReLU

– Final max-reduction to a single value with softmax


System comparison

Best Scoring System

• DenseNet blocks, mixup augmentation, batch-wise loss masking, ensemble of log-mel and waveform-based networks


Task 3 – Bird Audio Detection

• Detect the presence of bird sound of any kind

• Development datasets

– “freefield1010”: Field recordings (5.8 GB)

– “warblrb10k”: Crowdsourced dataset (4.3 GB)

– “BirdVox-DCASE-20k”: Remote monitoring flight calls (15.4 GB)

• Evaluation datasets

– “warblrb10k”: crowdsourced dataset (1.3 GB)

– “Chernobyl”: remote monitoring (5.3 GB)

– “PolandNFC”: remote monitoring night-flight calls (2.3 GB)


Baseline System

• CNN approach

– Input: log-melspectrogram (80 bands * 1000 frames, 14s)

– 4 CNN layers with leaky ReLU

– 3 dense layers with sigmoid


Best Scoring System

• Pre-trained CNN with ImageNet + data augmentation:

– Time domain

• Apply jitter to chunk duration

• Insert, delete, swap chunks

• Cyclic shifts

– Frequency domain

• Frequency shifting

• Different interpolation filters

• Apply color jitter

• Most effective augmentations found

– Adding noise/content from random files

– Piecewise time/frequency stretching

– Time interval dropout


Task 4 – Sound Event Detection

• Detect event class and time boundaries

• 10 classes: speech, dog, cat, alarm/bell/ringing, dishes, frying, blender, running water, vacuum cleaner, electric shaver/toothbrush

• Training set: 2,244 10s clips

– Weakly labeled: only contains clip-level presence of events

• Test set: 906 10s clips


Baseline System

• Convolutional Recurrent Neural Network (CRNN)

– Input: log-mel spectrogram (64 bands * 500 frames, 10s)

• 3 convolution layers

• 1 recurrent layer (GRU)

• 1 dense layer


System comparison

Best Scoring System

• Mean-teacher model with context-gating CNN and RNN


The Mean-Teacher Method

• Teacher model weights are exponential moving average of student model weights after each update


A. Tarvainen, H. Valpola, “Mean teachers are better role models: Weight-averaged consistency

targets improve semi-supervised deep learning results” in arXiv: 1703.01780, 2017.

Task 5 – Multi-channel Domestic Activity Monitoring

• Activities: absence, cooking, dishwashing, eating, other, social activity (visit, phone call), vacuum cleaning, watching TV, working (typing, mouse click, …)

– 72,984 10s clips


Baseline System

• CNN approach

– Input: single-channel log-mel spectrogram (40 bands * 501 frames, 10s)

– 2 convolutional layers with ReLU

– 1 dense layer with ReLU

– 1 output layer with softmax


System comparison

Best Scoring System

• CNN, independently process each channel and fuse classification results

• Ensemble of 4 classifiers trained on different folds

• Data augmentation

– Shuffling and mixing


Topic 14 - ece.rochester.edu

Documents