Top Banner
Topic 14 Sound Event Detection
27

Topic 14 - ece.rochester.edu

Oct 31, 2021

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Topic 14 - ece.rochester.edu

Topic 14

Sound Event Detection

Page 2: Topic 14 - ece.rochester.edu

An Example

ECE 477 - Computer Audition - Zhiyao Duan 2018 2

A busy street

Page 3: Topic 14 - ece.rochester.edu

What is it?

• Scene classification

– Characterize an acoustic scene from an audio recording with a semantic label

• Sound Event Detection

– Locate and recognize each occurrence of a specific event in an audio recording

– “when” and “what”

ECE 477 - Computer Audition - Zhiyao Duan 2018 3

Page 4: Topic 14 - ece.rochester.edu

Let’s listen to some examples

ECE 477 - Computer Audition - Zhiyao Duan 2018 4

Busy street

SpeechFootstepMotorcycleCar horn

Forest

Bird singingInsects

Office

Copy machineComputer keyboardPhoneFootstep

Scene

Events

Page 5: Topic 14 - ece.rochester.edu

Why should we care?

• Intellectual merit

– Important problem in computer audition / artificial intelligence

• Broader impacts

– Security surveillance

– Environment/context aware computing

– Biological environment monitoring

– Robot navigation

– Healthcare (assisting deaf people)

– Multimedia indexing

ECE 477 - Computer Audition - Zhiyao Duan 2018 5

Page 6: Topic 14 - ece.rochester.edu

Addressing SED at different levels

• Detecting specific sounds

– E.g., gunshots, vehicles, machines, birds

• Classifying isolated events

– Only answers “what” but not “when”

• Detecting non-overlapping events from continuous audio recordings (i.e., monophonic)

• Detecting overlapping events from continuous audio recordings (i.e., polyphonic)

ECE 477 - Computer Audition - Zhiyao Duan 2018 6

Page 7: Topic 14 - ece.rochester.edu

Classical Approaches

• Classification

– Represent an audio recording as a bag-of-frame features (e.g., MFCC)

– Classify the audio according to statistics of these features

• Dictionary learning

– Decompose an audio recording using a dictionary of “acoustic atoms”, e.g., NMF

ECE 477 - Computer Audition - Zhiyao Duan 2018 7

Page 8: Topic 14 - ece.rochester.edu

Datasets

• Google’s AudioSet

– An expanding ontology of 632 audio event classes

– 2,084,320 human labeled 10s sound clips from YouTube

– https://research.google.com/audioset/

ECE 477 - Computer Audition - Zhiyao Duan 2018 8

[Gemmeke et al. ICASSP’17]

Page 9: Topic 14 - ece.rochester.edu

Datasets

• FSD

– Freesound content crowd source labeled with AudioSet ontology

– https://datasets.freesound.org/

ECE 477 - Computer Audition - Zhiyao Duan 2018 9

As of November 20, 2018

Page 10: Topic 14 - ece.rochester.edu

IEEE AASP Challenge…

• … on Detection and Classification of Acoustic Scenes and Events (D-CASE)

• 2013: 1st edition, 3 tasks

– http://c4dm.eecs.qmul.ac.uk/sceneseventschallenge/

• 2016: 2nd edition, 4 tasks, started a workshop, 23 papers

– http://www.cs.tut.fi/sgn/arg/dcase2016/

• 2017: 3rd edition, 4 tasks, 27 papers

– http://www.cs.tut.fi/sgn/arg/dcase2017/index

• 2018: 4th edition, 5 tasks, 44 papers

– http://dcase.community/challenge2018/

ECE 477 - Computer Audition - Zhiyao Duan 2018 10

Page 11: Topic 14 - ece.rochester.edu

DCASE 2018

• Task 1: Acoustic scene classification

• Task 2: General-purpose audio tagging of Freesoundcontent with AudioSet labels

• Task 3: Bird audio detection

• Task 4: Large-scale weakly labeled semi-supervised sound event detection in domestic environments

• Task 5: Monitoring of domestic activities based on multi-channel acoustics

ECE 477 - Computer Audition - Zhiyao Duan 2018 11

Page 12: Topic 14 - ece.rochester.edu

Task 1 – Acoustic Scene Classification

• Dataset: TUT Urban Acoustic Scene 2018, ~24 hours

– 10 scenes: Airport, indoor shopping mall, metro station, pedestrian street, public square, street with medium level of traffic, traveling by a tram, traveling by a bus, traveling by an underground metro, urban park

• Devices

– Soundman OKM II Klassik/studio A3 + 3 consumer devices

• Subtask A

– Same device between training and testing

• Subtask B

– Different device between training and testing

• Subtask C

– Use of external data in training

ECE 477 - Computer Audition - Zhiyao Duan 2018 12

Page 13: Topic 14 - ece.rochester.edu

Baseline System

• CNN approach

– Input: log-mel spectrogram (40 bands * 500 frames, 10s)

– 2 CNN layers with ReLU

– 1 dense layer with ReLU

– Output: softmax

ECE 477 - Computer Audition - Zhiyao Duan 2018 13

Subtask A comparison

Page 14: Topic 14 - ece.rochester.edu

Best Scoring System

• Ensemble of 9 CNNs trained on different spectrograms with data augmentation

ECE 477 - Computer Audition - Zhiyao Duan 2018 14

Page 15: Topic 14 - ece.rochester.edu

Task2 – Audio Tagging

• A subset of FSD dataset: Freesound content with AudioSet labels

• 41 labels with AudioSet ontology

ECE 477 - Computer Audition - Zhiyao Duan 2018 15

•Tearing•Bus•Shatter•Gunshot, gunfire•Fireworks•Writing•Computer keyboard•Scissors•Microwave oven•Keys jangling•Drawer open or close•Squeak•Knock•Telephone

•Saxophone•Oboe•Flute•Clarinet•Acoustic guitar•Tambourine•Glockenspiel•Gong•Snare drum•Bass drum•Hi-hat•Electric piano•Harmonica•Trumpet

•Violin, fiddle•Double bass•Cello•Chime•Cough•Laughter•Applause•Finger snapping•Fart•Burping, eructation•Cowbell•Bark•Meow

Page 16: Topic 14 - ece.rochester.edu

Baseline System

• CNN approach

– Input: log-mel spectrogram (64 bands * 25 frames, 0.25s)

– 3 CNN layers with ReLU

– Final max-reduction to a single value with softmax

ECE 477 - Computer Audition - Zhiyao Duan 2018 16

System comparison

Page 17: Topic 14 - ece.rochester.edu

Best Scoring System

• DenseNet blocks, mixup augmentation, batch-wise loss masking, ensemble of log-mel and waveform-based networks

ECE 477 - Computer Audition - Zhiyao Duan 2018 17

Page 18: Topic 14 - ece.rochester.edu

Task 3 – Bird Audio Detection

• Detect the presence of bird sound of any kind

• Development datasets

– “freefield1010”: Field recordings (5.8 GB)

– “warblrb10k”: Crowdsourced dataset (4.3 GB)

– “BirdVox-DCASE-20k”: Remote monitoring flight calls (15.4 GB)

• Evaluation datasets

– “warblrb10k”: crowdsourced dataset (1.3 GB)

– “Chernobyl”: remote monitoring (5.3 GB)

– “PolandNFC”: remote monitoring night-flight calls (2.3 GB)

ECE 477 - Computer Audition - Zhiyao Duan 2018 18

Page 19: Topic 14 - ece.rochester.edu

Baseline System

• CNN approach

– Input: log-melspectrogram (80 bands * 1000 frames, 14s)

– 4 CNN layers with leaky ReLU

– 3 dense layers with sigmoid

ECE 477 - Computer Audition - Zhiyao Duan 2018 19

Page 20: Topic 14 - ece.rochester.edu

Best Scoring System

• Pre-trained CNN with ImageNet + data augmentation:

– Time domain

• Apply jitter to chunk duration

• Insert, delete, swap chunks

• Cyclic shifts

– Frequency domain

• Frequency shifting

• Different interpolation filters

• Apply color jitter

• Most effective augmentations found

– Adding noise/content from random files

– Piecewise time/frequency stretching

– Time interval dropout

ECE 477 - Computer Audition - Zhiyao Duan 2018 20

Page 21: Topic 14 - ece.rochester.edu

Task 4 – Sound Event Detection

• Detect event class and time boundaries

• 10 classes: speech, dog, cat, alarm/bell/ringing, dishes, frying, blender, running water, vacuum cleaner, electric shaver/toothbrush

• Training set: 2,244 10s clips

– Weakly labeled: only contains clip-level presence of events

• Test set: 906 10s clips

ECE 477 - Computer Audition - Zhiyao Duan 2018 21

Page 22: Topic 14 - ece.rochester.edu

Baseline System

• Convolutional Recurrent Neural Network (CRNN)

– Input: log-mel spectrogram (64 bands * 500 frames, 10s)

• 3 convolution layers

• 1 recurrent layer (GRU)

• 1 dense layer

ECE 477 - Computer Audition - Zhiyao Duan 2018 22

System comparison

Page 23: Topic 14 - ece.rochester.edu

Best Scoring System

• Mean-teacher model with context-gating CNN and RNN

ECE 477 - Computer Audition - Zhiyao Duan 2018 23

Page 24: Topic 14 - ece.rochester.edu

The Mean-Teacher Method

• Teacher model weights are exponential moving average of student model weights after each update

ECE 477 - Computer Audition - Zhiyao Duan 2018 24

A. Tarvainen, H. Valpola, “Mean teachers are better role models: Weight-averaged consistency

targets improve semi-supervised deep learning results” in arXiv: 1703.01780, 2017.

Page 25: Topic 14 - ece.rochester.edu

Task 5 – Multi-channel Domestic Activity Monitoring

• Activities: absence, cooking, dishwashing, eating, other, social activity (visit, phone call), vacuum cleaning, watching TV, working (typing, mouse click, …)

– 72,984 10s clips

ECE 477 - Computer Audition - Zhiyao Duan 2018 25

Page 26: Topic 14 - ece.rochester.edu

Baseline System

• CNN approach

– Input: single-channel log-mel spectrogram (40 bands * 501 frames, 10s)

– 2 convolutional layers with ReLU

– 1 dense layer with ReLU

– 1 output layer with softmax

ECE 477 - Computer Audition - Zhiyao Duan 2018 26

System comparison

Page 27: Topic 14 - ece.rochester.edu

Best Scoring System

• CNN, independently process each channel and fuse classification results

• Ensemble of 4 classifiers trained on different folds

• Data augmentation

– Shuffling and mixing

ECE 477 - Computer Audition - Zhiyao Duan 2018 27