Topic 14 Sound Event Detection
What is it?
• Scene classification
– Characterize an acoustic scene from an audio recording with a semantic label
• Sound Event Detection
– Locate and recognize each occurrence of a specific event in an audio recording
– “when” and “what”
ECE 477 - Computer Audition - Zhiyao Duan 2018 3
Let’s listen to some examples
ECE 477 - Computer Audition - Zhiyao Duan 2018 4
Busy street
SpeechFootstepMotorcycleCar horn
Forest
Bird singingInsects
Office
Copy machineComputer keyboardPhoneFootstep
Scene
Events
Why should we care?
• Intellectual merit
– Important problem in computer audition / artificial intelligence
• Broader impacts
– Security surveillance
– Environment/context aware computing
– Biological environment monitoring
– Robot navigation
– Healthcare (assisting deaf people)
– Multimedia indexing
ECE 477 - Computer Audition - Zhiyao Duan 2018 5
Addressing SED at different levels
• Detecting specific sounds
– E.g., gunshots, vehicles, machines, birds
• Classifying isolated events
– Only answers “what” but not “when”
• Detecting non-overlapping events from continuous audio recordings (i.e., monophonic)
• Detecting overlapping events from continuous audio recordings (i.e., polyphonic)
ECE 477 - Computer Audition - Zhiyao Duan 2018 6
Classical Approaches
• Classification
– Represent an audio recording as a bag-of-frame features (e.g., MFCC)
– Classify the audio according to statistics of these features
• Dictionary learning
– Decompose an audio recording using a dictionary of “acoustic atoms”, e.g., NMF
ECE 477 - Computer Audition - Zhiyao Duan 2018 7
Datasets
• Google’s AudioSet
– An expanding ontology of 632 audio event classes
– 2,084,320 human labeled 10s sound clips from YouTube
– https://research.google.com/audioset/
ECE 477 - Computer Audition - Zhiyao Duan 2018 8
[Gemmeke et al. ICASSP’17]
Datasets
• FSD
– Freesound content crowd source labeled with AudioSet ontology
– https://datasets.freesound.org/
ECE 477 - Computer Audition - Zhiyao Duan 2018 9
As of November 20, 2018
IEEE AASP Challenge…
• … on Detection and Classification of Acoustic Scenes and Events (D-CASE)
• 2013: 1st edition, 3 tasks
– http://c4dm.eecs.qmul.ac.uk/sceneseventschallenge/
• 2016: 2nd edition, 4 tasks, started a workshop, 23 papers
– http://www.cs.tut.fi/sgn/arg/dcase2016/
• 2017: 3rd edition, 4 tasks, 27 papers
– http://www.cs.tut.fi/sgn/arg/dcase2017/index
• 2018: 4th edition, 5 tasks, 44 papers
– http://dcase.community/challenge2018/
ECE 477 - Computer Audition - Zhiyao Duan 2018 10
DCASE 2018
• Task 1: Acoustic scene classification
• Task 2: General-purpose audio tagging of Freesoundcontent with AudioSet labels
• Task 3: Bird audio detection
• Task 4: Large-scale weakly labeled semi-supervised sound event detection in domestic environments
• Task 5: Monitoring of domestic activities based on multi-channel acoustics
ECE 477 - Computer Audition - Zhiyao Duan 2018 11
Task 1 – Acoustic Scene Classification
• Dataset: TUT Urban Acoustic Scene 2018, ~24 hours
– 10 scenes: Airport, indoor shopping mall, metro station, pedestrian street, public square, street with medium level of traffic, traveling by a tram, traveling by a bus, traveling by an underground metro, urban park
• Devices
– Soundman OKM II Klassik/studio A3 + 3 consumer devices
• Subtask A
– Same device between training and testing
• Subtask B
– Different device between training and testing
• Subtask C
– Use of external data in training
ECE 477 - Computer Audition - Zhiyao Duan 2018 12
Baseline System
• CNN approach
– Input: log-mel spectrogram (40 bands * 500 frames, 10s)
– 2 CNN layers with ReLU
– 1 dense layer with ReLU
– Output: softmax
ECE 477 - Computer Audition - Zhiyao Duan 2018 13
Subtask A comparison
Best Scoring System
• Ensemble of 9 CNNs trained on different spectrograms with data augmentation
ECE 477 - Computer Audition - Zhiyao Duan 2018 14
Task2 – Audio Tagging
• A subset of FSD dataset: Freesound content with AudioSet labels
• 41 labels with AudioSet ontology
ECE 477 - Computer Audition - Zhiyao Duan 2018 15
•Tearing•Bus•Shatter•Gunshot, gunfire•Fireworks•Writing•Computer keyboard•Scissors•Microwave oven•Keys jangling•Drawer open or close•Squeak•Knock•Telephone
•Saxophone•Oboe•Flute•Clarinet•Acoustic guitar•Tambourine•Glockenspiel•Gong•Snare drum•Bass drum•Hi-hat•Electric piano•Harmonica•Trumpet
•Violin, fiddle•Double bass•Cello•Chime•Cough•Laughter•Applause•Finger snapping•Fart•Burping, eructation•Cowbell•Bark•Meow
Baseline System
• CNN approach
– Input: log-mel spectrogram (64 bands * 25 frames, 0.25s)
– 3 CNN layers with ReLU
– Final max-reduction to a single value with softmax
ECE 477 - Computer Audition - Zhiyao Duan 2018 16
System comparison
Best Scoring System
• DenseNet blocks, mixup augmentation, batch-wise loss masking, ensemble of log-mel and waveform-based networks
ECE 477 - Computer Audition - Zhiyao Duan 2018 17
Task 3 – Bird Audio Detection
• Detect the presence of bird sound of any kind
• Development datasets
– “freefield1010”: Field recordings (5.8 GB)
– “warblrb10k”: Crowdsourced dataset (4.3 GB)
– “BirdVox-DCASE-20k”: Remote monitoring flight calls (15.4 GB)
• Evaluation datasets
– “warblrb10k”: crowdsourced dataset (1.3 GB)
– “Chernobyl”: remote monitoring (5.3 GB)
– “PolandNFC”: remote monitoring night-flight calls (2.3 GB)
ECE 477 - Computer Audition - Zhiyao Duan 2018 18
Baseline System
• CNN approach
– Input: log-melspectrogram (80 bands * 1000 frames, 14s)
– 4 CNN layers with leaky ReLU
– 3 dense layers with sigmoid
ECE 477 - Computer Audition - Zhiyao Duan 2018 19
Best Scoring System
• Pre-trained CNN with ImageNet + data augmentation:
– Time domain
• Apply jitter to chunk duration
• Insert, delete, swap chunks
• Cyclic shifts
– Frequency domain
• Frequency shifting
• Different interpolation filters
• Apply color jitter
• Most effective augmentations found
– Adding noise/content from random files
– Piecewise time/frequency stretching
– Time interval dropout
ECE 477 - Computer Audition - Zhiyao Duan 2018 20
Task 4 – Sound Event Detection
• Detect event class and time boundaries
• 10 classes: speech, dog, cat, alarm/bell/ringing, dishes, frying, blender, running water, vacuum cleaner, electric shaver/toothbrush
• Training set: 2,244 10s clips
– Weakly labeled: only contains clip-level presence of events
• Test set: 906 10s clips
ECE 477 - Computer Audition - Zhiyao Duan 2018 21
Baseline System
• Convolutional Recurrent Neural Network (CRNN)
– Input: log-mel spectrogram (64 bands * 500 frames, 10s)
• 3 convolution layers
• 1 recurrent layer (GRU)
• 1 dense layer
ECE 477 - Computer Audition - Zhiyao Duan 2018 22
System comparison
Best Scoring System
• Mean-teacher model with context-gating CNN and RNN
ECE 477 - Computer Audition - Zhiyao Duan 2018 23
The Mean-Teacher Method
• Teacher model weights are exponential moving average of student model weights after each update
ECE 477 - Computer Audition - Zhiyao Duan 2018 24
A. Tarvainen, H. Valpola, “Mean teachers are better role models: Weight-averaged consistency
targets improve semi-supervised deep learning results” in arXiv: 1703.01780, 2017.
Task 5 – Multi-channel Domestic Activity Monitoring
• Activities: absence, cooking, dishwashing, eating, other, social activity (visit, phone call), vacuum cleaning, watching TV, working (typing, mouse click, …)
– 72,984 10s clips
ECE 477 - Computer Audition - Zhiyao Duan 2018 25
Baseline System
• CNN approach
– Input: single-channel log-mel spectrogram (40 bands * 501 frames, 10s)
– 2 convolutional layers with ReLU
– 1 dense layer with ReLU
– 1 output layer with softmax
ECE 477 - Computer Audition - Zhiyao Duan 2018 26
System comparison