URBAN SOUNDS CLASSIFICATION USING DEEP LEARNING A Degree Thesis Submitted to the Faculty of the Escola Tècnica d'Enginyeria de Telecomunicació de Barcelona Universitat Politècnica de Catalunya by Martí Bolet Boixeda In partial fulfilment of the requirements for the degree in TELECOMMUNICATIONS ENGINEERING Advisor: Elisa Sayrol Barcelona, June 2019
35
Embed
URBAN SOUNDS CLASSIFICATION USING DEEP LEARNING A …
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
URBAN SOUNDS CLASSIFICATION USING DEEP
LEARNING
A Degree Thesis
Submitted to the Faculty of the
Escola Tècnica d'Enginyeria de Telecomunicació de
Barcelona
Universitat Politècnica de Catalunya
by
Martí Bolet Boixeda
In partial fulfilment
of the requirements for the degree in
TELECOMMUNICATIONS ENGINEERING
Advisor: Elisa Sayrol
Barcelona, June 2019
1
Abstract
The main goal of this project is to develop a computational cost efficient system to be able to classify audios from an Urban Sound context. To realize this function two types of Machine Learning algorithms are used, Deep Learning with Convolutional neural networks and with addition of handcrafted audio features combining it to finally classify them with a Support Vector Machine.
Different architectures of Deep Learning models are tested and tuned to obtain the Context aware deep learning features. With the handcrafted features and tuning the SVM model, the final performance of the whole system is very good.
2
Resum
En aquest projecte s’ha volgut desenvolupar un sistema computacionalment eficient per
classificació d’àudios basat en tècniques de ‘Machine Learning’. Per aquest projecte es
proposa un sistema basat en ‘Deep Learning’ amb les ‘Convolutional Neural Networks’ i
amb addició de paràmetres extrets directament de l’àudio per combinar-los amb els de la
CNN per aconseguir un millor resultat utilitzant per la classificació final una ‘Support
Vector Machine’ (SVM).
Diferents models de Deep Learning han sigut provats i ajustats per tal d’aconseguir els
‘context aware deep learning features’. Juntament amb els ‘handcrafted features’ i
ajustant el model de SVM també obtenint uns resultats per el nostre cas bastant bons.
3
Resumen
En este proyecto se ha desarrollado un sistema computacionalmente eficiente para
clasificación de audios basado en técnicas de ‘Machine Learning’. Para este proyecto se
propone un sistema conjunto basado en ‘Deep Learning’ con las ‘Covolutional Neural
Networks’ juntamente con los parámetros extraídos directamente del audio para
combinar los dos resultados para conseguir un major resultado utilizando una ‘Support
Vector Machine’ (SVM) para a clasificación final.
Diferentes modelos de Deep Learning han sido testeados y ajustados por tal de
conseguir los ‘context aware deep learning features’. Juntamente con los ‘handracfted
features’ i ajustando el modelo de SVM los resultados finales del sistema son bastante
buenos para nuestro caso.
4
Acknowledgements
First of all I want to thanks to my advisor Elisa Sayrol, to give me the trust of carrying this
project and help me and guide through the whole project.
Also I want to say thanks to J. Adrian Rodriguez to provide me an initial code to and give
me a hand with audio related problem.
For instance, thanks to Albert Gil and Josep Pujal to help me and assisting with any kind
of server computing problem on the GPI Development Platform of UPC.
The project is carried out at department Teroria de la Senyal i Comunicacions (TSC) of the Universitat Politecnica de Catalunya (UPC).
The main purpose of this project is to develop and improve a system based on Deep Learning, specifically with Convolution Neural Networks (CNN), that is able to classify audios from an urban context in different classes. Nevertheless, the original idea was to distinguish between the sound of breaking glass and of other material and be able to implement an embedded system to place it in glass containers. Due to the lack of a Data Base of this type, the project will use an Urban Sounds data base. According to the initial idea, an implementation of the best CNN could be integrated on an embedded system, so the computational cost is a very important factor to be aware. Therefore, the implementation of the system would be developed and positively tested and checked. The following steps would be to adapt the CNN for an embedded system, so the CNN should not have a high computational cost.
The project main goals are:
1. - Learn about Deep Learning audio classification methods.
2. - State of art about literature for the implementation of new CNN techniques.
3. - Develop a deep learning system using a Deep Learning framework.
1.1. Requirements and specifications
Some of the requirements of the project are to have some knowledge about Deep Learning and how it works. The system is developed in Python and using PyTorch framework that provides two main advantages, Tensor computing with strong acceleration via graphics processing units (GPU) and deep neural networks built on a tape-based autodiff system. To perform reasonably faster calculations than a normal CPU we will use a server with GPU and high computational resources. To train the system we need the Urban Sounds Database provided from U8K with labelled classes. [1]. This dataset contains 8732 labelled sound excerpts (less or equal to 4 seconds) of urban sounds from 10 different classes: air conditioner, car horn, children playing, dog bark, drilling, engine idling, gunshot, jackhammer, siren, and street music.
The specifications for the system is to develop the sound classifier based on deep neural networks, and adapted to a low computational hardware, so the computational cost has to be as low as possible with similar results. The Neural Network should have a reasonable accuracy according to the literature explored in the beginning of the project.
1.2. Project Background
This projects starts from an idea to develop and automatic sounds classification
embedded system for glass containers. The project is performed in the framework of an
academic work. The original idea is adapted by the project supervisor to adapt the case
to a different database and start developing from scratch, due to a non-existing similar
project in academic scope in the UPC.
1.3. Work Plan
Work Packages
Project: State of Art WP ref: 1
Major constituent: Research Sheet 1 of 5
Short description: Research and learn about Deep Learning
techniques and CNN architectures. Learn to program DL
structures with PyTorch framework.
Planned start date: 15/02/2019
Planned end date: 15/03/2019
Start event: 16/02/2019
End event:15/03/2019
Internal task T1: Read and understand papers related with CNN
Internal task T2: Stanford online course lectures about Deep
Learning [2].
Internal task T3: PyTorch framework approach and tutorials [3].
Internal task T4: Introduction to remote server computing in the
GPI.
Deliverables: Dates:
12
Project: Features Extraction WP ref: 2
Major constituent: SW Sheet 2 of 5
Short description: Split the Dataset. To obtain the best results we
need to extract the spectrogram with the optimums parameters
to facility the CNN pattern extraction.
Planned start date: 16/03/2019
Planned end date: 31/03/2019
Start event: 16/03/2019
End event: 12/04/2019
Internal task T1: Adaptation of the Database and ‘CSV’ files.
Internal task T2: Develop the feature extraction method using
some python library.
Internal task T3: Research for the best parameters to extract
features.
Internal task T4: Extract the spectrograms from the database and
compress to train it.
Deliverables: Dates:
Project: Architecture design WP ref: 3
Major constituent: SW Sheet 3 of 5
Short description: Develop the Deep Learning system and SVM
system. Try the best parameters and different models for the
training of the CNN and improve the accuracy. Do the same
thing with the SVM.
Planned start date: 01/04/2019
Planned end date: 26/05/2019
Start event: 13/04/2019
End event:
Internal task T1: Implementation of the system.
Internal task T2: Implementation of the model.
Internal task T3: Tests with different parameters.
Internal task T4: Evaluate results and restart from T2.
Deliverables: Dates:
13
Project: Computational cost WP ref: 4
Major constituent: SW Sheet 4 of 5
Short description: An important part is to reduce the
computational cost and memory used for the system.
Planned start date: 25/06/2019
Planned end date: 25/07/2019
Start event:
End event:
Internal task T1: Search for a method to compute the cost and
evaluate different cases
Internal task T2: Implement the reduction and evaluate it.
Deliverables: Dates:
Project: Documentation WP ref: 5
Major constituent: Sheet 5 of 5
Short description:
Documentation about the process of developing the project and
the final report of the whole project
Planned start date: 02/03/2019
Planned end date: 25/06/2019
Start event:
End event:
Internal task T1: Project Proposal and Work plan
Internal task T2: Critical Review
Internal task T3: Final Project
Deliverables:
Task T1
Task T2
Task T3
Dates:
10/03/2019
07/05/2019
28/06/2019
1.4. Incidences
One of the biggest delays of the project was due to an audio read problem. The initial
method to read the Sound eXchange (SoX) to read the raw audios and reconvert all to
the same format didn’t work. When changing the form of read the audios, another error
occurred opening a few audios with an unsupported library to open. After some
unsuccessful research, when trying the same in the server platform worked.
Another delay was for working with library ‘libsvm’ [4] for Support Vectors Machines and
working with different versions of python. The library was in python 2 when the whole
project was developed in python 3. Because of that when working with some libraries was
to be adapted to different types of python.
14
2. State of the art of the technology used or applied in
this thesis:
The classification methodology will be based on Machine Learning Techniques such as
Convolutional Neural Networks (CNN), for his high performance in image applications. In
this project we will use mel-spectrogram to transform audio to and 2D image for time-
frequency representation. Also a Support Vector Machine (SVM) will be applied in
addition with hand-crafted audio features to support the CNN results to achieve the best
results.
2.1. Machine Learning
Machine learning (ML) is the scientific study of algorithms and statistical models that
computer systems use to effectively perform a specific task without using explicit
instructions, relying on patterns and inference instead. It is seen as a subset of artificial
intelligence. Machine learning algorithms are used in a wide variety of applications, such
as email filtering, and computer vision, where it is infeasible to develop an algorithm of
specific instructions for performing the task. In this project the system is developed using
Support Vector Machines (SVM) and Deep Learning, two ML types of algorithms.
The types of machine learning algorithms differ in their approach, the type of data they
input and output, and the type of task or problem that they are intended to solve. Some of
it are supervised learning, unsupervised learning, semi supervised learning and feature
learning among others.
Machine learning algorithms build a mathematical model based on sample data, known
as "training data", in order to make predictions or decisions without being explicitly
programmed to perform the task. This consists in three main phases Train, Validation and
Test.
In order to adjust the mathematical model, the system is formulated as minimization of
some loss function on a training set of examples, called Train phase. Loss functions
express the discrepancy between the predictions of the model being trained and the
actual problem instances. In our case, (classification), system has to assign a label to
instances, in such wise models are trained to correctly predict the pre-assigned labels of
a set of examples.
2.2. Support Vector Machines
Support-vector machines (SVMs) are supervised learning models with associated
learning algorithms that analyze data used for classification and regression analysis.
Given a set of training examples, each marked as belonging to one or the other of two
categories, an SVM training algorithm builds a model that assigns new examples to one
category or the other, making it a non-probabilistic binary linear classifier. An SVM model
is a representation of the examples as points in space, mapped so that the examples of
the separate categories are divided by a clear gap that is as wide as possible called
decision boundary. New examples are then mapped into that same space and predicted
to belong to a category based on which side of the gap they fall.
15
The parameters to adjust the decision boundary are the following:
2.2.1. Cost
SVM models use Cost parameter to control trade-off smooth decision boundary and
classifying training examples points correctly. It modifies the optimization problem to
optimize both the fit of the line to data and penalizing the amount of samples inside the
margin at the same time, where C defines the weight of how much samples inside the
margin contribute to the overall error. Consequently, with C you can adjust how hard or
soft your large margin classification should be.
Figure 2: Influence of Cost [5]
2.2.2. Kernel
SVM separates the data using hyper-plains but when the data have high dimensionality
the Kernel trick is very useful to reduce it and adjust better to the data. There are a lot of
types of kernel, but in all they use gamma parameter that defines how far the influence of
a single training example reaches and adapts the decision boundary to each of this.
Figure 3: Kernel trick [6]
16
2.3. Deep Learning
Deep Learning is a class of machine learning algorithms that use a structure of Artificial
Neural Networks to have multiple layers to progressively extract higher level features
usually from raw input. The layers are composed of Neurons that perform individually
linear or non-linear transformations to the input signal to generate and output signal.
Each neuron can be connected to each neuron of different layers, generating the Network.
The artificial neurons or perceptron have multiple inputs (x) and are scaled with the
weights (w) in order to raise or reduce the value of the input, to represent the importance.
The output value of the perceptron depends on the values of the input and the activation
function. The activation function is a non-linear function used to transform the activation
level of a unit into an output signal.
Some of the activations functions are: Sigmoid, Hyper tangent, ReLu and Softmax.
Nowadays ReLu is the most used activation function and Softmax is normally used in the
last layer to obtain the output vector as a probability vector.
Figure 6: Activation functions [7]
17
Neural Networks have to be adapted to the training data and for each application, so they
are parameters to have to fit to obtain the best results. Now will describe the most
important for train the model:
2.3.1. Loss
The Loss function is the most important unit to estimate the error from the prediction to
the original value. To fit the estimated and expected values perfectly the training phase
aim to have a loss of zero. To obtain it the weights of the neurons have to be adjusted
using an optimization function until better predictions.
2.3.2. Optimizer
Optimizer is an optimization algorithm that helps us to minimize the loss function towards
changing and adapting the values of the weights and bias of the network. There are many
different types such as Stochastic Gradient Descent, Adam, Adamax and RMSprop.
2.3.3. Epochs
An epoch is when an entire dataset is passed forward and backward through the neural
network one time. In order to train the model the number of epochs should be more than
one because as the number of epochs increases, more number of times the weight are
changed in the network and the curve goes from underfitting to optimal or even to
overfitting curve.
2.3.4. Batch Size
Since the databases are so big, the databases are spited in batches. The number of
training examples present in this split is the batch size. This batch represents the input in
a single iteration to the neural network. Every batch has the forward and backward
optimization towards the labels of the true prediction.
2.3.5. Dropout
The term of dropout refers to dropping out units (neurons) in a neural network. The
neurons are discarded during training phase randomly with a certain probability; this is
the parameter that we can change. This strategy is used to prevent over-fitting, it forces
the neural network to learn more robust features that are useful in conjunction with
different random subsets of the others neurons.
18
2.3.6. Convolutional Neural Networks
Convolutional Neural Networks (CNNs) are a category of neural networks that have
proven very effective in areas such as image recognition and image classification. In our
case due to the audio is converted to a spectrogram the CNN will have better results than
a normal neural network.
The CNN is able to successfully capture the spatial dependencies in an image through
the application of relevant filters. The architecture performs a better fitting to the image
dataset due to the reduction in the number of parameters involved and reusability of
weights. The role of the CNN is to reduce the images into a form which is easier to
process, with the most important features for getting a good prediction.
Figure 7: Convolutional Neural Network structure
The type of layers of CNN’s are explained in the following sections.
2.3.6.1. Convolutional Layers
A convolutional layer is a layer with a rectangular filter that applies a 2D convolution and
can detect features or visual features in images such edges, lines, colour drops, etc. This
is a very interesting property because, once it has learned a characteristic at specific
point in the image, it can recognize it later in any part of it. Another important feature is
that convolutional layers can learn spatial hierarchies of patterns by preserving spatial
relationships.
Figure 8: 2D convolution layer [8]
19
2.3.6.2. Pooling layer
The pooling layer is responsible for reducing the spatial size of the Convolved Feature.
This is useful to decrease the computational power required to process the data through
dimensionality reduction. Furthermore, it is useful for extracting dominant features which
are rotational and positional invariant, this maintaining the process of effectively training
of the model. There are two popular types; Max pooling that returns the maximum value
from the portion of the kernel and the Average pooling who returns the average of all the
values of the section.
Figure 9: Pooling layer [9]
2.3.7. Fully Connected Layer
The fully connected (FC) layer in the CNN represents the feature vector for the input. This
feature vector/layer holds information that is vital to the input. When the network gets
trained, this feature vector is then further used for classification, regression, or input into
other network like SVM for translating into other type of output, etc. The convolution
layers before the FC layer hold information regarding local features in the input image
such as edges, blobs, shapes, etc. Each convolutional layer holds several filters that
represent one of the local features. The FC layer holds composite and aggregated
information from all the convolutional layers that matters the most.
Figure 10: Fully-connected layer [10]
20
2.3.8. Architectures of Convolutional Neural Networks
Neural networks structures, also named architectures, have multiple layers, different
types of layers, activations functions, sizes etc. The structure is a very important factor in
our project to have the best results and less operations and parameters to save
computational cost.
Figure 11: CNN layer structure [15]
In the previous Figure bottleneck is a type of convolutional block proposed for
MobileNetV2.
Some of the architectures used in this project are Very Depp Convolutional (VGG) [11],
Resnet [12], Densenet [13], Squeezenet [14] and MobileNetV2 [15].
2.3.9. Techniques
2.3.9.1. Transfer Learning
When a model is created the weights and bias have to be initialized randomly and then
start training from zero. When we use transfer learning, we take the values of the weights
and bias from a similar case, i.e. another classification problem, so we start with a model
with a reasonable accuracy. It’s a popular approach in deep learning to use pre-trained
models as the starting point. It’s proven that the parameters of the first layers of
classification model are very similar, and then changing only the last layers can reduce
the computational cost and time saving initial epochs.
2.3.9.2. Data augmentation
Data augmentation is a technique that consists in generating new synthetic samples to
train. This new synthetic samples are generated through the original samples and
applying little modifications to obtain the new ones. When dealing with image some
transformations can be crop image, rotating or adding noise. Alternately, audio signals
transformations can consists in changing pitch, the dynamic range, adding white noise or
time stretch. The data transforms that will be tried are more explained in later sections.
Data augmentation is useful to have more samples to feed the model and make the
system more reliable to different samples of the training phase.
21
2.4. Deep context-aware and Handcrafted Features
The start point of this project is to apply a technique explained in this paper [16]. This paper proposes the idea to combine two types of audio features and two types of machine learning algorithms to classify the urban sounds audios. In one hand we have the Handcrafted features, in the other hand the Context-aware deep learning features obtaining through a CNN fed with the spectrogram. Finally, combining the two results in a SVM obtain the final classification.
Figure 12: Structure of the proposed method [16]
2.4.1. Handcrafted Features
These features are obtaining in the time domain of the audio. The features are the next:
Table 2: Handcrafted Features [16]
22
The features are obtained from short-term audio slices and with all the vectors the mean
and standard deviation of each feature are computed to obtain only one vector of length
68. To know more about audio signals and representations see [17].
2.4.2. Context-aware deep learning features
Context-aware deep learning features (CadF) are obtained with the last layer values of a
CNN. The CNN is trained using mel-spectograms:
2.4.2.1. Short Time Fourier Transform and Spectrogram
Short Time Fourier Transform (STFT) is a is a Fourier-related transform used to
determine the sinusoidal frequency and phase content of local sections of a signal as it
changes over time. In practice, the procedure for computing STFTs is to divide a longer
time signal into shorter segments of equal length using and then compute the Fourier
transform separately on each shorter segment. This reveals the Fourier spectrum on
each shorter segment. Some parameters to calculate STFT are:
Window length: Length of the time segments.
Window stride: Length of the stride of the length. To reduce the effect of the
windowing in digital signals the
Window type: The type of the window can be rectangular, triangular, sine but to
reduce the previous mentioned effect the best one usually is hamming.
Spectrogram is image of time-frequency domain, where we can see the frequency
components in the time domain during the duration of the audio. There are a lot of
different types of window but to reduce the effect of them in digital analysis, the typical
window is hamming. Another thing to take in account is the frequency bins; this is the
number of frequency bands that we want to calculate. These features will be obtained
using librosa library [18], first to compute Short Time Fourier Transform to transform from
time domain to frequency domain. And then to compute spectrogram is pre-processed
normalizing it to standardize all the audios.
23
Figure 13: Mel spectrogram from dog barking
Then the spectrogram is processed via the neural network and the output is a 10 length
vector representing each of the 10 classes. The maximum value of the position is the
prediction. When we use this in conjunction with HaF in the paper they cut the input
spectrogram in N slices of 200ms and then all are they fed to the neural network to obtain
N outputs of the last layer of size 512. The mean vector of the N vectors is the CadF.
Finally, appending the 2 features vector, an SVM is trained to classify to the 10 different
classes.
24
2.5. Data Augmentation and CNN classification
Another paper [17] consulted to improve the initial system was to apply Data
augmentation to create new samples to train further. In this paper they use Data
Augmentation in the same case of Urban Sounds classification. Proposes applying data
augmentation directly to the audio to modify the next parameters:
Time Stretch (TS): The original audio is stretched to change the duration of it with
the same velocity. Paper proposes 4 factors of stretch, 0.81, 1.27,
Pitch Shift (PS): This transformation consists in changing the pitch of the audio
some semitones up or down. This is one of the bests transforms as we can see on the
table below. PS1 and PS2 correspond to PS1 of -2, -1, 1 and 2 semitones transforms and
PS2 to -3.5, -2.5, 2.5 and 3.5.
Dynamic Range Compression (DRC): The dynamic range is the difference of
the maximum value and the minimum value. The proposed types are music standard, film
standard, speech and radio.
Background Noise (BG): This consists in adding background noise to the
original sample. They add three urban street audios to it, with those don’t contain any of
the 10 classes.
In this table we can see the results in terms of improvement or looseness in each
transform and class.
Figure 14: Increment of accuracy relative to each class and transform [17]
25
3. Methodology / project development:
The first approach was to situate in the new techniques and methods to audio
classification using CNN’s. When we first start the idea, we could try only classifying
using CNN but we decided to probe the baseline paper idea [16]. Others papers, in
environmental sounds classification cases, were using Boltzmann Machines with
unsupervised learning. Other papers explain the parameters to obtain the spectrogram. In
conclusion, all use spectrogram with CNN’s.
3.1. Development of the system
To develop the system is used PyTorch [3], which is a python framework with an easy
API to implement DL systems and Tensor computing via GPU. Another important library
is libsvm [4] written in C++ but with a compiled version for python. Those two are the
most important to work with ML algorithms.
Other important used python libraries are:
Librosa: For audio read and spectrogram extraction. [18]
pyAudioAnalysis: To extract the features from time domain. [20]
Scikit-learn: To perform the PCA reduction. [21]
A lot of PyTorch tutorials were used to develop the system. With the big usage of it, there
is a lot of information online and the official page is much documented. Also the official
forum can be so much helpful. PyTorch additionally provides some useful DL models and
his trained weights to a quickly implementation. In the case of MobileNetV2 was from [22].
Starting with for an idea of initial code, the whole system is developed. This constrains in
a structure of code to be able to change parameters easily and save the experiment
results.
With the whole system developed we can start with the test and improvements. To
measure the performance of the system the metric used is the accuracy. The main goal
of this phase of the project is to improve as much as possible the accuracy of the system.
With the results of the first one we have a starting point to keep improving it. All the tests
and decisions made are explained in the next section. The tests were made changing
values explained in previous sections like hyper parameters and spectrogram creation
parameters. The following block diagram shows the tests made for each part of the
project, specifying parameters to be changed.
Figure 15: Methodology scheme
26
4. Results
The first results obtained were with VGG with 16 layers, with random weights initialization. To extract the spectrogram the parameters of the speech recognition case were used.
The spectrogram was extracted with:
Window size Window stride Window type N mels Low Freq High Freq Max len
20 ms 10 ms Hamming 32 20 Hz 22100 Hz 97
Table 2: Initial spectrogram parameters
Epochs Learning rate Optimizer Batch size Dropout
25 0.01 Adam 64 None
Table 3: Initial training parameters
The result was 40.8 % accuracy for test case and the best validation epoch 54.7 %.
4.1.1. Fine Tuning
Applying transfer learning with the weights of ImageNet classification problem and comparing with the initial VGG we can see that we save a lot of time reducing the epochs of the training. Also the results are better than initialising the weights from zero.
VGG Best Validation Test Results
Random initialization 52.8 % 40.8 %
Fine tuning last layer 54.7 % 41.2 %
Fine tuning all model 56.3 % 43.5 %
Table 4: FT vs Random Initialization
Figure 16: Plot comparation of Fine Tuning vs Random Initialization
27
4.1.2. Architectures and hyper parameters
Now, we started testing different architectures to obtain the best results. The different architectures tested were Resnet with 34 layers, VGG with 16 layers, Densenet with 121 layers, Squeezenet and Mobilenet.
Architecture Best Validation Test Results
VGG 16 56.3 % 43.5 %
Resnet 34 52.5 % 50.7 %
Densenet 121 56.4 % 49.8 %
Squeezenet 60.9 % 49.6 %
Mobilenet 65.5 % 54.3 %
Table 5: Different architectures comparison
Architecture Param memory FLOPS
VGG 16 528 MB 16 GFLOPS
Resnet 34 83 MB 4 GFLOPS
Densenet 121 31 MB 3 GFLOPS
Squeezenet 5 MB 360 MFOPS
Mobilenet 16 MB 579 MFLOPS
Table 6: Number of paramaters and FLOPS
Number of parameters extracted from original papers and FLOPS obtained from
MatConvNet for an input of 224x224 [23].
With the Test results and with an idea of the computational cost the proposed
architecture to continue developing is MobileNet.
4.1.3. Spectrogram Optimization
The following table exposes the changes applied in each test.
To conclude the results, comparing the baseline paper and our results:
Table 10: Results of baseline paper [16]
Method U-8K
HaF 89.06 %
CadF 63.10 %
HaF + CadF 93.69 %
Table 11: Final results
4.1.7. Results Conclusions
To summarize the results, the developed system works as expected, with the addition of
the two types of features, Haf and CadF, the accuracy improves reasonably.
Furthermore, the system has a higher accuracy according to the baseline paper. The
increase of the accuracy is supported with the differences of the proposed methodology
and the baseline paper, while they slice the spectrogram in chunks of 200 ms, our system
works with the whole spectrogram. Also, the dimensions of the CNN are bigger, leading
to the conclusion that the features are more relevant.
Surprisingly, the results with only HaF are much better than expected and the win of
computational cost of avoiding the CNN will be a very important factor to take account.
31
5. Budget
To take account of the budget of the project, will be considered the hours worked on it. The duration of the whole project is a total of 5 months, from February 2019 to June 2019. Considering 20 h/week of work from a Junior Engineer and assuming a mean salary of 1.800 €/month a full journey, the total amount will be around 4.500 €.
In addition, the senior engineer weekly meetings in this 5 months results in a total of 20 meetings, being the price 150 €/h results in 3.000 €.
Finally, the GPU server hosting services in a platform like ‘LeaderGPU’ with a 4 x 1080 GPUs and 64GB RAM will take a cost of 484 € month.
All the software, programs and libraries used are open-source and free.
A small office will take place in Barcelona near the UPC, with a monthly rental of 250 €.
Name Cost/week N. of weeks Total
Junior Engineer 225 € 20 4.500 €
Senior Engineer 150 € 20 3.000 €
GPU Server 121 € 20 2.420 €
Office 250 € 20 5.000 €
Total 14.920 €
Table 5: Table of costs
32
6. Conclusions and future development:
Nowadays CNN are widely used for image classification and image recognition, and the
popularity and investigation are increasing, the performance of CNN’s are proven every
day in many different ways and forms.
However, in this thesis is proven that SVM can increase the accuracy of the CNN and
with the Handcrafted Features the system can work by itself, reducing the computational
cost drastically. The main goal of this project was using CNN to perform the best results
but maybe in this case the best option can be without it. In case of classification audios,
the spectrogram and the Handcrafted Features are two viable options, in our case trying
to reduce as much as possible the costs an option is to use only Handcrafted Features
and SVM to reduce the models and operations.
Finally, for future development a full cost analysis should be done in order to have the
real numbers of each methodology and compare it with embedded systems capacity to
be able to expect the performance. Another important factor will be to compress the
models size with techniques explained in this paper [X], where quantify the weighs can
reduce the size a factor of ten times. In the case of the computational case wasn’t a
trouble, another methodologies could be applied to aim to better accuracies.
[11] K. Simoyan, A. Zisserman, “Very Deep Convolutional Networks for Large-Scale Image Recognition”. arXiv:1409.1556v6 [cs.CV] 10 Apr 2015
[12] K. He, X. Zhang, S. Ren, J. Sun. “Deep Residual Learning for Image Recognition”. arXiv:1512.03385 [cs.CV] Dec 2015
[13] G. Huang, Z. Liu, L van der Mateen. “Densly Connected Convolutional Networks”. arXiv: 1608.06993 [cs.CV] Jan 2018
[14] F. N. Iandola, S. Han, M.W. Moskewicz, K. Ashraf, W. J. Dally, K. Keutzer. “SqueezeNet: AlexNet-Level accuracy with 50xfewer parameters and <0.5MB model size”. arXiv: 1602.07360 [cs.CV] Feb 2016
[15] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, L. Chen. “MobileNetV2: Inverted Residuals and Linear Bottlenecks” arXiv:1801.04381 [cs.CV] Jan 2018
[16] Giannakopoulos, T.; Perantonis, S. Recognition of Urban Sound Events using Deep Context-Aware Feature Extractors and Handcrafted Features. Preprints 2018, 2018110509 (doi: 10.20944/preprints201811.0509.v1).
[17] J. Salomon, J.P. Bello. “Deep Convolutional Neural networks and Data Augmentation for Environmental Sound Classification”arXiv:1608.04363 [cs.SD] Aug 2018
[19] Romain Serizel, Victor Bisot, Slim Essid, Gael Richard. Acoustic Features for Environmental SoundAnalysis. Tuomas Virtanen, Mark D. Plumbley, Dan Ellis. Computational Analysis of Sound Scenesand Events, Springer, pp.71-101, 2017, 978-3-319-63449-4. <10.1007/978-3-319-63450-0_4>. <hal-01575619>