Fin_whales

Image Segmentation and Classification for Fin Whale1

Identification2

Michael Ford

Kathleen MoriartyElijah Willie

Luke McQuaid

3

December 10, 20164

1 Problem Description5

In many fields of zoology, the ability to identify individual animals is foundational to many research6

endeavors, and can allow scientists to discover aspects of animal biology such as social dynamics and7

population distribution. However, depending on the animal being studied, the process of performing8

an identification may be difficult or involve significant manual labour. A primary example of this is9

performing identification of fin whales, which is a species of large cetacean that lives off the coast10

of British Columbia and has unique identifying markings for each individual. There are over 70011

individual fin whales, and when an animal is encountered their photo must be compared by hand to a12

reference catalogue. In this project we applied various machine learning approaches with the goal of13

automating this process of fin whale identification.

Figure 1: Example image of a fin whale indicating unique identification features.14

1.1 Data set15

In collaboration with the Cetacean Research Program, Fisheries and Oceans Canada, we received16

a catalogue of 79 individuals with 884 images. This represents a small sample of the entire stored17

catalogue of identification photos, however the labels for the catalogue are recorded on hard copy,18

and due to the significant labour involved in manually labeling photos we were restricted to this data19

set.20

1.2 Identifying Challenges21

1.2.1 Data set size22

Previous work in the area of automated whale identification has shown that this is a feasible problem.23

In particular, the Kaggle Right Whale Recognition Contest of August of 2015 [kag()] had private24

submissions achieving log-loss of 0.59600. However the Kaggle data set contained 4237 photos25

for 427 individuals, a significant difference in scale from our data set. We recognized initially that26

the size of the data set would restrict our ability to train complex models due to the high chance of27

over-fitting.28

1

1.2.2 Signal-to-Noise29

As can be seen in the figure below, many input photos have significant noise in the background of30

the photo. Due to the poor signal to noise ratio, and being unable to perform complex learning for31

feature extraction due to our small dataset, we decided to devote significant effort to pre-processing32

steps prior to doing any classification.33

2 Pre-processing34

2.1 Segmentation using Markov Random Fields35

Markov Random Fields have over the past few decades become an innovative way of denoising36

and segmenting a wide range of types of images. Therefore is without much thought that the pre-37

processing of the whale images contains a pipeline that appeals to MRFs. We used a previously38

implemented MRF [Sharma()] model for processing the images. For our purpose, this model was39

used for image de-noising, and segmentation by edge detection. This model has parameters that can40

be set by the user. This is useful for optimization, as a user needs may differ based on the context41

of use. For this model, we needed to specify the maximum number of iterations, the number of42

neighbours for k-means clustering for classifying a pixel in the image based on its neighbors, and a43

value for the potential function used in the model.44

This pipeline was very memory demanding, and as such, great care was given when picking parameters45

since running all the samples through the pipeline was quite lengthy in time. See figure below for46

the result of MRF applied to an image. This image was processed with the following parameters:47

(maxIter = 5, k = 3, and potential = 0.5). The model with these parameters took approximately 4548

seconds for completion. Multiplying this by the total number of images in our data set (838 images),49

we see that this model applied to the whole of the images would take approximately 10.5 hours.50

These parameters proved to be best compromise between computation time and quality of results as51

other parameters had longer computation times while yielding similar results.

(a) Original image before MRF transformation

(b) Image after MRF transformation

Figure 2: MRF applied to test image

52

2.2 Segmentation using Hidden Random Markov Field-Expectation Maximizaton53

(HRMF-EM)54

In addition to using regular MRF for image denoisng and segmentation by edge detection, we also55

used different model for image denoisng and segmentation. This time we used a model based on56

Hidden Markov Random Field using the Expectation Maximization algorithm learn the most probable57

parameters for this algorithm. This algorithm is fully described in [Wang(2012)]. This source also58

provided an implemented model which also enabled us to tweak the parameters until we were able59

to remove an adequate amount of noise from the original image. This model was just as memory60

intensive and time consuming as the model based on regular MRF. This model however contained61

one fewer parameter. We had to specify the number of iterations, and the number of neighbours for62

computing the k-nearest neighbour when classifying a pixel in the image. Great care was also taken63

2

when picking the parameters that would result in as little noise as possible remaining in the image64

when compared to original image. Both of these models (MRF, and HRMF-EM) were used for image65

denoising, segmentation and generating extra features for downstream training of the final model.66

See figure below for the result of MRF applied to an image. This image was processed with the67

following parameters: (maxIter = 5, and k = 3). The model with these parameters took approximately68

80 seconds for completion. Again, multiplying this by the total number of images in the data set69

(838 images), we see that this model applied to the whole of the images would take approximately70

17 hours for completion. Given this lengthy computation time, it is clear how necessary parameter71

optization was for this model. These parameters, however, proved to be best compromise between72

computation time and quality of results as other parameters had longer computation times while73

yielding similar results. In addition to having a pre-implemented model, these models were chosen74

because they represent the simplest statistical models for image denoising and segmentation that by no75

means assume that the variables within a system are independent. The variables here being the pixels76

within an input image. These models also allows us to model important inter-pixel dependencies, and77

conditional dependencies that can be taken advantage of for our purpose of pre-processing.

(a) Original image before MRF-EM transformation

(b) Image after HMRF-EM transformation

Figure 3: MRF-EM applied to test image

78

3

2.3 Cropping79

2.3.1 Manual Cropping80

While developing a model to crop images automatically and pipeline them into our CNN identifier81

we had to work on the CNN itself. A small program was created to expedite the process of manually82

cropping all the photos. Images were greatly reduced in size using an augmenting program, then83

cropping was done using a code developed by [Rosebrock()] for cropping boxes out of an image. The84

cropped regions were then mapped back but to the full size images to get a high resolution version of85

the cropped photo.86

3 Whale Identification87

3.1 Binary Classification using Pre-trained CNN88

The initial model approach was to create a binary classifier using two identical, merged CNN89

structures. Siamese CNN structures have previously been applied with some success facial recognition90

[Khalil-Hani(2014)] and in one-shot image recognition [Koch(2015)]. This suggested that this model91

would be suitable for our whale identification using our small dataset. However, due to the small92

dataset size and limited available computation resources, we decided to implement the model using93

a VGG16 structure pre-trained on the imagenet database [Chollet(a)]. We removed the VGG last94

classification block, merged two identical models with learning turned off, and then tried several95

different classification structures, including adding one or two fully connected layers, and dropout96

layer, before the single classification node which used a sigmoid activation function. Training was97

preformed using full frame images reszied to 224x224. However we were unable to get any results98

that modeled more that the proportion of true to positive training examples. This made us conclude99

that either the signal to noise ratio was not adequate, or that the features that the pre-trained VGG100

network was not extracting features that were representative of the inter-individual variation.101

3.2 Minimal Compute Method to Establish Baseline Results102

3.2.1 Method103

A model of our minimal compute method, outlined in stages 1 and 2 below can seen in Fig. 4.

Figure 4: Model of Minimal Compute Method Pipeline

104

Stage 1: Feature Extraction105

Training a state of the art convolutional neural network (CNN) model for multiclass classification106

would require more compute power than we had at our disposal. Instead, a "CPU friendly"107

alternative method was used: whale images were feed through an Inception V3 CNN model which108

was pre-trained on ImageNet data set images, (similar to Assignment 3 [Mori(2016)]. Output109

from the last average pooling layer were collected and saved as feature vectors for each whale110

image. Our hypothesis was that our Inception V3 model, would have learned enough discrimina-111

tive information about each whale image to make classification on top of these feature vectors possible.112

113

Several variations of the original images were passed through the pre-trained CNN, to get114

different sets of feature vectors, (as depicted in Fig. 4). These sets were: original full images,115

4

cropped images, and cropped high resolution images, (they were not re-sized to 299x299 V3 input116

dimensions).117

*Stage 2: Support Vector Machine For Classification118

The feature vectors of each image, as discussed in Stage 1, were then passed into an support vector119

machine, to classify each image as one of 78 whales.120

Feature vectors were either left unnormalized or normalized, by scaling each vector to it’s unit121

norm,(i.e. L2-Normalization). This was proven to be an effective pre-processing step in previous122

works using SVMs for classification problems, ([Simonyan and Zisserman(2015)])123

With L2-Normalization, For each element E in feature vector x:124

Exnorm =Ex

‖xnorm‖2(1)

However, After visualizing these normalized feature vectors, ( see Fig. 5), it should be noted that125

there did not appear to be enough variation between images which could be determined from their126

V3 learned features.

Figure 5: Normalized Feature Vectors Learned from Inception V3

127

3.2.2 Results128

Several hyper-parameters were tested during cross-validation, the results are shown in Fig. 6. The best

Figure 6: Cross-Validation Accuracy Reported Across Several Hyper-parameters

129choice of hyper-parameters, resulting in the highest validation accuracy, was as follows: Surprisingly,130

despite L2- normalization outperforming in every other test, unnormalized features resulted in higher131

accuracy on the validation set. The highest validation accuracy was also achieved in conjunction132

with using a new feature vector, created from the feature vectors of full images, cropped images and133

5

high-resolution cropped images:134

Featuresbest =

∑Ftype

numFtypes(2)

These hyper-parameters were then used to create our final ’CPU-friendly’ model, achieving a135

classification accuracy on our test set of 7.61 percent. (see Fig. 7)136

Figure 7: Test Accuracy, Using Chosen "Best" Hyper-parameters

The final model was also tested on whales which had more than 25 images included in the data137

set, resulting in only 12 classes. Our final model achieved a test accuracy of 19 percent with this138

simplified problem.139

3.3 Dual Input Merged Model140

In an attempt to create a model that took as input as much signal as possible, we constructed a141

dual-input merged classification model where the input was both the cropped original image, and the142

output of the MRF pre-processing using the cropped original image. Both inputs were re-sized to143

224x224. The structure was based on two independent VGG16 networks pre-trained on Imagenet.144

Training was performed on the final convolution block as well as an added fully connected layer on145

top of the VGG networks.146

6

Figure 8: Merged model structure. Blue indicates layers where learning was turned off, while yellowlayers had learning turned on.

The model was run using SGD with a slow learning rate of 0.00001 and momentum of 0.9 as147

suggested by ’Building powerful image classification models using very little data’ [Chollet(b)].148

While training accuracy of 99.85% was achieved, this was accompanied with zero testing accuracy149

using 20% of the data set as test data. This indicates that the model was too complex for the size of150

the data set provided.151

3.4 Simple Convolution Neural Network152

Due to the lack in ability of the pre-trained VGG network to extract relevant features, and the153

massive over-fitting that resulted when learning was applied to only a limited portion of the VGG, we154

attempted to learn a simple CNN from scratch. The network consisted of a single 2d convolution155

layer, a max pooling layer, fully connected 64 node hidden layer, dropout of 0.5 and classification156

layer. ReLU was used as the activation function for hidden layers while softmax was used for the157

classification layer. See figure below for results.158

Input Learning Rate Momentum Train Accuracy Test accuracyHMRF 0.1 0.0 0.142 0.0074HMRF 0.00001 0.5 0.128 0.147HMRF 0.00001 0.9 0.399 0.0662

Cropped 0.00001 0.9 0.0456 0.441

Figure 9: Results from training using different input data sets and parameters. HMRF input refersto the the HMRF pre-processing applied to all the data set, while Cropped refers to the manuallycropped data set. All inputs were re-sized to 224x224.

4 Discussion159

4.1 Contributions160

Michael initiated the project and collaborated with the team from Fisheries in order to obtain the data161

set and understand the problem from the scientists perspective. Luke and Elijah teamed up to work162

on the pre-processing half of the project. While everyone met and dealt with issues all together we163

7

all had our own specific jobs. Luke developed a process for cropping images manually and focused164

on that for the first half of the project in order for the whale classification to test and refine their165

programs on better cropped images. Once we had a full set of high resolution cropped images Luke166

continued by assisting where he could with whale classification and Elijah’s automated cropping167

tool. Kathleen was responsible for the Minimal Compute Method for Baseline Results and Support168

Vector Machine for Classification, while Michael was responsible for the Binary Classification using169

Pre-Trained CNN, Dual-Input Merged Model and Simple Convolution Neural Network.170

References171

[kag()] https://www.kaggle.com/c/noaa-right-whale-recognition.172

[Sharma()] Kamal Kishor Sharma. github.com/kamalkishor/Wound_Image_Segmentation_by_Markov_Random_Field.173

[Wang(2012)] Quan Wang. Hmrf-em-image: Implementation of the hidden markov random field174

model and its expectation-maximization algorithm, 2012.175

[Rosebrock()] Adrian Rosebrock. tionReferences [1] http://www.pyimagesearch.com/2015/03/09/capturing-176

mouse-clic.177

[Khalil-Hani(2014)] LS Khalil-Hani, M; Sung. A Convolutional Neural Network Approach for Face178

Verification. NUCLEIC ACIDS RESEARCH, 2014.179

[Koch(2015)] Gregory Koch. Siamese neural networks for one-shot image recognition. PhD thesis,180

University of Toronto, 2015.181

[Chollet(a)] Francois Chollet. deep-learning-models, a. https://github.com/fchollet/deep-learning-182

models.183

[Mori(2016)] Dr. Greg Mori. Assignment 3: Deep learning. 2016.184

[Simonyan and Zisserman(2015)] Karen Simonyan and Andrew Zisserman. Very deep convolutional185

networks for large-scale image recognition. ICLR, 2015.186

[Chollet(b)] Francois Chollet. Building powerful image classification models using very little187

data, b. https://blog.keras.io/building-powerful-image-classification-models-using-very-little-188

data.html.189

8

Fin_whales

Documents