CLASSIFICATION OF MEDICAL IMAGES INTO ALZHEIMER’S … · classification of medical images into alzheimer’s disease, mild cognitive impairment and healthy brain a thesis submitted

CLASSIFICATION OF MEDICALIMAGES INTO ALZHEIMER’SDISEASE, MILD COGNITIVE

IMPAIRMENT AND HEALTHYBRAIN

A THESIS SUBMITTED TO THE UNIVERSITY OF MANCHESTER

FOR THE DEGREE OF MASTER OF RESEARCH IN INFORMATICS

IN THE FACULTY OF SCIENCE AND ENGINEERING

2018

ByChao Fang

School of Computer Science

Contents

Abstract 8

Declaration 9

Copyright 10

Acknowledgements 11

1 Introduction 121.1 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . 121.2 Aim and Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

1.2.1 Aim . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131.2.2 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

1.3 Organization of the Document . . . . . . . . . . . . . . . . . . . . . 13

2 Background 152.1 Alzheimer’s Disease . . . . . . . . . . . . . . . . . . . . . . . . . . 15

2.1.1 Causes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152.1.2 Stages and symptoms . . . . . . . . . . . . . . . . . . . . . . 162.1.3 Brain changes caused by AD . . . . . . . . . . . . . . . . . . 172.1.4 Diagnosis and treatment . . . . . . . . . . . . . . . . . . . . 18

2.1.4.1 Diagnosis . . . . . . . . . . . . . . . . . . . . . . 182.1.4.2 Treatment . . . . . . . . . . . . . . . . . . . . . . 20

2.2 Machine learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202.2.1 Machine learning pipeline . . . . . . . . . . . . . . . . . . . 21

2.2.1.1 Supervised learning . . . . . . . . . . . . . . . . . 222.2.1.2 Unsupervised learning . . . . . . . . . . . . . . . . 232.2.1.3 Reinforcement learning . . . . . . . . . . . . . . . 23

2

2.2.2 Loss function . . . . . . . . . . . . . . . . . . . . . . . . . . 242.2.2.1 Cross-entropy loss . . . . . . . . . . . . . . . . . . 242.2.2.2 L2 loss . . . . . . . . . . . . . . . . . . . . . . . . 25

2.2.3 Neural network . . . . . . . . . . . . . . . . . . . . . . . . . 262.2.4 Optimization method . . . . . . . . . . . . . . . . . . . . . . 27

2.2.4.1 Gradient descent . . . . . . . . . . . . . . . . . . . 272.2.4.2 Adaptive moment estimation . . . . . . . . . . . . 30

2.2.5 Activation function . . . . . . . . . . . . . . . . . . . . . . . 312.2.5.1 Sigmoid . . . . . . . . . . . . . . . . . . . . . . . 322.2.5.2 Rectified linear unit . . . . . . . . . . . . . . . . . 332.2.5.3 Softmax function . . . . . . . . . . . . . . . . . . 33

2.2.6 Regularization . . . . . . . . . . . . . . . . . . . . . . . . . 342.2.6.1 L2 regularization . . . . . . . . . . . . . . . . . . . 352.2.6.2 Dropout . . . . . . . . . . . . . . . . . . . . . . . 36

2.2.7 Convolutional neural network . . . . . . . . . . . . . . . . . 372.2.7.1 Convolution layer . . . . . . . . . . . . . . . . . . 372.2.7.2 Pooling layer . . . . . . . . . . . . . . . . . . . . . 39

2.2.8 Data augmentation . . . . . . . . . . . . . . . . . . . . . . . 392.2.9 Model selection . . . . . . . . . . . . . . . . . . . . . . . . . 42

2.2.9.1 Cross-validation . . . . . . . . . . . . . . . . . . . 422.2.9.2 Receiver operator characteristics . . . . . . . . . . 43

2.3 Current approaches to detect AD using machine learning methods . . 47

3 Experimental method 523.1 Research methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . 523.2 Overall design of the system . . . . . . . . . . . . . . . . . . . . . . 523.3 Software & Environment . . . . . . . . . . . . . . . . . . . . . . . . 54

3.3.1 Hardware . . . . . . . . . . . . . . . . . . . . . . . . . . . . 543.3.2 Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

3.4 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 583.4.1 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 583.4.2 Montreal Neurological Institute average brain of 152 scans

(MNI152) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 593.4.3 Harvard-Oxford cortical and subcortical structural atlases . . . 60

3.5 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 613.5.1 Brain extraction . . . . . . . . . . . . . . . . . . . . . . . . . 61

3

3.5.2 Linear registration . . . . . . . . . . . . . . . . . . . . . . . 613.5.3 Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . 633.5.4 ROI extraction . . . . . . . . . . . . . . . . . . . . . . . . . 66

3.5.4.1 Production of ROI mask . . . . . . . . . . . . . . . 663.5.4.2 ROI extraction . . . . . . . . . . . . . . . . . . . . 68

3.5.5 Classification . . . . . . . . . . . . . . . . . . . . . . . . . . 703.5.5.1 Activation function . . . . . . . . . . . . . . . . . 703.5.5.2 Loss function . . . . . . . . . . . . . . . . . . . . 703.5.5.3 Pooling . . . . . . . . . . . . . . . . . . . . . . . . 713.5.5.4 Learning algorithm . . . . . . . . . . . . . . . . . 713.5.5.5 Data augmentation . . . . . . . . . . . . . . . . . . 723.5.5.6 Model selection . . . . . . . . . . . . . . . . . . . 72

4 Results and discussion 754.1 The use of GPU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 754.2 Comparison of different models on accuracy . . . . . . . . . . . . . . 774.3 Effect of data augmentation . . . . . . . . . . . . . . . . . . . . . . . 824.4 Comparison of results between ours and others . . . . . . . . . . . . 84

5 Conclusions and future work 865.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 865.2 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87

Bibliography 89

A Full experiment results 95A.1 Full results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95

B Source code 98B.1 Pre-processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98B.2 Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102

B.2.1 Label generation . . . . . . . . . . . . . . . . . . . . . . . . 102B.2.2 Classification . . . . . . . . . . . . . . . . . . . . . . . . . . 103

4

List of Tables

2.1 Main stages of Alzheimer’s Disease and related symptoms . . . . . . 162.2 Metrics derived from the confusion matrix . . . . . . . . . . . . . . . 45

3.1 Primary hardware configurations . . . . . . . . . . . . . . . . . . . . 543.2 List of third-party packages of Python in this project . . . . . . . . . 573.3 Demographic data for the subjects from ADNI database . . . . . . . . 583.4 Distribution of the instances . . . . . . . . . . . . . . . . . . . . . . 583.5 Hyper-parameters and values for grid search . . . . . . . . . . . . . . 73

4.1 Architecture and hyper-parameters in the experiment for GPU and CPUcomparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76

4.2 The best 5 and worst 5 results for three-way classification . . . . . . . 784.3 The best 3 and worst 3 results for two-way classification . . . . . . . 804.4 Effect of data augmentation technique . . . . . . . . . . . . . . . . . 844.5 The results of our method and others . . . . . . . . . . . . . . . . . . 85

A.1 Full results for three-way classification . . . . . . . . . . . . . . . . . 96A.2 Full results for two-way classification . . . . . . . . . . . . . . . . . 97

5

List of Figures

2.1 Comparison of cognitive decline between patients and the normal . . 172.2 Comparison between normal brain structure and the structure affected

by the AD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192.3 Machine learning pipeline for supervised learning . . . . . . . . . . . 212.4 Decision boundaries . . . . . . . . . . . . . . . . . . . . . . . . . . . 222.5 Difference between classification and regression . . . . . . . . . . . . 232.6 Graph of log loss function . . . . . . . . . . . . . . . . . . . . . . . 252.7 Demonstration of the perceptron . . . . . . . . . . . . . . . . . . . . 262.8 Feedforward neural network . . . . . . . . . . . . . . . . . . . . . . 282.9 A visual example of gradient descent . . . . . . . . . . . . . . . . . . 282.10 Plot of Sigmoid function . . . . . . . . . . . . . . . . . . . . . . . . 322.11 Graph of ReLU function . . . . . . . . . . . . . . . . . . . . . . . . 342.12 Demonstration of Dropout method . . . . . . . . . . . . . . . . . . . 362.13 Difference between standard NN and CNN . . . . . . . . . . . . . . 372.14 Computation processes of the convolutional neural network . . . . . . 382.15 Demonstrations of max pooling and average pooling . . . . . . . . . 402.16 Examples of data augmentation techniques . . . . . . . . . . . . . . . 402.17 Examples of data augmentation methods . . . . . . . . . . . . . . . . 412.18 Random splitting method . . . . . . . . . . . . . . . . . . . . . . . . 422.19 Data splitting for cross-validation . . . . . . . . . . . . . . . . . . . . 432.20 Procedure of full cross-validation . . . . . . . . . . . . . . . . . . . . 442.21 Confusion matrix for the classification problem with two classes . . . 442.22 A basic ROC graph . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

3.1 Research methods in CS . . . . . . . . . . . . . . . . . . . . . . . . 533.2 Architecture of the HOKUSAI system . . . . . . . . . . . . . . . . . 553.3 ICBM 152 template without skull . . . . . . . . . . . . . . . . . . . 593.4 Harvard-Oxford subcortical and cortical structural atlases . . . . . . . 60

6

3.5 MRI scan before and after brain extraction . . . . . . . . . . . . . . . 623.6 Demonstration of affine registration . . . . . . . . . . . . . . . . . . 643.7 Examples of the segmentation process . . . . . . . . . . . . . . . . . 653.8 Creation of an image for single substructure . . . . . . . . . . . . . . 673.9 Creation of ROI mask . . . . . . . . . . . . . . . . . . . . . . . . . . 683.10 Demonstration of ROI extraction . . . . . . . . . . . . . . . . . . . . 693.11 Overall structure of the network . . . . . . . . . . . . . . . . . . . . 74

4.1 Time used when using GPU(s) or CPU . . . . . . . . . . . . . . . . . 774.2 Error landscapes of training and testing for three-way classification . . 794.3 Error landscapes of training and testing for two-way classification . . 814.4 ROC graph for two-way classifiers . . . . . . . . . . . . . . . . . . . 83

7

Abstract

Alzheimer’s disease is the most common form of dementia. This disease not onlyresults in the reduction of the patients’ quality of life, but also puts a financial andemotional burden on their families and carers. Early diagnosis is the key to this dis-ease since treatments have been found to slow down its progression. In this research,we design and implement a novel automatic computer-aided diagnostic system for theearly diagnosis of this disease based on the Magnetic resonance imaging scans. Inour work, several substructures of brain related to Alzheimer’s disease are extractedfrom the scans firstly, then a convolutional network is trained to extract the high-levelfeatures and classify the data into different groups. By using a proper network architec-ture, the accuracies of the system are 91.23% and 72.60% for two-way (normal controland Alzheimer’s disease) and three-way (normal control, mild cognitive impairmentand Alzheimer’s disease) classifications respectively. Our system outperforms most ofthe similar systems using traditional machine learning methods. Besides, in our re-search, experiments show that the use of graphics cards can reduce the training time.However, the elapsed time is not inversely proportional to the number of the graphicscards used in the training. Moreover, data augmentation technique, mirror flip, is eval-uated, and no improvement in the accuracy of our model can be observed by using thistechnique. This research can be also applied to a wide range of applications which usemagnetic resonance imaging scans as input like diagnostic system for tumor in brain orpancreas cancer. We hope that this research can provide researchers in this field with astarting point and useful experience.

Keywords: Alzheimer’s disease, machine learning, convolutional neural network,deep learning, computer vision.

8

Declaration

No portion of the work referred to in this thesis has beensubmitted in support of an application for another degree orqualification of this or any other university or other instituteof learning.

9

Copyright

i. The author of this thesis (including any appendices and/or schedules to this the-sis) owns certain copyright or related rights in it (the “Copyright”) and s/he hasgiven The University of Manchester certain rights to use such Copyright, includ-ing for administrative purposes.

ii. Copies of this thesis, either in full or in extracts and whether in hard or electroniccopy, may be made only in accordance with the Copyright, Designs and PatentsAct 1988 (as amended) and regulations issued under it or, where appropriate,in accordance with licensing agreements which the University has from time totime. This page must form part of any such copies made.

iii. The ownership of certain Copyright, patents, designs, trade marks and other in-tellectual property (the “Intellectual Property”) and any reproductions of copy-right works in the thesis, for example graphs and tables (“Reproductions”), whichmay be described in this thesis, may not be owned by the author and may beowned by third parties. Such Intellectual Property and Reproductions cannotand must not be made available for use without the prior written permission ofthe owner(s) of the relevant Intellectual Property and/or Reproductions.

iv. Further information on the conditions under which disclosure, publication andcommercialisation of this thesis, the Copyright and any Intellectual Propertyand/or Reproductions described in it may take place is available in the UniversityIP Policy (see http://documents.manchester.ac.uk/DocuInfo.aspx?DocID=487),in any relevant Thesis restriction declarations deposited in the University Li-brary, The University Library’s regulations (see http://www.manchester.ac.uk/library/aboutus/regulations) and in The University’s policy on presentation ofTheses

10

http://documents.manchester.ac.uk/DocuInfo.aspx?DocID=487

http://www.manchester.ac.uk/library/aboutus/regulations

http://www.manchester.ac.uk/library/aboutus/regulations

Acknowledgements

I would like to thank my family. It is their support that helps me complete this excitingand challenging adventure.I would like to thank Dr Fumie Costen for her patience and support. She advices meon the direction of this project and provides me with the best resource.

11

Chapter 1

Introduction

1.1 Problem Statement

Alzheimer’s Disease (AD) is a progressive and irreversible neurodegenerative diseasewhich is the most common type of dementia [1]. This disease usually happens to theelderly and results in memory loss, cognitive impairment and difficulties with languageand self-care [2]. Moreover, according to [3], currently no treatment is able to cure ADor stop its progression. That means once people are diagnosed with Alzheimer’s, theirbrains have been damaged, and the symptoms will occur over time. Hence, this diseasenot only reduces patients’ quality of life but also place a heavy burden on their care-givers and families in both financial and emotional aspects. There are two main stages,called Mild Cognitive Impairment (MCI) and AD respectively, during the course ofthis disease. MCI is considered as the early stage of Alzheimer’s and there are mildsymptoms or no symptom at all in this stage. However, with the development of thisdisease, patients in the early stage will develop to Alzheimer’s over time. The cause ofAD is not fully understood, but one possible cause is the effect of two proteins calledamyloid beta and tau in our body, which result in the loss of connections between neu-rons and eventually lead to the death of brain tissues [3]. According to the researchersfrom the University of Zurich [4], there is a way to limit the production of amyloidbeta to slow down the development of AD. Therefore, it is critical to diagnose AD inthe early stage and apply the treatment to slow down the progression before it causesserious damage to the brain and various symptoms. This treatment could improve thepatients’ quality of life and alleviate their symptoms as well as ease the burden on thefamilies. Current diagnosis for AD involves several tests like medical history analysis,Mini-Mental Score Examination (MMSE) and medical tests. Usually, this process is

12

1.2. AIM AND OBJECTIVES 13

mainly based on the experience of specialists [1]. According to the statement by [5], itis difficult for a doctor to diagnose this disease in the early stage since there are manyother diseases whose symptoms are similar to that of AD. In this case, patients withAD may miss the best time to take treatment. Moreover, the accuracy for the diagnosisof this disease made by using a machine learning method is better than that made bya doctor [6]. So, it is rational and crucial to build a computer-aided diagnostic systemfor AD based on the machine learning techniques, and this kind of system can helpdoctors make an accurate diagnosis of this disease in the early stage.

1.2 Aim and Objectives

1.2.1 Aim

The aim of this project is to develop an automatic computer-aided diagnostic (CAD)system for an accurate diagnosis of Alzheimer’s Disease by using the machine learningmethods to classify Magnetic Resonance Imaging (MRI) scans.

1.2.2 Objectives

• Apply data pre-processing tools to complete skull stripping, registration and seg-mentation task to get grey matter which is affected by the AD.

• Use morphometric techniques to process the scans and obtain substructures/ re-gions of interest (ROIs) of the brain from the grey matter.

• Based on the ROIs, build and evaluate the models (convolutional network) withvarious architectures to obtain an accurate classification result.

• Test and evaluate data augmentation technique which can be used to improve theperformance of the model.

1.3 Organization of the Document

This document consists of 5 main components and the organization of this documentis listed as follows:

1.3. ORGANIZATION OF THE DOCUMENT 14

• Chapter 2 gives the backgrounds and theories for this project. Section 2.1 ex-plains the details about Alzheimer’s Disease including the causes, stages, symp-toms and diagnosis and so on. After that, the background, techniques and the-ories related to machine learning, especially to the neural network and convo-lutional neural network, are presented in Section 2.2. At last, existing similarcomputer-aided diagnosis systems for Alzheimer’s are described in detail andthe limitations in these systems are identified as well.

• Chapter 3 represents the research method used in this project firstly. Accordingto the limitations of the existing similar systems, idea and overall design of thissystem are explained in Section 3.2. Moreover, Section 3.3 describes the de-velopment and test environments, including developing language, packages andhardware and so on. Section 3.4 explains the data used in this project. The lastsection of this chapter presents the details of the desired system such as the pro-cedure of pre-processing, the architecture of the neural network and the choiceof hyper-parameters.

• In chapter 4, results of the experiments are collected and illustrated. A varietyof comparisons are also performed such as the use of Graphics Processing Unit(GPU) and Central Processing Unit (CPU), the effect of data augmentation tech-niques. In this thesis, the terms GPU and graphics card have the same meaning.It is because usually there is one GPU on a graphics card. We also compare theperformance of our work with that of other similar works. Furthermore, accord-ing to the results, discussion and analysis are presented in detail.

• Chapter 5 concludes the project and the experimental findings. Then, based onthe conclusion, possible future investigations are discussed too.

Chapter 2

Background

2.1 Alzheimer’s Disease

Alzheimer’s Disease (AD) is a progressive neurodegenerative disease first describedand reported by Alois Alzheimer in 1907 [7]. This disease is characterized by theproblems with memory and thinking and eventually leads to the death of patients.Alzheimer’s disease is the most common cause of dementia compromising 60% to80% of 46 million people with dementia around the world [8]. Dementia is a term usedto define a series of symptoms related to memory loss and a decline in thinking skills.According to the prediction by [8], 1 in 85 people around the world will be affectedby dementia and the total number of people with dementia would be 113.15 millionby 2050. Except for the affection in the human body, dementia and Alzheimer’s alsocome with a heavy burden on the economy. This report [8] also points out that the costof dementia in 2016 was around 818 billion U.S. Dollars, and the cost will increase toover 1000 billion this year. By 2030, the cost will double reaching a number of 2000billion U.S. Dollars. In low-to-medium-income countries, 94% of those with dementiaare living and accepting nursing care at home due to the lack of support from the localmedical system [8]. Therefore, this disease also puts heavy burden on the patients’families.

2.1.1 Causes

Currently, the cause of Alzheimer’s Disease is still unknown, and there are severalhypotheses. The genetic factor is one of these hypotheses. According to [1], around1% of AD patients have this disease due to genetic inheritance, and this kind of cases

15

2.1. ALZHEIMER’S DISEASE 16

is called early-onset familial Alzheimer’s disease since these people usually have ADearlier than age 65. Another genetic risk factor is called ε4 allele of the apolipopro-tein (APOE), which increases the risk of having AD by 3-15 times [7]. In 1991, aresearch showed that the build-up of amyloid beta in the brain is the root cause ofAD [9]. This research holds that the connections between nerve cells in the brain arepruned by amyloid-related protein, and eventually this process results in the death ofthe neurons. Tau protein is deemed as the root cause of this disease in the tau hypoth-esis. The research holds that neurofibrillary tangles are formed due to the dysfunctionof tau protein inside the nerve cells, and then the microtubules are destroyed and theconnections between the nerve cells are disconnected which leads to the death of thesecells. Though tangles and plaques exist in the cells of most people, they are foundmuch more in people with Alzheimer’s in a specific pattern [8]. Other hypothesesinclude lifestyle, brain injuries and inflammation and so on. Although none of thesehypotheses can explain this disease perfectly, the death of the tremendous nerve cellsand the atrophy of brain tissues are confirmed as the consequence of getting AD.

2.1.2 Stages and symptoms

Table 2.1: Main stages of Alzheimer’s Disease and related symptoms [1].

Name of stages Symptoms

Pre-dementia characterized by forgetting thingsdifficulties in abstract thinkingapathy

Mild Alzheimer’s disease Difficulties in speech and picking up the right words and nameShort-term memory lossThe problem in the execution of movements

Moderate Alzheimer’s disease Evident speech difficultiesConfusion about the words and the loss of reading and writingskillsLong-term memory lossBecome moodyPersonality and behavioural changes

Advanced Alzheimer’s disease Difficulties in daily activities and personal careDifficulty in communicationsMovement disorderVulnerable to infectionsLose awareness of recent experiences


Figure 2.1: Comparison of cognitive decline between patients and the normal [10].

According to Alzheimer’s Society [3], although some common symptoms are iden-tified, the symptoms for different patients with AD vary a lot. Four stages are de-fined by [1] during the AD course base on the cognitive ability and functional im-pairment. The symptoms in every stage are summarized in Table 2.1. Pre-dementia,which is usually termed as mild cognitive impairment, is considered as the early stageof Alzheimer’s disease. The disease in this stage has no obvious effects on the dailylife of people, and the reason for the mild symptoms is usually considered mistakenlyas ageing. As for the patients in this stage, the irreversible damages to the brain havealready occurred for a decade or more, although the symptoms like cognitive and mem-ory problems are not noticeable. Some of the people with MCI will develop to AD oranother type of dementia. The second stage is called the Mild Alzheimer’s disease.Those who are classified into this stage and the following two stages are considered asAD patients. Patients in this stage can take daily life without assistance, although theymay notice some changes like moderate memory loss. The next is the Moderate stage.The persons with AD in this stage will need care from others and may not be able toexpress thoughts clearly due to the loss of nerve cells. In the final stage of this disease,the symptoms will be getting worse. Individuals will lose the ability of self-care andthey are hardly able to communicate with others.

2.1.3 Brain changes caused by AD

As for human, there are over 100 billion neurons in the brain connected to many oth-ers [1]. Brain changes with age and the cognitive function declines due to the lossof neural cells in the brain. That is why some of the elderly have some kinds of


thinking and memory problems. However, the great decline in memory and cogni-tive function could be a sign of the abnormal loss of neurons affected by dementiaincluding Alzheimer’s disease. Figure 2.1 shows the difference in cognitive functiondecline between the normal ageing people and people with AD. The AD patients showa much more decline in cognitive function than the normal elderly. According to theresearch [5], normally there is a 0.24-1.73 per cent reduction in the volume of hip-pocampus, which is responsible for forming memories, for people over the age of 60.However, for people with AD, the shrinkage could be between 2.2 and 5.9 per cent.The loss of neurons in the brain firstly takes place in the hippocampus decade be-fore the appearance of the noticeable symptoms. Then, when more neural cells die,the damage spreads to additional parts and eventually leads to the atrophy of the hip-pocampus. At the final stage of the AD course, the whole brain is affected, and thewidespread damage results in the shrinkage of the cerebral cortex, as well as the en-larged ventricles which are shown in Figure 2.2. As we can see, the hippocampus ofan AD patient is much smaller and the thickness of the cerebral cortex is thinner thanthat of a normal people, whereas the ventricles of an AD patient are significantly largerthan the normal.

2.1.4 Diagnosis and treatment

2.1.4.1 Diagnosis

According to the description [1], currently, the diagnosis for AD is based on a series oftests since no single test is able to show enough evidences for a diagnosis. Althoughthe tests can help doctors to make a decision, a definite diagnosis can only be madewith the postmortem examination of brain tissue. Usually, at first, the doctor will askthe suspect about his/her medical history, symptoms and family factors related to ADin order to rule out other diseases with the same symptoms as the AD. Then, somebasic physical exams and diagnostic tests will be taken to identify the health issuesresulting in the symptom of dementia. In order to make a final decision, the suspectusually takes a brain scan. Two most widely used imaging techniques are MagneticResonance Imaging (MRI) and Computerized Tomography (CT). These kinds of scanscan show the changes in the brain, and they are able to help a doctor rule out otherconditions like a tumor or stroke and directly observe the changes in the brain affectedby AD, such as the atrophied hippocampus and surrounding brain tissue. That meansthe spacial structure and volume of the substructure of the brain are able to be used as a


Figure 2.2: Comparison between normal brain structure (upper) and structure affectedby the AD (lower).

https://upload.wikimedia.org/wikipedia/commons/a/a5/Alzheimer%27s disease brain comparison.jpg

https://upload.wikimedia.org/wikipedia/commons/a/a5/Alzheimer%27s_disease_brain_comparison.jpg

https://upload.wikimedia.org/wikipedia/commons/a/a5/Alzheimer%27s_disease_brain_comparison.jpg

2.2. MACHINE LEARNING 20

biomarkers to detect AD. This principle is also the basic theory of most CAD systemsfor the diagnosis of AD. These existing works will be discussed in Section 2.3. MMSEis another assessment that can help the doctor to evaluate the person [3]. During theMMSE test, a series of specially designed questions are asked by a professional to testa person’s everyday mental skills. The score of MMSE ranges from 0 to 30, and ascore lower than 24 suggests that this person has dementia. For the patients with AD,a decrease of two to four points, on average, on the MMSE score can be observed eachyear [1].

2.1.4.2 Treatment

On average, a person with Alzheimer’s can live 8 years after the occurrence of notice-able symptoms [3]. However, due to different health and caring states, the lifespanvaries from 4 to 20 years. Currently, there is no cure for Alzheimer’s disease to elim-inate the symptoms or stop its progression, and the damages caused by AD are irre-versible. However, it is able to relieve the symptoms and temporarily slow down thedevelopment of AD by taking drugs and non-drug treatments. Several drugs, such asAricept, Reminyl and Ebixa, are often used to help with memory problems and dailyliving and so on [1]. Products or methods, like electronic reminders and a weekly pillbox, are also effective ways to help a patient live independently without a caregiverand deal with memory loss [1].

2.2 Machine learning

Machine learning is a multidisciplinary subject related to a wide range of fields likeprobability, statistics, optimization, algorithm and artificial intelligence [11]. The ap-plication of machine learning in the database domain for the analysis of data is calleddata mining. It is used to process massive data to build a simple but powerful model,also known as classifier in a classification task, for a diversity of tasks like credit anal-ysis and fault detection and control. Machine learning also plays an important role inartificial intelligence. In order to improve the adaptivity of a robot or machine [12], thesystem needs an ability to ’learn’ from the changing environment instead of followingthe decisions listed by the designers using the method of exhaustion. Moreover, in thecomputer vision field, machine learning method like Fast Region-based ConvolutionalNetwork (Fast R-CNN) has outperformed the traditional methods for object detectionand facial recognition tasks [13].


Figure 2.3: Machine learning pipeline for supervised learning [14].

2.2.1 Machine learning pipeline

Usually, a machine learning pipeline, shown in Figure 2.3, includes three main compo-nents, namely data, learning algorithm and model [14]. The data is collected before inwhich there could exist an interesting pattern. According to the type of training data,the machine learning algorithms are divided into two main groups named supervisedlearning and unsupervised learning, which will be discussed in Section 2.2.1.1. Learn-ing algorithm is the major research area of machine learning [11], and it is used tobuild a model based on the data. The algorithms used in machine learning are differentfrom the algorithms used to directly solve some target problems like sorting or shortestpath problems. The machine learning algorithms are designed to find out the patternsor regularities and build the models from the data [12]. That means the machine learn-ing algorithms are expected to automatically extract and generate a solution from thetraining data for a specific task such as spam detection and decision making and so on.There is no such kind of existing algorithms that can directly solve these problems.The result of a machine learning algorithm is usually a mathematical model with pa-rameters. This model can be expressed in many forms like linear equations, trees and


Figure 2.4: Decision boundaries drawn by the linear model equation (2.1).Left: In 2-dimensional (2D) space, data points are 2D, and the boundary is a line (1D).Right: in 3-dimensional (3D) space, data points are 3D, and this boundary is a 2Dplane [14].

graphs and so on. A simple example of the models is a linear model [14] defined as

wTx− b =d∑

j=1

wjxj − b (2.1)

where w and x are d-dimensional vectors representing weight vector and input vectorrespectively, wj and xj are the jth elements of weight vector and input vector, andb is a bias. In this case, the purpose of the learning algorithms, in the classificationscenario, is to find an appropriate parameter vector w to yield a plane, called decisionboundary, that can separate data points based on their classes. Figure 2.4 shows thedecision boundaries in 2-dimensional and 3-dimensional space.

2.2.1.1 Supervised learning

Supervised learning is a kind of machine learning method where each training exam-ple consists of an input data and a desired label, also called category [14]. The task ofsupervised learning is to infer a function that maps the input data to the desired label.As the label for the training instances is available, it is possible to adjust the parametersof the inferred function to better fit the input data according to the difference betweenthe predicted and desired label. That is the reason why this kind of methods is calledsupervised learning. The most common applications of the supervised learning areregression and classification shown in Figure 2.5. In machine learning, the classifi-cation task is related to learn a mapping function to predict the category for a given


Figure 2.5: Difference between classification and regression [14].

instance, and usually, the categories are discrete. Support Vector Machine (SVM) [15]and decision tree [16] are some of the widely used classification algorithms. The out-put variables of the regression tasks, on the contrary, are usually continuous and theaim of a regression task is to construct a function to predict a quantity according to theinput. The learning algorithms for regression include linear regression and regressiontree [17] and so forth.

2.2.1.2 Unsupervised learning

When the label for each example is not available, the method used in machine learningfield is called unsupervised learning. The target is to find the pattern or regularities inthe input data without the guidance provided by the label. According to [11], the mostcommon application of unsupervised learning is clustering, whose aim is to separatethe samples in a dataset into several disjoint subsets, called clusters. The instancesin the same cluster are expected to be similar to each other, and different from thosein other clusters. This kind of algorithms includes K-mean and Mixture of GaussianModel (GMM) [18] and so on.

2.2.1.3 Reinforcement learning

In some kind of applications, the output of a machine learning method is a series ofactions regardless of the availability of the label. In this case, we only pay attentionto the result or policy including a series of actions that can achieve the goal, but notto a single action. There is no optimal single action. The machine learning methodshould be able to evaluate the existing policies and rewards, and learn from the past


series of actions in order to generate an optimal policy. Game playing and decisionmaking are examples of the applications of reinforcement learning. The algorithmsfor reinforcement learning include Q-learning [19], State-action-reward-state-action(SARSA) [20] and deep Q-learning [21] and so on.

2.2.2 Loss function

In order to help the machine learning algorithms to find the correct pattern in the data,a precise objective, called loss function, is set for the algorithms to pursue [14]. A lossfunction is used to measure the mistakes made by a model, and the ultimate target ofa learning algorithm is to minimize the mistakes. The sum of the loss function overall data points is called an error function which represents the overall mistakes themodel made on the whole dataset. Two of the most commonly used loss functions arecross-entropy and L2 loss.

2.2.2.1 Cross-entropy loss

Cross-entropy loss function [22], also known as log loss [14], is defined as

L = −M∑c=1

yc log (f(x)c) (2.2)

where M is the number of classes, f(x) is the output of the model representing theprobability of an instance belonging to class c, and yc, the real label, equals one if thisinstance belongs to class c, otherwise it equals zero. The cross-entropy loss measuresthe similarity of the real distribution and the predicted distribution of the dataset. Whenthe real distribution of dataset equals the predicted distribution, the cross-entropy lossfunction achieves its minimal value. Therefore, the process to minimize the cross-entropy loss is also the process to force the predicted distribution to match the realdistribution of the dataset. When there are only two classes, the function, known asbinary cross-entropy loss function, is a special case of cross-entropy and can be definedas [14]

L = −y log f(x) + (1− y) log (1− f(x)) (2.3)

where y, the real label, is either 1 or 0 and f(x) is the output of the model. Figure2.6 gives the graph of the binary cross-entropy loss function. As we can see, when thetrue label is one, if the output of a model close to one, the log loss will approximateto zero. So, according to the log loss function, the learning algorithm is encouraged


Figure 2.6: Graph of log loss function.http://ml-cheatsheet.readthedocs.io/en/latest/ images/cross entropy.png

to adjust parameters in the model to produce an output toward the true label. The sumof cross-entropy loss function over all examples is called cross entropy error or thenegative log-likelihood error.

2.2.2.2 L2 loss

Another widely used loss function is called L2 Loss, and the error function is knownas Mean Square Error (MSE). The loss function and error function are defined as [23]

L = (f(x)− y)2 (2.4)

C =1

2m

m∑i=1

(f(xi)− yi)2. (2.5)

In equation (2.4), f(x) and y are the predicted label and the true label respectively.Equation (2.5) is the error function where m is the total number of examples in thedataset and f(xi) and yi are the output and label of ith example respectively. The sumof the square of the differences between the predicted labels and corresponding labelsover all data points composes the mean square error of the model. Mean square errorhas an acceptable geometric meaning that represents the Euclidean distance between

http://ml-cheatsheet.readthedocs.io/en/latest/_images/cross_entropy.png


Figure 2.7: Demonstration of the perceptron.https://www.researchgate.net/profile/Pradeep Singh22/publication/283339701/figure/

fig2/AS:421642309509122@1477538768781/Single-Layer-Perceptron-Network-Multilayer-Perceptron-Network-These-type-of-feed-forward.

png

the real value and the prediction [11]. An algorithm that utilizes mean square error asthe optimization target is called the least square method [11]. In linear regression, theuse of the least square method is trying to find a hyper-plane that minimizes the sumof Euclidean distances between all data points and the hyper-plane.

2.2.3 Neural network

An Artificial Neural Network (ANN) is a kind of machine learning methods. ANNis inspired by the biological nervous systems, such as a brain consisting of a largenumber of interconnected neural cells, and the way to process information [23]. Thefirst artificial neural network was introduced decades ago, however, the lack of devicesand techniques limits its development. With the progress of computation power andrelative techniques, such as the use of GPU, neural network starts showing its powerfulability and has outperformed the traditional machine learning methods in many areassuch as decision making [21] and computer vision [24] and so on. A neural networkis a complex network with a huge number of interconnected neurons and a series ofnon-linear activation functions, discussed in Section 2.2.5. This structure enables thenetwork to approximate almost any given mapping or function from input to output[23]. Therefore, neural network has a remarkable ability to recognize and detect thepatterns from complex data. Like in the biological nervous system, the basic unit in

https://www.researchgate.net/profile/Pradeep_Singh22/publication/283339701/figure/fig2/AS:421642309509122@1477538768781/Single-Layer-Perceptron-Network-Multilayer-Perceptron-Network-These-type-of-feed-forward.png





the artificial neural network is called a neuron, but in ANN, neurons are grouped intolayers.A simple neural network, named perceptron, which contains only one neuron is shownin Figure 2.7 [25]. This network containing n weighted inputs, an activation functionand one output is one of the simplest networks. Nevertheless, this structure enables it tolearn a linearly separable function or represent all primitive Boolean functions exceptXOR. In a perceptron, usually, the input and output of a neuron are real-numbers, andthe output is the weighted sum of the inputs. The output of the neuron is then passed toan activation function, called sign function, to produce the output of the network [25],which can be expressed as

f(x) =

{1, if wTx =

∑ni=0wixi > 0

−1, otherwise.(2.6)

In equation (2.6) and Figure 2.7, wi and xi are the ith elements of input vector x andweight vector w separately. x0 is a constant value 1 and w0 is the bias. Althoughperceptron is one of the simplest feedforward networks, it plays an important roleas the basic unit in the neural network shown in Figure 2.8. This is an example offeedforward neural networks in which the output of one layer is fed into the next layerand no cycle exists. This kind of fully connected neural network is also known as amulti-layer perceptron, and it can contain some complex connection patterns or loopslike Convolutional Neural Network and Recurrent Neural Network [26]. The inputlayer is where the input vector is fed to the network, so the number of neurons in theinput layer equals the dimension of input data. Hidden layers are the layers in theneural network except for the input and output layers. Output layer is the last layer ofthe neural network responsible to output the result of the network. The result is usedfor further calculation of an error function. In an artificial neural network, normallyactivation functions are only available for the neurons in the hidden layers and outputlayer.

2.2.4 Optimization method

2.2.4.1 Gradient descent

As discussed in Section 2.2.3, an artificial neural network includes a lot of weights andbias, and the values of these parameters are unknown and unable to decide manually.In this case, a learning algorithm called Gradient Descent (GD) is applied to find the


Figure 2.8: Feedforward neural network [14].

Figure 2.9: A visual example of gradient descent [14].


optimal weights and bias to reduce the error so that the network can approximate themapping from the input to the output [23]. Figure 2.9 gives an example of how thegradient descent works. The basic idea of the GD is to update the weights towardsthe direction of the negative gradient of the cost function, which can keep the errorto decrease [14]. Taking Mean Square Error (MSE), introduced in Section 2.2.2, forexample, it is one of the most commonly used cost functions in the neural network.The MSE cost function is defined in equation (2.5). The cost function is always non-negative since there is a square operation. 4C is defined as the change to the costfunction, and according to [23] it can be defined as

4C ≈∂C

∂w1

4w1 +∂C

∂w2

4w2 + ...+∂C

∂wn

4wn = ∇C · 4w (2.7)

In (2.7), C is the cost function, 4wi is the change made to the ith weight in the

network,∂C

∂wn

is the first partial derivative of the cost C with respect to the nth

weight, ∇C is the gradient vector of (∂C

∂w1

,∂C

∂w2

, ...,∂C

∂wn

)T and 4w is the vector

of (4w1,4w2, ...,4wn)T . In order to minimize the cost function, a method is needed

to decrease the cost by keeping 4C negative. After a number of iteration, it is possi-ble to reach the minimal cost value [23]. The problem becomes to find out a methodto select 4w1,4w2, ...,4wn in order to keep 4C negative. If we set the 4w to be−α∇C where α is called as the learning rate and is a small positive value between 0and 1. Equation (2.7) will become [23]

4C ≈ −α∇C · ∇C = −α‖∇C‖2. (2.8)

In this case, the change of C (4C) is always a negative value since ‖∇C‖2 is positive.Equation (2.8) can make sure the cost function C decrease or keep stable all the time[23]. The update rule for the ith weight can be expressed as [23]

wt+1i = wt

i − α∂C

∂wi

, (2.9)

where wi is the ith weight and t is the number of iterations for weight update. Theabove induction is the principle of gradient descent. Gradient descent is a robust andefficient algorithm widely used in training the neural network. However, there is onelimitation that makes the algorithm computationally expensive. From equation (2.5)we can see, in order to calculate the cost, the outputs resulting from each instance have


to be summed up to update the weight. If the size of the dataset is large, it will take along time to make a small step. To address this limitation, some improvements havebeen applied to gradient descent to create less computationally intensive algorithmsthan the original one. Two most successful algorithms based on gradient descent aremini-batch gradient descent and stochastic gradient descent [26]. The primary im-provement of these two algorithms is to use a small batch of the dataset instead of allto calculate the cost and partial gradient for weight updating in each iteration. There-fore, the computation is much smaller in each iteration, and the update of weight ismore frequent than the original algorithm. The batch size defines the number of in-stances that is used for training in each iteration. When all the training examples havebeen used to train the neural network once, this is called one epoch. The differencebetween mini-batch gradient descent and stochastic gradient descent lies in the sizeof the batch. The batch size for mini-batch gradient descent is a hyper-parameter thatneeds to be set prior to the training, whereas stochastic gradient descent uses only onesample in each iteration.

2.2.4.2 Adaptive moment estimation

As discussed in Section 2.2.4.1, another hyper-parameter, called learning rate, needsto be set before training. This hyper-parameter defines the step size to update theweights, and it remains the same throughout the training process [14]. A big learningrate can result in a big change to the weights, but it may make the change so muchthat the algorithm overshoots the minimum or even make the training process diverge.Whereas, a small learning rate can make the algorithm take a long time to convergesince it makes a tiny change to the weights. So, the choice of learning rate is one ofthe most important decisions for training a neural network successfully. However, itis difficult and empirical for the selection of this hyper-parameter. In order to solvethis issue, Kingma and Ba proposed a new learning algorithm [27], called AdaptiveMoment Estimation (Adam), which adopts an adaptive learning rate. The learningrate in Adam decreases during the training procedure. The main update rules [27] aredefined as

gt = ∇wft(wt−1) (2.10)

mt = β1 ×mt−1 + (1− β1)× gt (2.11)


vt = β2 × vt−1 + (1− β2)× g2t (2.12)

m̂t =mt

(1− βt1)

(2.13)

v̂t =vt

(1− βt2)

(2.14)

wt = wt−1 − α×m̂t

(√v̂t + ε)

(2.15)

where variable t is the timestep, ft(wt−1) is the cost function in t timestep with theparameter w at (t − 1) timestep, α is the learning rate, gt is the gradient with respectto the parameter w, mt and vt are exponential moving averages of the gradient and thesquared gradient in t timestep, and β1 and β2 are hyper-parameters, usually set to 0.9and 0.999, used to control the effect of the previous gradients and current gradient. Asm0 and v0 are initialized to 0, according to equations (2.11) and (2.12), mt and vt arebiased and always closed to zero due to the initial values, especially in the first severaltimesteps. To address this issue, the bias-corrected estimates m̂t and v̂t are calculatedto correct the bias. The basic idea of Adam is to introduce both the previous gradientsand current gradient with respect to a parameter for update purpose. In this case, itcan be considered that each parameter has a unique learning rate based on its owncondition. Equation (2.10) is to get the gradient of the cost function with respect to w.Equations (2.11) and (2.12) show the methods to calculate and update the exponentialmoving averages of the gradient and the squared gradient. Equations (2.13) and (2.14)give the rule to correct the bias for these exponential moving averages since they areset to zero as default, which causes the bias in average. In equation (2.15), the updaterule for weight is defined based on calculated variables in previous steps. Adam isefficient and easy to implement. Moreover, in some researches, Adam shows a betterperformance than other adaptive algorithms like AdaGrad and SGDNesterov and soon [27].

2.2.5 Activation function

In Section 2.2.3, we have seen an activation function, called sign function, used in thePerceptron. An activation functions in neural network processes the linear combina-tion of the inputs, and produces the output of the neuron [23]. Some of the activation


Figure 2.10: Plot of Sigmoid function [14].

functions, like sigmoid and softmax, take inputs ranging between negative infinity andpositive infinity and produces value in a small range like [0, 1]. It is reasonable toconsider that the activation function produces the probability of being activated for aparticular neuron and it works like a decision-making function used to decide whethera particular neural should be present. Moreover, the use of the function introduces non-linearity into the neural network. A neural network with two or more layers and activa-tion functions is able to approximate any linear or non-linear function [23]. Otherwise,the approximation produced by a neural network without any activation function willbe less effective when the desired function is non-linear.

2.2.5.1 Sigmoid

One of the commonly used activation functions is sigmoid function, known as LogisticCurve [14], which can be expressed as

ϕ(x) =1

1 + e−x. (2.16)

Figure 2.10 shows the graph of this function. It takes input x with the domain ofall real numbers, and produces the output ranging between 0 and 1. Moreover, it isa non-linear function, and the derivative of this function is non-negative at any givenpoint. These features enable this function to be used as activation function in the neuralnetwork. The output of a perceptron [14] with the sigmoid function can be expressed


asf(x) =

1

1 + e−(wTx−b)(2.17)

where x is the input vector, w and variable b are the weight vector and bias respec-tively. There is a limitation in using the sigmoid function. When the absolute valueof input increases gradually, the gradient of the sigmoid function with respect to theinput reduces to a small value. This situation is called the problem of vanishing gra-dient. This problem makes the learning procedure difficult to converge and reach theminimum.

2.2.5.2 Rectified linear unit

Another commonly used activation function is called Rectified Linear Unit (ReLU).The ReLU function can be expressed as

f(x) = max(0, x), (2.18)

where x is a real number [28]. The graph of ReLU is shown in Figure 2.11. This graphshows that if the input of ReLU is less than zero, the output will be zero, otherwise,the output equals the input. The gradient of ReLU with respect to the input of thisfunction is a constant value, which makes it possible to avoid the problem of vanishinggradient [23].

2.2.5.3 Softmax function

Another important activation function is softmax function. This function is used tonormalizes an arbitrary real value K-dimensional vector into a K-dimensional vectorwhere the value of each element is between 0 and 1, and the sum of all elementsequals 1 [11]. The output of the softmax function can be considered as a categoricaldistribution representing the probability of the occurrence of each element in the vector.The equation of softmax [23] is defined as

σ(z)j =ezj∑Kk=1 e

zkfor j = 1, ..., K, (2.19)

where z and σ(z) are K dimensional vectors representing the input and output of thisfunction respectively, zj is the jth element of the input vector z, and σ(z)j representsthe jth element of the output vector. Usually, softmax is used as the activation function


Figure 2.11: Graph of ReLU function (red) [28].

of the last layer in the neural network. As for a classification task,K equals the numberof the classes in the dataset. Therefore, each element in the output vector representsone class and the corresponding value is the probability that the input data belongs tothis class.

2.2.6 Regularization

Regularization is a kind of modification made to learning algorithm or error functionto reduce the generalization error of a model [26]. In machine learning, the algorithmis designed to learn general patterns or regularities from the training data [11]. Then,according to the learnt model, the predictions are made over future data. However,sometimes, the algorithm builds a model, and the model fits so closely to the trainingdata that it learns some patterns existing in training data only [14]. This problem isso-called overfitting which usually causes the poor generalization ability of the model.This usually happens when using an overly complicated model to explain a datasetwith simple patterns. On the contrary, underfitting happens if the algorithm does notlearn the general patterns well enough. Both overfitting and underfitting result in pooraccuracy and generalization in prediction on a new dataset. Several techniques aredeveloped to prevent these issues.


2.2.6.1 L2 regularization

L2 regularization, also known as weight decay, is a technique used to improve thegeneralization of a model, as well as to limit the values of the weights by adding aL2 regularization term to the cost function [23]. The total cost function with a L2

regularization term and update rule [23] can be expressed as

Ctotal = C +λ

2m‖w‖22 (2.20)

wt+1i = wt

i − α× (∂Ctotal

∂wti

) = wti − α× (

∂C

∂wti

+λ

2m

∂‖w‖22∂wt

i

) (2.21)

The C in equations (2.20) (2.21) is the cost function, λ is a hyper-parameter used tocontrol the effect of the L2 term and it is usually a small real number between 0 and1, and w is the weight vector consisting of all weights in the model. There are totalm weights. Ctotal is the total cost function equal to the sum of cost function and L2

term, and ‖w‖2 is the L2 norm of the weight vector w [26]. Equation (2.21) showsthe rule to update the weight wt

i which is the ith weight in iteration t. As discussed inSection 2.2.4.1, the purpose of a learning algorithm is to find a vector w that minimizethe cost, which is the total cost Ctotal in this case. The additional cost, the L2 term,enforces the weight vector close to the origin (zero) in order to minimize the totalcost. Researches [24] [23] have shown that the introduction of the L2 term leads to animprovement of a model’s generalization.There is another regularization method closely related to the L2 regularization used inthe research [24]. This method adds the weight decay term directly to the equation forupdating the weights instead of adding it to the cost function. This method [24] can beexpressed as

wt+1i = wt

i − α× (∂C

∂wti

+ λ× wti) (2.22)

where wi is the ith weight, t is the number of iteration for updating weight, α and λ

are the learning rate and weight decay rate respectively, and∂C

∂wi

is the gradient of the

cost C with respect to wti . Actually, both methods have the same effect on the weights

except a little difference in the coefficients of these two regularization terms.


Figure 2.12: Demonstration of Dropout method.http://cv-tricks.com/wp-content/uploads/2017/04/xdropout.jpeg.pagespeed.ic.

73ygRwte2E.webp

2.2.6.2 Dropout

Dropout, another regularization method, is designed to improve the generalization per-formance as well. This method is implemented by discarding a small portion of neu-rons in a neural network in a training iteration temporarily [24]. That means the archi-tecture of the network is different in one iteration from that in another iteration. Figure2.12 shows how dropout method works. In each training iteration, some of the nodesin the network are dropped randomly, and the rest with corresponding weights are usedand updated in this iteration. The nodes with cross in Figure 2.12 represents the nodesthat are dropped. An explanation for why dropout works is that when the neurons aredropped out randomly during training, it is like training many subnetworks with dif-ferent structures [26]. Each subnetwork overfits to the data in a different way. At thesame time, the model becomes more robust to the loss of some neurons. Therefore, theoutput of the model is like a mixture of the outputs of different networks which couldavoid overfitting [23]. This explanation is similar to the principle of the random forestalgorithm which builds a large group of decision trees using randomly selected dataand features for each tree [14]. That is probably why the dropout method works wellin neural network.

http://cv-tricks.com/wp-content/uploads/2017/04/xdropout.jpeg.pagespeed.ic.73ygRwte2E.webp

http://cv-tricks.com/wp-content/uploads/2017/04/xdropout.jpeg.pagespeed.ic.73ygRwte2E.webp


Figure 2.13: Difference between standard NN (Left) and CNN (Right).http://xilinx.eetrend.com/files-eetrend-xilinx/article/201612/10827-27690-06.jpg

2.2.7 Convolutional neural network

2.2.7.1 Convolution layer

Convolutional Neural Network (CNN) is a specific case of neural network. CNN wasfirst developed in 80s last century and raise recent year due to the improvement of ca-pability especially with the use of powerful GPU [11]. CNN is very much similar toa standard neural network. In a standard neural network, the nodes in the neighboringlayers are fully connected to each other, whereas in a CNN the nodes are partially con-nected to each other [23]. Figure 2.13 shows the difference in the connection methodbetween a CNN and a standard NN. As we can see, the weights and connections ina CNN only exist between some of the nodes. In a CNN, weights in the same layersare grouped into different sets called kernel or filter. There are normally more thanone kernels used in each layer. Figure 2.14 gives a demonstration of CNN in a graph-ical and intuitive way. The first matrix is the input, usually, it is an image where thevalues represent the intensities of the pixels. The second is a kernel where the valuesrepresent the weights. The third matrix is the result of Hadamard product between thekernel and the corresponding area (blue) of the input [23]. The sum of the third matrixis one element of the fourth matrix, known as feature map. As the kernel slides overthe input step by step in an overlapping way, different sub-areas of input are involvedin the operation. At last, the feature map will be filled with the results shown in Fig-ure 2.14(b). This procedure is more like a mathematical operation called convolution.

http://xilinx.eetrend.com/files-eetrend-xilinx/article/201612/10827-27690-06.jpg


(a)

(b)

Figure 2.14: Computation processes of CNN. (a) The first element of the result. (b)The last element of the result.

This is the reason why this kind of neural network is called convolutional neural net-work [23]. The use of convolutional layer and kernel is inspired by the kernels used insignal processing and computer vision for the signal filter, edge and corner detectionand so on [11].Another difference between a CNN and standard a NN lies in the use of the weights.Each weight except the bias in the standard NN corresponds to one node only [23]. InCNN, the weights in the same kernel are shared during the convolution operation [26].That means the weights used to produce the results in the same feature map remainthe same. Therefore, the number of feature map is equal to the number of kernels.In a CNN network, usually, there is an activation function to introduce the non-linearfeature into the network after each layer. However, a difference between a CNN and astandard neural network is the use of pooling layer. Usually, there is no pooling layerin a standard neural network.


2.2.7.2 Pooling layer

Pooling is the layer used to summarize the information from the output of the previouslayer usually the activation layer [26]. It combines the values in a small neighborhoodof the output into a single neuron as the input of the next layer. In some sense, theway how a pooling layer works can be considered as another kernel. However, theoperation is to find the representative value of a small area in a non-overlapping wayinstead of a convolutional operation. Another function of pooling layer is to reduce thecomputation complexity by reducing the number of neurons [29]. As for a 2×2 maxpooling or average pooling, only one neuron is left after the pooling operation. Thatmeans 75% data is eliminated. There are two commonly used pooling methods maxpooling and average pooling. Max pooling uses the maximum value in the pooling areaas the output, whereas average pooling uses the average value as the output. Figure2.15 depicts the max pooling and average pooling methods. The activation function ina CNN is usually put before the pooling layer [26]. However, there is no rule to definethis issue. When the max pooling and ReLU are used in a CNN, the results remain thesame regardless of whether the activation function is before or after the pooling layer.

2.2.8 Data augmentation

The amount of data is a key to train a neural network. More data is available fortraining, higher accuracy can be obtained in generalization [24]. However, sometimes,the performance of a machine learning model is constrained by a small dataset. In themedical area, due to the concern about privacy, usually, a limited number of instancesis available in public. Therefore, there is a need for a method to increase the numberof data based on the existing data. This kind of techniques or methods is called dataaugmentation technique. As for the images classification tasks, several techniques aredeveloped and applied, and these techniques result in a good performance [24]. Onekind of data augmentation techniques makes changes to the intensities of the originalimages such as adding noise and changing lighting condition and so on. Figure 2.16demonstrates these methods. Another kind of techniques applies deformations to theoriginal images. The commonly used transformations are flip, rotation and scale andso on. Figure 2.17 gives a few examples of these methods.


Figure 2.15: Demonstration of max pooling (left) and average pooling (right).https://cdnpythonmachinelearning.azureedge.net/wp-content/uploads/2017/09/

Pooling-Types.png?x31195

Figure 2.16: Examples of data augmentation techniques. Original image (left), theimage with random noises (middle) and with a different lighting condition (right).https://cdn-images-1.medium.com/max/1600/1*cx24OpSNOwgg7ULUHKiGnA.png

https://cdnpythonmachinelearning.azureedge.net/wp-content/uploads/2017/09/Pooling-Types.png?x31195

https://cdnpythonmachinelearning.azureedge.net/wp-content/uploads/2017/09/Pooling-Types.png?x31195

https://cdn-images-1.medium.com/max/1600/1*cx24OpSNOwgg7ULUHKiGnA.png


(a) Original image (left), flipped horizontally(middle), also known as mirror flip, andvertically (right).https://cdn-images-1.medium.com/max/1600/1*-beH1nNqlm Wj-0PcWUKTw.jpeg

(b) Original image (first), clockwise rotated by 90 degrees (second), 180 degrees (third)and 270 degrees (fourth).https://cdn-images-1.medium.com/max/1600/1*i F6aNKj3yggkcNXQxYA4A.jpeg

(c) Original image (left), scaled by 10% (middle) and 20% (right).https:

//cdn-images-1.medium.com/max/1600/1*INLTn7GWM-m69GUwFzPOaQ.jpeg

Figure 2.17: Examples of data augmentation methods. (a) flip. (b) rotation. (c) scaling.

https://cdn-images-1.medium.com/max/1600/1*-beH1nNqlm_Wj-0PcWUKTw.jpeg

https://cdn-images-1.medium.com/max/1600/1*i_F6aNKj3yggkcNXQxYA4A.jpeg

https://cdn-images-1.medium.com/max/1600/1*INLTn7GWM-m69GUwFzPOaQ.jpeg

https://cdn-images-1.medium.com/max/1600/1*INLTn7GWM-m69GUwFzPOaQ.jpeg


Figure 2.18: Random splitting method [14].

2.2.9 Model selection

2.2.9.1 Cross-validation

Usually, a given task can be solved by using different learning algorithms, which cangenerate several models. In order to choose the best model, model evaluation andselection are needed. Model selection is the process to pick the optimal model from agroup of possible models based on their performance, and model evaluation is able togive an estimate of the future generalization error [14]. The best way to evaluate themodel is to test the model on an unseen dataset. It is unfair and unreasonable to usetraining error to evaluate the model, since the model is trained on the training dataset.Three techniques can be used for model selection, namely random split, K-fold cross-validation and full cross-validation [14].From Figure 2.18 we can see, as for supervised learning, randomly splitting the datasetinto two parts, named training and testing data, is a method for model selection. In thismethod, the model is built on the training data firstly, and then the unseen testing data isused to evaluate the model. The testing error can be used as criteria for model selection.However, limitation lies in this method. The dataset is split into training and testingdatasets randomly. Due to the chance during splitting, even the same model couldproduce different results in two trials. Therefore, this method is not a good estimate ofthe future generalization error of the model [14].Model selection via cross-validation is a better method than that via random datasetsplitting. Cross-validation, also known as rotation estimation, is a group of techniquesused to assess the generalization performance of the models [14]. These techniques


Figure 2.19: Data splitting for cross-validation [14].

include K-fold cross-validation, Leave One Out Cross Validation (LOOCV) and fullcross-validation. The basic idea is to split the samples into several complementarysubsets shown in Figure 2.19. Each subset is called a fold, and K-fold means thatthere are total K disjoint subsets. In K-fold cross-validation, the model is trained onall folds except one and tested on this fold that is not used for training in a rotationway [30]. That means each fold is used for model testing in turn, and the rest is usedto train the model. Finally, the average of the testing errors resulting from each fold isthe overall cross-validation error. This testing error is a good estimate of the model’sgeneralization performance. LOOCV follows the procedure of basic cross-validation,but there is only one sample in each fold [30]. It is quite compute-intensive and usuallysuitable for a small dataset only. In full cross-validation, the dataset is partitionedinto two parts firstly [14]. Then, K-fold cross-validation is performed on one part,major part of the dataset, for parameter tuning and model selection. At last, the testis performed separately on the other part. This process ensures that the testing datasetremains unseen during the parameter tuning and this dataset is used for evaluatingtesting error only. Figure 2.20 shows the process of this method.

2.2.9.2 Receiver operator characteristics

Another technique, called Receiver Operator Characteristics (ROC) graph, and relatedmetrics provide a more sophisticated way to evaluate the performance of a model [14].As described in Section 2.2.1, the performance is estimated based on the classifica-tion error or classification accuracy. However, classification error is a weak metric tomeasure the performance of a model [31], when the class distribution is skewed or thecosts of misclassification are unequal.


Figure 2.20: Procedure of full cross-validation [14].

Figure 2.21: Confusion matrix for the classification problem with two classes [31].


Table 2.2: Metrics derived from the confusion matrix [31].

Metrics Equations Descriptions

RecallTP

PAlso known as true positive rate and sensitive, de-scribe the percentage of all positive instances that arecorrectly identified.

False Positiverate (FP rate)

FP

NAlso known as false alarm rate, describes the per-centage of negative instances that are incorrectlyidentified

SpecificityTN

(FP + TN)=

TN

NThis metric is the percentage of negative instancesthat are correctly identified, and it is equal to 1 - FPrate.

PrecisionTP

(TP + FP )This metric indicates the percentage of instances thatare classified as positive is correct.

Accuracy(TP + TN)

(P +N)This is the percentage of all instances that are classi-fied correctly.

F-measure(2× precision× recall)(precision + recall)

A higher recall could result in a lower precision bypredicting all instance into positive, and vice versa.This metric considers both the precision and recall,and it is a tradeoff between recall and precision toavoid extreme cases.


Figure 2.22: A basic ROC graph [31].

Figure 2.21 gives the confusion matrix for a two-class classification problem. A two-by-two confusion matrix, also known as a contingency table, is built based on thepredicted and actual labels. This matrix plays a role as a basis for many other perfor-mance metrics, which are shown in Table 2.2. For example, as for a rare disease, if theclassifier predicts all instances into negative class, it will generate an acceptable clas-sification accuracy. The reason is that the positive instances only account for a muchsmaller portion of the whole dataset than the negative. Usually, the cost of failing toidentify the disease is much higher than that of raising a false alarm. Therefore, insome cases, classification accuracy is a weak metric to measure the performance ofa model. F-measure in Table 2.2, the harmonic average of the precision and recall,can be used to measure the performance of this kind of models [31]. It considers bothprecision and recall of the model and eliminates the bias caused by imbalanced classesdistribution.A basic ROC graph is a 2-dimensional graph where x-axis represents the False Positive(FP) rate and y-axis represents the True Positive (TP) rate [31]. Figure 2.22 shows anexample of ROC graph. The model with a pair of metrics (TP rate and FP rate) corre-sponds a data point in the figure. The data point D in the left-top corner has a higher

2.3. CURRENT APPROACHES TO DETECT AD USING MACHINE LEARNING METHODS47

TP rate and a lower FP rate than other points (A, B, C, E). The performance of anypoint (classifier) on the diagonal equals random guessing and the classifier providesno information at all like point C. In this case, usually, the point (classifier) close toupper left corner is selected. No classifier should appear in the area below the diag-onal. The classifier in this area means that it performs worse than random guessing.The reason for a point below the diagonal could be the misuse of TP rate and FP rate.The choice between points A and B depends on the specific task. If the task pays moreattention to TP rate than FP rate, point B will be a better choice than point A. Withthe wide use of machine learning in many areas, especially in medical diagnosis, ROCplays an increasingly important role and usually outperform classification accuracy incost-sensitive learning and the learning with unbalanced classes [32].

2.3 Current approaches to detect AD using machinelearning methods

Several computer-aided diagnostic systems have been built and some of them haveachieved the remarkable results. Among these systems, there are mainly three dif-ferent types of methods, voxel-based method, morphometry-based method and deeplearning method.[33] performed a research utilizing the voxel-based approach in 2008. The data theyused was collected from difference centres. There are total 68 scans including 34subjects with post-mortem confirmation of AD and 34 normal controls (NC). In thepre-processing stage, they firstly segmented these images into white matter, grey mat-ter and cerebrospinal fluid. Then, population template was created using these scans,and each grey matter image was normalized to this template. Each normalized greymatter segment was treated as a high dimensional data. At last, a linear support vec-tor machine (SVM) was trained by using the data for classification. The accuracy forthe classification of AD and NC is between 87.5% and 96.4% when different groupsof data were used as the training and testing data respectively. Beside of SVM, othermodels were also used for the classification of MRI images base on voxel-based ap-proach. [5] developed an AD diagnosis system by using Primary Component Analysis(PCA) and neural network. The data used in this research is from the Open Access Se-ries of Imaging Studies (OASIS) MRI database 1 including 457 MRI images. Firstly,

1https://www.oasis-brains.org/

https://www.oasis-brains.org/


the images were registered into atlas space of Talairach. And then the normalizedcenter slices of 230 images were selected as training data. Based on the 230 images,Principal Component Analysis (PCA) was performed, and the top 150 primary com-ponents were retained. Therefore, a 150 dimensional vector is used as features torepresent each image. A six-layer neural network with 150 input nodes was trained forclassification. The accuracy for two-way classification (AD vs. normal control) thisresearch achieved is 89.22%. [34] performed a series of experiments by using dimen-sional reduction techniques and machine learning methods in 2015. The dataset usedin this research is from Alzheimer’s Disease Neuroimaging Initiative (ADNI)2. The de-sign of this research is divided into two phases. As for the pre-processing stage, severaldimensional reduction techniques, PCA, Histogram and Downscaling, were applied tothe MRI images separately. At the second stage, two machine learning methods, deci-sion tree and neural network, were trained based on these pre-processed images. Thatmeans there are total 6 results yielded by the experiment. As for three-way (normal,MCI and AD) classification, the best result they got was an accuracy of 60.2% by us-ing the downscaling and decision tree. Moreover, they also conducted three two-wayclassifications for (NC+MCI) vs. AD, (NC+AD) vs. MCI and (AD+MCI) vs. NC re-spectively. The accuracy ranges from 52% to 85.8% depending on the use of differentcombinations of the methods.From these previous researches we can see, although some of these researches achieveda high accuracy on their datasets, limitations lie in these researches. The voxel-basedapproach uses voxel as a basic feature element and the intensity of a voxel is the valueof the feature element. The extracted features of MRI images are treated as the data inhigh dimensional space. The machine learning methods are trained to separate the datainto different groups. As discussed in Section 2.1.3, the obvious change in the brainsresulting from the AD is the shrinkage of various substructures mainly consisting ofthe grey matter. And the shape, volumetric and structural changes caused by AD areused as the biomarkers to diagnose AD in clinical diagnosis. However, voxel-basedmethods neglect these changes in the brain caused by AD, which could result in a lowaccuracy. Moreover, the intensity of the voxel tends to be affected by other factors likenoise and the scan machines. Therefore, the features used in this kind of approach arevery noisy [35]. Furthermore, most of these researches introduce some dimensionalreduction techniques which usually result in the loss of information. Both noisy data

2http://adni.loni.usc.edu/

http://adni.loni.usc.edu/


and dimension reduction techniques could cause the overfitting of the model to irrele-vant noise and the reduction of the accuracy as well.Another type of method is called Morphometry-based method, which came over thelimitation of the voxel-based method by considering the morphometric changes in thebrain. By using the dataset from ADNI, a team proposed a morphometry-based anal-ysis method [36]. They firstly corrected the intensities of the MRI images and seg-mented grey matter and white matter into different images. Then, these segmentedimages were non-linearly registered to the MNI space. By using the grey matter im-ages, two z-score [37] maps, grey matter volume z-score map and cortical thicknessz-score map, were created to present grey matter volume and cortical thickness declineas classification features. They also calculated and collected the volumes of severalbrain substructures like hippocampus and entorhinal cortex. Based on these features,a support vector machine model was trained as a classifier. According to their results,the accuracy of classification of AD and NC was 92.7%, and for three-way classifi-cation, the accuracy was 74.6%. Another morphometry-based research was designedand implemented by Henriquez in 2017 [38], and the data used in this research is fromADNI. At the pre-processing stage, the skulls of these MRI images were removedfirstly. And then, the images were linearly registered into MNI space and segmentedinto grey matter, white matter and cerebrospinal fluid. At last, through the use of non-linear registration and ROI masks, ten regions of interest were extracted from the brainstructure for each image, and the volumes of these ROIs were calculated and used asfeatures to train an SVM model. Eventually, they reached an accuracy of 88.5% forthe classification of AD and NC.As we can see from these previous researches, the morphometry-based method is ableto take structural information like volume into account, which conforms to clinicaldiagnosis and tends to be understood easily. However, the complex preprocessingstages make it difficult to implement and the results highly depend on the quality ofpre-processing. Moreover, domain knowledge is needed since the extraction of thesubstructures from a brain image was conducted based on the domain knowledge. An-other problem lies in the calculation of the volume or feature of the extracted ROIs.The calculation highly depends on the parameters selected by the researcher manually.A minor change in parameter selection may cause a huge difference in the result of thecalculation.With the raising of deep learning, a new type of method occurred recently, which is ableto eliminate the disadvantage of voxel-based and morphometry-based methods. [29]


conducted a research that used a convolutional neural network to classify MRI scansinto different groups. This research used the ADNI dataset in their experiment, in-cluding 2265 MRI scans. In the preprocessing stage, images were registered into theInternational Consortium for Brain Mapping template firstly. Then, the normalizationof images was performed by using zero-mean normalization, which involves the sub-traction of population mean and the division of standard deviation. In the classificationstage, these processed images were firstly used to train a neural network called autoen-coder (3D), whose input and output are equal to each other. In this network, 150 3-Dfilters/kernels, size of 5×5×5, were trained and obtained. The purpose of this stepwas to obtain the 150 kernels which would be used as the convolutional kernels in thefollowing convolutional network. After training the autoencoder, a network with oneconvolutional layer and two plain neural network layers was built as the classifier. TheCNN kernels in this network were the 150 3D kernels trained in the previous autoen-coder. The accuracies of this system for two-way (AD vs. NC) and three-way (AD vs.MCI vs. NC) predictions were 95.39% and 89.47% respectively. Another computer-aided diagnosis system was implemented by [35] using deep learning. The datasetsused in this research are from Computer-Aided Diagnosis of Dementia 3 (CADDeme-tial) and ADNI. There is no preprocessing operation in this research which means theinput for this system is the raw MRI image. At first, a three-layer autoencoder net-work was built like the previous research. However, the purpose of this autoencodernetwork is not to train convolutional kernels but to extract the features from each MRIscan. Eight 3-D feature maps with the size of 38×25×30 were trained and used asfeatures. Then, a four-layer fully connected neural network was trained using the eightfeature maps of each MRI scan as input. In this research, the autoencoder networkworks as a feature extraction stage, and the fully connected network plays the role ofthe task-specific classifier. Moreover, the training data for autoencoder and the plainneural network was from the CADDemetial and ADNI respectively, which means thisfeature extraction method has some ability of generalization. This research reflects thestate-of-the-art method reaching the accuracies of 97.6% and 89.1% for two-way (ADvs. NC) and three-way (AD vs. MCI vs. NC) predictions.Both of these two researches are the state-of-the-art methods. In their systems, theyuse the deep learning methods which include an autoencoder network for feature ex-traction and a fully connected neural network for the classification purposes. Althoughthese two systems achieved the highest accuracy among all types of methods, there are

3 https://caddementia.grand-challenge.org/

https://caddementia.grand-challenge.org/


still some problems in the systems. The memory required for the running of these twomethods could be huge, especially when the dataset is large since the raw MRI datais used as input of these systems. Moreover, less understandable is another limitation.Auto-feature extraction is an attractive technique. However, it is difficult to explainwhether the extracted features are related to AD, but not related to another brain dis-ease since no AD-related information or label was used in the feature extraction stage.

Chapter 3

Experimental method

3.1 Research methods

There are mainly three kinds of research methods in Computer Science (CS), theoreti-cal CS, computer engineering and experimental CS [39] and the procedures are shownin Figure 3.1. According to the analysis of the existing researches in Section 2.3, it isclear that this kind of researches belongs to the computer engineering research methodshown in Figure 3.1 (b) and cannot be proofed theoretically or by the analysis of ob-servations. In this case, a system needs to be designed and implemented. And then,the evaluation of this system needs to be performed through taking experiments.

3.2 Overall design of the system

According to the analysis on the existing Alzheimer’s diagnostic systems, discussedin Section 2.3, we can see that some limitations lie in these systems. By identifyingand tackling these limitations, it is possible to propose a new method to achieve abetter performance in terms of accuracy. In our research, the morphometric method ischosen since this kind of methods considers volumetric and structural changes affectedby the AD in brains. We can use these biomarkers, which are also used in clinicalpractice, as input, and this makes the features used for classification task reliable andunderstandable. In terms of the classification task, the process of diagnosis for ADusing machine learning method can be considered as image classification task since anMRI scan is a 3-D image itself. As for this kind of tasks, according to the studies [23][24], CNN is able to generate better results than many other machine learning methods.Hence, two phases, pre-processing and classification, are introduced in the proposed

52

3.2. OVERALL DESIGN OF THE SYSTEM 53

(a)

(b)

(c)

Figure 3.1: Research methods in CS [39]. (a) Theoretical CS. (b) Computer engineer-ing. (c) Experimental CS.

3.3. SOFTWARE & ENVIRONMENT 54

new CAD system. In the first stage, seven substructures/ROIs of brain affected by ADwill be extracted into one image as the low level features by using the morphometricmethod. They are hippocampus, amygdala, caudate, putamen, pallidum, thalamus andmedial temporal lobe. According to the research [38], these ROIs are discriminativefeatures closely related to AD and they are able to yield satisfying result. The size ofthe input of the neural network can be reduced after the pre-processing stage since wefeed the neural network with the extracted ROIs only. Then, by using these ROIs, aCNN network will be built and trained to classify the MRI scans into different groups(NC, MCI and AD). Instead of using raw MRI images, we feed the neural networkwith the extracted ROIs only. Therefore, the size of the input of the neural networkis reduced and the requirement for memory is less than that of the methods using rawMRI images as input. This new method takes morphometric information into accountbut also avoids calculating the volume or feature directly.

3.3 Software & Environment

According to the description of system design in Section 3.2, the output of this projectis a system used for the diagnosis of Alzheimer’s disease. The hardware and softwareinvolved in this project as well as their roles will be discussed in this section.

3.3.1 Hardware

Table 3.1: Primary hardware configurations of developing and testing environment.

Hardware Development Environ-ment

Test Environment [40]

Central Processing Unit(CPU)

Intel(R) Xeon(R) CPUE5-2603 v2 @ 1.80GHz

Intel Xeon E5-2670 v32.30GHz

Random Access Memory(RAM)

64GB 64GB

Graphics Card NVIDIA Quadro K600× 1

NVIDIA Tesla K20X × 4

Graphic Memory 981MB 5700MB × 4

Disk 200GB 600GB

This system is developed and evaluated on different computers since this is a compute-intensive task. The resource, GPU, of our local computer is insufficient for the running


Figure 3.2: Architecture of the HOKUSAI system [40].


of this system. If the system runs on the CPU, it will take weeks to train a model.Table 3.1 shows the hardware configurations of development and test environments.This system is developed on a computer equipped with Intel(R) Xeon(R) CPU E5-2603 v2, 64 GB RAM and one NVIDIA Quadro K600 graphics card. Then, it is testedand evaluated on HOKUSAI system, which is a supercomputer system developed byRIKEN [41], the largest research institution in Japan. This system consists of severalmain components called Massively Parallel Computer, Application Computing Server,Front-end servers and storages including online storage and hierarchical storage. Fig-ure 3.2 shows the overview architecture of this system. In this project, ApplicationComputing Server with GPU (ACSG) is used for testing. ACSG consists of 30 nodesand these nodes are interconnected to each other, so, it enables the Massively Par-allel computing. Each node in the ACSG is equipped with Intel Xeon E5-2670 v3(2.30GHz) CPU, 64GB RAM, 600 GB disk and four NVIDIA Tesla K20X graphicscards. During testing and evaluation, the developer logs into the front-end server, sub-mit the computing tasks to (ACSG) through RIKEN network and High-PerformanceNetwork (HPN). The results will be kept and stored in the developer’s home directory.Since this system does not use the massively parallel computing technique, our systemis tested and evaluated on one node only.

3.3.2 Software

1. Linux CentOS7The operating system of the development and testing environments is Linux Cen-tOS7. This is an open source, stable and reproducible platform developed basedon Red Hat Enterprise Linux (RHEL) [42].

2. Python 3.6The developing language used to build this system is Python 3.6. Python is anopen source, interpreted high-level programming language maintained by thenon-profit Python Software Foundation [43]. The power of python partially re-sults from its flexible and a great diversity of packages. In this project, a varietyof third-party packages are used for the development of this system, and these arelisted in the Table 3.2. The source code of this system can be found in AppendixB.

3. Functional Magnetic Resonance Imaging of the Brain Software Library (FSL)


Table 3.2: List of third-party packages of Python in this project.

Name Description

Numpy 1.0 A package for scientific computing [44]. It is used for MRIdata processing and data augmentation and so on.

torch 0.3.0.post4 A deep learning framework enables developers to build a neu-ral network in a fast way and supports the use of GPU [45].This package plays an important role in developing the neuralnetwork in this project.

Nibabel A package for access to various neuroimaging files [46]. It isused to read and convert MRI data into numpy array in thisproject.

FSL is an open source and comprehensive library for the analysis and manipula-tion of a variety of brain imaging data [47]. It consists of many tools, each witha specific function for the process of MRI images. In this project, FSL is mainlyused for data pre-processing.

4. Robust Brain Extraction (ROBEX)Robust Brain Extraction (ROBEX) is a command-line based software for whole-brain extraction on TI-weighted MRI data developed by Iglesias in 2011 [48]. Itis free of parameter setting and able to produce a reliable result across datasets.This tool is used to strip the skull in the pre-processing stage.

5. CUDA and CUDA Deep Neural Network (cuDNN)CUDA is a parallel computing architecture developed by NVIDIA for program-ming on the graphics card to perform the computing on graphical processingunits (GPUs) [49]. CUDA enables the developers to take advantage of the powerof GPUs to greatly speed up computing applications. CUDA Deep Neural Net-work (cuDNN) is a task-specific library used for deep learning using CUDA [50].With the GPU accelerated functionality, the process of training the deep neuralnetwork can be speeded up drastically.

6. GitGit [51] is a distributed version control system which is free of charge and opensource. In this project, source code and files are managed by using git.

3.4. DATA 58

3.4 Data

3.4.1 Dataset

As discussed in Section 2.1.3, the changes in the brain caused by AD are the atrophiesof substructures of the brain. In order to diagnose AD, it is critical to check the atro-phy of the brain without the examination of brain tissue. In this case, the MRI scanis an ideal media that is able to display the changes inside the brain in a visible way.Therefore, most of current existing CAD systems for AD use MRI/CT scans as input.The data used in this research is from The Alzheimer’s Disease Neuroimaging Initia-tive (ADNI) database (http://adni.loni.usc.edu) which was launched by Dr Michael W.Weiner in 2004 under the public-private partnership. ADNI recruits people, collectsdata, tracks the progression and enables the data to be shared between scientists, whichmakes great contributions to the AD research. All the data in this work is from the firstphase of ADNI study called ADNI1, which was launched in 2004 with the help from16 other institutes. The data in the ADNI1 study was collected by ADNI from 817 sub-jects including a different number of MCI, AD and Normal Control (NC) subjects [2].There are total 1073 MRI images in this dataset, that means some of the subjects mayhave more than one image. The information for this dataset is shown in Tables 3.3 and3.4.

Table 3.3: Demographic data for the subjects from ADNI database [2].

Class Number Gender (Male/Female) Age(Meanstd) MMSE(Meanstd)

AD 191 100/91 75.277.46 23.312.04

MCI 398 257/141 74.737.39 27.031.78

NC 228 118/110 75.875.03 29.11.00

Table 3.4: Distribution of the instances [2].

Class No. of Instances

Normal Control 305

Mild Cognitive Impairment 525

Alzheimer’s Disease 243

http://adni.loni.usc.edu

3.4. DATA 59

Figure 3.3: ICBM/MNI 152 template without skull [47].

3.4.2 Montreal Neurological Institute average brain of 152 scans(MNI152)

The basic structures of the brains are the same among people. However, brains differgreatly in detail like depth and position of sulcus and symmetry. In order to study andlocate the substructure of the brains, a standard space, called Talairach space, was builtby Jean Talairach, a French neuroanatomist. Talairach space was built based on onesingle female brain, so, it is not representative of the population. In order to definea more representative model of the human brain, Montreal Neurosciences Institute(MNI) created a new coordinate system and a series of templates, such as MNI 305 andMNI 152, by taking the average of many healthy MRI scans [52]. Usually, the MRIimages collected from different people or generated by different machines are differentto each other. Because the position of the brain in the images, scanning parameters andvoxel size are different. Therefore, it is impossible to compare these scans in the nativespace. Moreover, the feature extracted from these scans are not valuable for statisticalanalysis or machine learning classification as well. In this case, a standard templateis necessary to address this problem by registering raw MRI images to the standardtemplate, so that these scans are comparable. The template used in this project is MNI152 which is the average brain of 152 structural MRI scans from 152 healthy youngadults. This template is approximately matched to the Talairach space. It is one ofthe most commonly used templates maintained by McGill University Health Centre.Moreover, MNI 152 is also adopted by the International Consortium for Brain Mapping(ICBM) that is why the name of this template is called ICBM 152 officially. Figure 3.3shows the ICBM 152 template viewed from sagittal, coronal and axial sections.

3.4. DATA 60

Figure 3.4: Harvard-Oxford subcortical (left) and cortical (right) structural atlases.https://fsl.fmrib.ox.ac.uk/fsl/fslwiki/Atlases?action=AttachFile&do=get&target=

atlas-hosub.png

3.4.3 Harvard-Oxford cortical and subcortical structural atlases

In this project, the structural atlases used to extract substructures of the brain is calledHarvard-Oxford cortical and subcortical structural atlases included in FSL [47]. Theprobabilistic atlases contain total 69 cortical and subcortical structural areas [53]. Thisatlases derive from T1-weighted structural MRI scans of 37 healthy subjects. In orderto create these atlases, semi-automated tools were used to segment these T1-weightedstructural MRI scans one by one by the Harvard Center for Morphometric Analysis(CMA) [53]. Then, affine transformations were applied to register these T1-weightedimages to MNI 152 template. At last, the population probability maps were created foreach ROI through the combination across subjects. Probabilistic atlases represent theprobability, ranging from 0% to 100%, that a voxel belongs to a specific substructure.Figure 3.4 is an example of atlases in MNI space included in the FSL toolkit [53].The use of probabilistic atlases is a kind of trade-off since it is a great challenge tocreate a set of atlases to represent the entire human populations due to the variationsof the structure and function between different human brains. Therefore, to use theprobabilistic atlases is an ideal way to represent the brain for human populations acrossage and gender.

https://fsl.fmrib.ox.ac.uk/fsl/fslwiki/Atlases?action=AttachFile&do=get&target=atlas-hosub.png

https://fsl.fmrib.ox.ac.uk/fsl/fslwiki/Atlases?action=AttachFile&do=get&target=atlas-hosub.png

3.5. METHOD 61

3.5 Method

3.5.1 Brain extraction

In data pre-processing, firstly, skull stripping is applied to MRI images to extract brainfrom the original images. Since this project focuses on the study of brain tissue, ir-relevant parts like skull, neck and eyeballs need to be removed. Robust Brain Ex-traction (ROBEX) is a command-line based software for whole-brain extraction onTI-weighted MRI data developed by Iglesias in 2011 [48]. It adopts a combinationof generative and discriminative models to complete the segmentation task. It is anautomatic skull stripping tool with no parameter setting, and it provides a robust resultacross datasets. Another widely used brain extraction tool is FSL-Brain ExtractionTool (FSL-BET) [47]. FSL-BET is a module of FSL toolkit which is an analysis toolfor MRI and Diffusion Tensor Imaging (DTI) brain images. This tool contains bothGraphical User Interface (GUI) and command line interface, which make it convenientto use. However, it is difficult to get accurate results since several parameters have tobe set, and expert skill and patience are needed [47].As we can see, ROBEX is a parameter-free tool, and it can be used across datasetswithout the need for parameter-tuning. FSL-BET is a powerful brain extraction tool,however, the set of hyper-parameters and the need for skill make it difficult to use.Moreover, according to the research [48], ROBEX outperforms FSL-BET in severaldatasets, and ROBEX is more robust than its rival. ROBEX is able to produce highlyaccurate results across datasets, on the contrary, FSL-BET cannot produce compara-ble results without the process of parameter tuning. Therefore, ROBEX is selected toperform the skull stripping task in this project. Figure 3.5 shows an image before andafter brain extraction.

3.5.2 Linear registration

The raw MRI scans are not comparable in native space since the size of the voxel andthe position of the brain in the images are different. In order to generate meaningfulfeatures for machine learning or statistical analysis, further processing is needed toregister all the images to a standard template, MNI 152 in this project. The process ofregistration is to transform different MRI scans into the standard coordinate system.After registration, the voxel size, brain position and scale of these images are adjustedto be uniform, and the features extracted from images are comparable and meaningful.

3.5. METHOD 62

Figure 3.5: An MRI scan before (upper) and after (lower) brain extraction

FMRIB’s Linear Image Registration Tool (FLIRT) [54] is a module of FSL for au-tomatic structural MRI brain image registration. This tool is used to perform linearregistration using affine transformations like rotation, scaling and translation, whichcan keep the shape and proportion of a brain the same before and after transformation.Affine transformations are able to register the global differences, like global size, be-tween images. According to the recommendation from FSL, FLIRT with 12 degrees offreedom is the appropriate choice for the process of affine transformation [47]. EachMRI scan is normalized and registered to the target standard space of MNI 152. Afterregistration, all scans are in the same space and comparable to each other. The optionsused in this process and the corresponding explanations [54] are listed as follows

• -in: an input, the MRI image.

• -ref: a reference, the path of the MNI152 template image in this project.

• -out: output volume, the name of the output file.

• dof: 12 (degrees of freedom were used in the affine process.)

• cost: corratio (defines the intensity-base cost functions to quantify the similar-ity.)

• bins: 256 (specifies the number of the bins in intensity histogram)

Figure 3.6 shows two MRI images before and after the affine registration. The rawimages (a) are not in the same space, therefore it is impossible to make a compari-son between them. After registration, the normalized images (b) are the same as the

3.5. METHOD 63

MNI 152 standard template (c) in size, shape and pose. In this case, it enables thecomparison between these normalized images.

3.5.3 Segmentation

In this project, segmentation is the process of segmenting an MRI 3D image of thebrain into different parts according to the tissue types, namely Grey Matter (GM),White Matter (WM) and Cerebrospinal Fluid (CSF). Since the AD leads to the atrophyof substructure composed of grey matter. It is possible for us to focus on the grey mat-ter and consider the other parts as noise. Therefore, it is reasonable to remove whitematter and CSF and study the grey matter only. Two widely used tools, namely Statis-tical Parametric Mapping (SPM) [55] and FSL, are available for this task.FSL can segment the MRI image into GM, WM and CSF by using its sub-modulecalled FMRIB’s Automated Segmentation Tool (FSL-FAST). FSL-FAST uses the Ex-pectation Maximization algorithm and the hidden Markov random field model in per-forming the segmentation task [56]. It contains both a GUI and a command-line pro-gram, and the input of this program should be an image after stripping the skull. SPMis a MATLAB software package for the analysis of several types of brain images likefMRI, PET and MEG. In SPM, a modified Gaussian mixture model and the Bayesianrule are used to develop the segmentation function [57]. By using a Gaussian mixturemodel, the probability of a voxel belonging to a specific class can be calculated accord-ing to the intensity distribution of the image. Then, a Bayesian rule is used to producethe joint posterior probability to determine the correct class of this voxel.Both FSL-FAST and SPM are free of charge under the terms of the GNU License.However, since SPM is a suite of the MATLAB, a license for MATLAB is requiredand that is costly. Moreover, according to the research [58], as for the grey mattersegmentation, the result produced by using FSL-FAST is better than that produced byusing SPM. Therefore, FSL-FAST is selected as the tool for brain image segmentation.Figure 3.7 shows the segmentation results generated by FSL-FAST. The following listshows the parameters used on FSL-FAST command Fast <options> <input>.

• -t <t>: 1 (indicates the type of MRI scans used in this function, t=1 for T1-weighted image and t=2 means the use of T2-weighted image.)

• -n <n>: 3 (refers to the number of classes, GM, WM and CSF, for segmenta-tion.)

3.5. METHOD 64

(a)

(b)

(c)

Figure 3.6: Demonstration of affine registration. (a) Two MRI images after skull strip-ing. (b) The corresponding images after affine registration. (c) The standard MNI 152template after skull stripping [47].

3.5. METHOD 65

(a)

(b)

(c)

Figure 3.7: Examples of the segmentation process. (a) Grey matter. (b) White matter.(c) CSF.

• -H <n>: 0.1 (controls the spatial smoothness in main segmentation phase. Ahigher value gives a spatially smoother result.)

• -l <n>: 20 (bias field smoothing parameter)

• -I <n>: 8 (number of main-loop iterations)

• -o <path>: the path and name of the output.

• -B: output restored image.

• -b: an estimated bias field will be output if this option is given.

• <input>: the path of input image

3.5. METHOD 66

3.5.4 ROI extraction

The last stage of pre-processing is ROI extraction. In this stage, several substructuresof each brain are extracted into one image. This stage consists of two main steps,the production of ROI masks and ROI extraction. Both steps are performed by usingfslmaths, a tool of FSL used for brain image calculations such as addition, subtractionand multiplication and so on. Moreover, fslmaths can be applied to perform variousimage manipulations like binarization and thresholding and so on.

3.5.4.1 Production of ROI mask

The ROI masks are produced based on the Harvard-Oxford cortical and subcorticalstructural atlases, which was discussed in Section 3.4.3. The separated structural im-ages are included in the FSL. The first step is to open fsleyes, a tool of FSL for visu-alizing MRI image. In this tool, users can choose the desired substructure and save itto an image one by one shown in Figure 3.8. In this project, according to the design,seven substructures of brain are extracted. Therefore, total seven substructure imagesare collected from fsleyes tool. Then, these separated substructure images are filtered,combined and binarized by using following fslmaths commands

1. fslmaths <input> -thr <threshold> <output>

This command is used to threshold the substructure images created from Harvard-Oxford cortical and subcortical structural atlas. This atlas is a probabilistic atlas,each voxel contains the probability of belonging to a specific substructure ofbrain. In order to get a precise mask, it is possible to remove the voxels with alow probability. [38] suggested that the ROIs consisting of voxels with the prob-ability between 0.3 and 1 can yield a better result. Therefore, the threshold is setto 0.3 in this project, and any voxel with a probability lower than 0.3 will be setto zero.

2. fslmaths <input1> -add <input2> <output>

This command is used to combine the <input1> and <input2> into one im-age stored in the image named as <output>. By using this command, differentsubstructure images are merged into one image.

3. fslmaths <input> -bin <output>

The last step for the production of ROI masks is to binarize the image generatedin the previous step. The intensities of the voxels in <input > image is set to

3.5. METHOD 67

Figure 3.8: Creation of an image for single substructure

3.5. METHOD 68

(a)

(b)

(c)

Figure 3.9: Creation of ROI mask. (a) Example of substructure image (Left Hippocam-pus). (b) Combination of different substructure images. (c) Binarization of the image.

either 1 or 0. If the intensity of a voxel is larger than zero, this intensity willbe set to 1. Otherwise, the intensity will be set to 0. The <output> is the ROImask, which will be used to extract ROI from the MRI images.

Figure 3.9 shows the outputs produced by using the above commands in ROI producingphase. Figure 3.9(c) is the ROI mask which is produced by binarizing Figure 3.9(b).The ROI mask image looks bigger than Figure 3.9(b) because there are many non-obvious voxels in the latter image.

3.5.4.2 ROI extraction

ROI extraction is the last operation in the data pre-processing and performed by usingfslmaths tool. In this step, the ROI of each MRI image is produced one by one by usingthe MRI images and the ROI mask. The ROIs are generated by performing Hadamardproduct between the ROI mask and MRI scans. This operation is possible since boththe MRI images and ROI mask have the same dimension and they are in the samespace. In an ROI mask image, the intensities of the voxels belonging to the desired

3.5. METHOD 69

(a)

(b)

(c)

Figure 3.10: Demonstration of ROI extraction. (a) An ROI mask. (b) An MRI image(grey matter) after linear transformation. (c) Extracted ROI.

substructures is one, and the rest is zero. By multiplying the ROI mask and an MRIimage, only the voxels belonging to the desired substructures in the MRI image remainthe same, and the others become zero. Figure 3.10 shows an example of this procedure,and the command used to perform this operation is

• fslmaths <input1> -mul <input2> <output>

<input1> is usually the MRI image used for training and testing. <input2>can be a real number or the data in the same dimension with <input1>. In thisproject, <input2> is the ROI mask. ’-mul’ option specifies the multiplicationoperation which performs the Hadamard product between <input1> and <in-put2>. <output> indicates the name and path of the output.

3.5. METHOD 70

3.5.5 Classification

According to the design of this system, discussed in Section 3.2, the architecture of theproposed system includes a series of CNN layers used for high-level feature extractionand several fully connected neural network layers for the classification task. In orderto build an optimal network, several decisions are needed to select the components ofthe network. It is a good way to learn from the existing algorithms or methods thathave been confirmed to be effective for this problem. Then, based on these methods,an optimal architecture can be derived from experiments.

3.5.5.1 Activation function

An activation function is able to introduce the non-linear feature to the network. Thereare several activation functions available for this project like Sigmoid and ReLU asdiscussed in Section 2.2.5. The choice of activation function is one of the most im-portant decisions in constructing the neural network. The use of sigmoid activationfunction usually results in the problem called vanishing gradient in some cases [23].Vanishing gradient is that the gradient of the sigmoid function with respect to the inputof the function reduces to a very small value when the absolute value of input increasesgradually. In this case, the update for parameter becomes less and less gradually. Thisproblem makes the learning procedure difficult to converge since it takes a long timeto reach the minimum. On the contrary, the gradient of ReLU with respect to the inputof this function is a constant value, which makes it possible to avoid the problem ofvanishing gradient [23]. Moreover, the use of RuLU is able to reduce the amount ofcomputation since it only takes the calculation of max(0, x). Furthermore, accordingto the research [24], ReLU usually outperforms Sigmoid in practice. So, based on theanalysis, ReLU is selected to be the activation function in this project.

3.5.5.2 Loss function

Mean square error and cross-entropy, discussed in Section 2.2.2, are two of the mostfamous and commonly used cost functions in machine learning. Roughly, for the clas-sification tasks, the cross-entropy measures the similarity of distributions of real labelsand predicted labels, whereas the MSE considers the Euclidian distance between thepredicted labels and the real labels by assuming they are points in a space. The use ofMSE is to find a plane as a decision boundary in high dimensional space. However,this loss function does not punish misclassification, since the optimal target of MSE is

3.5. METHOD 71

to find a plane that minimizes the distances between all data and the plane. Therefore,this process is suitable for regression but not for classification. On the contrary, cross-entropy tries to minimize the difference between the predicted distribution and realdistribution. When the two distributions are the same to each other, the error functionreaches the minimum. The cross-entropy makes more sense than MSE in classifica-tion problems. However, in some cases, MSE performs as good as cross-entropy [23].Therefore, in this project, both of these two loss functions are tested in model selection.

3.5.5.3 Pooling

Max pooling and average pooling are two most widely used pooling methods in theconvolutional neural network, which were discussed in Section 2.2.7.2. Max pool-ing tends to keep the most important features since it only considers the largest value,whereas the average pooling extracts smooth features by averaging all the elements.However, average pooling suffers from several drawbacks. This operation can signifi-cantly decrease the strong activations, when the values in the feature map differ a lot.The strong positive values are cancelled out by the strong negative values after averagepooling operation [59]. That means strong features (both positive and negative) areeliminated. Moreover, [60] shows that a much better result can be obtained from themodel with max pooling than the model with average pooling in linear classification.Due to the analysis, the max pooling method is selected for this project.

3.5.5.4 Learning algorithm

Two available learning algorithms for this project are Adam and mini-batch gradientdescent. The detail of these two algorithms was discussed in Section 2.2.4. Mini-batchgradient descent uses a fixed learning rate for all parameters, whereas Adam applies adynamic learning rule for each parameter. However, according to the research [61], inpractice, although adaptive learning methods have better performance in training thanthe gradient descent method with a fixed learning rate, the fine-tuned gradient descentalgorithms usually outperform adaptive learning methods in generalization. Therefore,in this project, we use the mini-batch gradient descent algorithm with a fixed learningrate to train the network.

3.5. METHOD 72

3.5.5.5 Data augmentation

In this project, the number of instances in the dataset is 1073. The size of this dataset ison the edge of enough data to train a neural network. In order to avoid overfitting due tothe lack of data, data augmentation techniques can be used to enlarge the dataset. Im-age flip, noise addition and lighting condition, discussed in Section 2.2.8, are availabletechniques for this purpose. However, flipping data augmentation techniques are moresuitable for this project than other techniques. One reason is that techniques includingchanging lighting condition and introducing noise make changes to the intensities ofthe voxels in the original images. The intensities of many voxels in the MRI images arezero representing nothing, and non-zero voxels represent the ROIs of the brain. Thisproject is to use machine learning method to find the difference in the volume or shapeof the ROIs. It is not sensible to make changes to the intensities of the voxels sincethis operation may result in the changes to the volume and shape of the ROIs. Thesetechniques will ruin the features that this project based on. Therefore, mirror flip isused to enlarge the dataset in this project.

3.5.5.6 Model selection

Grid search is a method designed to find the optimal hyper-parameters based on thebrute force to check all combinations of manually specified values for a set of hyper-parameters [23]. A set of hyper-parameters and the possible values should be specifiedmanually by considering the existing similar systems. Table 3.5 shows the hyper-parameters and the ranges of their values used in this project. The models generatedby using different combinations of these hyper-parameters need to be evaluated todetermine which one has the best performance. Usually, a model is evaluated by usingcross-validation method or random splitting method, which are discussed in Section2.2.9. In this project, 10-fold cross-validation method and several ROC metrics areselected to evaluate the model. It is because cross-validation is able to produce a moresubjective result compared to the random splitting method. Moreover, most of theexisting CAD systems, discussed in Section 2.3 use 10-fold cross-validation method.Therefore, 10-fold cross-validation method is an ideal method and makes the resultsof our method and others comparable, although full cross validation is more subjectivethan K-fold cross-validation. In some previous experiments we did, the classificationerrors of some models do not reduce any more after around 170 epochs, and the trainingtime is acceptable. Therefore, the epoch is set to be 200 in this project. The weight

3.5. METHOD 73

decay rate is 0.001 throughout all experiments in this project.

Table 3.5: Hyper-parameters and values for grid search

Hyper-parameters Type of Value Possible values

Learning rate float 0.1, 0.01, 0.001

No. of CNN layers Int 3, 4

No. of feature maps Int 32, 64

The learning rate for machine learning method is selected between 0.1 and 0.001.There is no appropriate method to decide the number of CNN layers, however, a smallnumber of layers leads to the failure in learning the high-level features. We will try arange of value from 3 to 4 according to the existing similar systems and the limitationof our resources. If more layers are used, the requirement for computation, trainingtime, and memory will increase accordingly. According to researches [62] [63] [29][35], usually, the number of kernels in each CNN layer is between 8 and 512. It isreasonable to set the range of this number from 32 to 64 in our case due to the limitationof memory and the large volume of the MRI data. Moreover, the number of kernels ineach layer remains the same in the model and the kernel size is determined as 3×3×3.This selection is made based on the previous similar systems [29] [35]. They all adopta constant number of kernels in the network. By considering the designs of [35] [62],three fully connected layers with 1000, 100 and 3 (2) nodes respectively are put afterthe CNN layers. Figure 3.11 shows the overall structure of the neural network used inthis project.

3.5. METHOD 74

Figure 3.11: Overall structure of the network

Chapter 4

Results and discussion

As discussed in Section 3.1, the last stage of the research method in this project is theexperimental evaluation. This chapter shows the results of the experiment to evaluatethe system. Through analyzing these results, discussions over these results are givenas well.

4.1 The use of GPU

In this project, the questions related to whether GPU is faster than CPU in terms oftraining time and how much the GPU is faster than its rival are interesting topics. Ifthere is no big difference between using GPU and CPU, there will be no reason toperform such kind of researches on the GPU which is a much more expensive resourcecompared to the CPU. Moreover, the question about whether it is worth using multipleGPUs for the training purpose is also unknown in this project. Theoretically, when thenumber of GPUs doubles, the training time will reduce to half. However, due to com-munication and management overheads, the improvement is unknown and difficult toanticipate. So, it is worth validating this issue and finding out the relationship betweenthe number of GPUs used in the system and the training time.The architecture, hyper-parameters and devices used in this experiment are shown inTable 4.1. This is a three-way classification task, and the network consists of fourCNN layers with 8 kernels in each layer. The CNN layers are followed by three fullyconnected neural network layers with 1000, 100 and 3 nodes respectively. The num-ber of examples used in the experiment increase from 200 to 1000 with a step of 200.The number of epoch for training is 200. Figure 4.1 gives the time used to train thisnetwork by using CPU, single GPU, double GPUs and four GPUs respectively. The

75

4.1. THE USE OF GPU 76

horizontal axis represents the number of instances used in the experiments, whereasthe vertical axis is the time used for training in minutes.

Table 4.1: Architecture and hyper-parameters in the experiment for GPU and CPUcomparison.

Architecture

(CNN): Sequential(0:Conv3d(1,8,kernel size=3,stride=1),ReLU,MaxPool3d(kernel size=2,stride=2)1:Conv3d(8,8,kernel size=3,stride=1),ReLU,MaxPool3d(kernel size=2,stride=2)2:Conv3d(8,8,kernel size=3,stride=1),ReLU,MaxPool3d(kernel size=2,stride=2)3:Conv3d(8,8,kernel size=3,stride=1),ReLU,MaxPool3d(kernel size=2,stride=2))(linear): Sequential(0:Linear(in features=24576, out features=1500, bias=True), ReLU()1:Linear(in features=1500, out features=100, bias=True), ReLU()2:Linear(in features=100, out features=3, bias=True))

Hyper-parameters & other conditions Values

Learning rate 0.01

Batch size 4

Learning algorithm Mini-batch gradient descent

Cost function Cross-entropy

Epoch 200

CPU Intel Xeon E5-2670 v3 (2.30GHz)

Graphics card NVIDIA Tesla K20X

As we can see from Figure 4.1, regardless of the use of GPU(s) or CPU, the trainingtime shows an increasing trend when the number of instances increases. This is be-cause when the size of training dataset increases, more iterations and computations areneeded. However, the time increasing rate of the experiment using CPU is higher thanthose using GPU(s). In this simple experiment, when there are 1000 MRI scans usedfor training purpose, it takes more than 6 hours to train the network by using CPU only.On the contrary, if one graphics card is used for training, the time drastically reduces toabout one hour and twenty minutes. Furthermore, the training time reduces to 45 min-utes and 29 minutes by using two and four graphics cards respectively. It is clear thatthe training speed by using GPU(s) is much faster than that by using CPU only. Themore GPU(s) used in the training, the faster training speed we can get. Therefore, theexperiment using four graphics cards spends the least time on the training stage amongall experiments. However, the reduced time is not proportional to the number of graph-ics cards used in the experiment. The reason could be that when more GPUs are used,the computation task is divided into different GPUs, and the results are sent back to

4.2. COMPARISON OF DIFFERENT MODELS ON ACCURACY 77

Figure 4.1: Time used when using GPU(s) or CPU

update the parameters of the model from various GPUs. There is a high communica-tion and management cost. On average, when the number of graphics cards used inthe training doubles, the training time reduces by 38.2%. This result suggests that it isworth using GPU in the training process and it gives a much better performance in thetraining time than that of using CPU only. The use of multiple GPUs gives a noticeableimprovement in training time, however, the elapsed time is not inversely proportionalto the number of graphics cards used.

4.2 Comparison of different models on accuracy

Table 4.2 shows the best 5 and worst 5 results for three-way classification, and thecorresponding architectures and hyper-parameters of these models are listed as well.Full experiment results can be found in Appendix A. These results were obtained fromthe models using both cross-entropy and mean square error loss functions. In thetop 5 models, the learning rates for the models using MSE (No. 1, 2 and 4) are 0.1and 0.01. In the worst 5 results, the learning rate for the models using MSE (No.6 and 9) is 0.001. These models did not converge at all since both of their training


Table 4.2: The best 5 and worst 5 results for three-way classification

Hyper-parametersClassifier No. of feature maps

in each layerLearning rate Loss func-

tionTraining ac-curacy

Testing ac-curacy

The best 5 results

1 (32,32,32,32) 0.1 MSE 99.98% 72.60%

2 (32,32,32,32) 0.01 MSE 99.30% 71.22%

3 (32,32,32) 0.001 Cross-entropy

100% 70.83%

4 (64,64,64,64) 0.01 MSE 100% 70.55%

5 (64,64,64,64) 0.01 Cross-entropy

100% 70.40%

The worst 5 results

6 (32,32,32,32) 0.001 MSE 88.04% 59.45%

7 (64,64,64,64) 0.1 Cross-entropy

49.05% 48.88%

8 (32,32,32,32) 0.1 Cross-entropy

48.92% 48.64%

9 (64,64,64,64) 0.001 MSE 48.92% 48.40%

10 (32,32,32) 0.1 Cross-entropy

48.92% 48.40%


Figure 4.2: Error landscapes of training and testing of classifiers No. 1 (upper) and 2(lower)


and testing accuracies are low. That means the learning rate of 0.001 is too smallfor these models with MSE (No. 6 and 9) to converge in a limited epochs. A betterresult could have been obtained from these classifiers if more epochs than that weused (200) are performed, but the training time will increase as well. In this project,the suitable learning rates for the classifiers using cross-entropy (No. 3 and 5) are0.001 and 0.01, and the learning rate of 0.1 is too big to make these classifiers (No.7, 8 and 10) converge. From these best results, the model, No. 1, with the highesttesting accuracy (72.60%) is selected to be the model of this project for three-wayclassification. Figure 4.2 shows the error landscapes of the classifiers No. 1 and 2 fortraining and testing. The testing errors are collected during the training process, butthe testing dataset is used for testing only rather than updating the weights. In orderto reduce the calculation, we use the training error of a mini-batch data to representthe average training error since we mainly focus on the testing error. The testing erroris the average error over all testing instances. This error landscapes show the changesof training and testing errors during the last cross-validation process. The trend of thetesting error for classifier No. 1 shows that the learning rate of 0.1 is a little largefor this classifier. It is because this model reaches its minimal error early in the first100 epochs, and the testing error reduces no more. Classifier No. 2 could have beenreached a lower error if more epoch is performed, since its testing error still has adecreasing trend at the end of the training. However, due to the time limitation of thisproject, this can be done as a future work.

As for two-way classification, both MSE and cross-entropy (CE) are used as cost

Table 4.3: The best 3 and worst 3 results for two-way classification.

Hyper-parameters AccuracyClassifier No. of fea-

ture maps ineach layer

Learningrate

Costfunc-tion

Testing Training TPrate

Precision FPrate

F-measure

A (32, 32, 32) 0.001 CE 91.23% 100% 87.07% 92.96% 5.13% 0.8992

B (32, 32, 32) 0.01 MSE 90.63% 100% 87.89% 90.87% 6.79% 0.8936

C (32,32,32,32) 0.001 CE 88.92% 99.03% 86.85% 89.37% 8.14% 0.8809

D (32, 32, 32) 0.001 MSE 84.07% 99.09% 72.63% 91.73% 6.24% 0.8107

E (64,64,64,64) 0.001 MSE 69.20% 72.99% 37.28% 94.41% 3.09% 0.5345

F (32,32,32,32) 0.001 MSE 54.40% 61.20% 21.95% 86.39% 15.50% 0.3501

function. However, when cross-entropy is applied, the learning rate of 0.001 is usedonly due to the limitation of computation resource. The experiment for three-way


Figure 4.3: Training and testing error landscapes of classifiers E (upper) and F (lower)

4.3. EFFECT OF DATA AUGMENTATION 82

classification has shown that 0.001 is a suitable learning rate for the models with cross-entropy cost function. Table 4.3 shows the results of the models (the best 3 and worst3) for two-way classification (AD vs. NC). A full list can be found in Appendix A. Inthis experiment, classifier A reaches the highest testing accuracy at 91.23%. The costfunction of two of the best three classifiers (A and C) is cross entropy, and the learningrate is 0.001. The last three classifiers (D, E and F) with a learning rate of 0.001perform the worst in this experiment by using MSE cost function. The training andtesting accuracies for the last three classifiers are (99.09%, 84.07%), (72.99%, 69.20%)and (61.20%, 54.40%) respectively. It is reasonable to conclude that the learning rate of0.001 is too small for these classifiers with MSE cost function to converge in a limitednumber of epochs. This conclusion is consistent with the findings in the three-wayclassification experiment. Figure 4.3 shows the training and testing error landscapes ofthe classifiers E and F during the last cross-validation process. It is obvious that thesetwo models have not converged at all since both training and testing errors are high andstill have a decreasing trend at the end of training stage. In this case, either a biggerlearning rate or more epochs than current settings can be used to help the models toconverge and get better results.As for model selection, Figure 4.4 is a basic ROC graph of all classifiers. It is clear thatclassifiers (A, B and C) are closer to the top-left corner than others (D, E and F). Thatmeans the classifiers A, B and C have better performances than classifiers D, E andF. In our research, the aim of this project is to build a diagnostic system for the earlydiagnosis of Alzheimer’s. So, the TP rate and precision, discussed in Section 2.2.9.2,are important metrics to measure the performance of these models. On one hand, wewant to improve the TP rate in order to correctly identify as many positive instancesas possible. It is because the cost of failure to identify a person with Alzheimer’s ishigher than that of raising a false alarm. On the other hand, we also want to improve theprecision since the number of the false alarm instances should be limited. F-measureis an ideal metric to assess the models in our project. According to the F-measurevalue in Table 4.3, classifiers A has the highest F-measure value (0.8992). Therefore,classifiers A is selected as the model of this project for two-way classification.

4.3 Effect of data augmentation

Data augmentation techniques are useful in many 2-d image classification tasks [24][62] [64]. Usually, the more the data used in the training, the higher the result can

4.3. EFFECT OF DATA AUGMENTATION 83

Figure 4.4: ROC graph for two-way classifiers

be obtained. Therefore, it is reasonable to take this experiment to validate this issueas well as improve the performance of our model. Three classifiers with MSE costfunction and the learning rate of 0.01 are used to test whether an improvement can beobtained by using flip mirror technique. Table 4.4 shows the results collected fromthese models for two-way classification with or without the use of data augmenta-tion technique. Classifier aaug is the classifier who has the same structure and hyper-parameters with the classifier a, but the mirror flip is applied to enlarge the trainingdataset for training the classifier aaug. In this case, the size of the training dataset isdoubled. According to the results in our experiment, the reductions in the testing ac-curacies, TP rates and F-measure values of these models can be observed when mirrorflip is used to augment the dataset. The reason may lie in the spatial structure of ourdataset. In the experiment without data augmentation, the ROIs of brain in all images,both training and testing images, are roughly same to each other in shape and positionglobally. A classifier is trained to find the local minor differences among them. If mir-ror flip is applied to the training dataset, the shape and position of ROIs in the originalimages are globally different from the ROIs in the new images created by using mirrorflip. Apart from learning the local minor differences among the ROIs, the classifier

4.4. COMPARISON OF RESULTS BETWEEN OURS AND OTHERS 84

may also learn the global difference between the ROIs in the original images and thesein the newly created images. However, the newly learnt difference is not helpful oreven has negative effect for the test since the testing dataset remains the same and noflip is applied to this dataset. For example, if a classifier is trained to find the minordifferences among many left hand images of different people, it could be meaninglessto introduce right hand images (mirror flip of the left hand images) in the training setwhen the testing set consists of the images of left hand only.

Table 4.4: Effect of data augmentation technique

AccuracyClassifier No. of fea-

ture maps ineach layer

Testing Training TP rate Precision FP rate F-measure

a (32, 32, 32) 90.63% 100% 87.89% 90.87% 6.79% 0.8936aaug (32, 32, 32) 81.74% 99.91% 65.97% 91.40% 7.95% 0.7663

b (64, 64, 64) 89.40% 99.64% 84.62% 91.02% 6.88% 0.8770baug (64, 64, 64) 80.4% 99.82% 64.39% 91.02% 6.96% 0.7542

c (32,32,32,32) 88.70% 99.77% 86.09% 88.09% 9.01% 0.8708caug (32,32,32,32) 77.8% 99.85% 69.09% 85.22% 11.59% 0.7631

4.4 Comparison of results between ours and others

Table 4.5 shows the results from several other methods and our method. N.A. meansthe value is not available in the articles. The best testing accuracies obtained in ourresearch are 91.23% and 72.60% for 2-way (NC vs AD) and 3-way (NC vs MCI vsAD) classification respectively. The results of our system are better than that of mostof these systems using traditional machine learning methods, but the results are worsethan some of the systems using neural network. From Table 4.5 we can see, the ac-curacies of the two state-of-the-art systems [29] [35] are (95.39% and 89.47%) and(97.60% and 89.10%) for two-way and three-way classifications respectively. basedon the current result, our system does not outperform the state-of-the-art systems forboth 2-way and 3-way classifications and the difference is observable and obvious.One possible reason for the poor result of our system is that in order to simplify the taskin the preprocessing stage, we apply linear registration to the MRI data only. This is be-cause we wanted to keep as much original information as possible in the raw MRI scans

4.4. COMPARISON OF RESULTS BETWEEN OURS AND OTHERS 85

Table 4.5: The results (Accuracy, %) of our method and others.

Research Model Two-way classifica-tion (AD vs. NC)

Three-way classifica-tion

Arvesen [34] Decision tree 85.80 60.20

Tong el al. [65] SVM 88.80 N.A.

Henriquez [38] SVM 88.50 N.A.

Mahmood et al. [5] NN 89.22 N.A.

Liu et al. [66] NN 91.40 53.80

Payan et al. [29] CNN 95.39 89.47

Hosseini-Asl etal. [35]

CNN 97.60 89.10

Ours CNN 91.23 72.60

and expected the neural network to extract the discriminative features to make classifi-cation automatically. We do not apply the more sophisticated non-linear registration toget more precise substructures from the brain images comparing to linear registrationmethod. The precise ROIs could yield a better result than the ROIs obtained using thelinear registration only. This effect can be observed in the research [38]. However,non-linear registration requires more complicated operations and expert knowledge,which prohibit the widespread use of this kind of systems since less preprocessing andeasy to use and understand are important. Another reason is that, in our research, weextracted the ROIs manually by using some automatic tools like FSL and ROBEX.However, errors could result from these tools which could cause the reduction of theaccuracy in the classification. The methods using autoencoder for automatic feature ex-traction in [29] [35] could help avoid the errors caused by the third-party tools duringthe pre-processing stage. That is a possible reason why these state-of-the-art methodscan outperform other methods in both 2-way and 3-way classifications.

Chapter 5

Conclusions and future work

5.1 Conclusions

This research proposed a novel method for the diagnosis of Alzheimer’s disease byclassifying the MRI scans. The published state-of-the-art works [35] [29] train twodifferent neural networks, a convolutional network for feature extraction and a stan-dard neural network for classification, in two stages. In our work, low level features(ROIs) are extracted from the MRI scans firstly, and then a convolutional network istrained to extract the high level feature for classification.In this method, the hypothesis is based on that the CNN is able to extract the high-levelfeatures related to the shapes or volumes of the substructures of brains for classifica-tion. This method is divided into two main stages, namely pre-processing and classifi-cation. In the pre-processing stage, each MRI scan is processed in the steps skull strip,registration and ROI extraction. In this stage, an image containing the desired ROIs iscreated for each MRI scan, and these images are used as input of the neural network.The deep learning is introduced into this task in the classification stage. By using theseimages from the pre-processing stage, a neural network containing four convolutionallayers and three standard neural is trained and tested. This new system yields relativelygood results, 91.23% and 72.60% for two-way and three-way classification respec-tively, compared to most of the systems using traditional machine learning methods.However, it is not the state-of-the-art outcome. Two existing best systems designedby [29] [35] also introduce convolutional network in their researches. Our researchand these two researches show the power of CNN and deep learning together. Thereason why our research is worse than these two methods may lie in pre-processingand parameter tuning issues, which has been discussed in Section 4.4.

86

5.2. FUTURE WORK 87

Moreover, the results related to the use of GPU and data augmentation techniques arealso evaluated and analyzed. In terms of elapsed time, the result shows that the use ofGPU can improve the performance of this system drastically. Experiments also showthat less training time is required when more GPUs are used in the same experiment,but the elapsed time is not inversely proportional to the number of GPUs used. Dataaugmentation techniques are widely used in image classification, but its effect on theMRI data classification is an open question. Experimental results from Section 4.3show that no improvement or even a reduction in accuracy can be observed when mir-ror flip technique is applied to the training dataset. The reason lies in the changes ofthe shape of the ROIs resulting from flip operation. Therefore, the use of data augmen-tation techniques in the MRI data classification task should be considered and chosencarefully. Our research can be applied to wider applications like the diagnostic systemfor tumor in brain or pancreatic cancer [67]. This kind of systems use MRI scans asinput, and our method can be used to extract ROIs and classify the MRI scans for di-agnosis.

5.2 Future work

Limitations have been identified and discussed in Section 4.4 based on the analysisof the results. The results and findings can be used to propose the future work thatcan be applied to improve the performance of the current research. Firstly, in order toobtain a better result, one possible method is to extract more discriminative featuresthan the current method. If the features are good enough, the accuracy can be expectedto rise. Several techniques can be used for generating more discriminative featuressuch as non-linear registration for precise matching, motion and slice timing correc-tion to eliminate errors resulting from scanning conditions and so on. These methodscan improve the quality of the raw MRI scans as well as the features extracted fromthese scans. Automatic feature extraction is another method to obtain better features.According to researches [35] [29], this kind of methods can lead to better classifica-tion result than the manual methods. These methods use an auto-encoder network toextract features automatically rather than manually and they are the state-of-the-arttechniques.Theoretically, a deeper neural network can generate a better result than the shallownetwork. However, in practice, a network usually performs worse when the network

5.2. FUTURE WORK 88

is too deep. An architecture, called ResNet [63], can be used to address this problem.Due to the big size of the MRI scan, limited memory on GPU and massive computa-tion, it is difficult to build up a deep learning architecture like ResNet and VGG [62],which require massive computation as well as tremendous memory to store data andmodel. The graphics memory is often limited and not enough to undertake this task.If only CPU and main memory are used for training, it will take too long to train thenetwork as discussed in Section 4.1. Therefore, another future work is to find a wayto minimize the size of data in order to balance the requirements for computation,memory and time. Dimension reduction techniques, like principal component analysisand auto-encoder, can be used to compress the data before feeding it into the network.Therefore, the requirement for the computation, memory and time in the training stagecan be reduced and a deeper network than the current method can be built.

Bibliography

[1] Alzheimer’s Association. What is alzheimer’s? https://www.alz.org/alzheimers-dementia/what-is-alzheimers. Accessed on 2018/04/24.

[2] ADNI. Sharing alzheimer’s research data with the world.http://adni.loni.usc.edu/. http://adni.loni.usc.edu/. Accessed on 2018/05/22.

[3] Alzheimer’s Society. Alzheimer’s disease. https://www.alzheimers.org.uk/about-dementia/types-dementia/alzheimers-disease. Accessed on 2018/04/24.

[4] S. BenHalima, S. Mishra, K.M. Poopathi Raja, M. Willem, A. Baici, K. Simons,O. Brstle, P. Koch, C. Haass, A. Caflisch, and L. Rajendran. Specific Inhibitionof β-Secretase Processing of the Alzheimer Disease Amyloid Precursor Protein.Cell Reports, 14(9):2127 – 2141, 2016.

[5] R. Mahmood and B. Ghimire. Automatic detection and classification ofalzheimer’s disease from mri scans using principal component analysis and artifi-cial neural networks. In 2013 20th International Conference on Systems, Signals

and Image Processing (IWSSIP), pages 133–137, July 2013.

[6] A. Abdulkadir, J. Peter, O. Ronneberger, T. Brox, and S. Kloeppel. Voxel-basedmulti-class classification of ad, mci, and elderly controls. In Medical Image

Computing and Computer-Assisted Intervention (MICCAI) 2014 - CADDementia

Challenge, 2014.

[7] National Institute on Aging. Alzheimer’s disease fact sheet. https://www.nia.nih.gov/health/alzheimers-disease-fact-sheet#causes. Accessed on 2018/04/27.

[8] Alzheimer’s Disease International. World alzheimer report 2015. Technical re-port, Alzheimer’s Disease International, 2015. https://www.alz.co.uk/research/WorldAlzheimerReport2015.pdf.

89

https://www.alz.org/alzheimers-dementia/what-is-alzheimers

https://www.alz.org/alzheimers-dementia/what-is-alzheimers

http://adni.loni.usc.edu/

https://www.alzheimers.org.uk/about-dementia/types-dementia/alzheimers-disease

https://www.alzheimers.org.uk/about-dementia/types-dementia/alzheimers-disease

https://www.nia.nih.gov/health/alzheimers-disease-fact-sheet#causes

https://www.nia.nih.gov/health/alzheimers-disease-fact-sheet#causes

https://www.alz.co.uk/research/WorldAlzheimerReport2015.pdf

https://www.alz.co.uk/research/WorldAlzheimerReport2015.pdf

BIBLIOGRAPHY 90

[9] M. C. Chartier Harlin, F. Crawford, H. Houlden, A. Warren, D. Hughes, L. Fi-dani, A. Goate, M. Rossor, P. Roques, J. Hardy, and M. Mullan. Early-onsetalzheimer’s disease caused by mutations at codon 717 of the β-amyloid precur-sor protein gene. nature, 353:844–846, 1991.

[10] R. A. Sperling, P. S. Aisen, L. A. Beckett, D. A. Bennett, S. Craft, A. M. Fagan,T. Iwatsubo, C. R. Jack, J. Kaye, T. J. Montine, D. C. Park, E. M. Reiman, C. C.Rowe, E. Siemers, Y. Stern, K. Yaffe, M. C. Carrillo, B. Thies, M. Morrison-Bogorad, M. V. Wagster, and C. H. Phelps. Toward defining the preclinical stagesof alzheimer’s disease: Recommendations from the national institute on aging-alzheimer’s association workgroups on diagnostic guidelines for alzheimer’s dis-ease. Alzheimer’s & Dementia: The Journal of the Alzheimer’s Association,7(3):280–292, 2011.

[11] Z. Zhou. Machine Learning. Tsinghua University Press, Beijing, 2016.

[12] A. K. Singh. Lecture notes in machine learning cs165b. http://www.cs.ucsb.edu/∼ambuj/Courses/165B/Lectures/introduction.ppt, 2012. University of California,Santa Barbara, Accessed on 2018/04/25.

[13] R. B. Girshick. Fast r-cnn. 2015 IEEE International Conference on Computer

Vision (ICCV), pages 1440–1448, 2015.

[14] G. Brown. Lecture notes in machine learning fundation, September 2017. TheUniversity of Manchester.

[15] B. E. Boser, I. M. Guyon, and V. N. Vapnik. A training algorithm for optimalmargin classifiers. In Proceedings of the Fifth Annual Workshop on Computa-

tional Learning Theory, COLT ’92, pages 144–152, New York, NY, USA, 1992.ACM.

[16] S. Sathyadevan and R. R. Nair. Comparative analysis of decision tree algorithms:Id3, c4.5 and random forest. In Lakhmi C. Jain, Himansu Sekhar Behera, Jyot-sna Kumar Mandal, and Durga Prasad Mohapatra, editors, Computational Intel-

ligence in Data Mining - Volume 1, pages 549–562, New Delhi, 2015. SpringerIndia.

[17] H. R. Bittencourt and R. T. Clarke. Use of classification and regression trees(cart) to classify remotely-sensed digital images. In IGARSS 2003. 2003 IEEE

http://www.cs.ucsb.edu/~ambuj/Courses/165B/Lectures/introduction.ppt

http://www.cs.ucsb.edu/~ambuj/Courses/165B/Lectures/introduction.ppt

BIBLIOGRAPHY 91

International Geoscience and Remote Sensing Symposium. Proceedings (IEEE

Cat. No.03CH37477), volume 6, pages 3751–3753 vol.6, July 2003.

[18] C. Bishop. Pattern Recognition and Machine Learning. Springer Verlag, 2006.

[19] C. J. C. H. Watkins and P. Dayan. Q-learning. Machine Learning, 8(3):279–292,May 1992.

[20] G. A. Rummery and M. Niranjan. On-line q-learning using connectionist sys-tems. 11 1994.

[21] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, andM. Riedmiller. Playing Atari with Deep Reinforcement Learning. ArXiv e-prints,December 2013.

[22] P Sadowski. Notes on backpropagation. https://www.ics.uci.edu/∼pjsadows/notes.pdf. Accessed on 2018/06/22.

[23] M. A. Nielsen. Neural Networks and Deep Learning. Determination Press, 2015.

[24] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deepconvolutional neural networks. In Proceedings of the 25th International Con-

ference on Neural Information Processing Systems - Volume 1, NIPS’12, pages1097–1105, USA, 2012. Curran Associates Inc.

[25] UNSW. Single layer perceptron. http://www.cse.unsw.edu.au/∼cs9417ml/MLP1/tutorial/singlelayer.htm. Accessed on 2018/06/22.

[26] I. Goodfellow, Y. Bengio, and A. Courville. Deep Learning. MIT Press.

[27] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. CoRR,abs/1412.6980, 2014.

[28] X. Glorot, A. Bordes, and Y. Bengio. Deep sparse rectifier neural networks. InGeoffrey Gordon, David Dunson, and Miroslav Dudk, editors, Proceedings of

the Fourteenth International Conference on Artificial Intelligence and Statistics,volume 15 of Proceedings of Machine Learning Research, pages 315–323, FortLauderdale, FL, USA, 11–13 Apr 2011. PMLR.

[29] A. Payan and G. Montana. Predicting alzheimer’s disease: a neuroimaging studywith 3d convolutional neural networks. CoRR, abs/1502.02506, 2015.

https://www.ics.uci.edu/~pjsadows/notes.pdf

https://www.ics.uci.edu/~pjsadows/notes.pdf

http://www.cse.unsw.edu.au/~cs9417ml/MLP1/tutorial/singlelayer.htm

http://www.cse.unsw.edu.au/~cs9417ml/MLP1/tutorial/singlelayer.htm

BIBLIOGRAPHY 92

[30] Carnegie Mellon University. Cross validation. https://www.cs.cmu.edu/∼schneide/tut5/node42.html. Accessed on 2018/06/20.

[31] T. Fawcett. An introduction to roc analysis. Pattern Recogn. Lett., 27(8):861–874, June 2006.

[32] J. Keane and S. Sampaio. Lecture notes in data engineering, September 2017.The University of Manchester.

[33] S. Kloppel, C. Stonnington, C. Chu, B. Draganski, R. Scahill, J. Rohrer, N. Fox,C. Jack, J. Ashburner, and R. Frackowiak. Automatic classification of mr scansin alzheimers disease. 131, 03 2008.

[34] E. Arvesen. Automatic classification of alzheimer’s disease from structural mri,2015. MSc Thesis, stfold University College.

[35] E. Hosseini-Asl, R. Keynton, and A. El-Baz. Alzheimer’s disease diagnostics byadaptation of 3d convolutional network. CoRR, abs/1607.00455, 2016.

[36] X. Ma, Z. Li, B. Jing, H. Liu, D. Li, H. Li, and the ADNI. Identify the atro-phy of alzheimer’s disease, mild cognitive impairment and normal aging usingmorphometric mri analysis. Frontiers in Aging Neuroscience, 8:243, 2016.

[37] SPSS Tutorials. Z-scores what and why? https://www.spss-tutorials.com/z-scores-what-and-why/. Accessed on 2018/06/21.

[38] C. A. Henriquez. ”adfinder”: Mr image diagnosis tool to automatically detectalzheimer’s disease, 2017. MSc Thesis, The University of Manchester.

[39] S. Harper. Lecture notes in experimental design 1, November 2017. The Univer-sity of Manchester.

[40] Information Systems Division RIKEN. Hokusai system. http://accc.riken.jp/download/sites/2/HOKUSAI system overview en.pdf, 2017. Accessed on2018/05/02.

[41] RIKEN. About riken. http://www.riken.jp/en/. Accessed on 2018/05/01.

[42] CentOS. About centos. https://www.centos.org/about/. Accessed on 2018/08/02.

[43] python. About python. https://www.python.org/about/. Accessed on 2018/08/02.

https://www.cs.cmu.edu/~schneide/tut5/node42.html

https://www.cs.cmu.edu/~schneide/tut5/node42.html

https://www.spss-tutorials.com/z-scores-what-and-why/

https://www.spss-tutorials.com/z-scores-what-and-why/

http://accc.riken.jp/download/sites/2/HOKUSAI_system_overview_en.pdf

http://accc.riken.jp/download/sites/2/HOKUSAI_system_overview_en.pdf

http://www.riken.jp/en/

https://www.centos.org/about/

https://www.python.org/about/

BIBLIOGRAPHY 93

[44] Numpy. Numpy. http://www.numpy.org/. Accessed on 2018/08/02.

[45] PyTorch. Pytorch. https://pytorch.org/. Accessed on 2018/08/02.

[46] NIPY. Neuroimaging in python. http://nipy.org/nibabel/. Accessed on2018/08/02.

[47] M. Jenkinson, C. F. Beckmann, T. E. Behrens, M. W. Woolrich, and S. M. Smith.Fsl. Neuroimage, 62(2):782–790, Aug 2012.

[48] J. E. Iglesias, C. Liu, P. M. Thompson, and Z. Tu. Robust brain extraction acrossdatasets and comparison with publicly available methods. IEEE Transactions on

Medical Imaging, 30(9):1617–1634, Sept 2011.

[49] NVIDIA. Cuda zone. https://developer.nvidia.com/cuda-zone. Accessed on2018/08/02.

[50] NVIDIA. Nvidia cndnn. https://developer.nvidia.com/cudnn. Accessed on2018/08/02.

[51] Git. About. https://git-scm.com/about/. Accessed on 2018/06/01.

[52] McGill. Mcconnell brain imaging centre. https://www.mcgill.ca/bic/software/tools-data-analysis/anatomical-mri/atlases/icbm152lin. Accessed on2018/07/31.

[53] FSL. Templates and atlases included with fsl. https://fsl.fmrib.ox.ac.uk/fsl/fslwiki/Atlases. Accessed on 2018/05/14.

[54] M. Jenkinson, P. Bannister, M. Brady, and S. Smith. Improved optimization forthe robust and accurate linear registration and motion correction of brain images.NeuroImage, 17(2):825 – 841, 2002.

[55] SPM. Statistical parametric mapping. https://www.fil.ion.ucl.ac.uk/spm/. Ac-cessed on 2018/07/21.

[56] Y. Zhang, M. Brady, and S. Smith. Segmentation of brain mr images through ahidden markov random field model and the expectation-maximization algorithm.IEEE Transactions on Medical Imaging, 20(1):45–57, Jan 2001.

[57] Structural Brain Mapping Group. Segmentation. http://www.neuro.uni-jena.de/vbm/segmentation/. Accessed on 2018/07/31.

http://www.numpy.org/

https://pytorch.org/

http://nipy.org/nibabel/

https://developer.nvidia.com/cuda-zone

https://developer.nvidia.com/cudnn

https://git-scm.com/about/

https://www.mcgill.ca/bic/software/tools-data-analysis/anatomical-mri/atlases/icbm152lin

https://www.mcgill.ca/bic/software/tools-data-analysis/anatomical-mri/atlases/icbm152lin

https://fsl.fmrib.ox.ac.uk/fsl/fslwiki/Atlases

https://fsl.fmrib.ox.ac.uk/fsl/fslwiki/Atlases

https://www.fil.ion.ucl.ac.uk/spm/

http://www.neuro.uni-jena.de/vbm/segmentation/

http://www.neuro.uni-jena.de/vbm/segmentation/

BIBLIOGRAPHY 94

[58] A. P. Perez. Accurate segmentation of brain mr images. http://publications.lib.chalmers.se/records/fulltext/125983.pdf, 2010. MSc Thesis, Chalmers Universityof Technology. Accessed on 2018/07/02.

[59] M. D. Zeiler and R. Fergus. Stochastic pooling for regularization of deep convo-lutional neural networks. CoRR, abs/1301.3557, 2013.

[60] Y. Boureau, F. Bach, Y. LeCun, and J. Ponce. Learning mid-level features forrecognition. In 2010 IEEE Computer Society Conference on Computer Vision

and Pattern Recognition, pages 2559–2566, June 2010.

[61] A. C Wilson, R. Roelofs, M. Stern, N. Srebro, and B. Recht. The marginal valueof adaptive gradient methods in machine learning. In I. Guyon, U. V. Luxburg,S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Ad-

vances in Neural Information Processing Systems 30, pages 4148–4158. CurranAssociates, Inc., 2017.

[62] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scaleimage recognition. In International Conference on Learning Representations,2015.

[63] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recogni-tion. CoRR, abs/1512.03385, 2015.

[64] L. Perez and J. Wang. The effectiveness of data augmentation in image classifi-cation using deep learning. CoRR, abs/1712.04621, 2017.

[65] T. Tong, R. Wolz, Q. Gao, R. Guerrero, J. V. Hajnal, and D. Rueckert. Multi-ple instance learning for classification of dementia in brain mri. Medical Image

Analysis, 18(5):808 – 818, 2014.

[66] S. Liu, S. Liu, W. Cai, H. Che, S. Pujol, R. Kikinis, D. Feng, M. J. Fulham, andADNI. Multimodal neuroimaging feature learning for multiclass diagnosis ofalzheimer’s disease. IEEE Transactions on Biomedical Engineering, 62(4):1132–1140, April 2015.

[67] Pancreatic Cancer UK. Tests for pancreatic cancer. https://www.pancreaticcancer.org.uk/information-and-support/facts-about-pancreatic-cancer/how-is-pancreatic-cancer-diagnosed/tests-for-pancreatic-cancer/. Accessed on2018/08/02.

http://publications.lib.chalmers.se/records/fulltext/125983.pdf

http://publications.lib.chalmers.se/records/fulltext/125983.pdf

https://www.pancreaticcancer.org.uk/information-and-support/facts-about-pancreatic-cancer/how-is-pancreatic-cancer-diagnosed/tests-for-pancreatic-cancer/



Appendix A

Full experiment results

A.1 Full results

95

A.1. FULL RESULTS 96

Table A.1: Full results for three-way classification

Hyper-parametersNo. of feature mapsin each layer

Learning rate Loss function Testing accuracy

(32,32,32,32) 0.1 MSE 72.60%

(32,32,32,32) 0.01 MSE 71.22%

(32,32,32) 0.001 Cross-entropy 70.83%

(64,64,64,64) 0.01 MSE 70.55%

(64,64,64,64) 0.01 Cross-entropy 70.40%

(64,64,64) 0.1 Cross-entropy 69.66%

(64,64,64) 0.01 Cross-entropy 69.93%

(32,32,32) 0.01 Cross-entropy 69.41%

(64,64,64) 0.001 Cross-entropy 69.21%

(32,32,32,32) 0.01 Cross-entropy 69.05%

(64,64,64) 0.01 MSE 68.47%

(64,64,64,64) 0.001 MSE 67.86%

(32,32,32) 0.01 MSE 67.53%

(64,64,64,64) 0.001 Cross-entropy 66.81%

(64,64,64) 0.001 MSE 65.72%

(32,32,32) 0.001 MSE 65.33%

(64,64,64) 0.1 MSE 63.62%

(32,32,32,32) 0.001 Cross-entropy 60.27%

(32,32,32) 0.1 MSE 59.73%

(32,32,32,32) 0.001 MSE 59.45%

(64,64,64,64) 0.1 Cross-entropy 48.88%

(32,32,32,32) 0.1 Cross-entropy 48.64%

(64,64,64,64) 0.001 MSE 48.40%

(32,32,32) 0.1 Cross-entropy 48.40%

A.1. FULL RESULTS 97

Table A.2: Full results for two-way classification.

Hyper-parameters AccuracyNo. of featuremaps in eachlayer

Learningrate

Costfunc-tion

Testing Training TP rate Precision FP rate F-measure

(32,32,32) 0.001 CE 91.23% 100% 87.07% 92.96% 5.13% 0.8992

(32,32,32) 0.01 MSE 90.63% 100% 87.89% 90.87% 6.79% 0.8936

(32,32,32,32) 0.001 CE 88.92% 99.03% 86.85% 89.37% 8.14% 0.8809

(64, 64, 64) 0.01 MSE 89.40% 99.64% 84.62% 91.02% 6.88% 0.8770

(64, 64, 64) 0.001 CE 88.48% 99.90% 83.78% 90.91% 6.74% 0.8720

(32,32,32,32) 0.01 MSE 88.70% 99.77% 86.09% 88.09% 9.01% 0.8708

(32,32,32,32) 0.1 MSE 89.10% 99.83% 83.40% 90.52% 6.99% 0.8677

(64,64, 64, 64) 0.1 MSE 88.21% 98.74% 85.96% 87.34% 8.87% 0.8664

(64, 64, 64) 0.1 MSE 85.45% 99.83% 80.00% 92.31% 7.75% 0.8571

(32, 32, 32) 0.1 MSE 86.25% 99.89% 82.58% 87.46% 9.89% 0.8495

(64,64, 64, 64) 0.01 MSE 86.80% 99.68% 81.47% 88.17% 8.45% 0.8469

(64, 64, 64) 0.001 MSE 83.63% 99.81% 76.67% 92.00% 8.03% 0.8364

(32, 32, 32) 0.001 MSE 84.07% 99.09% 72.63% 91.73% 6.24% 0.8107

(64,64, 64, 64) 0.001 MSE 69.20% 72.99% 37.28% 94.41% 3.09% 0.5345

(32,32,32,32) 0.001 MSE 54.40% 61.20% 21.95% 86.39% 15.50% 0.3501

Appendix B

Source code

B.1 Pre-processing

The source code for pre-processing stage partially derives from the source code of theresearch [38]. Modifications are made for this project. The usage can be found byusing command python3 preprocess.py -h.’ ’ ’

/∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗ T i t l e : a d f i n d e r . py∗ Author : C laud io A . Henr iquez∗ Date : <2017>∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗/’ ’ ’import sys , s u b p r o c e s s , a r g p a r s e , t h r e a d i n g , t ime , queue , os , r e

p a r s e r = a r g p a r s e . Argumen tPa r se r ( d e s c r i p t i o n = ’ S c r i p t f o r ROI e x t r a c t i o n , usage example : py thon3 p r e p r o c e s s . py 2 /media / chao / c l a u d i o l o c a l / ADNI Screening / / media / chao / c l a u d i o l o c a l / s y s / AtlasROIMask /ha rva rdox fo rd p rob Hippo Amyg Tha l Pu t a Pa l l Caud MTL ROI Mask 1mm tes t . n i i . gz / media / chao / c l a u d i o l o c a l / s y s /o u t p u t / o9 / ’ )

p a r s e r . add a rgumen t ( ’ n t h r e a d s ’ , type = i n t , help = ’ S p e c i f y t h e number o f t h r e a d used t o p r o c e s s d a t a ’ )

p a r s e r . add a rgumen t ( ’ f i l e F o l d e r P a t h ’ , type = s t r , help = ’ F o l d e r p a t h i n c l u d i n g MRI d a t a . eg . / media / chao / c l a u d i o l o c a l/ ADNI Screening / ’ )

p a r s e r . add a rgumen t ( ’ ROImaskFilename ’ , type = s t r , help = ’MASK f i l e p a t h eg . / media / chao / c l a u d i o l o c a l / s y s /AtlasROIMask / ha rva rdox fo rd p rob Hippo Amyg Tha l Pu t a Pa l l Caud MTL ROI Mask 1mm tes t . n i i . gz ’ )

p a r s e r . add a rgumen t ( ’ O u t p u t D i r ’ , type = s t r , help = ’ Outpu t f o l d e r . eg . / media / chao / c l a u d i o l o c a l / s y s / o u t p u t / o9 / ’ )

def r o i E x t r a c t i o n ( t a r g e t , i npu tF i l eName , ROIFileName , O u t p u t F i l e n a m e ) :t a r g e t . w r i t e ( ” f s l m a t h s ” + i n p u t F i l e N a m e + ” −mul ” + ROIFileName + ” ” + O u t p u t F i l e n a m e + ”\n ” )t a r g e t . w r i t e ( s t r ( s u b p r o c e s s . c h e c k o u t p u t ( [ ” f s l m a t h s ” , inpu tF i l eName , ”−mul ” , ROIFileName , O u t p u t F i l e n a m e ] ) ) )

def c h e c k p a t h ( p ) :i f not os . p a t h . e x i s t s ( p ) :

os . mkdir ( p )

def f s l t h r e a d ( threadname , f i l e p a t h ) :g l o b a l ROBEXFolder , MNI152brain 1mm , b r a i n F o l d e r , a f f i n e F o l d e r , s e g F o l d e r , ROIFolder , l o g F o l d e r ,

ROImaskFilenameimageID = ’ ’s tudy ID = ’ ’s u b j e c t I D = ’ ’o u t p u t F o l d e r = ’ ’

98

B.1. PRE-PROCESSING 99

i f f i l e p a t h == ’ ’ :# Error , t h e r e i s no f i l e i n t h e i n p u tr a i s e V a l u e E r r o r ( ’ There i s no i n p u t : ’ , f i l e p a t h , t h readname )

e l s e :# p = re . c o m p i l e ( ’ (\ / ? ( [ a−zA−Z0−9 \.\−]+\/)∗) (\w+\. n i i ) ’ )p = r e . compi le ( ’ (\ / ? ( [ a−zA−Z0−9 \.\−]+\/)∗) ( (\w+|−)∗\. n i i ) ’ )m = p . match ( f i l e p a t h )i f m == None :

r a i s e V a l u e E r r o r ( ’ I n v a l i d Pa th : ’ , f i l e p a t h , th readname )e l s e :

p a t h = m. group ( 1 )f i l e n a m e = m. group ( 3 )p r i n t ( t h readname + ” >> Pa th : ” + p a t h )p r i n t ( t h readname + ” >> Fi l ename : ” + f i l e n a m e )p2 = r e . compi le ( ’ADNI (\d+\ S\ \d +) (\w+|−)∗\ S (\d +)\ I (\d +) \ . n i i ’ )m2 = p2 . match ( f i l e n a m e )# Add c o n d i t i o n f o r r e a l f i l e si f m2 == None :

s u b j e c t I D = f i l e p a t hs tudyID = f i l e p a t himageID = f i l e p a t h

e l s e :s u b j e c t I D = m2 . group ( 1 )s tudy ID = m2 . group ( 3 )imageID = m2 . group ( 4 )

p r i n t ( t h readname + ” >> S u b j e c t ID : ” + s u b j e c t I D )p r i n t ( t h readname + ” >> Study ID : ” + s tudyID )p r i n t ( t h readname + ” >> Image ID : ” + imageID )

# Outpu t f i l e namesb r a i n F i l e P a t h = b r a i n F o l d e r + s u b j e c t I D + ” S ” + s tudyID + ” I ” + imageID + ” 1 b r a i n 1 m m t e s t ”a f f i n e F i l e P a t h = a f f i n e F o l d e r + s u b j e c t I D + ” S ” + s tudyID + ” I ” + imageID + ”

2 b r a i n a f f i n e 1 m m t e s t ”

s e g F i l e P a t h = s e g F o l d e r + s u b j e c t I D + ” S ” + s tudyID + ” I ” + imageID + ” 2 b r a i n a f f i n e s e g 1 m m t e s t ”segGMFilePath = s e g F o l d e r + s u b j e c t I D + ” S ” + s tudyID + ” I ” + imageID + ” 3 seg GM 1mm tes t ”segWMFilePath = s e g F o l d e r + s u b j e c t I D + ” S ” + s tudyID + ” I ” + imageID + ” 3 seg WM 1mm test ”s e g C S F F i l e P a t h = s e g F o l d e r + s u b j e c t I D + ” S ” + s tudyID + ” I ” + imageID + ” 3 s e g C S F 1 m m t e s t ”

# ROI F i l e sR O I F i l e P a t h = ROIFolder + s u b j e c t I D + ” S ” + s tudyID + ” I ” + imageID + ” 4 ROI 1mm tes t ”

# W r i t i n g FSL e x e c u t i o n o u t p u tf = open ( l o g F o l d e r + ’ P r o c e s s O u t p u t ’ + f i l e n a m e . r e p l a c e ( ’ . ’ , ’ ’ ) + ’ . t x t ’ , ’w’ )f . w r i t e ( ” P r o c e s s o f image i d : ” + imageID + ” has begun .\ n ” )f . w r i t e ( ”−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−\n ” )

# S t e p 1 : Bra in e x t r a c t i o nf . w r i t e ( ” S t ep 1 : BET \n ” )i n i t i a l T i m e = t ime . t ime ( )s t a r t T i m e = i n i t i a l T i m ef . w r i t e ( ” S t a r t e d a t : ” + t ime . c t i m e ( s t a r t T i m e ) + ”\n ” )# S t e p 1 : Command E x e c u t i o n

# ROBEXf . w r i t e (

ROBEXFolder + ”runROBEX . sh ” + ” ” + f i l e p a t h + ” ” + b r a i n F o l d e r + s u b j e c t I D + ” S ” + s tudyID + ”I ” + imageID + ” b r a i n t e s t ” + ” ” + ”−R −f 0 . 5 −g 0 ” + ”\n ” )

f . w r i t e ( s t r ( s u b p r o c e s s . c h e c k o u t p u t ([ ROBEXFolder + ”runROBEX . sh ” , f i l e p a t h , b r a i n F i l e P a t h + ” . n i i ” , b r a i n F i l e P a t h + ” mask . n i i ” ] ) ) )

# Compress ing f i l e sf . w r i t e ( s t r ( s u b p r o c e s s . c h e c k o u t p u t ( [ ” g z i p ” , b r a i n F i l e P a t h + ” . n i i ” , b r a i n F i l e P a t h + ” mask . n i i ” ] ) ) )

endTime = t ime . t ime ( )e l apsedT ime = endTime − s t a r t T i m ef . w r i t e ( ” F i n i s h e d a t : ” + t ime . c t i m e ( endTime ) + ”\n ” )f . w r i t e ( ” E l a p s e d t ime : ” + t ime . s t r f t i m e ( ’%H:%M:%S ’ , t ime . gmtime ( e l apsedT ime ) ) + ”\n ” )f . w r i t e ( ” S t ep 1 has f i n i s h e d .\ n ” )# S t e p 1 : Bra in e x t r a c t i o n F i n i s hf . w r i t e ( ”−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−\n ” )


# S t e p 2 : FLIRT A f f i n ef . w r i t e ( ” S t ep 2 : A f f i n e P r o c e s s \n ” )s t a r t T i m e = t ime . t ime ( )f . w r i t e ( ” S t a r t e d a t : ” + t ime . c t i m e ( s t a r t T i m e ) + ”\n ” )# S t e p 2 : Command E x e c u t i o nf . w r i t e (

” f l i r t ” + ” −i n ” + b r a i n F i l e P a t h + ” −r e f ” + MNI152brain 1mm + ” −o u t ” + a f f i n e F i l e P a t h + ” . n i i. gz ” + ” −omat ” + a f f i n e F i l e P a t h + ” . mat ” + ” −b i n s 256 −c o s t c o r r a t i o −s e a r c h r x −90 90 −s e a r c h r y −90 90 −s e a r c h r z −90 90 −dof 12 ” + ”\n ” )

f . w r i t e ( s t r ( s u b p r o c e s s . c h e c k o u t p u t ([ ” f l i r t ” , ”−i n ” , b r a i n F i l e P a t h , ”−r e f ” , MNI152brain 1mm , ”−o u t ” , a f f i n e F i l e P a t h + ” . n i i . gz ” , ”−

omat ” ,a f f i n e F i l e P a t h + ” . mat ” , ”−b i n s ” , ” 256 ” , ”−c o s t ” , ” c o r r a t i o ” , ”−s e a r c h r x ” , ”−90” , ” 90 ” , ”−

s e a r c h r y ” ,”−90” , ” 90 ” , ”−s e a r c h r z ” , ”−90” , ” 90 ” , ”−dof ” , ” 12 ” ] ) ) )

# ”− i n t e r p ” , ” t r i l i n e a r ”

endTime = t ime . t ime ( )e l apsedT ime = endTime − s t a r t T i m ef . w r i t e ( ” F i n i s h e d a t : ” + t ime . c t i m e ( endTime ) + ”\n ” )f . w r i t e ( ” E l a p s e d t ime : ” + t ime . s t r f t i m e ( ’%H:%M:%S ’ , t ime . gmtime ( e l apsedT ime ) ) + ”\n ” )f . w r i t e ( ” S t ep 2 has f i n i s h e d .\ n ” )# S t e p 2 : FLIRT A f f i n e F i n i s hf . w r i t e ( ”−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−\n ” )

# S t e p 3 : FAST S e g m e n t a t i o nf . w r i t e ( ” S t ep 3 : S e g m e n t a t i o n P r o c e s s \n ” )s t a r t T i m e = t ime . t ime ( )f . w r i t e ( ” S t a r t e d a t : ” + t ime . c t i m e ( s t a r t T i m e ) + ”\n ” )# S t e p 3 : Command E x e c u t i o n

f . w r i t e ( ” f a s t ” + ” −t 1 −n 3 −H 0 . 1 −I 8 −l 2 0 . 0 −B −b −o ” + s e g F i l e P a t h + ” ” + a f f i n e F i l e P a t h + ”\n” )

f . w r i t e ( s t r ( s u b p r o c e s s . c h e c k o u t p u t ([ ” f a s t ” , ”−t ” , ” 1 ” , ”−n ” , ” 3 ” , ”−H” , ” 0 . 1 ” , ”−I ” , ” 8 ” , ”−l ” , ” 2 0 . 0 ” , ”−B” , ”−b ” , ”−o ” , s e g F i l e P a t h

,a f f i n e F i l e P a t h ] ) ) )

i f os . p a t h . i s f i l e ( s e g F i l e P a t h + ” p v e 0 . n i i . gz ” ) == True :f . w r i t e ( s t r ( s u b p r o c e s s . c h e c k o u t p u t ( [ ”mv” , s e g F i l e P a t h + ” p v e 0 . n i i . gz ” , s e g C S F F i l e P a t h + ” . n i i .

gz ” ] ) ) )f . w r i t e ( s t r ( s u b p r o c e s s . c h e c k o u t p u t ( [ ”mv” , s e g F i l e P a t h + ” p v e 1 . n i i . gz ” , segGMFilePath + ” . n i i . gz

” ] ) ) )f . w r i t e ( s t r ( s u b p r o c e s s . c h e c k o u t p u t ( [ ”mv” , s e g F i l e P a t h + ” p v e 2 . n i i . gz ” , segWMFilePath + ” . n i i . gz

” ] ) ) )

endTime = t ime . t ime ( )e l apsedT ime = endTime − s t a r t T i m ef . w r i t e ( ” F i n i s h e d a t : ” + t ime . c t i m e ( endTime ) + ”\n ” )f . w r i t e ( ” E l a p s e d t ime : ” + t ime . s t r f t i m e ( ’%H:%M:%S ’ , t ime . gmtime ( e l apsedT ime ) ) + ”\n ” )f . w r i t e ( ” S t ep 3 has f i n i s h e d .\ n ” )# S t e p 3 : S e g m e n t a t i o n F i n i s h e df . w r i t e ( ”−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−\n ” )

# S t e p 4 : ROIs E x t r a c t i o nf . w r i t e ( ” S t ep 4 : ROIs E x t r a c t i o n\n ” )s t a r t T i m e = t ime . t ime ( )f . w r i t e ( ” S t a r t e d a t : ” + t ime . c t i m e ( s t a r t T i m e ) + ”\n ” )# S t e p 4 : Command E x e c u t i o nr o i E x t r a c t i o n ( f , segGMFilePath , ROImaskFilename , R O I F i l e P a t h )

endTime = t ime . t ime ( )e l apsedT ime = endTime − s t a r t T i m ef . w r i t e ( ” F i n i s h e d a t : ” + t ime . c t i m e ( endTime ) + ”\n ” )f . w r i t e ( ” E l a p s e d t ime : ” + t ime . s t r f t i m e ( ’%H:%M:%S ’ , t ime . gmtime ( e l apsedT ime ) ) + ”\n ” )f . w r i t e ( ” S t ep 4 has f i n i s h e d .\ n ” )# S t e p 4 : ROI E x t r a c t i o n F i n i s h e df . w r i t e ( ”−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−\n ” )

# D e f i n i n g c l a s s myThreadc l a s s myThread ( t h r e a d i n g . Thread ) :

def i n i t ( s e l f , t h r ead ID , name , q ) :


t h r e a d i n g . Thread . i n i t ( s e l f )s e l f . t h r e a d I D = t h r e a d I Ds e l f . name = names e l f . q = q

def run ( s e l f ) :p r i n t ( ” S t a r t i n g ” + s e l f . name + ”\n ” )p r o c e s s d a t a ( s e l f . name , s e l f . q )p r i n t ( ” E x i t i n g ” + s e l f . name + ”\n ” )

# Queue Consumerdef p r o c e s s d a t a ( threadName , q ) :

whi le not e x i t F l a g :queueLock . a c q u i r e ( )i f not workQueue . empty ( ) :

d a t a = q . g e t ( )queueLock . r e l e a s e ( )p r i n t ( ”%s p r o c e s s i n g %s \n ” % ( threadName , d a t a ) )t e x t r o w = f s l t h r e a d ( threadName , d a t a )

e l s e :queueLock . r e l e a s e ( )

def g e t F i l e s i n P a t h ( p a t h ) :’ ’ ’

F u n c t i o n t o l i s t a l l t h e f i l e s w i t h i n a g i v e n pa th’ ’ ’# Changing d i r e c t o r yf i l e L i s t S t r = s u b p r o c e s s . c h e c k o u t p u t ( [ ” f i n d ” , ”−name” , ’ ∗ . n i i ’ ] , cwd= p a t h )f i l e L i s t S t r = f i l e L i s t S t r . decode ( )f i l e L i s t = [ ]f o r f i l e P a t h in f i l e L i s t S t r . s p l i t ( ’\n ’ ) [ 0 : 5 ] :

i f f i l e P a t h . s t r i p ( ) :f i l e L i s t . append ( f i l e P a t h . r e p l a c e ( ’ . / ’ , p a t h ) )

re turn f i l e L i s t

# D e f i n e a f u n c t i o n f o r t h e t h r e a ddef p r i n t t i m e ( threadName , de lay , c o u n t e r ) :

whi le c o u n t e r :i f e x i t F l a g :

threadName . e x i t ( )t ime . s l e e p ( d e l a y )p r i n t ( ”%s : %s\n ” % ( threadName , t ime . c t i m e ( t ime . t ime ( ) ) ) )c o u n t e r −= 1

T e s t = F a l s e# pa th o f ROBEXROBEXFolder = ” / media / chao / c l aud iohome / T o o l k i t /ROBEX/ ”# pa th o f MNI152 t e m p l a t eMNI152brain 1mm = ” / u s r / l o c a l / f s l / d a t a / s t a n d a r d / MNI152 T1 1mm brain . n i i . gz ”

i f n a m e == ” m a i n ” :o p t = p a r s e r . p a r s e a r g s ( )e x i t F l a g = 0i f l e n ( s y s . a rgv ) == 5 or T e s t :

i f T e s t :n t h r e a d s = 2# f o l d e r pa th i n c l u d i n g MRI da taf i l e F o l d e r P a t h = ’ / media / chao / c l a u d i o l o c a l / ADNI Screening / ’# MASK f i l e pa thROImaskFilename = ’ / media / chao / c l a u d i o l o c a l / s y s / AtlasROIMask /

ha rva rdox fo rd p rob Hippo Amyg Tha l Pu t a Pa l l Caud MTL ROI Mask 1mm tes t . n i i . gz ’# o u t p u t f o l d e r pa thO u t p u t D i r = ’ / media / chao / c l a u d i o l o c a l / s y s / o u t p u t / o9 / ’

e l s e :n t h r e a d s = o p t . n t h r e a d sf i l e F o l d e r P a t h = o p t . f i l e F o l d e r P a t hROImaskFilename = o p t . ROImaskFilenameO u t p u t D i r = o p t . O u t p u t D i r

e l s e :

B.2. CLASSIFICATION 102

p r i n t ( ” Usage : ” )p r i n t ( ” py thon3 p r e p r o c e s s . py <#Threads><MRI F i l e s Fo lde r><ROI Mask F i l e><Outpu t Path>” )p r i n t ( ” eg . py thon3 p r e p r o c e s s . py 2 / media / chao / c l a u d i o l o c a l / ADNI Screening / / media / chao / c l a u d i o l o c a l / s y s /

AtlasROIMask / ha rva rdox fo rd p rob Hippo Amyg Tha l Pu t a Pa l l Caud MTL ROI Mask 1mm tes t . n i i . gz / media /chao / c l a u d i o l o c a l / s y s / o u t p u t / o9 / ” )

e x i t ( )i f n t h r e a d s < 1 or not os . p a t h . e x i s t s ( f i l e F o l d e r P a t h ) or not os . p a t h . e x i s t s ( ROImaskFilename ) :

p r i n t ( ”Wrong o p t i o n ” )e x i t ( )

# F o l d e r s and F i l e sb a s e F o l d e r = O u t p u t D i ro u t p u t F o l d e r = b a s e F o l d e r + ’ O u t p u t F i l e s / ’b r a i n F o l d e r = o u t p u t F o l d e r + ’ 01 B r a i n / ’a f f i n e F o l d e r = o u t p u t F o l d e r + ’ 02 A f f i n e / ’s e g F o l d e r = o u t p u t F o l d e r + ’ 03 S e g m e n t a t i o n / ’# nonL inearRegFo lder = o u t p u t F o l d e r + ’04 NonLinearReg / ’ROIFolder = o u t p u t F o l d e r + ’ 04 ROI / ’l o g F o l d e r = b a s e F o l d e r + ’ L o g F i l e s / ’c h e c k p a t h ( b a s e F o l d e r )c h e c k p a t h ( o u t p u t F o l d e r )c h e c k p a t h ( b r a i n F o l d e r )c h e c k p a t h ( a f f i n e F o l d e r )c h e c k p a t h ( s e g F o l d e r )c h e c k p a t h ( ROIFolder )c h e c k p a t h ( l o g F o l d e r )

f i l e L i s t = g e t F i l e s i n P a t h ( f i l e F o l d e r P a t h )

t h r e a d L i s t = [ ]f o r i in range ( n t h r e a d s ) :

t h r e a d L i s t . append ( ” Thread−{0:d}” . format ( i + 1 ) )

queueLock = t h r e a d i n g . Lock ( )workQueue = queue . Queue ( l e n ( f i l e L i s t ) + 1 )t h r e a d s = [ ]t h r e a d I D = 1

f o r tName in t h r e a d L i s t :t h r e a d = myThread ( th r ead ID , tName , workQueue )t h r e a d . s t a r t ( )t h r e a d s . append ( t h r e a d )t h r e a d I D += 1

# F i l l t h e queuequeueLock . a c q u i r e ( )f o r i f i l e in f i l e L i s t :

workQueue . p u t ( i f i l e )queueLock . r e l e a s e ( )

# Wait f o r work queue t o emptywhi le not workQueue . empty ( ) :

pass

# N o t i f y t h r e a d s i t ’ s t i m e t o e x i te x i t F l a g = 1

B.2 Classification

B.2.1 Label generation

This program is used to generate label file for training the neural network. After run-ning this script without option, the label file will be created. The usage can be foundby using command python3 label_generation.py -h.


import a r g p a r s e , osfrom xml . dom . minidom import p a r s eimport xml . dom . minidomimport numpy as npimport t ime

p a r s e r = a r g p a r s e . Argumen tPa r se r ( d e s c r i p t i o n = ’ S c r i p t f o r l a b e l g e n e r a t i o n , usage example : py thon3 l a b e l g e n e r a t i o n. py / home / chao / d a t a 3 d c n n / d a t a / l a b e l / home / chao / d a t a 3 d c n n / l a b e l . t x t ’ )

p a r s e r . add a rgumen t ( ’ x m l L a b e l F o l d e r ’ , type = s t r , help = ’XML d a t a f o l d e r . eg . / home / chao / d a t a 3 d c n n / d a t a / l a b e l ’ )p a r s e r . add a rgumen t ( ’ o u t p u t F i l e ’ , type = s t r , help = ’ Outpu t l a b e l f i l e . eg . / home / chao / d a t a 3 d c n n / l a b e l . t x t ’ )

def p a t h e x i s t ( p ) :i f not os . p a t h . e x i s t s ( p ) :

re turn F a l s ere turn True

def g e t l a b e l s ( l a b e l p a t h ) :# c o l l e c t l a b e l s s t a r t# l a b e l p a t h = ’ . . / da ta / l a b e l ’f i l e l i s t = os . l i s t d i r ( l a b e l p a t h )l a b e l s i m a g e = [ ]f o r f i l e in f i l e l i s t :

DOMTree = xml . dom . minidom . p a r s e ( l a b e l p a t h + ” / ” + f i l e )c o l l e c t i o n = DOMTree . documentElements u b j e c t I d e n t i f i e r s = c o l l e c t i o n . getElementsByTagName ( ” s u b j e c t I d e n t i f i e r ” ) [ 0 ]s u b j e c t I d e n t i f i e r = s u b j e c t I d e n t i f i e r s . c h i l d N o d e s [ 0 ] . d a t as u b j e c t I n f o = c o l l e c t i o n . getElementsByTagName ( ” s u b j e c t I n f o ” ) [ 0 ]s e r i e s I d e n t i f i e r s = c o l l e c t i o n . getElementsByTagName ( ” s e r i e s I d e n t i f i e r ” ) [ 0 ]s e r i e s I d e n t i f i e r = ”S” + s e r i e s I d e n t i f i e r s . c h i l d N o d e s [ 0 ] . d a t aimageUIDs = c o l l e c t i o n . getElementsByTagName ( ” imageUID ” ) [ 0 ]imageUID = ” I ” + imageUIDs . c h i l d N o d e s [ 0 ] . d a t akey = s u b j e c t I d e n t i f i e r + ” ” + s e r i e s I d e n t i f i e r + ” ” + imageUID

i f s u b j e c t I n f o . c h i l d N o d e s [ 0 ] . d a t a == ’AD’ :l a b e l = 2

e l i f s u b j e c t I n f o . c h i l d N o d e s [ 0 ] . d a t a == ’ Normal ’ :l a b e l = 0

e l s e :l a b e l = 1

# i n f o = t u p l e ( [ key , s t r ( l a b e l ) , s u b j e c t I n f o . c h i l d N o d e s [ 0 ] . da ta ] )l a b e l s i m a g e . append ( [ key , s t r ( l a b e l ) , s u b j e c t I n f o . c h i l d N o d e s [ 0 ] . d a t a ] )

re turn l a b e l s i m a g e

i f n a m e == ” m a i n ” :o p t = p a r s e r . p a r s e a r g s ( )i f not p a t h e x i s t ( o p t . x m l L a b e l F o l d e r ) :

p r i n t ( ’ F o l d e r o r f i l e does n o t e x i s t ! ’ )e x i t ( )

l a b e l s i m a g e = g e t l a b e l s ( o p t . x m l L a b e l F o l d e r )np . s a v e t x t ( o p t . o u t p u t F i l e , l a b e l s i m a g e , fmt=”%s ” )# np . s a v e t x t ( ” . / l a b e l a l l . t x t ” , l a b e l s i m a g e )# l a b e l = g e t l a b e l ( )

def g e t l a b e l ( p a t h = ’ / home / chao / d a t a 3 d c n n / d a t a / l a b e l a l l . t x t ’ ) :l a b e l s = np . l o a d t x t ( pa th , d t y p e =np . s t r )l a b e l = {}f o r i n f o in l a b e l s [ : ] :

l a b e l [ i n f o [ 0 ] ] = i n f o [ 1 ]re turn l a b e l

B.2.2 Classification

This program is used to train neural network for classification. The usage can be foundby using command python3 torch_cv.py -h.import numpy as np


import a r g p a r s e , osimport n i b a b e l a s n i bimport t o r c himport t o r c h . nn as nnimport t o r c h . opt im as opt imimport t o r c h . nn . f u n c t i o n a l a s Ffrom t o r c h . u t i l s . d a t a import DataLoaderfrom t o r c h . a u t o g r a d import V a r i a b l eimport randomimport t ime

def t o t a l g r a d i e n t ( p a r a m e t e r s ) :p a r a m e t e r s = l i s t ( f i l t e r ( lambda p : p . g r ad i s not None , p a r a m e t e r s ) )t o t a l n o r m = 0f o r p in p a r a m e t e r s :

modulenorm = p . g rad . d a t a . norm ( )t o t a l n o r m += modulenorm ∗∗ 2

t o t a l n o r m = t o t a l n o r m ∗∗ ( 1 . / 2 )re turn t o t a l n o r m

p a r s e r = a r g p a r s e . Argumen tPa r se r ( d e s c r i p t i o n = ’ PyTorch f o r AD, usage example : py thon3 t o r c h c v . py / media / chao /c l a u d i o l o c a l / 3 d c n n d a t a / media / chao / c l a u d i o l o c a l / 3 d c n n d a t a / HIPPO AMYG THAL PUTA PALL CAUD MTL NEW / home / chao /d a t a 3 d c n n / d a t a / l a b e l a l l . t x t ’ )

p a r s e r . add a rgumen t ( ’−−l o s s F u n c t i o n ’ , ’−l o s s ’ , type = i n t , d e f a u l t =1 , c h o i c e s = [ 0 , 1 ] , help = ’ Loss F u n c t i o n : 0CrossEn t ropy , 1 Mean Square E r r o r , d e f a u l t 1 ’ )

p a r s e r . add a rgumen t ( ’−−o p t i m i z e r ’ , ’−optm ’ , type = i n t , d e f a u l t =0 , c h o i c e s = [ 0 , 1 ] , help = ’ 0 g r a d i e n t d e s c e n t , 1 Adam ,d e f a u l t 0 ’ )

p a r s e r . add a rgumen t ( ’−−c r o s s V a l i d a t i o n ’ , ’−cv ’ , type = i n t , d e f a u l t =10 , help = ’ Cross V a l i d a t i o n : <= 1 , no CVperformed , d e f a u l t 10 ’ )

p a r s e r . add a rgumen t ( ’−−a u g m e n t a t i o n ’ , a c t i o n = ’ s t o r e t r u e ’ , help = ’ Augment d a t a u s i n g m i r r o r f l i p , d e f a u l t F a l s e ’ )p a r s e r . add a rgumen t ( ’−−t r a i n B a t c h S i z e ’ , ’−t r S i z e ’ , type = i n t , d e f a u l t =4 , help = ’ t r a i n i n g b a t c h s i z e , d e f a u l t 4 ’ )p a r s e r . add a rgumen t ( ’−−t e s t B a t c h S i z e ’ , ’−t e S i z e ’ , type = i n t , d e f a u l t =4 , help = ’ t e s t i n g b a t c h s i z e , d e f a u l t 4 ’ )p a r s e r . add a rgumen t ( ’−−nEpochs ’ , type = i n t , d e f a u l t =200 , help = ’ number o f epochs t o t r a i n f o r ’ )p a r s e r . add a rgumen t ( ’−−l r ’ , ’−l r ’ , type = f l o a t , d e f a u l t = 0 . 0 1 , help = ’ L e a r n i n g Rate , d e f a u l t =0 .01 ’ )p a r s e r . add a rgumen t ( ’−−c f g ’ , type = s t r , d e f a u l t = ’A’ , help = ’ S p e c i f y t h e ne twork s t r u c t u r e from t h e p r e d i f i n e d l i s t ’ )p a r s e r . add a rgumen t ( ’−−cuda ’ , a c t i o n = ’ s t o r e t r u e ’ , help = ’ use cuda or n o t ? d e f a u l t : F a l s e ’ )p a r s e r . add a rgumen t ( ’−−t h r e a d s ’ , type = i n t , d e f a u l t =4 , help = ’ number o f t h r e a d s f o r d a t a l o a d e r t o use ’ )p a r s e r . add a rgumen t ( ’−−d r o p o u t ’ , type = f l o a t , d e f a u l t = 0 . 1 , help = ’ d r o p o u t r a t e o f t h e s t a n d a r d n e u r a l network ,

d e f a u l t 0 . 1 ’ )p a r s e r . add a rgumen t ( ’−−momentum ’ , d e f a u l t =0 , type = f l o a t , help = ’momentum , d e f a u l t : 0 ’ )p a r s e r . add a rgumen t ( ’−−weight−decay ’ , ’−wd ’ , d e f a u l t = 0 . 0 0 1 , type = f l o a t , help = ’ w e i gh t decay , d e f a u l t : 0 , 1e−3’ )p a r s e r . add a rgumen t ( ’−−nWay ’ , ’−nWay ’ , d e f a u l t =2 , type = i n t , c h o i c e s = [ 2 , 3 ] , help = ’ Type of c l a s s i f i c a t i o n . 2 two−

way c l a s s i f i c a t i o n , 3 t h r e e−way c l a s s f i c a t i o n ’ )p a r s e r . add a rgumen t ( ’ o u t p u t F o l d e r ’ , d e f a u l t = ’ ’ , type = s t r , help = ’ The o u t p u t f o l d e r . eg . / media / chao / c l a u d i o l o c a l / 3

d c n n d a t a ’ )p a r s e r . add a rgumen t ( ’ MRIda taFo lde r ’ , type = s t r , help = ’MRI images f o l d e r . eg . / media / chao / c l a u d i o l o c a l / 3 d c n n d a t a /

HIPPO AMYG THAL PUTA PALL CAUD MTL NEW ’ )p a r s e r . add a rgumen t ( ’ l a b e l F i l e ’ , type = s t r , help = ’ Labe l f i l e . eg . / home / chao / d a t a 3 d c n n / d a t a / l a b e l a l l . t x t ’ )

c l a s s CNN( nn . Module ) :def i n i t ( s e l f , CNNLayer , l a s t o u t p u t , width , h e i g h t , depth , nLabe l ) :

super (CNN, s e l f ) . i n i t ( )s e l f . w id th = wid ths e l f . h e i g h t = h e i g h ts e l f . d e p t h = d e p t hs e l f . nLab le = nLabe ls e l f . cnn = CNNLayerp r i n t ( CNNLayer )#s e l f . l i n e a r = nn . S e q u e n t i a l (

nn . L i n e a r ( l a s t o u t p u t ∗ i n t ( s e l f . w id th ) ∗ i n t ( s e l f . h e i g h t ) ∗ i n t ( s e l f . d e p t h ) , 1000) ,nn . ReLU ( ) ,nn . Dropout ( o p t . d r o p o u t ) ,nn . L i n e a r ( 1 0 0 0 , 100) ,nn . ReLU ( ) ,nn . Dropout ( o p t . d r o p o u t ) ,


nn . L i n e a r ( 1 0 0 , s e l f . nLab le ) ,nn . Softmax ( )

)

f . p r i n t a n d l o g ( ’ S i z e o f MRI a f t e r CNN:{} {} {} ’ . format ( s e l f . width , s e l f . h e i g h t , s e l f . d e p t h ) )

def f o r w a r d ( s e l f , x ) :x = s e l f . cnn ( x )x = x . view ( x . s i z e ( 0 ) , −1)x = s e l f . l i n e a r ( x )re turn x

c l a s s SSPLayer ( ) :def i n i t ( s e l f ) :

pass

c l a s s LogSystem :def i n i t ( s e l f , l o g f o l d e r , t e s t f l a g = F a l s e ) :

s e l f . t e s t = t e s t f l a gs e l f . sep = ’ ’l o g f i l e n a m e = o p t . MRIda taFo lde r . s p l i t ( ’ / ’ ) [−1] + ’ ’ + s e l f . sep . j o i n ( compar i son ) + ’ ’ + t ime . s t r f t i m e ( ”%

Y %m %d %H %M” ,t ime . l o c a l t i m e ( ) ) + ’ .

t x t ’i f not s e l f . t e s t :

s e l f . l o g f i l e = open ( l o g f o l d e r + ’ / ’ + l o g f i l e n a m e , ’w’ )s e l f . p r i n t a n d l o g ( l o g f o l d e r + ’ / ’ + l o g f i l e n a m e )

def l o g ( s e l f , t e x t ) :s e l f . l o g f i l e . w r i t e ( t e x t + ’\n ’ )

def p r i n t a n d l o g ( s e l f , t e x t ) :p r i n t ( t e x t )i f not s e l f . t e s t :

s e l f . l o g ( t e x t )

def f l u s h ( s e l f ) :i f not s e l f . t e s t :

s e l f . l o g f i l e . f l u s h ( )

c l a s s LoadData ( ) :def i n i t ( s e l f , d a t a f o l d e r , nLabel , t r a i n t e s t r a t e = 0 . 9 ,

l a b e l p a t h = ’ / home / chao / d a t a 3 d c n n / d a t a / l a b e l a l l . t x t ’ , a u g m e n t a t i o n = F a l s e ,c r o s s v a l i d a t i o n =1) :

s e l f . l a b e l p a t h = l a b e l p a t hs e l f . nLabe l = nLabe ls e l f . l a b e l = {}s e l f . s a m p l e d a t a = [ ]s e l f . s a m p l e l a b e l = [ ]s e l f . r a t e = t r a i n t e s t r a t es e l f . a u g m e n t a t i o n = a u g m e n t a t i o ns e l f . c r o s s v a l i d a t i o n = c r o s s v a l i d a t i o ns e l f . c u r r e n t t e s t f o l d = 0s e l f . d a t a f o l d e r = d a t a f o l d e rs e l f . d min = s e l f . h min = s e l f . w min = 500s e l f . d max = s e l f . h max = s e l f . w max = 0s e l f . d i c = { ’NC’ : 0 , ’MCI ’ : 1 , ’AD’ : 2}s e l f . t r a i n i n g s e t = [ ]s e l f . l o a d d a t a ( )

def l o a d l a b e l ( s e l f ) :f o r i n f o in np . l o a d t x t ( s e l f . l a b e l p a t h , d t y p e =np . s t r ) [ : ] :

s e l f . l a b e l [ i n f o [ 0 ] ] = i n f o [ 1 ]

def g e t d a t a ( s e l f ) :

i f s e l f . c r o s s v a l i d a t i o n > 1 :t r a i n i n g t h r e s h o l d = 1 . 0 / s e l f . c r o s s v a l i d a t i o nt e s t i n g s t a r t = round ( s e l f . c u r r e n t t e s t f o l d ∗ t r a i n i n g t h r e s h o l d ∗ s e l f . g e t n u m d a t a ( ) )


t e s t i n g e n d = round ( ( s e l f . c u r r e n t t e s t f o l d + 1) ∗ t r a i n i n g t h r e s h o l d ∗ s e l f . g e t n u m d a t a ( ) )t e s t i n g d a t a = s e l f . s a m p l e d a t a [ t e s t i n g s t a r t : t e s t i n g e n d ]t r a i n i n g d a t a = s e l f . s a m p l e d a t a [ 0 : t e s t i n g s t a r t ] + s e l f . s a m p l e d a t a [ t e s t i n g e n d : ]

s e l f . c u r r e n t t e s t f o l d += 1i f s e l f . c u r r e n t t e s t f o l d == s e l f . c r o s s v a l i d a t i o n :

s e l f . c u r r e n t t e s t f o l d = 0e l s e :

t r a i n i n g t h r e s h o l d = round ( l e n ( s e l f . s a m p l e d a t a ) ∗ s e l f . r a t e )t r a i n i n g d a t a = s e l f . s a m p l e d a t a [ : t r a i n i n g t h r e s h o l d ]t e s t i n g d a t a = s e l f . s a m p l e d a t a [ t r a i n i n g t h r e s h o l d : ]

i f s e l f . a u g m e n t a t i o n :t r a i n i n g d a t a = s e l f . a u g m e n t d a t a ( t r a i n i n g d a t a )

re turn t r a i n i n g d a t a , t e s t i n g d a t a

def a u g m e n t d a t a ( s e l f , d a t a ) :t m p d a t a = d a t a . copy ( )f o r d in t m p d a t a :

d [ 0 ] = np . f l i p ( d [ 0 ] , 3 ) . copy ( )d a t a . append ( d )

np . random . s h u f f l e ( d a t a )re turn d a t a

def g e t n u m d a t a ( s e l f ) :re turn l e n ( s e l f . s a m p l e d a t a )

def g e t s h a p e ( s e l f ) :re turn s e l f . d max − s e l f . d min , s e l f . h max − s e l f . h min , s e l f . w max − s e l f . w min

def g e t e n g a g e m e n t ( s e l f ) :re turn s e l f . d min , s e l f . d max , s e l f . h min , s e l f . h max , s e l f . w min , s e l f . w max

def g e t e n g a g e m e n t t o s t r i n g ( s e l f ) :sep = ’ , ’re turn sep . j o i n ( s e l f . g e t e n g a g e m e n t ( ) )

def l o a d d a t a ( s e l f ) :s e l f . l o a d l a b e l ( )t a r g e t d i c = {}i ndx = 0f o r k , l in enumerate ( compar i son ) :

t a r g e t d i c [ s e l f . d i c [ l ] ] = kn u m t o t a l = l e n ( s e l f . l a b e l )i f num > 0 :

n u m t o t a l = num

d a t a l i s t = os . l i s t d i r ( s e l f . d a t a f o l d e r ) [ : n u m t o t a l ]np . random . s h u f f l e ( d a t a l i s t )f o r f i l e in d a t a l i s t :

t m p d a t a = n i b . l o a d ( s e l f . d a t a f o l d e r + ” / ” + f i l e ) . g e t d a t a ( )s a m p l e l a b e l t m p = np . z e r o s ( ( 1 , s e l f . nLabe l ) , d t y p e =np . f l o a t 3 2 )s e p a r a t e = ’ ’i m a g e i d = s e p a r a t e . j o i n ( f i l e . s p l i t ( s e p a r a t e ) [ 0 : 5 ] )t a r g e t l a b e l = i n t ( s e l f . l a b e l [ i m a g e i d ] )i f t a r g e t l a b e l not in t a r g e t d i c :

c o n t in u e

s a m p l e l a b e l t m p [ 0 ] [ t a r g e t d i c [ t a r g e t l a b e l ] ] = 1 . 0

#The s i z e o f ADNI1 MRI Data i s 182∗218∗182s e l f . s a m p l e d a t a . append ( [ t m p d a t a [ : ] . r e s h a p e ( 1 , 182 , 218 , 182) , s a m p l e l a b e l t m p [ 0 ] . copy ( ) , i m a g e i d ] )

d edge , h edge , w edge = t m p d a t a . nonze ro ( )

i f d edge . min ( ) < s e l f . d min : s e l f . d min = d edge . min ( )i f d edge . max ( ) > s e l f . d max : s e l f . d max = d edge . max ( )

i f h edge . min ( ) < s e l f . h min : s e l f . h min = h edge . min ( )i f h edge . max ( ) > s e l f . h max : s e l f . h max = h edge . max ( )

i f w edge . min ( ) < s e l f . w min : s e l f . w min = w edge . min ( )


i f w edge . max ( ) > s e l f . w max : s e l f . w max = w edge . max ( )i f i ndx % 100 == 0 :

f . p r i n t a n d l o g (” c u r r e n t l y p r o c e s s i n g : ” + t ime . s t r f t i m e ( ”%Y−%m−%d %H:%M:%S” , t ime . l o c a l t i m e ( ) ) + ” − ” + s t r (

i ndx ) )i ndx += 1

np . random . s h u f f l e ( s e l f . s a m p l e d a t a )s e l f . d max += 2s e l f . h max += 2s e l f . w max += 2s e l f . d min −= 2s e l f . h min −= 2s e l f . w min −= 2f o r d a t a in s e l f . s a m p l e d a t a :

d a t a [ 0 ] = d a t a [ 0 ] [ : , s e l f . d min : s e l f . d max , s e l f . h min : s e l f . h max , s e l f . w min : s e l f . w max ]

def l a y e r s ( c o n f i g , width , h e i g h t , d e p t h ) :l a y e r = [ ]i n c h a n n e l = 1

f o r v in c o n f i g :i f v == ’M’ :

l a y e r . append ( nn . MaxPool3d ( k e r n e l s i z e =2 , s t r i d e =2) )wid th = wid th / / 2h e i g h t = h e i g h t / / 2d e p t h = d e p t h / / 2

e l i f v == ’R ’ :l a y e r . append ( nn . ReLU( i n p l a c e =True ) )

e l s e :w id th = ( wid th + 2 ∗ v [ ’ pad ’ ] − v [ ’ s i z e ’ ] ) / / v [ ’ s t r i d e ’ ] + 1h e i g h t = ( h e i g h t + 2 ∗ v [ ’ pad ’ ] − v [ ’ s i z e ’ ] ) / / v [ ’ s t r i d e ’ ] + 1d e p t h = ( d e p t h + 2 ∗ v [ ’ pad ’ ] − v [ ’ s i z e ’ ] ) / / v [ ’ s t r i d e ’ ] + 1conv3d = nn . Conv3d ( i n c h a n n e l , v [ ’ o u t c h a n n e l ’ ] , k e r n e l s i z e =v [ ’ s i z e ’ ] , s t r i d e =v [ ’ s t r i d e ’ ] , padd ing =v [

’ pad ’ ] )l a y e r . append ( conv3d )i n c h a n n e l = v [ ’ o u t c h a n n e l ’ ]

re turn nn . S e q u e n t i a l (∗ l a y e r ) , i n c h a n n e l , width , h e i g h t , d e p t h

def p a t h e x i s t ( p ) :i f not os . p a t h . e x i s t s ( p ) :

re turn F a l s ere turn True

def mkdir ( p ) :i f not os . p a t h . e x i s t s ( p ) :

os . mkdir ( p )

# i f t e s t = True , 100 i n s t a n c e s w i l l be read o n l yt e s t = F a l s enum = 0i f t e s t :

num = 100

c f g = {’A’ : [{ ’ o u t c h a n n e l ’ : 32 , ’ s i z e ’ : 3 , ’ s t r i d e ’ : 1 , ’ pad ’ : 1} , ’M’ , ’R ’ ,

{ ’ o u t c h a n n e l ’ : 32 , ’ s i z e ’ : 3 , ’ s t r i d e ’ : 1 , ’ pad ’ : 1} , ’M’ , ’R ’ ,{ ’ o u t c h a n n e l ’ : 32 , ’ s i z e ’ : 3 , ’ s t r i d e ’ : 1 , ’ pad ’ : 1} , ’M’ , ’R ’ , ] ,

’B ’ : [{ ’ o u t c h a n n e l ’ : 16 , ’ s i z e ’ : 3 , ’ s t r i d e ’ : 1 , ’ pad ’ : 0} , ’R ’ , ’M’ ,{ ’ o u t c h a n n e l ’ : 16 , ’ s i z e ’ : 3 , ’ s t r i d e ’ : 1 , ’ pad ’ : 0} , ’R ’ , ’M’ ,{ ’ o u t c h a n n e l ’ : 16 , ’ s i z e ’ : 3 , ’ s t r i d e ’ : 1 , ’ pad ’ : 0} , ’R ’ , ’M’ , ]

}

i f n a m e == ” m a i n ” :


o p t = p a r s e r . p a r s e a r g s ( )# p r i n t ( o p t . s t r ( ) )# o p t . o u t p u t F o l d e r = ”/ media / chao / c l a u d i o l o c a l / 3 dcnnda ta ”# o p t . MRIdataFolder = ’ / media / chao / c l a u d i o l o c a l / 3 dcnnda ta / HIPPO AMYG THAL PUTA PALL CAUD MTL NEW ’# o p t . l a b e l F i l e = ’ / home / chao / d a t a 3 d c n n / da ta / l a b e l a l l . t x t ’# / b w f e f s / home / f u m i e / da ta / chao / download / u p r i k e n / l a b e l . t x t

i f not p a t h e x i s t ( o p t . o u t p u t F o l d e r ) or not p a t h e x i s t ( o p t . MRIda taFo lde r ) or not p a t h e x i s t ( o p t . l a b e l F i l e ) :p r i n t ( ’ F o l d e r o r f i l e does n o t e x i s t ! ’ )p r i n t ( ” py thon3 p r e p r o c e s s . py −h ” )e x i t ( )

i f o p t . c f g not in c f g :p r i n t ( ’ Network s t r u c t u r e does n o t e x i s t ! ’ )e x i t ( )

i f o p t . o u t p u t F o l d e r [−1] == ’ / ’ or o p t . MRIda taFo lde r [−1] == ’ / ’ :p r i n t ( ’ Wrong format , p l e a s e remove t h e l a s t / ’ )e x i t ( )

l o g f o l d e r = o p t . o u t p u t F o l d e r + ’ / Log ’mkdir ( l o g f o l d e r )

i f o p t . nWay == 2 :compar i son = [ ’NC’ , ’AD’ ]

e l i f o p t . nWay == 3 :compar i son = [ ’NC’ , ’MCI ’ , ’AD’ ]

nLabe l = l e n ( compar i son )i f o p t . c r o s s V a l i d a t i o n <= 1 :

o p t . c r o s s V a l i d a t i o n = 1

f = LogSystem ( l o g f o l d e r , t e s t f l a g = t e s t )f . p r i n t a n d l o g ( compar i son . s t r ( ) )f . p r i n t a n d l o g ( o p t . s t r ( ) )

f . p r i n t a n d l o g ( ” Reading d a t a s t a r t ” )

# Load da ta s t a r td a t a = LoadData ( o p t . MRIdataFolder , nLabel , a u g m e n t a t i o n = o p t . au gm en ta t i o n , c r o s s v a l i d a t i o n = o p t . c r o s s V a l i d a t i o n

, l a b e l p a t h = o p t . l a b e l F i l e )f . p r i n t a n d l o g ( ” r e a d i n g d a t a f i n i s h , t o t a l : ” + s t r ( d a t a . g e t n u m d a t a ( ) ) )cv = 0c v a c c = [ ]whi le cv < o p t . c r o s s V a l i d a t i o n :

cv += 1f . p r i n t a n d l o g ( ” Cross V a l i d a t i o n {}” . format ( cv ) )# g e t t r a i n i n g and t e s t i n g d a t a s e tt r a i n i n g s a m p l e , t e s t i n g s a m p l e = d a t a . g e t d a t a ( )

t r a i n n u m = l e n ( t r a i n i n g s a m p l e )t e s t n u m = l e n ( t e s t i n g s a m p l e )

f . p r i n t a n d l o g ( ” T r a i n i n g , T e s t i n g number : {} , {}” . format ( t r a i n n u m , t e s t n u m ) )

s a m p l e d e p t h , s a m p l e h e i g h t , s a m p l e w i d t h = d a t a . g e t s h a p e ( )

engage = d a t a . g e t e n g a g e m e n t ( )

f . p r i n t a n d l o g ( ” The engagement o f d a t a : ” + engage . s t r ( ) )

model = CNN(∗ l a y e r s ( c f g [ o p t . c f g ] , s a m p l e d e p t h , s a m p l e h e i g h t , s a m p l e w i d t h ) , nLabe l )

f . p r i n t a n d l o g ( model . s t r ( ) )

p r i n t ( model . p a r a m e t e r s ( ) )

t r a i n i n g d a t a l o a d e r = DataLoader ( d a t a s e t = t r a i n i n g s a m p l e , num workers= o p t . t h r e a d s , b a t c h s i z e = o p t .t r a i n B a t c h S i z e ,

s h u f f l e = F a l s e )t e s t i n g d a t a l o a d e r = DataLoader ( d a t a s e t = t e s t i n g s a m p l e , num workers= o p t . t h r e a d s , b a t c h s i z e = o p t .

t e s t B a t c h S i z e ,s h u f f l e = F a l s e )

i f o p t . cuda and not t o r c h . cuda . i s a v a i l a b l e ( ) :


r a i s e E x c e p t i o n ( ’No GPU found , p l e a s e run w i t h o u t −−cuda ’ )o p t . s eed = random . r a n d i n t ( 1 , 10000)f . p r i n t a n d l o g ( ”Random Seed : ” + s t r ( o p t . s eed ) )t o r c h . m a n u a l s e e d ( o p t . s eed )i f o p t . cuda :

t o r c h . cuda . m a n u a l s e e d ( o p t . s eed )

i f o p t . cuda :i f t o r c h . cuda . d e v i c e c o u n t ( ) >= 2 :

model = nn . D a t a P a r a l l e l ( model )model = model . cuda ( )f . p r i n t a n d l o g ( ’===> S e t t i n g {} GPU ’ . format ( t o r c h . cuda . d e v i c e c o u n t ( ) ) )

e l s e :f . p r i n t a n d l o g ( ’===> S e t t i n g CPU ’ )

f . p r i n t a n d l o g ( ’===> S e t t i n g O p t i m i z e r ’ )i f o p t . o p t i m i z e r == 1 :

o p t i m i z e r = opt im . Adam( model . p a r a m e t e r s ( ) , l r = o p t . l r , w e i g h t d e c a y = o p t . w e i g h t d e c a y )f . p r i n t a n d l o g ( ’===> Adam ’ )

e l s e :o p t i m i z e r = opt im .SGD( model . p a r a m e t e r s ( ) , l r = o p t . l r , momentum= o p t . momentum ,

w e i g h t d e c a y = o p t . w e i g h t d e c a y )f . p r i n t a n d l o g ( ’===> SGD ’ )

f . p r i n t a n d l o g ( ’===> S e t t i n g l o s s f u n c t i o n ’ )i f o p t . l o s s F u n c t i o n == 0 :

c r e t e r i o n = nn . C r o s s E n t r o p y L o s s ( )f . p r i n t a n d l o g ( ’===> Cross e n t r o p y ’ )

e l i f o p t . l o s s F u n c t i o n == 1 :c r e t e r i o n = nn . MSELoss ( )f . p r i n t a n d l o g ( ’===> MSE’ )

f . p r i n t a n d l o g ( ’===> S t a r t t r a i n i n g ’ )f o r epoch in range ( 1 , o p t . nEpochs + 1) :

f . p r i n t a n d l o g ( ’ Cors s V a l i d a t i o n : ’ + s t r ( cv ) + ’ epoch = ’ + s t r ( epoch ) + ’ l r = ’ + s t r (o p t i m i z e r . pa r am groups [ 0 ] [ ’ l r ’ ] ) + t ime . s t r f t i m e ( ” %Y−%m−%d %H:%M:%S” , t ime . l o c a l t i m e ( ) ) )

a c c t e s t = 0f o r i t e r a t i o n , b a t c h in enumerate ( t r a i n i n g d a t a l o a d e r ) :

model . t r a i n ( )input , t a r g e t = V a r i a b l e ( b a t c h [ 0 ] ) , V a r i a b l e ( b a t c h [ 1 ] . type ( t o r c h . F l o a t T e n s o r ) , r e q u i r e s g r a d = F a l s e

)i f o p t . l o s s F u n c t i o n == 0 :

t a r g e t = t a r g e t . max ( 1 ) [ 1 ]i f o p t . cuda :

input = input . cuda ( )t a r g e t = t a r g e t . cuda ( )

p r e = model ( input )a c c t a r g e t = t a r g e t . d a t ai f o p t . l o s s F u n c t i o n == 1 :

a c c t a r g e t = t a r g e t . max ( 1 ) [ 1 ] . d a t aa c c t e s t += np . c o u n t n o n z e r o ( np . e q u a l ( np . argmax ( p r e . da t a , a x i s =1) , a c c t a r g e t ) )

l o s s = c r e t e r i o n ( pre , t a r g e t )o p t i m i z e r . z e r o g r a d ( )l o s s . backward ( )o p t i m i z e r . s t e p ( )

i f i t e r a t i o n % 300 == 0 or ( l e n ( t r a i n i n g d a t a l o a d e r ) == (i t e r a t i o n + 1) and epoch == o p t . nEpochs ) :

model . e v a l ( )# f . p r i n t a n d l o g ( t i m e . s t r f t i m e (”%Y−%m−%d %H:%M:%S ” , t i m e . l o c a l t i m e ( ) ) )f . p r i n t a n d l o g (

”===> Epoch [{}]({} /{}) : Loss : { : . 1 0 f}” . format ( epoch , i t e r a t i o n , l e n ( t r a i n i n g d a t a l o a d e r ) ,l o s s . d a t a [ 0 ] ) )

f . p r i n t a n d l o g ( ’ t o t a l g r a d i e n t : ’ + s t r ( t o t a l g r a d i e n t ( model . p a r a m e t e r s ( ) ) ) )

t e s t l o s s = 0c o r r e c t = 0 . 0l o s s f n = nn . MSELoss ( )r e s u l t s = Nonef o r b a t c h s in t e s t i n g d a t a l o a d e r : # , t e s t i n g l a b e l l o a d e r ) :

t e s t i n p u t , t e s t t a r g e t , t e s t i m g i d = V a r i a b l e ( b a t c h s [ 0 ] ) , V a r i a b l e (


b a t c h s [ 1 ] . type ( t o r c h . F l o a t T e n s o r ) , r e q u i r e s g r a d = F a l s e ) , b a t c h s [ 2 ]i f o p t . cuda :

t e s t i n p u t = t e s t i n p u t . cuda ( )t e s t t a r g e t = t e s t t a r g e t . cuda ( )

p r e d i c t i o n = model ( t e s t i n p u t )

t m p t e s t l o s s = l o s s f n ( p r e d i c t i o n , t e s t t a r g e t )t e s t l o s s += t m p t e s t l o s s . d a t a [ 0 ]p red = p r e d i c t i o n . d a t a . max ( 1 ) [ 1 ]t a r g = t e s t t a r g e t . d a t a . max ( 1 ) [ 1 ]c o r r e c t += pred . eq ( t a r g ) . cpu ( ) . sum ( )i f l e n ( t r a i n i n g d a t a l o a d e r ) == ( i t e r a t i o n + 1) and epoch == o p t . nEpochs :

num inpu t = l e n ( t e s t i n p u t )r tmp = Noner tmp = np . c o n c a t e n a t e ( ( p r e d i c t i o n . d a t a . cpu ( ) . numpy ( ) ,

p r ed . cpu ( ) . numpy ( ) . r e s h a p e ( num input , 1 ) ,t a r g . cpu ( ) . numpy ( ) . r e s h a p e ( num input , 1 ) ) , a x i s =1)

i f r e s u l t s i s None :r e s u l t s = r tmp

e l s e :r e s u l t s = np . c o n c a t e n a t e ( ( r e s u l t s , r tmp ) , a x i s =0)

d e l t m p t e s t l o s s , p r e d i c t i o nt e s t l o s s /= l e n ( t e s t i n g d a t a l o a d e r )f . p r i n t a n d l o g (

’ T e s t s e t : Average l o s s : { : . 4 f } , Accuracy : {}/{} ({ : . 5 f}%)\n ’ . format ( t e s t l o s s , c o r r e c t ,t e s t n u m ,c o r r e c t / t e s t n u m ∗

1 0 0 . 0 ) )t e s t l o s s = 0f . f l u s h ( )# c a l c u l a t e ROC m e t r i c si f l e n ( t r a i n i n g d a t a l o a d e r ) == ( i t e r a t i o n + 1) and epoch == o p t . nEpochs :

i f nLabe l == 2 :f a l s e p o s i t i v e = 0 . 0f a l s e n e g t i v e = 0 . 0t r u e p o s i t i v e = 0 . 0t r u e n e g t i v e = 0 . 0p o s i t i v e = 0 . 0n e g t i v e = 0 . 0o t h e r = 0 . 0f o r r in r e s u l t s :

i f r [−1] == 0 :n e g t i v e += 1i f r [−1] == r [−2]:

t r u e n e g t i v e += 1e l s e :

f a l s e p o s i t i v e += 1e l i f r [−1] == 1 :

p o s i t i v e += 1i f r [−1] == r [−2]:

t r u e p o s i t i v e += 1e l s e :

f a l s e n e g t i v e += 1e l s e :

o t h e r += 1

f . p r i n t a n d l o g ( r e s u l t s . s t r ( ) )f . p r i n t a n d l o g ( ” Othe r i s {}” . format ( o t h e r ) )f . p r i n t a n d l o g ( ” True P o s i t i v e (SENS) : {}/{} { : . 5 f}%” . format ( t r u e p o s i t i v e , p o s i t i v e ,

t r u e p o s i t i v e / p o s i t i v e ))

f . p r i n t a n d l o g ( ” F a l s e P o s i t i v e : {}/{} { : . 5 f}%” . format ( f a l s e p o s i t i v e , n e g t i v e ,f a l s e p o s i t i v e / n e g t i v e ) )

f . p r i n t a n d l o g ( ” True N e g t i v e ( SPEC ) : {}/{} { : . 5 f}%” . format ( t r u e n e g t i v e , n e g t i v e ,t r u e n e g t i v e / n e g t i v e ) )

c v a c c . append ( c o r r e c t / t e s t n u m ∗ 1 0 0 . 0 )f . p r i n t a n d l o g ( ’ T r a i n i n g Accuracy :{}/{} { : . 5 f}%\n ’ . format ( a c c t e s t , t r a i n n u m , a c c t e s t / t r a i n n u m ∗

100) )f . p r i n t a n d l o g ( c v a c c . s t r ( ) )


f . p r i n t a n d l o g ( ” Cross V a l i d a t i o n Accuracy :{} { : . 5 f}%” . format ( c v a c c . s t r ( ) , np . mean ( c v a c c ) ) )

CLASSIFICATION OF MEDICAL IMAGES INTO ALZHEIMER’S … · classification of medical images into alzheimer’s disease, mild cognitive impairment and healthy brain a thesis submitted

Documents