Convolutional neural networks for automated edema ...

Convolutional neural networks for automated edema segmentation in brain CT
images of patients with intracerebral hemorrhage.
Dyantha van der Sluijs
Dr. F.P.J.M. VOORBRAAK
Convolutional neural networks for automated edema segmentation in brain CT
images of patients with intracerebral hemorrhage.
Student D.G. VAN DER SLUIJS 10164391 [email protected]
Location BIOMEDICAL ENGINEERING AND PHYSICS Academic Medical Center Meibergdreef 9 1105 AZ Amsterdam-Zuidoost
Supervisor Dr. H.A. MARQUERING Biomedical engineering and physics [email protected]
Tutor Dr. F.P.J.M. VOORBRAAK Medical Informatics [email protected]
February, 2017 - January, 2018
1 Introduction 4
2 Preliminaries 4 2.1 Artificial neural networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 2.2 Gradient descent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.3 Convolutional Neural Networks (CNNs) . . . . . . . . . . . . . . . . . . . . . . . 6
3 Methods and materials 9 3.1 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 3.2 Inclusion and exclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 3.3 Pre-processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.3.1 Manual segmentations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 3.3.2 Patches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 3.3.3 Thin- and thick sliced images . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.4 Hardware and Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 3.5 CNN training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.5.1 CNN architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 3.5.2 Training models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 3.5.3 Training on difficult patches . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.6 Statistical analyses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 3.7 Post-processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
4 Results 13 4.1 Inter observer variability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 4.2 Optimal probability threshold . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 4.3 CNN performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
4.3.1 Performance of training models on ICH segmentations . . . . . . . . . . . 16 4.3.2 Performance of training models on edema segmentations . . . . . . . . . 16 4.3.3 Performance on test sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
4.4 Post-processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
5 Discussion 21
6 Conclusion 22
1
Samenvatting
Introductie
Intracerebrale bloeding is een veelvoorkomende vorm van beroerte met een hoge ziekte- en sterftecijfer. Vaak ontstaat er oedeem rondom de bloeding. Doordat oedeem de kans op een slechte afloop vergroot, is oedeem kwantificatie nodig om de behandeling van de bloeding te optimaliseren. Het is al aangetoond dat het gebruik van convolutionele neurale netwerken (CNN) een betrouwbare methode is voor medische beeld segmentatie. In deze studie introduc- eren we CNN voor het automatisch kwantificeren van oedeem.
Methode
We gebruikte 191 scans voor het trainen, en 48 scans voor het testen van onze CNN. We gebruikten een CNN architectuur dat bestond uit 2 convolutionele lagen, 2 pooling lagen, een fully connected laag en een softmax functie. We hebben 8 modellen getest die varieerde in patch grootte, aantal iteraties en in plak dikte. Tevens gebruikten we in 1 model moeilijke patches in de training set. In 1 model gebruikten we segmentaties van een andere observator en in 1 model gebruikten we observaties van 2 observatoren. De prestaties van onze modellen zijn gemeten met een oppervlakte onder de curve (AUC), Dice scores en intraclass correlation coefficient (ICC) waardes. Interobserver variabiliteiten tussen 4 observanten werden gemeten om de prestaties te analyzeren. Het optimaliseren van de CNN segmentaties werd gedaan door incorrect geclassificeerde voxels te verwijderen met waarschijnlijkheidsgrenzen en segment ver- wijderingsgrenzen. Vervolgens elimineerden we oedeem segmenten die niet naast bloedingen lagen.
Resultaten
Het meest accurate model was getraind met patches van [19 x 19] voxels in 100 iteraties. Dit model behaalde een AUC waarde van 0.92, een gemiddelde Dice score van 0.32 en een gemiddelde ICC waarde van 0.73. Na het bewerken van deze segmentatie met bovengenoemde meth- oden, behaalde dit model een gemiddelde Dice score van 0.44 en een gemiddelde ICC waarde van 0.68. Dit is lager dan de gemiddelde interobserver Dice score van 0.52 en een gemiddelde interobserver ICC waarde van 0.83.
Conclusie
CNN is een veelbelovende methode voor oedeem segmentatie met een nauwkeurigheid dat dichtbij de interobserver variabiliteit ligt.
2
Summary
Introduction
Intracerebral hemorrhage (ICH) is a common type of stroke with high morbidity and mortality rate. Edema often forms around ICH. Because edema increases the chance of poor outcome, edema quantification is needed for optimizing ICH treatment. Convolutional neural networks (CNN) have been proven to be a reliable method in medical image segmentation. In this study, we introduce CNN to automatically quantify edema.
Methods
We used 191 scans for training, and 48 scans for testing our CNN. Furthermore, we used a CNN architecture that included 2 convolutional layers, 2 pooling layers, a fully connected layer and a softmax function. We tested 8 models varying in patch size, epoch number, slice thickness, use of difficult patches in the training set and manual segmenter used for training set segmentation. Performance was measured with area under the curve (AUC), Dice scores and intraclass correlation coefficient (ICC) values. Inter-observer variabilities between 4 observers were determined for analyzing performance. Optimizing segmentations was done using probability thresholds and particle removal thresholds (PRTs) to remove incorrect classified voxels. Furthermore, we deleted edema particles non-adjacent to ICH particles.
Results
The most accurate model was a model trained with patches of [19 x 19] voxels in 100 epochs with an AUC value of 0.92, an average Dice score of 0.32 and an average ICC value of 0.73. After post-processing, this model achieved an average Dice score of 0.44 and an average ICC value of 0.68. This is lower than the average inter-observer Dice score of 0.52 and an average inter-observer ICC value of 0.83.
Conclusion
CNN is a promising method for edema segmentation with an accuracy that approaches manual performances.
Keywords: Brain edema, convolutional neural networks, intracerebral hemorrhage
3
1 Introduction
Intracerebral hemorrhage (ICH) is a common type of stroke (10 - 15 % of all strokes) with a high morbidity and mortality rate1. It is caused by a rupture of a small, damaged artery in the brain2. Detoriation of blood vessels can be caused by hypertension or amyloid angiopathy. Multiple factors associated with ICH could provoke brain edema3–7. Edema is a swelling of the brain caused by excessive accumulation of fluid. It most rapidly progresses during the first 2-3 days after ICH and is visible on CT scans as a hypodense halo around the hyperdense ICH region8,9.
Edema is a major cause of poor outcome of ICH and can cause further tissue damage or even death8–12. This poor outcome is associated with the growth of the edema during the ini- tial 48-72 hours after the ICH9,10. Therefore, edema quantification after ICH could improve decision-making for ICH treatment11,12. Although volumetric analysis of edema by magnetic resonance imaging (MRI) is more accurate than CT13, an MRI scan takes more time and is more costly. Precise manual assessment of the edema volume on CT is difficult and there remain considerable interobserver differences14,15. An automatic algorithm for quantifying edema on a CT scan would be beneficial to decrease interobserver variances and improve quantification of edema. In this study, we use convolutional neural networks (CNNs) to create an automated edema quantification method.
Multiple automated ICH segmentation methods using thresholding have been developed14–17. Using thresholding, a voxel is labeled based on its intensity. This value could either be the hounsfield unit (HU)14,16,17 or the relative density increase15. CNN provides an automated method that also considers intensity values of neighbouring voxels for classifying voxels. This advantage is useful in edema segmentation, because the low contrast between edema and brain regions makes differenting hypodense edema from brain tissue challenging. Furthermore, CT scans vary in brightness, making it hard to set one threshold for all scans.
CNN is a machine learning method that uses small learnable filters for voxel classification. Previous studies already proved effectiveness of CNNs in medical image segmentations18–21. A CNN can be trained via an ”off-the-shelf” method or ”from scratch”. With an ”off-the-shelf” CNN, the CNN is already trained on non-medical images making it possible to re-use the pre- trained filters. Training from scratch means that the filters are randomly initialized. Van Gin- neken et al. (2015) accurately detected pulmonary nodules using an off-the-shelf CNN22 and Havaei et al. (2017) successfully segmented brain tumor with a CNN trained from scratch23. We train our CNN from scratch. The aim of this study is to produce and test an automated edema segmentation method using CNN for detecting and quantifying edema following ICH on a non contrast CT scan. The algorithm was trained and tested on CT scans obtained from the PATCH study24. This study investigated the effect of platelet transfusion in ICH patients.
The remainder of this thesis is built up as follows: First, a global introduction about neural networks is given followed by an introduction about CNNs. Then, the methods for developing and testing our segmentation method are provided, followed by the testing results, a discussion of these results and a conclusion.
2 Preliminaries
2.1 Artificial neural networks
Artificial neural networks (ANNs) are computing programs that can be taught to classify patterns by providing examples with corresponding classification labels. These networks are based on biological neural networks. Biological neural networks in the brain exist of billions of inter- connected neurons in different layers that process information in parallel. The network learns
4
to recognize a pattern by firing signals to specific neurons in the different layers. As a result, connections between firing neurons are adjusted. ANNs use artificial neurons connected over different hidden layers starting with an input layer and ending with an output layer (figure 1)25. These connections, usually denoted as arrows, contain weights and represent information flow. Neuron values in hidden layer neurons and output layer neurons are weighted accumulations of the values of the previous layer. Forward propagation is the application of the weights to the input data and calculating the output. For example, if we want to calculate the value of node h1 in figure1, we use the following equation:
h1 = i1 ∗ w3 + i2 ∗ w5 + b1 ∗ w1 (1)
Often, bias neurons (grey nodes in figure 1) are added to every layer in the network. These neurons usually have a value of 0, -1 or 1 and have changeable connections to next layer neurons. Because the value of a bias node is constant and cannot be changed by previous layers, it allows the network to shift its output function, such as equation 1, vertically. This prevents the network from overfitting the input data.
b1
i1
i2
b2
h1
h2
o1
o2
w4
w6
w10
w12
w1
w2
w3
w5
w9
w11
w7
w8
Hidden layer
Input layer
Output layer
Figure 1: A simple neural network consisting of an input layer (green), a hidden layer (purple) and an output layer (red). The grey circles are bias nodes (often 0, -1 or 1) and the grey arrows are the weights.
To simulate the learning process of biological neural networks where connections are changed upon learning, the weights of ANNs should be changed to minimize error between values of output neurons and actual output values. A cost function is used to calculate this error. A large number of cost functions exist. In this study, we use cross entropy (CE). CE is often used in image classification23,26. The CE function proved to be faster and more accurate than other cost functions27. CE is calculated by the following equation:
Hy′(y) = − ∑ i
y′i log (yi), (2)
where yi is the predicted probability distribution of class i and y′i is the true probability of that class.
5
2.2 Gradient descent
For optimizing the neural network, the total error should be minimized. Figure 2 shows the cost function for 2 weights. To reach the minimum of the function, steps towards this minimum error are taken by changing the weights. The direction of the steps is determined by multiplying the partial derivatives or gradients of all weights. The step sizes are determined by the learning rate. All weights are updated by subtracting their gradients multiplied by the learning rate from the original weight values. This is done untill a minimum total error is reached where all gradients are 0. This process is called gradient descent. As shown in figure 2, multiple low points or minima exist in a function. Using gradient descent, we can misinterpret a local minimum as the global minimum. Therefore, we should perform gradient descent multiple times with multiple starting points. We refer to a gradient descent iteration as an epoch in the remainder of this article.
Figure 2: Gradient descent visualized in a graph with 2 weights on the x-axes and the cross entropy on the y-axis. The black line shows the steps taken to find the lowest cross entropy value. Multiple local minima can be found by gradient descent. 28
2.3 Convolutional Neural Networks (CNNs)
Figure 3 displays an example architecture of a CNN. CNNs are neural networks with multiple types of layers containing learnable weights and biases. CNNs can be used for different causes. In this study we use CNNs for image recognition. The different types of layers enable the CNN to provide spatial information about objects in an image. There is one input layer, one output layer, and one or more hidden layers. In this study, the input is an image. This is comparable to a visual stimulus for the human brain where the brain can learn to recognize this visual stimulus by categorizing it. This image is provided as a matrix with an intensity value for every voxel. In CT scans, these values are presented in Hounsfield Units (HU). Training a CNN requires an input image with corresponding labels.
6
Figure 3: A simple architecture of a CNN comprising a Conv layer, a pool layer and a FC layer followed by a softmax function.
Multiple types of layers can be used in a CNN23,25,26,29:
1. Convolutional (conv) layers: Different features in the image are extracted by Conv layers. This layer uses forward propagation and gradient descent to adjust the weights and improve categorization. The weights, defined as filters, are provided as small matrices of an arbitrary size. Each filter detects a particular feature in the image. Every filter is multiplied with every part of the input, when slided in steps of an arbitrary number of voxels from the top left corner to the bottom right corner of the image (figure 4). The size of these steps is defined as the stride. After each step, the element-by-element product of each filter and the current part of the matrix is calculated, accumulated and added with the bias, resulting in a featuremap. This calculation is done in equation 3 using the red rectangle in figure 4 as example. Often, zeros are added around the border, which is called zero padding. This preserves the size of the input and the information at the borders.
Figure 4: Multiplication of a filter in one layer with stride 2. The filter is multiplied with the red part, blue part and green part and slided along the image in this way. The bias node is added with each multiplication resulting in the outcome matrix.
7
=
(3a)
= 1 (3b)
ored = b+ ∑
xred f = 1 + 1 = 2 (3c)
In equations 3, xred is the small red sub-matrix of the input image in figure 4, f is the filter, and ored is the outcome value for this sub-matrix. Equation a shows the element-by- element multiplication of the sub-matrix with the filter, equation b shows the accumulation of this product and equation c adds the bias. The number of filters for a convolutional layer determines the depth of the input for the next layer. Networks with more kernels perform better with complex images. This calculation is followed by a non-linear activation function. Multiple non-linear functions exist, such as the tanh() sigmoid funtion (f(x) = tanh(x)) and rectified linear units (ReLU) (f(x) = max(0, x)). ReLUs proved to train faster than the tanh units30. Weights are adjusted with gradient descent. Gradient descent is often a time-consuming process when a large number of patches is provided to the CNN. Therefore, the network is trained on minibatches of images31. Each convolutional output of a minibatch is normalized with the following formula:
xi = xi − µB√ σ2 B + ε
, (4)
where xi is the normalized convolutional output, xi is the convolutional output, µB is the batch mean, σ2
B is the batch variance and ε is a constant added for numerical stability. Filters are updated after each batch.
2. Pooling (pool) layers: These layers reduce the resolution of the image and thereby compu- tation power by taking the maximum or average of non-overlapping square patches from the image of a pre-defined size. This size is often 2 x 2 voxels. Pool layers help reducing overfitting.
3. Fully connected (FC) layers: FC layers, or dense layers, are generally used at the end of a CNN for accumulating all products of the previous layer and the previous filters. Here, the filter size should be equivalent to the size of the input volume.
4. Softmax layer: A softmax layer is added to the CNN. This layer computes probabilities for all patches by squashing the outputs of each class to be in the range [0, 1] in such a way that the sum of all outputs is 1. The softmax function is given by:
σ(z)j = ezj∑K k=1 e
zk , (5)
where z is the vector of K outputs indexed by j.
8
3 Methods and materials
Manually marked edema and ICH were used as ground truth values in the algorithm. We trained our CNN on 80 percent of our scans and validated the algorithm by comparing the edema segmentations and volumes computed by the algorithm with the manually annotated ground truth segmentation in the remainder 20 percent of CT scans. Because the combination of edema and ICH is a unique structure, we chose to train our CNN model from scratch.
3.1 Dataset
For this study, the image dataset of the PATCH study was used24. This dataset comprised of CT images of 190 patients with ICH that were taken on admission and 24 hours after admission. Patients were above 18 years and had a Glasgow Coma Scale score between 8 and 15. The amount of intraventricular blood was less than a sedimentation in the posterior horns of the lat- eral ventricles and there was no haematoma present that was suggestive of epidural, subdural, aneurysmal or arterio-venous malformation haematoma.
3.2 Inclusion and exclusion
For the reason that CNN training should include edema, we only used CT scans in which edema was visible. Furthermore, scans with large artefacts such as beam hardening were excluded. To prevent overfitting on edema of one patient, we excluded the scan taken 24 hours after admission when 2 scans of one patients were similar. Similarity was subjectively determined by the main investigator. Finally, 154 patients with 241 scans were included. This set of scans included 135 scans taken on admission of the patient and 106 scans taken 24 hour after admission.
Figure 5: Example of a manual segmentation with the hemorrhage segmentation in red and the edema segmentation in green.
3.3 Pre-processing
3.3.1 Manual segmentations
Each scan was manually segmented, using ITK-SNAP, by 3-5 trained observers (table 2). For ground truth, we used manual segmentations that were performed by a trained neurologist and checked by 2 radiologists. To assess interobserver variabilities, 1 master student Medical
9
Informatics and 2 PhD students of the department biomedical engineering and physics also segmented the scans. First, ICH was segmented separately. Edema surrounds the ICH as is shown in figure 5. By segmenting edema separately, overlapping of edema and ICH masks is likely to occur, giving voxels multiple labels. Thus, edema and ICH were segmented together. Finally, the ICH segmentations were subtracted from the combined segmentations in MATLAB to get edema segmentations.
Figure 6: 3 example patches of [19× 19] pixels for the 3 categories; Edema, brain and hemorrhage.
3.3.2 Patches
The edema and ICH segmentations were used as binary masks to determine the classification of a voxel. Brain masks were made by stripping the skull and subtracting the edema and ICH masks. Input images for the CNN were patches of size [19×19] or size [17×17] voxels (figure 6). Scans were randomly subdivided in 5 sets, where 4 sets contained 48 scans and 1 set contained 49 scans. Per training, one set was used as test set and the 4 other sets were used as training set. One set contained 7 scans with inconsistent slice thicknesses. This set was never used as a test set to prevent incorrect volume quantification.
For every scan of the training set, an equal number of edema, ICH and brain patches were taken. Normal and horizontally mirrored patches were taken around voxels to increase the number of patches. To maintain an equal number of patches for all classes, we determined which class contained the smallest number of voxels, e.g. ICH or edema. We took both normal and mirrored patches around every voxel of this class.
An equal number of patches was taken around randomly selected voxels for the class with the second smallest number of voxels. In the case that this structure had insufficient number of voxels to obtain this number of patches (when the number of voxels of this class was smaller than twice the number of voxels of the smallest class), we flipped patches of randomly selected voxels to obtain this number. The ’brain’ class always had the largest number of voxels, thus normal brain patches were taken around randomly selected voxels to reach a number of patches equally to the number of patches of the other classes.
10
All patches were labeled corresponding to the tissue of the center voxel. Therefore, it is possible that a patch is given a certain label, but also includes voxels of another label. Patches of all scans were put together in one file and randomly rearranged.
In contrast to the training set where patches were obtained for a selected part of all voxels of each scan, test set patches were obtained for every voxel of each scan without class balancing nor data augmentation.
3.3.3 Thin- and thick sliced images
The Patch study dataset consisted of images with slice thicknesses ranging from 0.45 mm to 10 mm. Observer 1 segmented the images on their original thicknesses. Resampling CT scans from thin sliced (< 4 mm) to thick sliced (> 4 mm) was done in MATLAB by taking the average HU values of the thin slices. Thereafter, observer 2-4 segmented edema and ICH in the resampled images. Convert3D from simpleITK32 was used for increasing slice thicknesses in segmentations of observer 1. This was done by nearest-neighbor interpolation.
3.4 Hardware and Software
All experiments were performed using an Intel Xeon CPU E5-1620 with 32 GB RAM memory and a NVIDIA GeForceGTX 1080 GPU driver. The Microsoft Cognitive Toolkit (CNTK) version 1.7.233 was used for making the CNNs. SimpleITK32 was used for visualizing segmentations and manually segmenting edema and ICH on the CT scans.
3.5 CNN training
3.5.1 CNN architecture
Earlier research has shown an optimal architecture for subarachnoid hemorrhage segmenta- tion18. Because of the similar features of edema and the use of ICH segmentation in our study, we used that architecture. This architecture used patches of [19 × 19] voxels and consisted of 2 Conv layers with respectively 128 and 256 filters sized [5 × 5] voxels. Weights were randomly initialized using a Gaussian distribution with zero mean and standard deviation of 0.2 ∗
√ nroffeatures. Number of features was calculated by multiplying the number of voxels
of one filter by the number of filters. Zeropadding and a bias node were used for each convo- lution. This bias node was initialized as 0. A ReLu function was used as activation function. Following each Conv layer, a [2 × 2] max Pool layer was implemented. A FC layer with 256 nodes and a softmax layer were added at the end of the CNN. Training was done with learning rates of 0.0006 units for the first 50 minibatches, 0.0003 for the following 100 minibatches and 0.0001 for the last 50 minibatches.
3.5.2 Training models
The characteristics of training images and the CNN architecture used in the training models are presented in table 1. The CNN was trained on segmentations of one observer or 2 observers and tested on segmentations of observer 1 (figure 2). When trained on multiple observers, segmentations were taken from both observers. Thus, scans were included twice.
11
Table 1: Training models with information about the observer that segmented the training images (table 2), slice thickness of training images, patch sizes used for CNN training, inclusion of difficult patches, and the number of training epochs.
Training code
Observer segm.
Number of training epochs
A 1 0.45− 10 19× 19 No 100 B 1 > 4 19× 19 No 100 C 1 > 4 17× 17 No 100 D 1 > 4 19× 19 Yes 100 E 2 > 4 19× 19 No 100 F 1 & 2 > 4 19× 19 No 100 G 1 0.45− 10 19× 19 No 200 H 1 0.45− 10 17× 17 No 200
Table 2: Observer information
Occupation Nr. of scans
1 Neurologist 239 2 Master student 239 3 PhD 45 4 PhD 48
3.5.3 Training on difficult patches
Multiple structures in the brain have similar HU values and patterns as edema. This can lead to misclassification of these brain voxels by the CNN. Training on difficult brain voxels can decrease misclassifications. For this training model, we trained the CNN twice where the first training was needed to determine which brain patches were too difficult for correctly classifying as brain. During this training, training patches included all edema and ICH patches, and ran- dom brain patches. All brain patches of the training set were classified with the first CNN and the patches that were misclassified as edema with a probability of more than 90% were marked as ’difficult’. During the second training, the CNN was trained from scratch and included all edema and ICH patches, and all difficult brain patches completed with randomly selected brain patches.
3.6 Statistical analyses
Area under the curves (AUCs) of receiver operating charachteristics (ROC) curves were calculated by comparing CNN classifications and manual segmentations. Furthermore, we used Dice scores to compare CNN and manual segmentations and to demonstrate interobserver variability. Dice scores are calculated by:
Dice = 2|X ∩ Y | |X|+ |Y |
, (6)
where X and Y are 2 segmented volumes. Intraclass correlation coefficient (ICC) was used for calculating the variability in edema volumes between CNN segmentations and manual segmentations and again to demonstrate interobserver variability.
12
3.7 Post-processing
Through examining edema segmentations of scans that obtained low Dice scores between CNN edema segmentation and observer 1 segmentation, we found that some scans were incorrectly segmented; multiple slices were not segmented. These scans were excluded. Producing the first version of edema and ICH segmentations was done by using the probability value that produces the best average Dice score as threshold. Voxels classified with a probability above this threshold were included in the segmentation. Since small particles of voxels connected by a 6-connected neighborhood were likely to be inaccurate, we used additional particle removal thresholds (PRTs) for edema and ICH particles. Thresholds were determined for multiple probability thresholds using the smallest particle sizes of the accurate connected components of 3 test sets. Additionally, edema particles without neighbouring ICH particles were eliminated considering that edema always surrounds ICH. Using 3 test sets, we tested which combinations of PRTs and probability thresholds for edema and ICH obtain best Dice scores. We tested this combination on the 4th test set using Dice scores and ICC values.
4 Results
In the 260 scans, there was a median edema volume of 6.8 ml and 9.0 ml according to expert 1 and CNN respectively. We excluded 2 scans in set 1 and 2 scans in set 4 due to incomplete segmentation by expert 1. Training models that used thick sliced images included around 1.4 billion input patches and training models that used thin sliced images included around 2.9 billion input patches.
4.1 Inter observer variability
Figure 7 shows scatter plots presenting interobserver variability using volumes of 2 experts. Dice scores and ICC values are presented in table 3. Dice scores varied from 0.43 to 0.66 with an average Dice score between 2 observers of 0.52. ICC values varied form 0.75 to 0.90 with an average ICC of 0.83.
Table 3: Inter observer variabilities. First column depicts the observers that are compared with the number of scans that are compared in the second column. Dice scores are given as Average Dice± standard deviation [minimum Dice,maximum Dice].
Observers Number of scans Dice ICC Average [95% CI] P-value
1 & 2 159 0.50± 0.15 [0.14, 0.82] 0.90 [0.87−0.93] < 0.001 1 & 3 45 0.54± 0.14 [0.17, 0.77] 0.81 [0.68−0.89] < 0.001 1 & 4 31 0.43± 0.18 [0.08, 0.74] 0.75 [0.45−0.87] < 0.001 2 & 3 30 0.53± 0.15 [0.11, 0.72] 0.87 [0.74−0.93] < 0.001 2 & 4 48 0.66± 0.09 [0.40, 0.79] 0.82 [0.66−0.91] < 0.001 3 & 4 30 0.48± 0.15 [0.06, 0.70] 0.82 [0.66−0.91] < 0.001
13
14
4.2 Optimal probability threshold
Figure 8 shows average Dice scores vs. probability thresholds for edema segmentations and for ICH segmentations. This figure shows the results of model G since this model achieved the best average Dice score. Optimal probability thresholds are 0.95 for edema segmentation and 0.93 for ICH segmentation. Further results use these thresholds. Figure 9 shows that the Dice score is relatively low for small volumes.
Figure 8: Average Dice scores vs. probabilities used as threshold for binary voxel classification of edema (blue) and ICH (red).
Figure 9: Dice scores over edema volumes.
15
4.3.1 Performance of training models on ICH segmentations
Table 4 gives AUC values for each ROC curve in figure 10, average Dice scores, and ICC values for ICH segmentations achieved by the training models. Our CNN performed better on ICH segmentations with AUC values ranging from 0.96 to 0.99, Dice scores between 0.70 and 0.74 and ICC values between 0.63 and 0.75.
False positive rate
0. 0
0. 2
0. 4
0. 6
0. 8
1. 0
A
B
C
D
E
F
G
H
Figure 10: ROC curves regarding the classification of voxels as ICH
Table 4: Performances on ICH segmentations presented by the area under the ROC curve (AUC), average Dice coefficient with the standard deviation and the intraclass correlation coefficient (ICC) of the volumes. Dice scores are given as Average Dice ± standard deviation [minimum Dice,maximum Dice].
Training AUC Dice ICC Average± stdev [min,max] Average [95% CI] P-value
A 0.99 0.74± 0.14 [0.24, 0.93] 0.75 [0.59-0.85] < 0.001 B 0.99 0.74± 0.14 [0.28, 0.93] 0.75 [0.60-0.85] < 0.001 C 0.99 0.71± 0.15 [0.16, 0.92] 0.74 [0.58-0.85] < 0.001 D 0.97 0.74± 0.17 [0.31, 0.93] 0.74 [0.57-0.84] < 0.001 E 0.97 0.74± 0.17 [0.23, 0.93] 0.74 [ 0.58-0.84] < 0.001 F 0.96 0.70± 0.22 [0.16, 0.93] 0.70 [0.53-0.82] < 0.001 G 0.99 0.73± 0.16 [0.24, 0.92] 0.74 [0.58-0.85] < 0.001 H 0.99 0.71± 0.17 [0.15, 0.93] 0.63 [0.43-0.78] < 0.001
4.3.2 Performance of training models on edema segmentations
Table 5 presents average Dice scores, and ICC values for ICH segmentations for edema segmentations with corresponding ROC curves in figure 11. AUC values range from 0.77 to 0.93, Dice
16
scores range from 0.08 to 0.36 and ICC values rand from 0.05 to 0.73. We found that model H obtained the best AUC value (0.93), model G the best average Dice score (0.36), and model A obtained the best ICC value (0.73). Model A and G both have an AUC of 0.92. Figure 12 shows volume correlation of model G and manual segmentations in a scatter plot.
False positive rate
0. 0
0. 2
0. 4
0. 6
0. 8
1. 0
A
B
C
D
E
F
G
H
Figure 11: ROC curves regarding the classification of voxels as edema
Table 5: Performances on edema segmentations presented by the area under the ROC curve (AUC), average Dice coefficient with the standard deviation and the intraclass correlation coefficient (ICC) of the volumes.
Training AUC Dice ICC Average± stdev [min,max] Average [95% CI] P-value
A 0.92 0.32± 0.14 [0.04, 0.61] 0.73 [0.55-0.84] < 0.001 B 0.91 0.33± 0.13 [0.08, 0.54] 0.51 [0.26-0.70] < 0.001 C 0.92 0.33± 0.11 [0.07, 0.53] 0.47 [0.21-0.67] < 0.001 D 0.77 0.08± 0.07 [0.00, 0.36] 0.05 [-0.24-0.33] 0.37 E 0.82 0.25± 0.11 [0.01, 0.44] 0.31 [0.03-0.55] 0.02 F 0.82 0.24± 0.11 [0.01, 0.44] 0.33 [0.04-0.56] 0.01 G 0.92 0.36± 0.13 [0.07, 0.57] 0.47 [0.21-0.67] < 0.001 H 0.93 0.34± 0.13 [0.03, 0.56] 0.49 [0.24-0.68] < 0.001
4.3.3 Performance on test sets
Results for the 4 test sets are obtained using model G and the 4 training sets. ROC curves for voxel classification of the test sets are presented in figure 13. AUC, Dice scores and ICC values for the 4 sets are given in table 6. Average Dice scores rang from 0.29 to 0.39 and average ICC values range from 0.47 to 0.64.
17
Figure 12: Correlation of edema volumes between model G and observer 1.
False positive rate
0. 0
0. 2
0. 4
0. 6
0. 8
1. 0
Figure 13: ROC curves on the 4 test sets
Table 6: Performances of multiple test sets trained on multiple training sets presented by the area under the ROC curve (AUC), average Dice coefficient with the standard deviation and the intraclass correlation coefficient (ICC) of the volumes. Dice scores are given as Average Dice± standard deviation [minimum Dice,maximum Dice].
Block AUC Dice ICC Average [95% CI] P-value
1 (n=46) 0.92 0.36± 0.13 [0.07, 0.57] 0.47 [0.21-0.67] < 0.001 2 (n=48) 0.93 0.31± 0.13 [0.06, 0.56] 0.64 [0.44-0.79] < 0.001 3 (n=48) 0.83 0.29± 0.13 [0.04, 0.55] 0.47 [0.21-0.67] < 0.001 4 (n=46) 0.94 0.39± 0.10 [0.17, 0.61] 0.52 [0.27-0.70] < 0.001
18
4.4 Post-processing
The smallest edema and ICH segments with voxels joined by a 6-connected neighborhood that included correctly classified voxels are presented in table 7. Because a 0.95 probability tresh- old for ICH classification did not increase ICH Dice scores, we did not use this threshold for post-processing. We did not use a 0.70 probability threshold for edema classification for post- processing since decreased probability thresholds resulted in decreasing Dice scores. Table 8 shows Dice scores for various probability thresholds and segment size thresholds for both edema and ICH segmentations. Best Dice scores were obtained with probability thresholds of 0.90 for both edema and ICH. Particle removal thresholds were 700 voxels for edema and 500 voxels for ICH. We used these thresholds for post processing segmentations of all models. Table 9 present Dice scores and ICC values for the models after post processing the segmentations of the CNN. We found that model A scored best with an average Dice score of 0.44 and an average ICC value of 0.68. The correlation of edema volumes of this model and manual segmentations is shown in figure 14. Smaller edema volumes correlate better than larger edema volumes. Example segmentations of this model are shown in figure 15.
Table 7: Smallest sizes of particles (in nr. of voxels) that include correctly classified edema or ICH voxels.
Probability threshold
0.70 - 744 0.80 793 669 0.90 708 545 0.95 591 -
Table 8: Dice scores for multiple probability thresholds and PRTs for Edema and ICH segments.
Edema ICH Dice score Probability PRT Probability PRT
0.80 700 0.70 700 0.36± 0.18 [0.05, 0.77] 0.90 700 0.70 700 0.37± 0.16 [0.04, 0.71] 0.90 700 0.80 600 0.38± 0.16 [0.02, 0.71] 0.90 700 0.90 500 0.40± 0.16 [0, 0.71] 0.95 500 0.80 600 0.33± 0.17 [0, 0.68]
Figure 14: Correlation of edema volumes between model A after post-processing and observer 1.
19
Table 9: Performances after post processing presented by average Dice coefficient with the standard deviation and the intraclass correlation coefficient of the volumes.
Training Dice ICC Average [95% CI] P-value
A 0.44± 0.17 [0, 0.73] 0.68 [0.49-0.80] < 0.001 B 0.42± 0.17 [0, 0.66] 0.65 [0.44-0.78] < 0.001 C 0.41± 0.16 [0, 0.72] 0.57 [0.33-0.73] < 0.001 D 0.06± 0.08 [0, 0.37] 0.63 [0.42-0.77] < 0.001 E 0.32± 0.15 [0, 0.58] 0.43 [0.17-0.64] < 0.001 F 0.29± 0.15 [0, 0.57] 0.40 [0.14-0.61] 0.002 G 0.43± 0.17 [0, 0.72] 0.61 [0.40-0.76] < 0.001 H 0.41± 0.18 [0, 0.66] 0.69 [0.50-0.81] < 0.001
(a)
(b)
(c)
Figure 15: Example scans with segmentations of segmenter 1 (left), segmentations computed by the CNN (middle) and segmentations after post-processing (right).
20
5 Discussion
In this study, we developed and evaluated multiple architectures of CNNs to automatically segment hemorrhage and edema in patients with an intracranial hemorrhage. Although high AUCs were obtained, the spatial agreement as Dice coefficient and volumetric agreement as expressed with the ICC were somewhat lower than the agreement between multiple observers. Dice scores achieved by the best models are not that much lower than average inter-observer variability. Therefore, we conclude that our CNN produces acceptable segmentations. How- ever, ICC values were noticeable lower than inter-observer ICC values. This suggests that CNNs are promising for edema segmentation.
Our study is the first study to use CNN for edema segmentation. Bardera et al. (2009) already proposed a promising semi-automated method for edema segmentation using a region growing approach and tested this on 18 scans34. In contrast to their study, our study provides a fully automated method for edema segmentation. Moreover, in this study, edema segmentation was performed on CT scans, which are more common in clinical practice than MRI scans because of the high availability and imaging speed.
We showed that there was little difference in results when using different test sets. We used the same test set for testing every model since the Dice score and ICC are strict measures and susceptible for size differences. Segmentations of edema with large volumes tend to have higher Dice scores than segmentations of edema with small volumes. Therefore, the composition of our test set with regard to scans with large edema volumes and scans with small edema volumes could have influenced average outcomes.
Our used CNN architecture was based on a previous study that created this architecture for subarachnoid hemorrhage segmentation. Although subarachnoid hemorrhage is similar in external structure as edema, it differs in density and internal composition. Edema is often credent-shaped surrounding the hemorrhage. Furthermore, there is a high contrast between ICH voxels and brain voxels which makes it easier to segment ICH than edema. We found that our CNN performed well for ICH segmentation.
We only tested two patch sizes where the patch size of [19 × 19] voxels proved to achieve the best results for hemorrhage segmentation. For edema, the outcomes for these patch sizes did not differ greatly, but our CNN performed better with a patch size of [19 × 19] than with a patch size of [17 × 17]. Further research should be performed to test performances of CNNs with larger patch sizes.
We used a manual segmentation as gold standard for our study. Due to the difficulty of edema segmentation, a manual segmentation is never perfect. Using MRI scans as gold standard would have increased the reliability of our test results. Furthermore, manual segmentations of edema were obtained by subtracting manual segmentations of ICH from manual segmentations of both edema and ICH. The border between edema and ICH was determined by determining the border of the ICH segmentation. Therefore, less hyperdense voxels at the edge of ICH were often not classified as ICH. Edema often surrounds ICH and because edema and ICH were segmented together, these voxels at the border were misclassified as edema. If hypodense edema was taken in to account during the determination of the ICH border, it would have been evident that these less hyperdense voxels were still part of ICH.
Brain voxels with similar HU values as edema were often categorized as edema. We tried to solve this problem by using a CNN model that was trained on difficult brain patches. How- ever, this model performed poorly in edema segmentation. Training on brain patches that were similar to edema caused our CNN to misclassify more than halve of edema patches as brain, because it was no longer able to distinguish edema and brain.
We showed, using thick (1.4 billion input patches) and thin sliced images (2.9 billion input
21
patches), that training on double the number of input patches did not remarkably improve performance. Thus, we assume that using more training patches will not improve our edema segmentation method.
Although we improved the edema segmentation accuracy with post-processing, this method was not perfect. The optimal threshold differs for all scans. A post processing method that ad- justs the particle removal threshold to the size of the largest particle of the CNN segmentation could be able to include more edema segmentations. The largest particle segments generally include the correct edema segment. An additional improvement was obtained by the elimination of edema particles that are not adjacent to ICH. We believe that there is potential to improve even further with additional postprocessing.
Optimizing this automatic edema segmentation method could eventually save experts time. Our method could already be used as a semi-automated method when experts improve outcome segmentations manually. This is already less time consuming than manually segmenting edema from scratch.
Our study provides a basis for automatic edema segmentation using CNNs. Although our results are sub optimal, improvements with regard to CNN architecture and post-processing could further enhance edema segmentation.
6 Conclusion
We found that optimal results were achieved with a CNN including 2 conv layers, 2 pool layers a FC layer and a softmax function. Furthermore, training this CNN with patches of [19 × 19] voxels in 100 epochs achieved optimal results. We provided a promising automatic hemorrhage and edema segmentation method in patients with ICH using CNN. The accuracy, as assessed with Dice scores and ICC values, approaches manual inter-observer variation.
22
References
1. S. Sacco, C. Marini, D. Toni, L. Olivieri, and A. Carolei, “Incidence and 10-year survival of intracerebral hemorrhage in a population-based registry,” Stroke, vol. 40, 2009.
2. M. E. Fewel, B. G. Thompson Jr., and J. T. Hoff, “Spontaneous intracerebral hemorrhage: a review,” Neurosurg Focus, vol. 15, no. 4, p. E1, 2003.
3. K. R. Lee, A. L. Betz, S. Kim, R. F. Keep, and J. T. Hoff, “The role of the coagulation cascade in brain edema formation after intracerebral hemorrhage,” Acta Neurochirurgica, vol. 138, no. 4, pp. 396–401, 1996.
4. K. R. Lee, N. Kawai, S. Kim, O. Sagher, and J. T. Hoff, “Mechanisms of edema formation after intracerebral hemorrhage: effects of thrombin on cerebral blood flow, blood-brain barrier permeability, and cell survival in a rat model.,” Journal of neurosurgery, vol. 86, no. 2, pp. 272– 278, 1997.
5. F.-P. Huang, G. Xi, R. F. Keep, Y. Hua, A. Nemoianu, and J. T. Hoff, “Brain edema after experimental intracerebral hemorrhage: role of hemoglobin degradation products,” Journal of Neurosurgery, vol. 96, no. 2, pp. 287–293, 2002.
6. G. H. Xi, R. F. Keep, and J. T. Hoff, “Erythrocytes and delayed brain edema formation following intracerebral hemorrhage in rats,” Journal of Neurosurgery, vol. 89, no. 6, pp. 991–996, 1998.
7. M.-A. Babi and M. L. James, “Peri-Hemorrhagic Edema and Secondary Hematoma Expan- sion after Intracerebral Hemorrhage: From Benchwork to Practical Aspects,” Frontiers in Neurology, vol. 8, no. January, pp. 8–11, 2017.
8. A. W. Unterberg, J. Stover, B. Kress, and K. L. Kiening, “Edema and brain trauma,” Neuro- science, vol. 129, no. 4, pp. 1021–1029, 2004.
9. C. Venkatasubramanian, M. Mlynash, A. Finley-Caulfield, I. Eyngorn, R. Kalimuthu, R. W. Snider, et al., “Natural history of perihematomal edema after intracerebral hemorrhage measured by serial magnetic resonance imaging,” Stroke, vol. 42, no. 1, pp. 73–80, 2011.
10. S. B. Murthy, Y. Moradiya, J. Dawson, K. R. Lees, D. F. Hanley, and W. C. Ziai, “Peri- hematomal Edema and Functional Outcomes in Intracerebral Hemorrhage: Influence of Hematoma Volume and Location,” Stroke, vol. 46, no. 11, pp. 3088–3092, 2015.
11. G. Appelboom, S. S. Bruce, Z. L. Hickman, B. E. Zacharia, A. M. Carpenter, K. A. Vaughan, et al., “Volume-dependent effect of perihaematomal oedema on outcome for spontaneous intracerebral haemorrhages.,” Journal of neurology, neurosurgery, and psychiatry, vol. 84, no. 5, pp. 488–93, 2013.
12. J. M. Gebel, E. C. Jauch, T. G. Brott, J. Khoury, L. Sauerbeck, S. Salisbury, et al., “Relative edema volume is a predictor of outcome in patients with hyperacute spontaneous intracerebral hemorrhage,” Stroke, vol. 33, no. 11, pp. 2636–2641, 2002.
13. C. Kidwell, J. Chalela, J. Saver, S. Starkman, M. Hill, A. Demchuk, et al., “Comparison of MRI and CT for detection of acute intracerebral hemorrhage.,” JAMA : the journal of the American Medical Association, vol. 292, no. 15, pp. 1823–1830, 2004.
23
14. B. Volbers, D. Staykov, I. Wagner, A. Dorfler, M. Saake, S. Schwab, et al., “Semi-automatic volumetric assessment of perihemorrhagic edema with computed tomography,” European Journal of Neurology, vol. 18, no. 11, pp. 1323–1328, 2011.
15. A. M. M. Boers, “Automatic Quantification after Subarachnoid Hemorrhage on Non- Contrast Computed Tomography,” pp. 1– 47, 2014.
16. P. b. Dvorak, K. Bartusek, and W. Kropatsch, “Automated segmentation of brain tumor edema in FLAIR MRI using symmetry and thresholding,” Progress in Electromagnetics Re- search Symposium, pp. 936–939, 2013.
17. T. Y. Wu, O. Sobowale, R. Hurford, G. Sharma, S. Christensen, N. Yassi, et al., “Software output from semi-automated planimetry can underestimate intracerebral haemorrhage and peri-haematomal oedema volumes by up to 41%,” Neuroradiology, vol. 58, no. 9, pp. 867–876, 2016.
18. R. S. Barros, W. E. van der Steen, M. Boers, I. Zijlstra, R. van den Berg, W. E. Youssoufi, et al., “Segmentation of Subarachnoid Hemorrhage in Computed Tomography Images Us- ing Convolutional Neural Networks.” 2017.
19. H. R. Roth, A. Farag, L. Lu, E. B. Turkbey, and R. M. Summers, “Deep convolutional networks for pancreas segmentation in CT imaging,” 2015.
20. W. Shen, M. Zhou, F. Yang, C. Yang, and J. Tian, “Multi-scale convolutional neural networks for lung nodule classification,” Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol. 9123, pp. 588–599, 2015.
21. K. H. Cha, L. Hadjiiski, R. K. Samala, H.-P. Chan, E. M. Caoili, and R. H. Cohan, “Urinary bladder segmentation in CT urography using deep-learning convolutional neural network and level sets,” Medical Physics, vol. 43, no. 4, pp. 1882–1896, 2016.
22. B. Van Ginneken, A. A. A. Setio, C. Jacobs, and F. Ciompi, “Off-the-shelf convolutional neural network features for pulmonary nodule detection in computed tomography scans,” 2015 IEEE 12th International Symposium on Biomedical Imaging (ISBI), pp. 286–289, 2015.
23. M. Havaei, A. Davy, D. Warde-Farley, A. Biard, A. Courville, Y. Bengio, et al., “Brain tumor segmentation with Deep Neural Networks,” Medical Image Analysis, vol. 35, pp. 18–31, 2017.
24. K. de Gans, R. J. de Haan, C. B. Majoie, M. M. Koopman, A. Brand, M. G. Dijkgraaf, et al., “PATCH: platelet transfusion in cerebral haemorrhage: study protocol for a multicentre, randomised, controlled trial.,” BMC neurology, vol. 10, p. 19, 2010.
25. S. Hijazi, R. Kumar, C. Rowen, and C. I. Group, “Using Convolutional Neural Networks for Image Recognition,” Cadence Whitepaper, pp. 1–12, 2015.
26. K. Jarrett, K. Kavukcuoglu, M. Ranzato, and Y. LeCun, “What is the best multi-stage architecture for object recognition? BT - Computer Vision, 2009 IEEE 12th International Confer- ence on,” Computer Vision, 2009 . . . , pp. 2146–2153, 2009.
27. P. Golik and P. Doetsch, “Cross-Entropy vs. Squared Error Training:a Theoretical and Ex- perimental Comparison,” Interspeech, Isca, vol. 2, no. 2, pp. 1756–1760, 2013.
28. A. NG, “Machine Learning,” 2017.
24
29. Y. LeCun, K. Kavukcuoglu, and C. Farabet, “Convolutional networks and applications in vision,” ISCAS 2010 - 2010 IEEE International Symposium on Circuits and Systems: Nano-Bio Circuit Fabrics and Systems, pp. 253–256, 2010.
30. A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet Classification with Deep Convo- lutional Neural Networks,” Advances In Neural Information Processing Systems, pp. 1–9, 2012.
31. S. Ioffe and C. Szegedy, “Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift,” 2015.
32. P. A. Yushkevich, J. Piven, H. C. Hazlett, R. G. Smith, S. Ho, J. C. Gee, et al., “User-guided 3D active contour segmentation of anatomical structures: Significantly improved efficiency and reliability.,” Neuroimage, vol. 31, no. 3, pp. 1116–1128, 2006.
33. D. Yu, A. Eversole, M. Seltzer, K. Yao, Z. Huang, B. Guenter, et al., “An Introduction to Computational Networks and the Computational Network Toolkit,” Tech. Rep. MSR-TR- 2014-112, 2015.
34. A. Bardera, I. Boada, M. Feixas, S. Remollo, G. Blasco, Y. Silva, et al., “Semi-automated method for brain hematoma and edema quantification using computed tomography,” Com- puterized Medical Imaging and Graphics, vol. 33, no. 4, pp. 304–311, 2009.
25
Acronyms
AUC area under the curve. 12
CE cross entropy. 5
conv convolutional. 7
ICH intracerebral hemorrhage. 4
26
Samenvatting
Summary
Introduction
Preliminaries
Hardware and Software
Performance on test sets

Convolutional neural networks for automated edema ...

Documents