Top Banner
The Pitfalls of Sample Selection: A Case Study on Lung Nodule Classification Vasileios Baltatzis 1,2 , Kyriaki-Margarita Bintsi 2 , Lo¨ ıc Le Folgoc 2 , Octavio E. Martinez Manzanera 1 , Sam Ellis 1 , Arjun Nair 3 , Sujal Desai 4 , Ben Glocker 2 , Julia A. Schnabel 1,5,6 1 School of Biomedical Engineering and Imaging Sciences, King’s College London, UK 2 BioMedIA, Department of Computing, Imperial College London, UK 3 Department of Radiology, University College London, UK 4 The Royal Brompton & Harefield NHS Foundation Trust, London, UK 5 Technical University of Munich, Germany 6 Helmholtz Center Munich, Germany [email protected] Abstract. Using publicly available data to determine the performance of methodological contributions is important as it facilitates reproducibil- ity and allows scrutiny of the published results. In lung nodule classifi- cation, for example, many works report results on the publicly available LIDC dataset. In theory, this should allow a direct comparison of the per- formance of proposed methods and assess the impact of individual contri- butions. When analyzing seven recent works, however, we find that each employs a different data selection process, leading to largely varying total number of samples and ratios between benign and malignant cases. As each subset will have different characteristics with varying difficulty for classification, a direct comparison between the proposed methods is thus not always possible, nor fair. We study the particular effect of truthing when aggregating labels from multiple experts. We show that specific choices can have severe impact on the data distribution where it may be possible to achieve superior performance on one sample distribution but not on another. While we show that we can further improve on the state- of-the-art on one sample selection, we also find that on a more challeng- ing sample selection, on the same database, the more advanced models underperform with respect to very simple baseline methods, highlighting that the selected data distribution may play an even more important role than the model architecture. This raises concerns about the validity of claimed methodological contributions. We believe the community should be aware of these pitfalls and make recommendations on how these can be avoided in future work. 1 Introduction Lung nodule characterization is the most difficult step in the pipeline of lung cancer diagnosis according to radiologists, which can be observed by a great arXiv:2108.05386v1 [cs.CV] 11 Aug 2021
12

arXiv:2108.05386v1 [cs.CV] 11 Aug 2021

May 06, 2023

Download

Documents

Khang Minh
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: arXiv:2108.05386v1 [cs.CV] 11 Aug 2021

The Pitfalls of Sample Selection:A Case Study on Lung Nodule Classification

Vasileios Baltatzis1,2, Kyriaki-Margarita Bintsi2, Loıc Le Folgoc2,Octavio E. Martinez Manzanera1, Sam Ellis1, Arjun Nair3, Sujal Desai4,

Ben Glocker2, Julia A. Schnabel1,5,6

1 School of Biomedical Engineering and Imaging Sciences, King’s College London,UK

2 BioMedIA, Department of Computing, Imperial College London, UK3 Department of Radiology, University College London, UK

4 The Royal Brompton & Harefield NHS Foundation Trust, London, UK5 Technical University of Munich, Germany

6 Helmholtz Center Munich, [email protected]

Abstract. Using publicly available data to determine the performanceof methodological contributions is important as it facilitates reproducibil-ity and allows scrutiny of the published results. In lung nodule classifi-cation, for example, many works report results on the publicly availableLIDC dataset. In theory, this should allow a direct comparison of the per-formance of proposed methods and assess the impact of individual contri-butions. When analyzing seven recent works, however, we find that eachemploys a different data selection process, leading to largely varying totalnumber of samples and ratios between benign and malignant cases. Aseach subset will have different characteristics with varying difficulty forclassification, a direct comparison between the proposed methods is thusnot always possible, nor fair. We study the particular effect of truthingwhen aggregating labels from multiple experts. We show that specificchoices can have severe impact on the data distribution where it may bepossible to achieve superior performance on one sample distribution butnot on another. While we show that we can further improve on the state-of-the-art on one sample selection, we also find that on a more challeng-ing sample selection, on the same database, the more advanced modelsunderperform with respect to very simple baseline methods, highlightingthat the selected data distribution may play an even more important rolethan the model architecture. This raises concerns about the validity ofclaimed methodological contributions. We believe the community shouldbe aware of these pitfalls and make recommendations on how these canbe avoided in future work.

1 Introduction

Lung nodule characterization is the most difficult step in the pipeline of lungcancer diagnosis according to radiologists, which can be observed by a great

arX

iv:2

108.

0538

6v1

[cs

.CV

] 1

1 A

ug 2

021

Page 2: arXiv:2108.05386v1 [cs.CV] 11 Aug 2021

2 V. Baltatzis et al.

inter-observer disagreement on the task [7,12]. A lung nodule is normally char-acterized with respect to texture, spiculation, lobulation, and its morphologicalappearance on a CT scan, and eventually it must be classified as either benignor malignant for patient management. The Lung imaging Reporting And DataSystem (Lung-RADS) [9] is a protocol that defines explicit guidelines for nodulemanagement and follow up planning, and classifies pulmonary nodules in sixcategories, each of which has its own suggested follow up. Lung-RADS also in-tegrates the PanCan Model [11], which provides a malignancy probability basedon the morphology of a nodule and additional patient information. Certain di-agnosis can only be made through biopsy, which, however, is invasive and notalways feasible to have access to. While determining the malignancy of a nodulefrom its appearance on a CT scan is not a fail-proof method, it is still a veryuseful step of the lung cancer detection pipeline. It can have very importantvalue to clinicians in conjunction with patient history and demographics.

Several deep learning methods have been proposed for automated noduleclassification from CT. The publicly available Lung Image Database Consor-tium and Image Database Resource Initiative (LIDC) database [2,10] has beenin the core of the majority of such efforts. The LIDC does not primarily containpathology confirmed ground truths (besides a very small subset of cases), butrather radiologists’ annotations. Nevertheless, it is still heavily used by the re-search community for the task of lung nodule classification. Interestingly, thereare various design choices regarding sample selection that need to be considered,which can have severe impact on the reported results.

The contributions of this paper can be summarized as follows: 1) We an-alyze several published works reporting results on LIDC nodule classificationand examine the different assumptions such as annotation aggregations meth-ods, removal of cases based on clinical guidelines, and data augmentation, whichall can affect the resulting sample selection process; 2) Through an extensiveexperimental analysis, we show that the selected data distribution can affectthe difficulty of the task and may play an even more important role than themodel architecture; 3) We demonstrate that reproducibility and direct modelcomparison is virtually impossible to achieve and provide suggestions towardsmaking this feasible in future work, while also making our data selection publiclyavailable to promote reproducibility. We illustrate the pitfalls of sample selec-tion with a novel methodological approach of curriculum by smoothing for lungnodule classification. Our findings and insights will be of use to the communityand aid in the design of future approaches for lung nodule classification.

2 State-of-the-art in Lung Nodule Classification

The LIDC dataset contains more than 1000 scans. Each scan was reviewed byfour radiologists who pinpointed lesion locations and assigned a variety of anno-tations including malignancy. For every nodule, each radiologist had to assigna malignancy rating from 1 (most likely benign) to 5 (most likely malignant).Nodules annotated with 3 were regarded as indeterminate.

Page 3: arXiv:2108.05386v1 [cs.CV] 11 Aug 2021

The Pitfalls of Sample Selection in Lung Nodule Classification 3

Median: Benign, Mean: Benign

Median: Indeterm inate, Mean: Indeterm inate

Median: Malignant , Mean: Malignant

Median: Benign, Mean: Indeterm inate

Median: Indeterm inate, Mean: Benign

Median: Indeterm inate, Mean: Malignant

Median: Malignant , Mean: Indeterm inate

Fig. 1: Lung nodule examples from the LIDC. Top row: Nodules that have thesame consensus regardless of the aggregation method used. Bottom row: Nodulesthat have different consensus depending on the aggregation method.

There are a number of preprocessing and data curation steps which are con-sidered fixed when using the LIDC and almost all recent deep learning papersfollow them. These include (1) retaining only nodules that have been annotatedby at least three radiologists and (2) discarding nodules annotated as indetermi-nate. Subsequently, for each nodule a consensus annotation is extracted from theindividual annotations through some form of aggregation or truthing (typicallyusing mean, median, or majority voting). Example nodules from the LIDC withdifferent consensus/aggregation combinations can be seen in Figure 1. Giventhese relatively straightforward steps, it may be surprising to find that everypaper we studied reports largely varying numbers for benign and malignantnodules and overall cases (see Table 1). Most studies report that they followa procedure similar to previous work, however, rarely provide the exact detailsabout either the sample selection process or the final dataset (e.g. by publishinga list of scan series IDs). Beside the differences in absolute numbers of benignand malignant cases, the characteristics of the underlying data distribution maychange significantly. One of the most important characteristics is the size of anodule (quantified by its diameter), as it plays an essential role in malignancyclassification. Another discrepancy arises from the decision to remove cases thathave a slice thickness > 2.5mm, which is based on clinical guidelines [5]. Im-ages with thick slices are deemed unsuitable for lung cancer screening. This stepwas first suggested in the LUNA16 nodule detection challenge [15] and has alsobeen adopted by other studies [20]. One of the few works that release theirpre-processed data is by Al-Shabi et al. [1].

Page 4: arXiv:2108.05386v1 [cs.CV] 11 Aug 2021

4 V. Baltatzis et al.

Table 1: Overview of previous work for lung nodule classification on LIDC-IDRIin terms of nodule counts and performance. Despite all papers using the samepublicly available dataset, final numbers of benign and malignant cases varylargely making a direct comparison of the methods’ performance impossible.Method Benign count Malignant count Accuracy (%)

Local-Global [1] 442 406 89.75DeepLung [20] 554 450 90.44Lightweight multi-CNN [14] 857 448 93.18Interpretable hierarchical CNN [16] 3212 1040 84.20NoduleX [3] 394 270 93.20Multi-crop CNN [17] 880 495 87.14Multi-task w/ margin ranking loss [8] 972 450 93.50

Here, we attempt to draw a direct comparison to their work with the datasetwe have extracted from pre-processing LIDC (see Figure 2). Something like thisis not feasible for the other proposed methods which do not publicly releasetheir sample selection. In this comparison, we want to highlight the importantrole that the aggregation method (mean vs median) plays in determining whichsamples are labeled as benign and malignant. When median aggregation is used,we see that a lot more nodules have an indeterminate consensus (i.e. median=3)and are therefore excluded, resulting in a smaller, more balanced dataset, whichis much easier to separate based on the key characteristic of nodule diameter.Specifically, median aggregation leads to 442/406 benign/malignant nodules for[1] and 376/357 benign/malignant in our replicated pipeline, respectively. In con-trast, mean aggregation results in 653/484 benign/malignant for [1] and 559/451for us. A factor leading to a discrepancy between the two samples, even whenthe same aggregation method is used, is the fact that cases with a slice thick-ness > 2.5mm have been retained by [1]. These factors make reproducibility anddirect comparison of methods nearly impossible.

3 Methodology

Here we present different methods and approaches, including our attemptedcontribution, which we considered for studying the impact of sample selectionon lung nodule classification performance. We used several baselines and state-of-the-art deep learning approaches.

3.1 Diameter-based baselines

Diameter threshold The first baseline we set is not learning-based but arather simplistic one. Specifically, given that the size of a nodule is a primaryfactor in determining whether a nodule is malignant or not (i.e. large nodulesare most likely to be regarded by experts as malignant, while small nodulesas benign) we use the provided diameter annotation in LIDC and specify a

Page 5: arXiv:2108.05386v1 [cs.CV] 11 Aug 2021

The Pitfalls of Sample Selection in Lung Nodule Classification 5

5 10 15 20 25

Diam eter (m m )

0

20

40

60

80

100

120

140

Nu

mb

er

of

sam

ple

s

Malignant : 406

Benign: 442

(a)

5 10 15 20 25

Diam eter (m m )

0

20

40

60

80

100

120

140

Nu

mb

er

of

sam

ple

s

Malignant : 484

Benign: 653

(b)

10 20 30 40

Diam eter (m m )

0

20

40

60

80

100

120

140

Nu

mb

er

of

sam

ple

s

Malignant : 357

Benign: 376

(c)

10 20 30 40

Diam eter (m m )

0

20

40

60

80

100

120

140

Nu

mb

er

of

sam

ple

s

Malignant : 451

Benign: 559

(d)

Fig. 2: Data distributions of benign and malignant samples over nodule diam-eter. (a) Median aggregation from [1], (b) Mean aggregation from [1], (c) Ourmedian aggregation, (d) Our mean aggregation. Median aggregation producesfewer nodules in total (i.e. more nodules are classified as indeterminate) for bothcases, and at the same time more balanced datasets.

threshold for classifying nodules into benign and malignant. This baseline isused as a surrogate to determine the difficulty of the classification, as the overallsize difference between structures may be easily picked up by an image-basedprediction model such as a convolutional neural network (CNN).

Regressed diameter threshold Another baseline that we use is similar to theprevious one but with a CNN that is trained to regress the diameter througha mean squared error loss. The classification is taking place by applying thethreshold determined from the first baseline on the output of the CNN insteadof the annotation. Again, if this baseline works well, one may conclude that thetask given a specific dataset is not very difficult.

3.2 ShallowNet

We also implement a CNN for malignancy classification (termed ShallowNet),which is a bare-bones CNN comprising of four convolutional layers with kernels

Page 6: arXiv:2108.05386v1 [cs.CV] 11 Aug 2021

6 V. Baltatzis et al.

of shape 3x3 and ReLU activations, and corresponding max-pooling layers withkernels of shape 2x2, as well as a fully-connected layer with 1024 neurons at theend for the classification. This is a deliberately simplistic deep learning baselineused to compare with more complicated architectures proposed in the literature.

3.3 Local-Global

Since we have access to the sample selection of [1], it makes sense to use the state-of-the-art method on this distribution. The Local-Global network was proposedby [1] and consists of two blocks. Each block contains the following sequence: aresidual sub-block [4] followed by a non-local sub-block [19] and a dropout layer.After the two blocks, there is an average pooling layer and a fully-connectedlayer for the classification.

3.4 Curriculum by smoothing

Finally, we propose the use of curriculum by smoothing (CBS) [18], which hasshown promising results on computer vision classification tasks. CBS plays therole of our attempted methodological contribution on lung nodule classification.The main idea behind CBS is to apply a Gaussian smoothing kernel to the outputof each convolutional layer of a CNN. We use θ ~ x to denote the convolutionof a kernel θ with an input x. Typically, in a CNN, a convolution operation isfollowed by a non-linear activation function as described in Equation 1:

z = activation(θω ~ x) (1)

where θω are the trainable parameters of a convolutional layer. The CBSformulation is presented in Equation 2:

z = activation(θG ~ (θω ~ x)) (2)

where θG is a predefined Gaussian kernel. The Gaussian kernel is deterministicand is not trained. During the early stages of training it has an initial standarddeviation σ, which is annealed as training progresses. This way, high-frequencyinformation is suppressed in the early training steps of the CNN and is onlyconsidered at later stages of the training process.

It is important to note that while we introduce CBS here as an approach thatcould enhance the performance of ShallowNet or Local-Global for the task of lungnodule classification, our purpose is not to propose a novel model architecturebut rather to explore whether the selected sample distribution can play a moreimportant role than the model architecture and highlight the pitfalls that occurin such a scenario.

4 Experimental Analysis

Following from the differences in the data distributions, we move to comparingsome baseline models, as well as the proposed method from [1]. In this section

Page 7: arXiv:2108.05386v1 [cs.CV] 11 Aug 2021

The Pitfalls of Sample Selection in Lung Nodule Classification 7

we focus on two distributions to demonstrate the impact of sample selection andunderstand whether performance differences stem from the data or the methods.Specifically, we use the data produced with median aggregation (Figure 2a) from[1] (henceforth denoted as D1), as this is the one the authors report results for,and mean aggregation (Figure 2d) for our data (denoted as D2). We do notconsider mean aggregation to be superior to median, but instead we want tostudy the differences in performance that are caused by this specific choice oftruthing. Median aggregation leads to the two classes being more easily separatedbased on nodule diameter (Figures 2a,2c), even though 5-10 mm is consideredthe most difficult area to separate malignant from benign nodules. In both D1

and D2, a nodule is considered benign when the consensus has a value lowerthan 3 and malignant when it has a value greater than 3.

CT scans with a slice thickness greater than 2.5mm are removed accordingto clinical guidelines [5] and every remaining scan is resampled to 1mm isotropicresolution across all three dimensions and one 32x32 mm patch is extractedalong each orthogonal plane at each nodule location. The final classificationresult for each nodule occurs from the averaging of the individual classificationof each of its three planes. Some experiments include offline data augmentation(i.e. the size of the dataset itself is increased six-fold through the addition ofnodule augmentations); these augmentations are the ones suggested by [1] andinclude rotations, horizontal flips and Gaussian smoothing. For the proposedmethodological contribution of employing CBS we choose 3x3 sized kernels, withan initial standard deviation σ = 1 of the Gaussian smoothing kernel and anannealing of 0.5 every 5 epochs based on guidelines provided by the authors of[18] and our own validation performance. All models are evaluated using 10-foldcross validation and the reported results are the average of the performanceacross the 10 folds. The networks are trained using the Adam optimizer [6] withlearning rate 10−3 and binary cross-entropy loss for 50 epochs and a batch size of256 samples. We also deploy early stopping to avoid overfitting. All experimentswere conducted using PyTorch [13].

The results of the comparison can be found on Table 2. First, we show thateven separating the samples based on nodule diameter (i.e. thresholding) canachieve a quite high accuracy (85.02% for D1 and 83.46% for D2). In each case,we select the threshold that maximizes training accuracy. The threshold for thetwo cases is quite different (7.2mm for D1 and 11.5mm for D2) because of thedifferent aggregation methods used and also because the equivalent diameter (i.e. the diameter of the sphere having the same volume as the nodule estimatedvolume) is the one used in [1]. Then we use a shallow CNN (ShallowNet) toregress the nodule diameter and use a threshold (7.7mm for D1 and 11mm forD2) on that, in order to classify the nodule. If we focus on D1, we see thata ShallowNet trained directly on malignancy can initially just outperform thediameter-based baselines (85.74%) but its performance gets better progressivelywhen we use either CBS (86.80%) or offline augmentations (89.74%) and reachesup to 90.91% if we use both. We observe the same pattern for Local-Global[1] which starts from 89.15% when we do not use CBS or augmentations and

Page 8: arXiv:2108.05386v1 [cs.CV] 11 Aug 2021

8 V. Baltatzis et al.

Table 2: Comparison of methods on the different data distribution settings. Thereported results are averaged across the 10 folds. D1 is the data distribution usedin [1], which has occurred from median aggregation, while D2 has been extractedfrom the LIDC by us using mean aggregation. We use accuracy (Acc), sensitivity(Sens) and specificity (Spec) to report the performance of each method and allthe reported values are percentages (%). Even from the baselines, it is evidentthat D1 is an easier task to solve than D2. All methods perform better whenaugmented with CBS for D1. In D2 all configurations perform similarly to thediameter baseline, and there is no improvement from progressively increasingthe complexity of the model by adding augmentations and/or CBS.

MethodD1 D2

Acc Sens Spec Acc Sens Spec

Diameter threshold 85.02 90.14 80.31 83.46 69.62 94.63CNN-regressed diameter threshold 84.43 84.23 84.61 81.58 68.95 91.77

ShallowNet 85.74 77.09 93.67 83.86 74.94 91.05ShallowNet + CBS 86.80 78.57 94.35 82.77 71.17 92.12ShallowNet (w/ aug) 89.74 85.96 93.21 84.35 77.38 89.98ShallowNet (w/ aug) + CBS 90.91 89.40 92.30 82.37 73.61 89.44

Local-Global [1] 89.15 89.16 89.14 82.97 74.72 89.62Local-Global + CBS 89.26 91.40 86.94 81.98 75.38 87.29Local-Global (w/ aug) [1] 89.75 90.17 88.17 82.57 79.15 85.33Local-Global (w/ aug) + CBS 90.91 90.64 91.17 81.88 70.06 91.41

eventually reaches 90.91% when we use both. The progressive gains from CBSand augmentations that are present in D1, however, are not replicated on D2. Allthe methods in that case perform very similar to the diameter-based baselineswith the ShallowNet being the only one that surpasses them marginally in termsof accuracy (84.35% with augmentations).

5 Discussion

The LIDC dataset has been instrumental for the majority of recent works onlung nodule classification. Here, we take a critical look at the aspect of sampleselection after discovering inconsistencies in the reported literature. We aimed toexamine different factors that affect the performance of a model and thus the ap-parent value of its methodological contribution. Starting from the pre-processingsteps that various studies have applied on the LIDC dataset, we observe that anumber of different assumptions during the sample selection process can lead tovery different resulting data distributions (Table 1). Such factors are the choiceof the aggregation method (e.g. median or mean), in order to extract a consensusfrom the multiple annotations per nodule, or the removal of certain cases whichare considered as unsuitable for the task due to clinical guidelines.

The aggregation method, in particular, plays a very important role. First, it isaffecting the total number of nodules that are retained, since median aggregation

Page 9: arXiv:2108.05386v1 [cs.CV] 11 Aug 2021

The Pitfalls of Sample Selection in Lung Nodule Classification 9

leads to more nodules having an indeterminate consensus and consequently be-ing removed, compared to mean aggregation. It is fair to say that these nodules,which are retained in the dataset with mean aggregation, are harder examples,and therefore, the classification task that occurs from mean aggregation is moredifficult. Second, the prevalence of the two classes in the dataset changes sub-stantially, since median aggregation leads to a more balanced, and potentiallymore favorable for classification, dataset.

It is easy to understand that these choices change the nature of the under-lying data distribution and hence, of the classification task itself. The compar-ison of the performance of different methods applied on different distributionsis thus complex and makes the objective assessment of the value of method-ological contributions difficult, which we also demonstrate experimentally. Weinitially devise several baselines. The first one is a simple thresholding based onthe nodule diameter annotation. A size-relevant annotation is usually a core partof a lung nodule dataset, including the LIDC, and therefore this baseline canbe applicable in all future studies. In the second baseline we apply a thresholdon the diameter predictions that have been regressed by a neural network. Thiscan indicate the degree of bias that a neural network has towards associatinglarge nodules with malignancy and small ones with a benign nature. Given thevery similar performance of the ShallowNet trained on malignancy predictionitself with the ShallowNet that is trained to regress the diameter, we understandthat this bias is actually quite severe. It is well documented [11] that the sizeof the nodule is an important factor in determining whether a nodule is benign,but from a clinical perspective there are also other indications such as textureor spiculation, which do not seem to be picked up by the neural network. Theaforementioned baselines can describe the difficulty of the task, and we suggesttheir adaptation by the research community working on lung nodule classifica-tion. Additionally, we intend to publicly release our sample selection and we urgethe research community to do the same to promote reproducibility.

The core argument of our paper is epitomized when we compare the perfor-mance of all methods on the two distributions. Overall, we see that on D1, addingdata augmentation or increasing the complexity of the model (i.e. Local-Globalinstead of ShallowNet) consistently leads to a distinct increase in performance.The approach of using CBS during training results in a performance increase onevery single method, outperforming marginally even the state-of-the-art (Local-Global w/ augmentations) on D1. However, on D2, all methods are bounded bythe diameter threshold baseline and even CBS is not having the impact it did onD1. This highlights the pitfalls of sample selection which may lead to incorrectconclusions about the methodological contributions. If we were to report onlyresults on D1, we may have concluded that CBS is beneficial for lung noduleclassification, and even outperforms previous works.

Page 10: arXiv:2108.05386v1 [cs.CV] 11 Aug 2021

10 V. Baltatzis et al.

6 Conclusion

In this paper we have investigated the effect of sample selection in the contextof lung nodule classification using deep learning. We have investigated differentfactors that cause the various published studies to report completely differentnumber of nodules, and we show experimentally that these factors explicitly af-fect network performance. We have demonstrated that using progressively moreand more complex methods systematically improves performance on the task, ifand only if the assumptions regarding the data selection process allows for it.On the other hand, if the data distribution presents a more challenging classi-fication task, as is the case when mean aggregation for the nodule annotationsis used, then model complexity or data augmentation do not offer any kind ofperformance boost compared to even the simplest baseline.

7 Acknowledgments

This work is funded by the King’s College London & Imperial College LondonEPSRC Centre for Doctoral Training in Medical Imaging (EP/L015226/1), EP-SRC grant EP/023509/1, the Wellcome/EPSRC Centre for Medical Engineering(WT 203148/Z/16/Z), and the UKRI London Medical Imaging & Artificial In-telligence Centre for Value Based Healthcare. The Titan Xp GPU was donatedby the NVIDIA Corporation.

References

1. Al-Shabi, M., Lan, B.L., Chan, W.Y., Ng, K.H., and Tan, M.: Lungnodule classification using deep Local–Global networks. International Jour-nal of Computer Assisted Radiology and Surgery 14(10), 1815–1819 (102019). https://doi.org/10.1007/s11548-019-01981-7, https://doi.org/10.1007/

s11548-019-01981-72. Armato, S.G., McLennan, G., Bidaut, L., et al.: The Lung Image Database Con-

sortium (LIDC) and Image Database Resource Initiative (IDRI): A completedreference database of lung nodules on CT scans. Medical Physics 38(2), 915–931 (1 2011). https://doi.org/10.1118/1.3528204, http://www.ncbi.nlm.nih.gov/pubmed/21452728

3. Causey, J.L., Zhang, J., Ma, S., et al.: Highly accurate model for predic-tion of lung nodule malignancy with CT scans. Scientific Reports 8(1) (2018).https://doi.org/10.1038/s41598-018-27569-w

4. He, K., Zhang, X., Ren, S., and Sun, J.: Deep residual learning for im-age recognition. In: Proceedings of the IEEE Computer Society Confer-ence on Computer Vision and Pattern Recognition. vol. 2016-Decem, pp.770–778 (2016). https://doi.org/10.1109/CVPR.2016.90, http://image-net.org/challenges/LSVRC/2015/

5. Kazerooni, E.A., Austin, J.H., Black, W.C., et al.: ACR-STR practice param-eter for the performance and reporting of lung cancer screening thoracic com-puted tomography (CT): 2014 (Resolution 4). Journal of Thoracic Imaging29(5), 310–316 (2014). https://doi.org/10.1097/RTI.0000000000000097, https:

//pubmed.ncbi.nlm.nih.gov/24992501/

Page 11: arXiv:2108.05386v1 [cs.CV] 11 Aug 2021

The Pitfalls of Sample Selection in Lung Nodule Classification 11

6. Kingma, D.P. and Ba, J.L.: Adam: A method for stochastic optimization. In: 3rdInternational Conference on Learning Representations, ICLR 2015 - ConferenceTrack Proceedings (2015)

7. Lin, H., Huang, C., Wang, W., Luo, J., Yang, X., and Liu, Y.: Mea-suring Interobserver Disagreement in Rating Diagnostic Characteristics ofPulmonary Nodule Using the Lung Imaging Database Consortium andImage Database Resource Initiative. Academic Radiology 24(4), 401–410(4 2017). https://doi.org/10.1016/j.acra.2016.11.022, http://www.ncbi.nlm.nih.gov/pubmed/28169141

8. Liu, L., Dou, Q., Chen, H., Qin, J., and Heng, P.A.: Multi-Task Deep Model withMargin Ranking Loss for Lung Nodule Analysis. IEEE Transactions on MedicalImaging 39(3), 718–728 (3 2020). https://doi.org/10.1109/TMI.2019.2934577

9. McKee, B.J., Regis, S.M., McKee, A.B., Flacke, S., and Wald, C.: Per-formance of ACR Lung-RADS in a Clinical CT Lung Screening Pro-gram. Journal of the American College of Radiology 13(2), R25–R29(3 2016). https://doi.org/10.1016/j.jacr.2015.12.009, http://www.ncbi.nlm.nih.gov/pubmed/25176499

10. McNitt-Gray, M.F., Armato, S.G., Meyer, C.R., et al.: The Lung Im-age Database Consortium (LIDC) Data Collection Process for NoduleDetection and Annotation. Academic Radiology 14(12), 1464–1474 (122007). https://doi.org/10.1016/j.acra.2007.07.021, http://www.ncbi.nlm.nih.

gov/pubmed/18035276

11. McWilliams, A., Tammemagi, M.C., Mayo, J.R., et al.: Probability of cancerin pulmonary nodules detected on first screening CT. New England Journal ofMedicine 369(10), 910–919 (9 2013). https://doi.org/10.1056/NEJMoa1214726,http://www.nejm.org/doi/10.1056/NEJMoa1214726

12. Nair, A., Bartlett, E.C., Walsh, S.L., et al.: Variable radiological lung noduleevaluation leads to divergent management recommendations. European Respira-tory Journal 52(6), 1–12 (12 2018). https://doi.org/10.1183/13993003.01359-2018,https://doi.org/10.1183/13993003.01359-2018

13. Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E., and ...: Automatic dif-ferentiation in pytorch (2017), https://openreview.net/forum?id=BJJsrmfCZ

14. Sahu, P., Yu, D., Dasari, M., Hou, F., and Qin, H.: A Lightweight Multi-Section CNN for Lung Nodule Classification and Malignancy Estimation.IEEE Journal of Biomedical and Health Informatics 23(3), 960–968 (5 2019).https://doi.org/10.1109/JBHI.2018.2879834

15. Setio, A.A.A., Traverso, A., de Bel, T., et al.: Validation, comparison, and com-bination of algorithms for automatic detection of pulmonary nodules in com-puted tomography images: The LUNA16 challenge. Medical Image Analysis 42,1–13 (12 2017). https://doi.org/10.1016/j.media.2017.06.015, http://www.ncbi.

nlm.nih.gov/pubmed/28732268

16. Shen, S., Han, S.X., Aberle, D.R., Bui, A.A., and Hsu, W.: An interpretabledeep hierarchical semantic convolutional neural network for lung nodule ma-lignancy classification. Expert Systems with Applications 128, 84–95 (8 2019).https://doi.org/10.1016/j.eswa.2019.01.048, https://linkinghub.elsevier.com/retrieve/pii/S0957417419300545

17. Shen, W., Zhou, M., Yang, F., et al.: Multi-crop Convolutional Neural Networks forlung nodule malignancy suspiciousness classification. Pattern Recognition 61, 663–673 (1 2017). https://doi.org/10.1016/j.patcog.2016.05.029, https://linkinghub.elsevier.com/retrieve/pii/S0031320316301133

Page 12: arXiv:2108.05386v1 [cs.CV] 11 Aug 2021

12 V. Baltatzis et al.

18. Sinha, S., Garg, A., and Larochelle, H.: Curriculum By Smoothing. In: Ad-vances in Neural Information Processing Systems. vol. 33, pp. 21653–21664.Curran Associates, Inc. (2020), https://proceedings.neurips.cc/paper/2020/file/f6a673f09493afcd8b129a0bcf1cd5bc-Paper.pdf

19. Wang, X., Girshick, R., Gupta, A., and He, K.: Non-local Neural Networks.In: Proceedings of the IEEE Computer Society Conference on Computer Vi-sion and Pattern Recognition. pp. 7794–7803. IEEE Computer Society (12 2018).https://doi.org/10.1109/CVPR.2018.00813

20. Zhu, W., Liu, C., Fan, W., and Xie, X.: DeepLung: Deep 3D Dual PathNets for Automated Pulmonary Nodule Detection and Classification. 2018IEEE Winter Conference on Applications of Computer Vision (WACV) (2018).https://doi.org/10.1101/189928, http://arxiv.org/abs/1801.09555