Top Banner
Monica Ramakrishnan, Surya Rajasekaran, Barsa Nayak, Akshay Bhagdikar SANTA CLARA UNIVERSITY AUTOMATED LUNG CANCER NODULE DETECTION
37

AUTOMATED LUNG CANCER NODULE DETECTION · Monica Ramakrishnan, Surya Rajasekaran, Barsa Nayak, Akshay Bhagdikar SANTA CLARA UNIVERSITY AUTOMATED LUNG CANCER NODULE DETECTION

Apr 15, 2018

Download

Documents

nguyentram
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: AUTOMATED LUNG CANCER NODULE DETECTION · Monica Ramakrishnan, Surya Rajasekaran, Barsa Nayak, Akshay Bhagdikar SANTA CLARA UNIVERSITY AUTOMATED LUNG CANCER NODULE DETECTION

MonicaRamakrishnan,SuryaRajasekaran,BarsaNayak,AkshayBhagdikar

SANTACLARAUNIVERSITY

AUTOMATEDLUNGCANCERNODULEDETECTION

Page 2: AUTOMATED LUNG CANCER NODULE DETECTION · Monica Ramakrishnan, Surya Rajasekaran, Barsa Nayak, Akshay Bhagdikar SANTA CLARA UNIVERSITY AUTOMATED LUNG CANCER NODULE DETECTION

2|P a g e

1 PRE-INTRODUCTION1.1 PREFAFACEThe purpose of this project is to develop a model that utilizes various concepts from imageprocessing, datamining, andmachine learning to detect lung cancer nodules amongst high riskpatients.Variousconceptsof imageprocessingwerealsoutilized.This reporthasbeenmade infulfillmentoftherequirementforthesubject:PatternRecognition&DataMininginJune2017underthesupervisionofDr.Ming-HwaWang1.2 ACKNOWLEDGEMENTSWewouldliketoexpressourheartfeltgratitudetoDr.Ming-HwaWangforprovidinguswithanopportunity to explore our interests in data mining as well as image processing. Without histremendous support, encouragement as well as valuable inputs, this project couldn't havematerialized.Theguidanceandsupportreceivedfromallthememberswhocontributedandwhoarecontributingtothisprojectwasvitalforoursuccess.

Page 3: AUTOMATED LUNG CANCER NODULE DETECTION · Monica Ramakrishnan, Surya Rajasekaran, Barsa Nayak, Akshay Bhagdikar SANTA CLARA UNIVERSITY AUTOMATED LUNG CANCER NODULE DETECTION

3|P a g e

1.3 TABLEOFCONTENTS

1 PRE-INTRODUCTION.......................................................................................................................21.1 PREFAFACE...........................................................................................................................................21.2 ACKNOWLEDGEMENTS........................................................................................................................21.3 TABLEOFCONTENTS............................................................................................................................31.4 LISTOFTABLES&FIGURES...................................................................................................................41.5 ABSTRACT:............................................................................................................................................5

2 INTRODUCTION..............................................................................................................................62.1 OBJECTIVE:...........................................................................................................................................62.2 WHATISTHEPROBLEM:......................................................................................................................62.3 PROJECTRELATIONTOCLASS:.............................................................................................................62.4 WHYOTHERSOLUTIONSAREINADEQUATE:.......................................................................................72.5 WHYOURAPPROACHISBETTER:.........................................................................................................72.6 STATEMENTOFPROBLEM....................................................................................................................82.7 SCOPEOFINVESTIGATION...................................................................................................................8

3 THEORETICALBASESANDLITERATUREREVIEW..........................................................................93.1 DEFINITIONOFTHEPROBLEM:............................................................................................................93.2 THEORETICALBACKGROUNDOFTHEPROBLEM:.................................................................................93.3 RELATEDRESEARCHTOSOLVETHEPROBLEM:....................................................................................93.4 ADVANTAGE/DISADVANTAGEOFTHOSERESEARCH.........................................................................103.5 SOLUTIONTOPROBLEM....................................................................................................................103.6 WHERESOLUTIONISDIFFERENTFROMOTHERS...............................................................................103.7 WHYSOLUTIONISANIMPROVEMENTONCURRENTMETHODS.......................................................113.8 TERMINOLOGY...................................................................................................................................11

4 HYPOTHESIS..................................................................................................................................124.1 MULTIPLEHYPOTHESIS......................................................................................................................124.2 POSITIVE/NEGATIVEHYPOTHESIS:.....................................................................................................12

5 METHODOLOGY............................................................................................................................135.1 HOWTOGENERATE/COLLECTINPUTDATA:......................................................................................135.2 HOWTOSOLVETHEPROBLEM:.........................................................................................................135.3 ALGORITHMDESIGN:.........................................................................................................................145.4 LANGUAGESANDTOOLSUSED:.........................................................................................................205.5 HOWTOGENERATEOUTPUT:............................................................................................................215.6 HOWTOTESTAGAINSTHYPOTHESIS:................................................................................................21

6 IMPLEMENTATION:......................................................................................................................226.1 CODE..................................................................................................................................................226.2 DESIGNDOCUMENT:..........................................................................................................................276.3 DESIGNDOCUMENTANDFLOWCHART.............................................................................................30

7 DATAANALYSISANDDISCUSSION..............................................................................................317.1 OUTPUTGENERATION:......................................................................................................................317.2 OUTPUTANALYSIS:.............................................................................................................................31

Page 4: AUTOMATED LUNG CANCER NODULE DETECTION · Monica Ramakrishnan, Surya Rajasekaran, Barsa Nayak, Akshay Bhagdikar SANTA CLARA UNIVERSITY AUTOMATED LUNG CANCER NODULE DETECTION

4|P a g e

7.3 COMPAREOUTPUTAGAINSTHYPOTHESIS........................................................................................337.4 ABNORMALCASEEXPLANATION........................................................................................................337.5 DISCUSSION:.......................................................................................................................................33

8 CONCLUSIONANDRECOMMENDATIONS.....................................................................................348.1 SUMMARYANDCONCLUSIONS:........................................................................................................348.2 RECOMMENDATIONSFORFUTURESTUDIES:....................................................................................348.3 BIBLIOGRAPHY....................................................................................................................................358.4 PROGRAMFLOWCHART.....................................................................................................................378.5 PROGRAMSOURCECODEWITHDOCUMENTATION..........................................................................378.6 INPUT/OUTPUTLISTING.....................................................................................................................37

1.4 LISTOFTABLES&FIGURES FIGURE1:SAMPLESLICESOFDICOMIMAGE..............................................................................................14

FIGURE2:HOUNDSFIELDUNITSFORVARIOUSSUBSTANCES............................................................................15

FIGURE3:CTSCANSLICECOLORED...........................................................................................................15

FIGURE4:AFTERTHRESHOLDING...............................................................................................................16

FIGURE5:AFTEREROSIONANDDILATION....................................................................................................16

FIGURE6:VGG16ARCHITECTURE............................................................................................................18

FIGURE7:MLPPROCESS........................................................................................................................19

FIGURE8:SAMPLEEPOCHOUTPUT...........................................................................................................32

Page 5: AUTOMATED LUNG CANCER NODULE DETECTION · Monica Ramakrishnan, Surya Rajasekaran, Barsa Nayak, Akshay Bhagdikar SANTA CLARA UNIVERSITY AUTOMATED LUNG CANCER NODULE DETECTION

5|P a g e

1.5 ABSTRACT:Lungcancerisadiseaseofuncontrolledcellgrowthintissuesofthelung.Sincelungcancerisoneoftheleadingcausesofdeath,earlydetectionofmalignanttumorsisimperativeforasuccessfulrecovery.Ingeneral,earlystagelungcancerdiagnosistechniquesmainlyutilizeX-raychestfilms,CT,MRI,etc.Computedtomography(CT)producesaseriesofcross-sectionalimagescoveringapartofthehumanbody.Forourcasespecifically,wewillfocusonthethoracicregion.Visuallyidentifyingandexaminingtheseimagesforpotentialabnormalitiesisachallengingandtimeconsumingtaskduetothelargeamountofinformationthatneedstobeprocessed,andtheshortamountoftimegiven.Thesubjectofmedicalimageminingiscurrentlyanupandcomingtopicandshowsalotofresearchpotentialintheareaofcomputationalintelligence.Byauto-analyzingapatient’srecordsand images throughdataminingand imageprocessing techniques,wewould reduce the riskofhumanerrorinnoduledetection.Byapplyingacombinationoftechniquesindatapreprocessing,featureextraction,andclassification,weultimately seek to increase theaccuracy rateofcancerdetection, while simultaneously reducing the false positive diagnosis rate. In this project, weproposetouseadeepartificialneuralnetworkarchitecture,whichisacombinationofCNNalongwithRNNforthefully-automateddetectionofpulmonarynodulesinCTscans.ThearchitectureoftheVGG16convolutionalneuralnetworkistrainedtodistinguishpixelsacrossimages,andcanbeutilizedinourcasetoextractnoduleinformation.Ourprojectwilldemonstratethatbyleveragingthesetechniques,wesubstantially increasethesensitivitytodetectpulmonarynodules,withoutinflatingthefalsepositiverate.Thus,fromtheavailableLIDC/IDRIdatasetconsistingofaround1500CTscans,wehaveprovidedaninnovativeapproachofimplementingCNNusingthepretrainedVGGmodel for feature extraction and RNN for feature classification for identification of pulmonarynodulesinlungcancerdetection.

Page 6: AUTOMATED LUNG CANCER NODULE DETECTION · Monica Ramakrishnan, Surya Rajasekaran, Barsa Nayak, Akshay Bhagdikar SANTA CLARA UNIVERSITY AUTOMATED LUNG CANCER NODULE DETECTION

6|P a g e

2 INTRODUCTION2.1 OBJECTIVE:Theobjectiveofthisprojectistoimprovethecurrentcancerdetectionratebyreducingthefalsepositives while maintaining a low false negative rate. The false positive detection of cancer isdangerous,asanerroneousdiagnosisutilizespreciousresources,causesunnecessaryapprehensionforthepatient,andfinally,posesavarietyoflegalthreatstothedoctors.Thus,topreventthis,ourobjectiveistodevelopanalgorithmthatefficientlyreducesthefalsepositiverate,whilemaintainingtheoverallcancerdetectionaccuracy.2.2 WHATISTHEPROBLEM:Lungcarcinoma,alsoknownaslungcancer,ischaracterizedbymalignanttumorsfromwhengenechanges intheDNAofthecellsmutateandpromoteunnaturalgrowth.Lungcancer is themostcommontypeofcancerwithapproximately225Knewcasesin2016alone,whichledto$12billioninannualhealthcarecosts.Themostcommonageatdiagnosisis70years.Overall,thelungcancersurvival rate in theUnitedStates isextremely low–only17.4%ofpatientsdiagnosedwith lungcancersurvivefiveyearspostdiagnosis.Further,uncontrolledcellgrowthcanspreadtosurroundingareasormetastasizetootherorgans if itnotdetectedearly.DoctorscurrentlyuseLow-DoseCTscanstohelpassessifapersonisatriskoflungcancer,orevenotherpulmonarydiseases.Using a data set of thousands of high-resolution lung scans provided by the National CancerInstitute,wewilldevelopamodelthataccuratelydetermineswhetherlesionsinthelungsofhighriskpatientsarecancerous.Currentmodelshaveextremelyhighfalsepositiverates,whichdoesnotallowoncologistsandradiologiststofocusonpatientsthathaveanimminentcancerthreat.Further,thehighfalsepositiveratesleadtounnecessarypatientanxiety,additionalfollowupimaging,andinterventionaltreatments.Thus,bybuildingaclassificationmodelthatreducesthenumberofbothfalsepositivesandfalsenegatives,patientscanhaveearlieraccesstolife-savinginterventions,aswellasgivedoctorsanopportunitytoprioritizepatientcare.2.3 PROJECTRELATIONTOCLASS:Theapproachwewilltaketomodelanaccuratepredictionsystemoflungcancerwillutilizemultipletechniques indataminingandpatternrecognition.Classificationproblemsarerooted in featureprocessing,clusteringfeatureextraction,featureengineering,andmachinelearningmodels.First,byminingalargedatasetofCTscans,weutilizetechniquesindataminingandimageprocessingthatarecrucialforaccuratefeatureextraction.Wealsoplantoutilizebothunsupervised(clustering)andsupervised(classification)modelsinordertoextractcharacteristicsofpatients’lungsandfinallyclassifypatientsascancerousornotcancerous.Byusingthesevarioustechniques,wearecombiningwidelyusedtechniquesindataminingandpatternrecognition.

Page 7: AUTOMATED LUNG CANCER NODULE DETECTION · Monica Ramakrishnan, Surya Rajasekaran, Barsa Nayak, Akshay Bhagdikar SANTA CLARA UNIVERSITY AUTOMATED LUNG CANCER NODULE DETECTION

7|P a g e

2.4 WHYOTHERSOLUTIONSAREINADEQUATE:MostofthestandardmechanismsforclassificationgoforeitherPCAorSVMorRandomForestandXboostClassifiertoclassifythedata.Otherclassifiersfailiftheyhaven'ttakenthebelowscenariosintoaccount:

1. Detecting the location of Pulmonary Nodule(s): The current approaches for detectingpulmonarynodulesgenerallyincludemanualintervention.Whenthelocationofthenodulesisnotprovidedinthedataset,itbecomesanextremelytedioustasktomanuallyparsetheimages to findpotentiallymalignantnodules. Thisproves itself evenmoredifficult if theimagesarerotatedortwisted,leadingtofurthererroneousprocessing.

2. FeatureExtraction:Thegenerationofasmallnumberoffeaturesleadstoalossofdata.Oncethenodulepositionisdetermined,thenoduleisthenextractedfromtheentireimage.Then,the features such as total area, average area, maximum area and average eccentricity,averageequivalentdiameter,standardequivalentdiameter,weightedX,weightedY,numberofnodes,andnumberofnodesperslicewerecalculated.Withthisapproach,thereisalotofinformationlost.

2.5 WHYOURAPPROACHISBETTER:Ourproposedapproachhasthefollowingtwosteps:

1. FeatureExtractionPhase:Unlikemanualintervention,wemakeuseoftheimageprocessingtechniquestofirsthighlightthelungregionandthenapplyourstateoftheartpretrainedConvolutionalNeuralNetworkmodelforfeatureextractionfromtheimages.Thisreducesthehumaneffortofthenodulepositiondetection,andsinceourmodelisnotrestrictedtothepositionofthenodule,itdoesnotgetaffectediftheimageisrotatedortwisted.

2. ClassificationPhase:RecurrentNeuralNetworkswithmultilevelperceptronwillbeusedtoclassifytheCTscans.Althoughrandomforestclassificationmodelstypicallyrequiremoredataforasimilaraccuracy,theytypicallygeneratearobustmodel.Thus,wewilluserandomforesttosetabaselineaccuracyforouranalysis.Ontheotherhand,deeplearningismorefavorableascomplexproblemssuchasimageclassificationcanbehandledbetter,andthisisthebaseofpulmonarynoduledetection.

tThebenefitsofusingdeeplearning(RecurrentNeuralNetworks)are:

1. Automaticfeatureextractionwithouthavingtoextractthenodulepositioninformationandotherfeatures.

2. Incaseofdatasetswhicharecomplex3Dimages,deeplearninggivesbetterclassificationresultsascomparedtoothermethods.

3. Treebasedmodels(theonethatXGBoostisgoodat)solvetabulardataverywellbutadeepnetworkcancapturethingslikeimage,audioandpossiblytextquitewellbymodelingthespatialtemporallocality.

4. Neuralnetworkbaseddeeplearningisanaccuracy-focusedmethodwhereasXgboostisaninterpretation-focusedmethod.

Page 8: AUTOMATED LUNG CANCER NODULE DETECTION · Monica Ramakrishnan, Surya Rajasekaran, Barsa Nayak, Akshay Bhagdikar SANTA CLARA UNIVERSITY AUTOMATED LUNG CANCER NODULE DETECTION

8|P a g e

2.6 STATEMENTOFPROBLEMAlthoughCTscansareestablishedmeansfordetectingpulmonarynodules,thesmalllesionsinthelungstillremaindifficulttoidentify–especiallywhenusingasingledetectorCTscan.Thisposesitselfasachallengewhenattemptingearlydetectionoflungcancer.Sinceearlydetectionisthekeyfor a successful remission and recovery, the inability tomanually see the small lesions furtherhindersthepossibilityofearlydetection.CurrentCTscannersproduceupto300cross-sectional2Dimages,eachofwhichmustbeindividuallyevaluatedunderatimeconstraint.DespitethediagnosticbenefitsprovidedbytheCTimaging,theincreasedmanualworkloadthatisrequiredtoread200-300slicesperexamleadstotheincreasingerrorrateofcancerdetection.Giventhatnodulescanappear in different positions, and depending on the patient, the process to detect lung cancerbecomes extremely labor-intensive and manual. This manual process further increases theprobabilityofhumanerror–eitherthedoctordetectscancerinapatientwhoiscancerfree,orthedoctor fails todetect themalignantnodule.Currentstudieshavedemonstrated that therateoferroneousCTinterpretationandanalysisrangesfrom7%-15%whenaradiologistperformsmorethan20CTexaminationsperday(Bechtold,1997).Inordertoaddresstheseissues,therehasbeenasuddenincreaseofresearchanddevelopmentofCAD(computer-aided-diagnosis)systemsforhighaccuracypulmonarynoduledetection.UsingtheCTscans,theadoptionoftheCADsystemsledtoan improvement on the sensitivity of current detection algorithms – present day systems aresuccessfully able to detect nodules with a 3mm diameter. Thus, several approaches are beingproposed to overcome the challenges to detect the pulmonary nodules and thusmaximize thechancesofthesurvivalofthepatients.2.7 SCOPEOFINVESTIGATION Forthepurposesofthisproject,wewillspecificallyfocusonlesionsinthelungsofhighriskpatients.The general approach can potentially be applied to different organs and in patients of variousdemographics,butforthisinvestigation,wewilllimitthescopeofourinvestigation.

Page 9: AUTOMATED LUNG CANCER NODULE DETECTION · Monica Ramakrishnan, Surya Rajasekaran, Barsa Nayak, Akshay Bhagdikar SANTA CLARA UNIVERSITY AUTOMATED LUNG CANCER NODULE DETECTION

9|P a g e

3 THEORETICALBASESANDLITERATUREREVIEW3.1 DEFINITIONOFTHEPROBLEM:Since lungcancer isoneofthe leadingcausesofdeath intheUnitedStatesandgiventhatearlydetection increases the probability of a successful remission, our problem statement revolvesaroundcreatinganautomated,highaccuracymodelfornoduledetection.Currentmethodologiesplace a heavy focus on reducing false negative rates, but at the expense of significantly overpredicting the cancerous class. Further, our research has suggested that current approaches tonoduledetectionallrequiresomelevelofmanualimageprocessing.Thatis,theimagesareparsedbyhandforthecoordinatesofthenodule.Thus,thesolutionwillfocuson1)automatingnoduledetection,and2)reducingthefalsepositiverateofcancerdetectionwhilemaintainingagoodfalsenegativerate.Theavailabilityof largedatagivenbytheNationalCancerAssociationprovidesanopportunityforfurtherresearchindatamining.Theimportantprobleminthisareaistomakeanefficientdetectionalgorithmtoaidwithearlydetection.3.2 THEORETICALBACKGROUNDOFTHEPROBLEM: With the advancement of Technology and Computer Aided Diagnosis (CAD), scientists haveencouraged a lot of automated systems to address the issue of reducing false positive whileestimatingthepresenceofpulmonarynodulesintheCTscansofthepatients.Thus,todaywehaveasurplusofdatapertainingtotheCTscanpatients.Fromhere,wehavetheopportunitytousecurrenttopicsinimageprocessing,datamining,andmachinelearningtoidentifyhiddenpatternsin nodule size, location, structure, etc. and construct a model to increase the probability ofmalignanttumordetection.Withtheadventof thepatternrecognitionandmachine learning,datascientistshaveproposedmanyapproacheswhichwererobustinfindingthehiddenpatternsandreducingthefalsepositives.Withthis,asthedataforthepatientsdetectedwithlungcancerincreased,theCTscanstendtodiffermoresignificantlyfromeachother,acrosspatients.Consequently,thedeeplearninghascomeintopictureforthecompleximageclassificationinordertoensurethatoutliersandanomaliesareproperlyhandledinthemodelandthus,reducingthefalsepositiverateformalignantpulmonarynoduledetection.3.3 RELATEDRESEARCHTOSOLVETHEPROBLEM: Therehasbeenalotofresearchinrecenttimesonthedevelopmentofcomputer-aideddiagnosis(CAD)systemsforpulmonarynoduledetectionusingCTimaging.Advancementsinimageprocessingfield has increased the accuracy in the prediction of cancer fromCT scans. There are plenty ofresearchpaperswhichdiscussthevariousmethodsandoutputs.Afewexamplesinclude:‘RecurrentConvolutionalNetworksforPulmonaryNoduleDetectioninCTImaging’, ‘Combiningdeepneuralnetwork and traditional image features to improve survival prediction accuracy for lung cancerpatientsfromdiagnosticCT’,‘ComputerizedDetectionofLungTumorsinPET/CTImages’,and‘LungcancerclassificationusingneuralnetworksforCTimages’.

Page 10: AUTOMATED LUNG CANCER NODULE DETECTION · Monica Ramakrishnan, Surya Rajasekaran, Barsa Nayak, Akshay Bhagdikar SANTA CLARA UNIVERSITY AUTOMATED LUNG CANCER NODULE DETECTION

10|P a g e

3.4 ADVANTAGE/DISADVANTAGEOFTHOSERESEARCHAdvantages The current research is extremely helpful to many students who would like to implement thefindingsintotheirprojectsandaswellastodofurtherresearch.Theadvantageoftheseresearchfindingsisthatitalsohelpsinimprovingthedetectionofcanceratanearlystageandtoreducethedetectionerror.Disadvantages Thedisadvantagesarethatusuallythedatatrainingsetisveryhugewhichincreasesthetimetakentotrain.Inordertoobtainabeyond-averageaccuracy,wearerequirdtouseanextremelylargedatasetinordertotrainthemodel.Althoughitistimeconsumingforthecodetotrainthenetwork,itisaone-timeprocessthatcanberedoneiftherearenewtypesofcancersthatneedtobetrained.3.5 SOLUTIONTOPROBLEM Inorder todetectcancerous tissue in the lungswitha relativelyhighaccuracy,oursolutionwillincludethefollowingsteps:

1. ImageProcessing2. FeatureExtraction&Engineering3. Clustering4. ClassificationModel

Thishybridapproachallowsustocombinethemostefficientindividualmethodologiesintoanendtoendmodel.Webelievethatthiscombinationwillextracttheadvantagesfromeachapproach,andultimatelyprovidemoreconclusiveresults.Ourgoalistobuildanautomatednoduledetectionmodelwithhighaccuracyinordertoaidthedetectionoflungcancerinpatients.3.6 WHERESOLUTIONISDIFFERENTFROMOTHERS Oursolutionwillutilizeacombinationofmachinelearningtechniquesandplaceauniquefocusonminimizingthefalsepositiverate.Whilemostmodelsplacetheprimaryfocusontotalcancerouslungspredicted,wewillplaceourfocusonreducingthefalsepositiverate,inordertomakeourresultsmorereliable.Further,mostof the techniquesbeingused for lungcancerdetectionpre-processtheCTscanmanually,whichisthemostcriticalstepforhighaccuracydetection.Currentsolutionsrequiremanualfeatureextraction,whichincreasestheprobabilityofhumanerror.Onthecontrary,ourmethodwillinputtheentireCTscanasDICOMimageintotheclassificationsystemandthus,weeliminatetheneedforamanualpreprocess.

Page 11: AUTOMATED LUNG CANCER NODULE DETECTION · Monica Ramakrishnan, Surya Rajasekaran, Barsa Nayak, Akshay Bhagdikar SANTA CLARA UNIVERSITY AUTOMATED LUNG CANCER NODULE DETECTION

11|P a g e

3.7 WHYSOLUTIONISANIMPROVEMENTONCURRENTMETHODS Oursolutionisaimedatreducingthenumberoffalsepositivesintheclassificationmodel.Thiscanbeachievedbyacombinationoffeatureengineering(i.e.selectingfeaturesandcharacteristicsthatarehighlycorrelatedwithlungcancer)aswellastrainingaclassificationmodelusingbootstrappingandvariousotherresamplingtechniquesgivenourunbalanceddataset.Bydoingso,ourdatasetwillevenoutthedistributionbetweentheclassesofpatients.Creatingaweightingfunctionforrareclass(cancerousscans)forcesthemodeltonotoverfittothedata,whichultimatelyresults inaminimalnumberoffalsepositives.Further,aspreviouslymentioned,foranyjobwhich isdoneinapattern,machineshaveamuchlargercapacityandincreasedefficiencyforimageprocessing.Byeliminatingtheneedforhumaninteraction, we have proposed an automated tool which will input raw CT scans, perform therequiredamountofimageprocessing,modeltraining,andfinalclassification.3.8 TERMINOLOGYThissectionhighlightsthecommonterminologyusedinlungcancerandimageprocessing.

Terminology Definition

CT ComputerizedTopologyusescomputer-processedcombinationofX-Rayimagestakenatdifferentanglestoproducethescan

PulmonaryNodule Massinthelungthatusuallyrepresentscancerouslesions

DICOM DigitalImagingandCommunicationsinMedicine

InstanceNumber IdentifiesthesequenceofimagesinaDICOM

SliceThickness ThicknessofslicedependsonthicknessofCTdetectionmachine

HoundsfieldUnits QuantitativeScaleusedtomeasuredensityofsubstancesfoundinbody

ROI RegionofInterest

Page 12: AUTOMATED LUNG CANCER NODULE DETECTION · Monica Ramakrishnan, Surya Rajasekaran, Barsa Nayak, Akshay Bhagdikar SANTA CLARA UNIVERSITY AUTOMATED LUNG CANCER NODULE DETECTION

12|P a g e

4 HYPOTHESIS4.1 MULTIPLEHYPOTHESISHypothesis: By using the solution outlined above, our model will reduce the number of falsepositives,whilemaintainingagoodaccuracyrate.4.2 POSITIVE/NEGATIVEHYPOTHESIS:PositiveHypothesis:WhenextractingandquantifyingfeatureforROI,thefeaturestructuredesignisirrational,hence3Dfeaturesaretakenintoaccountbyrecombiningthesliceswhileextractingthefeaturestogetmoreaccuracy.Negative:Thepulmonarynodulecanbeofanyshapeandcouldmergewiththebloodvessel,whichcouldcause issueswhendetecting thenoduleaccurately. Since theaccuracy isnot100%, thereleavesroomforerror.However,thisisstillbetterthanamanualdetectiondonebyhumans.

Page 13: AUTOMATED LUNG CANCER NODULE DETECTION · Monica Ramakrishnan, Surya Rajasekaran, Barsa Nayak, Akshay Bhagdikar SANTA CLARA UNIVERSITY AUTOMATED LUNG CANCER NODULE DETECTION

13|P a g e

5 METHODOLOGY5.1 HOWTOGENERATE/COLLECTINPUTDATA:OurinputdataistakenfromKaggle,whichhasthedataof1,000highriskpatients.Thedatasetisapproximately130GB–whichrequiredasignificantamountofcomputingpowertoprocess.Inthisdataset,wearegivenoverathousandlow-doseCTimagesfromhigh-riskpatientsinDICOMformat.EachDICOMimagecontainsaseriesof2Dgray-scaleimagesthatcontainmultipleaxialslicesofthechestcavity–thatis,thereareapproximately300imagesperpatient.Eachimagehasavariablenumberof2Dslices,whichcanvarybasedonthetypeofmachineusedinthescan.TheDICOMfileshaveaheaderthatcontainsthenecessaryinformationaboutthepatientid,aswellasotherscanparameterssuchastheslicethickness.Thedatasetalsoincludedacsvcontainingtheclassificationinformationascancerousornon-cancerousperpatientid.Theimagesinourdatasetvaryinquality,depending on when the scan was taken. For example, older scans were imaged with lesssophisticated equipment and thus, have a lower resolution than more recent scans. Thus tosummarizeithasthefollowinginfo–around300imageswhichrepresenttheslicesofthethoracicregionCTscaninslicesinaDICOMformat:

1. PatientID,name,dateofbirthandothermetadataofthepatientandimageinformation2. CSVfilemappingtoeachpatientidmentioningofthepersonhascancerornot

5.2 HOWTOSOLVETHEPROBLEM:Thefollowingflowcharthighlightstheprocessoverviewfordetectingnodulesinlunghypothesis.

Thestepsareoutlinedasfollows:1. DataInput:Downloaddataforaround1,500patientsfromKaggle–gives2DslicesofCTscan

imagesforeachpatient,whichareintheDICOMformat.Thereareapproximately300slicesperpatient.

2. ImageProcessing:Reconstructthe3Dstructureoflungs,removethenoisefromtheimage,andmakethedatasetofallthepatients’uniformandthus,eligibleforfeatureextraction,whichincludessorting,morphological,dilation,erosion,segmentation,maskingetc.

3. Classification:RecurrentneuralNetworkwithMLPasitscoreisusedfortheclassification.MLP,ormulti-levelperceptron,hasamemoryelement,LSTM(longshorttermmemory)associatedwith it. This furtherenhances theclassificationbyemphasizing significant featureswhilede-emphasizingasspectswhichareunimportant.

4. Validation:75%oftrainingdataisgiventotrainthemodeland25%isusedfortestingpurposes.

TheCTscansofallpatienrsarefedintothetoolinDICOMformathaving300imagesperpatients

Imagsofeachpatienttakenasslicsand3Dimagereconstructedandnoisereduction

Imageprocessingdoneontheentirescanforfeatureextraction

75%ofthedataistakenfromtrainingandtheremaining25%isusedfortesting and

validationofthemodel

Page 14: AUTOMATED LUNG CANCER NODULE DETECTION · Monica Ramakrishnan, Surya Rajasekaran, Barsa Nayak, Akshay Bhagdikar SANTA CLARA UNIVERSITY AUTOMATED LUNG CANCER NODULE DETECTION

14|P a g e

5.3 ALGORITHMDESIGN:Asapartofouralgorithmdesign,wewillimplementtwodifferentapproachesinordertodeterminethemodelwith thehighestpredictionaccuracy fordetecting lungcancer.Our imageprocessingstepswill remainconstant throughout theapproaches.Thedifferencewill liewithin the featureextractionandclassificationtechniques.Asourcursoryanalysis,wewilldesignarelativelysimplemethodoffeatureextractionandclassification.1. ImageProcessing:–giventhesetsofCTscans,ouralgorithmwillconstructthe3Dlungscan,and

extractonlythelungregion.Thisisespeciallyimportantthatedgedetectionisdonewithhighaccuracytominimizetheerrorrateofourmodel.Wealsoutilizesegmentationandmachinelearning techniques to pre-process the image, including auto-detecting boundaries thatsurroundthevolumeof interest.Theimagesrepresentthe2Dslicesofthepatient’sthoracicregioninDICOMformat.Thusbeforeprocessing,itisveryimportanttoreconstructtheimagein3Dformasfollows:a. ArrangetheSlices:Arrangetheslicesinthenon-decreasingorderoftheirInstanceNumber

tomakesuretheCTscancanbereconstructed.Asmentioned,theCTscanisintheformofnumerous2Dimages(slices)whichneedtoberearrangedinnumericalordertobeabletorecreatethe3Dstructure.ThisinformationliesintheInstanceNumberoftheslices,whichgivestheorderinwhichithastoappearwhengenerating3Dview.AnexampleofthreeslicesinaparticularDICOMimagecanbeseebelow:

b. CalculateSliceThickness:Oncetheimageisformed,wehavetoensurethatslicethickness

istakenintoaccount.TheslicethicknessiscalculatedbasedonthemetadatapresentintheDICOMformat.Sincetheinputdatahasbeentakenfrommanysources,itsthicknessvariesfordifferentscanners.Thus,wehavetwoformatsofmetadatatocalculatethicknesswhereonecanconsidereithertheImagePositionPatientattributeorSliceLocationattributeoftheDICOMformatbasedonwhichisavailable.

c. ConverttheImageintoHU:TheimageisthenconvertedtoHU(Hounsfieldunits)toisolate

the lungregion.Sincewehavean imageof theentirethoracicregion, it is importanttodistinguishlungregionfromnon-lungregion.Thefirststepoftheconversionissettingthesevaluesto0,whichcorrespondstotheHUunitofair.Then,theHUunitsofeachregionis

Figure 1: Sample slices of DICOM image

Page 15: AUTOMATED LUNG CANCER NODULE DETECTION · Monica Ramakrishnan, Surya Rajasekaran, Barsa Nayak, Akshay Bhagdikar SANTA CLARA UNIVERSITY AUTOMATED LUNG CANCER NODULE DETECTION

15|P a g e

obtainedbymultiplyingtherescaleslopeandaddingtheintercept(whichareconvenientlystoredinthemetadataofthescans).Forreference,thetablebelowcontainstheaveragerangeHUspersubstance:

Substance HUAir -1000Lung -500Fat -100to-50Blood +30to+45Water 0Muscle +10to+40

Figure 2: Houndsfield Units for Various Substances

WhenaddingacoloredfiltertoaparticularsliceoftheDICOMimages,wearebetterabletovisualizetheandoutlinevarioussubstancesinthelung:

Figure 3: CT Scan Slice Colored

d. Resampling:Then,weapplyresamplingtoobtainahomogeneousdataset.Uniformityisofprimeconcerninimageprocessing,sowedonotbiasanysetofimages.Acommonmethodofdealingwithproblemistoresamplethefulldatasettoaparticularisotropicresolution.Thus,wegoforaresamplingtechniquetoensurethatthespacingoftheimagesisuniformfortheentireinputdataset.

e. Interpolation:Theprocessofzoominterpolationusingsplinemethodologyisdoneto

smooththeimageandreducevisualdistortion.

f. Thresholding–Whilethereexistvariouskindsofthresholdingtechniques,forthepurposesofourproject,wewillutilizeaclusteringbasemethod,wherethegray-levelsamplesareclusteredintwopartsasbackgroundandforeground(object).Thus,weapplythek-meansalgorithminordertoseparatethelungregionfromthebackground.Anexampleofaslicepostthresholdingcanbefoundbelow.Notethatwehavesuccessfullyisolatedthelungregion,butcanstillvisiblyseethenoiseintheimage.

Page 16: AUTOMATED LUNG CANCER NODULE DETECTION · Monica Ramakrishnan, Surya Rajasekaran, Barsa Nayak, Akshay Bhagdikar SANTA CLARA UNIVERSITY AUTOMATED LUNG CANCER NODULE DETECTION

16|P a g e

Figure 4: After thresholding

g. ErosionandDilation–Wethenapplyerosion–atechniqueforshrinkingtheimageinorder

to ensure the unwantednoise introduces further diminishes. Then,whenwedilate theimage,weregaintheoriginalsizeofthe image,keepingtheactualsizeof lungsandthenodules intact. These twoprocessesoverall reduce thenoise in the image.ExamplesofnoiseinCTscansincludeveins,capillaries,etc.

Figure 5: After erosion and dilation

h. Segmentation-MaskingoftheimageisdonebyhighlightingROI(RegionofInterest).After

gettingridofnoiseandreducingfalsepositives,theimageissegmented.Thus,amaskismade by outlining the edges of the lung (distinguishing the lung from the muscularsurroundings)aswellasthenodulescontainedwithintheorgan.WethenapplythemasktoisolatetheROI.

Page 17: AUTOMATED LUNG CANCER NODULE DETECTION · Monica Ramakrishnan, Surya Rajasekaran, Barsa Nayak, Akshay Bhagdikar SANTA CLARA UNIVERSITY AUTOMATED LUNG CANCER NODULE DETECTION

17|P a g e

Thebelowimagedemonstratesanoverviewoftheentireprocessdescribedabove:

2. FeatureExtraction&Engineering–Usingthe3Dscanofthelungsallowsustoretainasmuchof

theinformationaspossible.Asexplained,wethenconverttheCTscans(inHounsfieldUnits)topixel,allowingustoinputapixelarrayintoourclassifieraswellasthefeaturesthatweextract.Here,wewillhaveutilizedatwotieredapproach.Thefirstapproachwillfocusonasimplisticmethodoffeatureextraction,thenfollowedbyoursecondapproachwhichwillplaceaheavyemphasisondeep learning.Bydoing so,weallow the simplisticmodel to set abaseline foraccuracy,andwearethenabletofullyunderstandtheimpactthatdeeplearninghasonourprecisionApproach1:a) Foreverypatientwedividedthesetofimagesinto4sections.Forexample,ifthepatient

has 200 images, then divide the set in a section of 50 images each. By doing this weessentiallydividethelunginto4regions–top,twomiddle,andabottom.

b) Afterpreprocessingtheimage,wegetanimagewithbinarypixelvaluesas1or0’s.Wereadeveryimagearrayinasection,thensumthosearraysandfinallyaverageoutandstoretheseaveragedvalues inanewarray.Thisnewarray isamean representationof that specific

Page 18: AUTOMATED LUNG CANCER NODULE DETECTION · Monica Ramakrishnan, Surya Rajasekaran, Barsa Nayak, Akshay Bhagdikar SANTA CLARA UNIVERSITY AUTOMATED LUNG CANCER NODULE DETECTION

18|P a g e

section of lung area. By doing so, we average out the dark areas per region, with thehypothesisthatthehighertheaveragevalue,thelargertheprobabilityfornoduledetection.

c) Similarly,wefindaveragedarrayforeverysection.Thevaluesstored inthisarrayrangesfrom0to1.Weroundoffthesevaluesandfinallygetfourarrayswith0sand1sforeverypatient.

d) Wecalculatethenumberof1sineveryarray.Sofinally,foreverypatientwegetfourvalues–numberof1sineverysection.

e) Thesevaluesarethenusedasthefeatureinputforourclassificationmodel.

Approach2:First,featureextractionislargelydependentonimageidentification.ThusanycustomizedCNNmodelfirsthavetofedwithlotoftrainingdataunlikehumanbeingsforittounderstandwhichcategorythe imagebelongsto. Intheprocess itdoessobyupdating itsweights i.e.applyingappropriatefilters,andthusemphasizingonthefeaturetobeconsideredforrecognition.Thisprocessistimeconsumingaswitheachinput,itupdatestheweightsandlearnstoidentifythefeature.TheprocessoutlinedbelowwascreatedbyVisualGeometryGroup(VGG)whohavetrainedaCNNmodelwithatleast22,000categoriestohavethebestpossibleweightsforfeatureextraction.Forourcase,wecanusethemodeldirectlyforfeatureextractionwithouthavingtomanuallyupdatetheweights.With16layers,VGG16isanextremelyefficientimageprocessingalgorithm.Usingapre-trainedmodelallowsustoutilizethemostuptodatetechnologies inmachinelearninganddatamining.ByrunningthisCNNonthedata,weareabletoautodetectfeatures in each DICOM image. Thus, our output from the CNN is a data frame withapproximately40,000processedfeaturesperpatient.ThearchitectureforVGGisbelow:

Figure 6: VGG16 Architecture

ThismodelisavailableforboththeTheanoandTensorFlowbackend,andcanbebuiltbothwith"channels_first"dataformat(channels,height,width)or"channels_last"dataformat(height,width,channels).

Page 19: AUTOMATED LUNG CANCER NODULE DETECTION · Monica Ramakrishnan, Surya Rajasekaran, Barsa Nayak, Akshay Bhagdikar SANTA CLARA UNIVERSITY AUTOMATED LUNG CANCER NODULE DETECTION

19|P a g e

3. Classification–Finally,wecan inputourdata intoaclassificationmodel.Thedataframewillhaveeachindividualpatientasarow,andthedifferentfeaturesasacolumn.Thefinalcolumninthedataframewillbeabinaryvalue:1ifthepatienthascancer,and0ifthepatientdoesnothavecancer.Forvalidationpurposes,wewilluse75%percentofouroriginaldatafortrainingtheclassifier,and25%forvalidation.Frompreviousdiscussion,wementionedwearetakingtwodifferenttracksinordertofurtheranalyzetheaccuracydifferencesbetweenusingasimplevscomplexmodel.Thus,ourclassificationstepsforbothapproachesareasfollows:Approach1:OnceourimagesareprocessedandthepixelarrayisextractedfromeachCTscanfromapproach1mentionedabove,wecan feedthedata frame intoaclusteringalgorithmtogroupsimilarpatientstogether.Assuch,usingthefourfeaturespreviouslygenerated,weutilizedtheK-Meansclustering algorithm in order to partition the ~1,500 patients into k=2 clusters – positive ornegativeforcancer–whereeachpatientwillbelongtotheclusterwiththenearestmean.Bypartitioning the data space as such,we are able to gain a preliminary understanding of thedistinguishingcharacteristicsbetweenthetwogroups.TheK-Meansalgorithmpartitionedthedataintothetwogroupswithrelativelyhighaccuracy.WewerethenabletoaddtheK-Meansclusteringclassificationasaninputparametertoournextclassificationmethod:RandomForest.SinceRandomForestsareknowntoperformwellonvarious tasks, including unscaled data and variable selection, we decided to employ thisapproachinordertounderstandtheeffectivenessofourfeatureextractionmethodology.IfourRandom Forest returns a low accuracy, we can conclude that our feature extraction is notadequate,settingabasisforourdeeplearningapproach.

Approach2:Theclassificationof thedatasetproducedbyoursecondapproachusingconvolutionneuralnetworks isdonebyamultilevelperceptron (MLP)which follows the structure in the imagebelow.

Figure 7: MLP Process

Page 20: AUTOMATED LUNG CANCER NODULE DETECTION · Monica Ramakrishnan, Surya Rajasekaran, Barsa Nayak, Akshay Bhagdikar SANTA CLARA UNIVERSITY AUTOMATED LUNG CANCER NODULE DETECTION

20|P a g e

Using amultilevel perceptron has various advantages, including the capability to learn bothlinearandnon-linearmodels,andmostimportantly,thecapabilitytolearnmodelsinreal-time(on-linelearning)usingpartialfit.TheMLPthenfollowsthestepshighlightedbelow:

• Checksifthereapretrainedmodeltoloaditbeforetraining.• Sequentialmodelusedfortrainingthemodel.• Modelusesreluactivationwithvariousactivationlayers• Dropoutaftereverystagetopreventoverfitting• Model is compiled with ‘binary_crossentropy’ - loss which is the best for binary

classification,optimizer–RMSPROPbeingused• Checkpointer-usedtosavethebestmodelbasedonaccuracymetrics• Aftertraining,evaluationisdonetoshowtheefficiencyofthetrainedmodel.

5.4 LANGUAGESANDTOOLSUSED:Ourprojectwillutilize libraries frombothRandpython.Wewillprimarilyusepythontodothemajorityofourdataextractionandimageprocessing,whiletheclassificationmodelswillbebuiltandanalyzedusingbothRandpython.

Page 21: AUTOMATED LUNG CANCER NODULE DETECTION · Monica Ramakrishnan, Surya Rajasekaran, Barsa Nayak, Akshay Bhagdikar SANTA CLARA UNIVERSITY AUTOMATED LUNG CANCER NODULE DETECTION

21|P a g e

5.5 HOWTOGENERATEOUTPUT:

Theprogramreadstheinputimage(inDICOM)andappliestheimageprocessingfunctions,followedbythepredictionalgorithminordertogeneratebinaryclassificationoflungcancerdetection.

The flow chart below describes the process taken to generate the prediction for lung cancerdetection:

5.6 HOWTOTESTAGAINSTHYPOTHESIS:Onceweacquireadataset,weintendtodivideitintotwosubsets: Training:Here,wehavetaken75%oftheentiredatasettotrainthepredictivemodel

Test:Here,wetaketheremaining25%ofthedatatoassessthelikelyfutureperformanceofthemodel.Ifourmodelfitbasedonthetrainingdatasetisamuchbetterfitthanthetestset,wewilllikelyhaveanoverfittingproblem.

Sinceourfocusisonthereductionoffalsepositives,andwewishtopreventoverfitting,wehaveaccountedforboththesecasesbyusingadropoutlayerintheneuralnetwork.Thepurposeofsuchalayeristopreventregularization–whichmayhavefalselyproducedasmoothlosecurve,butoverfitthedata,ultimatelyleadingtolowconvergence.

CollectinputdataApplyImagespreprocessingtechniques

FeatureextractionisdoneusingVFF-16CNNmodelwhich

automaticallygeneratesthelistoffeatures

FeedtheprocesseddatatotheMLPbinaryclassifierintermsof

ofbatcheswithinputdataincreasinggradually

Page 22: AUTOMATED LUNG CANCER NODULE DETECTION · Monica Ramakrishnan, Surya Rajasekaran, Barsa Nayak, Akshay Bhagdikar SANTA CLARA UNIVERSITY AUTOMATED LUNG CANCER NODULE DETECTION

22|P a g e

6 IMPLEMENTATION:6.1 CODE ImageProcessing&Extraction:ThefollowingpythonscriptshavebeencreatedinordertoprocesstheDICOMimagesaswellasperformthefeatureextraction:

1. settings.py2. common.py3. preprocess_step_1.py4. preprocess_step_2.py

ClassificationThefollowingpythonscriptshavebeencreated inordertoapplytheclassificationmodeltoourprocesseddataset:

1. mlp_binary_classifcation.py

Snippetsofcode:Giventhatthecodeoftheprojectistoovast,asitcontainsvariousaspectsofimageprocessingaswellasfeatureextractionandclassification,wehaveprovidedtheimportantfunctionsbelow.Thedescriptionsforthesefunctionscanbefoundinsection6.2:thedesigndocument.

preprocess_step_1.py def load_scan(src_dir): slices = [dicom.read_file(src_dir + '/' + s) for s in os.listdir(src_dir)] slices.sort(key=lambda x: int(x.InstanceNumber)) try: slice_thickness = np.abs(slices[0].ImagePositionPatient[2] - slices[1].ImagePositionPatient[2]) except: slice_thickness = np.abs(slices[0].SliceLocation - slices[1].SliceLocation) for s in slices: s.SliceThickness = slice_thickness return slices

Page 23: AUTOMATED LUNG CANCER NODULE DETECTION · Monica Ramakrishnan, Surya Rajasekaran, Barsa Nayak, Akshay Bhagdikar SANTA CLARA UNIVERSITY AUTOMATED LUNG CANCER NODULE DETECTION

23|P a g e

def resample(image, scan, new_spacing=[1, 1, 1]): # Determine current pixel spacing spacing = map(float, ([scan[0].SliceThickness] + scan[0].PixelSpacing)) #print spacing spacing = np.array(list(spacing)) resize_factor = spacing / new_spacing new_real_shape = image.shape * resize_factor new_shape = np.round(new_real_shape)

real_resize_factor = new_shape / image.shape new_spacing = spacing / real_resize_factor image = scipy.ndimage.interpolation.zoom(image, real_resize_factor) return image, new_spacing

def make_lungmask(img, display=False): row_size = img.shape[0] col_size = img.shape[1] mean = np.mean(img) std = np.std(img) img = img - mean if std > 0: img = img / std # Find the average pixel value near the lungs to renormalize washed out images middle = img[int(col_size / 5):int(col_size / 5*4),

int(row_size / 5):int(row_size / 5 * 4)] mean = np.mean(middle) max = np.max(img) min = np.min(img) #To improve threshold finding,move underflow and overflow to pixel spectrum img[img == max] = mean img[img == min] = mean # Use Kmeans to separate foreground(soft tissue/bone) and background(lung/air) kmeans = KMeans(n_clusters=2).fit(np.reshape(middle,

[np.prod(middle.shape), 1])) centers = sorted(kmeans.cluster_centers_.flatten()) threshold = np.mean(centers) thresh_img = np.where(img < threshold, 1.0, 0.0) # threshold the image # First erode away the finer elements, then dilate to include some of the pixels surrounding the lung as we don't want to accidentally clip the lung. eroded = morphology.erosion(thresh_img, np.ones([3, 3])) dilation = morphology.dilation(eroded, np.ones([8, 8])) labels = measure.label(dilation) # Different labels are displayed in different colors label_vals = np.unique(labels) regions = measure.regionprops(labels) good_labels = [] for prop in regions: B = prop.bbox if B[2] - B[0] < row_size / 10 * 9 and B[3] - B[1] < col_size / 10 * 9 and B[0] > row_size / 5 and B[ 2] < col_size / 5 * 4: good_labels.append(prop.label) mask = np.ndarray([row_size, col_size], dtype=np.int8) mask[:] = 0

Page 24: AUTOMATED LUNG CANCER NODULE DETECTION · Monica Ramakrishnan, Surya Rajasekaran, Barsa Nayak, Akshay Bhagdikar SANTA CLARA UNIVERSITY AUTOMATED LUNG CANCER NODULE DETECTION

24|P a g e

# After just the lungs are left, we do another large dilation # in order to fill in and out the lung mask for N in good_labels: mask = mask + np.where(labels == N, 1, 0) mask = morphology.dilation(mask, np.ones([10, 10])) # one last dilation mask = mask[int(col_size / 5):int(col_size / 5 * 4), int(row_size / 5):int(row_size / 5 * 4)] img = img[int(col_size / 5):int(col_size / 5 * 4), int(row_size / 5):int(row_size / 5 * 4)] masked_lung = mask * img if (display): fig, ax = plt.subplots(3, 2, figsize=[12, 12]) ax[0, 0].set_title("Original") ax[0, 0].imshow(img, cmap='gray') ax[0, 0].axis('off') ax[0, 1].set_title("Threshold") ax[0, 1].imshow(thresh_img, cmap='gray') ax[0, 1].axis('off') ax[1, 0].set_title("After Erosion and Dilation") ax[1, 0].imshow(dilation, cmap='gray') ax[1, 0].axis('off') ax[1, 1].set_title("Color Labels") ax[1, 1].imshow(labels) ax[1, 1].axis('off') ax[2, 0].set_title("Final Mask") ax[2, 0].imshow(mask, cmap='gray') ax[2, 0].axis('off') ax[2, 1].set_title("Apply Mask on Original") ax[2, 1].imshow(mask * img, cmap='gray') ax[2, 1].axis('off') plt.show() masked_lung = mask * img return masked_lung

Page 25: AUTOMATED LUNG CANCER NODULE DETECTION · Monica Ramakrishnan, Surya Rajasekaran, Barsa Nayak, Akshay Bhagdikar SANTA CLARA UNIVERSITY AUTOMATED LUNG CANCER NODULE DETECTION

25|P a g e

preprocess_step_2.py def dump_features(patient_id): patient_id = patient_id[:-2] try: global COUNT COUNT = COUNT + 1 print("{}, {}".format(patient_id, COUNT)) model = applications.VGG16(include_top=False, weights='imagenet', input_shape=(settings.ct_width, settings.ct_height, 3)) features_train = model.predict_generator(generator(patient_id), steps=settings.ct_depth/3) features_train = np.reshape(features_train, (1, -1)) if settings.is_csv: np.savetxt("{}/{}.csv".format(settings.preprocess_step2_csv, patient_id), features_train, fmt ='%.2f', delimiter=',') else: pickle.dump(features_train, open("{}/{}.b".format(settings.preprocess_step2, patient_id), 'wb')) except Exception as e: print e mlp_binary_classifcation.py def create_model(): model = Sequential() model.add(Dense(128, input_dim=input_dim, activation='relu', kernel_initializer="normal", #kernel_regularizer=regularizers.l1(), # activity_regularizer=regularizers.l1(0.0001) )) model.add(Dropout(0.3)) model.add(Dense(64, activation='relu', kernel_initializer="normal")) model.add(Dropout(0.3)) model.add(Dense(32, activation='relu', kernel_initializer="normal")) model.add(Dropout(0.3)) model.add(Dense(16, activation='relu', kernel_initializer="normal")) model.add(Dropout(0.3)) model.add(Dense(8, activation='relu', kernel_initializer="normal")) model.add(Dropout(0.3)) model.add(Dense(1, activation='sigmoid')) sgd = SGD(lr=0.01, decay=1e-2, momentum=0.9) model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy']) return model

Page 26: AUTOMATED LUNG CANCER NODULE DETECTION · Monica Ramakrishnan, Surya Rajasekaran, Barsa Nayak, Akshay Bhagdikar SANTA CLARA UNIVERSITY AUTOMATED LUNG CANCER NODULE DETECTION

26|P a g e

def train_model(): model = get_model() patient_ids = common.get_patient_ids() train_pct = 0.7 steps_per_epoch = int(len(patient_ids)*train_pct/settings.batch_size) validation_steps = max(1, steps_per_epoch * (1 - train_pct)) checkpointer = ModelCheckpoint(filepath=settings.model_filename, monitor='acc', verbose=1, save_best_only=True) early_stopping = EarlyStopping(monitor='acc', min_delta=0.001, patience=10, verbose=1, mode='auto') history = model.fit_generator( common.generate_arrays_from_file(input_shape=(input_dim,), train_pct=train_pct), steps_per_epoch=steps_per_epoch, epochs=settings.epochs, validation_data=common.generate_arrays_from_file(input_shape=(input_dim,), train_pct=train_pct, is_test=True), validation_steps=validation_steps, #validation_data=read_validation_data(30, input_shape=(input_dim,)), workers=4, pickle_safe=False, callbacks=[checkpointer]) score = model.evaluate_generator( common.generate_arrays_from_file(input_shape=(input_dim,), train_pct=0.80, is_test=True), steps=1) print score return model, history #score = model.evaluate(x_test, y_test) #print score model, history = train_model() model.save(settings.model_filename) pickle.dump(history.history, open(settings.model_dump_dir + "/cnn_history.b", "wb")) #plot_model(model, to_file=settings.output_dir + "/" + "mip_model.png") common.plot_history(history)

Page 27: AUTOMATED LUNG CANCER NODULE DETECTION · Monica Ramakrishnan, Surya Rajasekaran, Barsa Nayak, Akshay Bhagdikar SANTA CLARA UNIVERSITY AUTOMATED LUNG CANCER NODULE DETECTION

27|P a g e

6.2 DESIGNDOCUMENT:

DocumentationofFunctionsUsed:

Common.py

Method Description

get_patient_ids() ThisfunctionfetchesthepatientIDsforallthepatients

threadsafe_generator(f) Thisisadecoratorthattakesageneratorfunctionandmakesitthreadsafe

generate_arrays_from_file(input_shape,train_pct=0.9,is_test=False) Generatesanarrayfromvaluesinthefile

plot_history(history) Thisfunctionisusedtoplotthetrainingaccuracyandthetestaccuracy

preprocess_step_1.py

Method Description

load_scan(src_dir) Thisfunctionloopsovertheimagefilesandstoreseverythingintoalist.

get_pixels_hu(scans) ThisfunctionisusedforsegregatingthelungpartbyconvertingpixelstoHUunits

sample_stack(stack,rows=6,cols=6,start_with=10,show_every=3)

Usedsamplethestackof2Dimagesanddisplaythem

resample(image,scan,new_spacing=[1,1,1])

Resamplingisdonebythisfunctiontomakesurethatthedistancebetweentheslicesisuniform

reshape_ct(image,new_shape=[settings.ct_depth,settings.ct_width,settings.ct_height])

Reshapethe2Dimagesasperthedesiredshapeforthenextinput

make_lungmask(img,display=False) Functiontostandardizethepixelvalues

Page 28: AUTOMATED LUNG CANCER NODULE DETECTION · Monica Ramakrishnan, Surya Rajasekaran, Barsa Nayak, Akshay Bhagdikar SANTA CLARA UNIVERSITY AUTOMATED LUNG CANCER NODULE DETECTION

28|P a g e

preprocess(data_path) Thisfunctionprocessesalltheabovefunctionsdefined

dump_preprocess_data(patient_id)Usedtosavethepreprocessedimageintheformofbinaryfile/csvtobeusedbythefeatureextractionmodel

preprocess_all():

Functionusedtoiterateovereverypatientandpreprocesseachsliceperpatienttomakesureitisreadyforfeatureextraction

Pre_process_step_2.py

Method Description

read_preprocess_1(patient_id): Thisfunctionisusedtoloadthepreprocessedimages

generator(patient_id):

Thisfunctiontakesthepreprocessedimageandgeneratesnewimageswithdimension,whichcanbefedinto#VGG-16model.

dump_features(patient_id):

ThefeaturesextractedusingtheCNNmodel(VGG-16)aresavedtobefedtotheclassifierinthenextstageinthisfunction

preprocess2_all():

FunctionusedtoextractthefeaturesforeachofthescansbyVGG16model

Mlp_binary_clasification.py

Method Description

create_model()Thisfunctionisusedtocreateasequentialmodelthatisusedforsimplebinaryclassification.

get_model(): Functionusedtoloadapre-existingtrainedmodel

Page 29: AUTOMATED LUNG CANCER NODULE DETECTION · Monica Ramakrishnan, Surya Rajasekaran, Barsa Nayak, Akshay Bhagdikar SANTA CLARA UNIVERSITY AUTOMATED LUNG CANCER NODULE DETECTION

29|P a g e

train_model():

Thisfunctiongetapretrainedmodelifpresentorcreatesatrainingmodelandthentrainsitinepochs(asperthenumberofepochsmentioned)

evaluate_model(): Oncethemodelistrained,thisfunctionisusedtoevaluatethetestdata

Page 30: AUTOMATED LUNG CANCER NODULE DETECTION · Monica Ramakrishnan, Surya Rajasekaran, Barsa Nayak, Akshay Bhagdikar SANTA CLARA UNIVERSITY AUTOMATED LUNG CANCER NODULE DETECTION

30|P a g e

6.3 DESIGNDOCUMENTANDFLOWCHART Thefollowingsectionoutlinestheflowchartofouranalysis.

Start

Loadtheinput CTScanimagesofeachpatient

PreprocessingoftheImagewithImageprocessingmethods foruniformityandnoisereduction

FeatureExtractionusingVGG-16CNNmodel

Divideinputdataintotraindataandtestdata

Checkifthereisapretrainedmodelandthenloadsitbeforeproceedingwithtraining

Trainthemodelusingtrainingdataset

Findtheaccuracyandupdatethemodelifthelatestaccuracyisbetterthanearlier

Givethetestdataforevaluationofthemodel

DisplaytheOutput

Stop

Page 31: AUTOMATED LUNG CANCER NODULE DETECTION · Monica Ramakrishnan, Surya Rajasekaran, Barsa Nayak, Akshay Bhagdikar SANTA CLARA UNIVERSITY AUTOMATED LUNG CANCER NODULE DETECTION

31|P a g e

7 DATAANALYSISANDDISCUSSION7.1 OUTPUTGENERATION:TheprogramreadstheinputofaCTscanintheformofaDICOMimageperpatient,andappliesthepredictionalgorithmtoittogeneratetheoutput.Theimagesare2DslicesoftheCTscan.Wethentrainthemodelandanalyzetheaccuracyagainsttheactualresultforeachpatient(i.e.cancerousornon-cancerous).Wethenfindthedifferenceoftheoutputsandupdatetheweightsby-30%toavoidoverfitting.Thefinaloutputofouralgorithmisaclassificationofeachpatientaseitherpositiveornegativeforcancer.Thisoutputisgeneratedforallselectedpatients.7.2 OUTPUTANALYSIS: Approach 1:Aspreviouslymentioned,ourfirstapproachutilizestheRandomForestclassifieronthesimplisticfeatureextractionmodel.Assuch,theconfusionmatrixisasfollows:

PREDICTED:NOCANCER PREDICTED:YESCANCERACTUAL:NOCANCER 301 383ACTUAL:YESCANCER 117 128

Here,confusionmatrixdisplaysthetotalnumberofcorrectandincorrectpredictionsmadebytheclassificationmodelincomparisontotheactualoutcomes.Forourpurposes,theperformanceofourmodelswillbeevaluatedusingthedatainthematrix.Fromtheaboveoutput,wecanseethatoutof684patientswhodidnothavecancer,383werepredictedincorrectly, indicatingthatourfalse positive rate is 56%. Similarly, out of the 245 patients who have cancer, only 128 werepredictedcorrectly,indicatingthatourfalsenegativerateis47%.Aswecansee,theaboveinputindicatesthatourfeatureextractionprocessisclearlyinadequatefordetectinglungcancernodules.Ouraccuracyrateisaround50%forboththefalsepositivesandfalsenegatives,andthus,wecanconcludethatthesimplisticmodelbasedonhigh-levelfeatureextractionandbasicclassificationisnotabletopredictlungcancernodulesatadesirableaccuracy.However,thesimplisticapproachprovidesuswithabaselineforfurtheranalysis.Asweproceedtoanalyzeourmorecomplexmodelinvolvingapre-trainedfeatureextractionconvolutionalneuralnetwork,wecanjudgeouraccuracynotonlybyarawpercentage,butalsoasarelativeincreasecomparedtooursimplisticmodel.Approach 2:Now,aswemoveforwardwiththedeeplearningapproachaspreviouslymentioned,weusedaconvolutionalneuralnetworktoclassifypatientsaseithercancerousornon-cancerous.Throughtheprocessoftrainingtheclassifier,welookattheepochoutputasfollowed:Ateachstep,theCNNcalculatestheEstimateTimeofArrival,loss,accuracyineachstage,totaltime,totallossandtotalaccuracyforentirestep.Sampleoutputisfoundbelow:

Page 32: AUTOMATED LUNG CANCER NODULE DETECTION · Monica Ramakrishnan, Surya Rajasekaran, Barsa Nayak, Akshay Bhagdikar SANTA CLARA UNIVERSITY AUTOMATED LUNG CANCER NODULE DETECTION

32|P a g e

Figure 8: Sample Epoch Output

Aswecanseefromtheaboveoutput,theCNNclassifiercategorizesthedata,calculatesthelossfunction,andthenadjuststheweightsaccordingly.ThisprocessisrepeateduntiltheACCnolongerimproves,shownintheoutputbelow:

Thus,afterourclassificationmodelistrainedwiththedata,wecancomputetheconfusionmatrixasfollows:

PREDICTED:NOCANCER PREDICTED:YESCANCERACTUAL:NOCANCER 744 1ACTUAL:YESCANCER 57 198

Here,weabletoseethatbyutilizingconceptsindeeplearning,wewereabletosignificantlyincreaseouraccuracyrate.Notonlywereweabletopredictthelungcancernoduleswithanaccuracyof77%,wewerealsoabletoreducethefalsepositiverateto0.001%inourtrainingdata.Now,whenwetakeouttrainedclassificationmodel,andapplyittoourtestingdata,weseethatthereisanoverallaccuracyof70%.Apotentialreasonforthedrasticdifferencebetweentheaccuracyofthetestandtraindataisoverfitting–amodelingerrorwhichoccurswhenafunctionistoocloselyfittoalimitedsetofdatapoints.Whileapointforfurtherresearchincludeshandlingtheoverfittingofthedata,ourachievedaccuracyof70%ismuchhigherthanoursimplisticmodelprediction.Further,given the complexityof theproblem, and the lackofmanual analysis,we can conclude that anoverallaccuracyof70%isadequate.

Page 33: AUTOMATED LUNG CANCER NODULE DETECTION · Monica Ramakrishnan, Surya Rajasekaran, Barsa Nayak, Akshay Bhagdikar SANTA CLARA UNIVERSITY AUTOMATED LUNG CANCER NODULE DETECTION

33|P a g e

7.3 COMPAREOUTPUTAGAINSTHYPOTHESIS The efficiency of the model depends on the preprocessing, the feature extraction and theclassificationmodel.Italsolargelydependsonthedatasetused,thequalityofCTscanimagesandalsothevolumeofdataorthenumberofpatients.Withincreasingrateoflungcancerwithtime,moredatawillbeavailableandthusitcanimprovetheefficiencyofthemodelaccordingly.Withthedatasetofaround1,500patientswithlowdoseCTscanimages(thathasaround300imagesperpatient)where75%ofdataistakenasinputand25%asoutput,wemanagedtogeta90%accuracywith our training data and a 70% accuracy rate with our test data with complete automatedpulmonarynoduledetection.7.4 ABNORMALCASEEXPLANATIONSince thedata availablewasof lowdoseCT scan, the imagequality could alsohavepotentiallyaffected the efficiency. With the advancement of technology, better CT scanners have beendevelopedwhichcouldtakeCTscanswithultimateprecisionandimagequality,hencebeingabetterinputforthemodel.7.5 DISCUSSION:We selected 1,000 patients to train our CNN classificationmodel, and used the remaining 200patientstotestthealgorithm.Manyiterationsortestingvariousparameterswereattemptedbeforeconcludingwiththeabovemodel.Asseen,wemayhaveapotentialproblemwithoverfitting,butfornow,wewillconcludethatourmodelisadequateforpredictingthedetectionoflungcancernodules.

Page 34: AUTOMATED LUNG CANCER NODULE DETECTION · Monica Ramakrishnan, Surya Rajasekaran, Barsa Nayak, Akshay Bhagdikar SANTA CLARA UNIVERSITY AUTOMATED LUNG CANCER NODULE DETECTION

34|P a g e

8 CONCLUSIONANDRECOMMENDATIONS8.1 SUMMARYANDCONCLUSIONS:Inthispaper,westudytheuseofimageprocessing,datamining,andmachinelearningtechniquestopredictlungcancernodulesinhighriskpatients.Basedontheresearchandanalysisconductedfor this project using a publicly available data set of lung CT scans,wewere able to develop asuccessful model for lung cancer nodule detection. By using a hybrid of approaches in imageprocessing and classification,wewere able todevelop anend to endprocess that detects lungcancernoduleswithhighaccuracy.Further,byplacingaheavyemphasisonautomationofimageprocessingaswellasareductionoffalsepositives,wewereabletodevelopafullmodelthatrunswith 70% accuracy on test data. Given the difficult nature of the problem, we faced variouschallengesthroughouttheprocess.First,thesegmentationoflungsisaverychallengingproblemduetoinhomogeneityinthelungregion,pulmonarystructuresofsimilardensitiessuchasarteries,veins,bronchi,andbronchioles,anddifferentscannersandscanningprotocolsanddifference inqualityoftheCTscans,theimageshadtobemadeuniformbeforeprocessing.TheCTscansbeinginhundredsofimagesperpatienthadamemoryconstraintwhileprocessingandalsowasatimeconsumingprocesssincedataperpatientwasrelativelyhigh(around300imagesperpatientwitharound1,500patients).8.2 RECOMMENDATIONSFORFUTURESTUDIES: Forfeatureextraction,wehaveuseVGG16thesimplicityofthemodelmakesiteasytoimplement.However, there are other pre-trained CNN models available too for feature extraction. Forcomplicatedprocesslikeimageclassification,manyotherdeeplearningtechnologiesareproposed.Ateachstage,onecoulduseanovelapproachtoretainthemaximumfeatures,whichwouldoveralllead toabettermodel.Otherpre-trainedmodelscouldbeused likeResnet,Googlenet,etc. forfeature extraction and othermodules for deep learning could be combined with deferent lossfunction, layers and optimization techniques for better results. As previously mentioned, ourclassificationmodelispotentiallyoverfittingthedata,andthus,thereexistsaclearopportunitytoresearchdifferentmethodologies to combat thisproblem.Finally, the imageclassificationbeingused is MLP classifier with sequential model, however, other regularization techniques andfunctionscouldbeexploredtoupdateweightsinordertoincreasetheaccuracyofthemodel.

Page 35: AUTOMATED LUNG CANCER NODULE DETECTION · Monica Ramakrishnan, Surya Rajasekaran, Barsa Nayak, Akshay Bhagdikar SANTA CLARA UNIVERSITY AUTOMATED LUNG CANCER NODULE DETECTION

35|P a g e

8.3 BIBLIOGRAPHY

"1.17.NeuralNetworkModels(supervised)"1.17.NeuralNetworkModels(supervised)—Scikit-learn0.18.1Documentation.Google,n.d.Web.12June2017.

AmericanCancerSociety.Cancerfactsandfigures,2015.URLhttp://www.

cancer.org/research/cancerfactsstatistics/index.Bechtold,RobertE.,MichaelY.M.Chen,DavidJ.Ott,RonaldJ.Zagoria,EricS.Scharling,NeilT.

Wolfman,andDavidJ.Vining."InterpretationofAbdominalCT:AnalysisofErrorsandTheirCauses."JournalofComputerAssistedTomography21.5(1997):681-85.Web.

Chauhan,Divya,andVarunJaiswal."AnEfficientDataMiningClassificationApproachforDetecting

LungCancerDisease."2016InternationalConferenceonCommunicationandElectronicsSystems(ICCES)(2016):n.pag.Web.

"GettingStartedwiththeKerasSequentialModel."GuidetotheSequentialModel-Keras

Documentation.N.p.,n.d.Web.12June2017.Golan,Rotem,ChristianJacob,andJorgDenzinger."LungNoduleDetectioninCTImagesUsing

DeepConvolutionalNeuralNetworks."2016InternationalJointConferenceonNeuralNetworks(IJCNN)(2016):n.pag.Web.

Hawkins,SamuelH.,JohnN.Korecki,YoganandBalagurunathan,YuhuaGu,VirendraKumar,

SatrajitBasu,LawrenceO.Hall,DmitryB.Goldgof,RobertA.Gatenby,andRobertJ.Gillies."PredictingOutcomesofNonsmallCellLungCancerUsingCTImageFeatures."IEEEAccess2(2014):1418-426.Web.

Jafar,Iyad,HaoYing,AnthonyF.Shields,andOttoMuzik."ComputerizedDetectionofLung

TumorsinPET/CTImages."2006InternationalConferenceoftheIEEEEngineeringinMedicineandBiologySociety(2006):n.pag.Web.

Juma,Kassimu,MaHe,andYueZhaoc."LungCancerDetectionandAnalysisUsingDataMining

Techniques,PrincipalComponentAnalysisandArtificialNeuralNetwork."AmericanScientificResearchJournalforEngineering,Technology,andSciences(n.d.):n.pag.Web.

Kuruvilla,Jinsa,andK.Gunavathi."LungCancerClassificationUsingNeuralNetworksforCT

Images."ComputerMethodsandProgramsinBiomedicine113.1(2014):202-09.Web.Paul,Rahul,SamuelH.Hawkins,LawrenceO.Hall,DmitryB.Goldgof,andRobertJ.Gillies.

"CombiningDeepNeuralNetworkandTraditionalImageFeaturestoImproveSurvivalPredictionAccuracyforLungCancerPatientsfromDiagnosticCT."2016IEEEInternationalConferenceonSystems,Man,andCybernetics(SMC)(2016):n.pag.Web.

Page 36: AUTOMATED LUNG CANCER NODULE DETECTION · Monica Ramakrishnan, Surya Rajasekaran, Barsa Nayak, Akshay Bhagdikar SANTA CLARA UNIVERSITY AUTOMATED LUNG CANCER NODULE DETECTION

36|P a g e

Rao,R.Bharat,JinboBi,GlennFung,MarcosSalganicoff,NancyObuchowski,andDavidNaidich."LungCAD."Proceedingsofthe13thACMSIGKDDInternationalConferenceonKnowledgeDiscoveryandDataMining-KDD'07(2007):n.pag.Web.

Raschka,Sebastian."KDnuggets."KDnuggetsAnalyticsBigDataDataMiningandDataScience.

N.p.,n.d.Web.12June2017.Rosebrock,Adrian."ImageNet:VGGNet,ResNet,Inception,andXceptionwith

Keras."PyImageSearch.Pyimagesearch,03May2017.Web.12June2017.Srivastava,Nitish,GeoffreyHinton,AlexKrizhevsky,IlyaSutskever,andRuslanSalakhutdinov.

"Dropout:ASimpleWaytoPreventNeuralNetworksfromOverfitting."JournalofMachineLearningResearch(2014):n.pag.Web.

Ypsilantis,Petros-Pavlos,andGiovanniMontana."RecurrentConvolutionalNetworksfor

PulmonaryNoduleDetectioninCTImaging."(2016):1-36.Https://arxiv.org/pdf/1609.09143.pdf.Web.May2017.

Zisserman,Andrew,andKarenSimonyan."VeryDeepConvolutionalNetworksforLarge-Scale

VisualRecognition."VisualGeometryGroupHomePage.VisualGeometryGroupDepartmentofEngineeringScience,UniversityofOxford,n.d.Web.12June2017.

Page 37: AUTOMATED LUNG CANCER NODULE DETECTION · Monica Ramakrishnan, Surya Rajasekaran, Barsa Nayak, Akshay Bhagdikar SANTA CLARA UNIVERSITY AUTOMATED LUNG CANCER NODULE DETECTION

37|P a g e

8.4 PROGRAMFLOWCHART Theprogramflowchartcanbefoundinsection6.3anddetailedinstructionsforrunningthecodecanbefoundinthesubmittedREADMEfile.

8.5 PROGRAMSOURCECODEWITHDOCUMENTATION Thecodecanbefoundinthesubmittedfiles,alongwithaREADMEfilecontainingthefulldocumentation.

8.6 INPUT/OUTPUTLISTING Input:Aspreviouslymentioned,theinputfilesareintheformofaDICOMimage.ExamplesofslicesinaDICOMimageforonepatientcanbefoundbelow.

Output:Thefinaloutputisaclassificationvalue:1forcancerous,and0fornon-cancerous.