AcceleratingImageRecognitionProcessingUsingTieredStorage
andSparkIsomCrawfordJr.,Ph.D.,IBM
DavidChen,Ph.D.,IBM
SparkProcessing
• Open-sourceframework,improvementsonHadoop• Popularfordataanalytics– text,imagery,etc.
TiersofStorage• Multiple“levels”oflatency,bandwidth,andcost
SAS,etc.
Cloud
SequentialMedia
Processor
SSD
TieredStorage• Organizedandmanagedtomovedatabeingprocessedclosertoprocessor• Andvice-versa:movedatanotbeingprocessedtoless-expensivestorage
SAS,etc.
Cloud
SequentialMedia
Processor
SSD
Problem– Less-expensivestorageI/Ospeeds
• Tieredstorage• VariouslevelsofI/Operformance• Faster->smaller• Larger->slower
• Data(orInformation)LifecycleManagement• Handlesautomaticmigrationbetweentiers,e.g.,ILMpolicies• Eliminatesmanualretrieval
• Buton-demandretrievaladdsalotoflatency
Example– ImageRecognition
• ConsiderImageRecognition• Oncemodelisbuilt,recongition isrelativelyfast
• Fasterthancloudorsequentialmediacanfeeddata
• Problem:HowtoimproveI/Ospeedsfromless-expensivestorage
SequentialapproachRecognition
Processing(Tp)DataRetrieval(Td)
Pipelineapproachsingleprocessor,multi-retriever
RecognitionProcessing(Tp)DataRetrieval(Td)
Pipelineapproachsingleprocessor,multi-retriever
RecognitionProcessing(Tp)DataRetrieval(Td) Ostensibly(Ideally?),numberofretrievalthreadsis
ratioofretrievaltimetoprocessingtime:Nr ~Td/Tr
Challenges:Nonuniform processingtimesTpRetrievalthreadsmaycorrespondtotapedrive
(ormaynot!)Retrievaltimesnotlikelytobeproportiional to
imagesizeinitialTp (initialtapeload,etc.)networktraffic(cloud)
Generallynon-deterministicLimitedbynumberoftapedrives,networkconnections
Nr
Nr
Strategies
• Prefetch inputdatasets
• Use“close”storageascache
• Communicationbetweenprocessingthreadsandprefetch threads• Usecomm file,SparksharedRDD(IgniteRDD,IBMConductor/sharedRDD)
• Leverageschedulers(Lava,LSF,SGE,ConductorforSpark,etc.)tostartprefetchjob(s)beforeprocessing• Needtomanage“cache”capacity–usecommunicationtopurge• LeverageInformationLifecyclefunctionality(Scale/GPFS,etc.)
SimplePrefetch ApproachSimpledatastructure
(Filename) StatusAbc1.png processedAbc2.png processedAbc3.png processingAbc4.png cacheAbc5.png cacheAbc6.png tapeAbc7.png tape…
Prefetch threadsfindnextinputobjectstillarchived,initiate
retrieval
Prefetch threadsfindnextinputobjectstillarchived,initiate
retrieval
If/whencacheapproachesbeingfull,archivethreadsmove
objectstoless-expensivestorage
ImplementationApproaches
• Usecomm file• UsePOSIXfile-lockingtoallowcommunication
• SparksharedRDD(ApacheIgnite,IBMConductor-with-Spark)• Leveragesharedmemoryapproach• Fastercommunication
Summary
• Tieredstoragepresentsefficiencyandchallenges• Prefetchingtechnologywell-understoodfor”singletier”(filecaching)• Futureinvestigation• IdentifysharedRDDtemplate,syntaxtobettersupportmulti-tierprefetch• Prefetch objects,files,and/orblocks(granularitychallenges&efficiencies)• Application-triggeredarchival,post-processing(selective,read-only,etc.)