Accelerating Image Recognition Processing Using Tiered ......Accelerating Image Recognition Processing Using Tiered Storage and Spark Isom Crawford Jr., Ph.D., IBM David Chen, Ph.D.,

Post on 11-Sep-2020

1 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

Transcript

AcceleratingImageRecognitionProcessingUsingTieredStorage

andSparkIsomCrawfordJr.,Ph.D.,IBM

DavidChen,Ph.D.,IBM

SparkProcessing

• Open-sourceframework,improvementsonHadoop• Popularfordataanalytics– text,imagery,etc.

TiersofStorage• Multiple“levels”oflatency,bandwidth,andcost

SAS,etc.

Cloud

SequentialMedia

Processor

SSD

TieredStorage• Organizedandmanagedtomovedatabeingprocessedclosertoprocessor• Andvice-versa:movedatanotbeingprocessedtoless-expensivestorage

SAS,etc.

Cloud

SequentialMedia

Processor

SSD

Problem– Less-expensivestorageI/Ospeeds

• Tieredstorage• VariouslevelsofI/Operformance• Faster->smaller• Larger->slower

• Data(orInformation)LifecycleManagement• Handlesautomaticmigrationbetweentiers,e.g.,ILMpolicies• Eliminatesmanualretrieval

• Buton-demandretrievaladdsalotoflatency

Example– ImageRecognition

• ConsiderImageRecognition• Oncemodelisbuilt,recongition isrelativelyfast

• Fasterthancloudorsequentialmediacanfeeddata

• Problem:HowtoimproveI/Ospeedsfromless-expensivestorage

SequentialapproachRecognition

Processing(Tp)DataRetrieval(Td)

Pipelineapproachsingleprocessor,multi-retriever

RecognitionProcessing(Tp)DataRetrieval(Td)

Pipelineapproachsingleprocessor,multi-retriever

RecognitionProcessing(Tp)DataRetrieval(Td) Ostensibly(Ideally?),numberofretrievalthreadsis

ratioofretrievaltimetoprocessingtime:Nr ~Td/Tr

Challenges:Nonuniform processingtimesTpRetrievalthreadsmaycorrespondtotapedrive

(ormaynot!)Retrievaltimesnotlikelytobeproportiional to

imagesizeinitialTp (initialtapeload,etc.)networktraffic(cloud)

Generallynon-deterministicLimitedbynumberoftapedrives,networkconnections

Nr

Nr

Strategies

• Prefetch inputdatasets

• Use“close”storageascache

• Communicationbetweenprocessingthreadsandprefetch threads• Usecomm file,SparksharedRDD(IgniteRDD,IBMConductor/sharedRDD)

• Leverageschedulers(Lava,LSF,SGE,ConductorforSpark,etc.)tostartprefetchjob(s)beforeprocessing• Needtomanage“cache”capacity–usecommunicationtopurge• LeverageInformationLifecyclefunctionality(Scale/GPFS,etc.)

SimplePrefetch ApproachSimpledatastructure

(Filename) StatusAbc1.png processedAbc2.png processedAbc3.png processingAbc4.png cacheAbc5.png cacheAbc6.png tapeAbc7.png tape…

Prefetch threadsfindnextinputobjectstillarchived,initiate

retrieval

Prefetch threadsfindnextinputobjectstillarchived,initiate

retrieval

If/whencacheapproachesbeingfull,archivethreadsmove

objectstoless-expensivestorage

ImplementationApproaches

• Usecomm file• UsePOSIXfile-lockingtoallowcommunication

• SparksharedRDD(ApacheIgnite,IBMConductor-with-Spark)• Leveragesharedmemoryapproach• Fastercommunication

Summary

• Tieredstoragepresentsefficiencyandchallenges• Prefetchingtechnologywell-understoodfor”singletier”(filecaching)• Futureinvestigation• IdentifysharedRDDtemplate,syntaxtobettersupportmulti-tierprefetch• Prefetch objects,files,and/orblocks(granularitychallenges&efficiencies)• Application-triggeredarchival,post-processing(selective,read-only,etc.)

top related