Top Banner
Accelerating Image Recognition Processing Using Tiered Storage and Spark Isom Crawford Jr., Ph.D., IBM David Chen, Ph.D., IBM
13

Accelerating Image Recognition Processing Using Tiered ......Accelerating Image Recognition Processing Using Tiered Storage and Spark Isom Crawford Jr., Ph.D., IBM David Chen, Ph.D.,

Sep 11, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
  • AcceleratingImageRecognitionProcessingUsingTieredStorage

    andSparkIsomCrawfordJr.,Ph.D.,IBM

    DavidChen,Ph.D.,IBM

  • SparkProcessing

    • Open-sourceframework,improvementsonHadoop• Popularfordataanalytics– text,imagery,etc.

  • TiersofStorage• Multiple“levels”oflatency,bandwidth,andcost

    SAS,etc.

    Cloud

    SequentialMedia

    Processor

    SSD

  • TieredStorage• Organizedandmanagedtomovedatabeingprocessedclosertoprocessor• Andvice-versa:movedatanotbeingprocessedtoless-expensivestorage

    SAS,etc.

    Cloud

    SequentialMedia

    Processor

    SSD

  • Problem– Less-expensivestorageI/Ospeeds

    • Tieredstorage• VariouslevelsofI/Operformance• Faster->smaller• Larger->slower

    • Data(orInformation)LifecycleManagement• Handlesautomaticmigrationbetweentiers,e.g.,ILMpolicies• Eliminatesmanualretrieval

    • Buton-demandretrievaladdsalotoflatency

  • Example– ImageRecognition

    • ConsiderImageRecognition• Oncemodelisbuilt,recongition isrelativelyfast

    • Fasterthancloudorsequentialmediacanfeeddata• Problem:HowtoimproveI/Ospeedsfromless-expensivestorage

  • SequentialapproachRecognition

    Processing(Tp)DataRetrieval(Td)

  • Pipelineapproachsingleprocessor,multi-retriever

    RecognitionProcessing(Tp)DataRetrieval(Td)

  • Pipelineapproachsingleprocessor,multi-retriever

    RecognitionProcessing(Tp)DataRetrieval(Td) Ostensibly(Ideally?),numberofretrievalthreadsis

    ratioofretrievaltimetoprocessingtime:Nr ~Td/Tr

    Challenges:Nonuniform processingtimesTpRetrievalthreadsmaycorrespondtotapedrive

    (ormaynot!)Retrievaltimesnotlikelytobeproportiional to

    imagesizeinitialTp (initialtapeload,etc.)networktraffic(cloud)

    Generallynon-deterministicLimitedbynumberoftapedrives,networkconnections

    Nr

    Nr

  • Strategies

    • Prefetch inputdatasets

    • Use“close”storageascache

    • Communicationbetweenprocessingthreadsandprefetch threads• Usecomm file,SparksharedRDD(IgniteRDD,IBMConductor/sharedRDD)

    • Leverageschedulers(Lava,LSF,SGE,ConductorforSpark,etc.)tostartprefetchjob(s)beforeprocessing• Needtomanage“cache”capacity–usecommunicationtopurge• LeverageInformationLifecyclefunctionality(Scale/GPFS,etc.)

  • SimplePrefetch ApproachSimpledatastructure

    (Filename) StatusAbc1.png processedAbc2.png processedAbc3.png processingAbc4.png cacheAbc5.png cacheAbc6.png tapeAbc7.png tape…

    Prefetch threadsfindnextinputobjectstillarchived,initiate

    retrieval

    Prefetch threadsfindnextinputobjectstillarchived,initiate

    retrieval

    If/whencacheapproachesbeingfull,archivethreadsmove

    objectstoless-expensivestorage

  • ImplementationApproaches

    • Usecomm file• UsePOSIXfile-lockingtoallowcommunication

    • SparksharedRDD(ApacheIgnite,IBMConductor-with-Spark)• Leveragesharedmemoryapproach• Fastercommunication

  • Summary

    • Tieredstoragepresentsefficiencyandchallenges• Prefetchingtechnologywell-understoodfor”singletier”(filecaching)• Futureinvestigation• IdentifysharedRDDtemplate,syntaxtobettersupportmulti-tierprefetch• Prefetch objects,files,and/orblocks(granularitychallenges&efficiencies)• Application-triggeredarchival,post-processing(selective,read-only,etc.)