MicroBenchmarks - BenchCouncil · n For a layer with d-dimensional input x = (x(1) . . . x(d)) nDatainput n Imagedataset---Cifar,ImageNet nSoftwarestacks n TensorFlow,Caffe2,PyTorch,Pthread

INSTITUTE O

F COM

PUTING

TECHN

OLO

GY

Micro Benchmarks

Wanling Gao

ICT,ChineseAcademyofSciences

HPCA2019, WashingtonD.C.,USA

BigDataBench HPCA2019

AscalablebigdataandAIbenchmarksuite

n Treat big data, AI and Internet service workloads as a pipeline of units of computation handling (input or intermediate) data

n Target: find the main abstractions of time-consuming units of computation (data motifs)n The combination of data motifs = complex workloads

• Similar to Relational Algebra

n Datamotifs-basedscalablebenchmarkingmethodologyWanling Gao,Jianfeng Zhan,LeiWang, et al.DataMotif: A Lens towards Fully Understanding BigDataandAIWorkloads.PACT 2018.


BigDataBench Publicationsn DataMotifs:A Lens TowardsFullyUnderstandingBigDataand AIWorkloads.

PACT’18.n BigDataBench:aScalable and UnifiedBigDataandArtificialIntelligence

BenchmarkSuite.TechnicalReport.n UnderstandingBigDataAnalyticsWorkloadson ModernProcessors.TPDS’16n Auto-tuningSparkBigDataWorkloadsonPOWER8:Prediction-BasedDynamic

SMT.PACT’16n BigDataBench:aBigDataBenchmarkSuitefromInternetServices.HPCA’14n CVR:EfficientVectorizationofSpMV onX86Processors.CGO’18.n BOPS,NotFLOPS!ANewMetric,MeasuringTool,andRooflinePerformance

ModelForDatacenterComputing.Technicalreport.n Data Motif-based Proxy Benchmarks for Big Data and AI Workloads. IISWC

2018.


Micro Benchmark Target

n Capture one class of unit of computation inbig data and AI

n Easily be ported to anewcomputersystemorarchitecture at an earlier stage


Outline

n Summary of Micro Benchmark

n Micro Benchmark Characterization

n Conclusion


Summary

n 27 micro benchmarksn Covering 6 workload types

• Offline analytics, Graph analytics• Streaming, NoSQL, Data warehouse• AI

n Covering 8 data motifs• Transform, Graph, Set, Sort, Matrix, Logic, Sampling, Basicstatistics

n Covering 5 application domains• Internet Service (Social network, Search engine, E-commerce)• Recognition Science• Medical Science


MicroBenchmarks

AI

NoSQL

Offlineanalytics

Graphanalytics

Streaming

Datawarehouse


Sort

n Sort the key value according to a certain order

n Data inputn Wikipedia entries

n Software stacksn Hadoop,Spark,Flink, MPI


Grep

n Extract matchingstringsfromtextfilesandcountshowmanytimetheyoccurred




WordCount

n Count thenumberof words inadocument




MD5

n A widelyused hash function producinga128-bit hashvaluen Theinputmessageisbrokenupintochunksof512-bitblocks


n Software stacksn Hadoop,Spark,MPI


Connected Component

n A subgraph inwhichany two verticesare connected toeachotherby pathsn Easily computed in lineartimeusingeither breadth-firstsearch or depth-firstsearch

n Data inputn Facebooksocialnetwork

n Software stacksn Hadoop,Spark,Flink,GraphLab,MPI


RandSample

n Selectasubsetsamples randomlyn Using a random data generator to determinewhether the data is selected or not




FFT

n Cooley–Tukeyalgorithmn radix-2 decimation-in-time(DIT)FFT

n Data inputn Two-dimensional matrix



Matrix Multiply

n Compute a matrix from two matrics

n Data inputn Two-dimensional matrix



NoSQL ---Read, Write, Scan

n Benchmarksn Read records randomlyn Write new recordsn Scan records in order

n Data inputn ProfSearch resumes

• asemi-structureddatasetfromaverticalsearchengineforscientists

n Software stacksn Hbase, MongoDB


OrderBy

n Order the data according to specific item

n Data inputn E-commercetransaction

n Software stacksn Hive,Spark-SQL,Impala


Aggregation

n Gather information andaggregate inasummaryform




Project

n Retrieve specified attributes(columns)




Filter

n Selectpartial recordsthatmatchcertaincriteria




Select

n Select a set ofrecordsfromoneormoretables




Union

n CombinetheresultoftwoormoreSELECTstatements




Convolution

n The general expression

n Data inputn Image dataset---Cifar, ImageNetn Convolution kernel

n Software stacksn TensorFlow, Caffe2, PyTorch, Pthread

Note：g(x,y) is the filtered image, f(x,y) is the original image, ω is the filter kernel


Fully Connected

n Haveconnectionstoallneurons inthepreviouslayern Matrix multiplication followedbyabiasoffset

n Data inputn Image dataset---Cifar, ImageNet



Relu

n Abbreviationof rectified linear unitn Is definedasthepositivepartofitsargument



x is the input to a neuron


Sigmoid

n Sigmoid activation function




Tanh

n Tanh activation function




MaxPooling

n Non-linear down-samplingn Dividing theinputimageintoasetofnon-overlappingrectangles

n Outputsthemaximum foreachsub-rectangle




AvgPooling

n Non-linear down-samplingn Dividing theinputimageintoasetofnon-overlappingrectangles

n Outputstheaverage value foreachsub-rectangle




Batch Normalization

n A normalizationmethod/layerforneuralnetworksn Foralayerwithd-dimensionalinputx=(x(1)...x(d))



Ioffe S, Szegedy C. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167.


Cosine Normalization

n UsingCosineSimilarityinNeuralNetworksn InsteadofDotProduct



where netnorm is the normalized pre-activation, w⃗ is the incoming weight vector and ⃗x is the input vector, (·) indicates dot product, f is nonlinear activation function

Luo C, Zhan J, Xue X, Wang L, Ren R, Yang Q. Cosine normalization: Using cosine similarity instead of dot product in neural networks. InInternational Conference on Artificial Neural Networks 2018 Oct 4 (pp. 382-391).


Dropout

n A regularization techniqueforreducingoverfitting in neural networks




Software Stacks


Outline



n Conclusion


Experiment Setups

n Three-node cluster


Data Configuration

n To fully utilize the memory resourcesn Big data micro benchmarks

• 100 GB text data• 2^26-vertex graph data• 65536two-dimensionmatrixdata

n AI micro benchmarks• Input dimension 224*224, channels 64• 100K images from ImageNet


System Behaviorsn CPU Utilization &I/OWait

n Hadoop have higer CPU utilization and less I/O wait than sparkn AI micro benchmarks have lower I/O wait than big datan Some of AI micro benchmarks are cpu intensiven Pthread benchmarks havelessCPUutilizationandI/OWaitingeneral


I/O Behaviors

n Disk I/O Bandwidth &Network I/O Bandwidthn SparkstackhasmuchlargernetworkI/OpressurethanthatofHadoop

stack• Moredatashuffles,soitneedstransferringdatafromonenodetoanotherone

frequently


Execution Performance

n The overall running efficiency of theworkloadsn Instruction level parallelism (ILP)

• Retired instructions per cycle (IPC)

n Memory level parallelism (MLP)• Dividing L1D_PEND_MISS.PENDINGbyL1D_PEND_MISS.PENDING_CYCLES


Execution Performance

n ILP & MLP

n CoverawiderangeofILPandMLP behaviors

• Distinct computation andmemory access patterns

n softwarestackchangescomputationandmemoryaccesspatterns

• Hadoop FFT v.s. Spark FFT


Top-DownMethod

n Issuepointasthedividingpoint

From“ATop-DownMethodforPerformanceAnalysisandCountersArchitecture”

Whetherthemicrooperationisretired?

Notreadywithmoreuops

Onlyretiringis“usefulwork”


Pipeline Efficiency

n Top-Down Methodologyn Retiring, Frontend bound, Backend bound, Bad speculation

• Hadoop: notablestallsduetofrontendboundandbadspeculation• Spark: Higher backend bound• AI reflects different bottlenecks


Frontend Bound

n Frontend latency bound > Frontend bandwidth boundn Latency bound: notablestallsduetofrontendboundandbad

speculationn Bandwidth bound: deliveringinsufficientuops comparingtothe

theoreticalvalue


Data Motif – Frontend Bound

n Frontend Bound Breakdownn Top 3：branchresteers, instructioncachemiss, MSswitch

• The first reason is the delaystoobtainthecorrectinstructions• MS switch: bigdataandAIsystemsusemanyCISCinstructionsthatcannotbe

decodedbydefaultdecoder


Data Motif – Backend Bound

n Memory bound (datamovementdelays) > Core boundn Memory bound：L1, L2, L3, external memory boundn Core bound： thelackofhardwareresources or portunder-utilization


Overview

n Lookingbackathistory

nWhat is DataMotif

nCharacterization of Data Motif

n Impact of Data Input

nConclusion


Impact of Data Input

Size Pattern Type &Source


Similarity Analysisn Three data configurations

n Small, Medium, Large

n Sixtymetricsspanningsystemandmicro-architecture

n MeasuringSimilarityn PCAn Hierarchicalclustering



Size Impact on I/O Behaviors

n I/O Bandwidthn UsingtheI/ObandwidthofSmalldatasizeasbaseline,wenormalize

theI/ObandwidthofMediumandLargedatasize


Size Impact on Pipeline Behavior

n Datasizeincreasesà frontendbounddecrease, backendboundincrease





Impact of Data Pattern

n Dense matrix V.S. Sparse matrixn I/O Bandwidth: Sparse < Densen Frontend Stalls: Sparse > Dense





Impact of Data Type and Source

n Un-structured text data & Semi-structuredsequencedatan System：1.12-7.29 differencesn Architecture: text format incurs more backend bound


Outline



n Conclusion


Conclusion

n Website:n http://www.benchcouncil.org/benchmarks.htmln http://www.benchcouncil/BigDataBenchn http://prof.ict.ac.cn/BigDataBench

n Micro benchmarkn Single data motif implementation


MicroBenchmarks - BenchCouncil · n For a layer with d-dimensional input x = (x(1) . . . x(d)) nDatainput n Imagedataset---Cifar,ImageNet nSoftwarestacks n TensorFlow,Caffe2,PyTorch,Pthread

Documents