Top Banner
sor Uri Weiser srael Handling Memory Accesses in Big Data Environment Chipex 2016 1 The talk covers research done by: T. Horowitz , Prof. A. Kolodny, T. Morad, , Prof. A. Mendelson, Daniel Raskin, Gil Shomron, Loren Jamal, Prof. U. Weiser
26

Prof. Uri Weiser,Technion

Jan 09, 2017

Download

Business

chiportal
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Prof. Uri Weiser,Technion

Professor Uri WeiserTechnionHaifa, Israel

Handling Memory Accesses in Big Data Environment

Chipex 2016

1The talk covers research done by: T. Horowitz , Prof. A. Kolodny, T. Morad, , Prof. A. Mendelson, Daniel Raskin, Gil Shomron, Loren Jamal, Prof. U. Weiser

Page 2: Prof. Uri Weiser,Technion

2

A New Architecture Avenues in Big Data Environment

The Era of Heterogeneous HW/SW fits application Dynamic tuning Accelerators performance, energy efficiency

Big Data = big In general non repeated access to all the

“Big Data” What are the implications?

Page 3: Prof. Uri Weiser,Technion

Heterogeneous computing :Application Specific Accelerators

Performance/power

Apps range

Continue performance trend by tuned architecture to bypass current technological hurdles

Perf

orm

ance

/pow

erAccelerators

3

Tuned architectures

Apps behavior

Page 4: Prof. Uri Weiser,Technion

4

A New Architecture Avenues in Big Data Environment

Heterogeneous computing – ”tuning” HW to respond to specific needs

example: Big Data memory access pattern Potential savings

Reduction of Data Movements and bypass DRAM

Bandwidth issue Potential solution

Page 5: Prof. Uri Weiser,Technion

Input: Unstructured data

Big Data usage of DATA

5

Read OnceNon-Temporal

Memory Access

Funnel

beta=BWout

BWin

Page 6: Prof. Uri Weiser,Technion

Structuring

Input: Unstructured data

Structured data (aggregation)

A

ML Model creation

Data structuring = ETL

C

B

C Model usage @ client

6

Machine Learning

Page 7: Prof. Uri Weiser,Technion

7

Does Big Data exhibit special memory access pattern?

It probably should since Revisiting ALL Big Data items will cause huge/slow

data transfers from Data sources There are 2 access modes of memory operations:

Temporal Memory Access Non-Temporal Memory access

Many Big Data computations exhibit a Non-Temporal Memory-Accesses and/or Funnel operation

Page 8: Prof. Uri Weiser,Technion

Non-Temporal Memory access Initial analysis: Hadoop-grep Single Memory Access Pattern

~50% of Hadoop-grep unique memory references are single access

8

Page 9: Prof. Uri Weiser,Technion

Non-Temporal Memory AccessesPreliminary Results

WordCount:Access to Storage:Non-temporal locality

Sort: Access to Storage:NO Non-temporal locality

0 5 10 15 20 25 30 35 40 45 500

1000020000300004000050000600007000080000

WordCount I/O Utilization

Time [s]

0 200 400 600 800 1000 12000

20000400006000080000

100000120000

SORT I/O

Time [s]

Access rate[KB/s]

Time

Time

9

Access rate[KB/s]

Page 10: Prof. Uri Weiser,Technion

10

Where energy is wasted?

• DRAM

• Limited BW

Page 11: Prof. Uri Weiser,Technion

From: Mark Horowitz, Stanford “Computing’s Energy Problems”

From: Bill Dally (nVidia and Stanford), Efficiency and Parallelism, the challenges of future computing11

Page 12: Prof. Uri Weiser,Technion

Energy:

DRAM

12

Page 13: Prof. Uri Weiser,Technion

Memory Subsystem - copies

L1$

L2$

LL Cache

DRAM

NV Storage

RegistersKBs

10’s KBs

MBs

TBs

GBs

10’s MBs

3GB/sec

25GB/sec

500GB/sec

TB/sec

Size

Core

BW- Source

Copy 1 (main memory)

Copy 2 (LL Cache)

Copy 3 (L2 Cache)

Copy 4 (L1 Cache)

Copy 5 (Registers) - Destination

13

Page 14: Prof. Uri Weiser,Technion

Memory Subsystem – DRAM bypass == DDIO

L1$

L2$

LL Cache

DRAM

NV Storage

Registers

3-20GB/sec

25GB/sec

500GB/sec

TB/sec

Core

BW- Source

Copy 1 (main memory)

Copy 2 (LL Cache)

Copy 3 (L2 Cache)

Copy 4 (L1 Cache)

Copy 5 (Registers) - Destination

Potential savings:

@ 0.5n J/B (DRAM)10 – 20 GB/s NV BW

5W – 10W

Reference: “Optimizing Read-Once Data Flow in Big-Data Applications”Morad, Ghomron, Erez, Weiser, Kolodny, in Computer Architecture Letters Journal 2016 14

Page 15: Prof. Uri Weiser,Technion

BandwidthWhen should we use Funnel at the Data source

15

Page 16: Prof. Uri Weiser,Technion

Memory Hierarchy is Optimized for A: Bandwidth issue System are built for Temporal Locality

16Highest Bandwidth

L1$

L2$

LLC Cache

DRAM

NV Storage

RegistersKBs

10’s KBs

MBs

TBs

GBs

10’s MBs

3-20GB/sec

25GB/sec

500GB/sec

TB/sec

Size

Core

BW Existing BW

NTMA Desired BW

Page 17: Prof. Uri Weiser,Technion

# of cores

Bandwidth[MB/s] INITIAL RESULT

S# of cores

CPUutilization

[%]

Bandwidth[MB/s]Read Once – Non-Temporal Memory Accesses

CPU Utilizations

SSD=CPU Bandwidth

# of cores

Bandwidth[MB/s]

CPUutilization

[%]Temporal Memory Accesses

# of cores

Bandwidth[MB/s]

Hint: Memory access per operation

B: Memory access per operation impact BW

CPU Utilizations

CPU Bandwidth

SSD Bandwidth

17

Page 18: Prof. Uri Weiser,Technion

Solution:

Flow of “Non-Temporal Data Accesses”

Core

L1$

L2$

LLC Cache

DRAM

NV Storage

Registers

Non-Temporal

Memory accesses

NTMA

Temporal locality

accesses

The Funnel

18

Use Funnel when Bandwidth bottleneck occurs- “high” memory accesses per Instruction- Limited BW- Non temporal locality memory access

*private communication with: Moinuddin Qureshi

Page 19: Prof. Uri Weiser,Technion

“Funnel”ing “Read-Once” data in storage

*Kang, Yangwook, Yang-suk Kee, Ethan L. Miller, and Chanik Park. "Enabling cost-effective data processing with smart ssd." In Mass Storage Systems and Technologies (MSST), 2013 IEEE 29th Symposium on, pp. 1-12. IEEE, 2013.**K. Eshghi and R. Micheloni. “SSD Architecture and PCI Express Interface”

Typical SDD architecture*

19

Page 20: Prof. Uri Weiser,Technion

Analytical model of the Funnel

Post process

Bandwidth (BW) IN

Bandwidth BW OUT

Funnel

B

B

= BWOUT/BWIN

20

Page 21: Prof. Uri Weiser,Technion

Purposed Architecture

21

PCIeTL

B

CPU performs NTMA and TMA work

SSD Storage

B

Funnel

B=Bandwidth

Baseline Configuration

PCIe

TLB

2,LcE

CPU performs TMA workSSD performs NTMA work

B

Funnel

Funnel Configurations

B

B B

21

Page 22: Prof. Uri Weiser,Technion

Funnel Performance

Perfo

rman

ce

impr

ovem

ent

CPU becomes bottleneck

drops as CPU becomes bottleneck

𝟏𝐏𝐂𝐈𝐞𝐁𝐖

𝟏𝐒𝐒𝐃𝐁𝐖

PCIeTL

B

CPU performs NTMA and TMA work

SSD Storage

B

Funnel

B=Bandwidth

PCIeTL

B

2,LcE

CPU performs: TMA work

SSD performs NTMA work

B

Funnel

beta beta

Pe

rfor

man

ce

22

Page 23: Prof. Uri Weiser,Technion

Funnel energy

Funnel improvement

SSD BW is the

bottleneck

CPU becomes the bottleneck

PCIe is the bottleneck

Energy grows as

performance

advantage shrinks

Funnel processor overhead

PCIeTL

B

CPU performs NTMA and TMA work

SSD Storage

B

Funnel

B=Bandwidth

PCIeTL

B

2,LcE

CPU performs TMA work

SSD performs NTMA work

B

Funnel

beta

En

ergy

CPU becomes the bottleneck

Funnel configuration

baseline configuration

23

Page 24: Prof. Uri Weiser,Technion

Solution: ?

Non-Temporal Memory Accesses should be processed as close as possible to the data sourceData that exhibit Temporal Locality should use current Memory HierarchyUse Machine Learning (context aware*) to distinguish between the two phasesOpen questions:

SW modelShared DataHW implementationComputational requirement at the “Funnel”

*Reference: “Semantic locality and Context based prefetching” Peled, Mannor, Weiser, Etsion in ISCA 2015 24

Page 25: Prof. Uri Weiser,Technion

Summary

Memory access is a critical path in computingFunnel should be used for:

Resolve BW systems’ bottleneck for specific applicationsSolve the System’s BW issues for “Read Once” cases

Reduction of Data movementFree up system’s memory resources (re-Spark)Simple-energy-efficient engines at the front endIssues

25

Page 26: Prof. Uri Weiser,Technion

26