Top Banner
Everest: scaling down peak loads through I/O off-loading D. Narayanan, A. Donnelly, E. Thereska, S. Elnikety, A. Rowstron Microsoft Research Cambridge, UK hanzler666 @ UKClimbing.com
40

Everest: scaling down peak loads through I/O off-loading D. Narayanan, A. Donnelly, E. Thereska, S. Elnikety, A. Rowstron Microsoft Research Cambridge,

Dec 22, 2015

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Everest: scaling down peak loads through I/O off-loading D. Narayanan, A. Donnelly, E. Thereska, S. Elnikety, A. Rowstron Microsoft Research Cambridge,

Everest:scaling down peak loads through

I/O off-loading

D. Narayanan, A. Donnelly, E. Thereska, S. Elnikety, A. Rowstron

Microsoft Research Cambridge, UK

hanzler666 @ UKClimbing.com

Page 2: Everest: scaling down peak loads through I/O off-loading D. Narayanan, A. Donnelly, E. Thereska, S. Elnikety, A. Rowstron Microsoft Research Cambridge,

Problem: I/O peaks on servers

• Short, unexpected peaks in I/O load– This is not about predictable trends

• Uncorrelated across servers in data center– And across volumes on a single server

• Bad I/O response times during peaks2Everest: write off-loading for I/O peaks

Page 3: Everest: scaling down peak loads through I/O off-loading D. Narayanan, A. Donnelly, E. Thereska, S. Elnikety, A. Rowstron Microsoft Research Cambridge,

Example: Exchange server

• Production mail server– 5000 users, 7.2 TB across 8 volumes

• Well provisioned– Hardware RAID, NVRAM, over 100

spindles

• 24-hour block-level I/O trace– Peak load, response time is 20x mean– Peaks are uncorrelated across volumes

3Everest: write off-loading for I/O peaks

Page 4: Everest: scaling down peak loads through I/O off-loading D. Narayanan, A. Donnelly, E. Thereska, S. Elnikety, A. Rowstron Microsoft Research Cambridge,

Exchange server load

4Everest: write off-loading for I/O peaks

14:3917:0019:2121:4100:0302:2404:4407:0509:2711:4714:09100

1000

10000

100000

Time of day

Load

(req

s/s/

volu

me)

Page 5: Everest: scaling down peak loads through I/O off-loading D. Narayanan, A. Donnelly, E. Thereska, S. Elnikety, A. Rowstron Microsoft Research Cambridge,

Everest client

Write off-loading

5Everest: write off-loading for I/O peaks

Reads and Writes

No off-loadingOff-loading

Reads

Writes

Volume

Reclaiming

Reclaims

Everest store

Everest store

Everest store

Page 6: Everest: scaling down peak loads through I/O off-loading D. Narayanan, A. Donnelly, E. Thereska, S. Elnikety, A. Rowstron Microsoft Research Cambridge,

Exploits workload properties

• Peaks uncorrelated across volumes– Loaded volume can find less-loaded

stores

• Peaks have some writes– Off-load writes reads see less

contention

• Few foreground reads on off-loaded data– Recently written hence in buffer cache– Can optimize stores for writes

6Everest: write off-loading for I/O peaks

Page 7: Everest: scaling down peak loads through I/O off-loading D. Narayanan, A. Donnelly, E. Thereska, S. Elnikety, A. Rowstron Microsoft Research Cambridge,

Challenges

• Any write anywhere– Maximize potential for load balancing

• Reads must always return latest version– Split across stores/base volume if

required

• State must be consistent, recoverable– Track both current and stale versions

• No meta-data writes to base volume7Everest: write off-loading for I/O peaks

Page 8: Everest: scaling down peak loads through I/O off-loading D. Narayanan, A. Donnelly, E. Thereska, S. Elnikety, A. Rowstron Microsoft Research Cambridge,

Design features

• Recoverable soft state• Write-optimized stores• Reclaiming off-loaded data• N-way off-loading• Load-balancing policies

8Everest: write off-loading for I/O peaks

Page 9: Everest: scaling down peak loads through I/O off-loading D. Narayanan, A. Donnelly, E. Thereska, S. Elnikety, A. Rowstron Microsoft Research Cambridge,

Recoverable soft state

• Need meta-data to track off-loads– block ID <location, version>– Latest version as well as old (stale)

versions

• Meta-data cached in memory– On both clients and stores

• Off-loaded writes have meta-data header– 64-bit version, client ID, block range 9Everest: write off-loading for I/O peaks

Page 10: Everest: scaling down peak loads through I/O off-loading D. Narayanan, A. Donnelly, E. Thereska, S. Elnikety, A. Rowstron Microsoft Research Cambridge,

Recoverable soft state (2)

• Meta-data also persisted on stores– No synchronous writes to base volume– Stores write data+meta-data as one

record

• “Store set” persisted base volume– Small, infrequently changing

• Client recovery contact store set• Store recovery read from disk

10Everest: write off-loading for I/O peaks

Page 11: Everest: scaling down peak loads through I/O off-loading D. Narayanan, A. Donnelly, E. Thereska, S. Elnikety, A. Rowstron Microsoft Research Cambridge,

Everest stores

• Short-term, write-optimized storage– Simple circular log– Small file or partition on existing volume– Not LFS: data is reclaimed, no cleaner

• Monitors load on underlying volume– Only used by clients when lightly loaded

• One store can support many clients

11Everest: write off-loading for I/O peaks

Page 12: Everest: scaling down peak loads through I/O off-loading D. Narayanan, A. Donnelly, E. Thereska, S. Elnikety, A. Rowstron Microsoft Research Cambridge,

Everest client

Reclaiming in the background

12Everest: write off-loading for I/O peaks

“Read any”

Volume

Everest store

Everest store

Everest store

<block range, version, data>

delete(block range, version)

• Multiple concurrent reclaim “threads”– Efficient utilization of disk/network

resources

write

Page 13: Everest: scaling down peak loads through I/O off-loading D. Narayanan, A. Donnelly, E. Thereska, S. Elnikety, A. Rowstron Microsoft Research Cambridge,

Correctness invariants

• I/O on off-loaded range always off-loaded– Reads: sent to correct location– Writes: ensure latest version is

recoverable– Foreground I/Os never blocked by

reclaim

• Deletion of a version only allowed if– Newer version written to some store, or– Data reclaimed and older versions

deleted

• All off-loaded data eventually reclaimed

13Everest: write off-loading for I/O peaks

Page 14: Everest: scaling down peak loads through I/O off-loading D. Narayanan, A. Donnelly, E. Thereska, S. Elnikety, A. Rowstron Microsoft Research Cambridge,

Evaluation

14Everest: write off-loading for I/O peaks

• Exchange server traces• OLTP benchmark• Scaling• Micro-benchmarks• Effect of NVRAM• Sensitivity to parameters• N-way off-loading

Page 15: Everest: scaling down peak loads through I/O off-loading D. Narayanan, A. Donnelly, E. Thereska, S. Elnikety, A. Rowstron Microsoft Research Cambridge,

Exchange server workload

• Replay Exchange server trace– 5000 users, 8 volumes, 7.2 TB, 24 hours

• Choose time segments with peaks– extend segments to cover all reclaim

• Our server: 14 disks, 2 TB – can fit 3 Exchange volumes

• Subset of volumes for each segment

15Everest: write off-loading for I/O peaks

Page 16: Everest: scaling down peak loads through I/O off-loading D. Narayanan, A. Donnelly, E. Thereska, S. Elnikety, A. Rowstron Microsoft Research Cambridge,

Trace segment selection

16Everest: write off-loading for I/O peaks

14:3917:0019:2121:4100:0302:2404:4407:0509:2711:4714:09100

1000

10000

100000

1000000

Time of day

Tota

l I/O

rate

(r

eqs/

s)

Page 17: Everest: scaling down peak loads through I/O off-loading D. Narayanan, A. Donnelly, E. Thereska, S. Elnikety, A. Rowstron Microsoft Research Cambridge,

Trace segment selection

17Everest: write off-loading for I/O peaks

14:3917:0019:2121:4100:0302:2404:4407:0509:2711:4714:09100

1000

10000

100000

1000000

Time of day

Tota

l I/O

rate

(r

eqs/

s)

Peak 1Peak 2

Peak 3Peak 1Peak 2

Peak 3

Page 18: Everest: scaling down peak loads through I/O off-loading D. Narayanan, A. Donnelly, E. Thereska, S. Elnikety, A. Rowstron Microsoft Research Cambridge,

Three volumes/segment

18Everest: write off-loading for I/O peaks

Trace

Trace

Trace

Trace

Trace

Trace

Trace

Trace

min

max

medianstore (3%)

store (3%)

store (3%)

client

client

client

Page 19: Everest: scaling down peak loads through I/O off-loading D. Narayanan, A. Donnelly, E. Thereska, S. Elnikety, A. Rowstron Microsoft Research Cambridge,

Mean response time

19Everest: write off-loading for I/O peaks

Peak 1 reads

Peak 2 reads

Peak 3 reads

Peak 1 writes

Peak 2 writes

Peak 3 writes

0

50

100

150

200No off-loadOff-load

Mea

n re

sp ti

me

(ms)

Page 20: Everest: scaling down peak loads through I/O off-loading D. Narayanan, A. Donnelly, E. Thereska, S. Elnikety, A. Rowstron Microsoft Research Cambridge,

99th percentile response time

20Everest: write off-loading for I/O peaks

Peak 1 reads

Peak 2 reads

Peak 3 reads

Peak 1 writes

Peak 2 writes

Peak 3 writes

0

500

1000

1500

2000No off-loadOff-load

99%

resp

tim

e (m

s)

Page 21: Everest: scaling down peak loads through I/O off-loading D. Narayanan, A. Donnelly, E. Thereska, S. Elnikety, A. Rowstron Microsoft Research Cambridge,

Exchange server summary

• Substantial improvement in I/O latency– On a real enterprise server workload– Both reads and writes, mean and 99th pc

• What about application performance?– I/O trace cannot show end-to-end effects

• Where is the benefit coming from?– Extra resources, log structure, ...?

21Everest: write off-loading for I/O peaks

Page 22: Everest: scaling down peak loads through I/O off-loading D. Narayanan, A. Donnelly, E. Thereska, S. Elnikety, A. Rowstron Microsoft Research Cambridge,

OLTP benchmark

22Dushyanth Narayanan

LogData

Store

Everest client

SQL Server binary

LAN

OLTP client

Detours DLL redirection

• 10 min warmup• 10 min

measurement

Page 23: Everest: scaling down peak loads through I/O off-loading D. Narayanan, A. Donnelly, E. Thereska, S. Elnikety, A. Rowstron Microsoft Research Cambridge,

OLTP throughput

23Everest: write off-loading for I/O peaks

No off-load Off-load Log struc-tured

2-disk striped

Striped + Log-

struc-tured

0

500

1000

1500

2000

2500

3000

Thro

ughp

ut (t

pm)

Extra disk

+

Log layout

2x disks,3x speedup?

Page 24: Everest: scaling down peak loads through I/O off-loading D. Narayanan, A. Donnelly, E. Thereska, S. Elnikety, A. Rowstron Microsoft Research Cambridge,

Off-loading not a panacea

• Works for short-term peaks• Cannot use to improve perf 24/7• Data usually reclaimed while store

still idle– Long-term off-load eventual contention

• Data is reclaimed before store fills up– Long-term log cleaner issue

24Everest: write off-loading for I/O peaks

Page 25: Everest: scaling down peak loads through I/O off-loading D. Narayanan, A. Donnelly, E. Thereska, S. Elnikety, A. Rowstron Microsoft Research Cambridge,

Conclusion

• Peak I/O is a problem• Everest solves this through off-

loading• By modifying workload at block level

– Removes write from overloaded volume– Off-loading is short term: data is

reclaimed

• Consistency, persistence are maintained– State is always correctly recoverable

25Everest: write off-loading for I/O peaks

Page 26: Everest: scaling down peak loads through I/O off-loading D. Narayanan, A. Donnelly, E. Thereska, S. Elnikety, A. Rowstron Microsoft Research Cambridge,

Questions?

26Everest: write off-loading for I/O peaks

Page 27: Everest: scaling down peak loads through I/O off-loading D. Narayanan, A. Donnelly, E. Thereska, S. Elnikety, A. Rowstron Microsoft Research Cambridge,

Why not always off-load?

27Dushyanth Narayanan

Data

Store

Data

Store

OLTP client OLTP client

Write

ReadWrite

ReadWrite

Read

SQL Server 1

Everest client

SQL Server 2

Page 28: Everest: scaling down peak loads through I/O off-loading D. Narayanan, A. Donnelly, E. Thereska, S. Elnikety, A. Rowstron Microsoft Research Cambridge,

10 min off-load,10 min contention

28Everest: write off-loading for I/O peaks

Off-load Contention (server 1)

Contention (server 2)

0

1

2

3

4

Spee

dup

Page 29: Everest: scaling down peak loads through I/O off-loading D. Narayanan, A. Donnelly, E. Thereska, S. Elnikety, A. Rowstron Microsoft Research Cambridge,

Mean and 99th pc (log scale)

29Everest: write off-loading for I/O peaks

Peak 1 reads

Peak 2 reads

Peak 3 reads

Peak 1 writes

Peak 2 writes

Peak 3 writes

1

10

100

1000

10000No off-load Off-load

Resp

onse

tim

e (m

s)

Page 30: Everest: scaling down peak loads through I/O off-loading D. Narayanan, A. Donnelly, E. Thereska, S. Elnikety, A. Rowstron Microsoft Research Cambridge,

Read/write ratio of peaks

30Everest: write off-loading for I/O peaks

0 10 20 30 40 50 60 70 80 90 1000

0.2

0.4

0.6

0.8

1

% of writes

Cum

ulati

ve fr

actio

n

Page 31: Everest: scaling down peak loads through I/O off-loading D. Narayanan, A. Donnelly, E. Thereska, S. Elnikety, A. Rowstron Microsoft Research Cambridge,

Exchange server response time

31Everest: write off-loading for I/O peaks

14:3917:0019:2121:4100:0302:2404:4407:0509:2711:4714:09100000

1000000

10000000

100000000

Time of day

Resp

onse

tim

e (s

)

Page 32: Everest: scaling down peak loads through I/O off-loading D. Narayanan, A. Donnelly, E. Thereska, S. Elnikety, A. Rowstron Microsoft Research Cambridge,

Exchange server load (volumes)

32Everest: write off-loading for I/O peaks

14:3917:0019:2121:4100:0302:2404:4407:0509:2711:4714:09100

1000

10000

100000Max Mean Min

Time of day

Load

(re

qs/s

)

Page 33: Everest: scaling down peak loads through I/O off-loading D. Narayanan, A. Donnelly, E. Thereska, S. Elnikety, A. Rowstron Microsoft Research Cambridge,

Effect of volume selection

33Everest: write off-loading for I/O peaks

05000

10000150002000025000300003500040000

Peak 1

All Selected

Time of day

Load

(req

s/s/

volu

me)

Page 34: Everest: scaling down peak loads through I/O off-loading D. Narayanan, A. Donnelly, E. Thereska, S. Elnikety, A. Rowstron Microsoft Research Cambridge,

Effect of volume selection

34Everest: write off-loading for I/O peaks

010000200003000040000500006000070000

Peak 2

All Selected

Time of day

Load

(req

s/s/

volu

me)

Page 35: Everest: scaling down peak loads through I/O off-loading D. Narayanan, A. Donnelly, E. Thereska, S. Elnikety, A. Rowstron Microsoft Research Cambridge,

Effect of volume selection

35Everest: write off-loading for I/O peaks

02000400060008000

1000012000140001600018000

Peak 3

All Selected

Time of day

Load

(req

s/s/

volu

me)

Page 36: Everest: scaling down peak loads through I/O off-loading D. Narayanan, A. Donnelly, E. Thereska, S. Elnikety, A. Rowstron Microsoft Research Cambridge,

Scaling with #stores

36Dushyanth Narayanan

LogData

Store

Everest client

SQL Server binary

OLTP client

Detours DLL redirection

Store

Store

LAN

Page 37: Everest: scaling down peak loads through I/O off-loading D. Narayanan, A. Donnelly, E. Thereska, S. Elnikety, A. Rowstron Microsoft Research Cambridge,

Scaling: linear until CPU-bound

37Everest: write off-loading for I/O peaks

0 0.5 1 1.5 2 2.5 30

2

4

6

Number of stores

Spee

dup

Page 38: Everest: scaling down peak loads through I/O off-loading D. Narayanan, A. Donnelly, E. Thereska, S. Elnikety, A. Rowstron Microsoft Research Cambridge,

Everest store: circular log layout

Head

Tail

Reclaim

Header block

Active log

Stale records

Delete

38Everest: write off-loading for I/O peaks

Page 39: Everest: scaling down peak loads through I/O off-loading D. Narayanan, A. Donnelly, E. Thereska, S. Elnikety, A. Rowstron Microsoft Research Cambridge,

Exchange server load: CDF

39Everest: write off-loading for I/O peaks

100 1000 10000 1000000

0.2

0.4

0.6

0.8

1

Request rate per volume (reqs/s)

Cum

ulati

ve fr

actio

n

Page 40: Everest: scaling down peak loads through I/O off-loading D. Narayanan, A. Donnelly, E. Thereska, S. Elnikety, A. Rowstron Microsoft Research Cambridge,

Unbalanced across volumes

40Everest: write off-loading for I/O peaks

100 1000 10000 1000000

0.2

0.4

0.6

0.8

1MinMeanMax

Request rate per volume (reqs/s)

Cum

ulati

ve fr

actio

n