Top Banner
TrC-MC: Decentralized Software Transactional Memory for Multi-Multicore Computers Kinson Chan , Cho-Li Wang ICPADS 2011, Tainan
85

TrC-MC: Decentralized Software Transactional Memory for Multi-Multicore Computers

Jul 17, 2015

Download

Software

Kinson Chan
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: TrC-MC: Decentralized Software Transactional Memory for Multi-Multicore Computers

TrC-MC: Decentralized SoftwareTransactional Memory forMulti-Multicore Computers

Kinson Chan, Cho-Li WangICPADS 2011, Tainan

Page 2: TrC-MC: Decentralized Software Transactional Memory for Multi-Multicore Computers

Multicore Processors

• Multicore processors‣ a.k.a. chip multiprocessing‣ multiple cores on a processor‣ cores share a common cache‣ faster information

sharing among threads‣ more threads per cabinet

• Chip Multithreading‣ e.g. hyperthreading,

coolthreads, ...‣ ≥ 2 threads per core‣ hide the data load latency

2

L2 L2 L2 L2

ALU ALU ALU ALU

QPIDDR3

L1 L1 L1 L1

L3

QPI

DD

R3-1066 x3

25.6 GB/sec

Host ProcessorHost Memory

Page 3: TrC-MC: Decentralized Software Transactional Memory for Multi-Multicore Computers

Multi-Multicore Computers

• Multiple multicore processors in a computer• Shared cache is not really global

• Accessing others’ cache content is slow...‣ Write back, remote memory access (through QPI), ...‣ About the speed of previous-generation SMP

3

L2 L2 L2 L2

ALU ALU ALU ALU

QPIDDR3

L1 L1 L1 L1

L3

QPI

L2L2L2L2

ALUALUALUALU

QPI DDR3

L1L1L1L1

L3

QPI

DD

R3-1066 x3

25.6 GB/sec

DD

R3-

1066

x3

25.6

GB/

sec

Host Processor 1 Host Processor 2Host Memory Host Memory

Page 4: TrC-MC: Decentralized Software Transactional Memory for Multi-Multicore Computers

Processor Clockspeed Intra-Die Inter-Die

Core 2 Quad Q6600

Dual Xeon E5540

Dual Xeon E5550

Dual Xeon X5670

Quad Xeon X7750

2.40 GHz 1.6 • 107 trips/s 3.9 • 106 trips/s

2.53 GHz 1.3 • 107 trips/s 5.4 • 106 trips/s

2.66 GHz 1.5 • 107 trips/s 6.3 • 106 trips/s

2.93 GHz 1.4 • 107 trips/s 6.2 • 106 trips/s

2.00 GHz 5.4 • 106 trips/s 9.0 • 105 trips/s

Ping-pong trips per second

4

Page 5: TrC-MC: Decentralized Software Transactional Memory for Multi-Multicore Computers

Processor Clockspeed Intra-Die Inter-Die

Core 2 Quad Q6600

Dual Xeon E5540

Dual Xeon E5550

Dual Xeon X5670

Quad Xeon X7750

2.40 GHz 63 ns/trip 256 ns/trip

2.53 GHz 77 ns/trip 158 ns/trip

2.66 GHz 66 ns/trip 158 ns/trip

2.93 GHz 71 ns/trip 161 ns/trip

2.00 GHz 185 ns/trip 1111 ns/trip

Ping-pong round-trip times

5

Page 6: TrC-MC: Decentralized Software Transactional Memory for Multi-Multicore Computers

Processor Clockspeed Intra-Die Inter-Die

Core 2 Quad Q6600

Dual Xeon E5540

Dual Xeon E5550

Dual Xeon X5670

Quad Xeon X7750

2.40 GHz 63 ns/trip 256 ns/trip

2.53 GHz 77 ns/trip 158 ns/trip

2.66 GHz 66 ns/trip 158 ns/trip

2.93 GHz 71 ns/trip 161 ns/trip

2.00 GHz 185 ns/trip 1111 ns/trip

Ping-pong round-trip times

5

Inter-processor cache transfer is 2 to 6 times longer...

Page 7: TrC-MC: Decentralized Software Transactional Memory for Multi-Multicore Computers

Now and future multicores

6

Micro-architecture Clock rate Cores Threads per

coreThreads per

package Shared cache Memory arrangement

IBM Power 7 ~ 3 GHz 4 ~ 8 4 32 Max 4 MBshared L3 NUMA

Sun Niagara2 1.2 ~ 1.6 GHz 4 ~ 8 8 64 Max 4 MBshared L2 NUMA

Intel Westmere ~ 2 GHz 4 ~ 8 2 16 Max 12 ~ 24 MB

shared L3 NUMA

Intel Harpertown ~ 3 GHz 2 x 2 2 8 2 x 6 MB

shared L3 UMA

AMD Bulldozer ~ 2 GHz 2 x 6 ~ 2 x 8 1 16 Max 8 MB

shared L3 NUMA

AMD Magny-Cours ~ 3 GHz 8 modules 2 per module 16 Max 8 MB

shared L3 NUMA

Intel Terascale ~ 4 GHz 80? 1? 80? 80 x 2 KBdist. cache NUCA

Page 8: TrC-MC: Decentralized Software Transactional Memory for Multi-Multicore Computers

Now and future multicores

6

Micro-architecture Clock rate Cores Threads per

coreThreads per

package Shared cache Memory arrangement

IBM Power 7 ~ 3 GHz 4 ~ 8 4 32 Max 4 MBshared L3 NUMA

Sun Niagara2 1.2 ~ 1.6 GHz 4 ~ 8 8 64 Max 4 MBshared L2 NUMA

Intel Westmere ~ 2 GHz 4 ~ 8 2 16 Max 12 ~ 24 MB

shared L3 NUMA

Intel Harpertown ~ 3 GHz 2 x 2 2 8 2 x 6 MB

shared L3 UMA

AMD Bulldozer ~ 2 GHz 2 x 6 ~ 2 x 8 1 16 Max 8 MB

shared L3 NUMA

AMD Magny-Cours ~ 3 GHz 8 modules 2 per module 16 Max 8 MB

shared L3 NUMA

Intel Terascale ~ 4 GHz 80? 1? 80? 80 x 2 KBdist. cache NUCA

How to exploit parallelism for better performance?

Page 9: TrC-MC: Decentralized Software Transactional Memory for Multi-Multicore Computers

Multi-threading and synchronization

7

Page 10: TrC-MC: Decentralized Software Transactional Memory for Multi-Multicore Computers

Multi-threading and synchronization

7

Coarse grain locking

Easy / Correct(few locks, predictable)

Difficult to scale(excessive mutual

exclusion)

Page 11: TrC-MC: Decentralized Software Transactional Memory for Multi-Multicore Computers

Multi-threading and synchronization

7

Coarse grain locking

Easy / Correct(few locks, predictable)

Difficult to scale(excessive mutual

exclusion)

Fine-grain locking

Error prone(deadlock, forget to

lock, ...)

Scales better(allows more parallelism)

Page 12: TrC-MC: Decentralized Software Transactional Memory for Multi-Multicore Computers

Multi-threading and synchronization

7

Coarse grain locking

Easy / Correct(few locks, predictable)

Difficult to scale(excessive mutual

exclusion)

Fine-grain locking

Error prone(deadlock, forget to

lock, ...)

Scales better(allows more parallelism)

Do we have anything in between?

Easy / Correct

Scales good

Page 13: TrC-MC: Decentralized Software Transactional Memory for Multi-Multicore Computers

C. J. Rossbach, O. S. Hofmann and Emmett Witchel, Is transactional programming actually easier, In Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pages 45–56, 2010.

STM is easy

• In the University of Texas at Austin, 237 students taking Operating System courses were instructed to program the same problem with coarse locks, fine-grained locks, monitors and transactions...

8

DevelopmentTime

Errors

CodeComplexity

Page 14: TrC-MC: Decentralized Software Transactional Memory for Multi-Multicore Computers

C. J. Rossbach, O. S. Hofmann and Emmett Witchel, Is transactional programming actually easier, In Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pages 45–56, 2010.

STM is easy

• In the University of Texas at Austin, 237 students taking Operating System courses were instructed to program the same problem with coarse locks, fine-grained locks, monitors and transactions...

8

DevelopmentTime

Errors

CodeComplexity

LongShort

TMCoarse Fine

Page 15: TrC-MC: Decentralized Software Transactional Memory for Multi-Multicore Computers

C. J. Rossbach, O. S. Hofmann and Emmett Witchel, Is transactional programming actually easier, In Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pages 45–56, 2010.

STM is easy

• In the University of Texas at Austin, 237 students taking Operating System courses were instructed to program the same problem with coarse locks, fine-grained locks, monitors and transactions...

8

DevelopmentTime

Errors

CodeComplexity

LongShort

TMCoarse Fine

Simple Complex

TMCoarse Fine

Page 16: TrC-MC: Decentralized Software Transactional Memory for Multi-Multicore Computers

C. J. Rossbach, O. S. Hofmann and Emmett Witchel, Is transactional programming actually easier, In Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pages 45–56, 2010.

STM is easy

• In the University of Texas at Austin, 237 students taking Operating System courses were instructed to program the same problem with coarse locks, fine-grained locks, monitors and transactions...

8

DevelopmentTime

Errors

CodeComplexity

LongShort

TMCoarse Fine

Simple Complex

TMCoarse Fine

Less More

TM Coarse Fine

Page 17: TrC-MC: Decentralized Software Transactional Memory for Multi-Multicore Computers

STM optimistic execution

9

Begin Begin

Proceed Proceed

Commit Commit

Retry

Commit

x=x+4

y=y-4

x=x+2

y=y-2

Begin

Proceed Begin

Proceed

Commit

Commit

x=x+4

y=y-4

w=w+5

z=w

Thread 1 Thread 2 Thread 3 Thread 1 Thread 2 Thread 3

x=x+2

y=y-2

Success

Success

Success

Success

conflict detection

conflict detection

begin; x=x+4; y=y-4; commit;

begin; x=x+2; y=y-2; commit;

begin; x=x+4; y=y-4; commit;

begin; w=w+5; z=w; commit;

begin;x = x + 4;y = y - 4;commit;

Page 18: TrC-MC: Decentralized Software Transactional Memory for Multi-Multicore Computers

STM optimistic execution

9

Begin Begin

Proceed Proceed

Commit Commit

Retry

Commit

x=x+4

y=y-4

x=x+2

y=y-2

Begin

Proceed Begin

Proceed

Commit

Commit

x=x+4

y=y-4

w=w+5

z=w

Thread 1 Thread 2 Thread 3 Thread 1 Thread 2 Thread 3

x=x+2

y=y-2

Success

Success

Success

Success

conflict detection

conflict detection

begin; x=x+4; y=y-4; commit;

begin; x=x+2; y=y-2; commit;

begin; x=x+4; y=y-4; commit;

begin; w=w+5; z=w; commit;

Page 19: TrC-MC: Decentralized Software Transactional Memory for Multi-Multicore Computers

STM optimistic execution

9

Begin Begin

Proceed Proceed

Commit Commit

Retry

Commit

x=x+4

y=y-4

x=x+2

y=y-2

Begin

Proceed Begin

Proceed

Commit

Commit

x=x+4

y=y-4

w=w+5

z=w

Thread 1 Thread 2 Thread 3 Thread 1 Thread 2 Thread 3

x=x+2

y=y-2

Success

Success

Success

Success

conflict detection

conflict detection

begin; x=x+4; y=y-4; commit;

begin; x=x+2; y=y-2; commit;

begin; x=x+4; y=y-4; commit;

begin; w=w+5; z=w; commit;

case 1: two transactions conflicts:rollback and retry one of them.

Page 20: TrC-MC: Decentralized Software Transactional Memory for Multi-Multicore Computers

STM optimistic execution

9

Begin Begin

Proceed Proceed

Commit Commit

Retry

Commit

x=x+4

y=y-4

x=x+2

y=y-2

Begin

Proceed Begin

Proceed

Commit

Commit

x=x+4

y=y-4

w=w+5

z=w

Thread 1 Thread 2 Thread 3 Thread 1 Thread 2 Thread 3

x=x+2

y=y-2

Success

Success

Success

Success

conflict detection

conflict detection

begin; x=x+4; y=y-4; commit;

begin; x=x+2; y=y-2; commit;

begin; x=x+4; y=y-4; commit;

begin; w=w+5; z=w; commit;

case 1: two transactions conflicts:rollback and retry one of them.

case 2: two transactions do not conflict: they execute together,

achieving better parallelism.

Page 21: TrC-MC: Decentralized Software Transactional Memory for Multi-Multicore Computers

Timestamp-based detection

10

Tx Desc 1 Tx Desc nTimestamp

TableShared

Data

Clock

IDLE IDLE

234

232

229

230

Page 22: TrC-MC: Decentralized Software Transactional Memory for Multi-Multicore Computers

Timestamp-based detection

11

Tx Desc 1 Tx Desc nTimestamp

TableShared

Data

Clock

ACTIVE ACTIVE

234

232

229

230

234 234

Page 23: TrC-MC: Decentralized Software Transactional Memory for Multi-Multicore Computers

Running

Timestamp-based detection

12

Tx Desc 1 Tx Desc nTimestamp

TableShared

Data

Clock

ACTIVE ACTIVE

234

232

229

230

234 234 write

read

read

Page 24: TrC-MC: Decentralized Software Transactional Memory for Multi-Multicore Computers

Running

Tx Desc 1 Tx Desc nTimestamp

TableShared

Data

Clock

Timestamp-based detection

13

ACTIVE ACTIVE

234

232

229

230

234 234

write

read

Page 25: TrC-MC: Decentralized Software Transactional Memory for Multi-Multicore Computers

Running

Timestamp-based detection

14

Tx Desc 1 Tx Desc nTimestamp

TableShared

Data

Clock

COMMIT ACTIVE

234

232

229

230

234 234

Page 26: TrC-MC: Decentralized Software Transactional Memory for Multi-Multicore Computers

Running

Timestamp-based detection

15

Tx Desc 1 Tx Desc nTimestamp

TableShared

Data

Clock

COMMIT ACTIVE

234

LOCKED

229

230

234 234

Page 27: TrC-MC: Decentralized Software Transactional Memory for Multi-Multicore Computers

Running

Timestamp-based detection

16

Tx Desc 1 Tx Desc nTimestamp

TableShared

Data

Clock

COMMIT ACTIVE

235

LOCKED

229

230

234 234

Page 28: TrC-MC: Decentralized Software Transactional Memory for Multi-Multicore Computers

Running

Timestamp-based detection

17

Tx Desc 1 Tx Desc nTimestamp

TableShared

Data

Clock

DONE ACTIVE

235

235

229

230

234 234

Page 29: TrC-MC: Decentralized Software Transactional Memory for Multi-Multicore Computers

Running

Timestamp-based detection

18

Tx Desc 1 Tx Desc nTimestamp

TableShared

Data

Clock

DONE ACTIVE

235

235

229

230

234 234 read

Page 30: TrC-MC: Decentralized Software Transactional Memory for Multi-Multicore Computers

Running

Timestamp-based detection

19

Tx Desc 1 Tx Desc nTimestamp

TableShared

Data

Clock

DONE CONFLICT

235

235

229

230

234 234 read

Page 31: TrC-MC: Decentralized Software Transactional Memory for Multi-Multicore Computers

Timestamp-based detection

20

Tx Desc 1 Tx Desc nTimestamp

TableShared

Data

Clock

DONE CONFLICT

235

235

229

230

234 234

CommonAccessVariable

Page 32: TrC-MC: Decentralized Software Transactional Memory for Multi-Multicore Computers

Centralized Data Structure

• All transactions touch the same clock...• Even if they do not overlap.

• Serious contention among the processors.

• Lots of time spent on cache coherence.

• Example: ssca2‣ Very high commit ratio (> 95%, green dots)‣ Very high commit rate (M/sec, red dashes)‣ Not scalable on word-based STM (black)

21

1 2 4 8 16 32 64 1 2 4 8 16 32 64

100%

0%

100%

0%

100%

0%

100%

0%

100%

0%

900

0

0

0

0

0

0

0

0

0

0

1.5M

1.2M

80M

20M

100

2.5M

1.2M

1M

250K

8

8

3

4

2 2

12

12

2

10

Com

mit

Rat

io (G

reen

Dot

ted

Line

s)

Com

mit

Rat

e (P

er S

econ

d, R

ed D

ashe

d Li

nes)

Spee

dup

Rat

io (B

lack

Sol

id L

ines

)

Com

mit

Rat

e (P

er S

econ

d, R

ed D

ashe

d Li

nes)

Spee

dup

Rat

io (B

lack

Sol

id L

ines

)

0

0

0

0

0

0

0

0

0

0

bayes

genome

intruder

kmeans-1

kmeans-2

labyrinth

ssca2

vacation-1

vacation-2

yada

Number of Threads

50%

50%

50%

50%

50%

450

750K

600K

40M

10M

50

1.25M

600K

500K

125K

4

4

1.5

2

1

5

1

6

6

1

Figure 1. Commit Ratios, Commit Rates and Speedup Ratios ofSTAMP Benchmark on TinySTM

Time

Com

mit

Rat

e(P

er S

econ

d,R

ed D

ashe

d Li

ne)

Com

mit

Rat

io(G

reen

Dot

ted

Line

) 100%

0%

200K

0

100K50%

Figure 2. Instantaneous Commit Ratios and Commit Rates of va-cation with 64 Threads

is completed. These engaged data locks obstruct other trans-actions to proceed and they cause a cascading effect in whichfurther more transactions are obstructed and aborted.

• Optimal concurrencies are application-dependent: Differentapplications have different optimal concurrencies (sweet spot).For instance, genome scales the best at 32 threads while yadascales the best at 8. While kmeans-1 and kmeans-2 are essen-tially the same application program, they have different speedupcurves as they have different input data.

• Transactional nature changes in runtime: As a program maytake multiple stages of data processing, the nature of the trans-actions may change along the timeline. Figure 2 shows the in-stantaneous commit rates and ratios while vacation-2 is run-

Processor Clockspeed Intra-Die Inter-DieCore 2 Quad Q6600 2.40 GHz 63 ns 256 ns

2⇥ Xeon E5540 2.53 GHz 77 ns 185 ns2⇥ Xeon E5550 2.66 GHz 66 ns 158 ns2⇥ Xeon X5670 2.93 GHz 71 ns 161 ns4⇥ Xeon X7550 2.00 GHz 185 ns 1111 ns

Table 1. Cache Ping-Pong Round-Trip Times of some Processors

ning. Although the work nature is uniform, the commit rate andratio are constantly changing through time.

• Programs with low commit ratio may still be scalable: Althoughyada has commit ratio as low as 5% when there are 2 threads,doubling number of threads to 4 yields around 30% perfor-mance gain. If an ACC policy watches on the commit ratio only,it may over-react and stall more threads than necessary.

• Lower commit ratio does not imply worse performance: Despiteof a lot of coincidences, there are also a few cases where thesystem performs better with lower commit ratio. When numberof threads is increased from 8 to 32, labyrinth’s commit ratiodrops by around 15%. Meanwhile the speedup ratio increasesby 2.

• Commit rate acts as better performance index: Generally,higher commit rate on the graphs also implies higher speedupratio. This assumption holds on all of the benchmarks we havetested on this paper.

An adaptive concurrency control policy is required to guidea STM to have optimal concurrency and the best performance.To be a reasonable policy, it should make decisions based on theinstantaneous commit rates, not ratios.

2.3 Modern Computer Cache HierarchyModern computers are equipped with multicore (CMP) processors.A multicore processor features a shared cache and several compu-tation cores, on which at least one thread can execute. With chipmultithreading (CMT) technology, a core makes use of the cachelatency and runs more than one thread. Threads within the samecore share the L1 cache while threads on different cores (but sameprocessor) share L2 or L3 cache. To pursue high performance, high-end computers are equipped with multiple multicore processors.This complicates the cache-sharing situation as threads on differ-ent processors do not share any common cache.

We run ping-pong tests on several systems equipped with multi-ple multicore processors, as shown on Table 1.3 Our aim is to eval-uate how fast data sharing can be. Inter-thread data sharing speedmay give us some inspiration for how STM policies and protocolscan be designed.

We notice it takes tens of nanoseconds to transfer a cache linefrom a processor to another. From another view point, if a cacheline (e.g., a commit counter) has to be uniformly updated by threadson different processors, there can be only millions of cache transferper second. When there are more transactions than this capacity,threads slow down as they contend for the cache line ownership.

For instance, Chan, et al [4] were puzzled why ssca2 cannotgain benefit from their adaptive concurrency policies. The systemperformance drops by 26% when their ACC policy is activated.We can repeat the same observation by running similar softwareset on our computers. With millions of transactional commits persecond, the commit counter in their solution is saturated and causes

3 Although Q6600 comes in a single quad-core package, it features two setsof shared L2 cache, each to be shared by two cores only. According to ourdefinition in Section 1, we treat it as two processors.

3 2011/8/20

1 2 4 8 16 32 64 1 2 4 8 16 32 64

100%

0%

100%

0%

100%

0%

100%

0%

100%

0%

900

0

0

0

0

0

0

0

0

0

0

1.5M

1.2M

80M

20M

100

2.5M

1.2M

1M

250K

8

8

3

4

2 2

12

12

2

10

Com

mit

Ratio

(Gre

en D

otte

d Li

nes)

Com

mit

Rate

(Per

Sec

ond,

Red

Das

hed

Line

s)

Spee

dup

Ratio

(Bla

ck S

olid

Lin

es)

Com

mit

Rate

(Per

Sec

ond,

Red

Das

hed

Line

s)

Spee

dup

Ratio

(Bla

ck S

olid

Lin

es)

0

0

0

0

0

0

0

0

0

0

bayes

genome

intruder

kmeans-1

kmeans-2

labyrinth

ssca2

vacation-1

vacation-2

yada

Number of Threads

50%

50%

50%

50%

50%

450

750K

600K

40M

10M

50

1.25M

600K

500K

125K

4

4

1.5

2

1

5

1

6

6

1

Figure 1. Commit Ratios, Commit Rates and Speedup Ratios ofSTAMP Benchmark on TinySTM

Time

Com

mit

Rate

(Per

Sec

ond,

Red

Dash

ed L

ine)

Com

mit

Ratio

(Gre

en D

otte

d Li

ne) 100%

0%

200K

0

100K50%

Figure 2. Instantaneous Commit Ratios and Commit Rates of va-cation with 64 Threads

is completed. These engaged data locks obstruct other trans-actions to proceed and they cause a cascading effect in whichfurther more transactions are obstructed and aborted.

• Optimal concurrencies are application-dependent: Differentapplications have different optimal concurrencies (sweet spot).For instance, genome scales the best at 32 threads while yadascales the best at 8. While kmeans-1 and kmeans-2 are essen-tially the same application program, they have different speedupcurves as they have different input data.

• Transactional nature changes in runtime: As a program maytake multiple stages of data processing, the nature of the trans-actions may change along the timeline. Figure 2 shows the in-stantaneous commit rates and ratios while vacation-2 is run-

Processor Clockspeed Intra-Die Inter-DieCore 2 Quad Q6600 2.40 GHz 63 ns 256 ns

2⇥ Xeon E5540 2.53 GHz 77 ns 185 ns2⇥ Xeon E5550 2.66 GHz 66 ns 158 ns2⇥ Xeon X5670 2.93 GHz 71 ns 161 ns4⇥ Xeon X7550 2.00 GHz 185 ns 1111 ns

Table 1. Cache Ping-Pong Round-Trip Times of some Processors

ning. Although the work nature is uniform, the commit rate andratio are constantly changing through time.

• Programs with low commit ratio may still be scalable: Althoughyada has commit ratio as low as 5% when there are 2 threads,doubling number of threads to 4 yields around 30% perfor-mance gain. If an ACC policy watches on the commit ratio only,it may over-react and stall more threads than necessary.

• Lower commit ratio does not imply worse performance: Despiteof a lot of coincidences, there are also a few cases where thesystem performs better with lower commit ratio. When numberof threads is increased from 8 to 32, labyrinth’s commit ratiodrops by around 15%. Meanwhile the speedup ratio increasesby 2.

• Commit rate acts as better performance index: Generally,higher commit rate on the graphs also implies higher speedupratio. This assumption holds on all of the benchmarks we havetested on this paper.

An adaptive concurrency control policy is required to guidea STM to have optimal concurrency and the best performance.To be a reasonable policy, it should make decisions based on theinstantaneous commit rates, not ratios.

2.3 Modern Computer Cache HierarchyModern computers are equipped with multicore (CMP) processors.A multicore processor features a shared cache and several compu-tation cores, on which at least one thread can execute. With chipmultithreading (CMT) technology, a core makes use of the cachelatency and runs more than one thread. Threads within the samecore share the L1 cache while threads on different cores (but sameprocessor) share L2 or L3 cache. To pursue high performance, high-end computers are equipped with multiple multicore processors.This complicates the cache-sharing situation as threads on differ-ent processors do not share any common cache.

We run ping-pong tests on several systems equipped with multi-ple multicore processors, as shown on Table 1.3 Our aim is to eval-uate how fast data sharing can be. Inter-thread data sharing speedmay give us some inspiration for how STM policies and protocolscan be designed.

We notice it takes tens of nanoseconds to transfer a cache linefrom a processor to another. From another view point, if a cacheline (e.g., a commit counter) has to be uniformly updated by threadson different processors, there can be only millions of cache transferper second. When there are more transactions than this capacity,threads slow down as they contend for the cache line ownership.

For instance, Chan, et al [4] were puzzled why ssca2 cannotgain benefit from their adaptive concurrency policies. The systemperformance drops by 26% when their ACC policy is activated.We can repeat the same observation by running similar softwareset on our computers. With millions of transactional commits persecond, the commit counter in their solution is saturated and causes

3 Although Q6600 comes in a single quad-core package, it features two setsof shared L2 cache, each to be shared by two cores only. According to ourdefinition in Section 1, we treat it as two processors.

3 2011/8/20

threads

ssca2

Page 33: TrC-MC: Decentralized Software Transactional Memory for Multi-Multicore Computers

Solutions?

• Abandon timestamp-based detection?‣ We are still retaining some global data... e.g.,‣ A single clock (TML, NOrec)‣ A ring (RingSTM)‣ We are just having cache contention on some other things.

• Copy ideas from STM for cluster?‣ Usually involves network messages, esp. broadcasts.‣ Network messages can be taken as remote-memory operations.

• Finally, TL2C by Anvi?

22

Page 34: TrC-MC: Decentralized Software Transactional Memory for Multi-Multicore Computers

Tx Desc 1 Tx Desc 2Timestamp

TableTx Desc 3

TL2C by Hillel Anvi

23

IDLE IDLE

2…232

2…220

3…125IDLE

1…1232…2343…124

1…1202…2303…124

1…1232…2333…126

Page 35: TrC-MC: Decentralized Software Transactional Memory for Multi-Multicore Computers

Tx Desc 1 Tx Desc 2Timestamp

TableTx Desc 3

TL2C by Hillel Anvi

24

IDLE IDLE

2…232

2…220

3…125IDLE

1…1232…2343…124

1…1202…2303…124

1…1232…2333…126

Separated Clocks

Page 36: TrC-MC: Decentralized Software Transactional Memory for Multi-Multicore Computers

Tx Desc 1 Tx Desc 2Timestamp

TableTx Desc 3

TL2C by Hillel Anvi

25

IDLE IDLE

2…232

2…220

3…125IDLE

1…1232…2343…124

1…1202…2303…124

1…1232…2333…126

Cache Clocks

Page 37: TrC-MC: Decentralized Software Transactional Memory for Multi-Multicore Computers

Tx Desc 1 Tx Desc 2Timestamp

TableTx Desc 3

TL2C by Hillel Anvi

26

IDLE IDLE

2…232

2…220

3…125IDLE

1…1232…2343…124

1…1202…2303…124

1…1232…2333…126

Timestamp = Node ID + Clock

Page 38: TrC-MC: Decentralized Software Transactional Memory for Multi-Multicore Computers

Tx Desc 1 Tx Desc 2Timestamp

TableTx Desc 3

TL2C by Hillel Anvi

27

IDLE IDLE

2…232

2…220

3…125IDLE

1…1232…2343…124

1…1202…2303…124

1…1232…2333…126

Page 39: TrC-MC: Decentralized Software Transactional Memory for Multi-Multicore Computers

Tx Desc 1 Tx Desc 2Timestamp

TableTx Desc 3

TL2C by Hillel Anvi

28

ACTIVE IDLE

2…232

2…220

3…125ACTIVE

1…1232…2343…124

1…1202…2303…124

1…1232…2333…126

Page 40: TrC-MC: Decentralized Software Transactional Memory for Multi-Multicore Computers

Running

Tx Desc 1 Tx Desc 2Timestamp

TableTx Desc 3

TL2C by Hillel Anvi

29

ACTIVE IDLE

2…232

2…220

3…125ACTIVE

1…1232…2343…124

1…1202…2303…124

1…1232…2333…126

write

read

234 > 232 && 234 > 220 => okay

Page 41: TrC-MC: Decentralized Software Transactional Memory for Multi-Multicore Computers

Running

Tx Desc 1 Tx Desc 2Timestamp

TableTx Desc 3

TL2C by Hillel Anvi

30

ACTIVE IDLE

2…232

2…220

3…125ACTIVE

1…1232…2343…124

1…1202…2303…124

1…1232…2333…126

read

write

233 > 232 && 233 > 220 => okay

Page 42: TrC-MC: Decentralized Software Transactional Memory for Multi-Multicore Computers

Running

Tx Desc 1 Tx Desc 2Timestamp

TableTx Desc 3

TL2C by Hillel Anvi

31

COMMIT IDLE

2…232

2…220

3…125ACTIVE

1…1232…2343…124

1…1202…2303…124

1…1232…2333…126

Page 43: TrC-MC: Decentralized Software Transactional Memory for Multi-Multicore Computers

Running

Tx Desc 1 Tx Desc 2Timestamp

TableTx Desc 3

TL2C by Hillel Anvi

32

COMMIT IDLE

LOCKED

2…220

3…125ACTIVE

1…1232…2343…124

1…1202…2303…124

1…1232…2333…126

Page 44: TrC-MC: Decentralized Software Transactional Memory for Multi-Multicore Computers

Running

Tx Desc 1 Tx Desc 2Timestamp

TableTx Desc 3

TL2C by Hillel Anvi

33

COMMIT IDLE

LOCKED

2…220

3…125ACTIVE

1…1242…2343…124

1…1202…2303…124

1…1232…2333…126

Page 45: TrC-MC: Decentralized Software Transactional Memory for Multi-Multicore Computers

Running

Tx Desc 1 Tx Desc 2Timestamp

TableTx Desc 3

TL2C by Hillel Anvi

34

DONE IDLE

1…124

2…220

3…125ACTIVE

1…1242…2343…124

1…1202…2303…124

1…1232…2333…126

Page 46: TrC-MC: Decentralized Software Transactional Memory for Multi-Multicore Computers

Running

Tx Desc 1 Tx Desc 2Timestamp

TableTx Desc 3

TL2C by Hillel Anvi

35

DONE IDLE

1…124

2…220

3…125ACTIVE

1…1242…2343…124

1…1202…2303…124

1…1232…2333…126

read

Page 47: TrC-MC: Decentralized Software Transactional Memory for Multi-Multicore Computers

Running

Tx Desc 1 Tx Desc 2Timestamp

TableTx Desc 3

TL2C by Hillel Anvi

36

DONE IDLE

1…124

2…220

3…125CONFLICT

1…1242…2343…124

1…1202…2303…124

1…1232…2333…126

read

123 < 124 => not okay

Page 48: TrC-MC: Decentralized Software Transactional Memory for Multi-Multicore Computers

Running

Tx Desc 1 Tx Desc 2Timestamp

TableTx Desc 3

TL2C by Hillel Anvi

37

DONE ACTIVE

1…124

2…220

3…125ABORT

1…1242…2343…124

1…1202…2303…124

1…1242…2333…126

Page 49: TrC-MC: Decentralized Software Transactional Memory for Multi-Multicore Computers

Running

Tx Desc 1 Tx Desc 2Timestamp

TableTx Desc 3

TL2C by Hillel Anvi

38

DONE ACTIVE

1…124

2…220

3…125ABORT

1…1242…2343…124

1…1202…2303…124

1…1242…2333…126

read

Page 50: TrC-MC: Decentralized Software Transactional Memory for Multi-Multicore Computers

Running

Tx Desc 1 Tx Desc 2Timestamp

TableTx Desc 3

TL2C by Hillel Anvi

39

DONE CONFLICT

1…124

2…220

3…125ABORT

1…1242…2343…124

1…1202…2303…124

1…1232…2333…126

120 < 124 => not okay

Page 51: TrC-MC: Decentralized Software Transactional Memory for Multi-Multicore Computers

Running

Tx Desc 1 Tx Desc 2Timestamp

TableTx Desc 3

TL2C by Hillel Anvi

40

DONE CONFLICT

1…124

2…220

3…125ABORT

1…1242…2343…124

1…1242…2303…124

1…1242…2333…126

Page 52: TrC-MC: Decentralized Software Transactional Memory for Multi-Multicore Computers

Running

Tx Desc 1 Tx Desc 2Timestamp

TableTx Desc 3

TL2C by Hillel Anvi

41

DONE RETRY

1…124

2…220

3…125ABORT

1…1242…2343…124

1…1242…2303…124

1…1242…2333…126

read

read

Page 53: TrC-MC: Decentralized Software Transactional Memory for Multi-Multicore Computers

Running

Tx Desc 1 Tx Desc 2Timestamp

TableTx Desc 3

TL2C by Hillel Anvi

42

DONE CONFLICT

1…124

2…220

3…125ABORT

1…1242…2343…124

1…1242…2303…124

1…1242…2333…126

read

124 < 125 => not okay

Page 54: TrC-MC: Decentralized Software Transactional Memory for Multi-Multicore Computers

Running

Tx Desc 1 Tx Desc 2Timestamp

TableTx Desc 3

TL2C by Hillel Anvi

43

DONE CONFLICT

1…124

2…220

3…125ABORT

1…1242…2343…124

1…1242…2303…125

1…1242…2333…126

Page 55: TrC-MC: Decentralized Software Transactional Memory for Multi-Multicore Computers

H. Avni and N. Shavit, “Maintaining consistent transactional states without a global clock,” in Proceedings of the 15th international colloquium on Structural Information and Communication Complexity. Berlin, Heidelberg: Springer-Verlag, 2008, pp. 131–140.

Disadvantages

• Acknowledge new clocks by conflicts only‣ No more global clock to inform the newest timestamp value‣ New timestamp, Conflict, and then Abort Inevitable

• An update generates at most (n-1) unnecessary “conflicts”when there are n threads. Thus high resultant abort rate.

• “The hope is that in the future, on larger distributed machines, the cost of the higher abort rate will be offset by the reduction in the cost that would have been incurred by using a shared global clock.” — Hillel Anvi and Nir Shavit, SIROCCO, 2008

44

Page 56: TrC-MC: Decentralized Software Transactional Memory for Multi-Multicore Computers

Improving the design...

1. How do we reduce the chance a transaction encountering a timestamp that is “too new” like transaction 2?Find some other mechanisms to copy the clock values.

2. Can we copy the clock values directly from other transactions?No. This would cause cache invalidation on important data.

3. Would leveraging on the multicore’s shared cache help?

TrC-MC: Transactional Consistency for Multicore... Zone + Timestamp Extension

45

Page 57: TrC-MC: Decentralized Software Transactional Memory for Multi-Multicore Computers

Zones

• We allocate transactions into zones.

• Ideally, threads in the same zone are in the same processor.• Thus, we have fast sharing among the same zone.

• Threads within a zone share a group of cache clocks.• A thread copies the cache clocks before it starts a transaction.

• A thread updates the zone cache encountering new timestamps.

• So, once a thread encounters a conflict, other threads in the zone no longer suffers from the timestamp.

46

Page 58: TrC-MC: Decentralized Software Transactional Memory for Multi-Multicore Computers

Zones

47

Tx Desc 1 Tx Desc 3 TimestampTable

Tx Desc 4

Zone 1 Zone 2

1…1242…180

1…1182…225 2…225

1…118

1…124 2…220 1…119 2…225

1…118

2…219

2…210

IDLE IDLE IDLE

Page 59: TrC-MC: Decentralized Software Transactional Memory for Multi-Multicore Computers

Zones

48

Tx Desc 1 Tx Desc 3 TimestampTable

Tx Desc 4

Zone 1 Zone 2

1…1242…220

1…1182…225 2…225

1…119

1…124 2…220 1…119 2…225

1…118

2…219

2…210

ACTIVE IDLE ACTIVE

Page 60: TrC-MC: Decentralized Software Transactional Memory for Multi-Multicore Computers

Running

Zones

49

Tx Desc 1 Tx Desc 3 TimestampTable

Tx Desc 4

Zone 1 Zone 2

1…1242…220

1…1182…225 2…225

1…119

1…124 2…220 1…119 2…225

1…118

2…219

LOCKED

COMMIT IDLE ACTIVE

Page 61: TrC-MC: Decentralized Software Transactional Memory for Multi-Multicore Computers

Running

Zones

50

Tx Desc 1 Tx Desc 3 TimestampTable

Tx Desc 4

Zone 1 Zone 2

1…1252…220

1…1182…225 2…225

1…119

1…125 2…220 1…119 2…225

1…118

2…219

LOCKED

COMMIT IDLE ACTIVE

Page 62: TrC-MC: Decentralized Software Transactional Memory for Multi-Multicore Computers

Running

Zones

51

Tx Desc 1 Tx Desc 3 TimestampTable

Tx Desc 4

Zone 1 Zone 2

1…1252…220

1…1182…225 2…225

1…119

1…125 2…220 1…119 2…225

1…118

2…219

1…125

DONE IDLE ACTIVE

Page 63: TrC-MC: Decentralized Software Transactional Memory for Multi-Multicore Computers

Running

Zones

52

Tx Desc 1 Tx Desc 3 TimestampTable

Tx Desc 4

Zone 1 Zone 2

1…1252…220

1…1182…225 2…225

1…119

1…125 2…220 1…119 2…225

1…118

2…219

1…125

DONE IDLE CONFLICT

read

Page 64: TrC-MC: Decentralized Software Transactional Memory for Multi-Multicore Computers

Running

Zones

53

Tx Desc 1 Tx Desc 3 TimestampTable

Tx Desc 4

Zone 1 Zone 2

1…1252…220

1…1182…225 2…225

1…125

1…125 2…220 1…125 2…225

1…118

2…219

1…125

DONE IDLE CONFLICT

Page 65: TrC-MC: Decentralized Software Transactional Memory for Multi-Multicore Computers

Running

Zones

54

Tx Desc 1 Tx Desc 3 TimestampTable

Tx Desc 4

Zone 1 Zone 2

1…1252…220

1…1252…225 2…225

1…125

1…125 2…220 1…125 2…225

1…118

2…219

1…125

DONE ACTIVE CONFLICT

Page 66: TrC-MC: Decentralized Software Transactional Memory for Multi-Multicore Computers

Running

Zones

55

Tx Desc 1 Tx Desc 3 TimestampTable

Tx Desc 4

Zone 1 Zone 2

1…1252…220

1…1252…225 2…225

1…125

1…125 2…220 1…125 2…225

1…118

2…219

1…125

DONE ACTIVE CONFLICT

read

Page 67: TrC-MC: Decentralized Software Transactional Memory for Multi-Multicore Computers

Torvald Riegel, Pascal Felber, and Christof Fetzer. A Lazy Snapshot Algorithm with Eager Validation. In Proceedings of the 20th International Symposium on Distributed Computing (DISC), pages 284–298, September 2006.

P. Felber, C. Fetzer, and T. Riegel, “Dynamic performance tuning of word-based software transactional memory,” in Proceedings of the 13th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2008, pp. 237–246.

Timestamp extension

• Originally done by T. Riegel et al.‣ LSA: “Lazy Snapshot Algorithm with Eager Detection”‣ State-of-the-art example: TinySTM

• Effect: fewer aborts as some transactions are rescued.

• When a new timestamp is encountered...1. Update the new clock into the cache clock.2. Re-execute the read set, check for inconsistencies.3. If consistent, the transaction can proceed with new clock.4. Otherwise, the transaction has to abort.

56

Page 68: TrC-MC: Decentralized Software Transactional Memory for Multi-Multicore Computers

Proof

57

Tx Desc 1 Tx Desc 3 TimestampTable

Tx Desc 4

Zone 1 Zone 2

1…124 1…118 1…119

1…124 1…119

1…119IDLE IDLE IDLECache clocks intransaction descriptorsare under-estimation

of those in zone.

Page 69: TrC-MC: Decentralized Software Transactional Memory for Multi-Multicore Computers

Proof

58

Tx Desc 1 Tx Desc 3 TimestampTable

Tx Desc 4

Zone 1 Zone 2

1…124 1…118 1…119

1…124 1…119

1…119IDLE IDLE IDLE

Cache clocks in zones areunderestimations of the largest

values in timestamp table.

Page 70: TrC-MC: Decentralized Software Transactional Memory for Multi-Multicore Computers

Proof

59

Tx Desc 1 Tx Desc 3 TimestampTable

Tx Desc 4

Zone 1 Zone 2

1…124 1…118 1…119

1…124 1…119

1…119IDLE IDLE IDLEThe largest value is in the actualclock in the corresponding zone.

Page 71: TrC-MC: Decentralized Software Transactional Memory for Multi-Multicore Computers

Proof

60

Tx Desc 1 Tx Desc 3 TimestampTable

Tx Desc 4

Zone 1 Zone 2

1…124 1…118 1…119

1…124 1…119

1…119IDLE IDLE IDLETherefore, cache clocks intransaction descriptorsare underestimation of

the actual clocks.

Page 72: TrC-MC: Decentralized Software Transactional Memory for Multi-Multicore Computers

Proof

61

Tx Desc 1 Tx Desc 3 TimestampTable

Tx Desc 4

Zone 1 Zone 2

1…124 1…124 1…124

1…124 1…124

1…124IDLE IDLE IDLEOr, at extreme case,

all these numbers wouldbe equal...

Page 73: TrC-MC: Decentralized Software Transactional Memory for Multi-Multicore Computers

Proof

62

Tx Desc 1 Tx Desc 3 TimestampTable

Tx Desc 4

Zone 1 Zone 2

1…125 1…124 1…124

1…125 1…124

1…124COMMIT IDLE IDLE

1…125

In commit procedure,the clock in the zoneis incremented by 1.

Page 74: TrC-MC: Decentralized Software Transactional Memory for Multi-Multicore Computers

Proof

63

Tx Desc 1 Tx Desc 3 TimestampTable

Tx Desc 4

Zone 1 Zone 2

1…124 1…124 1…124

1…125 1…124

1…124IDLE IDLE IDLE

1…125

After increment, all thesecache clocks would be

absolutely smaller.

Page 75: TrC-MC: Decentralized Software Transactional Memory for Multi-Multicore Computers

Proof

64

Tx Desc 1 Tx Desc 3 TimestampTable

Tx Desc 4

Zone 1 Zone 2

1…124 1…124 1…124

1…125 1…124

1…124IDLE IDLE IDLE

1…125

Therefore, it would generatea conflict / extension whennewly written values are

encountered.

Page 76: TrC-MC: Decentralized Software Transactional Memory for Multi-Multicore Computers

Performance: ssca2

65

Tran

sact

ions

per

sec

ond

Number of threads

Page 77: TrC-MC: Decentralized Software Transactional Memory for Multi-Multicore Computers

TSTM / CommitsTSTM / AbortsTSTM / Extension

TRC-MC / CommitsTRC-MC / AbortsTRC-MC / Extension

TL2C / CommitsTL2C / AbortsSEQUENTIAL

Performance: ssca2

65

0

3750000

7500000

11250000

15000000

1 2 4 8 16 32 64

Tran

sact

ions

per

sec

ond

Number of threads

Page 78: TrC-MC: Decentralized Software Transactional Memory for Multi-Multicore Computers

TSTM / CommitsTSTM / AbortsTSTM / Extension

TRC-MC / CommitsTRC-MC / AbortsTRC-MC / Extension

TL2C / CommitsTL2C / AbortsSEQUENTIAL

Performance: ssca2

65

0

3750000

7500000

11250000

15000000

1 2 4 8 16 32 64

Tran

sact

ions

per

sec

ond

Number of threads

Less Aborts

Page 79: TrC-MC: Decentralized Software Transactional Memory for Multi-Multicore Computers

Performance: vacation

66

Tran

sact

ions

per

sec

ond

Number of threads

Page 80: TrC-MC: Decentralized Software Transactional Memory for Multi-Multicore Computers

TSTM / CommitsTSTM / AbortsTSTM / Extension

TRC-MC / CommitsTRC-MC / AbortsTRC-MC / Extension

TL2C / CommitsTL2C / AbortsSEQUENTIAL

Performance: vacation

66

0

375000

750000

1125000

1500000

1 2 4 8 16 32 64

Tran

sact

ions

per

sec

ond

Number of threads

Page 81: TrC-MC: Decentralized Software Transactional Memory for Multi-Multicore Computers

TSTM / CommitsTSTM / AbortsTSTM / Extension

TRC-MC / CommitsTRC-MC / AbortsTRC-MC / Extension

TL2C / CommitsTL2C / AbortsSEQUENTIAL

Performance: vacation

66

0

375000

750000

1125000

1500000

1 2 4 8 16 32 64

Tran

sact

ions

per

sec

ond

Number of threads

Less Aborts

Page 82: TrC-MC: Decentralized Software Transactional Memory for Multi-Multicore Computers

TSTM / CommitsTSTM / AbortsTSTM / Extension

TRC-MC / CommitsTRC-MC / AbortsTRC-MC / Extension

TL2C / CommitsTL2C / AbortsSEQUENTIAL

Performance: vacation

66

0

375000

750000

1125000

1500000

1 2 4 8 16 32 64

Tran

sact

ions

per

sec

ond

Number of threads

Less Aborts

Slightly more extensions

Page 83: TrC-MC: Decentralized Software Transactional Memory for Multi-Multicore Computers

Conclusions

• Cache contention over the global structure in STM.• Especially serious on a computer with multiple multicores.

• Our solution: Zone + Timestamp Extension‣ Compared with single-clock STM (like TinySTM):

✴ Much better performance (with less cache contention)‣ Compared with distributed-clock STM (like TL2C):

✴ Less aborts (at cost of some timestamp extensions)✴ Similar or better performance

• Future work:‣ Having multiple timestamp tables as well for better scalability‣ Experiment similar protocols on cluster STM

67

Page 84: TrC-MC: Decentralized Software Transactional Memory for Multi-Multicore Computers

Contact Us

• Kinson Chan‣ [email protected]‣ http://i.cs.hku.hk/~kchan

• Cho-Li Wang‣ [email protected]‣ http://i.cs.hku.hk/~clwang

• System Research Group‣ http://www.srg.cs.hku.hk/

• Department of Computer Science, University of Hong Kong‣ http://www.cs.hku.hk/

68

Page 85: TrC-MC: Decentralized Software Transactional Memory for Multi-Multicore Computers

Questions?