ICDE2010 Nb-GCLOCK

Nb-GCLOCK: A Non-blocking Buffer Management based on the Generalized CLOCK

Makoto YUI1, Jun MIYAZAKI2, Shunsuke UEMURA3

and Hayato YAMANA4

1 .Research fellow, JSPS (Japan Society for the Promotion of Science) / Visiting Postdoc at Waseda University, Japan and CWI, Netherlands 2. Nara Institute of Science and Technology 3. Nara Sangyo University 4. Waseda University / National Institute of Informatics

Outline

• Background

• Our approach

– Non-Blocking Synchronization

– Nb-GCLOCK

• Experimental Evaluation

• Related Work

• Conclusion

2

3

UltraSparc T2 Azul Vega Larrabee?

Nehalem

Multi-Core CPU

Many-Core CPU

2000

1990

Core2

Power4

Pentium

Single-Core CPU

Background – Recent trends in CPU development

# of CPU cores in a chip is doubling in two year cycles

Many-core era is coming.

4

UltraSparc T2 Azul Vega Larrabee?

Nehalem

Multi-Core CPU

Many-Core CPU

2000

1990

Core2

Power4

Pentium

Single-Core CPU

Background – Recent trends in CPU development

- Niagara T2 – 8 cores x 8 SMT = 64 processors - Azul Vega3 – 54 cores x 16 chips = 864 processors

# of CPU cores in a chip is doubling in two year cycles

Many-core era is coming.

Open source DBs have faced CPU scalability problems

5

Ryan Johnson et al., “Shore-MT: A Scalable Storage Manager for the Multicore Era”, In Proc. EDBT, 2009.

Background – CPU Scalability of open source DBs


6


0

2

4

6

8

10

1 4 8 12 16 24 32

PostgreSQL

MySQL

BDB


Microbenchmark on UltraSparc T1 (32 procs)


7


0

2

4

6

8

10

1 4 8 12 16 24 32

PostgreSQL

MySQL

BDB

Concurrent threads

Throughput (normalized)




8


0

2

4

6

8

10

1 4 8 12 16 24 32

PostgreSQL

MySQL

BDB

Concurrent threads




Gain after 16 threads is less than 5 %


9


0

2

4

6

8

10

1 4 8 12 16 24 32

PostgreSQL

MySQL

BDB

Concurrent threads




Gain after 16 threads is less than 5 %

You might think…

What about TPC-C ?

10

CPU scalability of PostgreSQL

Doug Tolbert, David Strong, Johney Tsai (Unisys), “Scaling PostgreSQL on SMP Architectures”, PGCON 2007.

TPC-C benchmark result on a high-end Linux machine of Unisys

（Xeon-SMP 32 CPUs, Memory 16GB, EMC RAID10 Storage)

11





Version 8.0

Version 8.1

Version 8.2

TPS

CPU cores

12

Gain after 16 CPU cores is less than 5%





Version 8.0

Version 8.1

Version 8.2

TPS

CPU cores

13






Version 8.0

Version 8.1

Version 8.2

TPS

CPU cores Q. What PostgreSQL community did?

14






Version 8.0

Version 8.1

Version 8.2

TPS

CPU cores Q. What PostgreSQL community did?

Revised their synchronization mechanisms in the buffer management module

[1] Ryan Johnson, Ippokratis Pandis, Anastassia Ailamaki: “Critical Sections: Re-emerging Scalability Concerns for Database Storage Engines”, In Proc. DaMoN, 2008. [2] Stavros Harizopoulos, Daniel J. Abadi, Samuel Madden, and Michael Stonebraker: OLTP Through the Looking Glass, and What We Found There, In Proc.SIGMOD, 2008.

Synchronization in Buffer Management Module

Several empirical studies have revealed that the largest bottleneck is …

synchronization in buffer management module


CPU

Memory

HDD Database

Files

Buffer Manager

Page requests

reduces disk access by caching database pages




20


CPU

Memory

HDD Database

Files

Buffer Manager

Page requests





Looking-up hash table

Page replacement algorithm

Page requests

hits misses

Database Files

Buffer Manager

(1)

(2)

18


CPU

Memory

HDD Database

Files

Buffer Manager

Page requests







Page requests

hits misses

Database Files

Buffer Manager

(1)

(2)

19


CPU

Memory

HDD Database

Files

Buffer Manager

Page requests







Page requests

hits misses

Database Files

Buffer Manager

(1)

(2)

20

Hash bucket

Hash bucket

Hash bucket

Hash bucket

Page replacement algorithm (Least Recently Used)

Page requests

hits misses

Database Files



Page requests

hits misses

Database Files

Naive buffer management schemes

PostgreSQL 8.0 PostgreSQL 8.1

21

Hash bucket

Hash bucket

Hash bucket

Hash bucket


Page requests

hits misses

Database Files



Page requests

hits misses

Database Files



Giant lock sucks!

22

Hash bucket

Hash bucket

Hash bucket

Hash bucket


Page requests

hits misses

Database Files



Page requests

hits misses

Database Files


LRU list always needs to be locked when it is accessed


Giant lock sucks!

23

Hash bucket

Hash bucket

Hash bucket

Hash bucket


Page requests

hits misses

Database Files



Page requests

hits misses

Database Files




Giant lock sucks! Striped a lock into buckets

24

Hash bucket

Hash bucket

Hash bucket

Hash bucket


Page requests

hits misses

Database Files



Page requests

hits misses

Database Files




Giant lock sucks!

Did not scale at all Scales up to 8 processors

Striped a lock into buckets

Page requests

25

Less naive buffer management schemes

Hash bucket

Hash bucket

Hash bucket

Hash bucket


hits misses

Database Files

PostgreSQL 8.1

Scales up to 8 processors

Always needs to be locked when it is accessed

Hash bucket

Hash bucket

Hash bucket

Hash bucket

Page replacement algorithm (CLOCK)

Page requests

hits misses

Database Files

PostgreSQL 8.2

Page requests CLOCK does not require a lock when an entry is touched

26

Less naive buffer management schemes

Hash bucket

Hash bucket

Hash bucket

Hash bucket


hits misses

Database Files

PostgreSQL 8.1


Always needs to be locked when it is accessed

Hash bucket

Hash bucket

Hash bucket

Hash bucket

Page replacement algorithm (CLOCK)

Page requests

hits misses

Database Files

PostgreSQL 8.2


Outline

• Background

• Our approach


– Nb-GCLOCK


• Related Work

• Conclusion

27

28

Database files

Buffer Manager

Request pages

Previous approaches Our optimistic approach

CPU

Memory

HDD Database

files

Buffer Manager

Request pages

Core idea of our approach

29

Database files

Buffer Manager

Request pages


CPU

Memory

HDD Database

files

Buffer Manager

Request pages

○Reducing disk I/Os × locks are contended


30

Database files

Buffer Manager

Request pages


CPU

Memory

HDD Database

files

Buffer Manager

Request pages



intuition

31

Database files

Buffer Manager

Request pages


CPU

Memory

HDD Database

files

Buffer Manager

Request pages


Disk bandwidth is not utilized

Enough processors


32

Database files

Buffer Manager

Request pages


CPU

Memory

HDD Database

files

Buffer Manager

Request pages



Enough processors


33

Database files

Buffer Manager

Request pages


CPU

Memory

HDD Database

files

Buffer Manager

Request pages



Enough processors

Reduced lock granularity to one CPU instruction and remove the bottleneck


34

Database files

Buffer Manager

Request pages


CPU

Memory

HDD Database

files

Buffer Manager

Request pages


△ # of I/O slightly increases ○ no contention on locks


Enough processors

Reduced lock granularity to one CPU instruction and remove the bottleneck


35




Major Difference to Previous Approaches

Their goal is …

36





Their goal is …

Improve buffer hit-rates for reducing I/Os

Unique goal for many decades. Is this goal valid for many core era? There are also SSDs.

37





Their goal is …



Our goal is …

38





Their goal is …



Our goal is …

Improve throughputs by utilizing (many) CPUs.

39





Their goal is …



Our goal is …

Improve throughputs by utilizing (many) CPUs.

Use Non-blocking synchronization instead of acquiring locks!

40

What’s non-blocking and lock-free?

Formally:

41


Formally:

Stopping one thread will not prevent global progress. Individual threads make progress without waiting.

42


Formally:


Less Formally:

43


Formally:


Less Formally:

No thread 'locks' any resource No 'critical sections', locks, mutexs, spin-locks, etc

44


Formally:


Less Formally:


Lock-free if every successful step makes Global Progress and completes within finite time (ensuring liveness)

45


Formally:


Less Formally:


Lock-free if every successful step makes Global Progress and completes within finite time (ensuring liveness)

Wait-free if every step makes Global Progress and completes within finite time (ensuring fairness)

46

Synchronization method that does not acquire any lock, enabling concurrent accesses to shared resources

Utilize atomic CPU primitives

Utilize memory barriers

Non-blocking synchronization

47


Utilize atomic CPU primitives CAS (compare-and-swap) cmpxchg on X86



48





acquire_lock(lock); counter++; release_lock(lock);

Blocking

49





acquire_lock(lock); counter++; release_lock(lock);

int old; do { old = *counter; } while (!CAS(counter, old, old+1));

Blocking Non-Blocking

counter is incremented if the value was equals to old

50

Hash bucket

Hash bucket

Hash bucket

Hash bucket

Page replacement algorithm (GCLOCK)

Page requests

hits misses

Database Files

Making the buffer manager non-blocking

lock; lseek; read; unlock

51

Hash bucket

Hash bucket

Hash bucket

Hash bucket


Page requests

hits misses

Database Files



1. Utilized existing lock-free hash table

52

Hash bucket

Hash bucket

Hash bucket

Hash bucket


Page requests

hits misses

Database Files



1. Utilized existing lock-free hash table

2. Removing locks on cache misses (in fig. 6)

53

Hash bucket

Hash bucket

Hash bucket

Hash bucket


Page requests

hits misses

Database Files



54

Hash bucket

Hash bucket

Hash bucket

Hash bucket


Page requests

hits misses

Database Files



3. Need to keep consistency between lookup hash table and GCLOCK (in the right half of fig. 3)

55

Hash bucket

Hash bucket

Hash bucket

Hash bucket


Page requests

hits misses

Database Files



Reference in buffer lookup table still has a different page identifier immediately after changing the page allocation of a buffer frame


56

Hash bucket

Hash bucket

Hash bucket

Hash bucket


Page requests

hits misses

Database Files



Reference in buffer lookup table still has a different page identifier immediately after changing the page allocation of a buffer frame


4. Avoided locks on I/Os by utilizing pread, CAS, and memory barriers (in fig. 5)

57

State Machine-based Reasoning for selecting replacement victim

Construct algorithm from many 'steps' ─ build a State Machine for ensuring glabal progress

58


59

Select a frame

Check whether Evicted

!null

continue

E: try next entry

null

Fix in pool

Check whether Pinned

evicted

!evicted

pinned

Try to decrement the refcount

E: decrement the refcount

success

E: move the clock hand

--refcount>0

--refcount<=0

! swapped

Try to evict

E: evict

evicted

!evicted

!pinned

E: CAS value

swapped E: entry action


60

Select a frame


!null

continue

E: try next entry

null

Fix in pool


evicted

!evicted

pinned



success


--refcount>0

--refcount<=0

! swapped

Try to evict

E: evict

evicted

!evicted

!pinned

E: CAS value


Start finding a replacement victim


61

Select a frame


!null

continue

E: try next entry

null

Fix in pool


evicted

!evicted

pinned



success


--refcount>0

--refcount<=0

! swapped

Try to evict

E: evict

evicted

!evicted

!pinned

E: CAS value



Decrement weight count of a buffer page


62

Select a frame


!null

continue

E: try next entry

null

Fix in pool


evicted

!evicted

pinned



success


--refcount>0

--refcount<=0

! swapped

Try to evict

E: evict

evicted

!evicted

!pinned

E: CAS value


Return a replacement victim




63

Select a frame


!null

continue

E: try next entry

null

Fix in pool


evicted

!evicted

pinned



success


--refcount>0

--refcount<=0

! swapped

Try to evict

E: evict

evicted

!evicted

!pinned

E: CAS value


Advance CLOCK hand (check the next candidate)





64

Select a frame


!null

continue

E: try next entry

null

Fix in pool


evicted

!evicted

pinned



success


--refcount>0

--refcount<=0

! swapped

Try to evict

E: evict

evicted

!evicted

!pinned

E: CAS value







Thread A

65

Select a frame


!null

continue

E: try next entry

null

Fix in pool


evicted

!evicted

pinned



success


--refcount>0

--refcount<=0

! swapped

Try to evict

E: evict

evicted

!evicted

!pinned

E: CAS value







Thread A

Thread B

66

Select a frame


!null

continue

E: try next entry

null

Fix in pool


evicted

!evicted

pinned



success


--refcount>0

--refcount<=0

! swapped

Try to evict

E: evict

evicted

!evicted

!pinned

E: CAS value







Thread A

Thread B Oops! Candidate is intercepted.

67

Select a frame


!null

continue

E: try next entry

null

Fix in pool


evicted

!evicted

pinned



success


--refcount>0

--refcount<=0

! swapped

Try to evict

E: evict

evicted

!evicted

!pinned

E: CAS value







Thread A

Thread B

Outline

• Background

• Our approach


– Nb-GCLOCK


• Related Work

• Conclusion

68

69

Workload Zipf 80/20 distribution (a famous power law)

containing 20% of sequential scans

dataset size is 32GB in total Machine used: UltraSPARC T2

Experimental settings

64 processors

70

Workload Zipf 80/20 distribution (a famous power law)

containing 20% of sequential scans

dataset size is 32GB in total Machine used: UltraSPARC T2

Experimental settings

64 processors

We also performed evaluation on various X86 settings in the paper.

71

Throughput (normalized by LRU)

Processors

Performance comparison on moderate I/Os (of fig.9)

0.0

1.0

2.0

3.0

4.0

5.0

6.0

8 16 32 64

LRU

GCLOCK

Nb-GCLOCK

72


Processors


0.0

1.0

2.0

3.0

4.0

5.0

6.0

8 16 32 64

LRU

GCLOCK

Nb-GCLOCK

CPU utilization Previous approach: Low, about 20% Nb-GCLOCK: High, more than 95％

73

More difference in CPU time can be expected when # of CPU increases ➜ We expect more throughput


Processors


0.0

1.0

2.0

3.0

4.0

5.0

6.0

8 16 32 64

LRU

GCLOCK

Nb-GCLOCK

CPU utilization Previous approach: Low, about 20% Nb-GCLOCK: High, more than 95％

74

Maximum throughput to processors

Scalability to processors when pages are resident in memory intending to see the scalability limit expected by each algorithm

75



Processors (cores)

Throughput (log scale)

8 (1) 16 (2) 32 (4) 64 (8)

2Q 890992 819975 866009 662782

GCLOCK 1758605 1912000 1931268 1817748

Nb-GCLOCK 3409819 7331722 14245524 25834449

76



Processors (cores)


8 (1) 16 (2) 32 (4) 64 (8)

2Q 890992 819975 866009 662782

GCLOCK 1758605 1912000 1931268 1817748

Nb-GCLOCK 3409819 7331722 14245524 25834449

Achieved almost linear scalability, at least, up to 64 processors! This is the first attempt that removed locks in buffer management

77



Processors (cores)


8 (1) 16 (2) 32 (4) 64 (8)

2Q 890992 819975 866009 662782

GCLOCK 1758605 1912000 1931268 1817748

Nb-GCLOCK 3409819 7331722 14245524 25834449

Achieved almost linear scalability, at least, up to 64 processors! This is the first attempt that removed locks in buffer management

Interesting here is GCLOCK has CPU-scalability limit on around 16 processors. Caching solutions using GCLOCK have scalability limit there.

78

Workload is Zipf 80/20, Evaluated on UltraSparcT2 (64 procs)

Accesses issued from 64 threads in 60 seconds Thus, ideally 64 x 60 = 3,840 seconds can be used

Max thoughput (operation/sec) evaluation

79




80




Most of CPU time is used because our Nb-GCLOCK is non-blocking!

81





About 10-20% of CPU Time is used!

82





About 10-20% of CPU Time is used!

The CPU utilization would be more differs when # of processors grows. It would causes contentions!

800

900

1000

1100

1200

1300

1400

8 16 32 64 128

tpmC

# of terminals (threads)

Derby

Nb-GCLOCK

TPC-C evaluation using Apache Derby

Transaction per minutes

83

Sang Kyun Cha et al. Cache-Conscious Concurrency Control of Main-Memory Indexes on Shared-Memory Multiprocessor Systems. In Proc. VLDB, 2001.

800

900

1000

1100

1200

1300

1400

8 16 32 64 128

tpmC


Derby

Nb-GCLOCK



The original scheme of Derby (CLOCK) decreased throughput. On the other hand, ours scheme showed better result.

84


800

900

1000

1100

1200

1300

1400

8 16 32 64 128

tpmC


Derby

Nb-GCLOCK



Throughput to buffer management module reduced a latch on root page of B+-tree ➜ We would require a concurrent B+-tree (see OLFIT)

85


Outline

• Background

• Our approach


– Nb-GCLOCK


• Related Work

• Conclusion

86

87

Bp-wrapper

Page requests

Hash bucket

Hash bucket

Hash bucket

Hash bucket

Page replacement algorithm (any)

hits misses

Database Files

Recording access

Xiaoning Ding, Song Jiang, and Xiaodong Zhang: Bp-Wrapper: A System Framework Making Any Replacement Algorithms (Almost) Lock Contention Free, Proc. ICDE, 2009.

eliminates lock contention on buffer hits by using a batching and prefetching technique

88

Bp-wrapper

Page requests

Hash bucket

Hash bucket

Hash bucket

Hash bucket


hits misses

Database Files

Recording access



called Lazy synchronization in the literature

postpones the physical work (adjusting the buffer replacement list)

and immediately returns the logical operation

89

Bp-wrapper

Page requests

Hash bucket

Hash bucket

Hash bucket

Hash bucket


hits misses

Database Files

Recording access



called Lazy synchronization in the literature

postpones the physical work (adjusting the buffer replacement list)

and immediately returns the logical operation

Pros. - works with any page replacement algorithm

Cons. - Does not increase throughputs of CLOCK variants because CLOCK does not require locks on buffer hits

- Cache misses involve batching larger lock holding time makes more contentions

90

Proposed a lock-free variant of the GCLOCK page replacement algorithm, named Nb-GCLOCK.

Conclusions

Linearizability and lock-freedom are proven in the paper

91


Conclusions

almost linear scalability to processors up to 64 processors while existing locking-based schemes do not scale beyond 16 processors

The first attempt that introduce non-blocking synchronization to database buffer management Optimistic I/Os using pread, CAS and memory barriers


92


Conclusions




The lock-freedom guarantees a certain throughput: any active thread taking a bounded number of steps ensures global progress.

93


Conclusions




The lock-freedom guarantees a certain throughput: any active thread taking a bounded number of steps ensures global progress.

This work is also useful for any caching solution that requires high throughput (e.g., C10K accesses)

94

Thank you for your attention!

ICDE2010 Nb-GCLOCK

Technology

cpu coresversion

cpu development

chipmanycore cpu

scaling postgresql

smp architectures

postgresql community

scalable storage manager

multicore era