ICDE2010 Nb-GCLOCK

Nb-GCLOCK: A Non-blocking Buffer Management based on the Generalized CLOCK

Makoto YUI1, Jun MIYAZAKI2, Shunsuke UEMURA3

and Hayato YAMANA4

1 .Research fellow, JSPS (Japan Society for the Promotion of Science) / Visiting Postdoc at Waseda University, Japan and CWI, Netherlands 2. Nara Institute of Science and Technology 3. Nara Sangyo University 4. Waseda University / National Institute of Informatics

Outline

• Background

• Our approach

– Non-Blocking Synchronization

– Nb-GCLOCK

• Experimental Evaluation

• Related Work

• Conclusion

UltraSparc T2 Azul Vega Larrabee?

Nehalem

Multi-Core CPU

Many-Core CPU

Power4

Pentium

Single-Core CPU

Background – Recent trends in CPU development

# of CPU cores in a chip is doubling in two year cycles

Many-core era is coming.

UltraSparc T2 Azul Vega Larrabee?

Nehalem

Multi-Core CPU

Many-Core CPU

Power4

Pentium

Single-Core CPU

Background – Recent trends in CPU development

- Niagara T2 – 8 cores x 8 SMT = 64 processors - Azul Vega3 – 54 cores x 16 chips = 864 processors

# of CPU cores in a chip is doubling in two year cycles

Many-core era is coming.

Open source DBs have faced CPU scalability problems

Ryan Johnson et al., “Shore-MT: A Scalable Storage Manager for the Multicore Era”, In Proc. EDBT, 2009.

Background – CPU Scalability of open source DBs

1 4 8 12 16 24 32

PostgreSQL

Microbenchmark on UltraSparc T1 (32 procs)

1 4 8 12 16 24 32

PostgreSQL

Concurrent threads

Throughput (normalized)

1 4 8 12 16 24 32

PostgreSQL

Concurrent threads

Gain after 16 threads is less than 5 %

1 4 8 12 16 24 32

PostgreSQL

Concurrent threads

Gain after 16 threads is less than 5 %

You might think…

What about TPC-C ?

CPU scalability of PostgreSQL

Doug Tolbert, David Strong, Johney Tsai (Unisys), “Scaling PostgreSQL on SMP Architectures”, PGCON 2007.

TPC-C benchmark result on a high-end Linux machine of Unisys

（Xeon-SMP 32 CPUs, Memory 16GB, EMC RAID10 Storage)

Version 8.0

Version 8.1

Version 8.2

CPU cores

Gain after 16 CPU cores is less than 5%

Version 8.0

Version 8.1

Version 8.2

CPU cores

Version 8.0

Version 8.1

Version 8.2

CPU cores Q. What PostgreSQL community did?

Version 8.0

Version 8.1

Version 8.2

CPU cores Q. What PostgreSQL community did?

Revised their synchronization mechanisms in the buffer management module

[1] Ryan Johnson, Ippokratis Pandis, Anastassia Ailamaki: “Critical Sections: Re-emerging Scalability Concerns for Database Storage Engines”, In Proc. DaMoN, 2008. [2] Stavros Harizopoulos, Daniel J. Abadi, Samuel Madden, and Michael Stonebraker: OLTP Through the Looking Glass, and What We Found There, In Proc.SIGMOD, 2008.

Synchronization in Buffer Management Module

Several empirical studies have revealed that the largest bottleneck is …

synchronization in buffer management module

Memory

HDD Database

Buffer Manager

Page requests

reduces disk access by caching database pages

Memory

HDD Database

Buffer Manager

Page requests

Looking-up hash table

Page replacement algorithm

Page requests

hits misses

Database Files

Buffer Manager

Memory

HDD Database

Buffer Manager

Page requests

hits misses

Database Files

Buffer Manager

Memory

HDD Database

Buffer Manager

Page requests

hits misses

Database Files

Buffer Manager

Hash bucket

Page replacement algorithm (Least Recently Used)

Page requests

hits misses

Database Files

Page requests

hits misses

Database Files

Naive buffer management schemes

PostgreSQL 8.0 PostgreSQL 8.1

Hash bucket

Page requests

hits misses

Database Files

Page requests

hits misses

Database Files

Giant lock sucks!

Hash bucket

Page requests

hits misses

Database Files

Page requests

hits misses

Database Files

LRU list always needs to be locked when it is accessed

Giant lock sucks!

Hash bucket

Page requests

hits misses

Database Files

Page requests

hits misses

Database Files

Giant lock sucks! Striped a lock into buckets

Hash bucket

Page requests

hits misses

Database Files

Page requests

hits misses

Database Files

Giant lock sucks!

Did not scale at all Scales up to 8 processors

Striped a lock into buckets

Page requests

Less naive buffer management schemes

Hash bucket

hits misses

Database Files

PostgreSQL 8.1

Scales up to 8 processors

Always needs to be locked when it is accessed

Hash bucket

Page replacement algorithm (CLOCK)

Page requests

hits misses

Database Files

PostgreSQL 8.2

Page requests CLOCK does not require a lock when an entry is touched

Less naive buffer management schemes

Hash bucket

hits misses

Database Files

PostgreSQL 8.1

Always needs to be locked when it is accessed

Hash bucket

Page replacement algorithm (CLOCK)

Page requests

hits misses

Database Files

PostgreSQL 8.2

Outline

• Background

• Our approach

– Nb-GCLOCK

• Related Work

• Conclusion

Database files

Buffer Manager

Request pages

Previous approaches Our optimistic approach

Memory

HDD Database

Buffer Manager

Request pages

Core idea of our approach

Database files

Buffer Manager

Request pages

Memory

HDD Database

Buffer Manager

Request pages

○Reducing disk I/Os × locks are contended

Database files

Buffer Manager

Request pages

Memory

HDD Database

Buffer Manager

Request pages

intuition

Database files

Buffer Manager

Request pages

Memory

HDD Database

Buffer Manager

Request pages

Disk bandwidth is not utilized

Enough processors

Database files

Buffer Manager

Request pages

Memory

HDD Database

Buffer Manager

Request pages

Enough processors

Database files

Buffer Manager

Request pages

Memory

HDD Database

Buffer Manager

Request pages

Enough processors

Reduced lock granularity to one CPU instruction and remove the bottleneck

Database files

Buffer Manager

Request pages

Memory

HDD Database

Buffer Manager

Request pages

△ # of I/O slightly increases ○ no contention on locks

Enough processors

Reduced lock granularity to one CPU instruction and remove the bottleneck

Major Difference to Previous Approaches

Their goal is …

Improve buffer hit-rates for reducing I/Os

Unique goal for many decades. Is this goal valid for many core era? There are also SSDs.

Their goal is …

Our goal is …

Their goal is …

Our goal is …

Improve throughputs by utilizing (many) CPUs.

Their goal is …

Our goal is …

Improve throughputs by utilizing (many) CPUs.

Use Non-blocking synchronization instead of acquiring locks!

What’s non-blocking and lock-free?

Formally:

Stopping one thread will not prevent global progress. Individual threads make progress without waiting.

Formally:

Less Formally:

Formally:

Less Formally:

No thread 'locks' any resource No 'critical sections', locks, mutexs, spin-locks, etc

Formally:

Less Formally:

Lock-free if every successful step makes Global Progress and completes within finite time (ensuring liveness)

Formally:

Less Formally:

Lock-free if every successful step makes Global Progress and completes within finite time (ensuring liveness)

Wait-free if every step makes Global Progress and completes within finite time (ensuring fairness)

Synchronization method that does not acquire any lock, enabling concurrent accesses to shared resources

Utilize atomic CPU primitives

Utilize memory barriers

Non-blocking synchronization

Utilize atomic CPU primitives CAS (compare-and-swap) cmpxchg on X86

acquire_lock(lock); counter++; release_lock(lock);

Blocking

acquire_lock(lock); counter++; release_lock(lock);

int old; do { old = *counter; } while (!CAS(counter, old, old+1));

Blocking Non-Blocking

counter is incremented if the value was equals to old

Hash bucket

Page replacement algorithm (GCLOCK)

Page requests

hits misses

Database Files

Making the buffer manager non-blocking

lock; lseek; read; unlock

Hash bucket

Page requests

hits misses

Database Files

1. Utilized existing lock-free hash table

Hash bucket

Page requests

hits misses

Database Files

1. Utilized existing lock-free hash table

2. Removing locks on cache misses (in fig. 6)

Hash bucket

Page requests

hits misses

Database Files

Hash bucket

Page requests

hits misses

Database Files

3. Need to keep consistency between lookup hash table and GCLOCK (in the right half of fig. 3)

Hash bucket

Page requests

hits misses

Database Files

Reference in buffer lookup table still has a different page identifier immediately after changing the page allocation of a buffer frame

Hash bucket

Page requests

hits misses

Database Files

Reference in buffer lookup table still has a different page identifier immediately after changing the page allocation of a buffer frame

4. Avoided locks on I/Os by utilizing pread, CAS, and memory barriers (in fig. 5)

State Machine-based Reasoning for selecting replacement victim

Construct algorithm from many 'steps' ─ build a State Machine for ensuring glabal progress

Select a frame

Check whether Evicted

continue

E: try next entry

Fix in pool

Check whether Pinned

evicted

!evicted

pinned

Try to decrement the refcount

E: decrement the refcount

success

E: move the clock hand

--refcount>0

--refcount<=0

! swapped

Try to evict

E: evict

evicted

!evicted

!pinned

E: CAS value

swapped E: entry action

Select a frame

continue

E: try next entry

Fix in pool

evicted

!evicted

pinned

success

--refcount>0

--refcount<=0

! swapped

Try to evict

E: evict

evicted

!evicted

!pinned

E: CAS value

Start finding a replacement victim

Select a frame

continue

E: try next entry

Fix in pool

evicted

!evicted

pinned

success

--refcount>0

--refcount<=0

! swapped

Try to evict

E: evict

evicted

!evicted

!pinned

E: CAS value

Decrement weight count of a buffer page

Select a frame

continue

E: try next entry

Fix in pool

evicted

!evicted

pinned

success

--refcount>0

--refcount<=0

! swapped

Try to evict

E: evict

evicted

!evicted

!pinned

E: CAS value

Return a replacement victim

Select a frame

continue

E: try next entry

Fix in pool

evicted

!evicted

pinned

success

--refcount>0

--refcount<=0

! swapped

Try to evict

E: evict

evicted

!evicted

!pinned

E: CAS value

Advance CLOCK hand (check the next candidate)

Select a frame

continue

E: try next entry

Fix in pool

evicted

!evicted

pinned

success

--refcount>0

--refcount<=0

! swapped

Try to evict

E: evict

evicted

!evicted

!pinned

E: CAS value

Thread A

Select a frame

continue

E: try next entry

Fix in pool

evicted

!evicted

pinned

success

--refcount>0

--refcount<=0

! swapped

Try to evict

E: evict

evicted

!evicted

!pinned

E: CAS value

Thread A

Thread B

Select a frame

continue

E: try next entry

Fix in pool

evicted

!evicted

pinned

success

--refcount>0

--refcount<=0

! swapped

Try to evict

E: evict

evicted

!evicted

!pinned

E: CAS value

Thread A

Thread B Oops! Candidate is intercepted.

Select a frame

continue

E: try next entry

Fix in pool

evicted

!evicted

pinned

success

--refcount>0

--refcount<=0

! swapped

Try to evict

E: evict

evicted

!evicted

!pinned

E: CAS value

Thread A

Thread B

Outline

• Background

• Our approach

– Nb-GCLOCK

• Related Work

• Conclusion

Workload Zipf 80/20 distribution (a famous power law)

containing 20% of sequential scans

dataset size is 32GB in total Machine used: UltraSPARC T2

Experimental settings

64 processors

Workload Zipf 80/20 distribution (a famous power law)

containing 20% of sequential scans

dataset size is 32GB in total Machine used: UltraSPARC T2

Experimental settings

64 processors

We also performed evaluation on various X86 settings in the paper.

Throughput (normalized by LRU)

Processors

Performance comparison on moderate I/Os (of fig.9)

8 16 32 64

GCLOCK

Nb-GCLOCK

Processors

8 16 32 64

GCLOCK

Nb-GCLOCK

CPU utilization Previous approach: Low, about 20% Nb-GCLOCK: High, more than 95％

More difference in CPU time can be expected when # of CPU increases ➜ We expect more throughput

Processors

8 16 32 64

GCLOCK

Nb-GCLOCK

CPU utilization Previous approach: Low, about 20% Nb-GCLOCK: High, more than 95％

Maximum throughput to processors

Scalability to processors when pages are resident in memory intending to see the scalability limit expected by each algorithm

Processors (cores)

Throughput (log scale)

8 (1) 16 (2) 32 (4) 64 (8)

2Q 890992 819975 866009 662782

GCLOCK 1758605 1912000 1931268 1817748

Nb-GCLOCK 3409819 7331722 14245524 25834449

Processors (cores)

8 (1) 16 (2) 32 (4) 64 (8)

2Q 890992 819975 866009 662782

GCLOCK 1758605 1912000 1931268 1817748

Nb-GCLOCK 3409819 7331722 14245524 25834449

Achieved almost linear scalability, at least, up to 64 processors! This is the first attempt that removed locks in buffer management

Processors (cores)

8 (1) 16 (2) 32 (4) 64 (8)

2Q 890992 819975 866009 662782

GCLOCK 1758605 1912000 1931268 1817748

Nb-GCLOCK 3409819 7331722 14245524 25834449

Achieved almost linear scalability, at least, up to 64 processors! This is the first attempt that removed locks in buffer management

Interesting here is GCLOCK has CPU-scalability limit on around 16 processors. Caching solutions using GCLOCK have scalability limit there.

Workload is Zipf 80/20, Evaluated on UltraSparcT2 (64 procs)

Accesses issued from 64 threads in 60 seconds Thus, ideally 64 x 60 = 3,840 seconds can be used

Max thoughput (operation/sec) evaluation

Most of CPU time is used because our Nb-GCLOCK is non-blocking!

About 10-20% of CPU Time is used!

The CPU utilization would be more differs when # of processors grows. It would causes contentions!

8 16 32 64 128

# of terminals (threads)

Nb-GCLOCK

TPC-C evaluation using Apache Derby

Transaction per minutes

Sang Kyun Cha et al. Cache-Conscious Concurrency Control of Main-Memory Indexes on Shared-Memory Multiprocessor Systems. In Proc. VLDB, 2001.

8 16 32 64 128

Nb-GCLOCK

The original scheme of Derby (CLOCK) decreased throughput. On the other hand, ours scheme showed better result.

8 16 32 64 128

Nb-GCLOCK

Throughput to buffer management module reduced a latch on root page of B+-tree ➜ We would require a concurrent B+-tree (see OLFIT)

Outline

• Background

• Our approach

– Nb-GCLOCK

• Related Work

• Conclusion

Bp-wrapper

Page requests

Hash bucket

Page replacement algorithm (any)

hits misses

Database Files

Recording access

Xiaoning Ding, Song Jiang, and Xiaodong Zhang: Bp-Wrapper: A System Framework Making Any Replacement Algorithms (Almost) Lock Contention Free, Proc. ICDE, 2009.

eliminates lock contention on buffer hits by using a batching and prefetching technique

Bp-wrapper

Page requests

Hash bucket

hits misses

Database Files

Recording access

called Lazy synchronization in the literature

postpones the physical work (adjusting the buffer replacement list)

and immediately returns the logical operation

Bp-wrapper

Page requests

Hash bucket

hits misses

Database Files

Recording access

called Lazy synchronization in the literature

postpones the physical work (adjusting the buffer replacement list)

and immediately returns the logical operation

Pros. - works with any page replacement algorithm

Cons. - Does not increase throughputs of CLOCK variants because CLOCK does not require locks on buffer hits

- Cache misses involve batching larger lock holding time makes more contentions

Proposed a lock-free variant of the GCLOCK page replacement algorithm, named Nb-GCLOCK.

Conclusions

Linearizability and lock-freedom are proven in the paper

Conclusions

almost linear scalability to processors up to 64 processors while existing locking-based schemes do not scale beyond 16 processors

The first attempt that introduce non-blocking synchronization to database buffer management Optimistic I/Os using pread, CAS and memory barriers

Conclusions

The lock-freedom guarantees a certain throughput: any active thread taking a bounded number of steps ensures global progress.

Conclusions

The lock-freedom guarantees a certain throughput: any active thread taking a bounded number of steps ensures global progress.

This work is also useful for any caching solution that requires high throughput (e.g., C10K accesses)

Thank you for your attention!

ICDE2010 Nb-GCLOCK

cpu coresversion

cpu development

chipmanycore cpu

scaling postgresql

smp architectures

postgresql community

scalable storage manager

multicore era

Technology

RFP # 2403-15 - INSTRUCTIONAL MATERIALS FOR ADVANCED … ·...

Lure recommendations for - Dardevle · Lure recommendations...

Programmable Terminal NB series/NB-S series

NB 1 24 E NB 1 - 18 SE NB 2 - 24 - SE NB 2 - 28 -...

EP 7001 EP 6001 EP 6501 7002€¦ · ep 7013 nb 6003 ep...

2 NB AT 16 NB Umbruch AT 5

北西域バリア井戸（NB-1 15...

BOARD OF DIRECTORS ARTHUR T. LEAHJ,'filJn CHIEF...

Pricelist NB Toshiba NB Lenovo Tablet Accessories Kasper

SYNOPSYS™ Input General Formats · 2019. 10. 1. ·...

E SE SE PRO NB 2 - 24 E / SE PLUS · 1 nb 1 - 24 – e nb 2...

Thornally Dr. Drew St. - AC Transit · Thornally Dr. SB &.....

NB-IoT MODEM WITH RS -232/RS-485 INTERFACE VEGA NB-13 NB-13....

Bully Chrome Trim & Accessories Catalog - CARiD.com · PDF.....

NB Petshop

Nb giugno13