Theory of Memory W. Paul Saarland University and DFKI bmb+f Projekt Verisoft-XT joint work with Ulan Degebaev and Norbert Schirmer Saarland University
Jan 14, 2016
Theory of Memory
W. Paul Saarland University and DFKI
bmb+f Projekt Verisoft-XT
joint work withUlan Degebaev and Norbert Schirmer
Saarland University
why might his be important?
• Unites theories of– store buffers– interlocking– caches– cache coherence– out of order execution– X64 instruction set– address translation– optimized compilation– structured parallel C
semantics
• Explains why hypervisor might run structured parallel C
• VCC is supposed to mirror structured parallel C semantics
• thus VCC might be(come) sound
Specifying Memory
M(x)x
Store Buffer
memory M
w(i)r(j)
sbuf(y)
Store Buffer
memory M
w(i)r(j)
sbuf(y)
Caches
M
ca
Many Caches: Snooping
M
ca(1) ca(p)
Many Caches
M
ca(1) ca(p)
x.la x.off
Many Caches
M
ca(1) ca(p)
x.la x.off
Many Caches
M
ca(1) ca(p)
x.off
Overlapping Transactions
public (a) a
c
c
b
c
Sequentially Consistent Memorylemma 5
public (a) a
c
c
b
c
Tomasulo Schedulers for OOO
IF
WB
reservation stations
ROB
issue
funct.
units
CDB
Two Memory Units
MMU
ROB
funct.
units
CDB
LS
RS RSsbuf
m
Single Processor OOO correctnesslemma 6
MMU
ROB
funct.
units
CDB
LS
RS RSsbuf
m
Multi Processor OOO implementation
MMUfunct.
units
CDB
LS
RS RSsbuf
m
ROB
data(i,j)
Multi Processor OOO correctnesslemma 7
MMUfunct.
units
CDB
LS
RS RSsbuf
m
ROB
data(i,j)
Multi Processor OOO correctnesslemma 7
MMUfunct.
units
CDB
LS
RS RSsbuf
m
ROB
data(i,j)
X64 architecture
• CPU core– R: user registers– SR: system registers
• CR3
– acc: access– segmentation
• mmu: memory management unit– tlb: translation look aside
buffer
• memory system– mm: main memory– ca: cache– sbuf: store buffer
sbuf
core
acc CR3
R
ca
mm
mmutlb
acc
segmentation
segmentation offlemma 8
• 1 segment• large as entire address
space• segmentation invisible
sbuf
core
acc CR3
R
ca
mm
mmutlb
acc
segmentation
Bad news: cache state is visible
• CPU core– acc: access
• acc.adr: address• acc.r: rights (user,write,
exe)• acc.data• acc.mmode: memory
mode– WB: write back– WT: write through ...– NC: no cache
sbuf
core
acc CR3
R
ca
mm or devices
mmutlb
acc
Good News: no device, no NC mode
• acc.mmode: memory mode– WB: write back– WT: write through ...– NC: no cache not usedsbuf
core
acc CR3
R
ca
mm
mmutlb
acc
Sequentially Consistent Physical Memorylemma 9
• acc.mmode: memory mode– WB: write back– WT: write through ...
mix on same address
• PM: sequentially consistent physical memory abstraction– Proof: MOESI invariants
are maintained
sbuf
PM
core
acc CR3
R
mmutlb
acc
Initialize page tables
• 1 processor– sbuf invisible
• operating mode: paging disabled– mmu invisible
• set up page table tree in PM
sbuf
PM
core
acc CR3
R
mmutlb
acc
page tables
Translated Linear Memory
• many processors• operating mode: paging
enabled• keep tlb consistent
sbuf
PM
core
acc CR3
R
mmutlb
acc
page tables
Translated Consistent Linear Memory+ sbufs lemma 10
• many processors• operating mode: paging
enabled• keep tlb consistent
sbuf
LM
core
acc CR3
R
page tables
C0: Pascal with C syntaxconfigurations
• c = ( pr, rd, lms, hm,gm)– pr program rest
– rd recursion depth
– lms: [0: recursion depth]!{local memories}
– hm: heap memory
– gm: global memory
• subvariables– (m,i)[17].gpr[3]
• value of pointers: subvariables !
va(c,(m,i))
ba(m,i)
memory m
size(m,i)
Parallel C
• c = ( pr, rd, lms, hm,gm)– pr program rest
– rd recursion depth
– lms: [0: recursion depth]!{local memories}
– hm: heap memory
– gm: global memory
• Share– gm
– hm
• Interleave at small steps semantics steps
va(c,(m,i))
ba(m,i)
memory m
size(m,i)
Parallel C
• c = ( pr, rd, lms, hm,gm)– pr program rest
– rd recursion depth
– lms: [0: recursion depth]!{local memories}
– hm: heap memory
– gm: global memory
• Share– gm
– hm
• Interleave at small steps semantics steps• Problem:
– Processor interleaves instructions
of compiled programs code(p)
va(c,(m,i))
ba(m,i)
memory m
size(m,i)
simulation relation consis(c, alloc, d)
p
y
alloc(c,p)
alloc(c,y)
LM
Non optimizing compiler:step by step simulation
Optimizing compiler:simulation between IO-steps
IO-steps (1): volatile accesses
Volatiles Sequentially Consistentlemma 11
Structured Parallel C
• Implement Locks using Volatiles• IO-steps (2): lock release• Run Processors alone on locked portions
of linear memory• Lemma 1: sbufs invisible• Lemma 10: Ordinary C code in linear memory
Summary
• Implement Locks using Volatiles• IO-steps (2): lock release• Run Processors alone on locked portions
of linear memory• Lemma 1: sbufs invisible• Lemma 10: Ordinary C code in linear memory
• Outlined correctness proof for implementation of structured parallel C– Initialisation– compilation