LOTS: A Software DSM LOTS: A Software DSM Supporting Large Object Supporting Large Object Space Space Benny Wang-Leung Cheung, Benny Wang-Leung Cheung, Cho-Li Wang Cho-Li Wang , and Francis , and Francis Chi-Moon Lau Chi-Moon Lau Department of Computer Department of Computer Science The University Science The University of Hong Kong of Hong Kong September, September, 2004 2004
25
Embed
LOTS: A Software DSM Supporting Large Object Space Benny Wang-Leung Cheung, Cho-Li Wang, and Francis Chi-Moon Lau Department of Computer Science The University.
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
LOTS: A Software DSM LOTS: A Software DSM Supporting Large Object Supporting Large Object
SpaceSpaceBenny Wang-Leung Cheung, Benny Wang-Leung Cheung, Cho-Li WangCho-Li Wang, and Francis Chi-Moon , and Francis Chi-Moon
LauLauDepartment of Computer Department of Computer Science The University of Science The University of
Hong KongHong Kong
September, September, 20042004
22
Presentation OutlinePresentation Outline
• Why LOTS? (Objectives)Why LOTS? (Objectives)• DSM Background and Related WorkDSM Background and Related Work• Design of LOTSDesign of LOTS• Performance Testing and ResultsPerformance Testing and Results• Conclusion and Future WorkConclusion and Future Work
33
The Problem in Current The Problem in Current DSMDSM
• Lack of shared object (memory) spaceLack of shared object (memory) space– Another major problem apart from Another major problem apart from
performanceperformance– Fixed address mapping in virtual memoryFixed address mapping in virtual memory– Shared object space size < process spaceShared object space size < process space
• TreadMarks: ~ min RAM size among all TreadMarks: ~ min RAM size among all machinesmachines
• JIAJIA V1.0: 128 MBJIAJIA V1.0: 128 MB– 32-bit machines 32-bit machines max 4 GB shared space max 4 GB shared space– Unscalable: Fixed regardless of # machinesUnscalable: Fixed regardless of # machines– Large problems (with > 4GB shared memory Large problems (with > 4GB shared memory
need) can’t be run directly need) can’t be run directly The programmer The programmer needs to change the application code to needs to change the application code to reduce the memory utilization. reduce the memory utilization.
44
Objectives of LOTSObjectives of LOTS
• Using 64-bit machines Using 64-bit machines is not a total solution!is not a total solution!
• 32-bit machines are 32-bit machines are dominating the dominating the market (poor man’s market (poor man’s clusters :<)clusters :<)
• Hence we introduce LOTS:Hence we introduce LOTS:– Large Shared Object Space > 4GBLarge Shared Object Space > 4GB– Dynamic run-time memory mapping Dynamic run-time memory mapping
techniquetechnique– Local disk as the backing store for Local disk as the backing store for
temporarily unused objectstemporarily unused objects– Shared space size now limited by disk spaceShared space size now limited by disk space– Lazy disk read/write Lazy disk read/write reasonable reasonable
• Home-based (JIAJIA) vs Homeless Home-based (JIAJIA) vs Homeless (TreadMarks) vs Migrating-Home (JUMP)(TreadMarks) vs Migrating-Home (JUMP)
• Write-update vs Write-invalidateWrite-update vs Write-invalidate• Adaptive Protocol (DOSA, ADSM)Adaptive Protocol (DOSA, ADSM)
– Coherence Protocol has to match with Coherence Protocol has to match with memory model for higher efficiencymemory model for higher efficiency
• No DSM deals with Large Object Space!No DSM deals with Large Object Space!
88
Related WorkRelated Work
• Large object space support:Large object space support:– Pointer swizzlingPointer swizzling
• Artificial, invalid addresses are translated to Artificial, invalid addresses are translated to machine-addressable form during accessmachine-addressable form during access
• Used in persistent store (QuickStore, Thor-1)Used in persistent store (QuickStore, Thor-1)
Process Space
Compiler-generated Compiler-generated addresses cause page addresses cause page fault at runtime and fault at runtime and are translated to valid are translated to valid onesones
Unused objects free Unused objects free their virtual addresses their virtual addresses and are swapped out and are swapped out (i.e., swizzled out) to (i.e., swizzled out) to hard diskhard disk
99
Design of LOTSDesign of LOTS
• Dynamic Memory Mapping (DMM)Dynamic Memory Mapping (DMM)– Uses C++ Uses C++ Operator OverloadingOperator Overloading as the interface as the interface
used in original C/C++used in original C/C++• Uses mmap() to get physical memory, Uses mmap() to get physical memory,
and map the shared object data to the and map the shared object data to the process space.process space.– Free queues and used queues Free queues and used queues – Small & large objects allocated separatelySmall & large objects allocated separatelyFreeFree
queuqueuee
0x50000000 0x50000000 DMM AreaDMM Area 0x70000000 0x70000000
• Principle: To eliminate as much all-to-all data Principle: To eliminate as much all-to-all data communication as possiblecommunication as possible
1313
Mixed Coherence ProtocolMixed Coherence Protocol
• An Example:An Example:
Rel(L1)Rel(L1)
New HomeNew Home
Acq(L1)Acq(L1)x1=1x1=1
x1=1 x1=1 y1=5y1=5
Home of X Home of X and Yand Y
P0P0 P1P1 P2P2 P3P3
BarrierBarrier
y1=5y1=5
Rel(L1)Rel(L1)
Acq(L1)Acq(L1)x1++x1++
X Y
y1++y1++
Rel(L2)Rel(L2)
Acq(L2)Acq(L2)x2=3x2=3
Rel(L2)Rel(L2)
Acq(L2)Acq(L2)x2 = 3x2 = 3
Inv X, Inv X, YY
Inv X, Inv X, YY
Inv X, Inv X, YY
x2++x2++
x1=?x1=?
x2 = 4x2 = 4
x1 = 2, x2 = 4x1 = 2, x2 = 4
X Y
Updates Movement Updates Movement Home Token Home Token MovementMovement
XY
When the processes arrive at the barrier, the process that holds the token of the object will become the new home of that object, and other processes will send the updates to the home.
1414
Making LOTS More Making LOTS More EfficientEfficient
• Eliminating Diff Accumulation ProblemEliminating Diff Accumulation Problem– Lock and timestamp info in DSM control areaLock and timestamp info in DSM control area– Calculate diff on request, no redundancyCalculate diff on request, no redundancy
All updates above need to be All updates above need to be sent (17 units data + 8 units of sent (17 units data + 8 units of control)control)
Only send 7 units data + 8 Only send 7 units data + 8 units of control dataunits of control data
LengthLength
X4X3X2X1
2
X8X7X6X5
3 4 0 4 1 4 3
1 1 X6 2 1 X1
3 2 X2 X8 4 3 X3 X7X5
TimeTime
1515
Other Components in Other Components in LOTSLOTS
• C++ runtime library in LinuxC++ runtime library in Linux• Minimal set of functions as interfaceMinimal set of functions as interface
– Retains as much C++ syntax as possible to Retains as much C++ syntax as possible to improve programmabilityimprove programmability
• Synchronization: Locks and BarriersSynchronization: Locks and Barriers– Barriers: With/Without memory effectBarriers: With/Without memory effect
• Communication: Sockets with UDP/IPCommunication: Sockets with UDP/IP• SIGIO handler for incoming messagesSIGIO handler for incoming messages
1616
Performance TestingPerformance Testing
• Two Kinds of TestingTwo Kinds of Testing1 Without invoking large object space supportWithout invoking large object space support
• Compare performance with other DSM (JIAJIA V1.0, Compare performance with other DSM (JIAJIA V1.0, as both have similar communication protocol)as both have similar communication protocol)
• Report no. of messages and bytes sentReport no. of messages and bytes sent• Calculate large object space support overheadCalculate large object space support overhead• 16 Pentium IV 2GHz machines with 100Mbps Fast 16 Pentium IV 2GHz machines with 100Mbps Fast
Ethernet connection, 128MB mem, Linux FedoraEthernet connection, 128MB mem, Linux Fedora2 With large object space supportWith large object space support
• Use an application with large memory demandUse an application with large memory demand• Run on different platforms for analysisRun on different platforms for analysis• Expect disk read/write overhead dominatesExpect disk read/write overhead dominates
1717
Test 1: Timing Test 1: Timing PerformancePerformance
LOTS: LOTS enabled LOTS: LOTS enabled LOTS-x : LOTS disabledLOTS-x : LOTS disabledx-axis : problem size, x-axis : problem size, y-axis : execution time y-axis : execution time in secondsin seconds
• LOTS beat JIAJIA V1.0 in most LOTS beat JIAJIA V1.0 in most applicationsapplications– Mixed protocol + “Diff accumulation Mixed protocol + “Diff accumulation
elimination” reduce data trafficelimination” reduce data traffic
• Large object space support and Large object space support and access checking incur a considerable access checking incur a considerable overheadoverhead– about about 5-15%5-15% of total execution time of total execution time
(application dependent)(application dependent)
1919
Test 2: Large Object SpaceTest 2: Large Object Space
• Using 4-node PC and server clustersUsing 4-node PC and server clusters
• Test program: simple matrix operationsTest program: simple matrix operations• With 120GB (SCSI) hard disk in each machine, able to With 120GB (SCSI) hard disk in each machine, able to
claim claim 117.77GB117.77GB Shared Object Space Shared Object Space• Disk read and write time is closely related to the OS Disk read and write time is closely related to the OS
version.version.
CPU CPU (MHz)(MHz)
OSOS RAM RAM (MB)(MB)
# # Shared Shared Objs (X)Objs (X)
Per Obj Per Obj Size (MB)Size (MB)
Total Shared Total Shared Obj Size (GB)Obj Size (GB)
• LOTS succeed in:LOTS succeed in:– Providing a large shared object space Providing a large shared object space
larger than the local process space larger than the local process space during runtimeduring runtime
– Performing reasonably well by reducing Performing reasonably well by reducing data traffic through data traffic through Scope ConsistencyScope Consistency, , mixed coherence protocolmixed coherence protocol and and “diff “diff accumulation elimination”accumulation elimination” technique technique
– Similar programming interface with C++Similar programming interface with C++
2121
Future WorkFuture Work
• A Number of Optimizations:A Number of Optimizations:– Further increase shared object space Further increase shared object space
“ “the minimum hard disk space x number of the minimum hard disk space x number of processes / 2”.processes / 2”.
• Recent progress: 64GB (4GB x 16) of shared Recent progress: 64GB (4GB x 16) of shared objects can be allocated in 16 machines, each objects can be allocated in 16 machines, each having a 9GB hard disk.having a 9GB hard disk.
coherence protocol adapting to network coherence protocol adapting to network traffic and processor loading (e.g., avoid too traffic and processor loading (e.g., avoid too many “homes” in a single machine) many “homes” in a single machine)
Questions ?
2323
Test 1: No. of Messages Test 1: No. of Messages SentSent
0
10
20
30
40
50
60
70
80
90
p=2 p=4 p=8 p=16
FL(n=2048)
LU(n=1024)
ME(n=8192)
RX(n=8192)
RB(n=2048)
%%No. of procs No. of procs (p)(p)
The percentage is obtained by dividing the number of The percentage is obtained by dividing the number of messages sent in LOTS over that in JIAJIA for the same messages sent in LOTS over that in JIAJIA for the same
application.application.Due to mixed Due to mixed protocol, LOTS send protocol, LOTS send fewer messages fewer messages through the network through the network than JIAJIAthan JIAJIA
2424
Test 1: No. of Bytes SentTest 1: No. of Bytes Sent
0
10
20
30
40
50
60
70
80
90
100
p=2 p=4 p=8 p=16
FL(n=2048)
LU(n=1024)
ME(n=8192)
RX(n=8192)
RB(n=2048)
%%No. of procs No. of procs (p)(p)
The percentage is obtained by dividing the number of bytes The percentage is obtained by dividing the number of bytes sent in LOTS over that in JIAJIA for the same application.sent in LOTS over that in JIAJIA for the same application.
2525
Test 2: Large Object SpaceTest 2: Large Object Space
• Allocate shared objects with total size > Allocate shared objects with total size > 4GB, and another process accesses 4GB, and another process accesses each of them once (array addition with each of them once (array addition with p=4)p=4)int main(int argc, char **argv){ int i, j, pp, local[4]; // 2D int array Pointer <Pointer <int> > a; lots_init(); // init LOTS // shared memory allocation a.alloc(X);
for (i=0; i<X; i++) a[i].alloc(size); nm_barrier(); // barrier
for (j = 0; j < linec; j++) { pp = (dsmid + j) % linec; for (i = pp; i < X; i += 4) { acq(i); a[i][0] += rand(); rel(i); } } // array addition