MN-Mate: Resource Management of Manycores with DRAM and ...core.kaist.ac.kr/publication/paper_list/2010_hpcc_khpark.pdf · MN-Mate: Resource Management of Manycores with DRAM and

MN-Mate: Resource Management of Manycores with DRAM and Nonvolatile

Memories

Kyu Ho Park, Youngwoo Park, Woomin Hwang, and Ki-Woong ParkComputer Engineering Research Laboratory

Korea Advanced Institute of Science and Technology {kpark, ywpark, wmhwang, woongbak}@core.kaist.ac.kr

Abstract

The advent of manycore era breaks the performance wall but it causes severe energy consumption. NVRAM as a main memory can be a good solution to reduce energy consumption due to large size of DRAM. In this paper, we propose MN-MATE, a novel architecture and management techniques for resource allocation of a number of cores and large size of DRAM and NVRAM. In MN-MATE, a hypervisor partitions and allocates cores and memory for guest OSes dynamically. It is clear that optimized matching of heterogeneous cores, DRAM, and NVRAM enhances system performance. Selective locating of data in a main memory composed of DRAM and NVRAM significantly reduces energy consumptions. Preliminary results show that integration of dynamic resource partitioning and selective memory allocation scheme with MN-MATE reduces energy usage significantly and suppresses performance loss from NVRAM’s characteristics.

1. Introduction Since the performance of a single CPU chip has now

almost arrived at the instruction level parallelism-wall and power-wall to increase the performance more, manycore chips could be a solution to it. There have been continuous efforts to integrate many CPUs on a single chip and currently 80-core chip is announced by Intel [15]. The integration of cores will be increasing continuously and it is expected to 1000 within 10 years. To exploit the manycore, there have been a lot of researches on scalable mechanisms such as symmetric multiprocessors and NUMA systems. Tessellation [16] is an OS structured around space-time partitioning and two-level scheduling between the global and partition runtimes. The factored operating system [17] is also under development on 16 core x86_64 hardware. The authors considered an OS as a collection of Internet

inspired services and the OS targets more than 1000 core single chip. Corey [18] runs on 16-core AMD and Intel machines and it is an exokernel-like [19] OS with a basic principle that applications should control all sharing resources. There are also Barrelfish many core OS [20], Larrabee for visual computing [21], Disco [22] and much more.

A typical architecture for making run operating systems on a large scale shared memory is to use hypervisors as shown in Fig.1. The hypervisor manages dynamically CPUs and other resources such as memory modules which consists of DRAM and nonvolatile memories such as phase change RAMs (PRAMs), and NAND flash memories. The amount of DRAMs for the manycore is enormously large so that the energy consumption is a serious problem. For example, we have calculated the power consumption in a cloud computing testbed at KAIST [30], in which 160 Intel Xeon chips that are equivalent to 640 cores consume 4.5 KW, and 1 TBytes DRAM consumes 2KW. If we can replace the DRAMS with NVRAM such as PRAM, the energy consumption will be reduced enormously. Our motivation comes from how to combine manycore with nonvolatile memory efficiently so that it contributes to

Figure 1. Overall Architecture of MN-MATE

2010 12th IEEE International Conference on High Performance Computing and Communications

978-0-7695-4214-0/10 $26.00 © 2010 IEEE

DOI 10.1109/HPCC.2010.35

24

2010 12th IEEE International Conference on High Performance Computing and Communications

978-0-7695-4214-0/10 $26.00 © 2010 IEEE

DOI 10.1109/HPCC.2010.35

24

get performance gain from the manycore and energy gain from the nonvolatile memory. Among several nonvolatile memories, we selected PRAM, whose characteristics are shown in Table I, as our experiment since it is available now. The access time of PRAM is 5-10 times slower than that of DRAM but its idle power is very small compared to DRAM. Our goal is to manage cores, DRAM, and nonvolatile memory such as PRAM dynamically to get energy saving with minimal degradation of system performance. It is under development in our hypervisor named MN-MATE that is an abbreviation of Manycore and Nonvolatile memory MAnagenenT for Energy efficiency. The overall architecture of MN-MATE is shown at Fig.1. The goal of MN-MATE is to run guest OSes with minimum efforts on the platform which is consisted of manycore, DRAM and nonvolatile memories. Therefore the cores and memories should be managed dynamically. The hypervisor consists of a performance monitor module, resource partitioning module, memory/core/storage virtualization module and I/O resource management module. The implementation issues are described in the following sections in detail. Besides a hardware platform, we are now developing a platform simulator and on top of it the hypervisor will be running. MN-MATE would be integrated into the cloud computing testbed as the key primitive for boosting computation intensive process and pursuing green computing environment.

The remainder of paper is organized as follows. Section 2 describes the overall architecture of MN-MATE and section 3 presents the management of cores and memories with a preliminary result. Section 4 presents virtual memory management for DRAM and PRAM with some preliminary results. In Section 5, we discuss relevant works. Our work is summarized in Section 6.

2. Architecture of MN-MATE

2.1. Hardware Architecture

Fig. 2 shows the hardware architecture of MN-MATE. MN-MATE hardware combines manycore with non-

volatile memory which are the cutting-edge technologies of current processor and memory. It arranges several 10s of general purpose and homogeneous cores in a processor chip. Large scale non-volatile memory is used with DRAM as a main memory. In this study, we assume that MN-MATE is placed on the hierarchical ring based architecture [15] can allow traffic to avoid network congestion as shown in Fig. 2.

With manycore, many applications on various operating systems can be executed because of the increased core scalability. Then, much larger size of memory is necessary to accommodate the application data. However, the density and energy consumption of conventional DRAM makes it difficult to increase the size of main memory. In our MN-MATE architecture, the next generation non-volatile memory is also used with DRAM as main memory alternative. As shown in Fig 2, the DRAM and non-volatile memory are coupled by a memory controller and mapped into single physical address space at the same time.

Also, MN-MATE uses a memory controller for each core package to increases the memory bandwidth and reduces energy consumption. The hybrid memory of DRAM and non-volatile memory is closely attached to single core package. The bandwidth between core package and nearby memory is maximized. Because many independent operating systems and applications are executed, relatively little sharing of memory exists. If the independent applications are scheduled to the different core packages, MN-MATE architecture improves the cache and memory locality and enables greater hardware scalability. In addition, all cores and memory in a package can power up or down as the workload fluctuates. Particularly, the non-volatile memory does not spend energy during idle state at all. We can expect the reduction of overall energy consumption.

Table I. Properties of PRAM

Figure 2. Assumed Hardware Architecture of MN-MATE

2525

2.2. Software Architecture

We incorporated Hypervisor and SMP-Linux into MN-MATE as an underlying basement S/W layer in an attempt to provide an elastic resource partitioning and a way of efficiently and cost effectively managing the resources of the manycore system. The software architecture of MN-MATE is composed to provide user-friendly programming environment to support user-level multi thread coding within C/C++ code. The four major components of our architecture are listed as follows:

� pThread/OpenMP library: it is responsible for

judiciously providing a well-organized parallel program interface. Programmers can use the libraries to parallelize the computation intensive tasks through the library interfaces.

� MP-Compiler: it produces a single fat binary consisting of parallelized executable codes. The compiler translates OpenMP parallel routines to a series of calls to the SMP-Linux layer over the affiliation interface, which is responsible for appropriate task allocation for parallel execution. The computation intensive assembly block is replaced with a call into an affiliation routine for locating the corresponding binary code in the fat binary.

� SMP-Linux: it is a Linux version to works on SMP (Symmetric Multi�Processors) machines. SMP support was introduced with kernel version 2.0, and has improved steadily ever since. Among the several sub-components of the SMP-Linux, the affiliation interface is responsible for translating the

programmer specified OpenMP directives into primitives to create and manage shreds that can carry out the parallel execution on the manycore system. The affiliation interface is used to partition the execution codes for fork-join parallel thread execution. When such a construct is encountered; a number of threads are spawned to execute the dynamic extent of a parallel region. This team of threads, including the main thread that spawned threads, participates in the parallel computation.

� VM & Hypervisor: Like conventional VM interface, the VM layer provides a layer of abstraction that hides the details of managing the compiled parallelized execution code from users of the OpenMP pragmas. Built on the MN-MATE hardware architecture, VM and Hypervisor layers are responsible for scheduling and dispatching the parallelized threads into manycores and heterogeneous memory based on the information from guest OS to resource partitioning clue interface as shown in Fig. 3.

2.3. Hypervisor

The hypervisor is to dynamically partitions the hardware resources among VMs and maximizes system-wide performance without significant changes of guest OS. A number of CPU cores and multiple types of NVRAM are properly allocated to each VM so that the hypervisor tries to achieve system-wide high resource utilization. To achieve this, the hypervisor mainly consists of three components, performance estimator, resource partitioner, and core/memory virtualizer as follows:

� Performance Estimator: usually, application

performance is not proportional to the amount of allocated resources such as CPU cores and memory. Furthermore, a manycore processor and main memory with NVRAM consists of heterogeneous components. To estimate each resource element’s effect to the performance when executing workloads, the estimator periodically monitors hardware events related to performance through on-core hardware Performance Monitoring Units (PMUs). It also monitors software-level information through VM monitoring module. From the collected data, the estimator predicts the future performance when allocated resources are changed, then determines the amount of memory to be allocated to each VM.

� Resource Partitioner: the hypervisor allocates cores and memory to each VM based on the collected monitoring data. With the measured performance data from on-core PMUs and software

�

�

�

Figure 3. MN-Mate Software Architecture

2626

monitoring module, the hypervisor determines how many cores are required to which VM, and so does to the all NVRAM including DRAM. If the workload changes over time, allocated cores and memory among VMs are re-balanced to maximize system-wide performance. Those partitioning policy includes static allocation of some resources to the specific VM or task for its isolation purpose.

� Core/Memory Virtualizer: each VM virtualizes allocated hardware resources not to be affected by moving physical resources. As a result of resource partitioning and re-balancing, allocated memory is neither contiguous nor homogeneous type. Allocated cores can also be heterogeneous type and those cores are shared among VMs over time. Furthermore, those configuration changes dynamically due to the re-balancing according to the workload change. Such a change in the allocated hardware makes the configuration of a guest operating system or applications changes. Virtualization of CPU cores and memory to the VM and applications overcomes the limitation because the guest OS or application does not know the changing of allocated resources for the purpose of system-wide performance optimization.

3. Core and Memory Management Although more physical resources give more benefit

to the task performance, the application performance on a guest OS is not proportional to the number of cores and the memory size allocated to the VM. The tendency may come from each core’s asymmetric capability and application’s different memory access locality. Things get more complicated if the main memory has various characteristics such as asymmetric access time, volatility, and physical lifetime limitations. To enhance system-wide performance, the hypervisor estimates resource requirements of all the VMs and searches proper matching of resource types according to the resources’ characteristics. In this section, we present the basic

design of our core management mechanism, the main memory balancing architecture, and the preliminary results showing our architecture’s effectiveness.

3.1. Resource Monitoring

To evaluate resource efficiency to the performance of VMs and applications, the hypervisor in MN-MATE collects software-level semantic information and hardware-level performance data. Fig. 4 shows a concept of the system monitoring. Hardware events such as cache miss, TLB1 miss, page faults, and interrupts are collected from on-core performance monitoring units (PMUs). Software-level semantic information from a VM is also gathered from the software-level PMU. The semantic information specifies monitored memory page’s state information, swap usage of each VM, task scheduling information, and reference patterns of monitored memory page frames. For the information’s usefulness, MN-MATE distinguishes each task at the hypervisor and then classifies collected performance data according to its source task. The performance estimation unit analyzes them so that it is used as hint of resource partitioning and resource virtualization.

3.2. Core Management

A manycore processor has a number of heterogeneous CPU cores or homogeneous CPU cores. Because the number of CPU cores and the application performance is not proportional, a proper core partitioning method is required in the hypervisor to enhance system-wide core utilization. In MN-MATE, the hypervisor generates multiple core partitions for VMs. A core partition associates virtual cores (VCores) with physical cores (PCores).

1 TLB (Translation Lookaside Buffer): it is a CPU cache that memory management hardware uses to improve virtual address translation speed..

Figure 4. Concept of system monitoring Figure 5. Concept of core partitioning and balancing

2727

Fig. 5 shows three example operations of the core partitioning in the hypervisor. First, a service or an application can be dedicated to specific PCores inside a core partition. A VM initially starts with a maximum number of VCores, where some of them are disabled for later use. A core partition associates each active VCores with an idle PCores. With the support of this association operation, the hypervisor can statically execute kernel-level services or user-level applications on specific PCores. The hypervisor provides programming interface, called affiliation call, to support static allocation of PCores. Second, the hypervisor and core partitions supports synergistic thread grouping. A number of related threads are grouped together inside the hypervisor and are allocated to the same core partition. This grouping followed by a dedication operation is performed when the administrator explicitly establishes or the core partitioner implicitly requests the operation. Third, the core partitions can be re-balanced by relocating PCores to more proper location according on the basis of the result of resource monitoring. When the hypervisor tries to move a PCore from a victim core partition to a beneficiary core partition, core movement procedures are like follows: i) In a victim core partition, a mapping to the stolen PCore are redirected to another PCore in the same core partition. As an alternative solution, the hypervisor disables a corresponding VCore and moves threads from the disabled VCore’s run-queue to other active VCores. ii) In a beneficiary core partition, the hypervisor associates the newly received PCore with existing active VCore. If a VCore exists, the hypervisor can activate it and associates the VCore with the inserted PCore.

3.3. Memory Management

Conventionally, partitioning of the memory is more important than partitioning of the CPU cores and its time slice because of the memory inertia and higher mis-decision penalty of memory partitioning. If the workloads fluctuate, the importance is critical. Unlike the

conventional DRAM-based main memory, NVRAM has different access latencies according to the memory type and operation type. Hence, it makes challenging issues on memory partitioning for system-wide performance. First, the hypervisor should figure out which VM requires memory and which VM gets the least useful memory; the memory type is also required. Second, when the hypervisor tries to re-balance memory among VMs, the hypervisor has to decide the memory types and the memory size, and the target VM to transfer.

To solve these problems in MN-MATE, the hypervisor acquires some memory page’s state of each VM and estimates reclamation cost according to the memory type. As shown in Fig. 6, the hypervisor then monitors memory access pattern in a software manner. With the collected data from the on-core PMUs and the software-level monitoring modules, the hypervisor determines memory requirements of all the VMs on the basis of the memory access characteristics of workloads. The hypervisor then estimates the memory type and the memory size which is more valuable to another VM. That estimation considers change of memory type if it is better for system-wide performance. Finally, an actual balancing is conducted by utilizing estimation result. The hypervisor transparently reclaims the selected page frames from the VM that has extra memory. This VM is called victim guest OS. Those stolen memory is then donated to the memory recipient, called beneficiary VM, through ballooning [4] method. In such a situation, the hypervisor utilizes various criteria to determine the memory size and the memory types to transfer among VMs. For example, the hypervisor can determine the memory size to transfer by comparing the swap storage usage of the guest OSs. Consequently, all the concerned VM have proper memory size of adequate memory type.

To realize this architecture, we perform a preliminary implementation with only DRAM [29]. Although [4], [5], [6], [7], [8], [9] already proposed some memory balancing methods, we believe that our balancing architecture is more efficient and it will also be effective

Figure 7. A preliminary architecture for DRAM balancing

Figure 6. Procedure for dynamic partitioning of Next Generation Memory

2828

with NVRAM. Fig. 7 illustrates our preliminary memory management architecture for MN-MATE. In this architecture, each guest OS passes on information about pages within its own page cache, which caches data stored in the permanent storage. The information specifies which page is inserted into, evicted from, and reused within the cache. In addition, memory access from the VMs to those pages are intercepted by the hypervisor and used to update the management data structure. With the information, the hypervisor classifies the concerned pages according to their reference pattern and maintains relative age of each page frame within them.

Based on a technique presented in [10], our pattern classification mechanism causes page frames to be categorized into two types: sequential references and unclassified references. Two partitions accommodate those classified page frames separately. The management of each partition reflects the characteristics of the page frames that belong to each partition. For the partition that holds the sequential references, a sequence consists of page frames that are sequentially referenced within a task-level file-stream; in addition, all the sequences are arranged according to the MRU strategy. Within the partition that holds the unclassified references, we use the LRU strategy. If a page frame in the sequential partition is referenced again, the page frame is moved to the unclassified partition and the associated sequence is split into two new sequences with the same age.

The overall procedure of memory transfer is like follows. When a page is requested for memory balance, for example, a least-valuable page frame is selected as the victim page on the basis of its reference pattern. The page frame is transparently reclaimed shortly before the hypervisor tries to schedule the beneficiary VM. Depending on the reason for the request, a different allocation mechanism is then applied to the beneficiary VM. If the reason is a paging-in event of a reclaimed page in the hypervisor, a victim page is directly allocated. However, if the request is the result of memory balancing, the hypervisor allocates the page to a guest OS through

ballooning. Later, if a victim VM has stolen more memory than the threshold level, the hypervisor requests explicit memory borrowing from the VM. This type of borrowing reduces the disparity between the memory size that the victim guest OS knows about and the memory size that is actually allocated.

In our page frame management, we concentrate on clean page frames that belong to a page cache because of the volatility of the clean pages. Generally, guest operating systems, such as Linux, attempt to use any available memory for their own caching purposes. As a result, only a small amount of memory is left free; others contain contents stored in the permanent storage. Because the cached non-dirty content can be rebuilt by doing a read operation from its storage location, the guest can tolerate the loss of the page content. Thus, if the hypervisor tries to reclaim those pages, there is no need to swap out page content to its own swap device and dual-swapping can be avoided. Hence, we restrict our monitoring to page frames that are used as a page cache and choose clean pages as victims. This process requires no swap storage area for the hypervisor, no additional management cost, and no data flush overhead.

Table II. Runtime Monitoring Overhead

3.4. Preliminary Evaluation of Memory Management

Our scheme is implemented on an Intel server with 4 cores (two Intel Xeon 3.40 GHz Dual-Core processors) and an 8GB DRAM. By default, all the VMs are initially allocated 256MB of memory and configured with 512MB as their highest possible memory allocation. The beneficiary VM is executed first to minimize the value of time to present page frames whenever the page frames requested to be transferred are ready. For the dynamic partitioning, that is ballooning, we also change the scheduling order of the victim VM with an interrupt to make the requested page frames relinquish immediately, thereby minimizing delay caused by VM scheduling. 1) Overhead of the Access Monitoring: The runtime overhead of our scheme mainly comes from the intentionally occurred page faults of page frames in the page cache, the detection of reference patterns, and the reloading of data from permanent storage. To evaluate the runtime overhead of our scheme, we measure the execution time of the Tiobench (version 0.3.3) [11] benchmark with our scheme turned on but without any

2929

DRAM reallocation. As shown in Table II, the average monitoring overhead is negligible. 2) Effect of the memory transfer speed: To evaluate the effect of memory transfer speed to the application performance, we first assign memory allocation job to a VM, while other 3 VMs perform sequential and random read in Tiobench benchmark. Then experiments are performed to measure memory transfer speed. We use each VM’s swap usage as a criterion of memory balancing. Table III shows the result of average reclamation time of a page from a victim VM using inflation of ballooning and our scheme. Ballooning takes much time compared with our scheme because it requires victim domain’s involvement consuming time to acquire free memory. Therefore, it imposes restrictions on elaborate memory balancing for its delay proportional to the number of balancing trials.

Table III. Elapsed time of a page release to transfer

Fig 8 shows the elapsed time of memory allocation in

beneficiary VM while the numbers of page frames are transferred as a result of memory balancing. It presents the effect of memory transfer speed and victim selection policy on the application performance. Faster transfer of page frames enlarge the amount of free memory when the guest OS requires them. As a result, it reduces the number of time-consuming memory reclamation trials if no free memory exists. Therefore, the guest OS can respond faster to the memory allocation request of user application. In addition, if we reclaim and restore a sequence instead of a page, we can get more accelerated result indicated as Sequential + WHOLE_SEQ. Because a sequence has more page frames, and data can be easily recovered with less number of long access to the permanent storage.

Sacrificing sequential references prior to sacrificing unclassified pages affects performance. To see how this

effect compares with the ordinary LRU strategy, we compare the dwell time of a reclaimed page in a reclaimed state according to the reclamation method. The results of the comparison are shown in Fig. 9. The entire victim-VMs execute the Tiobench benchmark. Maintaining a long dwell time is important for a reclaimed page in a reclaimed state. The long dwell time means that the reclaimed page causes no unnecessary page fault or data reload from the permanent storage-both those factors degrade performance. The results confirm that the victim selection policy for the hypervisor-level paging is effective. However, because of the policy mismatch, a page that is reclaimed by the LRU victim selection policy in the hypervisor is soon reclaimed by the guest OS. The quick recovery of the reclaimed pages results in a short dwell time. On the other hand, our scheme selects sequentially referenced pages first and reclaims those pages prior to others on the basis of the MRU strategy. Although the inaccuracy of the sequential reference detection scheme generates a short dwell time, a near maximum dwell time is guaranteed for most of the reclaimed page frames. Thus, the dwell time of pages reclaimed on the basis of our scheme is much longer that of pages reclaimed by the LRU policy.

This measurement is taken in a situation where reclamation in a page cache of the guest OS is actively performed by the Tiobench benchmark. If the workload fluctuates, the difference in the dwell time from the policy mismatch is increased. This increase lowers the overhead of the victim VM as a result of the page fault of the reclaimed pages.

4. Virtual Memory Management In MN-MATE architecture, both DRAM and non-

volatile memory are used as main memory devices and mapped into a same physically address. Because the characteristics of two memories are very different, new virtual memory management facilities are necessary to reduce the access latency and energy consumption of main memory.

3030

In addition, the MN-MATE has a non-volatile region in main memory. The permanent file data can be stored using the virtual memory resources. Because non-volatile memory is free from seek latency, we can increase the storage performance of MN-MATE using non-volatile memory as an alternative storage. Therefore, we also need to develop the management method to share non-volatile memory between virtual memory and file system.

Fig. 10 shows the virtual memory management architecture of MN-MATE. In this architecture, the conventional memory management facilities like free page manager and page swap manager are extended to support the hybrid memory of DRAM and non-volatile memory. The non-volatile memory pages are dynamically shared between virtual memory systems and file system. Therefore, although we use non-volatile memory as additional main memory and storage devices, the virtual memory management architecture in MN-MATE can offer the applications simple memory and storage abstractions. We can increase the memory and storage performance and reduce the energy consumption of MN-MATE system.

4.1. Selective free page allocation

The first role of virtual memory system is allocation of physical memory page. Previously, the page allocation is just a process to find and manage free pages in DRAM. However, in MN-MATE architecture, the page allocation needs additional effort to manage the non-volatile memory pages and choose the more beneficial memory page between DRAM and non-volatile memory. The selective page allocation manager, which is an extended version of conventional memory page allocator, is proposed to deal with the page allocation of MN-MATE.

Basically, the selective allocation manager uses the general access pattern of process address space where the allocated page will be included. As shown in Fig. 11, the process address space of traditional operating system is partitioned in several linear address intervals called segments. Each segment is associated with specific data type and has specific access pattern. For example, the

text segment is read-only page and infrequently accessed because it stores application code. The stack segment contains the local variable of application program and it is accessed and updated frequently.

Therefore, if the selective page allocation manager finds out the segment information for allocated page, it is possible to select better free page between DRAM and non-volatile memory. As we explained in Section 1, the non-volatile memory is slower than DRAM and has an endurance problem. The frequently updated pages in stack and heap segments can be allocated to DRAM. On the other hand, infrequently accessed and read-only pages in text, data and BSS segments are better to be allocated to non-volatile memory.

On the other hand, the basic metric for file page allocation algorithm is data size. It is generally known that a small size file is frequently and randomly accessed. Because the non-volatile memory has fast access time but much smaller capacity than disk, it is better to keep only data of a small size file in the non-volatile memory. In the selective page allocation manager, if the requested data size is over several pages, contiguous physical pages in disk are selected to store the data. Also, we propose the write request merging using the non-volatile buffer. Although the data size of a request is small, if it is a sequential request that has a nearby virtual address of previous one, the selective allocation manager decides that those pages are in the same file and allocate them in disk. It can allocate a large file, whose size is continuously increased, to disk. Consequently, the write request merging reduces the number of disk access and waste of non-volatile memory space.

After we select device for the page allocation, the selective allocation manager finds out free page in that device. In MN-MATE architecture, the conventional free page management system (buddy system) is extended. DRAM and non-volatile memory has its own free page list. We can easily choose the free page in the device using those lists. Finally, if the page mapping of allocated page is updated, the page allocation process is finished.

Figure 12. Access time and energy consumption of hybrid main memory of DRAM and non-volatile memory

Figure 11. Selective allocation of memory page

3131

4.2. Unified page swapping

The swapping is originally used when there are not enough free pages in memory. Similarly, we use the unified page swap manager to make free pages but there are several differences.

First of all, the unified page swap manager needs additional procedure for page swapping because many pages contain the file data in non-volatile memory. For example, if the selective allocation manager tries to write a new file data to non-volatile, the unified page swap manager can swap out a page from non-volatile to disk. The file page swapping is basically similar to the memory page swapping that it moves the page to the disk. However, after moving the page to disk, the file system can access the swapped out pages from disk without another swap-in process.

In addition, the unified page swap manager migrate pages according the access locality. Actually, we use selective allocation manager because there are two candidate storage for memory and storage. However, it cannot fully predict the future access pattern of page. Even though it perfectly allocates the page at the beginning, the access pattern is change over time. Therefore, it is necessary to adjust the allocated page location for the performance of system. In order to manage the access locality, the unified page swap manager migrate a page to another physical device if the page is stored in unprofitable device. After the page migration, page table should be modified because it changes the actual location of file data. We can use many criteria to select page for swapping. In our current design, the page migration monitors the page access count for a certain time and selects the candidate for migration using a simple swapping policy that is similar to LRFU. If we develop the dedicated policy for page swapping, it further improves the performance of the virtual memory management.

4.3. Preliminary evaluation

The virtual memory management architecture of MN-MATE is simulated by the M5 full system simulator [12]. The M5 simulator is used to generate the access profiles for estimating the access latency and energy consumption of memory and storage system. The management facilities are implemented on Linux 2.6.27 that is operated on the M5 simulator. Currently, the non-volatile memory and DRAM is mapped into the same address space of main memory. The selective page allocator can dynamically allocate the pages among DRAM, non-volatile memory and disk. However, the unified page swapper dose not evaluated in current implementation.

We first compare proposed hybrid main memory of DRAM and non-volatile memory with conventional 256MB DRAM only memory architecture. Using proposed virtual memory management scheme, we reduce the main memory space of DRAM from 256MB to 32MB and use 224MB non-volatile memory as main memory instead of DRAM. It is reasonable assumption due to higher density of PRAM [27].We use Specweb benchmark included in M5 simulator to extract memory access traces. All the experiment results are evaluated using the parameters of Numonyx [13] and the assumption that only 30% of memory is actively used [31]. Fig 12 presents the access latency and energy consumption result of DRAM and proposed memory architecture.

From this result, we can see that the hybrid main memory without selective allocation increases the total access latency more than two times although it reduces 50% energy consumption. It is because we replace much of physical address space of main memory to non-volatile memory which has much slower latency and lower energy consumption than DRAM. However, if the memory page is dynamically allocated between DRAM and non-volatile memory by the selective allocator, propose virtual memory architecture has almost same access latency of DRAM as well as saves 50% energy consumption.

In addition, we evaluate proposed non-volatile memory supported storage comparing with disk. We estimate the total access time when executing the OLTP workload [14]. Fig. 13 shows the results of the experiment. For all experimental results, we compare three request allocation policies (random, selective, and selective merging). The random allocation randomly assigns a page to non-volatile memory or disk. The selective allocation uses non-volatile memory only for page within small size request. The selective merging uses both selective allocation and write request merging. The experimental result presents that the use of non-

Figure 13. Total storage access time of non-volatile memory supported storage

3232

volatile memory increases the performance of the storage. It is because non-volatile memory is free from seek latency and favorable for small size random access. However, random allocation cannot fully use the non-volatile memory because it dissipates non-volatile memory for large size sequential data. On the other hand, selective allocation and the write request merging is very effective because they help to find sequentially accessed data and reduce the number of seeks in disk. If 256MB non-volatile memory is used as storage, we decrease more than 40% of access time of disk.

5. Previous Work Researchers have started to explore the management

of manycore resource. Corey [18] and FOS [17] is a manycore OS that increases the core scalability by sharing of kernel data structure. Tessellation [16] and FOS [17] propose the space-time sharing and scheduling of core resources. Barrelfish [20] also considers the management issues of heterogeneous core resources. Unlike these systems, MN-MATE is designed so that independent and parallel applications do not conflict by the same resources. MN-MATE proposes elastic resource partitioning and scheduling in a hypervisor level and uses existing operating systems with only additional affiliation call interface.

There are also many researches about NVRAM. HeRMES [23] Conquest [24], PFFS [25] have used non-volatile memory to enhance the performance of file system. Lee [26] Qureshi [27] and Dhiman [28] suggest a new main memory architecture using non-volatile memory. However, these researches do not consider the use of non-volatile memory with manycore system. MN-MATE exploits non-volatile memory for manycore system and run many OSs and applications on a large scale shared memory without large effort. The hypervisor of MN-MATE use system wide optimization for non-volatile memory resources, increase both the storage and main memory performance and minimize the overall system power consumption.

6. Conclusion In this paper, we presented MN-MATE architecture

and its management techniques as the primary utilities for resource management on highly energy-efficient manycore computing system. We concentrate on the energy-efficiency with minimal performance loss for the combined architecture of manycore, DRAM, and NVRAM. We are currently implementing MN-MATE hardware and software components including resource partitioning and dynamic balancing. Basically, the hypervisor MN-MATE monitors all tasks and collects performance data. With the analysis of collected data, the

hypervisor partitions computing resources including cores and memory including DRAM and NVRAM. Each guest OS utilizes those partitioned resources as an energy-efficient manner. Preliminary results of memory partitioning and virtual memory management scheme show the effectiveness of our MN-MATE components, thus convincing promising future direction of manycore research.

We will soon have complete design and implementation about memory partitioning and management of NVRAM main memory, and realize integrated resource management architecture of the hypervisor and the guest OS.

7. Acknowledgments The authors wish to thank their anonymous referees

for all of their invaluable comments and suggestions. The work presented in this paper was supported by MKE (Ministry of Knowledge Economy, Republic of Korea), Project No. 10035231-2010-01.

8. References [1] OpenMP Group, “Introduction to OpenMP Library”, http://software.intel.com/openmplibrary/

[2] OpenMP, MP-Compiler Toolkit, http://www.openmp.org/

[3] SMP-Linux Group, “Introduction to SMP-Linux”, http://www.ibm.com/developerworks/linux-smp/

[4] C. A. Waldspurger, “Memory Resource Management in VMware ESX Server,” in Proceedings of Fifth Symposium on Operating Systems Design and Implementation (OSDI ’02), Dec. 2002.

[5] P. Lu and K. Shen, “Virtual machine memory access tracing with hypervisor exclusive cache,” in ATC’07: Proceedings of the 2007 USENIX Annual Technical Conference. USENIX Association, 2007, pp. 1-15.

[6] D. Magenheimer, “Memory Overcommit... without the commitment,” Xen Summit, June 2008.

[7] Xen Co.Ltd Press, “Transcendent Memory on Xen,” Xen Summit, 2009.

[8] W. Zhao and Z. Wang, “Dynamic memory balancing for virtual machines,” in VEE ’09: Proceedings of the 2009 ACM SIGPLAN/SIGOPS international conference on Virtual execution environments. ACM, 2009, pp. 21?30.

[9] M. Schwidefsky, H. Franke, R. Mansell, D. Osisek, H. Raj, and J. Choi, “Collaborative Memory Management in Hosted Linux Systems,” in Proceedings of the 2006 Ottawa Linux Symposium, 2006.

[10] J. Choi, S. H. Noh, S. L. Min, and Y. Cho, “Towards application/file-level characterization of block references: a case for fine-grained buffer management,” in IGMETRICS ’00: Proceedings of the 2000 ACM SIGMETRICS international

3333

conference on Measurement and modeling of computer systems. ACM, 2000, pp. 286?295.

[11] M. Kuoppala, “Tiobench - Threaded I/O bench for Linux,” 2002.

[12] Nathan L. Binkert, Et Al., “The M5 Simulator: Modeling Networked Systems,” IEEE Micro, 2006.

[13] S. Eilert, Et Al., “Phase Change Memory: A new memory enables new memory usage models”, IEEE International Memory Workshop, 2009.

[14] OLTP Application I/O and Search Engine I/O. UMass Trace Repository, http://traces.cs.umass.edu/index.php/Storage/Storage.

[15] Teraflops Research Chip, http://techresearch.intel.com/articles/Tera-Scale/1449.htm.

[16] R. Liu, Et Al., “Tessellation: Space-Time Partitioning in a Manycore Client OS”, USENIX Workshop on Host Topics in Parallelism, 2009.

[17] D. Wentzlaff, A. Agarwal, "Factored operating systems (fos): the case for a scalable operating system for multicores,” ACM SIGOPS Operating Systems Review, Vol. 43, Issue 2, 2009, pp. 76-85.

[18] Silas Boyd-Wickizer, Et Al., “Corey: An Operating System for Many Cores,” 8th USENIX Symposium on Operating Systems Design and Implementation, 2008

[19] D. R. Engler, Et Al., “Exokernel: an operating system architecture for application-level resource management,” ACM Symposium on Operating Systems Principles, 1995.

[20] Andrew Baumann, Paul Barham, Pierre-Evariste Dagand, Tim Harris, Rebecca Isaacs, Simon Peter, Timothy Roscoe, Adrian Schupbach, and Akhilesh Singhania. “The Multikernel: A new OS architecture for scalable multicore systems,” In Proceedings of the 22nd ACM Symposium on OS Principles, Big Sky, MT, USA, October 2009.

[21] Larry Seiler, Et Al., “Larrabee: a many-core x86 architecture for visual computing”, ACM Transactions on Graphics (TOG), Vol. 27, Issue 3, 2008.

[22] Kinshuk Govil, Dan Teodosiu, Yongqiang Huang, Mendel Rosenblum, “Cellular disco: resource management using virtual clusters on shared-memory multiprocessors,” ACM Transactions on Computer Systems (TOCS), Vol 18, Issue 3, 2000, pp. 229-262.

[23] Ethan. L. MILLER. Et Al., “HeRMES: High-performance reliable MRAM-enabled storage,” In Proceedings of the 8th IEEE Workshop on Hot Topics in Operating Systems, 2001.

[24] An-I A. Wang, Et Al., “Conquest: Better Performance through a Disk/Persistent-RAM Hybrid File System,” In Proceedings of 2002 USENIX Annual Technical Conference, 2002.

[25] Y. Park, Et Al., “PFFS:A Scalable Flash Memory File System for the Hybrid Architecture of Phase change RAM and NAND Flash,” In Proceedings of the 2008 ACM symposium on Applied computing, 2008.

[26] BENJAMIN C. LEE, E. IPEK, O. MUTLU, AND D. BURGER. "Architecting Phase Change Memory as a Scalable DRAM Alternative", In Proceedings of the 36th annual International Symposium on Computer Architecture (2009).

[27] Moinuddin K. Qureshi, Et Al., “Scalable High Performance Main Memory System Using Phase-Change Memory Technology,” In Proceedings of the 36th annual International Symposium on Computer Architecture, 2009.

[28] G. Dhiman, Et Al., “PDRAM: A Hybrid PRAM and DRAM Main Memory System,” In Proceedings of the 46th Annual Design Automation Conference, 2009.

[29] W. Hwang, Et Al., "HyperDealer: Reference-pattern-aware Instant Memory Balancing for Consolidated Virtual Machines," In Proceedings of the 3rd International Conference on Cloud Computing, 2010

[30] NexR. Co. Ltd., “iCube Cloud Computing and Elastic-Storage Service”, http://www.icubecloud.com, 2010

[31] D.Meisner, Thomas F.W, "PowerNap: eliminating server idle power", In Proceedings of the 14th international conference on ASPLOS, 2009

3434

MN-Mate: Resource Management of Manycores with DRAM and ...core.kaist.ac.kr/publication/paper_list/2010_hpcc_khpark.pdf · MN-Mate: Resource Management of Manycores with DRAM and

Documents