Top Banner
Transparent Superpages for FreeBSD on ARM Zbigniew Bodek Semihalf, The FreeBSD Project zbb@{semihalf.com, freebsd.org} Abstract This paper covers recent work on pro- viding transparent superpages support for the FreeBSD operating system on ARM. The con- cept of superpages mechanism is a virtual mem- ory optimization, which allows for efficient use of the TLB translations, effectively reducing overhead related to the memory management. This technique can significantly increase sys- tem’s performance at the interface between CPU and main memory, thus affecting its over- all efficiency. The primary goal of this work is to elaborate on how the superpages functionality has been implemented on the FreeBSD/arm and what are the results of its application. The pa- per presents real-life measurements and bench- marks performed on a modern, multiprocessor ARM platform. Actual performance achieve- ments and areas of application are shown. Fi- nally, the article summarizes the possibilities of future work and further improvements. 1 Introduction ARM technology becomes more and more prevailing, not only in the mobile and embed- ded space. Contemporary ARM architecture (ARMv7 and the upcoming ARMv8) is already on a par with the traditional PC industry stan- dards in terms of advanced CPU features like: MMU (with TLB) Multi-level Cache Multi-core Hardware coherency Performance and scalability of the ARM- based machine is largely dependent of these functionalities. Majority of the modern ARM chips is capable of running complex software and handle multiple demanding tasks simulta- neously. In fact, general purpose operating sys- tems have become the default choice for these devices. The operating system (kernel ) is an essential component of many modern computer systems. The main goal of the kernel operations is to pro- vide runtime environment for user applications and manage available hardware resources in an efficient and reasonable way. Memory handling is one of the top priority kernel services. Grow- ing requirements of the contemporary applica- tions result in a significant memory pressure and increasing access overhead. Performance impact related to the memory management is likely to be at the level of 30% up to 60% [1]. This can be a serious issue, especially for the system that operates under heavy load. Today’s ARM hardware is designed to im- prove handling of contemporary memory man- agement challenges. The key to FreeBSD suc- cess on this architecture is a combination of so- phisticated techniques that will allow to take full advantage of the hardware capabilities and hence, provide better performance in many ap- plications. One of such techniques is transpar- ent superpages mechanism. Superpages mechanism is a virtual memory sys- tem feature, whose aim is to reduce memory access overhead by making a better use of the CPU’s Memory Management Unit hardware capabilities. In particular, this mechanism pro- vides runtime enlargement of the TLB (transla- tion cache) coverage and results in less overhead
13

Transparent Superpages for FreeBSD on ARM · 2020. 1. 4. · Transparent Superpages for FreeBSD on ARM Zbigniew Bodek Semihalf, The FreeBSD Project zbb@{semihalf.com,freebsd.org}

Sep 10, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Transparent Superpages for FreeBSD on ARM · 2020. 1. 4. · Transparent Superpages for FreeBSD on ARM Zbigniew Bodek Semihalf, The FreeBSD Project zbb@{semihalf.com,freebsd.org}

Transparent Superpages for FreeBSD on ARM

Zbigniew BodekSemihalf, The FreeBSD Project

zbb@{semihalf.com, freebsd.org}

Abstract

This paper covers recent work on pro-viding transparent superpages support for theFreeBSD operating system on ARM. The con-cept of superpages mechanism is a virtual mem-ory optimization, which allows for efficient useof the TLB translations, effectively reducingoverhead related to the memory management.This technique can significantly increase sys-tem’s performance at the interface betweenCPU and main memory, thus affecting its over-all efficiency.The primary goal of this work is to elaborateon how the superpages functionality has beenimplemented on the FreeBSD/arm and whatare the results of its application. The pa-per presents real-life measurements and bench-marks performed on a modern, multiprocessorARM platform. Actual performance achieve-ments and areas of application are shown. Fi-nally, the article summarizes the possibilities offuture work and further improvements.

1 Introduction

ARM technology becomes more and moreprevailing, not only in the mobile and embed-ded space. Contemporary ARM architecture(ARMv7 and the upcoming ARMv8) is alreadyon a par with the traditional PC industry stan-dards in terms of advanced CPU features like:

• MMU (with TLB)

• Multi-level Cache

• Multi-core

• Hardware coherency

Performance and scalability of the ARM-based machine is largely dependent of thesefunctionalities. Majority of the modern ARMchips is capable of running complex softwareand handle multiple demanding tasks simulta-neously. In fact, general purpose operating sys-tems have become the default choice for thesedevices.The operating system (kernel) is an essentialcomponent of many modern computer systems.The main goal of the kernel operations is to pro-vide runtime environment for user applicationsand manage available hardware resources in anefficient and reasonable way. Memory handlingis one of the top priority kernel services. Grow-ing requirements of the contemporary applica-tions result in a significant memory pressureand increasing access overhead. Performanceimpact related to the memory management islikely to be at the level of 30% up to 60% [1].This can be a serious issue, especially for thesystem that operates under heavy load.

Today’s ARM hardware is designed to im-prove handling of contemporary memory man-agement challenges. The key to FreeBSD suc-cess on this architecture is a combination of so-phisticated techniques that will allow to takefull advantage of the hardware capabilities andhence, provide better performance in many ap-plications. One of such techniques is transpar-ent superpages mechanism.Superpages mechanism is a virtual memory sys-tem feature, whose aim is to reduce memoryaccess overhead by making a better use of theCPU’s Memory Management Unit hardwarecapabilities. In particular, this mechanism pro-vides runtime enlargement of the TLB (transla-tion cache) coverage and results in less overhead

Page 2: Transparent Superpages for FreeBSD on ARM · 2020. 1. 4. · Transparent Superpages for FreeBSD on ARM Zbigniew Bodek Semihalf, The FreeBSD Project zbb@{semihalf.com,freebsd.org}

related to memory accesses. This technique hadalready been applied on i386 and amd64 archi-tectures and brought excellent results.

FreeBSD incorporates verified and ma-ture, high-level methods to handle super-pages. Work presented in this paper introducesmachine-dependent portion of the superpagessupport for ARMv6 and ARMv7 on the men-tioned OS.

To summarize, in this paper the followingcontributions have been made:

• Problem analysis and explanation

• Introduction to possible problem solutions

• Implementation of the presented solution

• Validation (benchmarks and measure-ments)

• Code upstream to the mainline FreeBSD10.0-CURRENT

The project was sponsored by Semi-half and The FreeBSD Foundation. Thecode is publicly available beginning withFreeBSD 10.0.

2 Problem Analysis

In a typical computer system, memory isdivided into few, general levels:

• CPU cache

• DRAM (main memory)

• Non-volatile backing storage (Hard Drive,SSD, Flash memory)

Each level in the hierarchy has significantlygreater capacity and lower cost per storage unitbut also longer access time. This kind of designprovides best compromise between speed, priceand capabilities of the contemporary electron-ics. However, the same architecture poses a

number of challenges for the memory manage-ment system.

User applications stored in the external,non-volatile memory need to be copied to themain memory so that CPU can access them.The operating system is expected to handle allphysical memory allocations, segments transi-tions between DRAM and external storage aswell as protection of the memory chunks be-longing to the concurrently running jobs. Vir-tual memory system carries these tasks with-out any user intervention. The concept allowsto implement various, favorable memory man-agement techniques such as on-demand paging,copy-on-write, shared memory and other.

2.1 Virtual Memory

Processor core uses so called Virtual Ad-dress (VA) to refer to the particular memorylocation. Therefore, the set of addresses thatare ’visible’ to the CPU is often called a Vir-tual Address Space. On the other hand thereis a real or Physical Address Space (PA) whichcan incorporate all system bus agents such asDRAM, SoC registers, I/O.

Virtual memory introduces additionallayer of translation between those spaces, ef-fectively separating them and providing artifi-cial private work environment for each applica-tion. This mechanism, however, requires someportion of hardware support to operate. Mostapplication processors incorporate special hard-ware entity for managing address translationscalled Memory Management Unit (MMU). Ad-dress translation is performed with the pagegranulation. Page defines VA−→PA transla-tion for a subset of addresses within that page.Hence, for each resident page in the VA spaceexists exactly one frame in the physical mem-ory. For the CPU access to the virtual addressto succeed MMU has to provide the valid trans-lation to the corresponding physical frame. Thetranslations are stored in the main memory inthe form of virtually indexed arrays, so calledTranslation Tables or Page Tables.

2

Page 3: Transparent Superpages for FreeBSD on ARM · 2020. 1. 4. · Transparent Superpages for FreeBSD on ARM Zbigniew Bodek Semihalf, The FreeBSD Project zbb@{semihalf.com,freebsd.org}

To speed up the translation procedureMemory Management Unit maintains a tableof recently used translations called Transla-tion Lookaside Buffer (TLB).

2.1.1 TLB Translations

Access to the pages that still have theirtranslations cached in the TLB is performedimmediately and implies minimal overhead re-lated to the access completion itself. Other sce-narios result in a necessity to search for a propertranslation in the Translation Tables (presentedin the Figure 1) or, in case of failure, handlingthe time consuming exception. TLB is there-fore in the critical path of every memory accessand for that reason it is desired to be as fastas possible. In practice, TLBs are fully asso-ciative arrays of size limited to several dozensof entries. In addition, operating systems usu-ally configure TLB entries to cover the smallestavailable page size so that dense page granula-tion, thus low memory fragmentation could bemaintained. Mentioned factors form the con-cept of TLB coverage, which can be describedas the amount of memory that can be accesseddirectly, without TLB miss. Another substan-tial TLB behavior can be observed during fre-quent, numerous accesses to different pages inthe memory (such situation can occur when alarge set of data is being computed). Becausea lot of pages is being touched in the process,free TLB entries become occupied fast. In or-der to make room for subsequent translationssome entries need to be evicted. TLB evictionsare made according to the eviction algorithmwhich is implementation defined. However, re-gardless of the eviction algorithm, significantpaging traffic can cause recently used transla-tions to be evicted even though they will needto be restored in a moment. This phenomenonis called TLB trashing. It is associated directlywith the TLB coverage factor and can seriouslyimpact system’s performance.

2.1.2 Constraints and opportunities

It is estimated that performance degra-dation caused by the TLB misses is at 30-60%.

CPU

MMU

TLB hit

TLBVA

MEMPA

VA – adres wirtualnyPA – adres fizyczny

CPU

MMUVA

MEMPA

TTW

TLBCPU

VA

TLB

TLB hit

TLB miss

Figure 1: Memory access with TLB miss.

That is at least 20%, up to 50% more than in1980’s and 1990’s [1]. TLB miss reduction istherefore expected to improve memory band-width and hence overall system performance,especially for resource-hungry processes. Re-ducing the number of TLB misses is equivalentto TLB coverage enhancement. Obvious solu-tions to achieve that would be to:

◦ Enlarge the TLB itself.However, bigger translation cache means morelogic, higher complexity and greater energyconsumption that still may result in a little im-provement. To sustain satisfying TLB charac-teristics with the currently available technolo-gies, translation buffers can usually hold tensup to few hundreds of entries.

◦ Increase the base page size.Majority of the microprocessor architecturessupport more than one page size. This givesthe opportunity to cover larger memory areasconsuming only a single entry in the TLB. How-ever, this solution has a major drawback in theform of increased fragmentation and hence, in-efficient memory utilization. The applicationmay need to access very limited amount ofmemory but placed in a few, distinct locations.If the small pages were used as a base allocation

3

Page 4: Transparent Superpages for FreeBSD on ARM · 2020. 1. 4. · Transparent Superpages for FreeBSD on ARM Zbigniew Bodek Semihalf, The FreeBSD Project zbb@{semihalf.com,freebsd.org}

unit, less memory is reserved and more physicalframes are available for other agents. On theother hand using superpages as a main alloca-tion unit results in a rapid exhaustion of avail-able memory for new allocations. In addition,single page descriptor contains only one set ofaccess permissions and page attributes includ-ing dirty and referenced bits. For that reason,the whole dirty superpages needs to be writtenback to the external storage on page-out sincethere is no way to determine which fraction ofthe superpage has been actually written. Thismay cause serious disk traffic that can surpassthe benefit from reducing TLB misses.

◦ Allow user to choose the page size.In that case, the user would have to be awareof the memory layout and requirements of therunning applications. That approach could beas much effective for some cases as it will beineffective for any other. In fact, this methodcontradicts the idea of the virtual memory thatshould be a fully transparent layer.

2.1.3 Universal Solution

Reduction of the TLB miss factor hasproven to be a complex task that requires sup-port from both hardware and operating systemsides. OS software is expected to provide low-latency methods for memory layout control, su-perpage allocation policy, efficient paging andmore.

FreeBSD operating system offers thegeneric and machine independent frameworkfor transparent superpages management. Su-perpages mechanism is a well elaborated tech-nology on FreeBSD, which allow for runtimepage size adjustment based on the actual needsof the running processes. This feature is al-ready being successfully utilized on i386 andamd64 platforms. The observed memory per-formance boost for those architectures is at30%. These promising numbers encouragedto apply superpages technique on another, re-cently popular ARM architecture. ModernARM revisions (ARMv6, ARMv7 and upcom-ing ARMv8) are capable of using various pagesizes allowing for superpages mechanism uti-lization.

3 Principles of Operation

Virtual memory system consists of twomain components. The machine-independentVM manages the abstract entities such as ad-dress spaces, objects in the memory or softwarerepresentations of the physical frames. Thearchitecture-dependent pmap(9), on the otherhand, operates on the memory managementhardware, page tables and all low-level struc-tures. Superpages framework affects both as-pects of the virtual memory system. Therefore,in order to illustrate the main principles of su-perpages mechanism, relevant VM operationsare described. Then the specification of theVirtual Memory System Architecture (VMSA)introduced in ARMv6/v7-compliant processorsis provided along with the opportunities to takeadvantage of the superpages technique on thatarchitectures.

3.1 Reservation-based Allocation

VM uses vm_page structure to representphysical frame in the memory. In fact, thephysical space is managed on page-by-page ba-sis through this structure [2]. In the con-text of superpages, vm_page can be called thebase page since it usually represents the small-est translation unit available (in most cases4 KB page). Operating system needs to trackthe state and attributes of all resident pagesin the memory. This knowledge is a neces-sity for a pager program to maintain an effec-tive page replacement policy and decide whichpages should be kept in the main memory andwhich ought to be discarded or written back tothe external disk.

Files or any areas of anonymous memoryare represented by virtual objects. vm_objectstores the information about related vm_pagesthat are currently resident in the main memory,size of the area described by this object, pointerto shadow objects that hold private copies ofmodified pages and other information [3]. Atsystem boot time, kernel detects the numberof free pages in the memory and assigns themvm_page structures (except for pages occupied

4

Page 5: Transparent Superpages for FreeBSD on ARM · 2020. 1. 4. · Transparent Superpages for FreeBSD on ARM Zbigniew Bodek Semihalf, The FreeBSD Project zbb@{semihalf.com,freebsd.org}

Reservation

Promotion

Figure 2: Basic overview of the reservation-based allocation.

by the kernel itself). When the processes be-gin to execute and touch memory areas theygenerate page faults since no pages from thefree list have been filled with relevant contentsand assigned to the corresponding object. Thismechanism is a part of the on-demand pagingand implies that only requested (and furtherutilized) pages of any object are cached in themain memory. Superpages technique relies onthis virtual memory feature and is in a way itsextension. When the reservation-based alloca-tion is enabled (VM_NRESERVLEVEL set to non-zero value) and the referenced object is of su-perpage size or greater, VM will reserve a con-tinuous physical area in memory for that ob-ject. This is justified by the fact that super-page mapping can translate a continuous rangeof virtual addresses to the range of physical ad-dresses within a single memory frame. Pageswithin the created area are grouped in a pop-ulation map. If the process that refers to theobject will keep touching subsequent pages in-side the allocated area, the population map willeventually get filled up. In that case, the re-lated memory chunk will become a candidatefor promotion to a superpage. The mechanismis briefly visualized in the Figure 2.

Not all reservations can be promoted eventhough the underlying pages satisfy the conti-nuity requirements. That is because the singlesuperpage translation has only one set of at-tributes and access permissions for the entirearea covered by the mapping. Therefore, it isobvious that all base pages within the popu-lation map must be consistent in terms of allsettings and state for promotion to succeed. Inaddition, superpages are preferred to be pro-moted read-only unless all base pages have al-ready been modified and are marked ’dirty’.The intention is to avoid increased paging traf-fic to the disk. Since there is only one modifica-tion indicator for the whole superpage, there isno way to determine which portion of the cor-responding memory has been actually written.Hence, the entire superpage area needs to bewritten back to the external storage. Demotionof the read-only superpage on write attempt isproven to be a more effective solution [1]. Sum-marizing, to allow for the superpage promotion,the following requirements must be met:

• The area under the superpage has to becontinuous in both virtual and physical ad-dress spaces

• All base mappings within the superpageneed to have identical attributes, state andaccess permissions

Not all reservations can always be completed.If the process is not using pages within the pop-ulation map then the reservation is just hold-ing free space for nothing. In that case VMcan evict the reserved area in favor of anotherprocess. This proves that the superpages mech-anism truly adapts to the current system needsas only active pages participate in the page pro-motion.

3.2 ARM VMSA

Virtual Memory System Architecture in-troduced in ARMv7 is an extension of thedefinition presented in ARMv6. Differencesbetween those revisions are not relevant to

5

Page 6: Transparent Superpages for FreeBSD on ARM · 2020. 1. 4. · Transparent Superpages for FreeBSD on ARM Zbigniew Bodek Semihalf, The FreeBSD Project zbb@{semihalf.com,freebsd.org}

this work since backward compatibility withARMv6 has to be preserved (ARMv6 andARMv7 share the the same pmap(9) module).

ARMv6/v7-compliant processors use Vir-tual Addresses to describe a memory locationin their 32-bit Virtual Address Space. If theCPU’s Memory Management Unit is disabled,all Virtual Addresses refer directly to the cor-responding locations in the Physical AddressSpace. However, when MMU is enabled, CPUneeds additional information about which phys-ical frame to access when some virtual addressis used. Both, logical and physical addressspaces are divided into chunks - pages andframes respectively. Appropriate translationsare provided in form of memory resident Trans-lation Tables. Single entry in the translation ta-ble can hold either invalid data that will causeData/Prefetch abort on access, valid transla-tion virtual−→physical or pointer to the nextlevel of translation. ARMv7 (without LargePhysical Address Extension) defines two-leveltranslation tables.

L1 table consists of 4096 word sized en-tries each of which can:

• Cause an abort exception

• Translate a 1 MB page to 1 MB physicalframe (section mapping)

• Point to a second level translation table

In addition, a group of 16 L1 entries can trans-late a 16 MB chunk of virtual space using justone, supersection mapping.L1 translation table occupies 16 KB of memoryand needs to be aligned to that boundary.

L2 translation table incorporates 256word sized entries that can:

• Cause an abort exception

• Provide mapping for a 4 KB page (smallpage)

Similarly to L1 entries, 16 L2 descriptors can beused to translate 64 KB large page by a singleTLB entry. L2 translation table takes 1 KB ofmemory and has to be stored with the samealignment.

Recently used translations are cached inthe unified TLB. Most of the modern ARMprocessors have additional, ’shadow’ TLBs forinstructions and data. These are designed tospeed-up the translation process even more andare fully transparent to the programmer. Usu-ally, TLBs in ARMv6/v7 CPUs can hold tensof entries so the momentary TLB coverage israther small. An exceptional situation is whenpages bigger than 4 KB are used.

3.2.1 Translation Process

When a TLB miss occurs MMU is ex-pected to find a mapping for the referencedpage. The process of fetching translations frompage tables to TLB is called a Translation Ta-ble Walk (TTW) and on ARM it is performedby hardware.

For a short page descriptor format (LPAEdisabled), translation table walk logic may needto access both L1 and L2 tables to acquireproper mapping. TTW starts with L1 page di-rectory whose address in the memory is passedto the MMU via Translation Table Base Reg-ister (TTBR0/TTBR1). First, 12 most sig-nificant bits of the virtual address (VA[31:20])are used as an index to the L1 translation ta-ble (page directory). If the L1 descriptor’s en-coding does not indicate otherwise the section(1 MB) or supersection (16 MB) mapping is in-serted to the TLB and translation table walkis over. However, if L1 entry points to the L2table then 8 subsequent bits of the virtual ad-dress (VA[19:12]) serve as an index to the desti-nation L2 descriptor in that table. Finally theinformation from L2 entry can be used to insertsmall (4 KB) or large (64 KB) mapping to theTLB. Of course, invalid L1 or L2 descriptor for-mat results in data or prefetch abort dependingon the access type.

6

Page 7: Transparent Superpages for FreeBSD on ARM · 2020. 1. 4. · Transparent Superpages for FreeBSD on ARM Zbigniew Bodek Semihalf, The FreeBSD Project zbb@{semihalf.com,freebsd.org}

3.2.2 Page Table Entry

Both L1 and L2 page descriptors hold notonly physical address and size for the relatedpages but also a set of encoded attributes thatcan define access permissions, memory type,cache mode and other. Page descriptor for-mat is programmable to some extent, depend-ing on enabled features and overall CPU/MMUsettings (access permissions model, type exten-sion, etc.). In general, every aspect of any mem-ory access is fully described by the page tableentry. This also indicates that any attempt toreference a page in a different manner than al-lowed will cause an exception.

4 Superpages Implementation for ARM

The paragraph elaborates on how the su-perpages mechanism has been implemented andoperates on ARM. Main modifications to thevirtual memory system have been describedalong with the explanation of the applied so-lutions.

4.1 Superpage size selection

First step to support superpages on a newarchitecture is to perform VM parameters tun-ing. In particular, reservation-based allocationneeds to be enabled and configured accordingto the chosen superpages sizes.

Machine independent layer re-quires two parameters declared insys/arm/include/vmparam.h:

• VM_NRESERVLEVEL - specifies a number ofpromotion levels enabled for the architec-ture. Effectively this indicates how manysuperpage sizes are used simultaneously.

• VM_LEVEL_{X}_ORDER - for each reserva-tion level this parameter determines howmany base pages fully populate the relatedreservation level.

At this stage a decision regarding supportedsuperpage sizes had to be made. 1 MB sec-tion mapping has been chosen for a superpagewhereas 4 KB small mapping has remained abase page. This approach has a twofold advan-tage:

1. Shorter translation table walk when TLBmiss on the area covered by a section map-ping.In that scenario, TTW penalty will be lim-ited to one memory access only (L1 table)instead of two (L1 and L2 tables).

2. Better comparison with other architec-tures.i386 and amd64 can operate on just onesuperpage size of 2/4 MB. Similar perfor-mance impact was expected when usingcomplementary page sizes on ARM.

Summarizing, VM parameters have beenconfigured as follows:

VM_NRESERVLEVEL set to 1 - indicates onereservation level and therefore one superpagesize in use.VM_LEVEL_0_ORDER set to 8 - level 0 reservationconsists of 256 (1 « 8) base pages.

4.2 pmap(9) extensions

The core part of the machine dependentportion of superpages support is focused on thepmap module. From a high-level point of view,VM ”informs” lower layer when the particularreservation is fully populated. This event im-plies a chance to promote a range of mappingsto a superpage but promotion itself still maynot succeed for various reasons. There are noexplicit directives from VM that would influ-ence superpages management. pmap module istherefore expected to handle:

• promotion of base pages to a superpage

• explicit superpage creation

• superpage demotion

• superpage removal

7

Page 8: Transparent Superpages for FreeBSD on ARM · 2020. 1. 4. · Transparent Superpages for FreeBSD on ARM Zbigniew Bodek Semihalf, The FreeBSD Project zbb@{semihalf.com,freebsd.org}

TAILQ (lista LRU L1)

L1

L1

L1

L1

SLIST (lista L1)

pmappm_l2

.

.

.

255

0

L1 Table

.

.

.

15

0

pm_l2 L2 Table

255

0

.

.

.

.

.

.

l2_bucket15

0

4095

0

Figure 3: Page tables and kernel structures organization.

4.2.1 Basic Concepts

pmap(9)module is responsible for manag-ing real mappings that are recognizable by theMMU hardware. In addition it has to controlthe state of all physical maps and pass rele-vant bits to the VM. Main module file is lo-cated at sys/arm/arm/pmap-v6.c and is sup-plemented by the appropriate structure defi-nitions from sys/arm/include/pmap.h. Corestructure representing physical map is structpmap.

During virtual memory system initializa-tion pmap module allocates one L1 translationtable for each fifteen user processes out of max-imum pool of maxproc. L1 entries sharing canbe achieved by marking all L1 descriptors withthe appropriate domain ID. Architecture de-fines 16 domains of which 15 are used for userprocesses and one is reserved for the kernel.This design can reduce KVM occupancy as eachL1 table requires 16 KB of memory which isnever freed. Each pmap structure holds pm_l1pointer to the corresponding L1 translation ta-ble meta-data (l1_ttable) which provides ta-ble’s physical address to move to the TTBR oncontext switch as well as other information usedto allocate and free L1 table on process creationand exit.

Figure 3 shows the page tables organiza-tion and their relation with the correspondingkernel structures. L1 page table entry points

to the L2 table which collects up to 256 L2descriptors. Each L2 entry can map 4 KB ofmemory. L2 table is allocated on demand andcan be freed when unused. This technique ef-fectively saves 1 KB of KVA per each unusedL2 table.pmap’s L2 management is performed via pm_l2array of type struct l2_dtable. Each ofpm_l2 fields holds enough L2 descriptors tocover 16 MB of data. Hence, for each16 L1 table entries, exists one pm_l2 en-try. l2_dtable structure incorporates 16 el-ements of type struct l2_bucket each ofwhich describes single L2 table in memory. Inthe current pmap-v6.c implementation, bothl2_dtable and L2 translation table are allo-cated in runtime using UMA(9) zone allocator.l2_occupancy and l2b_occupancy track thenumber of allocated buckets and L2 descriptorsaccordingly. l2_bucket can be deallocated ifnone of 256 L2 entries within the L2 table is inuse. Similarly, l2_dtable can be freed as soonas all 16 l2_buckets within the structure aredeallocated.

Additional challenge for the pmap moduleis to track multiple mappings of the same phys-ical page. Different mappings can have differ-ent states even if they point to the same phys-ical frame. When modifying physical layout(page-out, etc.) it is necessary to take into ac-count wired, dirty and other attributes of allpages related to a particular physical frame.The described functionality is provided by us-

8

Page 9: Transparent Superpages for FreeBSD on ARM · 2020. 1. 4. · Transparent Superpages for FreeBSD on ARM Zbigniew Bodek Semihalf, The FreeBSD Project zbb@{semihalf.com,freebsd.org}

ing pv_entry structures organized in chunksand maintained for each pmap in the system.When a new mapping is created for any pmap,the corresponding pv_entry is allocated andput into the PV list of the related vm_page.

Superpages support required to provideextensions for the mentioned mechanisms andtechniques. Apart from implementing routinesfor explicit superpage management the objec-tive was to make the existing code superpagesaware.

4.2.2 Promotion to a Superpage

The decision whether to attempt promo-tion is based on two main conditions:

• vm_reserv_level_iffullpop() - indi-cates that physical reservation map is fullypopulated

• l2b_occupancy - implies that (aligned)virtual region of superpage size is fullymapped using base pages

Both events will most likely occur during newmapping insertion to the address space of theprocess. Therefore the promotion attempt isperformed right after successful pmap_enter()call.

The page promotion routine(pmap_promote_section()) starts withthe preliminary classification of the page tableentries within the potential superpage. Atthis point the decision had to be made whichpages to promote and which of them should beexcluded from the promotion. In the presentedimplementation, promotion to a superpage isdiscontinued for the following cases:

• VA belongs to a vectors pageAccess to a page containing exception vec-tors must never abort and should be ex-cluded from any kind of manipulation forsafety reasons. Every abort in this casewould result in nested exception and fatalsystem error.

• Page is not under PV managementWith Type Extension (TEX) disabled,page table entry has not enough room tostore all the necessary status bits. For thatreason pv_flags field from the pv_entrystructure holds the additional data includ-ing bits relevant for the promotion to a su-perpage.

• Mapping is within the kernel address spaceOn ARM, kernel pages are already mappedusing as much section mappings as possi-ble. The mappings are then replicated ineach pmap.

Page table entry in the L2 under promotion isalso tested for reference and modification bitsas well as permission to write. Superpage ispreferred to be a read-only mapping to avoidexpensive, superpage-size transitions to a diskon page-out. Therefore it is convenient to clearthe permission to write for a base page if ithas not been marked dirty already. All of thementioned tests apply to the first base page de-scriptor in the set. This approach can reduceoverhead related to the unsuccessful promotionattempt since it allows to quickly disregard in-valid mappings and exit. However if the firstdescriptor is suitable for the promotion thenthe remaining 255 entries from the L2 table stillneed to be checked

Apart from the above mentioned criteriathe area under superpage must satisfy the fol-lowing conditions:

1. Continuity in the VA space

2. Continuity in the PA spacePhysical addresses stored in the subse-quent L2 descriptors must differ by the sizeof the base page (4 KB).

3. Consistency of the pages’ attributes andstates

9

Page 10: Transparent Superpages for FreeBSD on ARM · 2020. 1. 4. · Transparent Superpages for FreeBSD on ARM Zbigniew Bodek Semihalf, The FreeBSD Project zbb@{semihalf.com,freebsd.org}

When all requirements are met then it is possi-ble to create single 1 MB section mapping for agiven area. It is important that during promo-tion process L2 table zone is not being deal-located. Corresponding l2_bucket is ratherstashed to speed-up the superpage demotion inthe future.

The actual page promotion can be dividedinto two stages:

• pmap_pv_promote_section()At this point pv_entry related to the firstvm_page in a superpage is moved to an-other list of PV associated with the 1 MBphysical frame. The remaining PV entriescan be deallocated.

• pmap_map_section()The routine constructs the final sectionmapping and inserts it to the L1 page de-scriptor. Mapping attributes, access per-missions and cache mode are identical withall the base pages.

Successful promotion ends with the TLB inval-idation which flushes old translations and al-lows MMU to put newly created superpage tothe TLB.

4.2.3 Explicit Superpage Creation

Incremental reservation map populationis not always a necessity. In case of a map-ping insertion for the entire virtual object itis possible to determine the object’s size andits physical alignment. The described situationcan take place when pmap_enter_object() iscalled. If the object is at least of superpage sizeand VM has performed the proper alignment itis possible to explicitly map the object usingsection mappings.

pmap_enter_section() has been imple-mented to create a direct superpage map-pings. The routine has to perform prelimi-nary page classification similar to the one inpmap_promote_section(). This time however,it is not necessary to check any of the base pages

within the potential superpage since they donot exist yet. Bits that still need to be testedare:

• PV management status

• L1 descriptor statusThe given L1 descriptor cannot be used fora section mapping if it is already a validsection or it is already serving as a pagedirectory for a L2 table.

Direct insertion of the mapping involves anecessity to allocate new pv_entry for a1 MB frame. This task is performed bypmap_pv_insert_section() which may notsucceed. In case of failure the superpage can-not be mapped, otherwise section mapping iscreated immediately.

4.2.4 Superpage Demotion and Re-moval

When there is a need to page-out or mod-ify one of the base pages within the superpageit is required to destroy a corresponding sec-tion mapping. Lack of any mapping for a mem-ory region that is currently in use would causea chain of expensive vm_fault() calls. De-motion procedure (pmap_demote_section())is designed to overcome this issue by recreatingL2 translation table in place of the removed L1section.

There are two possible scenarios of the su-perpage demotion:

1. Demotion of the page created as a resultof promotion.In that case it is possible to reuse the al-ready allocated l2_bucket that has beenstashed after the promotion. This scenariohas got two major advantages:

• No need for any memory allocationfor L2 directory and L2 table.

• If the superpage attributes have notchanged then there is no need to mod-ify or fill the L2 descriptors

10

Page 11: Transparent Superpages for FreeBSD on ARM · 2020. 1. 4. · Transparent Superpages for FreeBSD on ARM Zbigniew Bodek Semihalf, The FreeBSD Project zbb@{semihalf.com,freebsd.org}

2. Demotion of the page that was directly in-serted as a superpage.This implies that there is no stashed L2table and it needs to be allocated and cre-ated from scratch. Any allocation failureresults in an immediate exit due to speedrestrictions. Sleeping is not an option.

The demotion routine has to check if the super-page has exactly the same attributes and sta-tus bits as the stashed (or newly created) L2table entries. If not then the L2 entries need tobe recreated using current L1 descriptor. PVentries also need to be allocated and recreatedusing pv_entry linked with the 1 MB page. Fi-nally when the L2 table is in place again, theL1 section mapping can be fixed-up with theproper L1 page directory entry and the corre-sponding translation in the TLB ought to beflushed.

The last function used for superpage dele-tion is pmap_remove_section(). It is usedto completely unmap any given section map-ping. Calling this function can speed-uppmap_remove() routine if the removed area ismapped with a superpage and the size of thespace to unmap is at least of superpage size.

4.2.5 Configuration and control

At the time when this work is written,superpages support is disabled by defaultin pmap-v6.c. It can be enabled in runtimeduring system boot by setting a loader variable:

vm.pmap.sp_enabled=1

in loader.conf or it can be turned onduring compilation time by setting:

sp_enabled

variable from sys/arm/arm/pmap-v6.c toa non-zero value.

System statistics related to the super-pages utilization can be displayed by invoking:

sysctl vm.pmap

command in the terminal. The exemplaryoutput can be seen below:

vm.pmap.sp_enabled: 1vm.pmap.section.demotions: 258vm.pmap.section.mappings: 0vm.pmap.section.p_failures: 301vm.pmap.section.promotions: 1037

demotions – number of demoted superpagesmappings – explicit superpage mappingsp_failures– promotion attempts that failedpromotions– number of successful promotions

5 Results and benchmarks

The functionality has been extensivelytested using various benchmarks and tech-niques. The performance improvement de-pends to a large extent on the application be-havior, usage scenarios and amount of availablememory in the system. Processes allocatinglarge areas of consistent memory or operatingon big sets of data will benefit more from su-perpages than those using small, independentchunks.

Presented measurements and benchmarks havebeen performed on Marvell Armada XP (quad-core ARMv7-compliant chip).

5.1 GUPS

The most significant results can be ob-served using the Giga Updates Per Second(GUPS) benchmark. GUPS measures how fre-quently system can issue updates to randomlygenerated memory locations. In particular itmeasures both memory latency and bandwidth.On multi-core ARMv7 platform, measuredCPU time usage and real time durationdropped by 34%. Number of updates per-formed in the same amount of time has in-creased by 52%.

11

Page 12: Transparent Superpages for FreeBSD on ARM · 2020. 1. 4. · Transparent Superpages for FreeBSD on ARM Zbigniew Bodek Semihalf, The FreeBSD Project zbb@{semihalf.com,freebsd.org}

Arkusz1

Strona 1

CPU time [s] 146,42 96,45

Real time [s] 143,42 93,45

36,6 55,6

SP disabled SP enabled

Updates mln/s

SP disabled SP enabled

0

20

40

60

80

100

120

140

160

GUPS

cpu_t updates

Figure 4: GUPS results.CPU time used [s],number of updates performed [100000/s].

5.2 LMbench

LMbench is a popular suite of system per-formance benchmarks. It is equipped withthe memory testing program and can be usedto examine memory latency and bandwidth.Measured memory latency has dropped by37,85% with superpages enabled. Memorybandwidth improvement varied depending onthe type of operation and was in the rangefrom 2,26% for mmap reread to 8,44% for mem-ory write. It is worth noting that LMbenchuses STREAM benchmark to measure mem-ory bandwidth which uses floating point arith-metic to perform the operations on memory.Currently FreeBSD does not yet support FPUon ARM what had a negative impact on theresults.

Mmapreread[MB/s]

Bcopy(libc)[MB/s]

Bcopy(hand)[MB/s] superpages

645,4 305,4 432,3660,0 312,4 446,9 3

Table 1: LMbench. Memory bandwidth mea-sured on various system calls.

Memread

[MB/s]

Memwrite[MB/s]

Memlatency[ns] superpages

681 3043 238,8696 3300 148,4 3

Table 2: LMbench. Memory bandwidth andlatency measured on memory operations.

The results summary is shown in Tables 1 and2. Table 3 on the other hand shows the the per-centage improvement of the parameters withthe best test results.

Memwrite %

Randmem latency %

8,44 37,85

Table 3: LMbench. Percentage improvement ofthe selected parameters.

5.3 Self-hosted world build

Using superpages helped to reduce self-hosted world build when using GCC. The re-sults are summarized in Table 4. The timeneeded for building the whole set of user appli-cations comprising to the root file system hasdropped by 1 hour 22 minutes (20% shorter).No significant change has been noted when us-ing CLANG.

GCC CLANG superpages6h 36min 6h 16min5h 14min 6h 15min 3

Table 4: Self-hosted make buildworld comple-tion time.

5.4 Memory stress tests

Presented functionality has been alsotested in terms of overall stability and reliabil-ity. For that purpose two popular stress bench-marks have been used:

• forkbomb: forkbomb -MApplication can allocate entire availablememory using realloc() and access thismemory.

12

Page 13: Transparent Superpages for FreeBSD on ARM · 2020. 1. 4. · Transparent Superpages for FreeBSD on ARM Zbigniew Bodek Semihalf, The FreeBSD Project zbb@{semihalf.com,freebsd.org}

• stress: stress –vm 4 –vm-bytes 400MBenchmark imposes certain types ofcompute stress on the system. Inthis case 4 processes were spinning onmalloc()/free() calls, each of whichworking on 400 MB of memory.

No anomalies or instabilities were detected evenduring long runs.

6 Future work

The presented functionality has signifi-cant impact on system’s performance but doesnot cover all of the hardware and OS capabili-ties. There are possible ways of improvement.

Adding support for additional 64 KB pagesize will further increase the amount of createdsuperpages, enabling a smoother and more effi-cient process for the promotion from 4 KB smallpage to 1 MB section. In addition, a largernumber of processes will be capable of takingadvantage from superpages if the required pop-ulation map size is smaller.

In addition, current pmap(9) implementa-tion uses PV entries to store some informationabout the mapping type and status. This im-plies the necessity to search through PV listson each promotion attempt. TEX (Type Exten-sion) support would allow to move those addi-tional bits to the page table entry descriptorsand lead to reduction of the promotion failurepenalty.

7 Conclusions

Presented work has brought the transpar-ent superpages support to the ARM architec-ture on FreeBSD. The paper described virtualmemory system from both OS and hardwarepoints of view. System’s bottle-necks and de-sign constrains have been carefully described.In particular the work has elaborated on theTLB miss penalty and its influence on the over-all system performance.

Mechanisms implemented during theproject met their objectives and provided per-formance gain on the interface between CPUand memory. This statement has been sup-ported by various tests and benchmarks per-formed on a real ARM hardware. Test re-sults vary between different benchmarks butimprovement can be observed in all cases andis at 20%.

Introduced superpages support has beencommitted to the official FreeBSD SVN repos-itory and is available starting from revision254918.

8 Acknowledgments

Special thanks go to the following people:

Grzegorz Bernacki and Alan Cox, for allthe help and mentorship.Rafał Jaworowski, mentor of this project.

Work on this project was sponsored by Semihalfand The FreeBSD Foundation.

9 Availability

The support has been integrated intothe mainline FreeBSD 10.0-CURRENT and isavailable with the FreeBSD 10.0-RELEASE.The code can also be downloaded from theFreeBSD main SVN repository.

References

[1] Juan E. Navarro, Transparent operatingsystem support for superpages, 2004

[2] The FreeBSD Documentation Project,FreeBSD Architecture Handbook, 2000-2006, 2012-2013

[3] Marshall Kirk McKusick, The Design andImplementation of the FreeBSD OperatingSystem, 2004

13