Utilizing Hardware Monitoring to Improve the Performance ... · Utilizing Hardware Monitoring to Improve the Performance of Industrial Systems ... 3G The third telecom network generation,

Mälardalen University Licentiate Thesis 200

Utilizing Hardware Monitoring toImprove the Performance ofIndustrial SystemsMarcus Jägemar

Malardalen University Press Licentiate ThesesNo. 200

UTILIZING HARDWARE MONITORING TOIMPROVE THE PERFORMANCE OF INDUSTRIAL

SYSTEMS

Marcus Jagemar

June 2016

Copyright c©Marcus Jagemar, 2016ISBN 978-91-7485-203-5ISSN 1651-9256Printed by Arkitektkopia, Vasteras, Sweden

Abstract

THE drastically increasing use of Information and Communications Tech-nology has resulted in a growing demand for network capacity. In thisLicentiate thesis, we show how to monitor, model and finally improve

network performance for large industrial systems. We also show how to usemodeling techniques to move performance testing to an earlier design phase,with the aim to reduce the total development time of large systems. Our firstcontribution is a low-intrusive method for long-term hardware characteristicmeasurements of production nodes located at customer sites. Our second con-tribution is a technique to mimic the hardware usage of a production envi-ronment by creating a characteristics model. The cloned environment makesfunction test suites more realistic. The goal when creating the model is to re-duce the system development time by moving late-stage performance testingto early design phases thereby improving the quality of the test environment.The third and final contribution is a network performance improvement wherewe dynamically trade computational capacity for a message round-trip timereduction when there are CPU cycles to spare. We have implemented an au-tomatic feedback controlled mechanism for transparent message compressionresulting in improved messaging performance between interconnected networknodes. Our mechanism continuously evaluates eleven compression algorithmson message stream content and network congestion level. The message sub-system will use the compression algorithm that provides the lowest messagingtime. If the message content or network load change, a new evaluation is per-formed. We have conducted several case studies in an industrial environmentand verified all contributions on a large telecommunication system manufac-tured by Ericsson. System engineers frequently use the monitoring and model-ing functionality for debugging purposes in production environments. We havedeployed all techniques in a complicated industrial legacy system with minimalimpact. We show that we can provide not only a solution but a cost-effectivesolution, which is an important requirement for industrial systems.

i

Sammanfattning

TELEKOMMUNIKATIONSBRANCHEN star just nu infor en stor utmaningdar kommunikationsprestanda och snabba leveranstider blir allt merviktiga for att positionera sig i den okande konkurrensen. I denna li-

centiatavhandling beskriver vi hur man kan observera, modellera och slutligenforbattra kommunikationsprestandan pa telekommunikationssystem och andrastora industriella datorsystem. Vi visar ocksa hur man kan korta ner den totalautvecklingstiden genom att anvanda modellsystem for prestandautvardering itidiga delar av utvecklingsprocessen.

Det forsta forskningsbidraget ar en fallstudie med en effektiv metod foratt kontinuerligt lasa ut hardvarukaraktaristik fran ett produktionssatt telekom-system. Vi har inriktat oss mot tekniker med lag paverkan pa det system somobserveras, vilket ar lampligt for undersokningar i prestandakritisk produktion-smiljo. Den hardvarukaraktaristik som lasts ut anvander vi i vart andra forskn-ingsbidrag dar vi har skapat en exekveringsmodell som kor pa ett mindre lab-system. Malet med modellen ar 1) att korta ner tiden mellan utvecklingsstartoch prestandatester samt 2) skapa en battre testmiljo for karaktaristiktester. Idet tredje och sista forskningsbidraget presenterar vi en metod for prestanda-forbattringar genom att selektivt komprimera meddelanden om det ger en snab-bare overforingstid i kommunikationssystemet. Flera komprimeringsalgorit-mer utvarderas kontinuerligt och den kompressionsalgoritm som ger kortastoverforingstid anvands for en majoritet av meddelandena. Forandringar i med-delandestrommen eller natverkets utnyttjandegrad overvakas lopande och an-vands vid utvarderingen av de tillgangliga kompressionsalgoritmerna.

All programvaruutveckling och test har genomforts pa ett industriellt tele-kommunikationssystem tillverkat av Ericsson. Alla tekniker ar implementer-ade for bruk i produktionsmiljo och monitorerings- och modelleringsfunktion-aliteten anvands kontinuerligt i felsokningsysfte av produktionssystemet. Detekniker vi presenterar i denna avhandling ger ocksa en kostnadseffektiv los-ning, vilket ar en viktigt krav for industriella system.

iii

To Karolinn

Acknowledgements

FIRST of all, I would like to thank my supervisors and co-authors, BjornLisper, Sigrid Eldh and Andreas Ermedahl for your patience and help-ful discussions during my studies. I would also like to express grat-

itude towards my manager, Magnus Schlyter, who has always supported methroughout the work on this thesis. The work presented in this Licentiate the-sis has been funded by Ericsson and the Swedish Knowledge Foundation (KKstiftelsen) through the ITS-EASY program at Malardalen University.

Furthermore, thanks to all students in the ITS-EASY research group, we allshare the ups and downs of studying for a PhD; Apala Ray, Daniel Hallmans,Daniel Kade, David Rylander, Eduard Paul Eniou, Fredrik Ekstrand, GaetanaSapienza, Kristian Wiklund, Markus Wallmyr, Mehrdad Saadatmand, MelikaHozhabri, Sara Dersten, Stephan Baumgart, and Tomas Olsson.

I would also like to thank my additional co-authors: Bjorn Lisper, Sigrid Eldh,Andreas Ermedahl, Gordana Dodig-Crnkovic, Rafia Inam, Mikael Sjodin, DanielHallmans, Stig Larsson and Thomas Nolte. I really enjoyed working with you.

I have the greatest gratitude to my parents; my mother and father who alwayswanted me to study hard to become something they never could.

Finally and foremost, I want to express my endless love for Karolinn and ourthree daughters, Amelie, Lovisa and Elise. I would not have been able to writethis thesis without your support and encouragement.

Marcus Jagemar

Sigtuna, May 2016

vii

List of Publications

Included PublicationsA Marcus Jagemar, Sigrid Eldh, Andreas Ermedahl, Bjorn Lisper and

Gabor Andai. Automatic Load Synthesis for Performance Verifica-tion in Early Design Phases. Technical Report, 2016. [68].This technical report, quoted in Chapter 7, is an extension of the al-ready published papers C [64], E [65] and the technical report I [63].

B Marcus Jagemar, Sigrid Eldh, Andreas Ermedahl and Bjorn Lisper.Automatic Message Compression with Overload Protection. In press:Journal of Systems and Software, 2016. [67].This paper, quoted in Chapter 8, is an extension of the already pub-lished paper G [66].

Changes to Included PublicationsPapers A and B are quoted in full but have been reformatted to fit the layoutof this thesis. Chapter 5, includes related work sections of both papers. In asimilar fashion, Chapter 6 contains future work from both papers.

ix

x

Other PublicationsC Marcus Jagemar, Sigrid Eldh, Andreas Ermedahl and Bjorn Lisper.

Towards Feedback-Based Generation of Hardware Characteristics.In Proceedings of the International Workshop on Feedback Comput-ing, 2012. [64]

D Rafia Inam, Mikael Sjodin and Marcus Jagemar. Bandwidth Mea-surement using Performance Counters for Predictable Multicore Soft-ware. Proceedings of the International Conference on Emerging Tech-nologies and Factory Automation (ETFA12), 2012. [58]

E Marcus Jagemar, Sigrid Eldh, Andreas Ermedahl and Bjorn Lisper.Automatic Multi-Core Cache Characteristics Modelling. In Proceed-ings of the Swedish Workshop on Multicore Computing, Halmstad,2013. [65]

F Daniel Hallmans, Marcus Jagemar, Stig Larsson and Thomas Nol-te. Identifying Evolution Problems for Large Long Term IndustrialEvolution Systems. In Proceedings of IEEE International Workshopon Industrial Experience in Embedded Systems Design, Vasteras,2014. [54]

G Marcus Jagemar, Sigrid Eldh, Andreas Ermedahl and Bjorn Lisper.Autonomous Feedback Controlled Message Compression. In Pro-ceedings of Computers, Software and Applications Conference(COMPSAC), Vasteras, 2014. [66]

H Marcus Jagemar and Gordana Dodig-Crnkovic Cognitively Sustain-able ICT with Ubiquitous Mobile Services - Challenges and Oppor-tunities. In Proceedings of the International Conference on SoftwareEngineering (ICSE), Firenze, Italy, 2015. [62]

Other Technical ReportsI Marcus Jagemar, Sigrid Eldh, Andreas Ermedahl and Bjorn Lisper.

Technical Report : Feedback-Based Generation of Hardware Char-acteristics, 2012. [63].

Key Concepts

Table 1 lists the most common abbreviations used throughout this thesis.

Key Concept Description

2G (GSM) The second generation telecom network, 1991, intro-duced digital communication.

3G The third telecom network generation, 1998, enabledlarge scale digital communication with increased band-width and service availability.

3GPP 3GPP is a standardization organization created by thetelecommunication industry. 3GPP aims to create aglobal standard that is used for development and main-tenance of telecommunication systems.

4G (LTE) Long Term Evolution is the fourth generation telecom-munication network, 2008, with increased capacity.

Action Re-search (AR)

A research method where the researcher is an active partof an incremental procedure (plan, act/observe and re-flect), which is repeatedly used to improve the objectbeing investigated. AR was first expressed in 1946 byLewin [83]

ASIC Application Specific Integrated Circuits are circuits thatcan be pre-programmed with specific functionality

Continued on next page→

xiii

xiv

← Continued from previous page

Abbreviation Description

Capacity As specified by the Oxford English dictionary; “Abilityto receive or contain; holding power”. We use the phrasein this thesis as the maximum available. We use capacityas a description of the maximal capability of a resource.

CompressionRatio

Compression ratio is denoted as cr =sizeuncompr

sizecompr. A

high cr means that the compressed data is smaller thanthe uncompressed.

COTS Common Off The Shelf are devices that does not need tobe tailored for a specific need, they can be bought fromother device manufacturer that produce common hard-ware for many purposes.

CPI Cycles Per Instruction is a metric to determine the per-formance of a computer system. An average estimationexplains how large part of total exection can be attributedto different execution parts, such as cache misses, branchmisses, TLB misses etc. Eyerman, Eeckhout and Karkha-nis provides a good explanation a paper [40] explaining amodern CPI structure.

Five Nines 99.999% uptime, which results maximum of approx. 5min downtime per year.

FPGA Field Progrmmable Arrays are generic circuits that canbe programmed in runtime with new functionality.

HW HW is a simple abbreviation for hardware, which meansall physical parts in the network, including computers,cables, circuit-boards etc.

ICT Information Communication Technology that makes itpossible for people to communicate and easily access in-formation.

Continued on next page→

xv

← Continued from previous page

Abbreviation Description

Low-intrusiveMonitoring

The monitoring mechanism does not affect the behavioror performance of the monitored system. There is no no-ticeable effect on the system.

Node A computer designed for message processing, which ispart of a telecommunication system.

Performance As specified by the Oxford English dictionary; “Thequality of execution of such an action, operation, or pro-cess; the competence or effectiveness of a person or thingin performing an action; spec. the capabilities, produc-tivity, or success of a machine, product, or person whenmeasured against a standard.” [93]. More specifically;a quantifiable metric on how good a particular action isperformed.

PIDController

Proportional Integrative Controller [12].

ProductionNode

One node that is running at a customer site handling realend-user traffic.

SuperscalarProcessors

Low-level instructions can be executed in parallel toachieve higher performance, typically more than one in-struction per clock cycle. The first commercial appear-ance was in 1988 with Intel i960CA [85].

SW As specified by the Oxford English dictionary; ”The pro-grams and procedures required to enable a computer toperform a specific task, as opposed to the physical com-ponents of the system” [93]

Test Node Test nodes are typically smaller than production nodesand usually only accessible by corporate personnel. Eco-nomic reasons and keeping debugging simple drive thedemand to keep test nodes being small.

Table 1: Key concepts used in the context of this thesis.

Contents

I Thesis 3

1 Introduction 71.1 Monitoring a Production System . . . . . . . . . . . . . . . . 81.2 Modeling a Production System . . . . . . . . . . . . . . . . . 81.3 Improving the Communication System . . . . . . . . . . . . . 91.4 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2 Background 152.1 Telecommunication Standards . . . . . . . . . . . . . . . . . 172.2 Telecommunication Services . . . . . . . . . . . . . . . . . . 192.3 Industrial Systems . . . . . . . . . . . . . . . . . . . . . . . . 202.4 Deploying Our Target System . . . . . . . . . . . . . . . . . 232.5 System Details . . . . . . . . . . . . . . . . . . . . . . . . . 25

3 Research Summary 333.1 Research Questions . . . . . . . . . . . . . . . . . . . . . . . 34

3.1.1 System Monitoring . . . . . . . . . . . . . . . . . . . 343.1.2 System Modeling . . . . . . . . . . . . . . . . . . . . 343.1.3 Improving System Performance . . . . . . . . . . . . 35

3.2 Delimitations . . . . . . . . . . . . . . . . . . . . . . . . . . 363.3 Achievements . . . . . . . . . . . . . . . . . . . . . . . . . . 37

3.3.1 System Monitoring . . . . . . . . . . . . . . . . . . . 373.3.2 System Modeling . . . . . . . . . . . . . . . . . . . . 413.3.3 System Improvement . . . . . . . . . . . . . . . . . . 433.3.4 Message Compression . . . . . . . . . . . . . . . . . 44

3.4 Research Methodology . . . . . . . . . . . . . . . . . . . . . 46

xvii

xviii Contents

3.5 Threats to Validity . . . . . . . . . . . . . . . . . . . . . . . . 473.5.1 Construct Validity . . . . . . . . . . . . . . . . . . . 483.5.2 Internal Validity . . . . . . . . . . . . . . . . . . . . 483.5.3 Conclusion Validity . . . . . . . . . . . . . . . . . . . 493.5.4 Method Applicability . . . . . . . . . . . . . . . . . . 49

4 Contributions 534.1 Publication Mapping . . . . . . . . . . . . . . . . . . . . . . 544.2 Publication Hierarchy and Timeline . . . . . . . . . . . . . . 554.3 Paper A (Based on Papers C, E and I) . . . . . . . . . . . . . 564.4 Paper B (Based on Paper G) . . . . . . . . . . . . . . . . . . 57

5 Related Work 615.1 System Monitoring . . . . . . . . . . . . . . . . . . . . . . . 625.2 System Modeling . . . . . . . . . . . . . . . . . . . . . . . . 635.3 Message and Data Compression . . . . . . . . . . . . . . . . 655.4 Adaptive Compression . . . . . . . . . . . . . . . . . . . . . 66

6 Conclusion and Future Work 716.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . 726.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

Bibliography 74

II Included Papers 89

7 Automatic Load Synthesis for Performance Verification in EarlyDesign Phases 937.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 957.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . 967.3 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99

7.3.1 Method Details . . . . . . . . . . . . . . . . . . . . . 1017.4 Target System . . . . . . . . . . . . . . . . . . . . . . . . . . 102

7.4.1 Target System Details . . . . . . . . . . . . . . . . . 1037.5 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . 104

7.5.1 The Characteristics Monitor . . . . . . . . . . . . . . 1057.5.2 The CPI Stack . . . . . . . . . . . . . . . . . . . . . 1067.5.3 The Load Controller . . . . . . . . . . . . . . . . . . 1077.5.4 Generating L1 I-cache Misses . . . . . . . . . . . . . 110

Contents xix

7.5.5 Generating L1 and L2 Data Cache Misses . . . . . . . 1107.5.6 Experimental Setup . . . . . . . . . . . . . . . . . . . 111

7.6 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1117.6.1 Running The Test Application With The Load Generator1137.6.2 Production vs. Modeled Characteristics . . . . . . . . 1137.6.3 System Performance Measurement . . . . . . . . . . . 1167.6.4 Performance Prediction When Switching OS . . . . . 117

7.7 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . 1227.8 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . 1247.9 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . 126References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126

8 Automatic Message Compression with Overload Protection 1328.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 134

8.1.1 Definitions . . . . . . . . . . . . . . . . . . . . . . . 1368.2 Problem Formulation and System Model . . . . . . . . . . . . 1368.3 Adaption . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139

8.3.1 The Communication Procedure . . . . . . . . . . . . 1408.3.2 Network Measurements . . . . . . . . . . . . . . . . 1418.3.3 Compression Measurements . . . . . . . . . . . . . . 1418.3.4 Selecting the Best Compression Algorithm . . . . . . 1428.3.5 Compression Throttling . . . . . . . . . . . . . . . . 143

8.4 Test System Setup . . . . . . . . . . . . . . . . . . . . . . . . 1458.4.1 The Test System . . . . . . . . . . . . . . . . . . . . 1458.4.2 Compression Algorithms . . . . . . . . . . . . . . . . 1468.4.3 Putting it All Together . . . . . . . . . . . . . . . . . 1488.4.4 Real-World Compression Throttling . . . . . . . . . . 150

8.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1518.5.1 Automatic Compression . . . . . . . . . . . . . . . . 1518.5.2 Algorithm Selection Methods . . . . . . . . . . . . . 1538.5.3 Automatic Algorithm Selection for Changing Message

Streams . . . . . . . . . . . . . . . . . . . . . . . . . 1548.5.4 Overload Handling . . . . . . . . . . . . . . . . . . . 156

8.6 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . 1578.7 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . 1608.8 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . 161References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162

I

Thesis

3

More and better collaboration between academia and the softwareindustry is an important means of achieving the goals of morestudies with high quality and relevance and better transfer of re-search results.

— D. Sjøberg, T. Dyba , M. Jørgensen [111]

1Introduction

WE have investigated how to improve the communication performanceof a large-scale telecommunication system [13] with a major marketshare [121]. Our most important driving force is the ever increasing

demand for higher communication capacity. Mobile operators are compelledto make significant investments in more efficient and powerful telecommuni-cation equipment to meet the requests from end-users. As a telecommunica-tion equipment manufacturer, it is getting increasingly important to enhancethe system performance continuously, both for current implementations andby developing new infrastructure. We describe the findings from our work onincreasing the capacity of a large-scale telecommunication system. We havefocused on two ways to improve the communication performance.

The first improvement area investigated by us is how to achieve higher sys-tem capacity by increasing the release rate for new software and hardware. Ourmethod is to reduce the development time by running performance verificationearlier in the development process. Many development processes do perfor-mance verification at the end of the development phase. Our suggestion is tomonitor the hardware characteristics of production systems, Section 1.1, andthen synthesize a hardware usage model, Section 1.2. By using this model, itis possible to test a large part of the performance of newly developed softwareduring the design phase, thus reducing the total development time.

As a second improvement area, we have designed, implemented and useda characteristics measurement tool to systematically monitor and improve theperformance of selected subsystems. In this thesis, we have addressed one per-formance problem where we have reduced the round-trip message time throughselective message compression, Section 1.3.

7

8 Chapter 1. Introduction

1.1 Monitoring a Production System

We have implemented a characteristics monitoring tool aimed for running atcustomer sites. Our goal with the monitoring tool was to get a better under-standing of real-world systems by sampling hardware (HW) characteristics.

Our monitor samples HW events from the CPU or any other low-level HWcomponents. We have grouped these events into sets that represent a certaintype of behavior, for example, cache-usage, TLB-usage, cycles per instruction.

Running a monitoring tool in a production environment pose special re-strictions and requirements such as:

• It must be possible to run the monitor simultaneously to the productionsystem.

• The monitor must have a low probe-effect [43] since it is not allowed toaffect the behavior and performance of production system.

• The monitor must be able to capture long time intervals because the sys-tem behavior changes slowly depending on end-customer usage.

We have addressed the production environment constraints by being very re-strictive when implementing the monitoring application. First, we implementedour application as simple as possible. It is vital that no undesired behavior orfaults occur when running in a sensitive environment. Secondly, we have cho-sen a low HW event sample frequency (1Hz) to reduce the probe effect. Thesampling frequency is sufficient for the slowly changing behavior of our targetsystem.

1.2 Modeling a Production System

We have devised a method that automatically synthesize an HW characteristicsmodel from data obtained by the monitoring tool, see Section 1.1. The modelcan replicate the HW usage of the production system.

Our goal was to create an improved test suite consisting of an HW charac-teristics model together with a functional test suite. Our assumption was thata test suite covering both the functional- and the characteristics perspectiveshould improve testing in the early stages of system development. Improvingthe test suite should also make it possible to discover, primarily performancerelated, bugs earlier in the development process. Finding bugs in the early

1.3 Improving the Communication System 9

design phases adheres well to the desire of reducing the total system develop-ment time since bug-fixing becomes much more difficult and time-consumingfurther from the introduction of the bug.

Our method uses a Proportional Integrative Derivative (PID) controller [12]to synthesize automatically the model from the HW characteristics data ob-tained through our monitoring tool. No manual intervention is needed. Theoverall method is generic and supports any hardware characteristics. The sys-tem we have investigated is IO-bound and mostly limited by cache and memorybandwidth. We have implemented one PID-control loop per characteristics en-tity. In our model, we have used L1-Instruction, L1-Data and L2-Data cacheusage to represent the behavior of the system.

We have evaluated our monitoring and modeling method by synthesizinga model for L1 Instruction-, L1 Data-, and L2 Data cache misses according tothe hardware characteristics extracted from a running production system. Wehave successfully tested the model on a test node together with an unmodi-fied functional test suite. Our experiments show that using our characteristicsmodel during the test of a production system bug fix causes the detected mes-sage round-trip time to increase by 10.8%. Using the traditional performancemeasurement tests results in a 0.75% RTT increase, which may be a too smallchange to be detectable in an automated test suite.

1.3 Improving the Communication System

We have contrived and implemented a mechanism to automatically find anduse a compression algorithm that provides the shortest message Round-TripTime (RTT).

Our goal, when performing this work, was to improve the communicationperformance of our target system. We had already implemented the monitoringtool, Section 1.1, and the characteristics model, Section 1.2 and could use thesetools for performance measurements.

We added a software metric to our monitoring tool, measuring messageRTT. We could deduce that 1) The message RTT varied depending on the net-work congestion levels and 2) The hardware usage varied but was relatively lowin certain conditions. Our assumption was that we could trade computationalcapacity for an increased messaging capacity by using message compression.We defined some critical considerations such as:

10 Chapter 1. Introduction

• The compression algorithm must be selected automatically because themessage content can change over time and depend on the location ofsystem deployment.

• Our mechanism should only use message compression if there are com-putational resources to spare since other co-located services should notstarve.

• Our mechanism must handle overload situations with grace and messagecompression can be resumed when the system has returned to normaloperation.

Our implementation automatically selects the most efficient compressionalgorithm depending on the current message content, CPU-load and networkcongestion level. We have evaluated our implementation by using productionsystem communication data gathered at customer sites and replayed it in a lab(with explicit customer concent). Our experiment shows that the automaticcompression mechanism produces a 9.6% reduction in RTT and that it is re-silient to manually induced overload situations.

1.4 OutlineThe thesis consists of two major parts. The first part puts our research into itscontext and explains the method we have used. The second part contains thescientific papers covered in the thesis.

Part I starts at Chapter 1 with an introduction to performance benchmarkingand modeling of hardware behavior of industrial systems. The thesis continuesin Chapter 2 with further explanations of our target system. We describe stan-dards and functionality supported by the telecommunication system we haveinvestigated. We also describe system setup, design, and structure.

In Chapter 3 we give a detailed summary of our research problems, re-search questions, and research methodology. A summary of our contributionsis presented in Chapter 4. We further contextualize this thesis by reviewing re-lated work in Chapter 5. Chapter 6 concludes part I of the thesis by describingfindings and references to future work.

Part II begins with Chapter 7 where Paper A describes how to monitor andmodel parts of a large scale industrial system. Chapter 8 includes Paper B thatdescribe how to improve the performance of a telecommunication system byusing online message compression.

I believe that many events in my work and life have been a matterof luck or accident. But I am also aware of several occasions onwhich I explicitly made choices to step off the obvious path, anddo something that others thought odd or worse. . . I have come tothink of these events as ’detours’ from the obvious career pathsstretching before me. Frequently these detours have become themain road for me. There are obvious costs to such detours. Otherchoices might have made me richer, more influential, more famous,more productive, and so on. But I like what I am doing, eventhough the path has involved a lot of wandering through unchartedterritory.

— L.D. Brown1

1Quoted from the book by M. Brydon-Miller, D. Greenwood and P. Maguire [20]

2Background

IN this chapter, we will further describe our target system. We start by list-ing telecommunication standards, Section 2.1, and how they relate to cur-rent and future telecommunication services, Section 2.2. The platform we

have worked with supports various standards spanning from 2G (GSM) via 3G(UMTS, WCDMA) and 4G (LTE) and further towards the current 5G standard.The main driver for new communication standards is the growing demand forhigher communication bandwidth. Both traffic applications and remote controlof equipment require low message latency and power efficient communication.

We continue, in Section 2.3, by defining our view of large-scale industrialsystems. Such systems have common attributes such as 1) low acceptance forsystem faults, 2) many simultaneously deployed software and hardware gen-erations within one system, 3) long lifetime spanning several decades, 4) verylarge size and complexity, and 5) continuous development over the completesystem lifespan.

Section 2.4 illustrates our production system, which is an example of alarge-scale industrial system. We show several deployment scenarios and theeffect on system complexity. A complete production system spans from singlecircuit boards with one CPU up to multiple circuit boards with a total of severalthousand of CPU’s.

We conclude this chapter, Section 2.5, with a detailed description of ourtarget system. The system we have investigated has a layered structure usingmany different programming languages and has continuously been developedduring several decades. It is a very large system that is fault-tolerant with highrequirements on uptime and robustness.

15

16 Chapter 2. Background

Telecom.Standard

Max DownLink Speed

FirstIntrod.

Main Features

1G (NMT,C-Nets,AMPS,TACS)

- Early1980

Several different analog stan-dards for mobile voice tele-phony.

2G (GSM) 14.4kbit/scircuit switched,22.8kbit/spacket data [45]

1991 The first mobile phone networkusing digital radio. Introducedservices such as SMS.

→ GPRS 30–100kbit/s 2000 Increased bandwidth over GSM.

→ EDGE 236,8 kbit/s 2003 Increased bandwidth over GSM-GPRS.

3G(UMTS,WCDMA)

384kbit/s 2001 Mobile music and other typesof smart-phone apps started tobe used through more advancedsmart-phones, which changedawareness and increased com-munication bandwidth.

→ HSPA 14.4–672Mbit/s [90] 2010 Increased bandwidth over 3G.

4G (LTE) 100Mbit/s–1Gbit/ 2009 Mobile video.

5G 1Gbit/s to many userssimultaneously

2018 Massive deployment of highbandwidth to mobile users,smart homes, high definitionvideo transmission.

Table 2.1: The most important telecommunication standards and their commu-nication bandwidth linked to the main features introduced by the standard..

2.1 Telecommunication Standards 17

2.1 Telecommunication StandardsTelecommunication systems are complex because they implement several com-munication standards. Standards define how systems should interact and is afundamental tool when connecting different manufacturer’s systems. The stan-dards continuously evolve to reflect customer demands, which drive equipmentmanufacturer to continually develop new features and system improvements.Several standards execute concurrently for efficiency reasons. See Table 2.1for a list of telecommunication standards and their main features.

Groupe Special Mobile (GSM) [120] (2G) was introduced in 1991 and pro-vided the second generation of mobile communication. It was the first commer-cial and widely available mobile communication system that supported digitalcommunication [97]. Needless to say, the GSM system was an astonishingcommercial success with 1B subscribers in 2002 [123] and 3.5B [52] in 2009.The introduction of GSM changed the way people communicate by allowing asignificant portion of the population in industrialized countries to use mobilephones. Several extensions to the GSM standard, GPRS, and EDGE, furtherincreased the communication bandwidth, thus allowing the implementation ofeven more complex services.

In 2001, the third generation (3G) standard was introduced as a responseto customer demands for further increased bandwidth. The 3G standard is alsoknown as Universal Mobile Telecommunication System (UMTS).

A fourth increment (4G) of the telecommunication standard, also calledLong Term Evolution (LTE) [61], was introduced to the market in 2009. At thispoint, a large part of the industrialized world had adapted the “always-online”paradigm. The society, as a whole, looks favorably on mobile broadband andsocial networking services [62] demanding higher capacity in the telecommu-nication infrastructure.

Today, in 2016, we are standing on the brink of the next telecommunicationstandard to be implemented (5G). It is estimated to be released to the market in2020 with substantial improvements compared to LTE [14]. The first improve-ment is a massive increase in bandwidth when there are many simultaneoususers. A drastically reduced latency (below 1ms) is needed to support trafficsafety and industrial infrastructure processes [36]. There is also an increasingdemand for a reduction of energy consumption [21] so that it is environmen-tally friendly [37], while also making it possible to install network nodes inremote places [38] with scarce power supply.


0

2

4

6

8

10

12

20

10

20

11

20

12

20

13

20

14

20

15

20

16

20

17

20

18

20

19

Traffi

c [E

xaB

yte

s]

Voice CommunicationMobile Phone DataMobile Computer Data

(a) Voice and data traffic.

0

2

4

6

8

10

12

14

16

18

20

20

10

20

11

20

12

20

13

20

14

20

15

20

16

20

17

20

18

20

19

Traffi

c [E

xaB

yte

s]

SumVideoAudio,Web,File sharing,Social Networking Services

(b) Mobile application traffic.

Figure 2.1: World-wide market outlook for mobile traffic 2010 – 2019 [34].Previously published in Paper H [62].

0.2

0.4

0.6

0.8

1

1.2

Jan-2

00

8

Jul-

20

08

Jan-2

00

9

Jul-

20

09

Jan-2

01

0

Jul-

20

10

Jan-2

01

1

Jul-

20

11

Jan-2

01

2

Jul-

20

12

Jan-2

01

3

Jul-

20

13

Jan-2

01

4

Jul-

20

14

0

20

40

60

80

100

Availa

ble

Ap

ps

[Mx#

]

Dow

nlo

ad

ed

Ap

ps

[Bx#

]Available Apple AppsAvailable Google AppsApple DownloadsGoogle Downloads

Figure 2.2: Download-statistics for mobile phone application [1,9,10,59,118].Previously published in Paper H [62].

2.2 Telecommunication Services 19

2.2 Telecommunication ServicesThe introduction of mobile phones quicky made voice communication the mostimportant service. It was the natural way to extend the already existing wirebound voice service into the mobile era. Voice services have now reached itspeak from a capacity perspective [34], see Figure 2.1a. It is also apparent thatdata communication is rapidly increasing for both mobile phones and mobilecomputers. A report [35] by Ericsson Consumer Lab attributes the increaseddata usage to three main usage areas:

• Streaming services are quickly gaining acceptance among the populationand include on-demand services such as music, pay-per-view TV andmovies. Ericsson estimates that mobile video will be one of the mostrequested services in the coming years (2010–2019), see Figure 2.1b.

• Home appliance monitoring is increasing rapidly. For example waterflood monitoring, heat and light control, refrigerator warning systems,coffee-machine refill sensors, entry and leave detection and much more.

• Data usage are expected to increase further at a rapid pace with the useof Information Communication Technology (ICT) devices such as mobilephones, watches, tablets and laptops. There is a common acceptance touse ICT devices for a large portion of daily activities [24] such as banktransactions, purchases, navigation, etc. The use of devices is expectedto further increase the utilization of telecommunication networks [129].The extraordinary increase in download rate of mobile apps indicates theacceptance of mobile usage among people, see Figure 2.2.

• Vehicle communication to support self-driving cars [36] and automatedvehicle fleet management [37].

• Reduced network latency is needed to implement Industrial infrastruct-ure [36] operations over wireless networks.

The overall increase in geographical and population coverage paired with newservices, such as the ones described above, will contribute to an enormousgrowth in mobile data traffic. The geographical coverage is in 2014 mainly fo-cused on Europe and USA with Asia, mainly India and China, quickly catchingup and surpassing [37]. In 2015 there were approx. 7.4(3.4)1 billion mobilesubscribers world-wide and it is estimated that there will be 9.1(6.4) billionsubscriptions by 2021 [37]. Increasing both geographical and population cov-erage causes an unprecedented change in global mobile data usage, which iscurrently one of the biggest challenges for network operators.

1The number of advanced smartphone subscriptions in parenthesis.


Node

Node

Node

Node

Node

Node Node

Node

NodeNode

Node

Node

Node

Inte

rfac

e

Interface

Stan

dard

izedSt

anda

rdiz

ed

Inte

rfac

e

Stan

dard

ized

Industrial System

InterfacesInternal

Internet

Standardized

Interface

Other Industrial System

Figure 2.3: Industrial systems interacts with surrounding systems using stan-dardized interfaces. We have concentrated on node-internal characteristics andperformance improvements for internal interfaces.

2.3 Industrial Systems

The system we have targeted and also performed our experiments upon is anexecution platform handling several generations of telecommunication stan-dards. The platform has been developed by Ericsson is called Cello or Connec-tivity Packet Platform [3,76] (CPP). The platform is generic and supports manyexisting communication standards [28], including 3G and LTE. The telecom-munication system we have investigated in this thesis shares similar proper-ties with other large-scale industrial systems. We believe that other systemsalso can use our research results since they share a similar system structureand behavior. A simplified view of the telecommunication system we haveinvestigated is shown in Figure 2.3. The system distributes over many com-

2.3 Industrial Systems 21

puters, denoted nodes. Internal nodes that implement a subset of the systemfunctionality does not necessarily use standardized communication protocols.Performance improvements can, therefore, be achieved using proprietary pro-tocol implementations. Standardized communication is, of course, necessaryfor external communication. We have defined [54] behavioral patterns that arecommon to industrial and telecommunication systems, for example:

• There is a low acceptance for system downtime.

• There are multiple concurrent hardware and software generations.

• The lifetime spans over several decades.

• The size and system complexity causes long lead-times when developingnew functionality.

• Substantial internal communication between nodes inside the industrialsystem. External connections are often using standardized protocols, forexample 3GPP for telecommunication systems, Figure 2.3.

We have tried to generalize our research results as far as possible. In general,our research results should be applicable for many other systems sharing thesame structure and behavior as the telecommunication system we have investi-gated. Some industrial systems are located in large server facilities, providingeasy access for engineers and scientists. Other industrial systems are locatedin “friendly” places where a support engineer can access them and extract anyinformation needed. Telecommunication systems are typically deployed in adifferent environment. Most network operators have their own infrastructurewhere the telecommunication nodes are located. Support and maintenance per-sonnel is often employed by the operator. In the rare cases when the operatorreceives support help from the equipment manufacturer, they are not given fullaccess to the nodes. Such restrictions makes it difficult to monitor hardwarecharacteristics for production nodes. Operators are traditionally very restric-tive towards running diagnostics, test programs or monitoring tools that are notverified as production level software.

Physical access restrictions also make it vital to have adequate error han-dling that gathers enough information when a fault occurs. It is not possibleto retrieve additional troubleshooting information at a later time meaning thatall necessary information must be packaged together with the trouble report.The scenario of restricted node access is one aspect we have tried to address inthe work leading up to this thesis. System developers have always demandedhardware characteristics measurements for production nodes, but it has beenhard to obtain such information.


Figure 2.4: Many circuit boards (to the left) are interconnected to form a cabi-net (to the right). Courtesy of Ericsson 2016.

Figure 2.5: Several interconnected cabinets construct a large-scale telecommu-nication system. One node in Figure 2.3 can vary in size from a single circuitboard up to several cabinets. Courtesy of Ericsson 2016.

2.4 Deploying Our Target System 23

Figure 2.6: Complex lab test environment. Courtesy of Ericsson 2016.

2.4 Deploying Our Target SystemThe physical layout of a telecommunication system is governed by strict rules.One cabinet, to the right in Figure 2.4, consists of three vertically mountedsub-racks. Each sub-rack holds up to 20 circuit boards, illustrated to the left inFigure 2.4. In total, a cabinet sums up to approximately 20 ∗ 3 = 60 circuitboards, depending on the desired configuration. Several cabinets can be con-nected to form a large-scale node, see Figure 2.5. Each circuit board can haveseveral CPUs with multiples of 10’s of cores each. In total the largest systemscan consists of thousands of CPU’s.

It is possible to deploy the system in several different levels, which is par-ticularly useful for testing purposes. Running one board by itself provides themost basic level of system used for low-level testing. A slightly bigger systemis achieved when at least two boards are interconnected to form a small cluster.This level of system is useful for verifying cluster functionality. Much morecomplex testing scenarios can be formed by configuring larger nodes, suchas Figure 2.6. These type of nodes are seldom available for software designpurposes since they are very costly. Large-scale nodes are mainly used whentesting complex traffic scenarios and for performance related verification.


3

5

4

Logic

Business

2

Hardware

1

Platform

BA C

Latest

Local Adjustments

Legacy

Cluster Functions

Application

Operating System

Target Specific Drivers

Generic Drivers

Figure 2.7: There are five abstraction levels (right) implementing the completesystem spanning from hardware to business logic (left). There are multiplehardware implementations (bottom) spanning from legacy single-core proces-sors (1–A) to advanced multi-core processors (1–C). The same platform (2-4)and application (5) supports all hardware implementations.

2.5 System Details 25

2.5 System Details

We have followed the guidelines presented by Peterson [96] to contextualizeour investigated system. We have investigated a large telecommunication sys-tem [13,121] where each node in the system overview, Figure 2.3, is describedinternally as in Figure 2.7. From a high-level perspective there are five abstrac-tion levels (to the right in figure) that are structured in three functional parts (tothe left in figure).

The hardware (level 1) is implemented with custom made circuit boardswith varying performance capabilities depending on desired functionality andyear of manufacture. The performance spans from older single-core boardsup to several CPU’s, each utilizing 10’s of cores. Memory capacity is varyingfrom a few MB’s up to many GB’s per CPU.

Hardware variations put great emphasis on designing drivers (level 2) thatmust be generic as well as provide support target specific functionality. Thedrivers must maintain a stable legacy interface towards the Operating Sys-tem (OS). Application programming interface stability is vital in large scalesystem development.

Third party vendors deliver the Operating System (level 3) and dependingon the use-case it is either a specifically tailored proprietary real-time OS orLinux. The API-functionality supplied by the OS must be both backward andforward compatible regardless of changes to the OS and the HW. Changinglow-level functionality should not be propagated upwards to higher levels.

Cluster functionality is implemented (level 4) to support board interoper-ability, communication mechanisms, initial configuration, error management,error recovery and much more. The majority of the platform source code isimplemented at this level. It is a complex part of the platform (levels 2–4) withcomplicated system functionality to maintain high-availability. Sharing theplatform between multiple hardware platforms is vital for the maintainabilityof the complete system.

The application runs on the uppermost level of the system (level 5). It isby far the largest portion of all layers when comparing computational capacity,memory footprint and any functional metric. There are several applications thateach implement a complete telecommunication standard, such as GSM [120],WCDMA [56, p 1–10] or LTE [61]. Several high-level modeling languageshave been used to model these applications in combination to low-level nativecode. The model is, in some cases, used to generate low-level programmingcode that is natively compiled for a specific target. The resulting code is com-plex to debug, especially from a performance perspective. One issue is the


sheer size of the application, which footprint is many Gigabytes. Furthermore,it sometimes runs inside an interpreting/compiling virtual machine shadowinginternal functionality. We have mainly worked with the platform parts in ourstudies (levels 2–4).

Maturity and Quality

The CPP telecommunication platform is a very mature product, and Ericssondeployed the first test system in 1998 [122]. In 2001, the first commercialsystem was released. It has been deployed worldwide and in 2015, it had amarket share of 40% [102]. Nokia-Alcatel-Lucent (35%) and Huawei (20%)share most of the remaining market share. Being competitive is a key fac-tor, and one of the most critical success factors for the resulting products is tokeep development times as short as possible [48,104,117,119,121]. There are,in general, new hardware releases every 12-24 months to improve performanceand/or consolidate functionality on fewer boards. Constant development activi-ties using an agile development process results in continuous customer releasesof new software versions.

There are strict quality requirements on telecommunication systems, simi-lar to other large infrastructure systems. In particular, there is little acceptancefor down-time. Typically, a system is required to supply a 99.999% [80]. Thereare many simultaneously running generations of software and hardware in aninterconnected system [54]. Multiple software and hardware revisions increasethe complexity, especially when designing new functionality and debugginglegacy problems.

Size and Type of System

To give an idea of the system size we present the number of source lines(SLOC) [88]. The operating system is either a legacy third party real-timeOS (many million lines)2 or Linux (15 million lines [84]). Running on topof the OS is a management layer providing cluster awareness and robustness.This layer consists of several million lines of code. The business logic is im-plemented using a model-based approach with large and complex models. Itimplements the complete communication standard for terminating traffic andhandling call-setup. This part of the system has cost several thousands of man-year to develop, and the execution footprint is many GB.

2Business aspects prohibits us from disclosing the exact number of lines of code.

2.5 System Details 27

The system is an extensive embedded distributed system [113]. Each exe-cution unit (board) runs a (soft) real-time OS. The boards are interconnected toform a large distributed system. Processes executing on one board can easilyconnect to processes executing on another. Interconnect poses many practicaldifficulties for standard OS:s, for example, the vast number of concurrentlyrunning processes. Furthermore, the system is designed to be both robust andscalable [49]. Customizing a telecommunication platform is a significant andchallenging task. There is an operational and maintenance interface contain-ing literary thousands of possible customization options. To further add to theoverall complexity, it is also possible to make individual choices on how toconnect each physical node in the network, see Figure 2.3.

Programming Languages

The system is built using many different programming paradigms. Drivers,abstraction level 2 in Figure 2.7, are implemented in either assembler or C.The operating system (OS), level 3, is also implemented in C and assemblerwhere high performance is needed. The rationale for selecting C as the mainprogramming language is historical but knowledge (at the time) and executionefficiency was the main reasons for the decision. The OS, level 3, is suppliedby a third party company. For maintainability reasons, the surrounding codeimplements local OS adjustments. During our research, we have mainly im-plemented functionality in level 3.

Moving the abstraction further from the hardware changes the program-ming paradigm to support higher level programming languages. For clusterfunctionality, level 4, several programming languages are used, such as C andC++ for legacy code. Depending on requirements, recent functional additionsmay be implemented in either Java or Erlang.

Various model-based approaches have been used when implementing theapplication layer, level 5. There are several applications implementing differ-ent parts of the telecommunication standards described in Section 2.2. Theapplications share the common execution environment provided by lower lev-els (1–4).

Hardware

Message processing system usually consists of two parts [108, p1], the controlsystem and the data plane. The control system implements functionality forconfiguring and maintaining an operational system. The data plane is mainly


concerned with payload handling, i.e. routing messages towards their destina-tion. In our system, the control system HW is different from the data planeHW. The former is partially implemented with common off-the-shelf hard-ware while the latter uses tailored CPU’s with specialized hardware supportfor packet handling. We have investigated the control system, which has acommunication rate in the range of Gbit/second. The traffic terminates at thedestination node where the CPU performs some message processing. We havenot investigated the data plane.

The CPP system runs on more than 20 [13] different hardware platforms de-pending on the required performance. Low-power boards may be using ARMCPUs while high-end circuit boards aimed towards heavier calculations mayuse powerful PowerPC or x86 CPUs. Using multiple hardware architectures isa challenging task. Platform code from level 4 and upwards, Figure 2.7, mustbe HW agnostic to be easily portable and efficiently maintained. The sameapplies to the application software, level 5, executing on top of the platform.

Development Process

Developing an extensive infrastructure system puts great effort into develop-ment tools and development flow. Tracking each code change must be pos-sible. Customers require continuous improvements with little or no regard tothe age of the hardware. It is hard to support systems with mixed hardwaregenerations, and each software release must support several simultaneouslyrunning hardware generations. As an indication of the system size, thousandsof skilled engineers [54] have spent decades implementing the system. Thedesign organization is distributed over many geographic locations, requiringintense coordination.

Du ska alltid tanka: Jag ar har pa jorden denna enda gang! Jagkan aldrig komma hit igen! Och detsamma sa Sigfrid till sig sjalv:Tag vara pa ditt liv! Akta det val! Slarva inte bort det! For nu ardet din stund pa jorden!.

My own translation:

You should always think: I am here on earth only once! I cannever get back here again! Sigfrid said the same thing to himself:Take care of your life! Take care of it! Don’t waste it! For this isyour moment on earth!

— Moberg V. [87]

3Research Summary

DURING the work on this thesis, we have had a large-scale telecommu-nication system at our disposal. The needs of that particular systeminfluenced us when we formulated the three research questions pre-

sented in Section 3.1. We have tried to express the research questions generi-cally to ensure that they can address issues that are problematic for many otherlarge-scale communication systems. We have also clarified some essential re-quirements related to each research question.

Closely related to the research questions are the delimitations we havemade when performing our research. We have listed several significant de-limitations in Section 3.2. Section 3.3 summarizes our achievements for thethree research areas: monitoring, modeling and improving.

We have done several case studies during the monitoring- and modelingphases to explore and describe our environment. In the performance improve-ment phase, we have adopted a more hands-on approach to solving a particularproblem. We list the research method in Section 3.4. The chapter is concludedin Section 3.5 by listing validity threats.

34 Chapter 3. Research Summary

3.1 Research QuestionsThe goal of our research is a systematic collection of characteristics data thatcan be used to model the hardware usage of the system and to find performanceimprovement areas. We present these three research questions in the followingsubsections.

3.1.1 System MonitoringThe telecommunication system we have focused on in this thesis is well un-derstood and thoroughly tested from a functional perspective. The system hasnot reached the same level of maturity with respect to characteristics testing.New functionality is well defined and implemented according to detailed spec-ifications by engineers with long experience in system development. However,the system complexity and difficulty to monitor behavior and hardware usagemakes it difficult to understand what impact new software changes will haveon the system behavior. This leads to the first research question:

Q1 How is it possible to monitor the hardware and software char-acteristics of a production system?

We refine the research question, Q1 with additional constraints so that it com-plies with general requirements for our industrial system:

• The probe-effect must be negligible for admitting the tool to run in aproduction environment.

• Sustained monitoring times, several days or weeks, is favored in com-parison to high-frequency sampling.

• The monitoring mechanism must be easily adaptable to different systemsand scenarios.

• We must have complete control over the source code to guarantee secu-rity and quality of service.

3.1.2 System ModelingAs a continuation of our work with characteristics monitoring, see Section 3.1.1,we understood that our monitoring method could be useful for other purposes

3.1 Research Questions 35

than only characteristics monitoring. The design organisation where we per-formed our tests had for a long time struggled with the problem of havinglong lead times between platform development and characteristics testing. Ac-cording to system architects the long lead-time results in difficult and time-consuming bug fixes. Early error detection is very difficult [4], but when suc-cessful it allows software errors to be corrected sooner than previously possi-ble [116] leading to a reduction in development cost [17, 18]. This reasoningleads up to the second research question:

Q2 How to correctly model hardware characteristics of a produc-tion system based on data collected from production nodes?


• The use of the synthesize-mechanism should be fully automatic becausewe want to include it in the automated test framework.

• The synthesize-mechanism should be generic for most types of industrialsystems, which should then apply to our telecommunication system.

3.1.3 Improving System PerformanceOur first two research questions targeted characteristics monitoring of a pro-duction system and performance bottlenecks. The natural next step is to targetperformance improvements for the system. How to use the extracted charac-teristics information to identify improvement areas where the communicationperformance of our target system can be improved? This reasoning leads to thethird and last research question:

Q3 How can the communication performance of a large productionsystem be improved based on a model derived from hardwareand software monitoring?


• Performance improvements must be fully automatic and non-manualsince network operators do not allow access to the system after deploy-ment.


• Network congestion level and CPU utilization are different for variousdeployment scenarios and also changes over time due to alternating us-age patterns. Any communication improvement method must automati-cally adapt to a changing environment, and it is therefore not possible tooptimize it for a specific scenario.

• The system must handle multiple concurrent communication streams.

• Other co-located services, such as databases, JAVA machines, SFTP,SSH- and Telnet servers, should not be negatively affected by the com-munication improvements.

• Robustness and automaticity have higher priority than pure performance.

Improving the performance of our investigated system is the overall goal ofthis thesis. The target is to design an automatic mechanism that is robust andworks well in an industrial environment.

3.2 DelimitationsWe have chosen to limit the scope of our investigation to one particular indus-trial system, which is the telecommunication systems where we have privilegedaccess. It is hard to gain access to other industrial systems since we need tomodify the investigated system to perform our research. We have performedour experiments on one type of system, but we believe that our results apply tomany other large-scale industrial systems. We believe that the general methodsare applicable for many other systems, although the specific results are uniquefor our target system.

We have implemented and tested our achievements in a particular telecom-munication system. By using one system for testing we have made some spe-cific limitations to the research questions:

• We have not yet explicitly verified that characteristics testing in earlydesign phases reduce the total system development time but earlier re-search [17, 18, 116] strongly implies that.

• The telecommunication system we have investigated is IO-bound, andwe have therefore mostly focused on modeling the low-level cache us-age.

• We have opted to use a low sample frequency (1Hz) that may be insuffi-cient in some cases. We think that it is sufficient for our static model syn-thesis procedure. The characteristics of our target system are relatively

3.3 Achievements 37

static where the resource usage slowly changes depending on end-userbehavior. The reason for this was that operator requirements forced usto guarantee that the production environment would not experience anyperformance impact.

Most limitations stem from the fact that it is challenging to get customer con-sent to access production nodes. Customers are very concerned that any systemchange may affect stability, security or performance, and it is usually difficultto run any monitoring tool at a customer site.

We have merged the two steps of synthesizing a model and load-replicationinto one concept that we call modeling because we did not make this distinctionfor our first papers. We will define the two steps in future publications.

3.3 Achievements

We have tried to observe the system as a whole [112] when improving thesystem performance, instead of diving into the details of each implementation.We have devised a systematic approach to finding performance problems inthe early stages of the system development process. The following subsectionsdescribe our achievements.

3.3.1 System MonitoringNo system monitoring tool was available for our legacy OS in 2011, at the startof our investigation. At the same time there were some tools for Linux, forexample Perf [25], that implemented a subset of our requirements. Because ofthe GPL-license it is politically difficult to port Perf to a proprietary OS. Weopted to implement a tailored monitoring tool to support all requirements.

Our first contribution is a low-intrusive method for long-term monitoring ofhardware (HW) characteristics in production environments. The characteristicsprofile is used to understand and investigate system behavior for different usagescenarios. It is vital to understand the behavior of the target system when tryingto improve the performance [8].

We have implemented a tool called Charmon to monitor SW and HW char-acteristics of our target system. Charmon currently runs on two different oper-ating systems, namely Enea’s OSE for the legacy system, and Linux for currentand future platforms. We use Charmon for long-term monitoring, and it runscontinuously while sampling various type of HW metrics through the Perfor-mance Monitor Counters (PMC) [33] with some frequency. PMCs can also


Local Database

3 ... ... ...3 ... ... ......

Charmon

Time

Performance Monitor Counters (PMC)

Act

ion

cpu_load_fcn

Nr ctx switches

CPU load

Signal RTT

Set

0

1

2

nr:ctx_fcn

sig_rtt_fcn

L1−I cache0

Set

Name

L1−D cache1

2 L2−Common 461 462

9

9

463

10

2

41

60

464

1

1

0 1 2 3

Hardware PMCSoftwareCounterName

5. W

rite

2. R

ead

3. S

tore

mea

s.

4. G

et n

ext

counte

r se

t

6. R

eturn

1. IR

Q

Figure 3.1: HW Characteristics measurements using Charmon.

be denoted Performance Monitor Unit (PMU). A PMC is an HW implementedevent counter, and it can autonomously count the occurrences of the specifiedevent after it has been programmed. PMC events [44] that are common formany HW architectures are for example cache misses, RAM accesses, branchmisses and similar issues. There are also other types of events that are uniquefor each architecture, for instance, related to the execution pipeline, memorysubsystems and similar.

Charmon iterates over a list of PMC event sets that is each programmedto the PMC for a period. As shown in Figure 3.1 Charmon is awoken (1) bya timer interrupt at fixed intervals and sleeps in between. Charmon starts byreading (2) the resulting values for the previous HW counter set. Reading HWcounters is, for the legacy OS, low-intrusive by utilizing the mfspr assem-bly instruction. The PowerPC instruction set defines this particular instruction,but there are similar instructions for other architectures. On Linux, we usethe Perf-API [25] for reading HW metrics and our implementation for readingSW metrics. The logical functionality, which is the major part of the Charmonapplication, is the same for both OSes. For both OSes, the measurements arestored (3) in a local database (DB). Next, the subsequent HW performancecounter set is read (4) from a table and programmed (5) into the PMC regis-

3.3 Achievements 39

ters. The PMC programming is similar to reading, mtspr for the legacy OSand Perf-API for Linux. It is also possible to add any other SW metrics, suchas CPU-load, context switches, signal turn around time. Our implementationuses CPU-load, which is supplied by the OS, and round-trip message time,which is supplied by the messaging application. Measurements for SW met-rics are stored in the DB to provide a contextualized and time-stamped log ofboth HW and SW utilization. Charmon provides the possibility to have a mixof both low-level and high-level metrics, which is useful when debugging/in-vestigating performance related problems. After setting a new set (5) of HWcounters, Charmon sleeps for a predefined interval, then restarts at step (1).When using multi-core CPUs we follow a similar procedure where Charmonsimultaneously programs all cores with the same counter set.

Charmon implements two types of counter sets. The first and by far largestset uses HW PMC counters. The second set uses SW counters. We startedby investigating the first set that contains HW metrics describing the systemperformance, for example, instructions per second and cycles per second. Byusing these two metrics, it is possible to calculate Cycles Per Instruction (CPI),which to some extent describes the efficiency of the system [40]. The nextarea of interest is to understand where the system loses performance. It iswell-known from interviewing senior technicians within the organization webelong to that the target system we are investigating is very IO-bound. There-fore, we implemented several counter sets to observe all cache usage regard-less of the cache level. Using the CPI-metrics [40, 41] as a guideline we im-plemented many more metrics, such as counters for Translation LookasideBuffers (TLBs), branches, floating point units and other. We know that wemust be careful when using CPI-stacks since they can be misleading [5], es-pecially for multi-core CPUs. We also include counters for all pipeline stagessince that is helpful to gain further knowledge of where stalls could occur.

The second type of counter set utilized by Charmon is SW counters. ASW counter can in practice be anything that is countable, but the two primarysoftware metrics monitored in Charmon are CPU-load, supplied by the OS, andmessage round trip time.

Charmon has been designed and implemented to allow easy addition ofmore counter sets. Our aim has been to ease the extension of Charmon withadditional counter sets whenever the need arise. In the future, we expect thatmemory subsystem metrics may be of specific interest because new HW archi-tectures introduce more multi-level cache hierarchies, non-uniform memoryaccesses, and other complex techniques.


Act

ivit

y

Time

Platform

Application

Delivery

Continuous testing throughoutthe design phase

Act

ivit

y

Characteristics Test

Platform

Application

Delivery

Time

iterations betweenLate stage

design and test

Lead−Time Reduction

effort varies over timeThe characteristics test

Test phase Development phase

a) Characteristics testing and corrections are iteratively performed atthe end of the design process.

b) Performing characteristics testing throughout the developmentprocess shortens the total development time.

Figure 3.2: Two different processes for characteristics verification 1.

The development organization at Ericsson uses Charmon as their preferredhardware usage monitoring tool, especially when running long-term perfor-mance evaluations on large-scale systems. The sampling rate of the monitor-ing tool is deliberately configured to operate at a low pace, typically 1 sampleper second, to not affect the node operation. It was an explicit design choicebecause network operators are very restrictive of any tool that has the slightestimpact on the production environment. The sampling rate of the monitoringtool is configurable and supports much higher sampling frequencies.

1The original figure has previously been published in Paper A [68].

3.3 Achievements 41

3.3.2 System Modeling

Characteristics modeling of systems is difficult [72, 73, 91]. It is even harderto model large industrial systems running in a production environment. As afirst obstacle, it is hard to get network operator consent to access the system.Second, the size of the system makes it problematic to evaluate the productionsystem manually and synthesize a model. Some earlier approaches like Belland John [11] are similar to ours but creates the model manually. Alameldeenet al. [4] describe the need for cheap models of expensive large-scale systems.

We want to overcome the problems of modeling large industrial systems.The resulting model should be sufficiently accurate, so that is possible to mea-sure system performance in the early stages of the system development pro-cess. Figure 3.2a depicts the current, sequential, development process withonly a slight overlap in the development activities. Ending the developmentphase with characteristics testing often results in long lead-times between thestart of platform development and when the system is finished. According toBoehm [16] it is expensive to perform corrections late in the development pro-cess:

Finding and fixing software problems after delivery is often 100times more expensive than finding and fixing it during the require-ments and design phase.

For industrial systems with long lead-times, it can even be harder to implementcorrections because the responsible engineers may have moved on to other de-velopment projects. Using our approach, depicted in Figure 3.2b, we shouldbe able to reduce the overall development time and cost by finding and fix-ing bugs earlier [17, 18, 116]. Reducing the total development time shouldalso reduce the Time-To-Market, which is a critical factor in the highly com-petitive telecommunication market [104, 117, 119, 121]. Our idea is to moveperformance verification earlier in the development process. As depicted inFigure 3.2b the characteristics testing effort is similar to that of the originaldesign process but started substantially earlier. We also try to avoid costly lateprocess stage iterations between testers and developers by using a character-istics model of the production system before finalizing the complete system.For more details of our contributions see Section 4.3. Using our method makesit possible to start incremental performance verification at the earliest possi-ble stage and continue throughout the development project. Such developmentprocess change should shorten the lead-time and provides earlier feedback todevelopers concerning potential performance problems.


Step 1

Loadgen Charmon

Platform Rev A

Test Appl. LoadgenTest Appl.

performancecounters

Application Charmon

Platform Rev A

performance

The application is modeled by a testapplication and a load generator

counters

Production Node Test Node Test Node

Create a model on thetest system

Use Charmon to get hardwareand software characteristics

Read Read

from the production systemcan use the same modelMultiple platform SW releases

CC C

Step 3Step 2

Platform Rev B

Figure 3.3: Three steps in the modeling process. (1) Extract characteristicsfrom production node; (2) Create a model of the production system using theoriginal platform A; (3) Use the production node model for testing purposeswhen a new platform B is released.

The modeling process, Figure 3.3, to achieve the goals described above isstraight-forward and can be described in three steps; 1) Extract the hardwarecharacteristics from the target system when it is running in a production en-vironment. 2) Create a hardware characteristics model on a test system thatemulates the hardware usage of the target system, using the original platformA. 3) Use the model on a test system together with the functional test suite todetect if there are any performance deviations for future software releases, de-noted Platform revision B. We are using the Charmon tool to get the hardwarecharacteristics, see Section 3.3.1.

The modeling process is generic and can use any hardware metric. We havein this thesis focused on modeling the cache usage, mainly because the systemwe are investigating is IO-bound and depends heavily on cache and memorysubsystem. Additionally, minimizing the number of metrics reduces the model-ing complexity. The modeling application has been implemented using severalPID-controllers. There is one PID-controller for each cache modeled prop-erty (L1-Instruction, L1-Data, and L2-Data). The modeling procedure is fullyautomatic and the model is created after 1–5 minutes.

3.3 Achievements 43

The characteristics model maps hardware characteristics from the produc-tion node to a smaller test node. The main goal is to provide a similar, andmore realistic, execution environment for the test node similar to the produc-tion node environment. It is well-known in the industry, that functional testsuites are good at testing the required functions, but they do not stress thesystem in the same way as the real production system. Running tests whilestressing the system increases the ability to provoke congestion scenarios thatmay lead to the detection of hidden bugs.

3.3.3 System ImprovementThe concluding part of our three-step procedure is to reduce the effects of thesystem bottleneck identified in Section 3.3.1 and modeled in Section 3.3.2. Wehave used the following method in our work with performance improvements:

1. Start the procedure by measuring production environment characteristicsfor the desired use-case.

2. Evaluate the characteristics metrics obtained in step (1). Modify themetrics set if additional characteristics information is needed and rerunstep (1). If it is possible to find the problem by viewing the characteris-tics data, a software fix can directly be implemented, and we can jumpto step (4).

3. If necessary, the production node can be modeled by a test node to reducedebugging lead-time. We should start by implementing a tailored test-suite covering the desired services in the same way as the productionsystem. By simultaneously running the function test and load generatorthe execution environment is similar to the production system.

4. Implement a fix for the performance bottleneck.

5. Rerun the test with the new software release using the same test setup asin step (3). Release the software if the test is satisfactory, otherwise, re-implement another solution in step (4). This step in the method can alsobe used to select the best out of several possible fixes, which providesthe best characteristics result.

The general methodology, outlined above, can easily be adapted to fit any othertest or debugging scenario. The most important task is to find a characteristicsmetric set that matches the scenario where the problem occurs. The metricset should map the performance related problem so that it is possible to detectmetric changes when altering the code.


Communication API

Operating System

Message

Compression

Message

Decompression

rcv_msg()snd_msg()

snd_msg()’ rcv_msg()’

Low−level Network Communication

Transparent Wrapper Layer

OS Communication API

Compressed and Uncom−

pressed Messages Simul−taneously on the Network

A BApplications or Processes

Figure 3.4: Adaptive Online Message Compression.

We have, in this thesis selected one area where we have made performanceimprovements. In the next section, we outline how we found the particularimprovement area and how we improved the performance.

3.3.4 Message Compression

We started our performance evaluation by using the procedure outlined in Sec-tion 3.3.3 on a particular telecommunication production system. When evalu-ating the characteristics measurements, two things were clear; 1) Network con-gestion was high; 2) The CPU load was varying between moderate (∼ 25%)and high (100%) depending on the execution pattern of the services sharingthe same hardware resources. Network communication is a well-known bot-tleneck [125] when computational capacity grows quicker than the bandwidth.From these data, we formed a hypothesis that it would be possible to increasethe bandwidth under certain scenarios by compressing messages [53, 74, 89].Compressing messages require processing capacity. In many cases, the nodehas spare capacity, but not always depending on the additional services sharingthe same execution environment. Furthermore, the data transferred by the sys-tem changes over time and depends on the deployment, making it is difficultto decide manually what compression algorithm to use. As a response to thechallenges above, we have modified an existing communication system by in-troducing a transparent message compression layer, see Figure 3.4. The legacy

3.3 Achievements 45

1

Overl

oad

Thre

shold

CPU

Usage

Rati

o C

om

pr.

/

Uncom

pr.

2 3 4 5

System CPU overload

6

0%

100%

Time

The ratio of messages compressed

as a function of CPU usage.

System-wide CPU-usage

Avg M

essage

Late

ncy

High

LowAverage message latency.

Figure 3.5: Adaptive online message compression overload protection.

communication API containing snd msg() and rcv msg() is wrapped bysnd msg()’ and rcv msg()’ to capture the message data. Our imple-mentation of the API will transparently compress messages and then use thestandard communication API supplied by the OS.

Our implementation of the adaptive online message compression mech-anism consists of three parts; 1) A selection mechanism that finds the bestcompression algorithm with regards to the lowest message RTT. 2) A system-level overload handler to keep the CPU-usage below a threshold. 3) A set ofcompression algorithms that may be used to compress messages.

Our online mechanism continuously compresses message using all avail-able algorithms. The distribution of used compression algorithms for a givenset of messages depend on the measured round-trip time. The selection mech-anism uses the best performing algorithm for bulk compression. By continu-ously using all compression algorithms we trade some performance to gain theadvantage of automatically detecting changes in the communication stream.


Automatic selection is a major advantage over manual selection because it canadapt to a changing environment.

The overload mechanism is illustrated in Figure 3.5. The CPU load is, inthe left part of the figure, well below (1) the threshold. A temporary load-increase (2) surpasses the threshold, and our mechanism reduces the compres-sion quota resulting in fewer compressed messages. Message compression isresumed when the system load is reduced (3). Our overload mechanism alsohandles scenarios with partially compressed message streams. If the total CPUload caused by message compression and other services are above the thresh-old (4), our mechanism reduces the compression quota. A quota reduction maylead to a partially compressed message stream with compressed messages in-termixed with uncompressed. Message compression is gradually resumed (5)as the total system load converges to the overload threshold (6).

We have integrated eleven state-of-the-art compression algorithms in thetelecommunication system we have investigated. Each algorithm has specialcompression properties. One algorithm can, for example, provide high com-pression ratio but require much CPU time (LZMA [94]), which may be suitablefor networks experiencing high congestion. Other algorithms may have specialtarget areas, such as efficient text compression (SNAPPY [50]) or being fast butnot so high compression ratio (QLZ [99]).

Our implementation requires the same communication API for both thesender and the receiver. The snd msg()’ function prepends a new headerto each transmitted message. The header contains information on the com-pression algorithms used for compressing the particular message. When thereceiving node calls the rcv msg()’ function, the API implements a trans-parent decompression of the message. The API sends decompression statisticsto each sender making it possible to calculate the complete compression →transmission→ decompression time.

3.4 Research Methodology

We have used two qualitative methods [109] to obtain the research resultspresented in this thesis. We have used case studies [105, 106, 128] to ex-plore and describe the investigated object. Similarly, we have used actionresearch [83] when iteratively implementing improvements in an industrial en-vironment. Our technical report A [68] includes previously published casestudies reported in papers C [64], E [65] and I [63]. We used the case studymethod to get a better understanding of the system characteristics and to de-

3.5 Threats to Validity 47

scribe the system behavior. We were active participants of the design organi-zation [95] during the research for paper B [66], which extends paper G [66].Changing the position from an observatory view to participatory role allowedus to switch method towards action research or an improvement-centric view.Table 3.1 relates each research question to publication and research method.

ResearchQuestion

Sect. Publ. Type of Question ResearchMethod

Q1 3.1.1 A (C, E, I) Exploratory/Descriptive Case study

Q2 3.1.2 A (C,E, I) Exploratory/Descriptive Case study

Q3 3.1.3 B (G) Problem Solving/Improvement Actionresearch

Table 3.1: Mapping the research questions to methods [101, 106].

3.5 Threats to Validity

We have performed all our research withing the scope of an industrial envi-ronment. One of the benefits is that an industrial environment provides greatinsight into a real production system with customers and user scenarios. Forexample, the data we have used in Papers A and B has been gathered at cus-tomer sites running production systems with real traffic.

However, performing research in the scope of an industrial system intro-duces some difficulties normally not seen in pure academic environments. Forexample, it is hard to obtain the scientific rigor needed for academic publica-tions, and it is challenging to publish raw data or implementation details dueto corporate secrecy. It is also difficult to get extensive access to a productionsystem for unrestricted testing. We have often been allowed a very limitedtime-frame for running our implementations on production nodes and with far-reaching limitations on capacity usage.

We have followed the guidelines by Runeson [106] and Wohlin [127] tocategorize and describe how we have performed our experiments. We dividethe validity discussion into subcategories described in the following sections.


3.5.1 Construct Validity

The construct validity [127, p108] describes the relationship between theoryand observation, for example if our test design has captured the theoreticalrequirements.

Our test design for Paper A was to 1) Extract characteristics data from aproduction system running at a customer site and 2) Synthesize a model us-ing a production test system 3) Test the model using a customer bug fix. Wetried to duplicate the real development process in our test design, which in-dicates that our early-stage performance benchmarking approach works in areal-world application. We have also assumed, according to earlier research byBoehm [17, 18], and Tassey [116], that it is economically beneficial to catchbugs in the initial phases of the development process. They state that the costof fixing a bug increases with a further distance between where a bug wereintroducing to where it is fixed. We have not tested this ourselves.

For Paper B we sampled communication data from a production system,which a test system replayed. We also added synthetic data to force the testsystem into corner-cases where our automatic compression algorithm mecha-nism temporarily selects other compression algorithms than the one used forproduction system messages. We also introduced synthetic overload to mimican overload scenario.

3.5.2 Internal Validity

The internal validity [19] reflects the quality of the data analysis, in otherwords: Is the data we have gathered relevant for the outcome?

Before starting our research, several senior system architects stated that thesystem we are investigating is IO-bound and memory-bound, which in effectare the system bottlenecks. We empirically verified their statement by usingour characteristics monitor to investigate the system characteristics. We haverun our tests on one telecommunication system that is similar to other large-scale systems, see Section 2.3.

For Paper A we synthesized a model for L1-Instruction, L1-Data and L2-Data cache miss ratio. The model was then used to clone the production systemhardware usage on a test node. We believe that the model is sufficiently accu-rate by verifying that the performance impact of a real bug fix is similar in themodel environment and the production environment

For Paper B we have sampled production system message data for use withthe automatic compression mechanism.

3.5 Threats to Validity 49

3.5.3 Conclusion ValidityThe conclusion validity describes the relationship between the treatment andthe outcome [127, p104]. We have identified some threats to the conclusionvalidity for Paper A. We have tested our monitoring and modeling method onone production system. Considering all type of systems the statistical testingset is too small, which forces us to limit our conclusions to the particular typeof system we have investigated. Our target system has the highest market share(40% [102]) among telecommunication systems, which strengthen our beliefthat the research is representable for this particular system type. It is verydifficult to get operator consent to verify our modeling mechanism on othermanufacturers equipment. However, we believe that the generic mechanism isusable for other types of systems with minor modifications. Adapting our mod-eling method requires the cache generator functions to be adapted to differentcache structures.

We have implemented the automatic message compression mechanism de-scribed in Paper B on the same system as Paper A. We extracted our test datafrom a running production system, but we believe our mechanism is sufficientlygeneric and can be utilized on any system. Migrating the mechanism would, ofcourse, require minor modifications such as modifying the set of compressionalgorithms suitable for the new system.

3.5.4 Method ApplicabilityWe argue that the findings in Paper A provides great insight into the behavior ofour investigated telecommunication system. Paper B has improved the messag-ing performance by applying selective compression when there are CPU-cyclesto spare. We have implemented our ideas in one particular telecommunicationsystem but also published the results in the academic community. Our contri-butions are now part of the corporate product portfolio, which further indicatesthat our research is needed and valuable. Our belief is that our target systemis representative of other large-scale systems and especially systems with ex-tensive communication. We think therefore that the research results we presentshould apply to other systems.

Det har ar inget man kan diskutera, jag har ratt och du har fel.2

My own translation:

This isn’t something to discuss, I am right and you are wrong.

— H. Rosling [103]

2Hans Rosling exclaims during a danish DR2 TV-interview, when the program leader says thatworld is in chaos with regards to war and refugees.

4Contributions

WE present several contributions in this thesis. The contributions orig-inates from several published papers that are consolidated in pub-lications A (C, E, I) and B (G). The main contributions and their

corresponding research questions (Q) are:

• A low-intrusive characteristics monitoring application (Q1).

• An automatic production node characteristics modeling mechanism (Q2).

• An automatic message compression mechanism reducing the messageround-trip time (Q3).

The monitoring and modeling techniques are implemented and incorporated inthe industrial production environment. The first of our tools provide monitor-ing functionality useful for understanding system behaviour. The industrial de-velopment environment is currently testing our modeling tool. Our automaticmessage compression mechanism is implemented and tested using productionnode data.

We continue this chapter by mapping each publication towards the re-search questions, Section 4.1. We show, in Section 4.2, the relationship be-tween the two contributing publications (A, B) and the already published pa-pers (C, E, I, G). The chapter concludes by a detailed desciption of the contri-butions in Paper A (Section 4.3) and Paper B (Section 4.4).

53

54 Chapter 4. Contributions

E

2

3

1

Q2

Q1

Q3

A

C

E

Model

Improve

Monitor

I

I

C

A

G

B

Figure 4.1: The three steps of this thesis mapped towards the research questions(Q) and to our published papers (A, B, C, E, G, I).

4.1 Publication Mapping

Figure 4.1 shows the procedure we have followed in this thesis together witheach corresponding research question (Q) and publication. Starting with mon-itoring (Q1), the first step resulted in the contributions presented in Papers A,C, E, I. The second step, modeling (Q2), describes the ability to model a pro-duction system on smaller test systems, which is described in Papers A, E. Inthe third and final step, Papers B, G, we show how to improve the performanceof messaging systems (Q3).

4.2 Publication Hierarchy and Timeline 55

G BE

A

I

CMessage Compressionand Load Replication

Characteristics Measurements Adaptive Online

Figure 4.2: We present two major research areas in this thesis. The first area isdescribed in Paper A, which incrementally embrace earlier publications C, Eand I. The second area, published in Paper B, extends the previously publishedpaper G.

4.2 Publication Hierarchy and TimelineThis thesis consists of two major areas, A and B in Figure 4.2. The first arearelates to characteristics measurements and modeling, described in Section 7.The characteristics and modeling section is based on paper A [68], which inturn supersedes the sequence of earlier publications; papers C [64], E [65] andthe technical report I [63].

2012 2013 2014 2015 2016

Monitoring

Improvement

Information Communication Technology

Large−scale Systems

E

H

G

F

A

B

AC, D, I

Modeling C, I

Figure 4.3: Publication order.

Adaptive online message compression is the second technical area describedin this thesis, see Section 8. The message compression section is based on thejournal Paper B [67], which in turn is an incremental extension of the publishedconference paper G [66]. The order of publication is depicted in Figure 4.3.Paper A and C cover several areas and is therefore presented in multiple rows.

56 Chapter 4. Contributions

4.3 Paper A (Based on Papers C, E and I)We have addressed research questions Q1 (Section 3.1.1) and Q2 (Section 3.1.2)in:

Marcus Jagemar, Sigrid Eldh, Andreas Ermedahl, Bjorn Lisper and GaborAndai. Automatic Load Synthesis for Performance Verification in EarlyDesign Phases. Technical Report, 2016. [68]

Paper abstractThis paper describes a method to extract hardware characteristics and synthe-size a model of a system running in a production environment. It is common toperform characteristics testing at the end of the development process, resultingin complex and costly bug fixes. Using our characteristics model makes it pos-sible to implement continuous performance testing throughout the whole devel-opment process. Early characteristics testing is important because it improvessystem-development efficiency by shortening the total development time. Thereduced lead time is an advantage in a competitive market, such as for thetelecommunication system we have investigated in this paper. The modelingmethod is generic and supports any hardware metric. We have modeled the L1-instruction, L1-data and L2-data cache in our experiment. We have applied ourmethod to a large-scale telecommunication system and verified that it is possi-ble to detect performance-related problems during the design phase rather thanat the end of the product development cycle.

My ContributionI am the main author of Paper A and also the earlier papers on which it is based.Paper A expands Papers C, E, and I concerning characteristics measurements,by providing a more detailed and theoretical explanation of the monitoring andmodeling mechanisms. I have also supervised a master thesis [7] investigatingthe possibility to use our monitoring-modeling mechanism when predicting theperformance impact of migrating a telecommunication system from a legacyOS to Linux.

I am also the main author of Papers C, E, and I. My contribution is the ideato model the hardware characteristics of production nodes on test nodes. I havealso implemented all functionality in a telecommunication system.

4.4 Paper B (Based on Paper G) 57

4.4 Paper B (Based on Paper G)We have addressed research question Q3 (Section 3.1.3) in:

Marcus Jagemar, Sigrid Eldh, Andreas Ermedahl, and Bjorn Lisper.Online Message Compression with Overload Protection. In press: Journalof Systems and Software, 2016. [67]

Paper abstractIn this paper, we show that it is possible to increase the message throughputof a large-scale industrial system by selectively compress messages. The de-mand for new high-performance message processing systems conflicts with thecost effectiveness of legacy systems. The result is often a mixed environmentwith several concurrent system generations. Such a mixed environment doesnot allow a complete replacement of the communication backbone to providethe increased messaging performance. Thus, performance-enhancing softwaresolutions are highly attractive. Our contribution is 1) an online compressionmechanism that automatically selects the most appropriate compression algo-rithm to minimize the message round trip time; 2) a compression overloadmechanism that ensures ample resources for other processes sharing the sameCPU. We have integrated 11 well-known compression algorithms/configura-tions and tested them with production node traffic. In our target system, au-tomatic message compression results is a 9.6% reduction of message roundtrip time. The selection procedure is fully automatic and does not require anymanual intervention. The automatic behavior makes it particularly suitable forlarge systems where it is difficult to predict future system behavior.

My ContributionI am the main author of Papers B and G [66]. My main contribution is the ideato compress selectively messages depending on network congestion level, mes-sage content, and current CPU usage. I have also implemented and evaluatedthe complete message compression selection mechanism in a telecommunica-tion system. This journal article is an extension of conference Paper G [66].I have extended Paper G by adding additional compression algorithms and athorough rework of the paper structure. I have also elaborated on a scenariowhere the content of a message-stream changes.

Never, for the sake of peace and quiet, deny your own experienceor convictions.

— D. Hammarskjold1

1Secretary-General of the United Nations 1955-61, Nobel prize winner 1961.

5Related Work

THIS chapter contextualize our research by listing the most important re-lated publications. We have divided the chapter into four subsectionsthat each describes one aspect of our work. The sections represent sys-

tem monitoring, system modeling, message and data compression and adaptivecompression. The two last sections relates to system performance improve-ments.

Section 5.1 describes the state of the art for systems monitoring. We arein particular interested in continuous long-term observation and monitoring oflarge-scale industrial systems. The research has used system monitoring for along time, and there are several research results of particular interest to us.

Second, in Section 5.2, we have addressed system modeling. There havebeen several efforts to synthesize models of the system execution environment,using several methods.

The third, Section 5.3 and fourth, Section 5.4, sections of this chapter re-lates to performance improvements. When investigating performance bottle-necks we have focused on the messaging performance. Section 5.3 details thecompression algorithms we have evaluated during our work. The typical us-age of a compression algorithm is to select the best algorithm statically usingoff-line evaluations. Since we have used an automatic mechanism to evalu-ate continuously and choose the best algorithm we have investigated adaptivecompression in Section 5.4,

61

62 Chapter 5. Related Work

5.1 System Monitoring

Understanding hardware utilisation is, according to Eranian [33], normally akey factor when improving the performance of a computer system. As stated byEyerman, Eeckhout and Karkhanis [40] it is hard to understand hardware met-rics for modern superscalar [85] processors with out-of-order execution [57].Eyerman, Eeckhout, and Karkhanis also state that it is difficult to get CyclesPer Instruction (CPI) information from hardware counters in what they denote“naive” processors. In their opinion, a CPU like the IBM Power 5 providesmuch better CPI measurement capabilities than “simple” CPUs since it con-siders the effect of superscalar pipelines. The main benefit comes from non-overlapping counters that provide a more accurate CPI calculation.

Allam, Eyerman, and Eeckhout [6] have implemented hardware function-ality to measure directly the CPI stack, which leads to even more improvedmonitoring capabilities. The scenario described by these papers and authors ishighly accurate for the industrial environment we have worked within. The keyissue is to extract vital hardware usage information without any probe effect.

In a subsequent publication by Eyerman and Eeckhout [39] they questionthe validity of CPI comparisons when evaluating multi-core CPUs. The mainproblem is related to shared resources where different applications and multiplecores will compete for hardware resources. The authors suggest applicationlevel metrics instead of low-level metrics. In our work, we have tried to bridgethis gap by using system-level metrics such as signal round-trip-time vs. cacheusage. For our purposes of achieving a test environment, we have succeededin this task. Eyerman and Michaud expands system monitoring into the multi-core era. In their paper [42] they express critical opinions and motivations onwhat type of metric can be used to measure the performance and characteristicsof multi-core systems. We have selected message round-trip time as a systemlevel metric.

An early paper by Anderson, Berc and Dean [8] describes continuous sys-tem monitoring by implemented a low-intrusive (1%-3%) interrupt triggeredsample based mechanism to gather system-wide information. An interrupt isgenerated after a predefined number of events, which triggers a sample of theprogram-counter as well as additional Performance Monitor Counter (PMC)register information. Our method reduces the probe effect further by less fre-quent PMC sampling. One of the standard work when measuring systems isthe LM-Bench suite by Mcvoy and Staelin [86]. It is useful to measure andcalculate cache and memory timings with a standard tool because it is easy tocompare our tests with the result from other already existing platforms.

5.2 System Modeling 63

5.2 System Modeling

System modeling is within the research community also known as system syn-thesis and dimensioning. A characteristics model can be created to act as areplacement for the original system, with respect to some characteristics met-rics. In our investigation, we have modeled the cache characteristics but intheory, any metric can be monitored and modeled. There can be many pur-poses of such a tool, such as improved test environment, overload testing orother similar tasks.

Eeckhout et al. [29] describes that there are several types of simulationand modeling techniques. The first is functional-simulation, which tests thefunctional aspects of an application. Our legacy test application is a functionalsimulator. Specialized cache and predictor simulation try to synthesize andmodel cache usage. According to Eeckhout et al. this technique may by itselfbe too simple to provide accurate results [29]. We have combined both thesetechniques to provide an execution environment that is similar to the productionenvironment.

Bell and John [11] describes a similar approach to ours. They define amethod to model an application by synthesising low-level parts of the targetapplication and inserting inline assembly instructions into the synthesis code.They use the model to create a synthetic test application with similar charac-teristics to the original one. They have applied this method on the SPEC2000benchmark suite, and the result shows that Instructions Per Cycle (IPC) dif-fers on average 2.4% between the original applications and the model appli-cations. Other metrics differ a degree slightly higher than ours, I-Cache 8.6%L2 cache misses to a large extent. We use a feedback control loop to modelthe system while Bell and John [11] use statistical simulation with instructiontraces for the synthesize procedure, as described by Nussbaum and Smith [91].Bell and John’s synthesis procedure is semi-automatic, and an average of tenpasses with some manual intervention is needed to tune the synthesis param-eters. As a comparison, our feedback controller allows the synthesis proce-dure to converge with no user interaction at all. Additionally, the model inour case is described by configuration parameters fed to a generic application.For Bell and John, the configuration parameters are evaluated at compile time,which requires repeated application re-compilations. Another difference in ourapproaches is that we use system-level message round-trip time to detect anyperformance changes between releases while Bell and John use low-level IPC.Joshi et al. [72] have formulated a concept called performance cloning that canbe used to synthesize application characteristics from a proprietary application


and create a model that mimics a similar behavior. In effect, Joshi et al. im-plements a similar methodology as Bell and John in [11], but have refined thememory and branching model to be hardware agnostic.

Doucette and Fedorova [27] have implemented a similar functionality toours by generating cache misses to determine application sensitiveness for dif-ferent architectures. They try to forecast the application behavior when movingfrom one hardware platform to another without actually running the target ap-plication on the new hardware.

Our load generator steals hardware resources from other applications shar-ing the same common resource, which is L1-instruction, L1-data and L2 cachein our implementation. The main idea is to starve the target application in thesame way as done by the Cache Pirate [30] and the Bandwidth Bandit [31,32].In our work, we act on the core private cache instead of a shared cache. Saave-dra and Smith [107] explain how to understand cache memory structure andhow to generate misses, associativity and more.

Alameldeen et al. [4] investigate server platforms and come to an inter-esting conclusion that it is quite difficult to create simulations of productionsystems. In their work they model the desired characteristics by using a tai-lored workload suite. Our approach is similar, but since they have shown somedifficulties to model a similar hardware-load profile, we use feedback-basedload generator to achieve an approximation of the production application.

Examining the paper by Diniz et al. [26] shows how they have investigatedthe use of feedback mechanisms to improve program execution. They havemodified a compiler to accept performance feedback result automatically fromtest running the application. Their feedback method allows the compiler toutilize the underlying mechanisms that are not possible to determine by usingstatic methods at compile-time. Lau et al. [82] extends earlier work by investi-gating how feedback control techniques can improve the performance of JAVAprograms executing in a Virtual Machine (VM). Lau et al. state that there areplenty of known optimization techniques available, but it is hard for a VM toidentify which one to use. Sometimes a function optimization decrease theperformance rather than improve it because the data set varies over time.

The paper by Kim et al. [75] present a method to sample characteristicsmetrics such as L1D, branch misses, and other metrics. They use perf [25]for monitoring purposes. Their implementation gathers characteristics metricsthat are used to mimic the dynamic target system behavior. We have chosena much lower sampling frequency in our implementation because we want tohave a lower probe effect.

5.3 Message and Data Compression 65

5.3 Message and Data Compression

In this thesis we suggest to use message compression as a means to improvecommunication performance. There are numerous compression techniqueswithin the research community, many with radically different characteristicsand implementations [74]. Many of these techniques have open source imple-mentations allowing them to be easily used and evaluated in research projects.We have investigated several compression algorithms LZFX [23], LZO [92],LZO-SAFE which is a safe configuration of LZO, LZMA [94], LZW [124],BZ2 [110], LZ4 [22], FastLZ level 1 and 2 [55], Snappy [50], and QLZ [99] forinclusion into the selection mechanism described in Section 5.4. Ringwelski,Renner, Reinhardt, Weigel and Turau [100] manually investigates a numberof compression techniques with regard to compression ratio and computationalresources. The manual labor of investigating compression techniques is thestarting point for our investigation. We asked ourselves how it was possibleto evaluate and select the most appropriate compression algorithm without anyoff-line measurements and calculations.

During the work with this thesis we have initiated several MSc thesis.One of them, Karlsson and Hansson [74], provides an investigation of typi-cal characteristics for certain compression techniquesm, considering compres-sion/decompression rate, compression ratio and resource usage. Their workalso shows the suitability for each algorithm in the context of communicationscenarios. Gutwin, Fedak, Watson, Dyck and Bell describes in their paper [53]transparent message compression Groupware. The framework support bothtext and serialized objects. The method to apply compression to cloud com-puting and storage is investigated in the paper [89] by Nicolae. The smallcomputational overhead is justified by the gain when using message compres-sion for network communication. An interesting part of this paper is that itapplies a practical implementation on the Grid5000 research network to obtainresults. A significant reduction of network traffic has been detected using bothLZO and BZIP2 compression algorithms.

In recent CPUs, there is a trend to include hardware support for compres-sion. Intel has released hardware support for LZO [60] and the AHA companyhas implemented specific circuitry for 80Gbps Gzip [46], Zlib [47], LZS 1 com-pression as well as separate cores for inclusion in customer ASIC/FPGA [2].The benefit is that tailored HW offloads the CPU with the heavy burden of

1Lempel-Zif-Stac (LZS) compression technique. It was created by Stac Electronics and hasbeen widely used for tape and disk compression.


compressing messages. In our investigation, this means that such an algorithmwill have the special characteristics of relatively low compression ratio but veryfast compression rate.

5.4 Adaptive Compression

In traditional message compression systems, a compression algorithm is man-ually selected by a system designer. Our implementation automates the selec-tion mechanism by continuously evaluating several compression algorithms. Amessage stream will, therefore, contain messages compressed with several al-gorithms when using our automatic compression selection mechanism. In thissection, we list some publications related to automatic algorithm selection. Thepaper [126] by Wiseman et al. investigate loss-less compression of communi-cation systems. By running a micro benchmark they can pre-generate an off-line data representation of each supported compression algorithm. When send-ing messages the most appropriate compression algorithm grade is selectedaccording to the pre-generated algorithm characteristics.

There is a series of related papers written by a group of researchers inter-ested in message compression. In the first paper by Knutsson and Bjorkman [78]closely followed by Knutsson [77] they have included Zlib [47] functionality inthe Linux kernel. To support an adaptive scheme they monitor the length of thesend-queue and if it grows, the outgoing messages should be compressed moreefficiently. The opposite applies when the send-queue length is reduced. Thesuite of papers continues with Knutsson and Bjorkman [79] where they deducethat there is no performance gain when compressing messages smaller than4kB, for the system they have investigated. The paper was published in 1999so the measurements may have changed with the introduction of modern CPUsand communication equipment. Our communication mechanism handles allsizes of messages. If the compression gain is too small, our selection mech-anism will stop message compression and send the messages uncompressed.To continue the suite of papers, Jeannot, Knutsson, Bjorkman [71] revisits theadaptive compression algorithm and expands it to be more generally available.Kernel internal code is replaced with more portable user-mode implementation.This is also the first release of the freely available and portable version of Adap-tive Online Data Compression (AdOC) [70]. The final paper by Jeannot [69]related to this topic is a rebuttal to critical opinions by other researchers, forexample, the request for an increased adaptiveness to CPU usage and networkusage by Wiseman, Schwan, and Widener [126]. The reply is a new version

5.4 Adaptive Compression 67

of AdOC with improved handling of small messages, compression-send paral-lelism and other.

Sucu and Krintz [115] expands previous discoveries by creating a commu-nication environment called the Adaptive Compression Environment (ACE).It aims to change the behavior of socket communication by introducing mes-sage compression. Only messages larger than 32kB are affected, while smallermessages are sent uncompressed. Sucu and Krintz also expand their previouspaper by a new paper [81], in which they have added additional compressionalgorithms such as Bzip2 [110], zlib [47] and LZO [92].

Pu and Singaravelu [98] discuss the trade-off between available band-width and the required computational capacity when compressing messages.They present a thorough investigation of fundamental compression schemessuch as “compress-all messages” or “compress-none” with algorithms such asBzip2 [110], gzip [46] and LZO [92]. Furthermore, they investigate the effectsof mixed messaging, which they define as a messaging stream containing both(denoted Fine-Grained mixing) compressed and uncompressed messages. Puand Singaravelu’s message mixing technique is similar to our solution, but ourmechanism automatically evaluates the full message transit time and select themost appropriate compression algorithm. Gray, Peterson and Reiher [51] ex-pands the earlier work by Pu and Singaravelu and discuss problems on how todecide when to compress messages or not.

A patent by Biederman [15] shows a general idea of receiving, compress-ing and sending messages. Biederman’s method is similar to ours but differs inthe following aspects: 1) We adopt a feedback control loop to control the CPUresources spent compressing. Controlling the allocated CPU load allows otherservices to coexist with the message compression functionality. Further, 2)Biederman uses different levels of compression. We suggest to simultaneouslyevaluate several compression algorithms to let the best algorithm dominate.

[Speaking of computers] But they are useless. They can only giveyou answers.

— P. Picasso

6Conclusion and Future Work

WE will, in this chapter, briefly answer the research questions askedin Section 3.1. We give our answers in the frame of the telecom-munication system we have defined in Section 2.5, and delimited,

Section 3.2. We believe that our research has lead to some advances in the in-dustrial use of automated synthesis of characteristics modeling. The softwaredesign and test organisation within Ericsson use our characteristics monitoringtool when evaluating production system performance.

Our thoughts related to future work concludes this thesis. We try to answerquestions like: What do we think is the future of the area we have investi-gated? What would we do to investigate further issues that were left behinddue to time restrictions? The answer is to improve the modeling mechanismby several actions, such as higher sampling frequency to support a dynamicmodel, modeling of more hardware metrics, test the modeling mechanism onmany other types of systems.

71

72 Chapter 6. Conclusion and Future Work

6.1 ConclusionIn this thesis, we have formulated three research questions. To answer the firstresearch question (Q1), Section 3.1.1, we have implemented a characteristicsmonitoring application that can observe large industrial systems in a produc-tion environment. The monitoring application periodically samples hardwarecharacteristics information with low impact on the system behavior.

As a response to the second question (Q2), Section 3.1.2, we have deviseda method to automate the synthesis process when modeling the hardware us-age of a production system. We have tested our method by using hardwarecharacteristics information sampled by our monitoring application to create anexecution model on a much smaller and cheaper test system. The character-istics model makes it possible to run performance tests 1) without using thebusiness logic of the production system and 2) much earlier in the develop-ment process. Both approaches aim to reduce the overall development timeand cost.

To answer the third and final question (Q3), Section 3.1.3, we needed tounderstand how the performance of our target communication system could beimproved. As a first step, we implemented a message compression mechanismthat automatically selects the most appropriate compression algorithm depend-ing on the network congestion level, message content, and CPU load. Ourmechanism uses the compression algorithm that provides the shortest round-trip message time for bulk message transmission while continuously assessingthe performance of all supported compression algorithms. We plan to continueusing the monitor-model-improve methodology to find additional performanceimprovements.

We have implemented and tested all of our research results using a telecom-munication system. We believe that our generic methods are usable for othersystems although we have only tested them on one particular system. Thecorporate test department currently uses the monitoring and modeling tool forearly-stage performance testing. We are currently evaluating the automaticmessage compression mechanism for possible inclusion in the communicationsubsystem.

6.2 Future WorkEvery researcher knows that it is difficult to delimit one’s work when perform-ing research. Plunging deeper into a problem and investigating it more thor-oughly is always interesting and gratifying but as all researchers know theremust always be an end to the study. In this section, we list some areas wherewe would like to investigate further, given the time and resources.

We think that adding model support for dynamic behavior would make themodel more accurate. Currently, we use the mean value of a metric whencreating the model, which is sufficient for our current purposes. Adding thepossibility to model dynamic memory usage would make it possible to investi-gate additional areas, for example, undesirable memory bus side-effects causedby data bursts. It would also be useful to add additional hardware metrics tothe mode, such as branch misses, last level caches, and TLB misses. Addi-tionally, we think that we think that it is possible to increase further the usageof the monitoring functionality, which is currently limited to the test organiza-tion. Our belief is that the design organization would also benefit from an earlyunderstanding of the system behavior and performance bottlenecks.

We have assumed that finding performance related bugs in the initial phasesof the development process will reduce the total development time. We wouldlike to perform a study to investigate that our assumptions are correct. Wewould also like to implement and test our methods on a wider range of systemsto verify that they support varying types of systems.

Another suggestion is to add additional features to the automatic messagecompression mechanism. The natural extension of the current mechanism isto add additional compression techniques. It would be interesting to evaluatehardware supported compression algorithms included in recent processors. Itwould also be interesting to add machine learning techniques to predict recur-ring changes to the message stream, and consequently also predict the com-pression algorithm to use.

When writing this thesis, we have concluded that there is an infinite de-mand for performance investigations and improvements within the industry.We believe that there continue to be a demand for more advanced monitor-ing techniques, allowing system engineers to understand and draw conclusionsregarding system characteristics and performance. The continuous need forincreased bandwidth is promising for the development of more advanced andefficient adaptive message compression techniques. We estimate that modernCPUs will increasingly support hardware acceleration for compression algo-rithms.

Bibliography

[1] 148Apps. Count of Active Applications in the App Store. http:

//148apps.biz/app-store-metrics/?mpage=appcount, 2014. [Ac-cessed 2015-03-04].

[2] AHA. AHA378 Gzip/Zlib/LZS compression/decompression hardware.http://www.aha.com, 2014. [Accessed 2015-03-04].

[3] Goran Ahlforn and Erik Ornulf. Ericsson’s family of carrier-class tech-nologies. Technical Report 4, Ericsson, 2001.

[4] Alaa R. Alameldeen, Milo Martin, Carl J. Mauer, Kevin E. Moore, andMin Xu. Simulating a 2MCommercialServerona2K PC. IEEE Com-puter, 36(2):50–57, 2003.

[5] Alaa R. Alameldeen and David A. Wood. IPC considered harmful formultiprocessor workloads. IEEE Micro, pages 8–17, 2006.

[6] Osman Allam, Stijn Eyerman, and Lieven Eeckhout. An efficient CPIstack counter architecture for superscalar processors. Proceedings of theGreat Lakes Symposium on VLSI, pages 55–58, 2012.

[7] Gabor Andai. Performance monitoring on high-end general processingboards. Master thesis, KTH Royal Institute of Technology, 2014.

[8] Jennifer M. Anderson, Lance M. Berc, Jeffrey Dean, Sanjay Ghemawat,Monika R. Henzinger, Shun-Tak A. Leung, Richard L. Sites, Mark T.Vandevoorde, Carl A. Waldspurger, and William E. Weihl. Continuousprofiling: where have all the cycles gone? ACM SIGOPS, 15(4):357–390, 1997.

[9] Apple. Apples Revolutionary App Store Downloads Top One Billion inJust Nine Months. www.apple.com, 2009. [Accessed 2015-03-04].

75

http://148apps.biz/app-store-metrics/?mpage=appcount

http://148apps.biz/app-store-metrics/?mpage=appcount

http://www.aha.com

www.apple.com

76 Bibliography

[10] Apple. App Store Tops 40 Billion Downloads with Almost Half in 2012.www.apple.com, 2013. [Accessed 2015-03-04].

[11] Robert H. Bell and Lizy K. John. Improved automatic testcase synthe-sis for performance model validation. In Proceedings of InternationalConference on Supercomputing, pages 111–120. 2005.

[12] S. Bennett. Nicolas Minorsky and the Automatic Steering of Ships.IEEE Control Systems Magazine, 4(4):10–15, 1984.

[13] Mikael Bergqvist, Jakob Engblom, Mikael Patel, and Lars Lundegard.Some experience from the development of a simulator for a telecomcluster (CPPemu). In Proceedings of the International Association ofScience and Technology for Development, pages 13–21. 2006.

[14] JO Best. The race to 5G Inside the fight for the future of mobile as weknow it - Feature - TechRepublic. URL http://www.techrepublic.

com/article/does-the-world-really-need-5g/.

[15] Daniel Biederman. Communication system with content-based datacompression. US Patent 7069342, 2001.

[16] Barry Boehm. The Incremental Comitment Spiral Model. Principles andPractices for Successful Systems and Software. Technical report, 2013.

[17] Barry Boehm and Victor R. Basil. Software Defect Reduction Top 10List. Computer Journal, 34(1):135–137, 2001.

[18] Barry Boehm and Philip N. Papaccio. Understanding and control-ling software costs. IEEE Transactions on Software Engineering,14(10):1462–1477, 1988.

[19] Marilynn B. Brewer. Research design and issues of validity. Handbookof research methods in social and personality psychology. 2000.

[20] Mary Brydon-Miller, Davydd Greenwood, and Patricia Maguire. WhyAction Research? Action Research, 1(1):9–28, jul 2003.

[21] R. L. G. Cavalcante, S. Stanczak, M. Schubert, a. Eisenblatter, andU. Turke. Toward Energy-Efficient 5G Wireless Communications Tech-nologies. IEEE Signal Processing Magazine, accepted f(October):24–34, 2014.

www.apple.com

http://www.techrepublic.com/article/does-the-world-really-need-5g/

http://www.techrepublic.com/article/does-the-world-really-need-5g/

Bibliography 77

[22] Yann Collet. lz4 Data Compression Library. http://fastcompression.blogspot.se/p/lz4.html, 2013. [Accessed 2015-03-04].

[23] Andrew Collette. LZFX Data Compression Library. http://code.

google.com/p/lzfx/, 2013. [Accessed 2015-03-28].

[24] Ericsson Consumerlab. Hot Consumer Trends 2016. Technical ReportDecember 2015, Ericsson Consumer Lab, 2016.

[25] Arnaldo Carvalho de Melo. The New Linux ’perf’ Tools. LinuxKongress, 2010.

[26] Pedro C. Diniz and Martin C. Rinard. Dynamic feedback: An EffectiveTechnique for Adaptive Computing. ACM SIGPLAN Notices, 32(5):71–84, may 1997.

[27] Daniel Doucette and Alexandra Fedorova. Base vectors: A potentialtechnique for microarchitectural classification of applications. In Pro-ceedings of the Workshop on the Interaction between Operating Systemsand Computer Architecture. 2007.

[28] Denis Duka. Connectivity packet platform in the GSM/WCDMA net-work. Proceedings Elmar - International Symposium Electronics in Ma-rine, pages 163–166, 2006.

[29] Lieven Eeckhout, Sebastien Nussbaum, James E. Smith, and Koen DeBosschere. Statistical Simulation: Adding Efficiency to the ComputerDesigner’s Toolbox. IEEE Micro, 23(5):26–38, 2003.

[30] David Eklov, Nikos Nikoleris, David Black-Schaffer, and Erik Hager-sten. Cache Pirating: Measuring the Curse of the Shared Cache. InProceedings of International Conference on Parallel Processing, pages165–175. sep 2011.

[31] David Eklov, Nikos Nikoleris, David Black-Schaffer, and Erik Hager-sten. Bandwidth bandit: Understanding memory contention. ISPASS2012 - IEEE International Symposium on Performance Analysis of Sys-tems and Software, pages 116–117, 2012.

[32] David Eklov, Nikos Nikoleris, David Black-Schaffer, and Erik Hager-sten. Bandwidth Bandit: Quantitative characterization of memory con-tention. Proceedings of the 2013 IEEE/ACM International Symposiumon Code Generation and Optimization, CGO 2013, 2013.

http://fastcompression.blogspot.se/p/lz4.html


http://code.google.com/p/lzfx/


78 Bibliography

[33] Stephane Eranian. What can performance counters do for memory sub-system analysis? In Proceedings of the ACM SIGPLAN workshop onMemory Systems Performance and Correctness, pages 26–30. 2008.

[34] Ericsson. Market Outlook. Technical report, Ericsson, 2013.

[35] Ericsson. Ericsson Consumer Lab: 10 Hot Consumer Trends 2014.Technical report, Ericsson Consumer Lab, 2014.

[36] Ericsson. 5G Radio Access - Technology and Capabilities. TechnicalReport February, Ericsson White Paper, 2015.

[37] Ericsson. Ericsson Mobility Report November 2015. Technical ReportNovember, Ericsson Consumer Lab, 2015.

[38] Ericsson AB. 5G Energy Performance - Key Technologies and DesignPrinciples. Technical Report April, Ericsson White Paper, 2015.

[39] Stijn Eyerman and Lieven Eeckhout. System-level performance metricsfor multiprogram workloads. IEEE Micro, 28(3):42–53, 2008.

[40] Stijn Eyerman, Lieven Eeckhout, Tejas Karkhanis, and James E. Smith.A Top-Down Approach to Architecting CPI Component PerformanceCounters. IEEE Micro, 27(1):84–93, 2007.

[41] Stijn Eyerman, K. Hoste, and Lieven Eeckhout. Mechanistic-empiricalprocessor performance modeling for constructing CPI stacks on realhardware. In International Symposium on Performance Analysis of Sys-tems and Software (ISPASS), pages 216–226. 2011.

[42] Stijn Eyerman and Pierre Michaud. Defining metrics for multicorethroughput on multiprogrammed workloads. Technical report, GhentUniversity - Team ALF, 2013.

[43] Colin Fidge. Fundamentals of distributed system observation. IEEESoftware, 13(6), 1996.

[44] Freescale. Advanced QorIQ Debug and Performance Monitoring. Rev.d edition, 2011.

[45] Anders Furuskar, Jonas Naslund, and Hakan Olofsson. Edge - enhanceddata rates for GSM and TDMA/136 evolution. Ericsson Review (EnglishEdition), 76(1):28–37, 1999.

Bibliography 79

[46] Jean-loup Gailly and Mark Adler. gzip. http://gzip.org, 2014. [Ac-cessed 2015-03-04].

[47] Jean-loup Gailly and Mark Adler. zlib. http://www.zlib.net/, 2014.[Accessed 2015-03-04].

[48] Gartner. High Tech and Telecom Providers. http://www.gartner.com/technology/consulting/high-tech-telecom-providers.jsp, 2012.[Accessed 2015-03-04].

[49] Adithya Gollapudi and Arvind Ojha. Comparing Applicability of TestDesign Techniques for Telecom systems. Ph.D. thesis, Malardalen Uni-versity, 2009.

[50] Google. Snappy Compression Library. https://code.google.com/p/

snappy, 2013. [Accessed 2015-03-28].

[51] Michael Gray, Peter Peterson, and Peter Reiher. Scaling Down Off-The-Shelf Data Compression : Backwards-Compatible Fine-Grain Mix-ing. In Proceedings of Distributed Computing Systems, pages 112 – 121.2012.

[52] GSM World. GSM Market Data Report. Technical report, 2009.

[53] Carl Gutwin, Christopher Fedak, Mark Watson, Jeff Dyck, and TimBell. Improving network efficiency in real-time groupware with gen-eral message compression. In Proceedings of Conference on ComputerSupported Cooperative Work, pages 119–128. ACM Press, New York,USA, 2006.

[54] Daniel Hallmans, Marcus Jagemar, Stig Larsson, and Thomas Nolte.Identifying Evolution Problems for Large Long Term Industrial Evo-lution Systems. In Proceedings of IEEE International Workshop onIndustrial Experience in Embedded Systems Design (COMPSAC14).Vasteras, 2014.

[55] Ariya Hidayat. FastLZ. http://fastlz.org/, 2014. [Accessed 2015-03-28].

[56] Harri Holma and Antti Toskala. WCDMA for UMTS, 3rd edition. JohnWiley & Sons Ltd., 2004.

http://gzip.org

http://www.zlib.net/

http://www.gartner.com/technology/consulting/high-tech-telecom-providers.jsp

http://www.gartner.com/technology/consulting/high-tech-telecom-providers.jsp

https://code.google.com/p/snappy


http://fastlz.org/

80 Bibliography

[57] Wen-Mei Hwu and Yale N. Patt. HPSm, a high performance restricteddata flow architecture having minimal functionality. ACM SIGARCHComputer Architecture News, 14(2):297–306, 1986.

[58] Rafia Inam, Mikael Sjodin, and Marcus Jagemar. Bandwidth Measure-ment using Performance Counters for Predictable Multicore Software.In Proceedings of the International Conference on Emerging Technolo-gies and Factory Automation (ETFA12). 2012.

[59] Nathan Ingraham. Apple by the numbers: 30 billion app downloads,650,000 apps available in the App Store. http://www.theverge.

com/2012/6/11/3077792/apple-wwdc-2012-stats-ios-mac-growth,2012. [Accessed 2015-03-04].

[60] Intel. LZO hardware compression. http://software.intel.com/en-

us/articles/lzo-data-compression-support-in-intel-ipp, 2013.[Accessed 2015-03-04].

[61] Anand Padmanabha Iyer, Li Erran Li, and Ion Stoica. CellIQ : Real-Time Cellular Network Analytics at Scale. In Nsdi, pages 218–234.2015.

[62] Marcus Jagemar and Gordana Dodig-Crnkovic. Cognitively SustainableICT with Ubiquitous Mobile Services - Challenges and Opportunities.In Proceedings of International Conference on Software Engineering(ICSE15). 2015.

[63] Marcus Jagemar, Sigrid Eldh, Andreas Ermedahl, and Bjorn Lisper.Technical Report : Feedback-Based Generation of Hardware Charac-teristics. Technical report, Malardalen University, 2012.

[64] Marcus Jagemar, Sigrid Eldh, Andreas Ermedahl, and Bjorn Lisper. To-wards Feedback-Based Generation of Hardware Characteristics. In Pro-ceedings of the 7th International Workshop on Feedback Computing.2012.

[65] Marcus Jagemar, Sigrid Eldh, Andreas Ermedahl, and Bjorn Lisper. Au-tomatic Multi-Core Cache Characteristics Modelling. In Proceedingsof the Swedish Workshop on Multicore Computing (MCC13), page 4.Halmstad, 2013.

http://www.theverge.com/2012/6/11/3077792/apple-wwdc-2012-stats-ios-mac-growth

http://www.theverge.com/2012/6/11/3077792/apple-wwdc-2012-stats-ios-mac-growth

http://software.intel.com/en-us/articles/lzo-data-compression-support-in-intel-ipp


Bibliography 81

[66] Marcus Jagemar, Sigrid Eldh, Andreas Ermedahl, and Bjorn Lisper.Adaptive Online Feedback Controlled Message Compression. In Pro-ceedings of Computers, Software and Applications Conference (COMP-SAC14). Vasteras, 2014.

[67] Marcus Jagemar, Sigrid Eldh, Andreas Ermedahl, and Bjorn Lisper. Au-tomatic Message Compression with Overload Protection. In press: TheJournal of Systems and Software, 2016.

[68] Marcus Jagemar, Sigrid Eldh, Andreas Ermedahl, Bjorn Lisper, and Ga-bor Andai. Automatic Load Synthesis for Performance Verification inEarly Design Phases. Technical report, Malardalen University, 2016.

[69] Emmanuel Jeannot. Improving Middleware Performance with AdOC:an Adaptive Online Compression Library for Data Transfer. In Proceed-ings of International Parallel and Distributed Processing Symposium,page 70. 2005.

[70] Emmanuel Jeannot. ADOC homepage. http://www.labri.fr/perso/ejeannot/adoc/adoc.html, 2012. [Accessed 2015-03-04].

[71] Emmanuel Jeannot, Bjorn Bjorn Knutsson, Mats Bjorkman, and MatsBjorkman. Adaptive online data compression. In IEEE High Perfor-mance Distributed Computing. 2002.

[72] Ajay Joshi, Lieven Eeckhout, Robert H. Bell, and Lizy K. John. Dis-tilling the essence of proprietary workloads into miniature benchmarks.ACM Transactions on Architecture and Code Optimization, 5(2):1–33,aug 2008.

[73] Ajay Joshi, Lieven Eeckhout, Robert H Bell Jr, I B M Corp, and LizyJohn. Performance Cloning : A Technique for Disseminating Propri-etary Applications as Benchmarks Background and Motivation. Inter-national Symposium on Workload Characterization, 2006.

[74] Stefan Karlsson and Erik Hansson. Lossless Message Compression.Bachelor thesis, Malardalen University, 2013.

[75] Keunsoo Kim, Changmin Lee, Jung Ho Jung, and Won Woo Ro. Work-load synthesis: Generating benchmark workloads from statistical exe-cution profile. In 2014 IEEE International Symposium on WorkloadCharacterization (IISWC), pages 120–129. 2014.

http://www.labri.fr/perso/ejeannot/adoc/adoc.html


82 Bibliography

[76] Lars-orjan Kling, Ake Lindholm, Lars Marklund, and Gunnar B Nils-son. CPP Cello packet platform. Technical Report 2, Ericsson Review,2002.

[77] Bjorn Knutsson. Increasing Communication Performance via AdaptiveCompression. In Proceedings of the Seventh Swedish Workshop on Com-puter Systems Architecture. Gothenburg, Sweden, 1998.

[78] Bjorn Knutsson and Mats Bjorkman. Trading Computation for Com-munication by End-to-End Compression. In Proceedings of the Interna-tional Workshop on High Performance Protocol Architectures. 1997.

[79] Bjorn Knutsson and Mats Bjorkman. Adaptive end-to-end compres-sion for variable-bandwidth communication. Computer Networks,31(7):767–779, apr 1999.

[80] N Krajnovic. The design of a highly available enterprise ip telephonynetwork for the power utility of Serbia company. Communications Mag-azine, IEEE, 47(4):118–122, apr 2009.

[81] Chandra Krintz and Sezgin Sucu. Adaptive on-the-fly compression.IEEE Transactions on Parallel and Distributed Systems, 17(1):15 – 24,jan 2006.

[82] Jeremy Lau, Matthew Arnold, Michael Hind, and Brad Calder. Onlineperformance auditing. In Proceedings of ACM SIGPLAN Conference onProgramming language design and implementation, pages 239–251. jun2006.

[83] Kurt Lewin. Action research and minority problems. Journal of SocialIssues, 2(4):34–46, 1946.

[84] Linuxcounter. Lines of code of the Linux Kernel Versions. URL https:

//www.linuxcounter.net/statistics/kernel.

[85] Steven McGeady, Randy Steck, Glenn Hinton, and Atiq Bajwa. Perfor-mance enhancements in the superscalar i960MM embedded micropro-cessor. COMPCON Spring ’91 Digest of Papers, 1991.

[86] Larry Mcvoy and Carl Staelin. lmbench : Portable Tools for Perfor-mance Analysis. In Proceedings of the USENIX Annual Technical Con-ference, pages 279–294. 1996.

https://www.linuxcounter.net/statistics/kernel

https://www.linuxcounter.net/statistics/kernel

Bibliography 83

[87] Vilhelm Moberg. Din stund pa jorden. 1963.

[88] Vu Nguyen, Sophia Deeds-Rubin, Thomas Tan, and Barry Boehm. ASLOC Counting Standard. pages 1–15, 2007.

[89] Bogdan Nicolae. On the benefits of transparent compression for cost-effective cloud data storage. In Proceedings of Transactions on LargeScale Data and Knowledge Centered Systems, volume 3, pages 167–184. 2011.

[90] Nokia Siemens Networks. Long Term HSPA Evolution: Mobile Broad-band Evolution beyond 3GPP Release 10 HSPA has Transformed Mo-bile Networks. Technical report, Nokia Siemens Networks, 2010.

[91] S. Nussbaum and J.E. Smith. Modeling superscalar processors via sta-tistical simulation. In Proceedings of the International Conference onParallel Architectures and Compilation Techniques, pages 15–24. 2001.

[92] Markus Oberhumer. LZO (Lempel-Ziv-Oberhumer) Data CompressionLibrary. http://www.oberhumer.com/opensource/lzo/, 2013. [Ac-cessed 2015-03-04].

[93] Oxford. English Dictionary (online), 2014.

[94] Igor Pavlov. LZMA Software Development Kit. http://www.7-zip.

org/sdk.html, 2013. [Accessed 2015-03-27].

[95] Kai Petersen, C Gencel, and N Asghari. Action research as a model forindustry-academia collaboration in the software engineering context. InProceedings of the 2014 international workshop on Long-term indus-trial collaboration on software engineering, pages 55–62. 2014.

[96] Kai Petersen and Claes Wohlin. Context in industrial software engi-neering research. In International Symposium on Empirical SoftwareEngineering and Measurement, pages 401–404. Orlando, Florida, USA,2009.

[97] Ian Poole. Cellular Communications Explained : From Basics to 3G.Elsevier, 1st edition, 2006.

[98] Calton Pu and Lenin Singaravelu. Fine-Grain Adaptive Compressionin Dynamically Variable Networks. In Proceedings of the InternationalConference on Distributed Computing Systems, pages 685–694. 2005.

http://www.oberhumer.com/opensource/lzo/

http://www.7-zip.org/sdk.html


84 Bibliography

[99] Lasse Mikkel Reinhold. QuickLZ - Fast compression library for C, C#and Java. http://www.quicklz.com/, 2011. [Accessed 2013-05-31].

[100] Martin Ringwelski, Christian Renner, Andreas Reinhardt, AndreasWeigel, and Volker Turau. The hitchhiker’s guide to choosing the com-pression algorithm for your smart meter data. In 2nd IEEE ENERGY-CON Conference & Exhibition, pages 935–940. 2012.

[101] Colin Robson. Real world research. Blackwell, Oxford, 2nd edition,2002.

[102] Jussi Rosendahl and Leila Abboud. Nokia buys Alcatel to take on Eric-sson in telecom equipment. http://www.reuters.com/article/2015/04/15/nokia-alcatel-lucent-ma-idUSL5N0XC0X220150415, 2015.

[103] Hans Rosling. Hans Rosling is lecturing the Danish Radio Channel 2program (deadline) host Adam Holm, sep 2015.

[104] Kim Rowe. Time to market is a critical consideration.http://www.embedded.com/electronics-blogs/industry-

comment/4027610/Time-to-market-is-a-critical-consideration,2010. [Accessed 2015-03-04].

[105] Per Runeson. Case Study Research or Anecdotal Evicende? Technicalreport, 2010.

[106] Per Runeson and Martin Host. Guidelines for conducting and reportingcase study research in software engineering. Empirical Software Engi-neering, 14(2):131–164, dec 2008.

[107] Rafael H. Saavedra and Alan J. Smith. Measuring cache and TLB per-formance and their effect on benchmark runtimes. IEEE Transactionson Computers, 44(10):1223–1235, 1995.

[108] Max Schuchard, Eugene Y. Vasserman, Abedelaziz Mohaisen, De-nis Foo Kune, Nicholas Hopper, and Yongdae Kim. Losing Controlof the Internet: Using the Data Plane to Attack the Control Plane. InComputer and Communications Security, pages 726–728. 2010.

[109] Carolyn B. Seaman. Qualitative methods in empirical studies ofsoftware engineering. IEEE Transactions on Software Engineering,25(4):557–572, 1999.

http://www.quicklz.com/

http://www.reuters.com/article/2015/04/15/nokia-alcatel-lucent-ma-idUSL5N0XC0X220150415

http://www.reuters.com/article/2015/04/15/nokia-alcatel-lucent-ma-idUSL5N0XC0X220150415

http://www.embedded.com/electronics-blogs/industry-comment/4027610/Time-to-market-is-a-critical-consideration

http://www.embedded.com/electronics-blogs/industry-comment/4027610/Time-to-market-is-a-critical-consideration

Bibliography 85

[110] Julian Seward. BZIP2, a program and library for data compression com-pression. http://www.bzip.org, 2013. [Accessed 2015-03-04].

[111] Dag Sjøberg, Tore Dyba, and Magne Jørgensen. The Future of Em-pirical Methods in Software Engineering Research. Future of SoftwareEngineering, SE-13(1325):358–378, 2007.

[112] Jan Christiaan Smuts. Holism and Evolution, volume 119. Macmillianand Co., London, 2nd edition, 1927.

[113] Niklas Stahle. Implementing Transaction Tracing in Real-Time SystemsMaster of Science Thesis Implementing Transaction Tracing in Real-Time Systems. Ph.D. thesis, Royal Institute of Technology, 2009.

[114] Jorgen Stenmark. Intellectual property rights and copyright laws versusfile-sharing in Cyberspace.

[115] Sezgin Sucu and Chandra Krintz. Ace: A resource-aware adaptive com-pression environment. In Proceedings of International Conference of In-formation Technology: Coding and Computing, pages 183 – 188. 2003.

[116] Gregory Tassey. The economic impacts of inadequate infrastructure forsoftware testing. Technical Report 7007, National Institute of Standardsand Technology, 2002.

[117] Paul Taylor. Battle lines are drawn for the future of 4G.http://www.ft.com/intl/cms/s/0/399b1508-d9d8-11dc-bd4d-

0000779fd2ac.html#axzz1va5rEtRx, 2008. [Accessed 2015-03-04].

[118] Techcrunch. Apples App Store Hits 50 Billion Downloads, 900K Apps.http://techcrunch.com/2013/06/10/apples-app-store-hits-50-

billion-downloads-paid-out-10-billion-to-developers/, 2013.[Accessed 2015-03-04].

[119] Telecomasia. Faster time to market with next-gen OSS.http://www.telecomasia.net/content/faster-time-market-

next-gen-oss, 2012. [Accessed 2015-03-04].

[120] Ericsson Nikola Tesla, Denis Duka, and Keywords Cpp. ConnectivityPacket Platform in the GSMIWCDMA Network. In Access, June, pages7–9. 2006.

http://www.bzip.org

http://www.ft.com/intl/cms/s/0/399b1508-d9d8-11dc-bd4d-0000779fd2ac.html#axzz1va5rEtRx

http://www.ft.com/intl/cms/s/0/399b1508-d9d8-11dc-bd4d-0000779fd2ac.html#axzz1va5rEtRx

http://techcrunch.com/2013/06/10/apples-app-store-hits-50-billion-downloads-paid-out-10-billion-to-developers/

http://techcrunch.com/2013/06/10/apples-app-store-hits-50-billion-downloads-paid-out-10-billion-to-developers/

http://www.telecomasia.net/content/faster-time-market-next-gen-oss

http://www.telecomasia.net/content/faster-time-market-next-gen-oss

86 Bibliography

[121] Hans Vestberg. Ericsson unveils new products, partnerships and in-creased market share. In Proceedings of at Mobile World Conference.2012.

[122] Wan Vinny. CPP in LTE Overview. Technical report, Ericsson, 2014.

[123] Johan De Vriendt, Philippe Laine, Christophe Lerouge, and XiaofengXu. Mobile network evolution: a revolution on the move. IEEE Com-munications magazine, (April):104–111, 2002.

[124] Terry A Welch. A Technique for High-Performance Data Compression.Computer, 17(6):8–19, 1984.

[125] Benjamin Welton, Dries Kimpe, Jason Cope, Christina M. Patrick,Kamil Iskra, and Robert Ross. Improving I/O Forwarding Through-put with Data Compression. 2011 IEEE International Conference onCluster Computing, pages 438–445, sep 2011.

[126] Y. Wiseman, K. Schwan, and P. Widener. Efficient end to end data ex-change using configurable compression. ACM SIGOPS Operating Sys-tems Review, pages 4–23, 2005.

[127] Claes Wohlin, Per Runeson, Martin Host, Magnus C. Ohlsson, BjornRegnell, and Anders Wesslen. EXPERIMENTATION IN SOFTWAREAn Introduction. Springer Science+Business Media LLC, Lund, 2000.

[128] Robert K. Yin. Case study research: Design and methods, volume 5.Sage, 2nd edition, 1994.

[129] Li Zhang, Dhruv Gupta, and Prasant Mohapatra. How expensive are freesmartphone apps? ACM SIGMOBILE Mobile Computing and Commu-nications Review, 16(3):21–32, dec 2012.

II

Included Papers

89

– Jag vet ingenting om tur,bara att ju mer jag tranardesto mer tur har jag1.

My own translation:

– I don’t know anything about luck,but the more I trainthe more lucky I get.

— I. Stenmark [114]

1A reporter implies that I. Stenmark has had a large portion of luck when competing in slalom

7Automatic Load Synthesis for

Performance Verification inEarly Design Phases

Marcus Jagemar, Sigrid Eldh, Andreas Ermedahl, Bjorn Lisper and GaborAndai. Automatic Load Synthesis for Performance Verification in Early De-sign Phases. Technical Report, 2016. [68].This technical report is an extension of the already published papers C [64],E [65] and the technical report I [63].

94 Paper A - Automatic Load Synthesis for PerformanceVerification in Early Design Phases

AbstractThis paper describes a method to extract hardware characteristics and synthe-size a model of a system running in a production environment. It is common toperform characteristics testing at the end of the development process, resultingin complex and costly bug fixes. Using our characteristics model makes it pos-sible to implement continuous performance testing throughout the whole devel-opment process. Early characteristics testing is important because it improvessystem-development efficiency by shortening the total development time. Thereduced lead time is an advantage in a competitive market, such as for thetelecommunication system we have investigated in this paper. The modelingmethod is generic and supports any hardware metric. We have modeled the L1-instruction, L1-data and L2-data cache in our experiment. We have applied ourmethod to a large-scale telecommunication system and verified that it is possi-ble to detect performance-related problems during the design phase rather thanat the end of the product development cycle.

7.1 Introduction 95

7.1 Introduction

Performance is an important issue for most software development projects, andit is one of the major differentiating factors for most new software releasesin a highly competitive market. Time-to-market is also a key factor [16, 48]for large-scale industrial systems [21]. The ability to deliver performance im-provements for the next generation of an existing product has received an in-creased focus in the software industry. Product verification is time-consuming,and performance measurements are often executed during the system verifi-cation phase that is usually late in the software life-cycle [23]. Finding bugsand performance problems late in the development process is expensive [37]because the cost to correct bugs increases with the distance from where theywere introduced [9,10]. We asked ourselves: Can we devise a method to detectperformance related problems in the early stages of the development process?

We have answered the question by devising a method to 1) Extract hard-ware characteristics from a running production system; 2) Synthesize a hard-ware characteristics model representing the production system; and 3) Runningthe characteristics model in the early phases of the development process bymimicking the hardware usage of the production system. We have tested ourmethod on an existing execution platform supporting a large-scale telecommu-nication system. Our contributions are:

• A low-intrusive method to periodically sample hardware performancecounters from a production system.

• A fully automatic synthesis method to create a hardware characteristicsmodel of a production system by using a Proportional Integral Derivative(PID) [7] control algorithm.

• We show that our model mimics the production system and can be usedto find performance related problems in early design process stages.

• We have verified our method by creating a cache characteristics modelof a large telecommunication system.

We have tested our method by synthesizing a model for L1 Instruction-, L1Data-, and L2 Data cache misses according to the hardware characteristics ex-tracted from a running production system. We generate a similar cache usageon a test system as on the production system while running the legacy testapplication. Our experiments show that using our techniques on a productionsystem bug fix causes the detected message Round Trip Time (RTT) time to in-crease by 10.8%. Using the traditional performance measurement tests results


in a 0.75% RTT increase, which may be too small to be detectable in automatedtest suites.

Our method to synthesize the hardware characteristics model is fully au-tomatic, which contrasts to previous research [1, 6, 31] that require human in-teraction. We have verified out contributions by performing experiments ona large-scale telecommunication system with a 40% [102] market share. Themethod is generic and can be utilized to synthesize a model of similar type ofsystems.

We begin this paper, in Section 7.2, by motivating and giving backgroundinformation about the type of system we have investigated. Section 7.3 de-scribes our method to synthesize a model from characteristics data sampledfrom a production system. We continue with Section 7.4 where we provide adetailed description of our target system. We explain the system structure, howto perform hardware characteristics monitoring, and how to generate load. Ourtarget specific implementations is explained in Section 7.5 where we descrbethe system we have used for testing. We present our results in Section 7.6.Related work in Section 7.7 and future work in Section 7.8. We end the paperby presenting our conclusions in Section 7.9.

7.2 BackgroundWe have, in this paper, targeted characteristics monitoring and load replica-tion of large-scale systems. We have defined [21] some common behavioralpatterns to industrial systems, such as:

• There is a low acceptance for system downtime.

• There are multiple concurrent hardware and software generations.

• The lifetime spans over several decades.

• The size and system complexity causes long lead-times when developingnew functionality.

• Substantial internal communication between nodes inside the industrialsystem. External connections are often using standardized protocols, forexample 3GPP for telecommunication systems.

For the type of system we have investigated, a functional change within theplatform will first be approved by a system department and then implementedby the design department, see Figure 7.1. New platform releases are inte-grated at the next stage in the development process, which is also where ap-

7.2 Background 97

plication development starts. Characteristics measurements usually requiresa suitable system release consisting of both the platform and the application.Complete systems are only available in the later stages of system integration,which causes characteristics measurements to be performed close to the cus-tomer delivery. The practical effect of this scenario is that performance relatedproblems are detected very late in the development process and it is thereforedifficult and expensive [9,10] to fix them since a long time has passed betweenimplementation and performance test [21].

In many large-scale development projects, the main priority is functionalrequirements, and characteristics requirements fall under the phrase “The over-all performance must not be worse than before”. We have depicted this sce-nario in Figure 7.2a where numerous iterations are needed after finishing themain development cycle. Late-stage testing and bug-fixing is, of course, ex-pensive and requires a lot of extra resources from both design and verification.Performing early-stage characteristics measurements reduce the development

Iterations between

process stages are

test and design

costly.

Development Deployment

Approval of Functional Change

System Dept.

Implementation of Functional Change

Test Dept.

Application Development

Appl. Design Dept.

Platform Design Dept.

Customer Organization


System Deployment

Test

Figure 7.1: System development waterfall model.


Act

ivit

y

Time

Platform

Application

Delivery

Continuous testing throughoutthe design phase

Act

ivit

y


Platform

Application

Delivery

Time

iterations betweenLate stage

design and test

Lead−Time Reduction

effort varies over timeThe characteristics test

Test phase Development phase

a) Characteristics testing and corrections are iteratively performed atthe end of the design process.

b) Performing characteristics testing throughout the developmentprocess shortens the total development time.

Figure 7.2: Different processes for characteristics verification of a large-scaleindustrial system.

time, see Figure 7.2b for an illustration. It is then possible to handle character-istics deviations during the main design phase, i.e. when engineers are in themiddle of the implementation. The major reasons why it is important to getearly-stage characteristics feedback are that:

• The lab costs (hardware and personnel) for doing tests on a completelarge-scale system are magnitudes larger compared to using small testnodes.

• There might be a substantial time between the platform delivery and thefinal application delivery causing difficulties when correcting character-istics problems. It is expensive to fix [37] bugs late in the development

7.3 Method 99

phase since the developers that made the initial code might not be avail-able to provide a correction for the problem.

• It is vital to get early feedback on the performance when performingcost-reduction activities for an existing product. Several functions previ-ously executing on different CPUs can often be decided to be co-locatedon one CPU to reduce cost, which may have undesirable performanceside-effect.

Measuring the correct behavioural characteristics for complex large-scale com-puter systems is difficult. Characteristics measurements typically require eithera full production system or advanced test programs running on large test sys-tems. It is essential to measure behavioural characteristics after a softwareupdate and check that the nature of the system has not changed. Behaviouralchanges can result in costly and time-consuming verification later in the de-velopment cycle. Several iterations of testing and redesign at different de-velopment stages to may be necessary to make sure that the system behavescorrectly. Late-stage detection of unfulfilled performance requirements due tocharacteristics changes leads to increased development lead-time since partsof the system must be re-investigated and re-implemented. Such increase indevelopment time is hard to accept for products where a short time-to-marketis essential [16, 32, 38, 39]. Our approach aims to provide a step towards thesolution for this problem.

7.3 Method

Our method to simulate a production system environment on a test node con-sists of three steps. The first step is to gather hardware characteristics from thetarget system, see Figure 7.3a for an illustration. The characteristics containboth the platform and application.

We collected the characteristics from a system running in a production en-vironment at a customer site. The characteristics monitor was continuouslyrunning for several days gathering hardware usage information such as cache,TLB, branch prediction utilization and other hardware related counters.

The second step is to create a model with the same characteristics as the tar-get system with the difference that the test system is running a test applicationinstead of the production application. As shown in Figure 7.3b the extractedcharacteristics is fed into the load controller running on a test node. On the testnode, a test application is running instead of the large production application.


Ou

r imp

lemen

ted fu

nctio

nality

Co

nfig

uratio

n d

ata directio

nC

on

figu

ration

data

Ex

traction

of ch

aracteristics data

Leg

endSy

stem stru

cture

Ch

armo

n

Pro

du

ction

Ap

pl.

Pro

du

ction

No

de

Platfo

rm

PM

C

b) U

se the ex

tracted ch

aracteristics info

rmatio

n to

create an ex

ecutio

n m

od

el on

a test no

de.

pro

du

ction

no

de ex

ecutio

n en

viro

nm

ent.

a) Ex

tract hard

ware ch

aracteristics from

a

characteristics

Ex

tracted H

W

Gen

erator

Lo

ad

Test N

od

e

Test A

pp

l.

Lo

adC

on

troller

PM

C

Platfo

rmC

harm

on

Lo

ad

Test N

od

e

Test A

pp

l.

Mo

dified

Platfo

rm

Gen

erator

PM

C

Use retriev

edg

enerato

r param

eters

Retriev

ed g

enerato

rp

arameters

no

de ex

ecutio

n en

viro

nm

ent o

n a test n

od

e.

HW

characteristics

Iterate

c) Use th

e execu

tion

mo

del to

simu

late the p

rod

uctio

n

Co

ntin

uo

us ex

traction

of ch

aracteristicsin

form

ation

.

Use ex

tracted

Gen

erated lo

ad

Gen

erated lo

ad

Ch

armo

n

Figure7.3:

Schematic

descriptionof

characteristicsextraction

froma

productionsystem

,howto

createan

executionm

odelandhow

tosim

ulatethe

productionsystem

executionenvironm

entona

testsystem.

7.3 Method 101

After the initialisation phase of the feedback controller, it automaticallyconverges after a couple of minutes reaching a stable state where the system-wide hardware characteristics are similar to the production node. The model isstored for later retrieval.

When the test node has reached a stable state, the hardware characteristicsis similar to the production node and the test node provides a good environmentfor the test application to execute in. The third and final step is the actual per-formance comparison between the legacy and the new release of the platform.We do this by checking one high-level performance metrics of the test applica-tion when running it with the same hardware characteristics as the productionsystem, see Figure 7.3c.

7.3.1 Method DetailsThe process to obtain characteristics from a production application and mim-icking it on a much smaller test environment can be described by three steps.The assumption is that the same type of hardware is used both in the productionsystem and in the test system. Each type of system is briefly described below.

1. The procedure is started by sampling characteristics for a productionsystem running in its designated target environment, see Figure 7.3a.

2. The second step, see Figure 7.3b, is to create a simulated environment ona test node substituting the production application with a test applicationtogether with a load generator mimicking the characteristics obtained inStep 1. We use the average value for each modeled metric.

(a) Run the test application on the same platform as in Step 1, i.e.exactly the same software release of the platform.

(b) Use the PID-control algorithm to reach the same average hardwareload characteristics as in Step 1.

(c) Retrieve metrics from the control algorithm. In our case the internalcounters used to describe the amount of cache-misses generated bythe load generation algorithm.

In the scenario above we have sampled behavioural characteristics from a realcustomer system and then mimicked a similar execution environment for atest application that performs functional test. By storing the internal load-generation parameters, in Step 2c, we can generate the same rate of cachemisses without using the control algorithm. This allows us to change the plat-form and then apply the same rate of cache misses. Investigating the ratio of


misses allows us to detect changes in platform behaviour. In the continuedprocedure below we can measure characteristics for a different release of theplatform to get an indication of how it will perform running the productionsystem.

3. In this last step, see Figure 7.3c, we can detect behavioural character-istics changes, such as signal turn-around time, for a modified platformwithout running the production application.

(a) Start the modified platform together with the test application.

(b) Generate hardware-load at the same rate as obtained in Step 2c.

(c) Check behavioural characteristics for the benchmarking applica-tion, such as signal turnaround time.

If the signaling turn-around time has changed in Step 3c the characteristics ofthe modified platform is different from the original one. Low level changes tothe operating system can influence the overall performance of the applicationsdrastically if there are problems in cache handling or memory footprint.

7.4 Target SystemWe have investigated and tried our techniques a telecommunication system [16]consisting of a platform and application. The platform has been developedby Ericsson is called Connectivity Packet Platform [3, 76]. The platform isgeneric and supports many existing communication standards [28], including3G and LTE. The platform supplies functionality such as hardware abstraction,OS-services, connectivity and cluster awareness. We use a tailored operatingsystem for embedded systems, adapted for performance and reliable execution.

The platform abstraction layer consists roughly of 5 Million lines of sourcecode [8], excluding the operating system, and executes on more than 20 dif-ferent architectures and hardware. The application runs on the platform and ismodeled using a high-level language. The memory footprint is in the range ofGiga Bytes, and the application uses a multi-threaded execution environmentspanning over multiple cores and CPUs. Each such execution environment isthen interconnected with other similar environments to form a complete sys-tem with a CPU count of 100s. The application is designed to handle mobiletelephone calls and can be seen as an advanced routing functionality. It trans-fers a huge amount of data between different interfaces and parses some headerinformation for each message.

7.4 Target System 103

7.4.1 Target System Details

Our target system consists of three major parts. 1) the hardware, 2) the softwareplatform and middleware and 3) the application. The hardware has a releasecycle of approximately 24 months delivering new CPU(s), improved memoryconfiguration and other low-level changes to provide better performance and/ornew functionality. The operators expect long-term support (decades) for eachhardware configuration.

The platform [8] consists of a proprietary operating system or Linux, aswell as additional middleware software that supports cluster awareness, com-munication channels, load-balancing, error handling, failure recovery mecha-nisms, and others. The application using the platform is huge in both size andcomplexity. It has been developed over decades by thousands of engineers andhas a memory footprint of several gigabytes (GB) spanning over many coresand CPUs depending on the current partition of functionality. It is not possi-ble to give an explicit description of the configuration since it is flexible anddepends on the desired usage of the platform.

The main task for the application is to control the traffic behavior of atelecommunication system and to serve mobile calls and data transmissions.The development process is complicated and time-consuming because of thebig source code size.The overall complexity leads to a long lead-time betweeneach step in the development process. There may be up to one year from thestart of platform development to the first performance test of the complete sys-tem including applications. Naturally, there are some test applications trying todetect problematic behavior much earlier than system verification, but it has sofar been difficult to do so with adequate results. The test application we use aremainly targeting functional properties and does not stress-test the system. It hasbeen particularly difficult to obtain early-stage characteristics results where theutilization of the hardware is similar to the production application. This leadsto a number of requirements when developing these techniques:

1. Characteristics measurements and function testing should be moved muchearlier in the development chain, to reduce development time and cost.

2. Only low-intrusive profiling on the production system is allowed.

The platform we have investigated uses a proprietary embedded real-timeoperating system OS or Linux, and it runs on a p4080 Freescale processor.


7.5 Implementation

We start this section by describing our implementation of the monitoring appli-cation, see Section 7.5.1. We continue, in Section 7.5.2, by showing one wayto determine vital hardware metrics affecting the system performance. Sec-tion 7.5.3 shows one way to mimic the characteristics of a production systemusing the selected metrics. We show implementation details on L1-instructioncache miss generation in Section 7.5.4 and similar details for L1-data and L2-data cache miss generation in Section 7.5.5.

EventsCPU Pipeline Unit Stalls from the following pipeline stages:→ Fetch (Nr. fetches, Nr. prefetches, Instruction Buffer empty/full)→ Decode (Nr. stalls)→ Issue (Simple/Complex integer, Load-Store, Branch, Floating point, Altivec)→ Schedule (Simple/Complex integer, Load-Store, Branch, Floating point, Altivec)→ Retire (Completion buffer empty/full)CPI/IPCData Load/Store miss rate/ratioBranch miss rate/ratioL1 Instr Cache miss rate/ratio∗

L1 Data Cache miss rate/ratio∗

L2 I and D miss rate/ratio∗

L3 Cache read miss rate/ratio (System Wide)L3 cache write statistics (System Wide)Cycles/InterruptsITLB miss rate/ratioDTLB miss rate/ratioL2 TLB miss rate/ratio

Table 7.1: Hardware performance monitor counters events.

Event Data SourceCPU-load The operating system.Message Round Trip Time Either the production application or a test

application depending on the scenario.

Table 7.2: Software counter events.

7.5 Implementation 105

7.5.1 The Characteristics MonitorThe Characteristics Monitor, called Charmon, is implemented with the explicitrequirement to be continuously running within the platform. It samples theHW-usage by periodically setting and reading performance monitor counters,PMCs [19, 20]. Each individual counter inside the PMC can be configuredto count one HW-event such as cache misses, TLB misses, branch statisticsetc. Charmon stores counted events in a database together with commonly re-quested Key Performance Indices (KPIs). The probe effect is very low sincethe PMC hardware is implemented inside the CPU with negligible performancepenalty and the database storage and PMC reprogramming occurs infrequently.In the current implementation for p4080, we have 13 different sets of perfor-mance counters, see list below, where each set is designed to provide an un-derstanding of one particular KPI. Each set uses between one and six PMCsdepending on the desired functionality and supported hardware. The numberof sets differs depending on the actual hardware it is executed on and in ourinvestigation for this paper we have focused on three counter sets, marked with(*) in Table 7.1. Charmon also measures two software metrics that are relatedto the hardware utilization, see Table 7.2. CPU-load shows the number of pro-cesses in the ready-queue. The round-trip message time describes performanceon a system level. Measuring characteristics is done in three steps:

1. First the counters events for one set is programmed into the PMC andtriggered to start counting.

2. Then the character monitor sleeps for 1 second until woken by a timer.We make sure that no counter overrun has occurred.

3. The counters are read and stored in the database and the next set of per-formance counters are programmed into the PMC.

Given that our implementation have 13 sets, and each counter set has a sam-pling length of 1 second and then it waits for an additional 12 seconds beforebeing issued once more. The period and sample length have its limitations butshortening the time to reduce the granularity increases the intrusiveness. Wehave decided to sample infrequently out of several reasons:

• Customer system is very sensitive to running testing tools; We must becertain that our monitoring application does not affect the system perfor-mance or behavior.

• We are using the extracted data to create a semi-static model that doesnot require higher sampling frequency.


Furthermore, no code instrumentation is needed by using PMs. Other CPUarchitectures such as x86 implement performance counters in a performancemonitor unit [15]. The Freescale P4080 reference manual has a general de-scription of the PMC [19] and more details can be found in the e500mc corereference manual [20, Table 9-47]. The cache for our target CPU is structuredas following [19, 20]:

• L1 Cache Size: 32KB separate 8-way set assoc. I and D cache/core with64B cache line size.

• L2 Cache Size: 128KB common 8-way set assoc. I and D cache/core.• L3 Cache Size: 2MB common I and D cache for all cores.

All cache levels uses pseudo Least Recently Used (LRU) replacement strategy.

7.5.2 The CPI StackWe have created a CPI-stack to get an indication of the most relevant hardwaremetrics to monitor. Eyerman et al. [18] describes CPI as the total execution costof a single instruction, including wasted capacity caused by processor stallssuch as branch prediction misses, TLB misses, cache misses. Splitting the CPIinto each contributing metric builds a CPI stack, which illustrates how big partof the total execution time is spent doing real work compared to wasted. Weare using the PowerPC p4080 CPU in our target system, which has no spe-cific hardware counters for measuring the number of lost cycles contributed byindividual shared resources. We have therefore calculated each CPI-stack con-tributor by using the arithmetic mean [28] cost of each access and multipliedit by the number of accesses. The performance monitor counters can measurethe number of accesses. The mean performance penalty was obtained by inter-viewing hardware designers and through testing. Following our reasoning, weestimate the CPI-stack as:

CPI = CPIcache + CPITLB + CPIbranch + CPIbase

We agree with other researchers [3, 18, 30] that our simplified CPI-stack isblunt, but it confirmed the opinions expressed by senior hardware and softwaredesigners working within our target organization. Figure 7.4 illustrates ourmeasurements. In this particular test run, we focus on cores 6 and 7, which iswhere our target application runs. Cores 0-5 runs other applications not closelyrelated to this investigation. They will not have much impact on the testedapplication apart from their sharing the same L3-cache and other peripheral


0

20

40

60

80

100

0 1 2 3 4 5 6 7

CPI st

ack

rati

o p

er

consu

min

g H

W r

eso

urc

e [

%]

CPU Core [#]

Base+otherL1-cache

L2-cacheL3-cache

TLBBranch miss

Figure 7.4: CPI stack for the production system.

resources. Figure 7.4 shows that the majority of wasted execution time is spentwaiting for L1-instruction, L1-data, and L2-data caches, which is the reasonwhy we have used these metrics when synthesizing our characteristics model.

7.5.3 The Load ControllerThe load controller is designed to model a production system. The model issynthesized by increasingly generating HW-load to reach a user-supplied limit.When reaching the desired limit, a model is automatically created, which ispossible to store for later use. The desired limits can be obtained through ex-tracting production system characteristics, as in Figure 7.3a. Current measure-


1 i n t b i g s w i t c h ( i n t n ) 2 sw i t ch ( n ) 3 case 1 : n += 1 0 ;4 break ;5 case 2 : n += 1 1 ;6 break ;7 case 3 : n += 1 1 ;8 break ;9 . . .

10 case 99999 : n += 50009 ;11 break ;12 d e f a u l t : n+= 2 0 ;13 14 re turn n ;15

Listing 7.1: Generating L1 Instruction cache misses.

ments are obtained through the Charmon utility while automatically creatingthe model, see Figure 7.3b. When the model synthesis process has reached astable state, it is possible to extract the controller parameters for offline storage.As shown in Figure 7.3c, the stored parameters can later be directly fed intothe load controller. When the load controller receives the model parameters,it can start generating HW load according to the model. The main benefit ofusing this procedure is that the model is synthesized once and then deployedwidely without the need for remodeling.

Each HW metric has its own PID-controller. It has proved difficult to im-plement an autonomous feedback control loop taking multiple metrics intoconsideration. For each additional metric, the convergence is slower and theoscillation tendencies increase. We are aware that there are more advancedtechniques that may support multi-metric control, but due to lack of time, wehave not investigated this issue further.

In our investigation we have implemented control algorithms for L1-inst-ruction, L1-data and L2-data cache misses since they cause the largest portionof lost-cycles, see Section 7.5.2. We have actively chosen a subset of the char-acteristics contributors because adding more metrics increase the complexityof the control algorithm.


07

15

23

31

Way

2

Way

4

Way

5

Way

6

Way

1

Way

0

Way

7

Way

3

3

2a

2b

1

64

B C

ach

e L

ine

Ad

dre

ss T

agS

tatu

s B

its

Sel

ect

Tag

Sel

ect

By

te i

n t

he

Cac

he

Lin

e

64 se

ts

EA

(32

bit

)

Selec

t Set07

15

23

31

07

15

23

31

35

: P

hysi

cal

Addre

ss (

36bit

)

: E

ffec

tive

Addre

ss t

o b

e A

cces

sed (

32bit

)

LSB

MSB MSB

LSB

Ad

dre

ss T

ran

slat

ion

PA

(36

bit

)

Figure 7.5: Address translation and physical address usage in the cache struc-ture.


7.5.4 Generating L1 I-cache Misses

We use a large switch-case statement [35], Listing 7.1 to generate L1 instruc-tion cache misses. The bigswitch() function is called with varying ar-gument values for the switch-case index n, which has a higher probability ofgenerating an instruction cache hit if the jump distance is short and vice versaif the distance is longer. The feedback controller varies the jump distance toproduce the desired amount of cache misses. The main advantage of usingthis method is its simplicity while still not generating too many data accessesthat affect the other feedback controllers working with data cache modeling.The model is formed by retrieving the iteration counters when the feedbackcontroller has converged

7.5.5 Generating L1 and L2 Data Cache Misses

By using varying strides through memory [24] we generate L1 and L2 Datacache misses. The tags and sets in the cache memory will be exercised differ-ently depending on the desired ratio of cache misses, Figure 7.5. The addresstranslation for Freescale processors starts with a 32-bit Effective Address (EA)and is converted to a 36-bit Physical Address (PA), step 1 in Figure 7.5. Thereare eight address tags for each cache set, which stores the corresponding 64Bcache line. The selection of address tag 2a is done with bits PA[0:23] in par-allel to the set selection 2b with bits PA[24:29]. A specific Byte within thecache line is selected using bits PA[30:35], see [19, 20].

We have devised a loop that is iterating over all (8) tags in a cache set andthen over all (64) sets. The loop strives to generate cache misses in a controlledmanner and according to the desired level. The general approach to generatedata cache misses is straight forward. Striding within the number of sets/tagssupported by the L1 cache, see Section 7.5.1, results in cache hits but the L1cache miss ratio starts to increase when adding more sets to the stride. Thesame thing applies for L2 cache misses but has to use a larger data set. A sideeffect of this multi-level cache hierarchy is that introducing more L2 cachemisses will also generate L1 misses. When the L2 controller is converging toits desired iteration parameters, the L1 controller is constantly monitoring andchanging its parameters. Our contribution is to control the amount of misseson different cache levels by implementing a PID-feedback control loop thatchecks the current ratio for each cache level and then strives to approach thedesired level for all of them. The implementation starts by working inside the

7.6 Results 111

sets/tags causing only cache hits and then increase the number of misses bysubsequently access data outside the sets/tags.

7.5.6 Experimental Setup

We have used a telecommunication system consisting of two interconnectednodes to test the techniques presented in this paper. Each node in the systemruns one or more applications spanning multiple cores on the CPUs. Main-tenance and administrative software typically runs on Core 0 on each CPU.We have omitted core 0 from our tests because it is usually not utilized tothe same extent as the other cores, and the investigated application does notexecute on that core. The telecommunication application is a message process-ing machine that handles incoming messages received via multiple networkconnections. Advanced algorithms process the message header and decide theaction to perform on the message content. The simplest action is to forward themessage to other nodes within the network or decode the content for processingor re-coding in other formats or protocols.

7.6 Results

We have utilized our automatic modeling procedure, described in Section 7.3,to synthesize a cache characteristics model of a production system, see Sec-tion 7.4. We have compared the characteristics measured on a production sys-tem with the mimicked behavior of a model system. The production systemwas at the time of measurement handling end-user telecommunication traffic(voice and data) in a large metropolitan city 1.

We will describe the outcome from three experiments in the following sec-tions. In our first experiment, see Section 7.6.2, we compare the hardwarecharacteristics of a production node with characteristics from a test node, bothwith and without a characteristics model. In our second experiment, see Sec-tion 7.6.3, we describe one example of finding a performance related bug inthe early phases of the development process. In the third, and last, experimentwe use our modeling mechanism to perform an initial performance estimateof the performance impact when switching from a legacy-OS to Linux, seeSection 7.6.4.

1We cannot disclose the location due to business considerations.


0

5

10

15

20

25

30

Cach

e M

iss

Rati

o[%

]

Core 0

0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

4.0

Cycl

e P

er

Inst

ruct

ion

, C

PI

Core 1

0

5

10

15

20

25

30

Cach

e M

iss

Rati

o[%

]

Core 2

0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

4.0

Cycl

e P

er

Inst

ruct

ion

, C

PI

Core 3

0

5

10

15

20

25

30

Cach

e M

iss

Rati

o[%

]

Core 4

0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

4.0

Cycl

e P

er

Inst

ruct

ion

, C

PI

Core 5

0

5

10

15

20

25

30

00

05

10

15

20

25

30

35

Cach

e M

iss

Rati

o[%

]

Time[minutes]

Core 6

00

05

10

15

20

25

30

35

0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

4.0

Cycl

e P

er

Inst

ruct

ion

, C

PI

Time[minutes]

Core 7

L1 I$[%] L1 D$[%] L2 D$[%] CPI RTT[us]

Figure 7.6: CPI and cache miss ratio when bouncing signals.

7.6 Results 113

7.6.1 Running The Test Application With The Load GeneratorWe use one PID-controller for each HW metric. As can be seen in Figure 7.6the test application has an initial characteristic (to the left) that differs fromthe final characteristics (to the right) reached after the control has started toconverge. Each of the eight graphs shows CPI and cache misses when runninga signaling application sending signals between two processes located on thesame core. The load controller is set to generate a cache miss ratio accordingto the values in Table 7.3. The effect of an increased cache miss ratio is that theCPI also increases. This will affect the software execution on the core. In thisexample, we control all eight cores simultaneously. The initial control loopparameters for each of the properties were empirically discovered to providestability rather than quick convergence.

7.6.2 Production vs. Modeled CharacteristicsWe use our modeling method to create a hardware characteristics model thatconverge the cache usage to the desired level, see Figure 7.6. All cores in theCPU are modeled during and shows similar characteristics as the productionnode. In this section, we look closer at core 6, which runs the main applica-tion functionality that we are investigating. The other cores handle differentsupport services that are related to traffic handling, but we have omitted themdue to space constraints. The average cache usage and CPI have been modeledsuccessfully, Figure 7.7. Our measurements show that the hardware usage ofthe test application, which is supposed to mimic the functional behavior of theproduction application, is not similar to the production system. When we add

Core L1-instr. L1-data L2-data CPI0 0.74% 3.3% 22% 1.911 0.74% 3.3% 14% 1.702 0.74% 3.3% 15% 1.763 0.74% 3.3% 16% 1.784 0.74% 3.3% 17% 1.825 0.74% 3.3% 18% 1.866 0.74% 3.3% 19% 1.877 0.74% 3.3% 20% 1.91

Table 7.3: CPI and cache miss ratio when bouncing signals, corresponding tothe graphs in Figure 7.6.


0

5

10

15

20

ProductionSystem

TestAppl.

Test Appl. withGenerated Load

0

0.5

1

1.5

2

2.5

3

Mis

s ra

tio [

%]

CPI [C

ycl

es/

Inst

ruct

ion]

CPI[Cyc/Instr]

2.04

1.15

1.96

L1 ICache miss ratio

0.740.06

0.75

L1 DCache miss ratio

3.3

0.14

3.35

L2 DCache miss ratio

22

1

21.2

Figure 7.7: CPI and cache miss ratio for core 6.

the load generator (cache-usage model) to the test environment, the averagecache usage becomes almost identical to the production system.

A closer investigation reveals that the production characteristics jitters overtime, Figure 7.8a, to a much greater degree than the test node running themodeled environment, Figure 7.8b. Such jitter has a two-fold meaning; 1)The model system is within the measured behavior of the production system,which is good since we have succeeded in reaching a similar, average cacheusage ratio as the production system. 2) The model system does not jitter tothe same extent as the production system. The effects are difficult to predict,but although the mean value is similar, memory access bursts tend to congest

7.6 Results 115

0 5

10

15

20

25

30

00

02

04

06

08

10

12

14

16

18 1.5

2 2.5

3 3.5

4Miss ratio [%]

Cycles Per Instruction(CPI)

Tim

e [

min

ute

s]

L1 I$

L1 D

$L2

D$

CPI

Mean v

alu

eStd

Dev.

Max

Min

(a)P

rodu

ctio

nsy

stem

.

0 5

10

15

20

25

30

00

02

04

06

08

10

12

14

16

18 1.5

2 2.5

3 3.5

4

Miss ratio [%]

Cycles Per Instruction(CPI)

Tim

e [

min

ute

s]

L1 I$

L1 D

$L2

D$

CPI

Mean v

alu

eStd

Dev.

Max

Min

(b)M

odel

edte

stsy

stem

.

Figu

re7.

8:A

com

pari

son

ofH

Wch

arac

teri

stic

sfo

rapr

oduc

tion

syst

eman

da

mod

eled

test

syst

em.


memory subsystems and could, therefore, have effects on access times. Weplan to improve future versions of our modeling method by supporting dynamicbehavior. It is also possible to detect periodically reoccurring events by lookingfor peaks in the data. For example, our measurements imply that the productionnode executes a memory intensive task roughly every second minute becausethere are repeating CPI peaks occurring with this interval, see Figure 7.8a.

7.6.3 System Performance Measurement

We have verified our method to detect performance degradation by modeling aproduction system and apply it to the test of a real software delivery that relatesto a cache errata bug fix. The bug fix only concerned the platform code andmade no changes to the application. The initial tests, only using the alreadyexisting test suite, showed a minor performance degradation when messageRTT increased by 0.75% as shown in Table 7.4.

We added additional hardware load by running our load generator accord-ing to the production-environment model. When running the test suite togetherwith the model, the message RTT increased as much as 10.8%, Table 7.5. Sucha considerable performance degradation gave a clear indication that this par-ticular bug-fix would cause performance related problems for the productionapplication. To validate our method we delivered the bug-fix for formal perfor-

OriginalRelease

ModifiedRelease

Comparison

Core [ms] [ms] [ms] [%]

0 Omitted from the simulation.1 0.5934 0.5982 0.0048 0.81%2 0.5935 0.5999 0.0064 1.08%3 0.6017 0.6068 0.0051 0.85%4 0.6022 0.6057 0.0035 0.59%5 0.6022 0.6058 0.0036 0.60%6 0.6025 0.6060 0.0036 0.59%7 0.6015 0.6061 0.0046 0.76%Average 0.75%

Table 7.4: Mean message RTT for a test application w/ and w/o a particularsoftware change while running on a test system. The data was sampled on asecond-level basis during several hours.

7.6 Results 117

mance verification. Much later, when the bug fix finally reached the productionsite, the CPU load increased by 8.4%, Table 7.6, which verified that the bug-fixwas not possible to deliver due to the performance impact it caused.

Please note that it is difficult to make a direct comparison between the sig-nal turnaround time on the test system, ts, and the CPU load on the productionsystem. However, we estimate that ts = tp + tt where tp is the time it takesto process a message and tt is the time in transit between nodes. In the systemwe are investigating tt tp since the available bandwidth is very high, andthe communication path is relatively short. Thus, ts is proportional to tp, andwe can deduce that CPU load is a major contributor to the signal turnaroundtime. A higher CPU-load results in an increased signal turnaround time due tolonger processing time.

7.6.4 Performance Prediction When Switching OS

We can also use our modeling technique for more comprehensive performancepredictions. In this experiment, we will predict the performance impact whenswitching from a legacy-OS to Linux. Many production systems still runlegacy-OSes, which are expensive to maintain and port to new architectures.There is a great drive to move such systems to Linux, which supports manymore architectures and also has a well-maintained execution environment and

OriginalRelease

ModifiedRelease

Comparison

Core [ms] [ms] [ms] [%]

0 Omitted from the simulation.1 2.0238 2.2492 0.2254 11.14%2 2.0937 2.3423 0.2487 11.88%3 1.9284 2.1527 0.2243 11.63%4 2.0195 2.2548 0.2353 11.65%5 1.9945 2.1778 0.1832 9.19%6 2.1637 2.4100 0.2463 11.38%7 1.9952 2.1704 0.1752 8.78%Average 10.81%

Table 7.5: Mean message RTT for a test application w/ and w/o a particularsoftware change while simultaneously running a load generator on a test sys-tem. The data was sampled on a second-level basis during several hours.


OriginalR

eleaseM

odifiedR

eleaseC

omparison

Core

Instr.C

ycls.IPC

CPI

CPU

Instr.C

ycls.IPC

CPI

CPU

Instr.C

ycls.IPC

CPI

CPU

0O

mitted

fromthe

simulation.

Om

ittedfrom

thesim

ulation.O

mitted

fromthe

simulation.

11111M

1498M0,74

1,3563,7%

1106M1499M

0,741,36

69,3%-0,49%

0,09%-0,49%

0,49%8,79%

21040M

1498M0,70

1,4463,8%

961M1502M

0,641,56

69,2%-7,54%

0,26%-7,96%

8,65%8,46%

31043M

1498M0,70

1,4462,8%

966M1500M

0,641,55

68,7%-7,38%

0,15%-7,51%

8,12%9,39%

41046M

1498M0,70

1,4463,3%

967M1499M

0,651,55

68,6%-7,50%

0,05%-7,37%

7,95%8,37%

51044M

1498M0,70

1,4464,0%

963M1502M

0,641,56

69,1%-7,74%

0,29%-7,83%

8,49%7,97%

61042M

1498M0,70

1,4463,3%

959M1496M

0,641,56

67,9%-8,04%

-0,16%-7,89%

8,57%7,27%

71044M

1501M0,70

1,4463,3%

966M1500M

0,641,55

68,5%-7,46%

-0,02%-7,44%

8,04%8,21%

x1053M

1498M0,70

1,4263,5%

984M1500M

0,661,53

68,8%-6,6%

0,1%-6,6%

7,2%8,4%

Table7.6:

Com

parisonof

two

releasesof

atelecom

munication

systemrunning

ina

productionenvironm

ent.T

hem

odifiedsystem

containsa

hardware

erratafix

relatedto

cachehandling.

The

characteristicsdata

was

sampled

eachsecond

duringseveralhours

atacustom

ersitein

alarge

metropolitan

cityand

thetable

shows

theaverage

value.

7.6 Results 119

tools. It is beneficial to get a performance indication before making the deci-sion to port the complete system, including applications, to Linux.

This experiment aims to evaluate if a performance test is more accuratewhen using a synthesized hardware characteristics model together with a testsuite compared to running the performance test with the test suite only. Weperform the experiment on a function test suite that sends messages betweentwo interconnected nodes. The test suite also performs some message process-ing, which causes a moderate CPU load. We have executed these tests on theFreescale p4080 platform. The legacy OS also uses Variable Size Pages (VSP)2

to reduce the pressure on small-page (4KB). There is no significant backgroundactivity running in either OSes. Andai [4] has previously presented this experi-ment in his master thesis report, which the main author (Jagemar) of this papersupervised.

As a first reference test, we run a currently existing function test suite with-out any modifications (Scenario S1 in Table 7.7). The reference test results inan 183% increase in message RTT when using Linux compared to the real-timelegacy OS. For our second test, we run the legacy function test suite togetherwith our load generator (Scenario S2 in Table 7.7). The complete test setupmimics, from the cache-usage point of view, the execution environment of aproduction node using the legacy OS. The message RTT degradation for theLinux model system is a 14%. Later, when we had ported the complete systemto Linux, it was possible to verify that the message RTT performance degrada-tion was 15%, which is very close to our prediction (S2).

We have revealed some clues to the radical drop in performance (+183%message RTT increase) when interpreting the characteristics for running onlythe function test application (S1). Starting from the top of Table 7.7 the firstmetric that stands out is a decrease of 1.2 percentage point (pp) for L1-instructioncache hit ratio. It may, at a first glance, look like a negligible change but weknow from experience that even small decreases in the cache hit ratio affectsthe performance. The next metric is L1-data cache hit ratio, which has de-creased with 0.6pp. Such a hit rate reduction gives a first hint that the workingset is larger for Linux. The platform and test application remain the same forboth OSes, which suggests that Linux by itself causes the increased workingset size. Investigating the Linux source code shows that it is much more com-plex than the legacy-OS. We attribute much of the complexity to the genericand modular design of Linux. We can observe a similar increase for the sharedL2 cache where the L1-cache spillover affects the number of accesses. The

2The Linux kernel we have tested does not support Transparent Huge Pages.


S1:Only

TestProg.S2:TestProg.w

/Loadgen

Metric

Legacy

Linux

IncreaseL

egacyL

inuxIncrease

Com

ments

SignalRT

T6us

17us11us(+183%

)25us

29us4us(+14%

)B

igdifference

w/and

w/o

Loadgen.

L1

I$hitratio

100%98.8%

-1.2pp99.3%

98.8%-0.5pp

Legacy

fitsin

thecache.

L1

D$

hitratio99.8%

99.2%0.6pp

98.2%98.5%

0.3ppSim

ilarcacheusage

forbothO

Ses.L

2hitratio

–100%

–85.0%

90.0%5pp

L2

isnotused

fororiginallegacybut

modeled

with

Loadgen.

L2

accesses0

16M16M

6M13.5M

6.5M(+108%

)IT

LB

4kBreloc.

0750k

750k0

450k450k

The

Linux

codebase

ism

uchlarger.

DT

LB

4kBreloc.

0750k

750k0

550k550k

The

Legacy

systemuses

Variable

SizePages

toreduce-T

LB

pressure.D

TL

BV

SPreloc.

50k0

-50k20k

0-20k

L2

TL

Breloc.

00

00

110k110k

The

Linux

datasetis

largerforS2.B

ranchhitratio

100%84%

-16pp94%

85%9pp

Smallercode

baseforthe

legacysystem

resultsin

goodbranch

prediction.B

ranchhitrate

200M120M

-80M(-40%

)80M

90M10M

(12.5%)

Interrupts0

230k+230k

0250k

250kT

henetw

orkdriverim

plementation

differs.

Table7.7:

ScenarioS1

shows

am

essagingtest

application.S2

showthe

same

testapplication

runningw

ithan

additionalhardware

loadgenerator.(pp=percentage

point).

7.6 Results 121

L2 cache hit ratio is negligible for the legacy OS because there are too fewaccesses.

Closely related to the cache is the number of Translation Lookaside Buffers (TLB)relocations. The number of instruction TLB relocations have increased from0 → 750k. Our conclusion is that the size of executing code has grown be-tween the legacy OS and Linux. If we further investigate the DTLBs we can ob-serve that Linux does not use large TLBs (Transparent Huge-Pages THP [26])because the number of DTLB relocations has increased from 0→ 750k at thesame time as DTLB-Variable Size Pages (VSP) are reduced from 50k → 0when moving from legacy-OS to Linux.

The execution flow is also affected by the increased code base and leadsto a branch hit ratio drop (100 → 84). The last counter in this experiment isthe number of external interrupts. It seems likely that the Linux network driverimplementation uses much more interrupts than the legacy OS. A polling driverwould probably reduce the number of interrupts for a messaging applicationsuch as ours.

In our next test (S2) we simultaneously run the function test suite and theload generator. The goal is to run the same test as in S1 but with an executionenvironment that is similar to the production system. In S2, the message RTT isincreased for both the legacy OS (6us→ 24us) and Linux (17us→ 29us) butthe difference between the OSes is much smaller, only +14%. We can still seethat the instruction flow is slower for Linux, L1-instruction cache 99.3% →98.8% and ITLB relocations 0 → 450k. For the data flow, we can observebetter cache hit ratio for Linux, 98.2%→ 98.5%, but there are still many moreDTLB 4KB relocations 0 → 550k due to missing VSP/THP support. The L2cache usage shows a greater number of accesses by Linux but also with higherhit ratio. Branch hit ratio still shows a performance impact for Linux, and thereare still a great number of interrupts.

Based on the data collected we found strong indications that the perfor-mance degradation can be mainly attributed to 1) A larger total code and dataset (including the OS) 2) the number of interrupts handled by the networkdriver. One explanation for the significant difference in message RTT betweenS1 and S2 is that the size of the complete code and data set is greater in Linux.Even a small size increase of the working set leads to an increased number ofcache misses in lower level caches.

To overcome the performance degradation we suggest the following ac-tions: 1) We should enable VSP/THP or similar functionality to utilize largememory pages, which will reduce the pressure on small (4kB) pages, 2) TheLinux network driver should use interrupt coalescing or polled drivers to re-


duce the number of handled interrupts, and 3) We need to further investigatewhy the total code and data set is greater for Linux.

7.7 Related Work

Bell and John [6] describes a method to model an application by synthesizingvital metrics, which is then used to create a representative test application au-tomatically with similar characteristics to the original one. Starting with thesynthesizing procedure, we use a feedback control loop to model the systemwhile Bell and John [6] use statistical simulation with instruction traces, de-scribed by Nussbaum and Smith [31]. Bell and John state that the synthesisprocedure is semi-automatic, and an average of ten passes with some manualintervention is needed to tune the synthesis parameters. As a comparison, ourfeedback control algorithm allows the synthesis procedure to converge with nouser interaction. Additionally, our model is described by the resulting config-uration parameters, which are fed to the generic method. For Bell and Johnmodel derivation is done at compile time thus requiring a recompilation whenaltering the configuration. Another difference between our approaches is thatwe use a signaling application to detect any performance changes between re-leases while Bell and John use IPC.

Joshi et al. [25] have formulated a concept called performance cloning thatcan be used to synthesize characteristics from a proprietary application andcreate a model that mimics a similar behavior. In effect Joshi et al. implementsa similar methodology as Bell and John in [6] but have refined the memory andbranching model to be hardware agnostic.

Doucette and Fedorova [12] have implemented a similar functionality toours when generating cache misses to determine application sensitiveness fordifferent architectures. For example, if an application is sensitive for accessesto a particular resource and another architecture has a different amount of thatresource, the application performance is related to the hardware upon whichit runs. By using their approach, it is to some extent possible to estimate thesuitability of a new hardware platform for a particular application. The loadconfiguration is static compared to our automatic mechanism.

Similar to the research by Eklov et al. [13, 14] and Tang et al. [36] ourload generator steals hardware resources from other applications thus starvingthem. In contrast to their method, we use the a cache miss generator to mimica certain execution environment while Eklov et al. and Tang et al. determine

7.7 Related Work 123

the demand for the application concerning cache and memory bandwidth. Ourwork has concentrated on core-private cache instead of a shared cache.

Alameldeen et al. [1] investigate server platforms and come to the conclu-sion that it is quite difficult to create simulations of production systems. In theirwork they mimic the desired characteristics by using a manually tailored work-load suite. In contrast, we use an automatic feedback-based load generator toachieve an approximation of the production application.

In the area of continuous system monitoring, we can find interesting rela-tions, such as Anderson et al. [5]. In their approach, they implement a lowintrusive (1%-3%) sample based mechanism to gather system-wide informa-tion. Their implementation samples hardware performance counters when theygenerate overflow interrupts. Our implementation uses a timer to sample thehardware performance counters periodically.

One of the standard work when monitoring or measuring system perfor-mance is the LM-Bench suite by Mcvoy and Staelin [29]. It is useful to mea-sure and calculate cache and memory timings to understand the system be-havior. Unfortunately, our legacy platform does not support all API-calls anddevelopment tools required by LM-Bench.

Eranian [15] claims that performance monitor counters are an essentialcomponent in performance measurements and evaluations. They have donetheir investigation on the x86 hardware architecture, which is different from thePowerPC hardware architecture used by us, but the basic approach is similar;they run a measurement application gathering information for later evaluation.In our case, we have extended this idea to let the samples provide input to afeedback control algorithm that mimics the monitored system.

Diniz et al. [11] have investigated how to use feedback control mechanismsto improve program compilation. They have modified a compiler to use per-formance feedback results from an initial application test-run. Their methodallows the compiler to utilize intricate optimization mechanisms that are notusually possible utilize when using static methods at compile time. Lau etal. [27] extends the work by investigating how feedback control techniquescan improve the performance of JAVA programs executing in a Virtual Ma-chine (VM). They state that there are plenty of known optimization techniquesavailable, but it is hard for a VM to know which one to use in particular cases.Sometimes a function optimization decreases the overall performance becausethe function may operate on different data during the next iteration.

Eyerman et al [18] describes an architecture for measuring the CPI-stackvia tailored Power5 hardware performance counters. They also describe thedifficulties and errors when using simpler methods like multiplying the number


of misses with the average cost. The total number of misses may, for example,contain entries in mispredicted execution paths that should be omitted in thereal CPI calculation. In our case, we could only choose the “naive” methodsince our target hardware architecture only implemented simple counters. Aswith all measurement activities there are pros and cons. Some researchers, Ey-erman [17], suggest that CPI-stacks simplifies the execution environment andmay be misleading. Their suggestion is to use a system-level metric to describethe performance, Alameldeen and Wood [2] and Eyerman [17]. We opted touse CPI as the low-level metric to get an indication of which shared hardwareresources to synthesize. We use message round-trip time as the system levelmatric when measuring the system performance.

Sherwood et al. [33,34] deduce that it is possible to model subsections of anapplication by dividing it into Basic Block Vectors and providing a hardware-independent metric. In our case we have problems simulating the productionapplication, because of its size and complexity, so it may be difficult to use thisapproach but nevertheless it is a very interesting complement to the techniqueswe are presenting.

7.8 Future Work

We would like to further investigate the advantages of performance estimationwhen designing new hardware. A significant portion of the software appli-cation when developing a large-scale systems will usually remain the samefor subsequent system releases. Only a relatively small part of the system ischanged or added for each release. Many times a new system release only con-sists of updated hardware and the corresponding hardware drivers. Still the de-velopment process is long and time-consuming resulting in a desire to estimatethe performance of the new hardware in the early design phases. We wouldlike to investigate if our method can be utilized to improve the understandingof hardware performance in the early design phases, maybe even during thehardware creation.

A natural extension would be to extend our method with additional hard-ware and software metrics. We believe that it would improve the model accu-racy and make it more applicable for other type of systems. Metrics to includemay for example be L3 cache hit rate, branch hit/miss ratio, interrupt rate, etc.We would also like to add more system-level software metrics, which dependson the system being monitored.

7.8 Future Work 125

In this paper, we have used a simplified model describing the behavior ofthe system by using the average value for each characteristics metrics. Us-ing an average value suffice in some cases, for example when finding certainperformance related bugs such as the one described in Section 7.6. However,when looking at the dynamic behavior of the system, see Figure 7.8a and Fig-ure 7.8b, we realize that a dynamic model could describe the system dynamicsas well. Memory accesses bursts tend to congest the memory bus so a dynamicmodel can be a step towards a more complete model. Our current implementa-tion has several for-loops that access instructions and data in specific patterns,which causes the desired level of cache misses. With the current synthesismethod the model is created by modifying the for-loops iteration counters untilthe mimicked environment is similar to the production environment, i.e. theaverage value for the metrics are identical between the production- and testenvironment resulting in a counter set Cavg . Our suggestion to create a dy-namic model is to create multiple sets of iteration counters, C0, C1, ...., Cn,by running the synthesis method for many sampled metric sets. By using thisnew approach, we will provide several for-loop iterations counter sets that eachrepresents different cache characteristics. It will then be possible to provide anexecution scheme that is similar to the original dynamic characteristics behav-ior, for example, C0 → C5 → C2 → C0 → C1 → C2 → ....

Using the current control algorithm, it takes approx. 15 minutes to con-verge. An improved control algorithm would be beneficial to decrease thistime and accuracy. The HW-properties we are controlling are connected toeach other causing undesired side-effects. For example, when one of the prop-erties, such as L1-data cache, is changed it may cause a change for L2-datacache. This cause problems converging to the desired state. Using more ad-vanced control algorithms may reduce the time to converge.

We would also like to implement a quick-starting mechanism where ourmodel synthesis starts from a previous state instead of starting from the begin-ning. We believe that quick-starting can decrease the convergence time.

In the current implementation the sampling interval is set to 1 second permetric set. This means that it takes several seconds between each sample fora specific set. This causes the control algorithm to converge at a moderatepace. It is also difficult to observe transients since the arithmetic mean overthe sampling interval is recorded. Reducing the interval should decrease theconvergence time and also more information on the dynamic behaviour of thesystem. We have not investigated how intrusive our monitoring mechanismis. Before increasing the sample frequency we need to study the sample-costfurther.

126 References

7.9 ConclusionsDeveloping large-scale systems is a very time-consuming task. Many designersare working in parallel, and the typical process waterfall model can result inperformance testing at the very end of the development process. It is expensiveto fix performance related bugs found in the later stages of the developmentprocess because the original designer has moved on to new tasks. We havetargeted this problem and tried to devise a method to find and fix performance-related bugs in the earlier stages of the development process.

Our first contribution is a low-intrusive method to periodically sample hard-ware performance counters from a production system. Our characteristicsmonitoring application have proven valuable in system development. It is pos-sible to gain a better understanding of the of the investigated system’s perfor-mance by using the monitoring application. The characteristics monitor is nowpart of the product and used by both design and test departments within our theorganization.

The second contribution is a fully automatic synthesis method to create ahardware characteristics model of a production system. We have modeled partsof an extensive telecommunication production system on a much smaller testnode. Our aim was to make it easier to detect performance related problems inthe earlier stages of system development.

As the third and final contribution, we have shown that our model mimicsthe production system and we have also verified our method by creating a cachecharacteristics model of a large telecommunication system. We have verifiedthe model on a real telecommunication production system with a 40% marketshare.

References[1] A. R. Alameldeen, M. Martin, C. J. Mauer, K. E. Moore, and M. Xu.

Simulating a $2M Commercial Server on a $2K PC. IEEE Computer,36(2):50–57, 2003.

[2] A. R. Alameldeen and D. A. Wood. IPC considered harmful for multi-processor workloads. IEEE Micro, pages 8–17, 2006.

[3] O. Allam, S. Eyerman, and L. Eeckhout. An efficient CPI stack counterarchitecture for superscalar processors. Proceedings of the Great LakesSymposium on VLSI, pages 55–58, 2012.

References 127

[4] G. Andai. Performance monitoring on high-end general processingboards. Master thesis, KTH Royal Institute of Technology, 2014.

[5] J. M. Anderson, L. M. Berc, J. Dean, S. Ghemawat, M. R. Henzinger,S.-T. A. Leung, R. L. Sites, M. T. Vandevoorde, C. A. Waldspurger, andW. E. Weihl. Continuous profiling: where have all the cycles gone? ACMSIGOPS, 15(4):357–390, 1997.

[6] R. H. Bell and L. K. John. Improved automatic testcase synthesis for per-formance model validation. In Proceedings of International Conferenceon Supercomputing, pages 111–120, 2005.

[7] S. Bennett. Nicolas Minorsky and the Automatic Steering of Ships. IEEEControl Systems Magazine, 4(4):10–15, 1984.

[8] M. Bergqvist, J. Engblom, M. Patel, and L. Lundegard. Some experiencefrom the development of a simulator for a telecom cluster (CPPemu). InProceedings of the International Association of Science and Technologyfor Development, pages 13–21, 2006.

[9] B. Boehm and V. R. Basil. Software Defect Reduction Top 10 List. Com-puter Journal, 34(1):135–137, 2001.

[10] B. Boehm and P. N. Papaccio. Understanding and controlling softwarecosts. IEEE Transactions on Software Engineering, 14(10):1462–1477,1988.

[11] P. C. Diniz and M. C. Rinard. Dynamic feedback: An Effective Techniquefor Adaptive Computing. ACM SIGPLAN Notices, 32(5):71–84, May1997.

[12] D. Doucette and A. Fedorova. Base vectors: A potential technique formicroarchitectural classification of applications. In Proceedings of theWorkshop on the Interaction between Operating Systems and ComputerArchitecture, 2007.

[13] D. Eklov and D. Black-Schaffer. StatCC: a statistical cache contentionmodel. Proceedings of the International conference on Parallel architec-tures and compilation techniques, pages 551–552, 2010.

[14] D. Eklov, N. Nikoleris, D. Black-Schaffer, and E. Hagersten. Cache Pi-rating: Measuring the Curse of the Shared Cache. In Proceedings ofInternational Conference on Parallel Processing, pages 165–175, Sept.2011.

128 References

[15] S. Eranian. What can performance counters do for memory subsystemanalysis? In Proceedings of the ACM SIGPLAN workshop on MemorySystems Performance and Correctness, pages 26–30, 2008.

[16] H. Vestberg. Ericsson unveils new products, partnerships and increasedmarket share. In Proceedings of at Mobile World Conference, 2012.

[17] S. Eyerman and L. Eeckhout. System-level performance metrics for mul-tiprogram workloads. IEEE Micro, 28(3):42–53, 2008.

[18] S. Eyerman, L. Eeckhout, T. Karkhanis, and J. E. Smith. A Top-DownApproach to Architecting CPI Component Performance Counters. IEEEMicro, 27(1):84–93, 2007.

[19] Freescale. P4080 Reference Manual Rev F. 2009.

[20] Freescale. e500mc Core Reference Manual Rev F. 2010.

[21] D. Hallmans, M. Jagemar, S. Larsson, and T. Nolte. Identifying Evo-lution Problems for Large Long Term Industrial Evolution Systems. InProceedings of IEEE International Workshop on Industrial Experience inEmbedded Systems Design (COMPSAC14), Vasteras, 2014.

[22] A. Hartstein, V. Srinivasan, T. Puzak, and P. Emma. Cache miss behavior:is it√2? In Proceedings of Conference on Computing Frontiers, pages

313–320, 2006.

[23] IEEE. International Standard ISO/IEC 15288 - Systems and softwareengineering - System life cycle processes., volume 8. 2008.

[24] M. Jagemar, S. Eldh, A. Ermedahl, and B. Lisper. Feedback-Based Gen-eration of Hardware Characteristics. Technical report, Malardalen Uni-versity, 2012.

[25] A. Joshi, L. Eeckhout, R. H. Bell, and L. K. John. Distilling the essenceof proprietary workloads into miniature benchmarks. ACM Transactionson Architecture and Code Optimization, 5(2):1–33, Aug. 2008.

[26] Christoph Lameter. Bazillions of Pages - The Future of Memory Man-agement under Linux. In Proceedings of the Linux Symposium, volume 1,pages 275–284. Ottawa, Ontario, Canada, 2008.

References 129

[27] J. Lau, M. Arnold, M. Hind, and B. Calder. Online performance auditing.In Proceedings of ACM SIGPLAN Conference on Programming languagedesign and implementation, pages 239–251, June 2006.

[28] J. Mashey. War of the benchmark means: time for a truce. ACMSIGARCH Computer Architecture News, 32(4):1–14, 2004.

[29] L. Mcvoy and C. Staelin. lmbench : Portable Tools for PerformanceAnalysis. In Proceedings of the USENIX Annual Technical Conference,pages 279–294, 1996.

[30] P. Michaud. Demystifying multicore throughput metrics. IEEE ComputerArchitecture Letters, pages 10–13, 2012.

[31] S. Nussbaum and J. Smith. Modeling superscalar processors via statisticalsimulation. In Proceedings of the International Conference on ParallelArchitectures and Compilation Techniques, pages 15–24, 2001.

[32] K. Rowe. Time to market is a critical consideration, 2010.

[33] T. Sherwood and B. Calder. Basic Block Distribution Analysis to Find Pe-riodic Behavior and Simulation Points in Applications. Number Septem-ber, 2001.

[34] T. Sherwood and G. Hamerly. Automatically Characterizing Large ScaleProgram Behavior. In Proceedings of the conference on ArchitecturalSupport for Programming Languages and Operating Systems, 2002.

[35] Stackoverflow. Generate Instruction Cache misses.

[36] L. Tang, J. Mars, N. Vachharajani, and M. L. Soffa. The Impact of Mem-ory Subsystem Resource Sharing on Datacenter Applications Categoriesand Subject Descriptors. In Proceedings of the 38th annual internationalsymposium on Computer architecture, pages 283–294, 2011.

[37] G. Tassey. The economic impacts of inadequate infrastructure for soft-ware testing. Technical Report 7007, National Institute of Standards andTechnology, 2002.

[38] P. Taylor. Battle lines are drawn for the future of 4G, 2008.

[39] Telecomasia. Faster time to market with next-gen OSS, 2012.

[40] W. F. Tichy. Should Computer Scientists Experiment More? IEEE Com-puter, 31(5):32–40, 1998.

8Automatic Message

Compression with OverloadProtection

Marcus Jagemar, Sigrid Eldh, Andreas Ermedahl and Bjorn Lisper. AutomaticMessage Compression with Overload Protection. In press: Journal of Systemsand Software, 2016. [67].This paper is an extension of the already published paper G [66].

133

AbstractIn this paper, we show that it is possible to increase the message throughputof a large-scale industrial system by selectively compress messages. The de-mand for new high-performance message processing systems conflicts with thecost effectiveness of legacy systems. The result is often a mixed environmentwith several concurrent system generations. Such a mixed environment doesnot allow a complete replacement of the communication backbone to providethe increased messaging performance. Thus, performance-enhancing softwaresolutions are highly attractive. Our contribution is 1) an online compressionmechanism that automatically selects the most appropriate compression algo-rithm to minimize the message round trip time; 2) a compression overloadmechanism that ensures ample resources for other processes sharing the sameCPU. We have integrated 11 well-known compression algorithms/configura-tions and tested them with production node traffic. In our target system, au-tomatic message compression results is a 9.6% reduction of message roundtrip time. The selection procedure is fully automatic and does not require anymanual intervention. The automatic behavior makes it particularly suitable forlarge systems where it is difficult to predict future system behavior.

134 Paper B - Automatic Message Compression with OverloadProtection

8.1 Introduction

There is a great demand for high-performance communication in today’s in-dustry, both within and between systems. We are in the midst of the evolutioninto multicore CPUs which causes the computational capacity [1, 2] to growquicker than the available communication bandwidth [3]. Large-scale indus-trial systems [4] have additional problem areas, for example, the large andvery expensive legacy of already installed systems. It is not always econom-ically feasible to replace current systems with newer ones just because theycan provide higher performance. Industrial requirements explicitly state thatolder systems must coexist with more modern ones, which pose difficultieswhen requiring substantial performance improvements. We have formulatedthe following questions to address these problems:

Q1 Is it possible to increase the overall message processing performance byusing message compression?

Manual configuration is frowned upon for large-scale systems with chang-ing message content. Industrial systems implicitly require auto-configurablemessaging systems, removing the need for off-line decisions. Communicationmechanisms must be able to handle different initial scenarios and a dynamicenvironment.

Q2 Can an automatic method provide the best message processing perfor-mance by selecting the best compression algorithm from a set of algo-rithms?

Q3 Can an automatic method continuously ensure the best message process-ing performance when message content changes?

Message compression implies that we are trading CPU resources for a re-duction in message size, which in turn leads to quicker message transit time. Amessage processing node in our industrial environment runs several essentialservices that must not have their execution disrupted. Since unrestricted mes-sage compression may overload the CPU, there must be an overload protectionfor message compression. We formulate this in the last question:

Q4 Is it possible to limit CPU resources, used for message compression, sothat it does not seriously affect other services co-located on the sameCPU?

8.1 Introduction 135

In Section 8.2 we give an overview of the general ideas presented in this paper.We describe the typical use cases where automatic message compression canbe applied and the practical results of the current implementation. Our maincontributions are:

• A novel automatic mechanism that automatically and transparently evaluatethe messaging performance of different compression algorithms and selectthe most efficient one for the current communication stream. (Section 8.3.1)

• Our automatic selection mechanism detects network congestion and mes-sage content changes and will continuously select the best compression al-gorithm. (Section 8.3.2 and 8.3.3)

• Our selection mechanism can simultaneously handle multiple communica-tion streams. Each stream will have its own environment providing the possi-bility to have different compression algorithms for different communicationstreams depending on their suitability. (Section 8.3.4)

We have addressed the overload problem by dynamically adjusting the avail-able CPU resources allocated for compression:

• We have implemented a Proportional Integrative Derivative (PID) feedbackcontroller that evaluate the CPU usage and dynamically adapt the ratio ofcompressed and uncompressed messages. (Section 8.3.5)

We have tested our implementation on a large telecommunication system witha major market share. We have used a realistic test setup to show that ourimplementation adds performance improvements to an already existing system.

• We have used production data in the test environment. (Section 8.4.1)

• We have evaluated 11 different compression algorithms and configurationswith varying characteristics. (Section 8.4.2)

In Section 8.5 we display the test results from an experiment using data froma telecommunication production system. The first results, in Section 8.5.1,shows a 9.6% decrease of message Round Trip Time (RTT) when using the au-tomatic compression algorithm selection method compared to not using com-pression. In Section 8.5.2 we show that it is possible to select automatically thecompression algorithm providing the lowest message RTT. We have extendedthe automatic selection technique in Section 8.5.3 showing a message-streamchange during the test case execution. The content change is detected, and adifferent compression algorithm is selected to keep the lowest RTT. We havealso shown that the selection process can handle CPU overload situations, see


Section 8.5.4. During the overload situation, message compression is tem-porarily and proportionally reduced. As the CPU load returns to a lower non-overload level, message compression is resumed to the initial level. We havealso reviewed how our results compare to related work, see Section 8.5. Previ-ous publications use either very few compression algorithms [5] or specializeon explicit types of communication patterns [6], for example, patterns such asdata properties, message length, etc. Such static configuration contrasts withour automatic selection mechanism that automatically adapts to the executionenvironment without any manual intervention. We conclude the paper withfuture work, Section 8.7, and conclusions in Section 8.8.

8.1.1 DefinitionsWe define host in our communication system to be a computer that first receivesa message and then spends some time processing it producing a result to belater sent onwards to another host. A host is usually, for cost-savings reasons,also configured to handle additional concurrent tasks, such as statistics mea-surements, user interaction, database management and other similar actions.Further, cost-effective considerations encourage a move of software previouslyimplemented on separate hardware to shared processing boards. Such softwareconsolidations put high demands on the CPU usage of applications sharing aresource, which further complicates the execution environment.

We define messaging performance as a function of the time between thesnd msg() call until the receiver obtains the message, i.e. ts + tl1 + tr inFigure 8.1a. The messaging performance varies depending on properties suchas link speed, host distance in the network. In the case of message compression,additional properties can be added such as compression/decompression rateand compression ratio.

The concept message processing performance is the measurable ability toprocess messages per time interval. In our study, we measure this as the mes-sage RTT between two interconnected nodes.

We define the best compression algorithm to be the one that gives the lowestmessage RTT.

8.2 Problem Formulation and System ModelThe main idea presented in this paper is to reduce message round trip timeby using selective message compression. Manually finding the best compres-sion algorithm is both complex and time-consuming. There are also practical

8.2 Problem Formulation and System Model 137

A B

t

snd_msg() rcv_msg()t

l1

s1 t r1

(a) Without message compression.

Compression

A B

rcv_msg()s2tsnd_msg()

t l2

Decompression

t r2

t tc d

(b) With message compression.

Figure 8.1: Schematic description of generic messaging system.

considerations such as how to handle the scenario when the message-streamcontent changes. Our approach is to implement a selection mechanism thatautomatically chooses the best compression algorithm, depending on currentmessage content and network congestion levels. For systems with no compres-sion, this is a straight-forward procedure depicted in Figure 8.1a.

Process A in Figure 8.1a communicates with process B, locatedon another processor. A and B use legacy functionality withoutmessage compression. Using message compression makes it pos-sible to increase messaging performance, see Figure 8.1b.

With our novel solution, we can transparently improve messaging performanceby adding message compression to the legacy Application Programming In-terface (API). We suggest to add selective message compression functionalityinside the snd msg() function and corresponding decompression functional-ity in rcv msg(). It is a substantial cost-saving for any industrial software toimprove performance without API changes. API changes are usually frownedupon as they usually have a significant impact on other software using the API.Breaking a legacy API will often require costly modifications to other systemcomponents as well as complex regression testing and challenging customerdiscussions. Our solution is to update transparently the existing API to moni-tor each communication instance. Process A→ B is one instance and processA → C is another, where A,B and C are processes in the communicationsystem. The communication API can transparently utilize the most suitablecompression algorithm for different instances and types of message content.

The message scenario starts in a round-robin fashion by sendingmessages using all compression methods. Each communicationinstance stores compression and network statistics. After an initial


evaluation period, one compression algorithm will be selected if itis predicted to give the lowest message round trip time.

The total time (tt1) in Equation 8.1 is the time inside the send function (ts), totravel on the link (tl1) and inside the receive function (tr). See Figure 8.1a.

tt1 = ts1 + tl1 + tr1 (8.1)

As illustrated in Figure 8.1b our approach is to add compression (tc) and de-compression (td), see Equation 8.2.

tt2 = tc + ts2 + tl2 + tr2 + td (8.2)

The send and receive function does not differ between Figure 8.1a and 8.1b,therefore we assume that ts1 = ts2 and tr1 = tr2. This means that messagecompression is beneficial if the time it takes to compress and decompress mes-sages is lower than saved link time:

tc + tl2 + td < tl1 (8.3)

We can predict the performance of each compression algorithm by using statis-tics gathered from prior communication. The statistical data is then used to pre-dict future message compression, send, link, receive and decompression timefor each compression algorithm. The prediction method selects the compres-sion algorithm that gives the lowest message time (tt). The selected algorithmis used for a majority of messages until a different algorithm outperforms thecurrent one. To make sure that it is possible to detect a network- or messagecontent change some messages are sent using other compression algorithms.The idea is to gather statistical data for all implemented algorithms, not onlythe one that is selected.

We assume that F is the selected compression algorithm for a mes-sage stream. F is, therefore, used for the majority of the trans-mitted messages. The rest of the algorithms will continuously beevaluated by compressing a small number of messages. If the con-tent of the message stream changes, the algorithm evaluation mayfavour a different algorithm, which will then be chosen as the bestone.

If the CPU load increases beyond a limit, a PID-feedback controller reduces themessage compression time-quota, causing messages to be sent uncompressed.Consider the following scenario:

8.3 Adaption 139

No

Compression

No

New CalculationNew Calculation

Use Compression Use Compression Compression

r

Time

iri+1

Figure 8.2: At the start of each round, r, messages are being compressed.When the compression quota is exhausted messages are being transmitted un-compressed to reduce CPU load. In this example, the ratio compressed/uncom-pressed is increasing slowly.

We assume that process S shares the same CPU as process A. Thethrottling mechanism will reduce the compression quota if the loadof A and S together exceeds a predefined limit. For example, re-duce the number of compressed messages when S performs time-consuming calculations and A solicits heavy compression. Suchoverload handling mechanism aims to increase the availability ofservice S.

We think that it is hard to manually, in an offline manner, consider all pos-sible scenarios and selecting the most suitable compression algorithm. Ourapproach, using an automatic mechanism, greatly simplifies this task and pro-vides the flexibility needed for a changing message stream while at the sametime being able to provide CPU resources for other services sharing the samehardware resources.

8.3 Adaption

In Section 8.2 we demonstrated that it is beneficial to use message compressionfor particular situations, see Equations 8.1–8.3. Up to now we have assumedthat the optimal compression algorithm is known. However, in many scenarios,this is not the case. The message content may be unknown to the programmer,which makes it difficult to select an appropriate compression algorithm manu-ally. To find the most suitable compression algorithm there are two approaches.The first method is to manually choose the compression algorithm that the op-erator thinks is the best one. The second method, the one we suggest in thispaper, is to use an automatic selection mechanism to evaluate automatically andselect the compression algorithm that is most suitable for the current messagestream.


8.3.1 The Communication ProcedureWe explain our automatic message compression method by showing the sendand receive procedure. All messages being sent belong to a message stream,which is owned by the sending process, see Figure 8.2. The message streamis divided into rounds separated by an assessment period where the previousround is evaluated. The system decides the compression strategy for the nextround at the end of each evaluation period. It is desirable to keep the round aslong as possible while still letting adjustments take effect in a reasonable time.Additionally, it must be long enough so that there is sufficient information toevaluate to make a good estimation of the compression algorithm performance.

Sending a message

The list below summarizes our suggested communication procedure. It showsthe major steps and links to later sections describing more details. For eachnew round r do:

1. Evaluate previous statistics and assign r its parameters.

(a) Calculate the algorithm distribution for r by setting a weighted prob-ability, see Section 8.3.4. A more efficient algorithm gets a higherprobability for being selected than other algorithms.

(b) Derive a compression time budget for r, see Section 8.3.5.

2. Send messages during round r.

(a) Select a random compression algorithm according to its weighted prob-ability. This means that there might be different compression algo-rithms for adjacent messages depending on message size and content.If the time budget is empty, our mechanism will not compress the mes-sage.

(b) Send the message.

(c) Update the statistics for the compression algorithm used and decreasethe available time budget reflecting the time it took to compress thecurrent message.

3. Until end of round, goto item 24. When the round ends, goto item 1.

The round-length is determined by any quantifiable metric, for example, wall-time or number of messages. Which metric to choose depends on system re-quirements and communication system behavior.

8.3 Adaption 141

Receiving a message

The receive-procedure is simpler than the sending procedure since it does notrequire any compression algorithm selection. The following procedure outlinesthe necessary steps to be performed by the receiver.

1. The message is received.2. The header of the received message is parsed to reveal the compression al-

gorithm used by the sender.3. The message is decompressed.

(a) Memory is allocated for the uncompressed message.(b) The time it took to decompress the messages is stored in a database.

The stored time will be used for probe message handling, Section 8.4.4.

4. The application can resume operation with the newly received message.

The communication procedure is designed to be completely transparent to theapplication. The legacy compliant API hides all functionality related to mes-sage decompression.

8.3.2 Network MeasurementsOur implementation measures the network capacity by continuously monitor-ing the send-time, ts, and Round Trip Time (RTT), trtt. The send-time ismeasured inside the snd msg() API function call. Periodic probe messagesare sent to measure the RTT. Measurement data is stored in a local databaseand used in the compression algorithm selection process. Each measurementis a cumulative moving average to ensure that network changes, such as con-gestion, will influence the algorithm selection procedure. In general, hard com-pression is favored on slower, congested networks with high transmission time.Using cumulative average data has proven to be successful for our target sys-tem. If another system has different demands, it is easy to change the statisticalmodel. Only looking at recent messages may give quicker response time at theexpense of not being as resilient.

8.3.3 Compression MeasurementsSeveral algorithm specific metrics are measured such as compression rate, tcr,decompression rate, tdr, compression ratio, rc. The counters are updated eachround and shows messaging properties for the system it is running on. The


values will differ depending on the hardware it runs on and the message con-tent. The values are calculated as cumulative moving average to provide a goodbalance between quick response and stable behavior. If a particular system ex-periences problems with this approach is is simple to change this mechanismto a more appropriate one.

8.3.4 Selecting the Best Compression AlgorithmIt is possible to use many different approaches when selecting the most ap-propriate compression algorithm. For example, we could make a static selec-tion based on the result of an initial evaluation run. Such predefined selectionwould initially provide the best possible overall compression ratio but any con-tent change in the communication stream would over time cause the selectedalgorithm to perform poorly. At the other end of the spectrum, we could use allavailable algorithms evenly in a round robin fashion. Using round-robin pro-vides a lower overall compression ratio but will be robust whenever there arecontent changes. For comparison, we have defined three different algorithmdistributions: Majority, One Algorithm or Round Robin. They are explained inmore detail below.

Compression Selection 1 - Majority Distribution

We start by identifying the best compression algorithm providing the shortestmessage RTT. The majority distribution uses the best compression algorithmfor as many messages as possible while still retaining the ability to switch toanother compression algorith should the message stream change. At the begin-ning of each round, a compression algorithm distribution is calculated basedon the measurements during previous rounds. The distribution is a suitabilityrating of the compression algorithms and takes many parameters into consid-eration, such as compression time, transmission time, etc. Message streamcontent can change depending on the users of the communication channel. Achange in message content may lead to vast differences in compression ratioand compression time for different algorithms. This is one reason for allow-ing all compression techniques to run in each round. Completely turning offa compression algorithm would make the strategy static and unable to copewith a changing situation. We use a simple scheme to decide the compressiondistribution where the bulk of messages are compressed using the best algo-rithm and all other algorithms each receives 1% of the compression budget.Such distribution allocation causes the majority of compression time quota tobe allocated for the algorithm causing the shortest message round-trip time.

8.3 Adaption 143

We have selected this distribution because it is simple while still being reason-able efficient. Should the need arise, it is always possible to implement moreadvanced and tailored distributions.

Compression Selection 2- One Algorithm

One algorithm is manually selected and gets the complete compression budget.No other compression algorithm will be used thus in effect making this anoffline method. If it is possible to find the best algorithm, it will provide thebest compression ratio. However, it is a static selection; which may result inpoor performance for message streams where the content changes.

Compression Selection 3 - Round Robin

Apply an equal share of compression budget between the available compres-sion algorithms. Each algorithm gets the same amount of slots with no regardto their individual performance. Using all algorithms equally results in greatflexibility, but lower performance since it is not fully utilizing the best com-pression algorithm.

8.3.5 Compression Throttling

We want to make sure that all processes get a fair chance of running. In somecases, high CPU-load and excessive message compression may starve otherservices running on the same CPU. We have implemented a control algorithmthat throttles the amount of CPU cycles that can be used for message compres-sion. The idea is to continuously monitor CPU-usage and current communica-tion bandwidth. If the current CPU-load is low and the desired bandwidth ishigher than what is available, it is possible to trade computational capacity foran increased compression level causing the perceived bandwidth to increase.

Compression Time

We can measure the time spent compressing messages during a round. Thisis achieved by looking at the compression time tc(n) and decompression timetd(n) time for each individual message n. Adding tc(n) and td(n) for all Nmessages during a round results in the total time, ttot, spent performing com-pression activities (8.4). Our mechanism measures the decompression time and


piggybacks the result to probe messages that are sent between the sending- andreceiving node.

ttot =

n=N−1∑n=0

(tc(n) + td(n)) (8.4)

Controlling the total compression time

We want to throttle the total time spent compressing messages, ttot, per roundto adjust the computational capacity spent for message compression. We usea feedback control algorithm that maximize the amount of time assigned forcompression to match the desired CPU-load target level. We want to find theoptimal distribution of CPU-load between computational capacity and messag-ing compression to maximize the throughput without overloading the CPU.

The Control Algorithm

We use a Proportional Integrative Derivative (PID) controller to restrict theCPU capacity assigned to message compression. We have implemented thisfunctionality because message compression can easily overload the CPU ifused for all messages. An overload situation can seriously affect the func-tionality of other processes executing on the same CPU. A system engineercan define the amount of CPU capacity assigned to message compression de-pending on the desired system requirements.

The PID control algorithm uses CPU load as input and ttot as output. Thecontrol algorithm will continuously adapt ttot to converge to the target CPUload given at the time of system configuration. Increasing ttot will cause moreCPU time to be allocated for message compression.

Initial Values

The initial setting for the compression budget (ttot) is set to 0 since we donot have any knowledge of the message stream content when a new system isdeployed. Initially, for the first round, there will be no compression and forthe subsequent rounds the feedback control loop will increase ttot to gradu-ally allow more compression. The desired CPU load is system dependent and,therefore, configurable. Most systems will have different services running atthe same time as the messaging system so the CPU-load value must be care-fully determined.

8.4 Test System Setup 145

Service ServiceCommunication

subsystem Service Service

ProcessB

LZFNone Algorithm n

Service

Communication ChannelNone LZ4LZ4LZFX

Transit

LZFNone Algorithm n

Service

Messages LZ4

ProcessA

Communication subsystem

Figure 8.3: The structure and behavior of two interconnected nodes in a com-munication system. Process A and B have multiple message streams usingdifferent message compression algorithms depending on their payload content.

8.4 Test System Setup

We have tested our adaptive message compression implementation by runningseveral experiments on a test setup designed for large-scale telecommunica-tion systems. In the following section we will map the generic description inSections 8.2 and 8.3 towards our implementation. The implementation is indi-vidual for each target system since the procedure to obtain metrics may differdepending on hardware and operating system.

8.4.1 The Test System

We have followed the guidelines and checklist suggested by Petersen [7] toexplain the system we have investigated. We have investigated a telecommu-nication system [8, 9] where the platform consists of a third party embeddedreal-time operating system (OS). On top the OS approximately 5M SourceLines Of Code (SLOC) implements the middleware. The middleware im-plements cluster control, error management, upgrade functionality and othergeneric tasks that are common for all types of applications running on top theOS. Our test system consists of multiple CPU boards where each CPU runsseveral thousands of processes, see Figure 8.3. Each process can have its ownmessage queue which stores metadata such as compression algorithm statisticsand network congestion levels. The statistical data is used to predict the op-timal compression strategy. The result is that two message streams may use


different compression algorithms depending on their message content, desti-nation node and network congestion level. Thousands of engineers developedthe system [4] for many years. Multiple stakeholders compete for the inclusionof their functionality within the platform. As a result, from the size and com-plexity, there are many months from the requirement phase to the final releaseof the implemented functionality. The platform runs on more than 20 typesof hardware boards with varying complexity and performance levels, rangingfrom single-core CPUs with MBs of memory up to large multi-core clusterswith many GBs of memory. They have different functionality and hardwarelayout servicing both voice and data communication. The telecommunicationsystem we have investigated has about 25% market share for telecommunica-tion network equipment and ranks as being the most used radio access networkin the world [10]. For our experiments, we are using production control trafficgathered from a production environment intercepted by Wireshark [11]. Wereplay these messages to provide a real-life scenario. We have implementedthe functionality described in this paper as part of a Linux-based system. Theapplication we are targeting is running inside an emulation layer that addssupport for a legacy operating system. We have only made changes in thecommunication-API by adding message compression support before enteringthe TCP/IP layer inside the Linux kernel. We have made no changes at all tothe Linux kernel itself. The test setup is as follows: Node A is a Quad-CoreAMD Opteron [email protected] running Linux 3.2; Node B is an Intel [email protected] running Linux 3.2. A 100Mbit Ethernet network sharedwith a large number of other workstations connects both nodes.

8.4.2 Compression Algorithms

There are numerous compression algorithms, all designed for various uses andwith different characteristics. Some [12] [13] of the algorithms focus on purecompression ratio with little regard to the compression- and decompressionrate. Others [14] [15] provide a lower compression ratio but are faster. Itis easy to understand that both approaches are useful in different situations.Additionally, it is possible to accelerate partial or complete compression algo-rithms by implementing them in hardware [16]. In this paper we focus on loss-less compression thus completely disregarding lossy compression techniques.Lossless techniques could be applied to data that doesn’t need to be transmit-ted in an unchanged manner. Our implementation of the automatic mechanismuses eleven compression algorithms/configurations; LZFX [17], LZO [14],LZO-SAFE which is a safe configuration of LZO, LZMA [12], LZW [18],


BZ2 [13], LZ4 [15], FastLZ level 1 and 2 [19], Snappy [20], and QLZ [21].The key properties are listed in Table 8.1. The list is extended from the resultsby Karlsson and Hansson [22] comparing several compression techniques withregard to compression ratio and resource usage. Their work also investigatesthe suitability for each algorithm in the context of communication scenarios.Ringwelski et al. [23] manually investigates a number of compression tech-niques with regard to compression ratio and computational resources. This isthe starting point for our investigation, how can this task be fully automated?

Table 8.1: Implemented compression algorithms and their characteristics.

Compr.Alg.

Key Characteristics Ref.

LZFX Fast compression, low cr [17]LZO Fast compression, low cr [14]LZO-SAFE A safe, slightly slower, configuration of LZO. [14]LZMA Slow compression, high cr [12]LZW Medium compression, medium cr [18]BZ2 Medium compression, high cr [13]LZ4 Fast compression, low cr [15]FastLZ lv1 Fast compression, low cr, suitable for small mes-

sages.[19]

FastLZ lv2 Slightly slower than lv1, higher cr than lv1. [19]Snappy Very fast compression, medium cr, suitable for

text messages.[20]

QLZ Very fast compression, medium cr, suitable forsmall messages.

[21]

To measure the performance of each compression algorithm we define threekey properties: compression time (Definition 8.4.2), decompression time (Def-inition 8.4.2) and compression ratio (Definition 8.4.2). A compression ratiorc ≤ 1 indicates that no compression is achieved or even worse that the result-ing message is larger than the uncompressed. Achieving a higher compressionratio rc > 1 results in a smaller compressed message compared to the originalone.

Definition The compression time, tc = s/tcr is defined as the time to com-press a particular message of size s. The compression rate, tcr, is the speed ofcompression, [B/s].


Definition The decompression time, td = s/tdr is defined as the time to de-compress a particular message of size s. The decompression rate, tdr, is thespeed of decompression, [B/s].

Definition The compression ratio of a particular algorithm is defined as rc =su/sc, where su is the size of the uncompressed message and sc is the corre-sponding size when being compressed.

Adding new compression algorithms is simple. The current implementationuses a list where the compression algorithms are transparently used. All algo-rithm measurements are generic and they do not depend on the specific imple-mentation.

8.4.3 Putting it All TogetherThe message stream is divided into communication rounds, see Figure 8.2. Asdescribed in Definition 8.4.3, each round consists of two phases, evaluation andtransmission. Transmitting more messages each round reduce the relative costof the evaluation phase but also the ability to detect a message content change.In our target system we have empirically defined a round to 1000 messages.

Definition A communication round is started by an evaluation phase followedby a transmission phase, and it is delimited by a fixed number of messages.

The first part in a round is the evaluation of statistical data retrieved from pre-vious messaging rounds, see Sections 8.3.2-8.3.3. The evaluation procedureuses historical data to predict if message compression should be used for sub-sequent messages, as defined by Equation 8.2. The most suitable compressionalgorithm is chosen depending on its ability to reduce the message transitiontime. During the second part, see Figure 8.2, messages will be compressedusing the algorithm distribution decided by the selection phase.

Estimating the Compression Metrics

The time to transfer a message between two interconnected nodes is a centralconcept in this paper and is denoted transmission time, see Definition 8.2. Wehave split this procedure into several parts that are individually measurable ina run-time environment. The first part is the sending time, see Definition 8.4.3.It is the time it takes to prepare the message and present it to the driver that willperform the actual link transmission.


Definition Sending time is the time spent inside the snd msg() function call.The time to send a message is defined as, ts = s/tsr, where s is the size of themessage in Bytes and tsr is the send rate [B/s].

The second part is the Round Trip Time (RTT), see Definition 8.4.3.

Definition The round-trip-time (RTT), trtt = s/trttr, is defined as the time ittakes for a message of size s to travel from node A to node B and back to A. Theamount of data per time unit sent back and forth between two communicationpartners, RTT rate, is defined by trttr [B/s].

Estimating the Transmission Time

With Equations 8.1–8.3 in Section 8.2 we can describe the time it takes betweenthe sending application calls the snd msg() function and the receiver obtainsthe decompressed message and can operate on it. For our system we assumethat the link time (tl) is roughly half the RTT, as defined by Equation 8.5. Weassume that it takes equal time back and forth between two nodes, which givesan estimation for our proposed algorithm.

tl =trtt2

(8.5)

Combining Equation 8.2 with Equation 8.5 results in Equation 8.6.

tt = tc + ts +trtt2

+ td (8.6)

Initial Configuration

The compression algorithm evaluation procedure needs statistical data to beable to predict the need for message compression and calculate the distribu-tion over available compression algorithms. No predefined statistical data ispresent at the start of the first communication round (in the system lifetime).By using all compression algorithms in a Round-Robin (R-R) manner, initialdata is obtained. We have empirically chosen 10 seconds for our target systemsince it gives ample measurements to use in the first iteration of the selectionprocess. Initial algorithm selection jitter is also filtered out by running R-Rduring the first part of communication. In our target-system the message con-tent changes but not very frequent, which makes it possible to use the firstset of messages as an indication of near future content of the message stream.Later message stream changes will naturally affect the compression algorithm


selection if needed. Each algorithm will have its own set of measurements forcompression time (tc), send time (ts) and decompression time (td). The roundtrip time (trtt) is a network-dependent parameter stored on a per-network basis

8.4.4 Real-World Compression Throttling

Compression throttling, see Section 8.3.5, is dependent on the ability of accu-rately measuring the current CPU usage of the system. We assume that peo-ple interested in utilizing this algorithm have such metrics available. The CPUload is used as input to the automatic throttling functionality when determiningactions for current and future messaging strategies. In particular, we are inter-ested in the current CPU-load and communication properties since these areused to determine the compression technique. If the CPU is already saturatedwith other computational tasks we will avoid burdening it further with com-pression to allow the highest possible throughput while preserving the avail-ability of other services.

Measuring the CPU-load

The CPU load is measured on a per-core basis, see Definition 8.4.4, and is usedby the feedback control loop determining how much time should be spent oncompression. Normally the target CPU load should be set to a value that allowsother services to run in the desired way. Setting the target CPU load too highmay cause the system to overload since all messages will be compressed andthe worst case is that some process may starve out vital functionality on thesystem.

Definition The CPU-load, L, is defined as the number of processes, ready toexecute, in the run-queue of the operating system.

We use the 1 minute running average for the CPU-load to avoid jitter. Theload values are retreived from the Linux system by reading /proc/stat.This greatly reduces oscillation problems in the feedback controller, where in-termittent background service usage may cause spikes in the CPU-load. Tomeasure other parameters such as round-trip-time, RTT, we issue periodicalprobe messages to other nodes. These messages are used to determine howsaturated the network currently is.

8.5 Results 151

8.5 Results

We have answered the research questions, see Section 8.1, by performing ex-periments on a test system that emulates a large telecommunication system.As experiment data, we are using intercepted control traffic from a productionenvironment, gathered by Wireshark [11]. The production node was fully oper-ational handling control plane traffic and other routine maintenance tasks whilewe were capturing the experimental data. The test data excludes data plane traf-fic such as call or video data. Our assumption is that we will provide a realisticevaluation of our suggested techniques by using production traffic data. Eventhough we have tested our implementation on a particular type of system, theproposed mechanism is adaptable to any other kind of system. Each test hasfollowed the process defined in Section 8.3.1, including the initial 10-secondround-robin execution of all algorithms.

8.5.1 Automatic Compression

Q1 Is it possible to increase the overall message processingperformance by using message compression?

In this experiment, we show that our online mechanism automatically findsthe most efficient compression algorithm from a set of available algorithms,depending on the content of the message stream. We have run three differ-ent test setups within this experiment. All tests send an equal amount of databut with different content and different settings for the automatic compressionalgorithm selector.

The first test is a reference measurement using production data, see Fig-ure 8.4a. We have temporarily suspended message compression by disablingthe selection mechanism. This scenario shows the original production systembehavior when not using any message compression.

In the second test, we use a synthetic data set containing only zeros, seeFigure 8.4b, together with the automatic compression selection mechanism en-abled. Snappy [20] is a fast and efficient compression algorithm for simplertextual messages. It is selected as the best compression algorithm for this syn-thetic data set because it outperforms all other algorithms. Network distur-bances at 7k–10k messages in Figure 8.4b triggers an additional compressionalgorithm selection. When the network has returned to a stable state, Snappy isonce more selected and finally results in a 14.0% reduction in RTT compared


0

1

2

3

4

5

0 10 20 30 40 50 0

0.2

0.4

0.6

0.8

1

RTT [

ms]

a) Uncompressed

0

1

2

3

4

5

0 10 20 30 40 50 0

0.2

0.4

0.6

0.8

1

RTT [

ms]

b) Compressed zero-pattern

Network disturbances causes anadditional algorithm selection.

Alg

ori

thm

Sele

ctio

n R

ati

o[0

-1]

0

1

2

3

4

5

0 10 20 30 40 50 0

0.2

0.4

0.6

0.8

1

RTT [

ms]

Number Message [# x1000]

c) Compressed wireshark messages

Alg

ori

thm

Sele

ctio

n R

ati

o[0

-1]

Avg. RTTCurrent RTTNO-COMPR

LZFXLZO

LZO-SAFE

LZMALZWBZ2

LZ4FASTLZ1FASTLZ2

SNAPPYQLZ

Figure 8.4: Three different message streams experiments; a) reference ex-periment using uncompressed messages; b) Compressed messages with zero-pattern data; c) Compressed messages with production system message data.

to the first test. This experiment shows that different compression techniquescan be chosen depending on the transmitted message data.

The third and final test, shown in Figure 8.4c, uses production data togetherwith the automatic compression selection mechanism. In this experiment, theselection mechanism chooses QLZ [21] as the most efficient compression algo-rithm with an 8.9% reduction in message RTT over uncompressed messaging.We have decided to connect the test nodes to a busy network to provide a moredynamic environment. External network congestion causes the RTT transientat 20k messages.

8.5 Results 153

1

1.5

2

2.5

3

0

10

20

30

40

50

Avera

ge R

TT [

ms]

(Low

er

is b

ett

er)

Number Message [# x1000]

Overhead for theadaptive mechanism

Performance gainover uncompressed

Avg. RTTQLZ (Manual)

UncompressedAdaptive improvement

Adaptive overhead

Figure 8.5: Cumulative average round trip time (RTT) over multiple commu-nication rounds.

8.5.2 Algorithm Selection MethodsQ2 Can an automatic method provide the best message processing per-

formance by selecting the best compression algorithm from a set ofalgorithms?

As described in Section 8.5.1, the automatic selection mechanism selects thebest compression algorithm depending on the message content and the networkenvironment. We have designed a test to demonstrate the algorithm selectionmethods by replaying data from a production node. Figure 8.5 shows howRTT converges towards a specific value over multiple communication rounds.QLZ [21] is automatically selected as the most efficient compression algorithmfor this test setup because it provides the best balance between compressionratio and compression rate. The automatic mechanism is inferior to uncom-pressed messaging during the first part of the test, between 0–6k messages. At6k messages, the QLZ performance surpasses uncompressed messaging result-ing in a 9.6% RTT reduction after a total of 50k messages.

As a comparison, a manual offline selection of QLZ as the compressionalgorithm results in a 14.0% reduction in RTT, see Table 8.2. The current costof automating compression algorithm selection (14.0%−9.6% = 4.4%), whichcan be attributed to intermittent use of non-optimal compression algorithmsto be able to detect changes in the message stream. It is possible to reduce


Table 8.2: Relative improvements for different algorithm selection strategies.

Selection Mechanism RTT[ms]

Relative TimeReduction

Uncompressed (Reference) 1.57 0.0Round Robin 1.45 -7.6%Automatic 1.42 -9.6%Manually selecting QLZ as the best com-pression algorithm

1.35 -14.0%

the selection cost by increasing the usage of the best compression algorithmwhile reducing the usage of other algorithms. Changing the ratio between theselected and other algorithms may reduce the ability to detect changes in themessage stream. In our setup, we have empirically decided to assign 1% of thecompression quota to the non-optimal algorithms and the remaining quota tothe best algorithm. The ratio between the best algorithm and the rest is easilyconfigurable and dependent on the desired system behavior.

8.5.3 Automatic Algorithm Selection for Changing MessageStreams

Q3 Can an automatic method continuously ensure the best message pro-cessing performance when message content changes?

In this experiment, we show that our automatic mechanism can select differ-ent compression algorithms when a message stream changes. The mechanisminitially selects Snappy [20] for text messages and later selects QLZ [21] forproduction node data. Figure 8.6 shows the measurements when changing themessage stream content after 10k messages. From 0–10k messages, the data setcontains a zero pattern. At 10k messages, the data set is switched to productionnode data. The first and uppermost graph, Figure 8.6a, shows the RTT, which at5k messages is lower than uncompressed messaging. The second graph, Figure8.6b, shows the cumulative average compression ratio for all evaluated com-pression algorithms. Snappy is chosen for the first data set and provides a lowRTT for the system. At 10k messages, the message stream switches to produc-tion data the QLZ compression algorithm performs better and is subsequentlyselected at 15k messages. The third graph, Figure 8.6c, shows the suitability ofeach compression algorithm. The fourth graph, Figure 8.6d, shows the cumula-tive number of messages compressed by each algorithm. Before 15k messages,

8.5 Results 155

1

1.5

2

2.5

3

RTT [

ms]

a) Current and cumul. avg. RTT (lower is better)

Uncompressed messages (average=1.65)

0

5

10

15

20

25

30

35

40

45

Avera

ge C

om

pr.

b) Cumul. avg. compr. ratio per algorithm (higher is better)

0

1

2

3

4

5

Suit

abili

ty

c) Algorithm suitability (lower is better)

0

10

20

30

40

50

60

Nr

Cum

ul. M

sg [

# x

10

00

] d) Cumul. nr. msg. compressed each algorithm (higher is better)

0

1

2

3

4

5

6

7

8

0 20 40 60 80 100

Com

pre

ssio

n T

ime [

Rela

tive]

Number Messages [# x1000]

e) Rel. compr. time for each algorithm (lower is better)

Avg. RTTCurrent RTTNO-COMPR

LZFXLZO

LZO-SAFE

LZMALZWBZ2

LZ4FASTLZ1FASTLZ2

SNAPPYQLZ

Figure 8.6: Message stream change at 10k messages triggers an re-selection(Snappy→QLZ).


1.2

1.4

1.6

1.8

2

2.2

2.4

0

20

40

60

80

10

0

0

20

40

60

80

100

Avera

ge R

TT [

ms]

Alg

ori

thm

Sele

cti

on R

ati

o[%

]and C

PU

-load[%

]

Number Messages [# x1000]

Interval 1Startup

Interval 2Overload

Interval 3Load restored to normal

Average RTT QLZ[%] NO COMPR[%] CPU-load

Figure 8.7: Overload handling by reducing compression time budget.

the Snappy algorithm is used for most messages. After 15k messages, there isa selection time where the best algorithm is switching between several ones.From 30k messages, the QLZ algorithm is used as the sole compression algo-rithm. The final graph, Figure 8.6e, illustrates the relative compression timefor each algorithm.

8.5.4 Overload HandlingQ4 Is it possible to limit CPU resources used for message compression so

that it does not seriously affect other services co-located on the sameCPU?

We have, in this experiment, shown that our feedback control algorithm canhandle overload situations. We have divided the overload experiment resultinto three intervals, shown in Figure 8.7. In the initial interval, between 0 and35k messages, QLZ is selected as the most appropriate compression algorithm.

The second interval, starting at 35k messages, shows a CPU overload situa-tion triggered by using manually starting Cpuburn [24]. Cpuburn is designed toevaluate system load by creating massive CPU load. Due to the compressiontime-quota reduction during the second interval, our mechanism reduces thenumber of compressed messages. The feedback control algorithm detects theoverload situation and reduces the time quota assigned for compression, whichfrees CPU-resources for other applications executing on the shared resource.


The third interval starts at 50k messages after terminating the Cpuburn ap-plication, and the overload situation ends. The QLZ compression algorithmusage increases at 60k messages after gradually restoring the initial CPU loadlevel.

8.6 Related Work

We have split the related work into three subsections. The first details messagecompression from a general perspective. The second section introduces differ-ent compression algorithm selection methods. More specifically, how to findthe algorithm that provides the best messaging performance. In the third andlast part, we summarize the state-of-the art related to overload handling.

Message Compression

Several earlier publications and implementations propose that compression canimprove the overall communication performance. Wiseman et al. [25] inves-tigate loss-less compression of communication systems. They benchmark foreach compression algorithm using off-line data and use these measurements toautomate algorithm selection. Gutwin et al. [6] describe a transparent way ofcompressing Groupware messages in an efficient way. Their method is con-venient and easy to use for framework users since it supports both text andserialized objects.

Nicolae [26] apply compression to cloud computing and investigates itseffect on cloud storage. The Grid5000 research network has tested the imple-mentation and report a significant network traffic reduction when using LZOand BZIP2. The trend in recent CPUs is to include hardware support for com-pression and decompression [16]. Hardware support offloads the CPU and canimprove the overall system messaging performance. The selection mechanismpresented in this paper will treat a hardware supported compression algorithmthe same way as a software algorithm.

The compression techniques described above adopts a semi-passive behav-ior by selecting the compression algorithm during configuration time. If wehave prior knowledge of the system behavior, a static selection of compressionalgorithms can be acceptable or even desirable since it provides determinism tothe system. For other types of systems, such as our dynamic telecommunica-tion system, the communication streams change over time, and their content isdifficult to predict in advance. Providing a way to support a dynamic message


content is one of the major reasons for implementing our online compressionalgorithm selection methods.

Compression Algorithm Selection Methods

Jeannot, Knutsson, and Bjorkman has created a suite of papers [5, 27, 28] thatdescribes an implementation using adaptive message compression. They havechanged the Linux communication mechanism to compress/decompress mes-sages as a function of available memory and communication capacity. Theirimplementation uses predefined compression levels compared to our approachthat continuously evaluate all algorithms. Jeannot, Knutsson and Bjorkman [28]re-implements previous techniques in user-space to improve portability. Thismethod is similar to ours where a user-space API hides message compression.In a later paper Jeannot [29] describes an official library that supports auto-matic message compression. The AdOC-library is freely available at the of-ficial web page [30]. There are some differences compared to our work; Thefirst is that AdOC uses POSIX standard calls while we have adapted a legacycompliant asynchronous messaging system. The AdOC implementation usesmultiple threads to compress and communicates data. Our implementation ex-ecutes the compression in single threaded user-mode to isolate the functionalityto a single execution context. Isolation is essential to reduce inter-thread com-munication in the current system implementation. Furthermore, AdOC useslarge (200kB) buffers compared to ours that are usually multiple of 1000s ofbytes. Our assumption is that coalescing multiple messages in larger chunkswould increase the message round trip time. Our target system would suffera performance impact if the message round trip time increased. Jeannot’s im-plementation monitors the send queue length. The network is saturated if thequeue grows, which means that the message stream needs higher compressionratio. Adoc pre-defines a set of compression algorithms, for which each onedefines the wanted compression level. Depending on the desired compressionlevel, AdOC selects one algorithm out of the set of available ones. Our methodis on the other hand continuously assessing all compression algorithms makingit more flexible. In our solution, there is no need for offline algorithm evalua-tion to determine their suitability with regards to different message streams.

Sucu and Krintz [31] have created a communication environment calledAdaptive Compression Environment (ACE). It aims to change the behaviorof socket communication by compressing particular type of messages. ACEwill only compress messages larger than 32kB. This approach differs to ourswhere all messages are transparently evaluated to detect if compression can


improve messaging performance. ACE uses one compression algorithm, Zlib,compared to our method where we have implemented eleven algorithms. Intheir later paper, Krintz and Sucu [32] implements additional compression al-gorithms such as Bzip2, Zlib, and LZO. The main difference compared to ourimplementation is that algorithm selection depends on offline measurementsusing training files. Each compression algorithm is profiled to measure its per-formance. The adaptive online algorithm will then utilize algorithm profileinformation when selecting the most suitable compression algorithm. Offlineoperations require the system engineer to have prior knowledge of the messagestream content. Our implementation does not require any manual evaluationsince it continuously evaluates the compression performance.

Pu and Singaravelu [33] describes the trade-off between available band-width and the required computational capacity when compressing messages.They present a thorough investigation of simple schemes such as “compress-all messages” or “compress-none”. Gray et al. [34] points out that it is hardto decide when to compress messages or not. They expand earlier work byPu and Singaravelu [33] by including mixed sets of compressed and uncom-pressed messaging. We employ this technique in our messaging subsystem byboth sending uncompressed messages as well as messages compressed withdifferent algorithms.

Brunet et al. [35] describe a technique to auto-tune compression parame-ters depending on the application hardware. Hardware profiling is performedonce for each platform, and the resulting profile is an indicator of successfulnetwork communication parameters. In our algorithm selection process, wecontinuously monitor the performance of each compression algorithm com-pared to the offline evaluation proposed by Brunet et al. [35].

Biederman [36] owns a patent describing a general idea of receiving, com-pressing and sending messages. The method is similar to ours but static. Wehave implemented a PID feedback controller to manage the CPU quota allo-cated for message compression. The feedback controller adjusts the CPU loadto ensure that other services can coexist on the same CPU. Biederman usesseveral predefined compression levels that depend on the message content.Biederman contrasts with our solution that simultaneously evaluates severalcompression algorithms allowing the best algorithm to dominate.

In general, our approach is more flexible than the publications describedabove. We have provided a messaging subsystem with the primary goal ofbeing automatically adaptable to a changing message stream and system be-havior.


Overload HandlingOur target system needs an overload mechanism because there are many co-located services sharing the same hardware. Message compression is compu-tationally heavy and can easily starve other processes. We have managed thisproblem by automatically constraining the amount of computational capacityavailable for message compression. There are other ways to control the CPUquota assigned to message compression. Jeannot, Knutsson and Bjorkman [5,27,28] has Implemented AdOC, which uses the send-queue length as an indica-tion of the desired compression level. AdOC deduces that higher compressionratio is required when the send queue gets full. Message compression will takelonger time if the processor is busy with higher prioritized tasks, resulting ina small number of messages in the send queue. A low message count in thesend-queue triggers a compression-level reduction. Our implementation doesnot require the same type of low-level kernel modifications needed by AdOC.Avoiding 3PP kernel modifications is a tremendous benefit concerning kernelupgrades and support agreements. Our design is also much more fine-grainedsince it is possible to specify exactly the maximum CPU-usage. There is a strictCPU usage cap for some industrial systems. Strict overload handling can bebeneficial for industrial usage not to exceed such predefined CPU usage limit.

8.7 Future WorkIn this section we list the most important issues that should be investigatedfurther.

Compression TechniquesTo provide a larger working set of compression algorithms to choose from,it is easy to add more algorithms. In this paper we have intergrated a set ofeleven compression algorithms as a starter. This is sufficient to test our imple-mentation and give a large increase in message throughput. In recent CPUs,Intel provides hardware support for LZO compression [16]. Adding hardwareaccelerated compression algorithms to our automatic selection mechanism isalready supported and could be investigated. Similar changes to the outcomeshould occur if running on networks with far greater bandwidth. For the im-plementation described in this paper we have made several simple memcpy()when compressing and decompressing messages. The side effect is lower per-formance and using a zero-copy approach would probably increase the perfor-

8.8 Conclusions 161

mance many fold. Testing a hardware implementation would be interesting tosee how it would compete with the software implementations. This area is get-ting more and more supported by CPU manufacturers and also part of commonoff-the-shelf components.

Automatic CompressionWe would like to experiment with other type of compression handling mecha-nisms. For example, it may be more efficient to reduce the compression levelfor all messages in a round rather than sending a few messages with high com-pression and the rest without when the compression time budget empty. Wewould also like to make use of more metrics for the system utility calculation,for example the receiver CPU-load, multicore CPUs etc. Do such CPUs needspecial treatment or is the current technique transparent and will work out ofthe box?

Message and Data CompressionSeveral uncompressed messages can be coalesced into one larger compressedmessage to reduce much of the overhead related to packet marshalling, memoryallocation etc. This should provide a significant performance improvementover the current implementation but requires more complex message receivers.Finally, it would be a challenge to find other techniques that automaticallyselect appropriate compression techniques for dynamically changing messagestream contents. Especially with a lower cost of overhead to remove the needfor offline algorithm selection.

Temporal LocalityIn the current implementation little regard is taken to the temporal locality ofstatistical data. Old message compression statistics weigh the same as recentdata, which favors a stable behavior. Recent data should have higher impor-tance vs. old to decrease the adaption time when new communication circum-stances occurs.

8.8 ConclusionsWe have shown that it is possible to increase the message processing per-formance of large-scale industrial systems [4] by selectively and automati-

162 References

cally compressing messages. We have also shown that message compressioncan coexist with other services without introducing starvation problems. Wehave integrated eleven algorithms and compression configurations; LZFX [17],LZO [14], LZO-SAFE, LZMA [12], LZW [18], BZ2 [13], LZ4 [15], FastLZlevel 1 and 2 [19], Snappy [20] and QLZ [21]. The system will automaticallychoose the compression algorithm that provides the lowest message round triptime (RTT). The RTT is affected by factors such as compression rate, com-pression ratio and link speed. There are also additional external factors thatinfluence the suitability of each algorithm, such as the CPU processing powerand the network congestion level. Furthermore, we have implemented a mech-anism that continuously evaluates the algorithm suitability. The mechanismdetects when the content of a message stream varies by monitoring the algo-rithm suitability. By continuously evaluate all compression algorithms we pro-vide a robust and fully automatic algorithm selection suitable for large-scaledeployment. The automation is particularly suitable for environments where itis hard to manually decide the optimal compression algorithm.

We have also implemented an overload protection to ensure that excessivecompression does not starve other services co-located on the same CPU as thecommunication application. The compression throttling is implemented usinga Proportional Integrative Derivative (PID) controller that is monitors the CPUusage. When starting the communication application, all messages are sentwithout compression. The CPU-quota assigned for message compression isincreased until reaching the desired max-load. Simultaneously, our commu-nication mechanism continuously evaluates all compression algorithms to findthe one producing the shortest message round trip time.

We have implemented the automatic message compression method on alarge-scale industrial telecommunication platform [8] with a major market share [9].We have tested our implementation with production data gathered at customersites and replayed it in a lab. We show that the automatic compression mecha-nism yields a 9.6% reduction in RTT when using production data.

References

[1] J. Gustafson, Reevaluating Amdahl’s law., Communications of the ACM31 (5) (1988) 532–533.

[2] M. Hill, M. Marty, Amdahl’s law in the multicore era., Computer 41 (7)(2008) 33–38.

References 163

[3] J. Nielsen, Nielsen’s law of internet bandwidth. (online), http://www.nngroup.com/articles/law-of-bandwidth/ (1998).

[4] D. Hallmans, M. Jagemar, S. Larsson, T. Nolte, Identifying EvolutionProblems for Large Long Term Industrial Evolution Systems, in: Pro-ceedings of IEEE International Workshop on Industrial Experience inEmbedded Systems Design (COMPSAC14), Vasteras, 2014.

[5] B. Knutsson, Increasing Communication Performance via Adaptive Com-pression, in: Proceedings of the Seventh Swedish Workshop on ComputerSystems Architecture, Gothenburg, Sweden, 1998.

[6] C. Gutwin, C. Fedak, M. Watson, J. Dyck, T. Bell, Improving networkefficiency in real-time groupware with general message compression., in:Proceedings of Conference on Computer Supported Cooperative Work,ACM Press, New York, USA, 2006, pp. 119–128.

[7] K. Petersen, C. Wohlin, Context in industrial software engineering re-search, in: International Symposium on Empirical Software Engineeringand Measurement, Orlando, Florida, USA, 2009, pp. 401–404.

[8] M. Bergqvist, J. Engblom, M. Patel, L. Lundegard, Some experience fromthe development of a simulator for a telecom cluster (CPPemu), in: Pro-ceedings of the International Association of Science and Technology forDevelopment, 2006, pp. 13–21.

[9] H. Vestberg, Ericsson unveils new products, partnerships and increasedmarket share., in: Proceedings of at Mobile World Conference, 2012.

[10] H. Vestberg, Ericsson Annual Report., Tech. rep. (2013).

[11] G. Combs, Wireshark, http://www.wireshark.org/ (2014).

[12] I. Pavlov, LZMA Software Development Kit, http://www.7-zip.org/sdk.html (2013).

[13] J. Seward, BZIP2, a program and library for data compression compres-sion., http://www.bzip.org (2013).

[14] M. Oberhumer, LZO (Lempel-Ziv-Oberhumer) Data CompressionLibrary, http://www.oberhumer.com/opensource/lzo/(2013).

http://www.nngroup.com/articles/law-of-bandwidth/

http://www.nngroup.com/articles/law-of-bandwidth/

http://www.wireshark.org/



http://www.bzip.org

http://www.oberhumer.com/opensource/lzo/

164 References

[15] Y. Collet, lz4 Data Compression Library, http://fastcompression.blogspot.se/p/lz4.html (2013).

[16] Intel, LZO hardware compression, http://software.intel.com/en-us/articles/lzo-data-compression-support-in-intel-ipp (2013).

[17] A. Collette, LZFX Data Compression Library, http://code.google.com/p/lzfx/ (2013).

[18] T. A. Welch, A Technique for High-Performance Data Compression,Computer 17 (6) (1984) 8–19.

[19] A. Hidayat, FastLZ, http://fastlz.org/ (2014).

[20] Google, Snappy Compression Library, https://code.google.com/p/snappy (2013).

[21] L. M. Reinhold, QuickLZ - Fast compression library for C, C# and Java,http://www.quicklz.com/ (2011).

[22] S. Karlsson, E. Hansson, Lossless Message Compression., Tech. rep.(2013).

[23] M. Ringwelski, C. Renner, A. Reinhardt, A. Weigel, V. Turau, The hitch-hiker’s guide to choosing the compression algorithm for your smart meterdata, in: 2nd IEEE ENERGYCON Conference & Exhibition, 2012, pp.935–940.

[24] M. Mienik, CPU burnin.URL http://cpuburnin.com

[25] Y. Wiseman, K. Schwan, P. Widener, Efficient end to end data exchangeusing configurable compression, ACM SIGOPS Operating Systems Re-view (2005) 4–23.

[26] B. Nicolae, On the benefits of transparent compression for cost-effectivecloud data storage, in: Proceedings of Transactions on Large Scale Dataand Knowledge Centered Systems, Vol. 3, 2011, pp. 167–184.

[27] B. Knutsson, M. Bjorkman, Adaptive end-to-end compression forvariable-bandwidth communication., Computer Networks 31 (7) (1999)767–779.








http://fastlz.org/



http://www.quicklz.com/

http://cpuburnin.com

http://cpuburnin.com

References 165

[28] E. Jeannot, B. B. Knutsson, M. Bjorkman, M. Bjorkman, Adaptive onlinedata compression., in: IEEE High Performance Distributed Computing,2002.

[29] E. Jeannot, Improving Middleware Performance with AdOC: an AdaptiveOnline Compression Library for Data Transfer, in: Proceedings of Inter-national Parallel and Distributed Processing Symposium, 2005, p. 70.

[30] E. Jeannot, ADOC homepage, http://www.labri.fr/perso/ejeannot/adoc/adoc.html (2012).

[31] S. Sucu, C. Krintz, Ace: A resource-aware adaptive compression environ-ment., in: Proceedings of International Conference of Information Tech-nology: Coding and Computing, 2003, pp. 183 – 188.

[32] C. Krintz, S. Sucu, Adaptive on-the-fly compression., IEEE Transactionson Parallel and Distributed Systems 17 (1) (2006) 15 – 24.

[33] C. Pu, L. Singaravelu, Fine-Grain Adaptive Compression in DynamicallyVariable Networks., in: Proceedings of the International Conference onDistributed Computing Systems, 2005, pp. 685–694.

[34] M. Gray, P. Peterson, P. Reiher, Scaling Down Off-The-Shelf Data Com-pression : Backwards-Compatible Fine-Grain Mixing, in: Proceedings ofDistributed Computing Systems, 2012, pp. 112 – 121.

[35] E. Brunet, F. Trahay, A. Denis, R. Namyst, A sampling-based approachfor communication libraries auto-tuning, in: Proceedings of InternationalConference on Cluster Computing, 2011, pp. 299 – 307.

[36] D. Biederman, Communication system with content-based data compres-sion, US Patent 7069342 (2001).



Nu ar det slut. *

— Fem myror ar fler an fyra elefanter

*This quote is taken from the concluding scene of a TV-show, famous to all Swedish childrenborn during the 70’s. A pink elephant exclaims “This is the end” and then skedaddles away whiletrumpeting with its trunk.

Utilizing Hardware Monitoring to Improve the Performance ... · Utilizing Hardware Monitoring to Improve the Performance of Industrial Systems ... 3G The third telecom network generation,

Documents