COSTAC 2015 - UPMauthors use the concept of Technical Debt (Samarthyam and Sharma 2014) and (Chris et al. 2012). Technical Debt commonly focuses on how bad the code can be, and also

COSTAC 2015

Proceedings of the

1st Workshop on COmputing Science and

Technology for smArt Cities

ETSISI, Madrid, May 13, 2O15

i

Preface

The 1st Workshop on COmputing Science and Technology for smArt Cities

(COSTAC) is an initiative of the PhD. Program on COmputing Science and

Technology for smArt Cities of the Technical University of Madrid (UPM) which

provides PhD. Students of the program a means to present and discuss their most

recent and significant findings and experiences in their PhD Thesis.

The workshop will consists of 10 research works that are included in this

proceedings. This proceedings are CONFIDENTIAL only for internal

distribution and usage of the PhD. Program. Any redistribution or partial

publication of these proceedings is prohibited and it requires the permission of

the authors.

Madrid, May 2015 Sergio Arévalo

Jennifer Pérez

ii

iii

Table of Contents

“Towards a supported process for reengineering software architectures” ........................ 5 Daniel A. Guamán, Jessica Díaz, Jennifer Pérez

“A logic-algebraic approach to decision taking in a railway interlocking system” .......... 11 Antonio Hernando, Eugenio Roanes‐Lozano and Roberto Maestre‐Martínez

“Non Delay Causal Forwarding Protocols for Hierarchical Communication Architectures” ....................................................................................................................... 25 Isabel Muñoz, Sergio Arévalo

“HCE-oriented payments vs. SE-oriented payments. Security Issues” .............................. 27 Rubén Nieto

“Detection of Vulnerable Users using V2x Communications” .......................................... 41 Maria Ines Stimolo, Marcela Porporato, Gustavo Porporato Daher

“How the LDA Method Works for Collaborative Filtering” .............................................. 71 Priscila Valdiviezo, Guido Riofrío

“Knowledge Description for Bodies of Knowledges” .......................................................... 91 Pablo Quezada and Juan Garbajosa

“Detection of Vulnerable Users using V2x Communications” .......................................... 109 José Javier Anaya, Edgar Talavera, David Giménez, Felipe Jiménez and José Eugenio Naranjo

“Implementing Uniform Reliable Broadcast in Anonymous Distributed Systems with Fair Lossy Channels” .......................................................................................................... 123 Jian Tang, Mikel Larrea, Sergio Arevalo, and Ernesto Jiménez

“A survey on subspace clustering” ...................................................................................... 139 Bo Zhu and Alexandru Mara

iv

Towards a process for reengineering software architectures

Daniel Guamán1, Jessica Díaz2, Jennifer Pérez2

1Technical University of Loja (UTPL) Marcelino Champagnat S/N San Cayetano Alto, 1101608 – Loja, Ecuador

Department of Software Engineering and Technology Management – SDISGTI [email protected]

2Technical University of Madrid (UPM) E.T.S. de Sistemas Informáticos - CITSEM

Ctra. Valencia Km. 7, E-28031 Madrid, Spain [email protected], [email protected]

Abstract. This paper presents the outline of the research that is going to be per-formed for the construction of a process for reengineering software architec-tures. The state of the art and the research questions to be addressed are de-scribed in detail. Keywords: software architecture, reengineering, reverse engineering, pattern.

1. Outline of the research topic

Most of the software systems are built on architectural principles that allow software components, subsystems and their relationships to interact together in order to provide a functionality. According to Clements and Northrop (1996) “Software Architecture is the abstraction of the common features inherent in the systems design, which must take into account a wide range of activities, concepts, methods, strategies and results”. The definitions of Perry and Wolf (1992) and Garlan and Shaw (1993) pay attention on the system defining architecture as: “the description of software systems in terms of functional, reusable and independent components that are connected among them”, which allows the abstraction of system complexity; whereas the definition of Software Engineering Institute(2014) pays attention on the process and its features, which states that Software Architecture includes the analysis and selection of struc-tural elements and interfaces where functionality, ease of use, flexibility, perfor-mance, reuse, comprehensibility, economic and technological limitations, advantages, disadvantages and aesthetic concerns are required. However, it is important to take into account that software is continually evolving as the evolution law defines: “No matter where you are in the system life cycle, the system will change, and the desire to change it will persist throughout the life cycle” (Bersoff EH, Henderson VD, and Siegel SG 1980), or the first evolution law of Lehman “Continuing Change”, “a pro-gram that is used must be continually adapted else it becomes progressively less satis-

CONFIDENTIA

L

5

factory” (Lehman,1997). Therefore, any software asset is susceptible of being changed throughout its life cycle, and at that moment, the second law “Increasing Complexity” of Lehman emerges “as a program is evolved its complexity increases unless work is done to maintain or reduce it”, and consequently the seventh law “quality decrease” must be also taken into account “programs will be perceived as of declining quality unless rigorously maintained and adapted to a changing operational environment”. So, a well-defined process is required to modify software architectures without degrading it so much, and translate the changes into code. However, currently what really happens is that there are software systems without a well-defined architec-ture or with a software architecture being constantly redefined depending on software code modification, which implies a continuous erosion of the software architecture. Therefore, processes and mechanisms to repair or improve degraded software systems are required.

During the evolution of software systems for their reparation or improvement, differ-ent problems related to deficiencies in the architectural design, poorly written code base, implementation and feature validation errors, among others, could emerge. Therefore, sometimes if the software is too deteriorated, the investment on its evolu-tion may not be paid off. In order to determine if investing in the software mainte-nance/evolution of a software product is economically profitable for a company, some authors use the concept of Technical Debt (Samarthyam and Sharma 2014) and (Chris et al. 2012). Technical Debt commonly focuses on how bad the code can be, and also if it affects the architecture and requires architectural changes that have oc-curred during the system development or after this process, which directly affects the cost of the architectural changes. As a result, before changing a software product, it would be desirable to know if this change would provide a ROI (Return Of Invest-ment) by analyzing the software system and using tools based on the concept of the Technical Debt (Schmidt, 2012).

Based on the aforementioned facts, this research proposal focuses on providing a process to guide architects by recommending the best options for constructing the software architecture of those software systems without an architecture or with a de-graded architecture to improve the software architecture.

2. Goals and research questions

To define a process to reconstruct software architecture there will be necessary to apply: reengineering and reverse engineering techniques, architectural styles and pat-terns, standards, quality attributes, patterns, and technical debt analysis, among others.

CONFIDENTIA

L

6

The first stage of the research will start by analyzing the source code of the software systems, through an architectural archeology that involves metrics, documentation, Architecture Description Languages (ADLs), Domain Specific Languages (DSLs), among others, in order to identify the architectural style and patterns that make up the system and if there is an architecture if it matches with the running software or there are inconsistencies among them. This first step is shown in Fig. 1, where, from code source of the system as an entry point, this is studied to identify it’s design, as it is set in (Allman, 2012). To do that, some studies about technical debt (Kruchten, Nord, and Ozkaya 2012) will be considered in order to evaluate if the reconstruction of the ar-chitecture provides a ROI for the company. This analysis will be supported by tools, such as, Sonarqube, Lattix, SonarGraph, among others.

Fig. 1. First step of the research proposal

From this first step the application of the complete reconstruction process emerges. But to formally define this process, it is necessary to deal with the following research questions:

- How can we identify the architectural style or patterns of software system from its source code?

- Is it possible to recommend or misestimate a software architecture reconstruction using Technical Debt evaluation techniques?

- Is it possible to recommend architectural styles and design patterns of applying reengineering and reverse engineering techniques?

- Which reverse engineering, reengineering, reconstruction and pattern detection techniques exists? If any, which are their advantages and disadvantages?

CONFIDENTIA

L

7

- Which software architecture recommending processes exist? If any, which are their advantages and disadvantages?

According to Committee,S., (2000), and Nord, Ozkaya, Kruchten, and Gonzalez-Rojas (2012) software architecture is known as the fundamental organization of a system represented in its components, their relationships, the environment and the principles that guide its design and evolution. By taking this concept as a reference, it is possible to associate the set of decisions about the organization of a software sys-tem Microsoft (2009), which will focus this research on topics such as:

- Methods, procedures, models or techniques for architecture reengineering, re-verse-engineering and reconstruction and pattern detection in order to identify the architecture of a system.

- Use of metrics and technical debt tools to determine if a software architecture re-construction provides a ROI.

- Use of metrics, tools, standards to determine recommendations associated with a design pattern or an architectural style or pattern. Next, these concepts are pre-sented in detail:

Architectural Style: An architectural style Microsoft (2009) and Microsoft (2015b) is a collection of components and connectors that combined with design patterns seek to obtain beneficial qualities for defining the architectural design of a software sys-tem, applicable in a context that responds to the question: what?

Architecture Style = {types of architectural elements + types of relationships rules}

Pattern Design: According to Clements and Northrop (1996) and Microsoft (2015a) the term pattern is adopted by software designers who explore its benefits to abstract the parts of a system and provide a description of how to build these parts within a particular context. The design patterns provide a proven and documented solution to common problems in software development where communication between objects and custom classes are used to solve at a level of general design, a problem within a particular context. In a computing context, a design pattern is similar to concepts such as class libraries, frameworks, techniques and / refactoring tools or extreme pro-gramming.

Pattern Design = {context} problem + solution

Architectural pattern: Architectural patterns have a high level of abstraction, they are software design patterns that define implementation strategies and the relationship of the components and connectors of a system, allowing to answer the question how? Through an architectural pattern, an essential structural organization scheme for a software system can be expressed, which can consist of: subsystems, responsibilities

CONFIDENTIA

L

8

and relationships. Compared with design patterns, architectural patterns have a higher level of abstraction. (Microsoft, 2015a).

Architectural Pattern = Pattern Design + {rationale}

3. Current stage of the research

The investigation is at an early stage. The starting point has been the development of the Systematic Literature Review (SLR) in order to seek some related work and de-termine the scope, tools and tips that can support research. During the SLR searches about reverse engineering, software re-engineering and architectural technical debt were performed. The aim of these searches is the identification of the architecture style of a system. From them it has been able to highlight topics, such as those by Samarthyam and Sharma (2014) and Chris et al. (2012) and Schmidt (2012) and Kruchten et al. (2012) which could be a guide in this work.

References

Allman, E. (2012). Managing technical debt. Communications of the ACM. Retrieved from http://dl.acm.org/citation.cfm?id=2160733

Bersoff EH, Henderson VD, & Siegel SG. (1980). Software Configuration Management. An Investment in Product Integrity. Addison-Wesley, (New York).

Chris, F. (UPB), Daga, E. (CNR), Engels, G. (UPB), Germesin, S. (DFKI), Kilic, O. (SRDC), Nagel, B. (UPB), & Sauer, S. (UPB). (2012). IKS Alpha Development. Retrieved May 01, 2015, from http://wiki.iks-project.eu/index.php/IKS_Alpha_Development

Clements, P. C., & Northrop, L. M. (1996). Software Architecture: An Executive Overview. Weatherwise (Vol. 49, pp. 42–47).

Committee, S. (2000). IEEE Recommended Practice for Architectural Description of Software-Intensive Systems. October, 1471-2000(42010), i–23. Retrieved from http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.102.9904

Garlan, D., & Shaw, M. (1993). An Introduction to Software Architecture. Advances in Software Engineering and Knowledge Engineering, I(New Yersey).

Kruchten, P., Nord, R. L., & Ozkaya, I. (2012). Technical debt: From metaphor to theory and practice. IEEE Software.

CONFIDENTIA

L

9

Lehman, M. (1997). Laws of Software Evolution Revisited.

Microsoft. (2009). Software Architecture and Design. Retrieved April 29, 2015, from https://msdn.microsoft.com/en-us/library/ee658098.aspx

Microsoft. (2015a). Patterns & Practices. Retrieved May 03, 2015, from http://pnp.azurewebsites.net/

Microsoft. (2015b). Strategic Architect Forum 2015. Retrieved May 03, 2015, from https://msdn.microsoft.com/architects-overview-msdn

Nord, R. L., Ozkaya, I., Kruchten, P., & Gonzalez-Rojas, M. (2012). In Search of a Metric for Managing Architectural Technical Debt. 2012 Joint Working IEEE/IFIP Conference on Software Architecture and European Conference on Software Architecture, 91–100. http://doi.org/10.1109/WICSA-ECSA.212.17

Perry, D., & Wolf, A. (1992). Foundations for the Study of Software Architecture. ACM Software Engineering Notes, 17, 40–52.

Samarthyam, G., & Sharma, T. (2014). Refactoring for Software Design Smells: Managing Technical Debt (First Edit). Retrieved from http://www.amazon.com/Refactoring-Software-Design-Smells-Technical/dp/0128013974

Schmidt, D. C. (2012). Strategic Management of Architectural Technical Debt. Retrieved May 02, 2015, from http://blog.sei.cmu.edu/post.cfm/strategic-management-of-architectural-technical-debt

Software Engineering Institute. (2014). Defining Software Architecture. Retrieved May 02, 2015, from http://www.sei.cmu.edu/architecture/

CONFIDENTIA

L

10

A Logic-Algebraic Approach to Decision Takingin a Railway Interlocking System

Antonio Hernando, Eugenio Roanes-Lozano, and Roberto Maestre-Martınez

1 Antonio Hernando Departamento de Sistemas InformaticosEscuela Universitaria de Informatica. Universidad Politecnica de Madrid

[email protected] Eugenio Roanes-Lozano Departamento de Algebra

Facultad de Educacion. Universidad Complutense de [email protected]

3 Roberto Maestre-Martınez Departamento de Sistemas InformaticosEscuela Universitaria de Informatica. Universidad Politecnica de Madrid

Abstract. The safety of a railway network is a very important issueconsidered very labour-intensive. Authors have developed different ap-proaches in order to detect automatically the safety for mid-small railwaynetworks. Although these approaches are very simple to implement, theyhave the drawback of being unsuitable to large networks since the algo-rithm takes large time to be run. In this paper, we show a new algebraicmodel which, besides being also simple to implement, has the advantageof being very fast and consequently can be used for checking the safetyin a large railway network.

Keywords: Railway Interlocking Systems, Logic, Decision Theory, GraphTheory.

1 Introduction

This paper deals with a new algebraic method for detecting the safety in arailway network.

A railway network is composed by a set of sections (a portion of a railwayline) on which trains can be placed on. The topology of the railway networkis described by means of a adjacency relation between sections indicating thepossibility to pass from a section to another one. The topology of a railwaynetwork can be changed by means of semaphores and turnouts.

– A semaphore is at the end of a section S1 connecting to S2. When the colourof semaphore is green, then it is possible to pass from section S1 to S2.In case that the colour of semaphore is red, it is not possible to pass fromsection S1 to S2.

– Turnouts are devices connecting a section S1 to either a section S2 or asection S3 depending on the state of a mobile part of the turnouts (termedas switch rail).

CONFIDENTIA

L

11

Railway networks are conceived so that different trains can be placed ondifferent sections. However, semaphores and turnouts in railway network mustbe stated so that two trains cannot crash when moving. The railway interlockingsystem has the purpose of informing when the railway network is not safe becausetwo trains could crash if they moving in a particular way in the railway network.

Within a railway interlocking system a route denotes a path along the topol-ogy of the station or junction (for instance a path from an entrance of the stationto a certain track where the train will stop). In the classic approach to railwayinterlocking systems design the admissible train routes are predefined.

Establishing a route implies to adequately set the turnouts and signals/se-maphores along the train route. Once a driver has been given a clear signalindicating a route, the route cannot be changed before the train has completelycleared it (then it is said that the route is locked). The standard approach torailway interlocking systems design is to predefine the admissible train routesand to manually study in advance their compatibility. As described in [40],

“The development of railway interlocking systems is currently very labour-intensive. Specialists develop the interlocking design for a particular areaand manually check for completeness and consistency.”

There is an impressive number of papers on computer applications to railwayinterlocking systems (see, for instance, [5]). These works either create a formalspecification for an existing railway system (in order to verify it or to create a newdecision-taking tool) or describe a completely new model for railway interlockingsystems.

An extremely detailed work about British Railways’ interlocking systems’logic is the Ph.D. Thesis [25], where a prototype verification system using atheorem prover is implemented in higher-order Logic. This work is revisitedusing an annotated logic programme with temporal reasoning in [26].

In [40] a collaborative project with Queensland Railway to model check rail-way interlocking systems is detailed. It uses ordered binary decision diagramsfor the symbolic model checking. It accepts the (unreal) constraint that trainsoccupy one section at a time.

Meanwhile the task of [18] is to automate the process of relevant data modeldesign and verify its safety. It is applied to the Slovak National Railways’ tech-nical standards and uses Z notation.

An example of a new decision tool is [14], where the Danish State Railways’informal specifications for interlocking systems are formalised and a VDM modelis presented.

It is surprising that even in minimal topologies, errors can be found. A realcase study is [6], where detail is given on how an unacceptable error was found(using the Sternol Verification Tool). The station is a tiny subway station andthe railway interlocking system had been developed by a well-known signalingcompany.

In [16] a line block system (instead of a railway interlocking system) is veri-fied.

CONFIDENTIA

L

12

An early topology-independent formal model is [24]. It uses different layersof abstraction (called domains). Petri nets are used for the dynamic domainsand double point graphs as well as logic invariants for the static domains. It isimplemented in Objective-C and PROLOG. Let us underline that it does notfollow the standard approach to railway interlocking systems; the concept ofroutes has been replaced by a context-free check of the permissibility for eachcontrolling command.

We have also already developed different approaches to decision making ina railway interlocking system that does not use the standard approach to rail-way interlocking systems design. And, unlike other approaches, ours does notconsider the direction of the trains, what allows to directly deal with special sit-uations like reversing loops and reversing triangles. In our approaches the codeis topologically-independent and surprisingly brief. In [28] the problem is trans-lated into graph theory language and treated using a matrices-based approach.In [29, 30], the problem is directly translated into an algebraic problem. Thesafety of the proposed situation is equivalent to the compatibility of a non-linearalgebraic system (what can be decided in any computer algebra system usingGroebner bases [8, 9]). In [34], the problem is directly translated into many SATproblems which can be solved using an algorithm based on the technique DPLL[41].

However, all of the previous approaches are not suitable to mid-large railwaynetwork because of the long time required to perform the algorithms in mid-largerailway networks. In this paper, we will show a new approach which combines thecalculation of Groebner basis with the approach presented in [34]. In this newalgebraic approach, unlike the algebraic approach in [30], we deal with booleanpolynomials and we use specialized algorithms for calculating Grobener basisover boolean polynomials. This makes this new approach much faster than theprevious approaches (see Section 5). Indeed, unlike the previous approaches, ittakes very short time to check the safety of large railway stations.

This paper presents an improvement over previous work on railway interlock-ing systems. Indeed, this paper introduces a new algebraic approach to deal withrailway interlocking systems which can be seen as a combination of techniquespresented previously:

– In [30], the problem is solved by means of an algebraic approach involvingGroebner Bases.

– In [34], the problem is solved by means of a logic approach.

As already established by a previous result [32], the logic approach in [34] canbe dealt with algebraically, giving way to another algebraic approach, differentfrom that in [30]. In this way, the approach presented here is different from [34]because we now deal with an algebraic approach instead of a logic approach.The approach presented here is different from [30], because we now deal withBoolean polynomials (defined in the field Z2) instead of ordinary polynomials(defined in the field Q). An important advantage of our algebraic model overthe algebraic approach in [30], is that we can determine, unlike the algebraicapproach in [30], the unsafe sections by means of the calculation of the normal

CONFIDENTIA

L

13

form of polynomials. Besides, the algebraic approach presented here improvessignificantly the approaches in [30] and [34], as will be shown in Section 5 below.

In Section 2, we will define formally concepts related to railway network. InSection 3, we will review a propositional logic model which was studied in [34].In Section 4, we will study a new algebraic approach based on the calculation ofGroebner basis of ideals of boolean polynomials. In Section 5, we will show howour model runs much faster that previous ones. Finally, in Section 6 we discussthe conclusions and possible extensions of our technique.

2 Formalism of a railway interlocking system

Let us consider a railway network with m sections and n trains placed in it.A section is a connected (single piece) part of the network separated from theadjacent (neighbour) sections by a semaphore or signal, and/or a turnout. It ispossible to pass from section si to the section sj if all of these conditions hold:

– si and sj are connected by an endpoint. That is to say, these sections arereally adjacent4,

– in case there is a turnout between sections si and section sj , its switch directstrains from si to section sj and conversely5,

– in case there is a semaphore or a signal controlling the pass from si to sj ,its colour is green6.

Here, we provide a recursive definition illustrating the idea that a train mayreach a given section:

Definition 1. The notion “a train ti may reach a section sj” is recursivelydefined as follows:

– if ti is placed on the section sj, then the train ti may reach the section sj,– if ti may reach the section sk and it is possible to pass from sk to sj, then

the train ti may reach the section sj.

Definition 2. A section s is safe if and only if there are not two trains t1, t2which may reach the section s.

Definition 3. The railway network is safe if and only if all of its sections aresafe.

Example 1. Let us consider the tiny station in Figure 1 (a station with just onepassing loop) on a single track line.

The position of the switches of the turnouts is represented by:

4 That is, they share an endpoint, like s1 and s5 or s3 and s4 in Figure 1.5 For instance the turnout connecting s3 with s2 and s5 in Figure 1 is in the diverted

track position, so it allows to move from s3 to s5 and conversely.6 For instance, the semaphore controlling the movement from s4 to s3 in Figure 1 is

green.

CONFIDENTIA

L

14

Fig. 1. A very simple station.

– a small segment, if the switch is in the direct track position (see for instancethe turnout between sections s1 and s2, s5 in Figure 1), or

– a small angle, if the switch is in the diverted track position (see for instancethe turnout between sections s3 and s2, s5 in Figure 1).

(In view that Figure 1 is here printed in black and white, the colouring ofthe semaphores will be represented by “R” (red) or “G” (green)).

So the main line is divided in sections s1, s2, s3, s4 (there is a semaphoreand a turnout between sections s1 and s2, another turnout and two semaphoresbetween sections s2, and s3, and another semaphore between sections s4 and s3.The section corresponding to the passing loop is denoted by s5.

Let us observe that signaling is installed on the right hand side. Therefore,for example, the upper left “R” in Figure 1 means that moving from section s5to section s1 is not allowed by the corresponding semaphore (the switch is notin the correct position either).

As may be observed, the position of the trains is represented by thick lines,and the numbers of the trains are included on top of these thick lines. So wehave train t5 in section s1, train t10 in section s5 and train t9 in section s4.

3 Logic description of the problem in the logic-algebraicmodel

The logic description of the problem in the new logic-algebraic model is verysimilar to that of the logic model [34]. It is based on propositional logic andallows us to detect if a certain section is safe (see Lemma 1). In this logic model,we shall consider three kinds of atomic propositions:

– for each section in the part of the network supervised by the system, we shallconsider an atomic proposition, si. Consequently, if there are m sections, wehave m atomic propositions related to sections: s1, ..., sm.

– for each train placed in the part of the network supervised by the system, weshall consider an atomic proposition, ti. Consequently, if there are n trains,we have n atomic propositions related to trains: t1, ..., tn.

Once the atomic propositions are established, we considered three kinds offormulae:

CONFIDENTIA

L

15

– formulae related to the topology of the network and the movements allowedby switches and semaphores. We consider the set of formulae:

S = {sj → si| it is possible to pass from si to sj}– formulae related to the position of the trains. We consider the set of formulae:

P = {si → tj | the section si is occupied by the train tj}observe that a long train could simultaneously occupy more than one section.

– formulae related to trains. We consider the set of formulae:

T = {¬ti ∨ ¬tj | ti and tj are atomic propositions related to trains}In [34] we show the following result, with which we can determine when a

section is safe.

Lemma 1. A section, s, is safe if and only if the set of formulae

{s} ∪ S ∪ P ∪ T is consistent.

In [34], we use the previous lemma in order to determine if the railway net-work is safe: For each section in the railway network, we need to solve a SATproblem. This SAT problem can be solved by using algorithms like MiniSat [42]based on the DPLL technique [41].

In this paper, we will use the previous lemma and the following well-knownproposition (see Proposition 1) in order to state the Theorem 1 which will bevery useful in the next section.

Proposition 1. Let A1, ..., Ar, B be propositional formulae. We have the fol-lowing:

{A1, ..., Ar, B} is inconsistent ⇔ {A1, ..., Ar} |= ¬BTheorem 1. A section is not safe if and only if the following holds:

S ∪ P ∪ T |= ¬sProof. (According to Lemma 1) the section s is not safe ⇔{s} ∪ S ∪ P ∪ T is inconsistent. ⇔(According to Proposition 1), S ∪ P ∪ T |= ¬s.

ut

4 Algebraic approach to the logic description of theproblem in the logic-algebraic model

4.1 Relation between algebra and propositional logic

In this section, we will review some relations between algebra and propositionallogic which have been studied in previous papers [15, 17, 19, 22, 31–33].

Let C be the set of possible propositional formulae using the atomic proposi-tion X1, ..., Xm. Each formula in C is associated a polynomial in Z2[x1, ..., xm].In order to define this association we will define previously the following ideal:

I = 〈x21 + x1, ..., x

2m + xm〉

CONFIDENTIA

L

16

Definition 4. We define recursively the function ϕ : C −→ Z2[x1, ..., xm] asfollows:

– If A ≡ Xi, then ϕ(A) = xi

– If A ≡ ¬B where B ∈ C, then ϕ(A) = NF(1 + ϕ(B), I)– If A ≡ B ∧ C where B,C ∈ C, then ϕ(A) = NF(ϕ(B) · ϕ(C), I)– If A ≡ B∨C where B,C ∈ C, then ϕ(A) = NF(ϕ(B)+ϕ(C)+ϕ(B)·ϕ(C), I)– If A ≡ B → C where B,C ∈ C, then ϕ(A) = NF(1 +ϕ(B) +ϕ(B) ·ϕ(C), I).

(where NF stands for “normal form”, i.e., the reduction of a polynomial moduloa polynomial ideal7).

The previous representation of formulae in polynomials implies an interest-ing property: given a formula A, ϕ(A) is a polynomial in Z2[x1, ..., xm] whosevariables are never to a power greater than 1 (although the total degree of thepolynomial may be, at most, equal to the number of variables). On account ofthis property, the polynomials representing formulae are very simple to deal with(this property improves the computational cost of calculating Groebner bases).

Once described how Boolean formulae are represented by polynomials, weshow, by Theorem 2, how logic problems are translated into algebraic problems8:

Theorem 2. Let A1, ..., An, B ∈ C. The following holds:

{A1, ..., An} |= B ⇔ ϕ(¬B) ∈ 〈ϕ(¬A1), ..., ϕ(¬An)〉+ I

The importance of Theorem 2 is derived from the fact that the algebraicproblems involved in this theorem may be solved making use of Groebner bases[8]. On the ground of this, many expert systems have been developed [21, 23, 27,31, 35].

A reduced Groebner basis of an ideal J is a finite set of polynomials G ={q1, ..., qk} such that 〈q1, ..., qk〉 = J . Obviously, the definition of reduced Groeb-ner basis involves more restrictions about these polynomials q1, ..., qk, but as wehave mentioned above, here we are merely concerned with such features as per-tain to our paper’s purpose (see [9] for more details). Indeed, for our purposesin this paper, we need only to enhance Proposition 2. This proposition statesthat the question for determining if a variable belongs to a ideal may be straightanswered by checking if this variable belongs to the reduced Groebner basis ofthis ideal.

Proposition 2. Let J 6= ∅ be an ideal in a polynomial K[x1, ..., xm] such that1 /∈ J , let G be the reduced Groebner basis of the ideal J and let xi be a variableof K[x1, ..., xm]. We have that:

xi ∈ J ⇔ xi ∈ G

7 The theory of Grobner bases [8, 9] provides an effective algorithm for computingnormal forms.

8 Indeed, as may be seen in [22, 31, 32], there are many deep relations between algebraicstructures and propositional logic. A generalization on this theorem for many-valuedlogics (propositional logics with a prime number of truth values) may be found in[32] or with a power to a prime integer may be found in [15]. Details on this algebraicapproach to logic, first introduced in [17, 19], can be found in [22, 33].

CONFIDENTIA

L

17

4.2 The new logic-algebraic model

Now, we shall propose a model based on polynomials allowing us to detect if thepart of network supervised by the interlocking system is safe (see Theorem 3).We will transform our logic model into an algebraic model such as we have seenin Section 4.1.

Each propositional formula described in section 3 will be translated into apolynomial in Z2[s1, ..., sm, t1, ..., tn].

We will consider the following ideals:

– the ideal I of polynomials (related to properties of the translation of logicinto algebra):

I = 〈s21 + s1, ..., s2m + sm, t21 + t1, ..., t

2n + tn〉

– the ideal S of polynomials related to the topology of the network and themovements allowed by switches and semaphores:

S = 〈ϕ(¬A)|A ∈ S〉 = 〈(si + 1) · sj | it is possible to pass from si to sj〉

– the ideal P of polynomials related to the position of the trains:

P = 〈ϕ(¬A)|A ∈ P〉 = 〈si·(tj+1)| the section si is occupied by the train tj〉

let us remember that a long train could simultaneously occupy more thanone section,

– and the ideal T of polynomials related to trains:

T = 〈ϕ(¬A)|A ∈ T 〉 = 〈ti · tj | ti 6= tj are variables related to trains〉

Lemma 2. Let I, S, P , T be the ideals associated to a proposed situation ofswitches and semaphores in the part of the railway network supervised by theinterlocking system.

a section s, is not safe ⇔ s ∈ I + S + P + T .

Proof. According to Theorem 1, a section s, is not safe ⇔⇔ S ∪P ∪ T |= ¬s, what is equivalent, according to Theorem 2, to s ∈ I + S +P + T . ut

According to the following theorem, we can determine when a section is safe.

Theorem 3. Let I, S, P , T be the ideals associated to a proposed situationof switches and semaphores in the part of the railway network supervised by theinterlocking system. Let G be the reduced Groebner basis of the ideal I+S+P+T :

a section s, is not safe ⇔ s ∈ G.

Proof. It can be immediately proven by taking into account Lemma 2 and Propo-sition 2. ut

CONFIDENTIA

L

18

Example 2. We will study the safety of the railway network described in Example1.

In order to check the safety of the railway network, we will consider thefollowing ideals:

– Ideal I (there are five sections: s1, s2, s3, s4, s5 and three trains: t5, t9, t10).

I = 〈s21 + s1, s22 + s2, s

23 + s3, s

24 + s4, s

25 + s5, t

25 + t5, t

29 + t9, t

210 + t10〉

– Ideal S related to the topology of the railway network.

S = 〈(s1 + 1)s2, (s5 + 1)s3, (s3 + 1)s4, (s4 + 1)s3〉

– Ideal P related to the position of the trains.

P = 〈s1(t5 + 1), s5(t10 + 1), s4(t9 + 1)〉

– Ideal T related to the trains (there are tree trains: t5, t9 and t10).

T = 〈t5t9, t5t10, t9t10〉

Now, we shall test if the situation proposed here is safe. We calculate G, thereduced Groebner basis of the ideal I + S + P + T :

G = 〈s5t9, s5t5, s1t10, s1s5, s2t10, s2s5, s1t9, s2t9, s5t10 + s5, s1t5 + s1,s2t5 + s2, t9t10, t5t10, t5t9, s1s2 + s2, t210 + t10, t29 + t9, t25 + t5, s25 + s5,s22 + s2, s21 + s1, s3, s4〉

Since s3 ∈ G and s4 ∈ G, according to Theorem 3, we have that s3 and s4are not safe sections9. Therefore, the network is not safe.

5 Performance comparison

Comparison of this model and the one presented in [30]. Both uses simple poly-nomials. However, this model works with boolean polynomials. There are veryefficient specialized algorithms for calculating Groebner basis when working withboolean polynomials [11–13]. Polybori [7] is a recent library on C++ specializedon Boolean polynomials which involves high speed on calculations over this kindof polynomials.

Comparison of this model and the one presented in [34]. While the modelpresented here needs to calculate one Groebner basis, the logic model requiresto calculate many SAT problems (so many as number of sections there is in therailway network).

We have made a comparison between the times required to calculate thesafety of the railway network for a mid-large station in Spain . As may beseen, the efficiency of our method for a classic large station in Spain: the oldbroad gauge Algodor junction, famous because of its mechanical interlocking

9 Indeed, as may be seen in Figure 1, both the trains t4 and t5 can reach the sectionss3 and s4.

CONFIDENTIA

L

19

(now preserved) and the big railway village associated to it [10]. Determiningthe safety of a sample situation with 10 trains takes 0.160 seconds in a standardcomputer while the other approaches take more than 10 seconds.

In Table 1, we show the comparison between the times required to calculatethe safety of the railway network for different situations. As may be seen, theefficiency of our method is always high superior to the other approaches.

m=60n=10

m=100n=10

m=250n=20

New Model described in this paper 0.160 s 0.186 s 0.360 s

Model based on Logic described in [34] 10.23 s 250.250 s > 1 h

Optimized Model based on Logic described in [34] 4.47 s 120.365 s > 1 h

Model based on Goebner Basis described in [30] 15.356 s 380.368 s > 24 h

Table 1. Time comparative of methods implemented in for detecting the safety in arailway interlocking system with m sections and n trains.

As may be seen, the efficiency of our method is always superior to the otherapproaches. We have also tested our approach with a large real-world-problem:the broad gauge Algodor junction, famous because of its mechanical interlocking(now preserved) and the big railway village associated to it [10]. In Table 2, weshow the comparison of different approaches for calculating the safety of thisspecific railway network (Algodor junction) for different numbers of trains. Asmay be seen, our approach is highly superior to the others.

Number of trains / Model Groebner Bases Logic Logic improved New approach

1 trains > 1h 16.258 s 7.583 s 0.156 s

5 train > 1h 26.437 s 9.845 s 0.174 s

10 trains > 1h 38.948 s 10.235 s 0.182 s(no possible collision)

Table 2. Time comparative of different approaches for detecting potentially dangeroussituations in the Algodor junction.

6 Conclusions

In this paper, we have presented a new algebraic model for a railway interlockingsystem. According to this new model, the railway network is safe if a variablerepresenting a section belongs to the reduced Groebner basis of a set of booleanpolynomials. We have implemented this model on the computer algebra systemPolybori resulting in a very short programme code. Moreover, we have comparedthe execution times with other very fast models implemented previously.

CONFIDENTIA

L

20

Acknowledgements

This work was partially supported by the research projects TIN2009-07901(Spanish Government) and UCM2008-910563 (UCM - BSCH Gr. 58/08, researchgroup ACEIA, Spain).

We would also like to thank the anonymous referees for their most valuablecomments, that have greatly improved this paper.

References

References

1. Anon., Proyecto y obra del enclavamiento electronico de la estacion de Madrid-Atocha. Proyecto Tecnico, Siemens, Madrid, 1988.

2. Anon., Microcomputer Interlocking Hilversum, Siemens, Munich, 1986.3. Anon., Microcomputer Interlocking Rotterdam, Siemens, Munich, 1989.4. Anon., Puesto de enclavamiento con microcomputadoras de la estacion de Chiasso

de los SBB, Siemens, Munich, 1989.5. D. Bjørner, The FMERail/TRain Annotated Rail Bibliography, 2005. Available

from: URL: http://www2.imm.dtu.dk/˜db/fmerail/fmerail/6. A. Boralv, Case Study: Formal Verification of a Computerized Railway Interlocking,

Formal Aspects of Computing 10 (1998) 338–360.7. M. Brickenstein, A. DreyerPolyBoRi: Polybori: A framework for Grbner-basis com-

putations with Boolean polynomials, Journal of Symbolic Computation 44/9 (2009)1326–1345.

8. B. Buchberger, Bruno Buchberger’s PhD thesis 1965: An algorithm for finding thebasis elementals of the residue class ring of a zero dimensional polynomial ideal,Journal of Symbolic Computation 41/3–4 (2006) 475–511.

9. D. Cox, J. Little, D. O’Shea, Ideals, Varieties and Algorithms, Springer-Verlag,Berlin - Heidelberg - New York, 1992.

10. D. Cuellar-Villar, M. Jimenez-Vega, F. Polo-Muriel (Eds.), Historia de los pobladosferroviarios en Espana, Fundacion de los Ferrocarriles Espanoles, Madrid, 2005.

11. J. C. Faugere: A new efficient algorithm for computing Grobner bases. Journal ofPure and Applied Algebra 139/1 (1999) 61–88.

12. J. C. Faugere: A new efficient algorithm for computing Grobner bases withoutreduction to zero. In: T. Mora, ed.: Proceedings of the 2002 International Symposiumon Symbolic and Algebraic Computation ISSAC 2002. ACM Press (2002) 75–83.

13. V. P. Gerdt, M. V. Zinin: A Pommaret Division Algorithm for Computing GrobnerBases in Boolean Rings. In: J. R. Sendra, L. Gonzlez-Vega, eds.: Symbolic andAlgebraic Computation, International Symposium, ISSAC 2008. ACM Press (2008)95–102.

14. K. M. Hansen, Formalising Railway Interlocking Systems, in: Nordic Seminar onDependable Computing Systems, Department of Computer Science, Technical Uni-versity of Denmark, Lyngby, 1994, pp. 83–94.

15. A. Hernando, E. Roanes-Lozano, L.M. Laita: A Polynomial Model for Logics witha Prime Power Number of Truth Values. Journal of Automated Reasoning, Springer,doi: 10.1007/s10817-010-9191-0.

16. T. Hlavaty, L. Preucil, P. Stepan, S. Klapka, Formal Methods in Developmentand Testing of Safety-Critical Systems: Railway Interlocking System, in: IntelligentMethods for Quality Improvement in Industrial Practice, vol. 1, CTU FEE, Depart-ment of Cybernetics, The Gerstner Laboratory, Prague, 2002, pp. 14–25.

CONFIDENTIA

L

21

17. J. Hsiang, Refutational Theorem Proving using Term-rewriting Systems, ArtificialIntelligence, 25 (1985) 255–300.

18. A. Janota, Using Z Specification for Railway Interlocking Safety, Periodica Poly-technica Ser. Trans. Eng. 28/1–2 (2000) 39–53.

19. D. Kapur, P. Narendran, An Equational Approach to Theorem Proving in First-Order Predicate Calculus, 84CRD296, General Electric Corporate Research andDevelopment Report, Schenectady, NY, March 1984, rev Dec 1984. Also in: A. K.Joshi (Ed.) Proceedings of IJCAI-85, Morgan Kaufmann, 1985 (pages 1146–1153).

20. M. Losada, Curso de Ferrocarriles: Explotacion Tecnica, E.T.S.I. Caminos, Madrid,1991.

21. L.M. Laita, E. Roanes-Lozano, V. Maojo, L. de Ledesma, L. Laita: An ExpertSystem for Managing Medical Appropriateness Criteria Based on Computer AlgebraTechniques. Computers and Mathematics with Applications 51/5 (2000) 473–481.

22. L. M. Laita, L. de Ledesma, E. Roanes-Lozano, E. Roanes-Macıas: An Interpre-tation of the Propositional Boolean Algebra as a k-algebra. Effective Calculus. In:J. Campbell, J. Calmet (editors): Proceedings of the Second International Work-shop/Conference on Artificial Intelligence and Symbolic Mathematical Computing(AISMC-2). Lecture Notes in Computer Science 958. Springer-Verlag (1995) 255–263.

23. M. Lourdes-Jimenez, J. M. Santamarıa, R. Barchino, L. Laita, L. M. Laita, L.A. Gonzalez, A. Asenjo: Knowledge representation for diagnosis of care problemsthrough an expert system: Model of the auto-care deficit situations. Expert Systemwith Applications 34 (2008) 2847–2857.

24. M. Montigel, Modellierung und Gewahrleistung von Abhangigkeiten in Eisenbahn-sicherungsanlagen (Ph.D. Thesis), ETH Zurich, Zurich, 1994. Available from: URL:http://www.inf.ethz.ch/research/disstechreps/theses

25. M. J. Morley, Modelling British Rail’s interlocking logic: Geographic data correct-ness, Technical Report ECS-LFCS-91-186, Laboratory for Foundations of ComputerScience, Department of Computer Science, University of Edinburgh, 1991.

26. K. Nakamatsu, Y. Kiuchi, A. Suzuki, EVALPSN Based Railway Interlocking Sim-ulator, in: M. Gh. Negoita et al., eds., Knowledge-Based Intelligent Information andEngineering Systems, Springer LNAI 3214, Berlin - Heidelberg, 2004, pp. 961–967.

27. C. Perez-Carretero, L.M. Laita, E. Roanes-Lozano, L. Lazaro, J. Gonzalez-Cajal,L. Laita: A Logic and Computer Algebra-Based Expert System for Diagnosis ofAnorexia. Mathematics and Computers in Simulation 58 (2002) 183–202.

28. E. Roanes-L., L.M. Laita, An Applicable Topology-Independent Model for RailwayInterlocking Systems, Math. Comput. Simul. 45/1 (1998) 175–184.

29. E. Roanes-Lozano, L. M. Laita, E. Roanes-Macıas, An Application of an AIMethodology to Railway Interlocking Systems using Computer Algebra, in: A.Pasqual del Pobil, J. Mira, M. Ali, eds., Tasks and Methods in Applied Artifi-cial Intelligence, Proceedings of IEA-98-AIE, Vol. II, Springer LNAI 1416, Berlin -Heidelberg, 1998, pp. 687–696.

30. E. Roanes-Lozano, E. Roanes-Macıas, L.M. Laita, Railway interlocking systemsand Grobner bases, Math. Comput. Simul. 51/5 (2000) 473–481.

31. E. Roanes-Lozano, L. M. Laita, E. Roanes-Macıas: Maple V in A.I.: The BooleanAlgebra Associated to a KBS. CAN Nieuwsbrief 14 (1995) 65–70.

32. E. Roanes-Lozano, L. M. Laita and E. Roanes-Macıas, A Polynomial Model forMultivalued Logics with a Touch of Algebraic Geometry and Computer Algebra.Mathematics and Computers in Simulation 45/1 (1998) 83–99.

CONFIDENTIA

L

22

33. E. Roanes-Lozano, L. M. Laita, A. Hernando, E. Roanes-Macıas, An alge-braic approach to rule based expert systems RACSAM 104/1 (2010) 1940.DOI:10.5052/RACSAM.2010.04

34. E. Roanes-Lozano, A. Hernando, J.A. Alonso, L. M. Laita, A Logic Approach toDecision Taking in a Railway Interlocking System using Maple, Mathematics andComputers in Simulation, doi:10.1016/j.matcom.2010.05.024

35. C. Rodrıguez-Solano, L.M. Laita, E. Roanes-Lozano, L. Lopez-Corral, L. Laita, AComputational System for Diagnosis of Depressive Situations. Expert System withApplications 31 (2006) 47–55.

36. L. Villamandos, Sistema informatico concebido por Renfe para disenar los en-clavamientos, Vıa Libre 348 (1993) 65.

37. J. Westwood, ed., Trains, Octopus Books Ltd., London, 1979.38. URL: http://en.wikipedia.org/wiki/Railroad switch39. URL: http://www.voestalpine.com/vaers/en/products/railway infrastructure

/switchsystems/special trackwork/trailable point.html40. K. Winter, W. Johnston, P. Robinson, P. Strooper, L. van den Berg, Tool Support

for Checking Railway Interlocking Designs, in: T. Cant, ed., Proceedings of the10th Australian Workshop on Safety Related Programmable Systems, AustralianComputer Society, Inc., Sydney, 2006, pp. 101–107.

41. M. Davis, G. Logemann, D. Loveland, A machine program for theorem-proving.Communications of the ACM 5/7 (1962) 394–397.

42. URL: http://minisat.se/

CONFIDENTIA

L

23

CONFIDENTIA

L

24

Non Delay Causal Forwarding Protocols forHierarchical Communication Architectures

Isabel Muñoz, Sergio ArévaloEUI, Universidad Politécnica de Madrid, 28031 Madrid, Spain

{imunoz,sergio.arevalo}@eui.upm.es

September 11, 2013

Abstract. Hierarchical communication architectures of groups of processes havebeen proposed for vector clock based causal reliable protocols to be scalable. Inthis way the size of the causal information carried in messages is the order ofthe group size instead of the system size. In this architectures gateway processesforward messages among groups in causal order, delaying unordered messages ifnecessary. Message delays in hierarchical architectures produces convoy effectsthat in turn produces distortions to the network like router scheduling delays andlink congestions. This paper proposes a non delay causal forwarding protocolfor hierarchical architectures, hence free of convoy effect. We design a causalcommunication service and its protocol for a one stage causal forwarding and asketch of a two stage causal forwarding protocol (a daisy architecture) using in arecursive way the previous one.

1 Introduction

Many distributed applications nowadays need high availability using replication or highperformance using distributed process cooperation. This replicas and processes needsome kind of consistency like causal, sequential, or atomic. This consistency can beimplemented using multicasts communication services with some kind of quality ofservice such as reliability, causal order, total order, etc. Causal and reliable multicastwas first proposed as part of the ISIS system [2] being considered a good candidatefor scalable systems because of its weak properties compared with reliable and totalorder multicast protocols. It needed for its implementation only a round of messages [3]carrying vector clocks [8] [4]. But vector clocks do not scaled well, needing multicastmessages to carry a vector of integers of size N (number of processes).

In order to improve scalability Baldoni et al. proposed in [1] a hierarchical daisyarchitecture for multicasting reliable and causal messages (fig 1a). In this architectureprocesses organize in ‘leaf’ groups surrounding a central ‘gateway’ group. In fig 1b isshown how a process p1 from a leaf group multicasts a message m to all processes ofthe system. First, p1 multicast m to its own leaf group and also to a special gateway pro-cess of the group that belongs both to the leaf group and to the gateway group. Then thegateway process forwards (first stage) m to all members of the gateway group, all aregateway processes of different groups. And then every gateway process forwards (sec-ond stage) m in turn to its respective leaf group. To guarantee causal delivery at system

CONFIDENTIA

L

25

CONFIDENTIA

L

26

HCE-oriented payments vs. SE-oriented payments.

Security Issues

Rubén Nieto

Escuela Técnica Superior de Ingeniería de Sistemas Informáticos

Universidad Politécnica de Madrid

[email protected]

Abstract. Host Card Emulation (HCE) has emerged over the last year, as an

alternative to traditional payments. Near Field Communication (NFC) and its

card emulation operation mode, have enabled mobile payments. First, with SE-

oriented payments, using the Secure Element as the center of payment

transactions. Now, with HCE-oriented payments, using the host operating

system as the center of payment transactions. Tokenization has helped to secure

the implantation of these payment technologies, especially HCE due to its lack

of security. This research paper will list down the security issues involving

HCE and SE oriented payments.

1 Introduction

NFC technology is present in our daily lives. In a relatively short period of time it

has passed from being relatively unknown (only mentioned in academic journals or

scientific articles) to being part of our daily lives.

Although it can be considered an evolution of RFID, its global implantation has

been more difficult than expected, despite its multiple uses in actual society. For

example, ticketing [1], identification, collaborative games [2], even interactive

videogames, there are many disciplines in which this technology facilitates the

interaction between humans and computers. Mainly mobile device-oriented, NFC,

due to the global increase of these types of devices, it seems unavoidable that one of

its multiple uses has become enabling mobile payments.

Thanks to one of the operation modes that exist in NFC, card emulation, it is

possible to emulate Smart cards, like the commonly named chip of credit cards.

One of the pioneers to adopt this technology, from a mobile payments point of

view, was as to be expected Google, with Google Wallet [3], seeking the achievement

of small transactions using mobile phones with Android OS. It was based on SE-

oriented payments, so its operative was based on storing the necessary keys to enable

payments with SE. The Secure Element (SE) is a tamper-resistant cryptographic

hardware that can save apps and data in a secure way. When Google Wallet appeared

its impact was minimal, largely due to limitations that are implicit to the ecosystem

encompassing SE-oriented payments.

CONFIDENTIA

L

27

At that time, there were many concerned stakeholders that wanted to interfere in

the management of the procedure, starting with MNO (Mobile Network Operator),

TSM (Trusted Service Manager), handset manufacturer, Service Provider, and

continuing with the stakeholders directly involved in payment transactions like

Merchant, Acquirer or Issuer. The complexity of finding a balance of forces between

the parties involved was tough and consequently despite the emerging technology and

possessing the necessary mediums, the idea did not work.

Several years before Google Wallet was developed, an alternative not needing the

presence of SE was created; it was also based on card emulation, but using software

instead of hardware. NFC Forum has already defined this possibility long time ago,

but it was the handset manufacturer RIM (Research In Motion) the first one to

implement it on the operative system (Blackberry OS 7) of one of its models that

supported NFC, materializing this technology with the name Virtual Target

Emulation.

This technology was renamed HCE (Host Card Emulation) by Google and

implemented it in its operating system Android from version 4.4. Unlike SE-oriented

payments where the communication inside the mobile phone is done directly with the

NFC controller and the SE, in HCE-oriented payments this communication is

produced between the host OS and the NFC controller. As will be explained

throughout the article, this implies multiple security issues.

Due to HCE possibilities, Google stepped up and adapted Google Wallet to HCE-

oriented payments. With that movement, it removed the majority of stakeholders

involved in SE-oriented payments, since as it has been described, with HCE it is not

necessary to use SE, because payment applications run in the host operating system.

The HCE technology and the adaptation of the vast majority of terminal (PoS) to

NFC contactless technology, has permitted many financial institutions to shift to the

mobile payments market. In Spain we have a clear example in BBVA with its app

BBVA Wallet. But not all the companies have opted for HCE, some of them have

chosen SE-oriented payments, MNOs like Vodafone with Vodafone Wallet, or

technological giants like Apple with Apple Pay.

The abovementioned study cases have a common characteristic, the use of

substitution techniques of payment credentials for tokens. Tokenization is the process

of replacing the PAN with a single-use or multiple-use controlled token. This

technique adds complexity to the SE and HCE payment ecosystem, but it also adds

security.

Regarding security, despite the fact that SE-oriented payments and HCE-payments

share common protocols, there still exist several security differences between them.

Throughout this article it will be verified which security aspects are safeguarded by

each of them.

2 NFC

NFC technology is a short-range half duplex communication protocol, which

provides easy and secure communication between various devices. NFC is distinct

from far field RF communication that is used in personal area and longer-range

CONFIDENTIA

L

28

wireless networks. NFC relies on inductive coupling between transmitting and

receiving devices. The communication occurs between two compatible devices within

few centimeters with 13.56 MHz operating frequency [4].

There exist three NFC devices, which can involve in NFC communication: NFC

mobile, NFC tag, and NFC reader. NFC technology operates in three different

operating modes: reader/writer, peer-to-peer, and card emulation modes where

communication occurs between an NFC mobile on one side, and an NFC tag, an NFC

mobile, and an NFC reader on the other side respectively. Each operating mode uses

distinct communication interfaces on RF layer as well as has different technical,

operational and design requirements.

In reader/writer operating mode, active NFC mobile initiates the wireless

communication, and can both read and modify the data stored in NFC tags.

In peer-to-peer mode, two NFC mobiles establish a bidirectional connection to

exchange information.

In card emulation mode, the mode in which this article will focus, NFC devices

use similar digital protocol and analogue techniques with smart cards and they are

completely compatible with the smart card standards based on ISO/IEC 14443 Type

A, Type B and FeliCa.

3 Secure Element

A Secure Element (memory and secure execution environment) is a dynamic

environment in which the code of an application and its related data can be stored and

managed securely, and in which it is possible to run executions in a secure way

application [5].

The Secure Element (SE), in its form factor, consists mainly of a Smart Card, like

the ones used in credit or debit cards. The Element adds delimited memory for each

application and other functions like encryption, decryption, and digital signature of

data packets. This element is a fundamental part of the NFC technology, being the

main point where sensible information is stored securely, above all in the case of

payments, that is the center of this paper.

The SE can be found in several formats:

UICC/SIM: Mobile Network Operator (MNO) dependent.

Embedded SE: Handset vendor dependent.

MicroSD: Manufacturer dependent.

The SE architecture consists fundamentally of: secure microcontrollers, CPU,

operating system, Immutable memory (ROM), mutable memory (EEPROM) and

volatile memory (RAM), crypto engines, sensors, timers, RNG, communication Ports,

FIPS and CC Certifications.

The SE stores high-profile credentials necessary for the payment transactions to be

made.

The typical components of a NFC handset could be: Secure Element (SE), NFC

Controller, Mobile Wallet (UI Application for consumer interaction), Communication

CONFIDENTIA

L

29

Protocol/Interfaces (ISO-7816, ISO-14443, SWP, UART, I2C, SPI), Host OS

(Android OS, BlackBerry OS, iOS), SE OS (Java, Multos, proprietary)

The secure element is directly connected with the NFC controller, which includes

the Contactless Front-end (CLF), through Single Wire Protocol (SWP).

After clarifying the architecture of a SE, it is recommended to analyze how it fits in

within SE-oriented payments.

To fully understand the operation of SE-oriented payments, it is necessary to

comprehend how a payment made in a Point of Sale (PoS) with a credit card with

integrated chip is completed. This mechanism is ruled by EMV, which defines its own

specifications and is managed by EMVCo (MasterCard, Visa, etc.) .

The chip integrated in current credit and debit cards is known as Smart Card, and it

is similar to the SE embedded, or not, in a NFC enabled mobile phone, as it has been

described before. The Secure Element stores all the necessary information for

achieving a payment transaction (Primary Account Number, validity date, name, card

issuer, etc.). Normally, in the case of the SE in a SIM form factor, the proprietary

would be an MNO that must permit the access to it to let the bank app to be installed.

In many other cases, the SE proprietary is the handset manufacturer.

Every time a payment is about to be made, initiating a NFC communication

between the PoS and the mobile device, with the use of the ISO 14443 protocol stack,

is needed. Once established the connection, starts the information interchange via

ISO 7816-4 protocol.

In SE-oriented payments the payment application is running on the SE, besides

there should be a UI interface app that runs on the host OS of the mobile phone and is

connected to the SE app. The payment app is running on the JVM (Java Virtual

Machine) of the SE, and utilizes an AID (Application Identifier) which permits

distinguishing each app from other apps running within the SE.

Consequently, a communication between the payment application and the PoS will

be performed, passing through the NFC interface (CLF) which will establish the steps

needed to complete the payment. This will include a connection to the bank to verify

the personal information stored within the SE, and to approve the proposed quantity

with possible additional security mechanisms like PIN, biometric signals, etc.

All the applications stored in the Secure Element can communicate using

contactless technology. The CLF will work as a gateway between the NFC antenna

and the SE, and will redirect all the communication to the SE directly without any

host OS involvement.

4 Host Card Emulation

Nowadays, credit cards with integrated chips are still used to perform payments.

This commonly known chip consists of a Smart Card.

Host Card Emulation (HCE) enables a mobile phone to work as a Smart Card.

Therefore, no Smart Card is needed to make payment transactions.

In HCE-oriented payments, the payment application is executed on the operating

system of the mobile device and it communicates straightforwardly with the NFC

controller.

CONFIDENTIA

L

30

HCE was initially defined by the NFC Forum [6], and since its origin has been

integrated in the operation mode card emulation, part of the main set of NFC

specifications. As described before, HCE allows software emulation of a Smart Card

based on application.

Soft-SE or HCE was introduced by RIM in Blackberry 7. It does not permit the

emulation of some applications based on hardware card scheme like MiFare or Felica.

Moreover does not provide any hardware specific or security services based on

software. It behaves like any other Android application.

Android supports card emulation based on ISO 14443-4 (ISO-DEP specification)

and processes Application Protocol Data Unit as defines in ISO 7816-4 specification.

Android compels the use of ISO-DEP only over NFC-A technology. Support for

NFC-B technology is optional.

SE and HCE can coexist on the same system, so it is possible to manage HCE-

oriented and SE-oriented payments.

HCE and SE services must register the Application ID (AID) they want to manage

on the appropriate Android manifesto, at the moment of the installation of the

application. The default route addresses the host, therefore if SE-oriented services

want to be found, Android should register their AID on the routing table in the NFC

Controller. When a terminal (PoS) selects an AID, the communication will be routed

by default to the host OS, in the case that the AID had an entry on the routing table, to

the SE.

Android defines two NFC service categories, payments and others. Obviously SE

and HCE applications should register on payment categories.

At this point it will be interesting to describe the protocols used by HCE-oriented

payments, adapted to the payments ecosystem.

ISO 14443. Contactless coupling.

Emulated cards may be Type A and Type B, both of which communicate via radio

at 13.56 MHz. The main differences between these types concern modulation

methods, coding schemes (ISO 14443- 2) and protocol initialization procedures (ISO

14443-3). Both Type A and Type B emulated cards use the same transmission

protocol described in ISO 14443-4.

ISO 14443-3 specifies initiation and anti-collision

The PoS alternates between Type A and Type B communication until it detects a

NFC enabled mobile device. 14443-3 specifies a number of rules for this polling

process. It also describes the process of establishing communication between the PoS

and the mobile device and the collision methods to be used for selecting a specific

device.

Due to the different modulation methods Type A and Type B emulated cards have

different protocol frames and anti-collision methods.

Type A: A dynamic binary search is used for the initialization and selection of type

A emulated cards.

Type B: Type B proximity emulated cards use a dynamic slotted Aloha algorithm

for selection.

CONFIDENTIA

L

31

14443-4 specifies transmission protocol.

Based on T=1 protocol (defined in 7816-3). T=1 is a block oriented transmission

protocol.

Type A: If the emulated card supports the 14443-4 protocol, the terminal sends a

RATS command to the emulated card. The emulated card answer ATS. ATS and

RATS are used to exchange data and parameters in order to determine which data

transmision options are supported by the card and the terminal. After that, PPS can be

used to configure the modificable parameters to make the best use of the capabilities

of the card and the terminal.

RATS and ATS compatible values are defines in Android developers guide.

The protocol is the half duplex block transmission protocol. Sometimes the

protocol is referred to as ISO-DEP protocol. The protocol is placed on the transport

layer of the OSI model. Its role is to ensure correct addressing of the data blocks,

sequential transmission of excessively sized data blocks (chaining), monitoring time

procedure and handling of transmission errors.

The protocol is capable of transferring application data via application protocol

data units (APDUs), defined in the ISO/IEC 7816-4.

At the beginning, the card waits for a command sent by the reader. Each command

is processed by the card and a response is sent back to the reader. This pattern cannot

be broken, thus the card itself cannot initiate any communication without receiving a

command first.

The basic protocol structure is a data block. Three distinct types of blocks exist:

• I-block (information block) - transfer of application data

• R-block (receive ready block) - used for positive or negative acknowledgements.

The acknowledgment relates to the last received block.

• S-block (supervisory block) - for exchanging control information

PCB (protocol control byte) specifies a type of the block and included fields. CID

(card ID) is a logical card identifier for addressing a specific card. NAD (node address

field) is reserved to build up and address different logical connections; its use is no

further defined in the specification. INF field carries payload (like APDUs) from the

application layer in the case of I-block.

Finally, EDC (error detection code) is a 16-bit cyclic redundancy check for error

detection.

The chaining mechanism allows transmitting bigger payload that does not fit into a

single block. The maximum size of the block is determined by FSCI and FSDI values.

For the chaining control, the chaining bit is used in PCB field.

7816-5 specifies the registration of application providers

When the user taps a device to a NFC reader, the Android system needs to know

which HCE Service the NFC reader actually wants to talk to. 7816-4 defines a way to

select applications centered on an Application ID (AID).

This part of ISO/IEC 7816 specifies a registration procedure for application

providers, and establishes the authorities and procedures to ensure and optimize the

reliability of this registration.

The structure of the AID is the following: 5 bytes RID + 0..7 bytes PIX

RID is unique identifier of the application provider. In order to make sure that

nobody else use your RID you should register it in a national or international

CONFIDENTIA

L

32

certification institutions, depending on the scope of your application. Although, it is

not mandatory to have registered RID only requirement is that you should use "F" as a

start of your RID if it is not registered. It is recommend to use only registered AID if

you implement serious application.

PIX is Proprietary Application Identifier Extension that application provided

should maintain its uniqueness.

7816-4 specifies organization, security and commands for interchange

APDUS (application protocol data unit) are used to exchange all data that passes

between the smart card (application) and the terminal (PoS). This layer is an

equivalent of Data Unit OSI Model layer 7 (Application Layer).

The protocol-dependent data units of the transmission protocol layer are called

transmission protocol data units (TPDUS).

C-APDUS – commands sent to the card

R-APDUS – replies to these commands that are returned by the card.

APDUS are transferred transparently by the transmission protocol. APDUS that

comply with ISO 7816-4 are designed to be independent of the transmission protocol.

Common transmission protocol:

T=0 – byte oriented

T=1 – Block oriented

5 Tokenization

Tokenization is the process of replacing a high-value credential, like a Primary

Account Number (PAN), with a substitute value utilized in the transactions instead of

the credential [7].

Tokenization can map the credential to a new value with the same or different

format that the credential that it is being replaced. Regarding to payments, the aim of

the tokenization is to remove sensitive payment data and substitute it with something

useless outside the environment in which the token was created. As tokenization is

not a new concept, recent data breaches have increased the conscious of taking special

care of payment credentials. This technique can be used to prevent payment

credentials to be stolen and used for fraudulent transactions.

There are several types of tokens and different forms to generate them. A token can

be merchant specific, single-use o multiple-use, stored and managed on the cloud, in a

Token Vault, or in the Merchant location. Each token is created using a strict process

defined by the Token Service Provider. Once the token has been generated, it can be

associated to a device, and individual transaction or a payment card.

There are mainly two types of tokens currently being used in the payment industry;

tokens that replace the real payment account number to perform a payment

transaction, and tokens that work in place of the payment account number, and are

stored by merchants or acquirers substituting the actual account numbers and used

multiple times.

CONFIDENTIA

L

33

The use of tokens in payment transactions and the creation and management

process of tokenization differ depending on the type of credential. There are many

property tokenization solutions available in the market.

Industry bodies like ANSI, ASC, X9, EMVCo or PCI SSC have started to develop

tokenization specifications for the use of banking payment cards. EMVCo released on

2014 a complete specification.

Case study of SE-oriented payment with tokenization. Apple Pay

The process of configuring Apple pay for the first time starts with the user taking a

photo or typing the credit card number [8]. This information will be encrypted and

sent to Apple Servers; none of this data will be saved on the mobile device.

Afterwards, Apple decrypts the data, obtains the card’s payment network and re-

encrypts the data with a unique key only known by the payment network.

Subsequently, Apple sends the encrypted data, including ITunes and device

information, to the card Issuer. The Issuer will determine whether the card is added to

the Apple Pay repository. In case of being approved, the payment network or the

issuer generates a device account number, a unique token associating the device and

the PAN, specific for the device and it sends it along other sensitive data, like the

necessary key to generate dynamic security codes for each transaction (single-use

tokens). Although Apple cannot decipher this token, it will be added to de SE.

The payment process in a store initiates when a compatible device (currently only

iPhone 6) approaches a NFC enabled PoS. The communication is initiated via NFC,

and the device will select its default credit card. An authentication is requested at this

time of the process, this can be done using TouchID or a password. Afterwards, the

SE provides the Device Account Number and a transaction-unique specific security

code, generated from the stored key, besides any additional payment information

needed to complete the operation. The payment network or the Issuer verifies the

payment information checking the uniqueness and device association of the security

code.

Case study of HCE-oriented payment with tokenization.

The process of provisioning starts with the Issuer sending the PAN to the Token

Service Provider (TSP) [9]. Introduced by EMVCo , the TSP creates and manages

tokens throughout their life cycle. The TSP can be located in the Issuer, Payment

Network, or a third party. It is responsible for handling a register of entities that can

obtain tokens, a token provisioning system, the token security and the Token Vault (a

repository where the PAN-token associations are stored). The TSP generates a token

associated to the PAN informed, saves the PAN-token association on the Token Vault

and sends the token to the Issuer.

The issuer sends a specified number of tokens to the payment application on the

NFC mobile device. These tokens are stored on the handset and can be used in case

there is not network connectivity, and a payment transaction has to be made.

Afterwards, the user approaches the NFC mobile device to the NFC terminal (PoS)

and communicates the token, depending on the type of transaction, the quantity or the

preferences, perhaps inserting a PIN (on the PoS or the mobile phone) will be needed.

CONFIDENTIA

L

34

The EMV specification describes the information that will be sent: payment token,

token expiry date, token cryptogram, token requestor ID, and the token assurance,

which will define the token security level.

The PoS will send the token to the Acquirer, adding the in the PoS Entry Mode that

it is a contactless transaction.

The Acquirer performs routine checking processes and sends the token to the

Payment Network.

The Payment Network sends the token to the Issuer.

The Issuer sends the token, along with the type of payment (contactless) and the

amount of money, to the TSP. the Token Vault within the TSP validates the token and

the Issuer authorizes the payment.

The Issuer completes the validation of the bank account and authorization

information and sends the PAN to the Payment Network.

The Payment Network generates a reply cryptogram and replaces the PAN with the

payment token. After that, it sends to the Acquirer the toke, the token assurance level,

and the last four digits of the PAN.

The Acquirer passes the reply authorization to the Merchant.

The Consumer is notified whether the transaction has been successfully processed

or not.

6 HCE vs. SE. Security Issues

Taking a closer look to the protocol stack used by SE-oriented payments and HCE-

oriented payments, it is proven that they share the protocols (ISO) 14443-1 (physical

layer), 14443-2 (RF signal interface), 14443-3 (Activation and Anti-collision), 14443-

4 (Transmission protocol) and 7816-4 (Card organization and structure).

Of all the previously mentioned protocols, only 7816-4 provides default security

mechanism for safeguarding different security aspects.

ISO 7816-4 includes Secure Messaging (SM), which defines multiple security

mechanisms to achieve data confidentiality and data authentication.

The 7186-4 protocol was initially defined to exchange data between a Smart Card

and a terminal, but nowadays, due to the implementation of HCE in diverse operative

systems, is also supported by applications that are executed on an operative system,

allowing the communication with a PoS. This procedure is the base of HCE-oriented

payments.

Each one of the data exchanges between a PoS and a Smart Card is performed

using electric pulses on I/O line of the Smart Card. The data transmission should be

normally designed in order not to permit an attacker obtain any advantage of

eavesdropping on data transmission and insert blocks in the protocol.

Several mechanism and techniques, known collectively as Secure Messaging (SM)

can be used to defend against these attackers and their sophisticated attacking

methods.

A security mechanism is defined as a function that requires a set of items: a

cryptographic algorithm, a key, an argument, an initial data. An indispensable

CONFIDENTIA

L

35

condition must be also satisfied, all the security mechanism must be completely

transparent with regard to existing protocol layers, to assure that standardized existing

process are not incompatible.

Moreover, taking into account the recommendations of the European Union in

2005, we can achieve integrity, besides the previously mentioned, confidentiality and

authentication.

Encryption, with AES or TDES algorithms, provides confidentiality and the

checksum features integrity.

Cryptograms are generated with the symmetric algorithm (TDES o AES). A

cryptogram is always followed by a cryptographic checksum. According to ISO 7816-

4, in first place an encryption over the data is performed and later a compute of the

cryptographic checksum with encrypted data is fulfilled.

It is also possible to add authentication and non-repudiation besides integrity via

digital signature, considered in 7816-4 SM

Security must be controlled in the application layer because all the lower protocols

do not implement security.

As we have seen so far, because SE-oriented payments and HCE-oriented

payments share protocols, the way they manage security is the same, centralized on

7816-4.

From here on, depending on the ecosystem utilized to perform payment

transactions, each one manages security in a different way, as it will be specified.

Through this article it has been disclosed that HCE-oriented payments base their

security in the operative system and in the mobile device, with the inherent risks it

involves. On the contrary, SE-oriented payments base their security on a specific

hardware, having their accesses strongly controlled.

Analyzing the security from a physical point of view, it becomes obvious that the

unauthorized access to a mobile phone is straightforward. Any attacker can obtain

confidential information stored in the mobile device, just removing the flash memory

and with the help of a regular connector mount it in other device. It could seem that,

due to the disk encryption managed by the operating system (as it happens in

Android) the disk would be secure enough, preventing any attacker to access

information. Nevertheless, with the use of brute force, it is proved that obtaining a key

and therefore accessing the information is possible.

On the contrary, extracting data from an EEPROM or from a flash storage in a SE,

requires much more effort and the adequate complex means.

If we take into consideration the physical security of the processor, entrusted to

execute instructions that enable the exchange of payment information, it seems

obvious that in the case of HCE-oriented payments there will be need to analyze the

mobile phone CPU. ARM CPUs are among the most used by handsets; from this type

of CPU may be likely to obtain signals that can reveal information about

cryptographic operations. In contrast, one of the main principles of SE design is

tamper-resistance; consequently for Se-oriented payments the processor is robust and

is protected from unauthorized accesses [10].

The attack surface of an application can be defined as the union of code,

interfaces, services, protocols and practices available for all users, paying special

attention to what is accessible for unauthenticated users. In HCE-oriented payments,

for an Android app, the next direct attack surfaces can be outlined:

CONFIDENTIA

L

36

UI inputs

Network connectivity

Communication mechanisms between processes (intents…)

File system

NFC

According to this, it seems quite clear that an app is exposed to multiple risk

factors.

For SE-oriented payments, the direct attack surfaces are shown down below:

Wired. SE use to be normally hard-wired connected to the host device.

Contactless or NFC

Write Access to SE is often limited, due to it is usually property of the handset

manufacturer, if it is embedded or the Mobile Network Operator if it is a SIM.

Therefore, the TSM (Trusted Services Manager), the handset manufacturer, or the

Mobile Network Operator, depending on which one is the owner of the SE, are the

administrators of the SE and can access to the SE to make any change, including the

initial process known as personalization, in which bank apps and their associated data

are inserted. It should be noted, that the access keys are only known by the owner of

the SE.

Regarding Android, on the contrary, the user can install whatever he wants with

little restrictions.

However, an issue exists due to that the app running on the SE connects to an UI

app running on the mobile host OS, so if there is any breach, the SE will be

compromised.

Fortunately, SE only supports create and delete, but not upgrade, therefore in case

of a security breach existing data cannot be modified, however it can be leaked.

Anyhow, this option is highly unlikely, since in SE environment the root concept does

not exist, contrary to Android.

No credential handled by an Android app is completely protected against an

attacker with administrative privileges (root) in an Android device. But it’s not only

an Android issue, in other manufacturer devices such as Apple (iOS) or Windows

(Windows Phone) the operative system has root privileges (for example, to install

updates) so if someone takes control he can access as root.

Handling root access can be very complex, but how can affect this complexity to

HCE-oriented payments and Se-oriented payments. Secure element environment is

not affected due to the fact that in the communication process the host OS is not

involved. The main issue takes place in HCE environment, due to the fact that the

communication is always passing through the host OS. Although, the operative

system adds basic security mechanisms, for example, in Android, each application is

executed in its own sandbox, that theoretically avoids a malicious app from accessing

to other applications data. But all of this is removed if the device is rooted [11].

The information stored in applications, including sensitive information like

payment credentials, can be compromised, if a privilege escalation is made by the

actual user, a third involved or a malware that has infected the device. Moreover, it

may be unauthorized accesses to data stored in the credit card or banking services

connected through a mobile application. It can also happen that the code and data

CONFIDENTIA

L

37

application could be stolen for a fraudulent use, or the application code modification

to make it malfunction or behave in an unwanted way.

Even though the device is not rooted, there exist dangers. Payment applications,

running on the SE or the host OS, are identified from the outside for its AID

(Application ID). It may be the case that malicious applications running on the

operative system can try to hold AID ranges of known payment applications to

provoke denegation of service (DoS) attacks.

SE-oriented applications communicate to the outside through CLF (Contactless

Front-End), while HCE-oriented ones use the operating system as an intermediary.

Henceforth, CLF needs to manage a routing table to decide if the requested app (for

example, when receives a communication from a PoS) is running on the SE or on the

host OS. If the entry is not found in the routing table, Android defines a default route

that addresses to an HCE-oriented application. This default route can be problematic,

because it contradicts the proposed architecture for SE management.

Another possible issue that involves HCE-oriented payments is related to backups

and on the cloud storage services, because they can contain sensitive information

necessary for the performance of payment transactions. Besides the inherent risk of

cloud storage, such as we are not in charge of the environment, a provider does, it

arises the possibility that an attacker can obtain the credentials controlling the

applications that have access to the service.

6 Conclusion

In this paper we presented a comparative from a security perspective between the

principal technologies regarding mobile payments, HCE-oriented payments and SE-

oriented payments.

We started explaining the Near Field Communication, the emerging technology

that enables the SE and HCE payments due to its operating mode named card

emulation. This mode allows the interaction between a NFC enabled mobile device

and a Point of Service. The communication link that is created permits payment

transactions.

The Secure Element is a fundamental part of NFC, and it can store high-value data

in a secure way, from the hardware point of view, due to tamper-resistance, and from

the software point of view due to its own operating system independent from the host

OS.

Although it was implemented a few years ago, Host Card Emulation has been

waiting for Android to come. Just in a year or so, HCE and its ecosystem have

become very popular among payment stakeholders, in a very important part because

of the liberation from all the bonds surrounding the SE ecosystem. The independence

to access the device and manipulate data without the need of other parts involved, is a

new sensation that stakeholder can feel if they chose to develop HCE-oriented

payments. But everything comes at a price, in this case security.

CONFIDENTIA

L

38

There is no doubt that SE-oriented payments are more secure than HCE-oriented

payments in almost every possible aspect. But with the possibilities that HCE

provides it has become a necessity to develop security mechanism to increase its

security level to its highest.

Tokenization can help secure the communication between a NFC device and a

PoS, by substituting the PAN with a token. It can benefit either HCE-oriented

payments or SE-oriented payments, but as it has been describe is more important for

HCE due to its security problems regarding the store of payment credentials.

References

1. Suikkanen, J., Reddmann, D.: Vision: Touching the Future, Ticketing. In: Tuikka, T.,

Isomursu, M. (eds.) Touch the Future with a Smart Touch, Espoo, Finland, pp. 233–236

(2009); no. Research notes 2492

2. Broll, G., Graebsch, R., Holleis, P., Wagner, M.: Touch to Play – Mobile Gaming with

Dynamic, NFC-based Physical User Interfaces. In: Proc. MobileHCI 2010 (2010)

3. Wallen, J., http://www.techrepublic.com/article/manage-loyalty-programs-within-google-

wallet-for-more-convenient-shopping/

4. Coskun, V., Ok, K., Ozdenizci, B. (2012) Near Field Communication (NFC): From Theory

to Practice. London: Wiley. ISBN: 978-1-1199-7109-2, February, 2012.

5. Roland, M.: Software Card Emulation in NFC-enabled Mobile Phones: Great Advantage or

Security Nightmare? In: 4th International Workshop on Security and Privacy in

Spontaneous Interaction and Mobile Phone Use, Newcastle, UK (June 2012),

http://www.medien.ifi.lmu.de/iwssi2012/papers/iwssi-spmu2012-roland.pdf

6. Smart Card Alliance. http://www.smartcardalliance.org/publications-host-card-emulation-

101/

7. Dridan, R., Oepen, S.: Tokenization. Returning to a long solved problem. A survey,

contrastive experiment, recommendations, and toolkit. In: Proceedings of the 50th Meeting

of the Association for Computational Linguistics, Jeju, Republic of Korea, pp. 378–382

(July 2012)

8. https://www.apple.com/apple-pay/

9. https://www.emvco.com/specifications.aspx?id=263

10. Roland, M, Comparison of the usability and security of NFC’s different operating modes in

mobile devices

11. https://developer.android.com/guide/topics/connectivity/nfc/hce.html CONFIDENTIA

L

39

CONFIDENTIA

L

40

Evidence of sticky costs in multiple industries of Argentina

Maria Ines Stimolo1, Marcela Porporato*2, Gustavo Porporato Daher 3

*Corresponding author

1 Facultad de Ciencias Económicas Universidad Nacional de Córdoba (Argentina)

Ciudad de Valparaíso s/n. Ciudad Universitaria, Córdoba – Argentina. CP: X5000HRV mstimolo@gmail

2 School of Administrative Studies Room 282 Atkinson College Building - York University 4700 Keele St. – Toronto ON M3J 1P3 CANADA

3 CFO Maersk Line [email protected] - Iberia Cluster

Parque Empresarial "La Finca" Paseo del Club Deportivo, 1 - Edificio 18 28223 Madrid SPAIN

[email protected]

March 2014

Abstract. An assumption made in traditional cost accounting books is that vari-able costs change proportionately with revenues. Recent studies on sticky costs challenged this assumption (Anderson et al., 2003). This study shows that sticky costs are observed in all Argentinean industries in the period 2004-2012 except for firms operating in the Agricultural sector. Total costs are sticky be-cause the magnitude of the increase associated with an increase in activity level is larger than the magnitude of the fall associated with a decrease in activity level. These results suggest that industry specific cost structure and macroeco-nomic climate are valid explanations for cost behavior of firms operating in emerging economies.

Keywords. sticky costs, Argentina, cost structure, agricultural sector, cost behavior.

CONFIDENTIA

L

41

2

1 INTRODUCTION

A critical assumption in traditional cost accounting is that variable costs move propor-tionately with revenues. In the last decade has been suggested that the magnitude of the change in costs depends both on the magnitude of the change in the cost driver and on the direction of this change (ascending or descending). In the management accounting literature this situation is known as sticky costs; although there are several alternative and complementary explanations, so far it cannot be asserted that all cost behavior is necessarily the result of managers’ decisions, however industry specific costs structure seem to have some explanatory power.

The motivation of this study is to test if the concept of sticky costs is observed in a set of companies operating in a G-20 emerging economy. The study focuses on a set of five industries as a way to improve the generalization of Anderson et al. (2003) idea by using similar firms in the analysis (Balakrishnan and Gruca 2008). This ap-proach assumes that cost structure is constant over time across sample firms in the same industry, because cost structures and scale economies are likely to be similar across firms in the same industry, however, such an approach is likely to limit sample size (Balakrishnan and Gruca 2008; Balakrishnan et al. 2004). Building on previous work, this paper focuses on a set of industries typical of emerging economies; of par-ticular interest is the inclusion of a subsample of agricultural based firms which has not been documented in prior empirical studies.

Sticky costs have attracted attention recently, but little has been explored yet in specific industries from emerging economies, particularly in Latin America. Given the implications of the idea of sticky costs, it is useful to apply it in new contexts. As many authors argue, traditional techniques or models do not provide much guidance to how they should be applied to emerging markets (Pereiro, 2006), therefore the main purpose of this study is to successfully replicate the sticky cost behavior model of Anderson et al. (2003) while addressing the factors mentioned by Balakrishnan et al. (2010) with the population of Argentinean companies to determine if currently developed and available models can have the same explanatory power in emerging and advanced economies. Argentina has been selected because between 2004 and 2007 the country experienced an important growth, therefore one expects to observe a significant increase in activity levels of firms; in 2008 and 2009 the country’s level of activity started to declined or stagnate due to the international financial crisis provid-ing excellent conditions to test how inflexible (sticky) are costs in the short term.

Building on previous work, this paper focuses on companies from an emerging economy to validate the external generalization of sticky costs idea that firm-specific cost structure is a plausible explanation. This study has been originally designed to improve the generalization of Anderson et al. (2003) in a geographical context scarce-ly researched with the purpose of testing if sticky costs are observed across various industries where cost structures and scale economies are likely to be similar (Bala-krishnan and Gruca, 2008; Balakrishnan et al., 2004). Of particular interest is to see how costs behave in agricultural focused organizations, which, is the main novelty and contribution of this study.

CONFIDENTIA

L

42

Selecting companies from various industries operating in an emerging economy al-lows this study to manipulate two variables, namely cost structure and macroeconom-ic climate. This study is designed to offer an empirical validation of the idea of sticky costs in turbulent emerging economies by using financial statements of listed compa-nies in the local Stock Exchange between 2004 and 2012. The selection of subjects to study allows us to provide a coherent answer to the two main research questions of this study: 1) are total costs in Argentinean companies sticky? and 2) the degree of cost stickiness can be explained by industry specific cost structure and macroeconom-ic conditions?

The results obtained confirm the validity of existing literature on sticky costs ap-plied in a different and new context while manipulating industry specific cost struc-ture and macroeconomic climate effects. The results show that sticky costs are present in Argentina for the years 2004-2012. A second set of results suggest that both indus-try related cost structure and macroeconomic climate are good explanators of cost behavior. The degree of cost stickiness is different for each industry, which is ex-plained by the flexibility they have to adjust their costs structures to changes in activi-ty levels. An important contribution of this is that costs in agricultural firms do not show a sticky behavior.

The study is organized in five sections including this introduction. The second sec-tion offers a literature review of historical evolution of sticky costs together with the main reasons explored and the principal critiques. The third section presents the hy-potheses, variables, models and details of how the data set was assembled. Results and discussion are presented in the fourth section. The study ends with a conclusion section.

2 Literature review of Sticky costs

A cost behavior labeled as “sticky” in the management accounting literature refers to costs that respond asymmetrically to changes in activity levels. It has been suggested that several factors lead to “sticky” cost behavior in which costs adjust asymmetrical-ly; faster for upward than for downward demand (revenues) changes. Some authors suggest that cost management decisions and their determinants can be inferred from cost behavior because all cost behavior is the result of cost management. Recent stud-ies have contributed to develop a theory where adjustment costs drive cost manage-ment and explain sticky cost behavior. Although the management accounting litera-ture declares to be interested in documenting how costs behave, there is a widespread belief that it is the connection to cost management that is of interest to most research-ers and professionals.

CONFIDENTIA

L

43

4

2.1 Traditional view of Cost Behavior

Sticky costs seems to be a new concept, however it has been around for dec-ades, sometimes incorrectly being referred as ‘active cost management’ as well. The study of Malcom (1991; p. 76) is one of the first ones to introduce the concept of sticky costs, indicating that "many of these new costs tend to be somewhat nonvaria-ble in character, i.e., lumpy and not strictly proportional to changes in activity. A common example is materials ordering and handling costs. As production grows, additional employees are added to handle the additional load; but, if production de-creases, these personnel are not immediately laid off. Thus these lumpy costs stick even if activity declines and such costs have therefore sometimes been labeled ‘sticky costs’". Another early study that mentions this concept is Mak and Roush (1994) when they were considering how Activity Based Costing is reflected in flexible budg-ets. They refer to Malcom (1991) saying that sticky costs are those that can be in-creased in the short term, but do not decrease when the activity declines.

Costing methods propose different procedures to identify cost drivers and their relation to cost objects. All these methods assume that variable costs are "pro-portionally variable" with the firm’s activity level. This assumption of proportionality is presented in the cost methods used in all cost and management accounting text-books (Hongren et al, 2012; Garrison et al., 2012) but is not always fulfilled in the practice as costs generally show an asymmetric behavior. A fundamental concept in cost accounting is the formula of total costs that is normally represented as:

TC = FC + VCu x Q

Where: TC: total costs FC: fixed costs VCu = variable cost per unit Q = units sold

The total cost formula relates cost records with financial accounting reporting through a decomposition of net income before interest and taxes:

Π= Revenue - CGS - SG&A

Where: Π: Profit or net income before interest and taxes CGS: Cost of goods sold SG&A: Selling, general and administrative expenses

CONFIDENTIA

L

44

If net income before interests and taxes is stated using the variables from cost ac-counting we have the following:

Π = SPu x Q – VCu x Q – FC

Where: SPu represents the selling Price per unit Reorganizing the equation terms we have:

Π = (SPu – VCu) x Q – FC

Where SPu – VCu is the marginal contribution per unit. Hence:

Π = MgC x Q – FC

This last equation is used in cost and management accounting textbook as the basis for short term decisions through the Cost-Volume-Profit model. According to the accepted knowledge in the field, short term decisions are based on three variables that directly affect net profit: total fixed costs, quantity of units sold and the contribu-tion margin per unit defined as the difference between selling price and variable costs.

2.2 Sticky Costs literature

All the costing techniques designed to identify, measure, track, assign and report costs are based on the distinction between fixed and variable elements regarding operations volume. Dividing costs between fixed and variable has generated some problems with the supposition of proportionality of changes of variable costs in relation to the change in operations volume. This problem of asymmetric proportionality between costs and changes in the activity level began to be explicit in some studies.

The sticky nature of costs has been raised in different studies; among them Noreen (1994) demonstrated algebraically that activity based costing must fulfill the condi-tion of proportionality in order for their results to be useful in decision making. Latter Noreen and Sonderstrom (1997) offer the first empirical work to analyze if general costs are proportional to activities. But is the study by Anderson, Banker and Jana-kiraman (2003) the one that had the great impact on the development of the empirical research on sticky cost behavior by providing evidence that selling, general and ad-ministrative costs increase with increases in revenues but decrease in a lesser propor-tion when revenues declined.

Anderson et al. (2003) is considered the seminal work of empirical studies on sticky costs, but from the year 2009 some studies question the validity of the empiri-cal findings. Anderson and Lanen (2007) think that total costs show a sticky behavior to activity level changes, but when costs that depend on managerial decisions are

CONFIDENTIA

L

45

6

analyzed separately from the rest, they not always have this behavior. These results lead the authors to question if the proposed empirical models really prove that mana-gerial decisions are reflected in costs sticky behavior. Balakrishnan, Labro and Soderstrom (2010) argue that costs sticky behavior might stem from the cost structure and not driven by managerial decisions. Based on data from a simulation model that do not incorporate manager’s decisions, the authors suggest that future research shall consider in their analysis the following factors: cost structure (fixed costs and econo-mies of scale in variable costs), industry and time specific differential rates of growth. Continuing with the Anderson et al. (2003) model, Banker, Byzalov and Plehn-Dujowich (2010) offer a complete literature review that synthesizes the development of this model summarizing the empirical evidence that justifies each of the hypotheses used.

Other group of studies also derived from Anderson et al. (2003) seminal work, tested the existence of sticky costs in different economies. Calleja, Steliaros and Thomas (2006), extended the concept worldwide and proved that in American, Eng-lish, French and German companies, the operative costs are sticky after changes in the sales. Banker, Byzalov and Plehn–Dujowich (2010) studied the behaviour of opera-tive costs in 19 countries and found the existence of sticky costs. Similar results were obtained for marketing, administrative and personal costs by Ribeiro de Medeiros and de Souza Costa (2004) in Brazilian companies. This study was the first one to show that the supposed symmetrical relation between variable costs and sales level is not such and that costs are rigid to fall even more in companies that operate in emerging economies.

After more than a decade of studies published on this issue, the vast literature can be grouped in three relevant groups given the motivation and focus of this study:

1. Studies that first talk about the issue of lack of proportionality: (a) Malcom, R.E. (1991) “Overhead Control Implications of Activity Costing”. (b) Noreen E. (1994) ”Conditions Under Which Activity-Based Cost Systems Pro-

vide Relevant Costs”. (c) Noreen, E., and Soderstrom N. (1997) “The Accuracy of Proportional Cost

Models: Evidence from Hospital Service Departments”. 2. Seminal empirical studies that set trends in how to study the issue:

(a) Anderson, Banker and Janakiraman (2003) “Are Selling, General and Adminis-trative Costs “Sticky”?”

(b) Anderson, and Lanen. (2007) “Understanding Cost Management: What can We Learn from the Evidence on “Sticky Costs?”

(c) Banker, Byzalov and Plehn-Dujowich (2010) “Sticky Cost Behavior: Theory and Evidence”.

(d) Balakrishnan, Labro and Soderstrom (2010) “Cost Structure and Sticky Costs”. 3. Empirical studies performed with data of companies from emerging economies:

(a) Ribeiro de Medeiros and De Souza Costa (2004) “Cost Stickiness in Brazilian Firms”.

CONFIDENTIA

L

46

(b) Pervan Maja and Ivica Pervan (2012) “Analysis of sticky costs: Croatian Evi-dence”.

(c) Porporato and Werbin (2012) “Active Cost Management in Banks: Evidence of Sticky Costs in Argentina, Brazil and Canada”.

(d) Werbin and Porporato (2012) “Active Cost Management (Sticky costs) in Ar-gentinean Banks”.

(e) Poorzamani Zahra and Bakhtiary Mohammadreza (2013) “Reviewing the im-pact of macroeconomic factors on operating cost stickiness in Tehran stock ex-change”.

Exhibit 1 briefly presents the results of selected previous studies.

Exhibit 1: Results of selected sticky costs studies

Study Main contribution

Sticky cost evidence

Costs change when reve-

nues increase 1%

Costs change

when revenues

decrease 1%

Balakrishnan et. al.

(2004)

Focused in healthcare sector. Key

new variable: capacity utilization Increase 0.51% Decrease 0.36%

Ribeiro & Souza

Costa (2004)

Replication of ABJ (2003) model

in Brazilian companies. Increase 0.59% Decrease 0.27%

Banker & Chen

(2006b)

Study focused on 19 developed

countries. Key new variable: labour

market structure and restrictions

Increase 0.88% Decrease 0.80%

Calleja et.al (2006)

Study focused on USA, UK,

France and Germany. Key new

variable: operative costs


Chen, Lu &

Sougiannis (2008)

Key new variable: managerial

incentive and compensation plans Increase 0.70% Decrease 0.46%

Porporato &

Werbin (2012)

Study focused on Banks from

Argentina, Brazil and Canada.

Argentina: Incr. 0.60%

Brasil: Incr. 0.82%

Canada: Incr. 0.94%

Argentina:

Decr. 0.38%

Brasil: Decr.

0.48%

Canada: Decr.

0.55%

Kama & Weiss

(2010)

Key new variables: incentives to

achieve profitability targets and

technological options to react to the

demand


Dierynck et. al.

(2009)

Study focused in Belgium. Key

new relation between variables:

managerial incentives and compensa-

tion with laboral costs


CONFIDENTIA

L

47

8

Anderson et. al.

(2012)

Key new variables: use of re-

sources that are flexible or committedIncrease 0.64% Decrease 0.45%

2.3 Reasons for sticky costs

Several studies were derived from Anderson et al. (2003) seminal work, with the common characteristic of testing the existence of sticky costs and the factors that might explain them. Among the papers that contributed to the generalization of sell-ing, general and administrative costs sticky behavior are Banker and Chen (2006a), while Steliaros, Thomas and Calleja (2006) considered operating costs. Sticky costs cannot be fully explained by the fact that resources are not perfectly divisible, imply-ing that resources cannot be added or removed in small quantities to perfectly match the level of activity. The results of more recent studies indicate that costs are not uni-formly sticky, and that stickiness depends on several factors.

It is argued that the principal reason for sticky costs existence is the uncertainty about the products’ future demand that lead managers to delay the reductions of costs until they are sure of the fall in the volume. Anderson et al. (2003), show that the effect of sticky costs tends to diminish in subsequent periods because costs stop being sticky and diminish accompanying the level of activity. Normally, managers are more confident of demand decline when the fall occurs for more than two successive peri-ods. On the other hand, if the macroeconomic environment is positive, managers are not inclined to reduce costs because they expect that activity level will recover soon. Banker, Ciftci and Mashruwala (2008) analyzed how the managers’ optimism (or pessimism) affects resource allocation decisions. Their results reinforce the argument that managers deliberately adjust the resources in response to the observed conditions of the demand.

Recent research documented that factors such as capacity utilization (Balakrishnan, Petersen and Soderstrom, 2004), the criticality of the cost (Balakrishnan and Gruca, 2008) and empire-building incentives (Chen, Lu and Sougiannis, 2007), moderate the asymmetric response of costs to activity changes. Balakrishnan and Soderstrom (2008) documented that the more linked is the function with the central business (core business) more difficult it is to reduce its costs. In a similar vein Balakrishnan and Gruca (2008) found that the criticality of the function is related with cost behavior. None of these studies provide evidence that shows that the property of the resource influence a sticky cost behavior. Chen, Lu and Sougiannis (2007) argue that costs stickiness is driven not only by economic factors but also by managerial empire-building incentives.

Another reason identified relates to the particularities of the industry on which companies operate. Several studies demonstrated that companies with a large propor-tion of fixed assets have stickier costs (Anderson 2003). Grouping companies by in-dustries allows analizing the factors that affect cost sticky behavior such as manageri-al decisions, fixed costs, economies of scale and asset structure. Among studies fo-

CONFIDENTIA

L

48

cused on different industries analysis Weidenmier and Subramaniam (2003) is some-times identified as the first serious attempt. Their study of companies belonging to manufacturing, commercialization, financial and services between 1979 and 2000 show that the manufacturing industry exhibited the "stickiest" behavior, due to the high levels of fixed assets and inventory, while merchandising companies were the least sticky given the competitive nature of the industry. Their main contribution was that cost behavior changes across industries, which implies that the affiliation of the company to a certain industry is important for sticky costs research.

Industry focused research has expanded beyond the classic geographical areas of Anglo-Saxon countries and it is starting to produce some results in emerging econo-mies. In non Anglo-Saxon countries, Werbin et al. (2011) analyzed two industries in Spain: furniture manufacturing and hospitality (hotels and restaurants). That study confirmed that sticky costs is also present in Spanish companies, nevertheless, they appreciate important differences between both industries, presenting the companies in hospitality more flexibility to adjust their costs to the activity levels. In emerging economies Porporato and Werbin (2012) offers an analysis of cost behavior of banks in Argentina, and although they are not compared with other sectors, the authors drive some conclusions about cost behavior based on the regulations and macroeconomic situation of the industry. Exhibit 2 offers a summary of the three studies mentioned as valid antecedents of one of the lines of inquiry of this study.

Exhibit 2: Empirical Studies focused on industry as a reason for diverse cost behavior

Study

Key varia-

ble to measure

costs

Sticky cost evidence

Costs change when revenues

increase 1%

Costs change when revenues

increase 1%

Weidenmier

and Subramaniam

(2003)

Selling, Gen-

eral and Adminis-

trative

Manufacturing: increase

0.71%

Commercialization: increase

0.81%

Services: increase 0.74%

Financial: increase 0.51%

Manufacturing: decrease

0.56%

Commercialization: decrease

0.72%

Services: decrease 0.64%

Financial: decrease 0.57%

Werbin et al.

(2011) Oprating costs

Spain

Furniture manuf.:Increase

0.97%

Hospitality Increase 0.90%

Spain

Furniture manuf.: Decrease

0.44%

Hospitality: Decrease 0.84%

Porporato and

Werbin (2012) Total costs

Argentina: Banks

Increase 0.60%

Argentina: Banks

Decrease 0.38%

CONFIDENTIA

L

49

10

3 Hypothesis, Data Set, Variables and Models

The motivation of this work is to test if the concept of sticky costs is ob-served in a set of companies operating in a G-20 emerging economy. The analysis is structured by industries as a way to improve the generalization of Anderson et al. (2003) idea by using similar firms in the analysis (Balakrishnan and Gruca 2008). This approach assumes that cost structure is constant over time across the sample firms in the same industry, because cost structures and scale economies are likely to be similar across firms of the same industry, however, such an approach is likely to limit sample size (Balakrishnan and Gruca 2008; Balakrishnan et al. 2004). Building on previous work, this paper focuses in a set of industries typical of emerging econo-mies; of particular interest is the inclusion of a subsample of agricultural based firms which has not been documented in prior empirical studies.

Selecting companies from the most relevant industries allows this study to manipulate two variables not included in previous studies of sticky costs of emerging economies, one is the cost structure that depends on the nature of each industry and two is the macroeconomic climate. This study is designed to offer an empirical vali-dation of the idea of sticky costs in emerging economies by using the financial state-ments of companies quoted in the Buenos Aires Stock Exchange between the years 2004 to 2012. Between 2004 and 2007 the Argentinean economy experienced an im-portant growth, therefore one expects to observe a significant increase of income and consequently of the costs (in a lesser magnitude). In 2008 and 2009 the country’s level of activity started to declined or stagnate due to the international financial crisis providing excellent conditions to test how inflexible (sticky) are costs in the short term.

3.1 Hypothesis Development

Sticky costs literature suggests that the magnitude of the change in costs does not only depend on the magnitude of the change in the cost driver, but also on the direction of this change (ascending or descending). However, the results of recent studies indicate that costs are not uniformly sticky; this behavior depends on several factors such as industry, economic environment and also on costs category and structure. To date there is no study designed to test if the principle of sticky costs holds in agricultural companies of emerging economies and how they differ from those observed in other industries. Therefore we propose that the following will be observed:

Proposition: if total costs are sticky, then when total income in-creases in a 1% total costs increase too, and when total income de-creases in a 1% total costs decrease too, but in a lesser proportion. It means that if total costs are sticky, then the magnitude of the in-crease associated with an increase in volume is larger than the magnitude of the fall associated with a decrease in volume.

CONFIDENTIA

L

50

The only precedent in Argentina are the works of Werbin (2009), Werbin and Porporato (2012) and Porporato and Werbin (2012) that empirically show evi-dence of costs sticky behavior in banks for the period 2004-2009. Besides these stud-ies there are no other studies done with data from Argentinean companies, making the question about cost structure of Argentine companies exhibiting a ‘cost sticky’ behav-ior still valid. If the said behavior exists, the next natural research question is if it found in all the industries? Are there any relevant differences among industries? and if so, how can it be described or which are the key differentiating factors? Collecting these arguments and replicating Balakrishnan and Gruca (2008) sticky costs hypothe-sis, we will test the following:

HYPOTHESIS 1: Total costs of Argentinean companies are sticky. The rate of in-

crease in costs exceeds the rate of decline in costs as activity volumes change. Also this study seeks to explore the role of cost structures on costs stickiness. Spe-

cifically, when companies have a larger proportion of fixed costs, they are expected to show higher levels of stickiness as Balakrishnan et al. (2010) suggest in their simula-tion. The same expected result was found by Anderson and Lanen (2007) because increasing the proportion of fixed costs increases the degree of asymmetry in the cost response. To measure the impact of cost structure, the selection of industries to com-pare is carefully done given the restrictions on data availability. Collecting these ar-guments we have the first part of the second hypothesis:

HYPOTHESIS 2a: The factors that affect cost stickiness are related with the inter-

nal cost structure of the firm. The second explanation of differences in sticky costs among the industries selected

is related with macroeconomic climate. Banker et al. (2008) suggest that optimism affects resource allocation. In an optimistic environment managers trend to delay cost decreases. To measure the impact of macroeconomic climate, the selection of macro-economic metrics has to be very careful as they shall allow differentiating the particu-lar conditions experienced by each industry. Collecting these arguments we have the second part of the second hypothesis:

HYPOTHESIS 2b: The factors that affect cost stickiness are related with the mac-

roeconomic variables in general and with those particular to the industry.

CONFIDENTIA

L

51

12

3.2 Subjects of Sticky costs analysis

The sample is composed by a set of Argentine companies that quote their shares in the Buenos Aires Stock Exchange between the years 2004 and 2012. The sample ex-cludes those belonging to the financial and insurance sectors as the local accounting and disclosure standards are not directly comparable with the rest of the companies. The database was prepared relying on the data publicly available in the Buenos Aires Stock Exchange web site (http://www.bolsar.com/net/principal /contenido.aspx). It was decided to include financial statements closed on the year 2004 onwards to avoid the special period of a grave economic and social crisis of 2001 and immediate fol-lowing years. The sample includes only companies that presented a positive operative ordinary income defined as net sales larger that costs of sales and selling and adminis-trative expenses. Another restriction imposed on those companies included in the sample is that in the period 2004 to 2012 they have reported a minimum of three fi-nancial statements. The companies in the sample belong to different economic sectors or industries that present very different costs structure. Exhibit 3 lists the industries included and classification adopted in this study according to the Industrial Interna-tional Uniform Code. The agricultural sector includes activities of agriculture, cattle and forest; the energy sector includes electricity, gas and water as well as oil extrac-tion, refining and transport. The manufacturing sector was divided into manufacturing of products of agricultural origin and manufacturing of products of industrial origin because their costs structure is perceived to be quite different. In a remaining group the rest of the sectors were included labeling it as construction, trade and services in general. This last group was made due to the low number of companies left once the other 4 groups were formed, showing where are the strengths of the Argentine Econ-omy.

Exhibit 3: Industries included in the sample

Sector Code IIUC codes included Agriculture AGR 1 & 2

Energy ENE 11,40,41 & 60 Manufacturing of agricultural origin MOA 15 to 21 Manufacturing of industrial origin MOI 22 to 37

Construction, commerce and services COM 45, 50 to 95 except for 60

CONFIDENTIA

L

52

3.3 Variables considered

To estimate cost behavior two variables intervene, costs and activity level measured through sales. However, each new study adds different variables to detect reasons of costs sticky behavior; this study relies on three sets of variables. The first set is com-prised of those that measure sticky cost behavior; a second set of variables seek to explain this behavior, which in turn are divided between internal to the company and external or macroeconomic variables and finally a dummy variable that detects the periods in which the activity level declines.

The first relation to determine is between costs and activity levels. Financial state-ments do not contain data about volume of sales and as such that information is not directly observable, therefore the proxy used is revenues. Regarding costs, this study focuses on administrative and selling expenses and how they relate with revenues. The variables considered in the models are net revenues, selling expenses, administra-tive expenses1 and total expenses that result from adding selling and administrative expenses. To relate the change of costs to the change in sales, this study uses the an-nual rate of change in expenses and the annual rate of change of revenues. Since An-derson et al. (2003) work, studies analyze and explain this behavior considering dif-ferent reasons.

3.4 Empirical model

The problem’s empirical modeling was introduced by Anderson et al. (2003). It is based on a regression model using panel data where the variables are defined as change rates between two periods. The model stems from the Cobb Douglas cost function. The first step is to determine if costs behave in a sticky manner: costs in-crease with an increase of revenues in a proportion bigger than the decrease of costs when there is an equivalent decrease in revenues. Anderson et al. (2003) define the following empirical model:

1 The Argentinean GAAP in efect at the time of this study was Resolución Técnica Nº9

(RT9) of FACPCE. Chapter 5 of RT9 defines as Selling Expenses those related with sales and distribution of products or services rendered by the firm. RT9 Chapter 5 says that Administra-tion Expenses are expenses incurred by the firm in order to carry on its activities but cannot be atribuible to any of the following functions: purchasing (procurement), production (operations), selling, research and development, financing of goods or services. The same chapter of RT9 states that net sales (revenues) are to be presented in the income statement and the amount shall exclude returns, discounts and taxes.

CONFIDENTIA

L

53

14

, , ,0 1 2 , ,

, 1 , 1 , 1

ln ln * ln .

*i t i t i t

i t i ti t i t i t

C V VDEC

C V V (1)

Where: Ci,t is the level of cost (no matter how it is measured) of firm i in year t Vi,t is revenues or activity level of firm i in year t DECi,t is a dummy variable that takes the value 1 when revenues or activity level

increases for firm i in year t. 1 is the coefficient of change in costs when activity level (revenue) increases 1%

and 1+ 2 is the coefficient of change in costs when activity level (revenues) de-creases 1%. Sticky costs occur when 1>0 and 2<0.

This basic empirical model started to include other terms to test other hypotheses

about costs sticky behavior. One of these modifications emerged from the idea that in front of small changes in activity levels, firms tend to maintain existing resources, whereas more important variations in activity levels force managers to change the firm’s cost structure. When the increase of revenues is important, then the adjustment in costs is more rapid than when revenues decrease, since it is not possible to reduce compromised costs such as personnel, fixed assets, etc. It is more likely that managers change the cost structure with an increase in the activity level than with a decrease of the same order, under the presumption that the decrease can be temporary (Cooper and Kaplan, 1998; Balakrishnan et al. ,2004; Weidenmier and Subramaniam, 2003). With this change, the hypothesis to test can be transformed into asking if the magni-tude in the change in revenues (activity level) affects the degree of the sticky costs.

To empirically test how the magnitude of changes in activity levels affects the sticky behavior of costs, it is proposed a model that stratifies the changes in activity levels. In the literature there is no consensus on how many intervals to consider or how wide they are supposed to be, therefore in this study they were established after analyzing the data and finding the set of intervals that better explained the data. The second equation of the improved model incorporates terms for every interval of reve-nues considered.

4 4, , ,

0 , 4 , , ,1 1, 1 , 1 , 1

ln ln ln .i t i t i tj i t j i t i t i t

j ji t i t i t

C V VRj Rj DECj

C V V

(2)

Where: Ci,t is the level of cost (no matter how it is measured) of firm i in year t Vi,t is revenues or activity level of firm i in year t R1i,t =1 if the rate of revenue changes is between [-0,05;0,05]

CONFIDENTIA

L

54

R2i,t =1 if the rate of revenue changes is between [-0.10, -0.05) or (0.05, 0.10] R3i,t =1 if the rate of revenue changes is between [-0.20, -0.10) or (0.10, 0.20]

R4i,t =1 if the rate of revenue changes is between (-,-0.20) or (0.20,+)

DECi,t is a dummy variable that takes the value of 1 when revenues or activity lev-el increases in firm i in year t.

DEC1i,t =1 if the rate of revenue changes is between [-0,05;0)

DEC2i,t =1 if the rate of revenue changes is between [-0.10, -0.05) DEC3i,t =1 if the rate of revenue changes is between [-0.20, -0.10) DEC4i,t =1 if the rate of revenue changes is between (-,-0.20) 1 to 4 are the change rate coefficient of costs when revenues or activity level in-

creases 1% j + j+4 for j=1,2,3,4 are the change rate coefficient of costs when revenues or ac-

tivity level decreases 1% Costs are said to be sticky if j >0 and j+4<0 for all of j values.

These models are used in this study. The formulas presented so far can be adapted to test if the sticky behavior is less significant when multiple periods are con-sidered, or if costs are less sticky when revenues have already declined in the previous period. Also, variations of these models are used to test if costs are stickier in periods of macroeconomic desaceleration, or in firms with higher levels of assets or in firms with large number of employees. There are other ways of relating costs with revenues, particularly when selling expenses are considered as they affect directly the volume of sales. To represent this simultaneity of relation between revenue and costs is possible to use a model of simultaneous equations that includes changes in costs and revenue as endogenous variables. Selling expenses include advertising expenses that have a direct relation with the company’s level of sales; to separately shape these relations we can use a model of simultaneous equations that considers in the second equation the change in revenues depending on the change in the advertising expenses.

3.5 Limitations

The principal limitation is given by the availability of financial data. The population of this study is limited to the companies that are listed in the Buenos Aires Stock Ex-change, which in the first decade of the years 2000 at any point were no more than 120. A second limitation emerges from the financial accounting regulation in Argen-tina that limits some analysis because the data is not disclosed. An example of this limitation is the number of employees as it is not mandatory to inform, instead the Argentinean regulations require to report the total amount paid in salaries and social security contributions. The third limitation deals with the time frame considered. Ar-gentina experienced a significant crisis by the end of 2001, therefore only financial statements presented in 2004 are not significantly affected by the crisis and its after-math. On the other hand, the limit of 2012 is due to changes in Argentine legislation

CONFIDENTIA

L

55

16

that requires companies to adopt IFRS what leads to changes in disclosure and meas-urement making financial statements not comparable.

4 Results and discussion

4.1 Descriptive Statistics

The first step in the analysis was to detect and eliminate outliers. Outliers were identi-fied by considering all firms and using Anderson et al. (2003) model to calculate in-fluence measure denoted by DFITS2, Cook’s distance and Welsch’s distance. These three measures include in their estimation the size of atypical residues and leverage size (values that significantly affect the model coefficients). Outliers were eliminated from the sample when any of the three measures was outside the standard limits. A total of 684 observations corresponding to 99 firms were considered. A preliminary descriptive analysis is summarized in Exhibit 4.

Exhibit 4: Preliminar Descriptive Analysis

Variable Mean Standard

deviation

Variation

coeficient

1st

Quartil

Medi-

an

3rd

Quartil

n

Net revenue (in millions

of pesos) 2,110 5,740 2.72 199 573 1,680

68

4

Selling expenses (in

millions) 194 487 2.51 6.8 31.3 135

68

4

Administrative expens-

es (in millions) 85.7 187 2.18 11.6 27.7 73.5

68

4

Selling expenses as a %

of net revenues 9.2% 8.5% 0.92 3.4% 5.5% 8.0%


es as % of net revenues 4.1% 3.3% 0.80 5.8% 4.8% 4.4%

For period when net

revenue declined (% of

decease)

11.4% 88.6% 15.3% 8.7% 3.3% 19

8

2 DFITS is a scaled difference between predicted values for the ith case when the regression

is fit with and without the ith observation, hence the name

CONFIDENTIA

L

56

Average net revenues reach $2,110 million (median $573 million). Selling expenses are in average of $194 million (median $31.3 million) and administrative expenses’ mean is $85.7 million (median $27.7 million). The variability is high but similar for revenues and expenses as reflected in the variation coefficients. Selling and administrative expenses represent in average 9.2% and 4.1% of net revenues re-spectively. A total of 29.91% of observations (companies per year) presented decreas-es in net revenues when compared with the previous period. The average decrease was 11.4% (median 8.7 %)

4.2 Results for Equation 1: existence of costs sticky behavior

The results of the equations are robust and concordant with expectations. The first hypothesis suggests that administrative and selling expenses or costs present a sticky behavior in relation to variations of revenues. This hypothesis is confirmed because the values of ß2 coefficients are negative as shown in Exhibit 7, with all coefficients statistically significant at the 5% level. Selling expenses increase 0.89% when there is an increase of 1% in revenues but they only diminish 0.17% when revenues decrease 1%. Administrative expenses show a similar but less pronounced effect because the proportion of increase 0.3% and decrease of 0.1%. Total expenses increase 0.43% with a 1% increase in revenues, but they decrease 0.15% with a decrease of 1% of revenues. The decrease of total expenses is less than its increase, meaning that in the period of one year the costs cannot adjust proportionally to revenue changes causing a decrease of short-term profitability. Exhibit 7 also includes the degree of cost sticki-ness calculated as (ß1 + ß2) / ß1; the lower the result the more sticky the costs behav-ior, in this case, selling expenses presented the stickiest behavior.

Exhibit 7: Empirical results of applying model of equation (1) to the Argen-

tinean data

Equation of Model

(1) Selling expenses


es

Total ex-

penses

β0 0.022

(1.44)

0.05

(5.20)

0.05

(5.82)

β1 0.89**

(12.19)

0.30**

(7.14)

0.43**

(11.75)

β2 -0.72**

(-5.21)

-0.20**

(-2.44)

-

0.28**

(-3.73)

Degree of cost sticki-

ness 0.19 0.33 0.35

R2 ajusted 0.2261 0.0882 0.2099

** significance level 5%. t-values are reported in parentheses.

CONFIDENTIA

L

57

18

4.3 Results for Equation 2: different cost behavior explained by changes in activity levels

The second equation presented in this study is rooted in the idea that costs sticky be-havior is different at various levels of change in absolute values of revenues. Exhibit 8 shows that coefficients ß1 to ß8 represent the relation among increases in costs when revenues increases 1%. These coefficients show a positive relation with a trend that observes increases in the second and third interval considered, but then goes down in the fourth interval when changes in revenues are more than 20%. The coefficients that turned out to be significantly different from zero are ß3 (0.54) and ß4 (0.43) implying that a 1% increase in revenues results in an increase in total expenses of only 0.54% when revenues change between 10% and 20%. These results are not aligned with previous studies where coefficients tend to increase with each interval, but given the particularities of the Argentinean economy, changes of more than 20% are indication of inflation and companies have been exposed to inflationary processes before there-fore it is plausible that managers overreact if in an inflationary context. Argentinean companies show a sticky behavior in their costs when change in absolute values in revenues are larger than 5%, but its stickiness jumps for changes in revenue above 20 %. In this context, an increase of 1% in revenues involves an increase of 0.43% in costs whereas a decrease of 1% in revenues represents a decrease of 0.15% in the costs. The degree of cost stickiness (0.35) is the same as calculated with the equation (1) for total expenses (selling and administrative).

Exhibit 8: Empirical results of applying model of equation (2) to the Argen-

tinean data

Equa-

tion of

Model (2)

β0

β1

up to

5%

β2

be-

tween 5%

& 10%

β3

be-

tween

10% &

20%

β4

between

20% & 30%

β5

between

30% & 40%

β6

between 40%

& 50%

β7

between

50% & 60%

β8

more

than

60%

Total

expenses

0

.047

(2.8

8)

0.4

4

(0.57)

0.59

*

(1.86)

0.54

**

(3.42)

0.43**

(9.51)

0.21

(0.15)

-0.18

(-0.33)

-0.44

(-1.55)

-

0.28*

*

(-

3.25)

R2

ajusted 0.2063

** significance level 5% * significance level 10% t-values are reported in parentheses

CONFIDENTIA

L

58

4.4 Results for Equation 1: costs sticky behavior affected by a macroeconomic measure

In previous stages of the analysis a model using equation (1) was calculated year by year, detecting that costs sticky behavior changed even in years where it was no pre-sent. For a better interpretation of results the model of equation (1) was applied to periods of three years. The definite periods were the trienniums 2004-2006, 2007-2009 and 2010-2012. In the triennium 2004-2006 and 2010-2012, as Exhibit 9 shows, in general costs present a sticky behavior, but this does not happen in the triennium 2007-2009 when coefficients ß2 are positive except for selling expenses that presents a sticky behavior but in a lesser extent than in the other two trienniums. To interpret these differences the model incorporated macroeconomic variables, such as the inter-nal gross product. The results in Exhibit 9 suggest that costs turn out to be stickier in periods of growth, which supports the argument of hypothesis H2b of this study. This is coherent with the literature that in the short-term managers will try to maintain resources when the market expectations are positive, avoiding the cost of reducing them to latter contract or hire those resources again when activity levels recover. Worth to notice is the result of the triennium 2004-2006, where administrative ex-penses (increase 0.56% and diminish 0.05%) turn out to be stickier than those of sell-ing (increase 0.76% and diminish 0.16%). Also, this relation inverts in the triennium 2008-2010 where the administrative expenses (increase 0.30% and decrease 0.10%) have a lower degree of stickiness than selling expenses (increase 1.02% and decrease 0.12%).

Exhibit 9: Empirical results of equation (1) to the Argentinean data organized

in trienniums

Model coeffi-

cients

Selling ex-

pense

Administrative

expenses Total expenses

Annual rate of

change of the Inter-

nal Gross Product

Period

2004-2006

β0 0.04

(1.26)

-0.02

(-1.25)

0.10

(0.66)

Mean

8.89%

Median

9.03%

β1 0.76**

(5.52)

0.56**

(8.13)

0.62**

(9.91)

β2 -0.60*

(-1.91)

-0.61**

(-3.57)

-0.56**

(-3.53)Degree of cost

stickyness

0.21 -0.09 0.10

R2 ajusted 0.1467 0.2501 0.3363

Period

2007-2009

β0 0.036

(1.45)

0.09**

(5.72)

0.09 **

(6.32)

Mean

5.42%

Median

6.76%

β1 0.95**

(7.88)

0.07

(1.00)

0.23**

(3.87)

β2 -0.63**

(-2.32)

0.27

(1.57)

0.11

(0.80)Degree of cost

stickyness

0.3

CONFIDENTIA

L

59

20

R2 ajusted 0.1467 0.0329 0.1223

Period

2010-2012

β0 -0.011

(-0.44)

0.07**

(4.65)

0.049** (3.49)

Mean

6.64%

Median

8.87%

β1 1.02**

(8.04)

0.30*

(3.84)

0.42** (6.08)

β2 -0.90**

(-4.70)

-0.20**

(-1.67)

-0.29** (-2.63)

Degree of cost

stickiness0.12 0.33 0.31

R2 ajusted 0.2813 0.0855 0.195

** significance level 5% * significance level 10% t-values are reported in parentheses

4.5 Costs sticky behavior affected by the industry or economic sector

Anderson et al. (2003) suggests that assets structure and number of employees affect the degree of costs stickiness because those firms with more fixed assets and employ-ees found costlier to decrease the use of resources when revenues decreased marginal-ly. Higher level of assets and employees implies that the firm relies more on their own resources and less on third party resources that are easily adjustable. However, assets size and employees tend to maintain similar proportions in firms operating in the same industry, therefore it is said that firms belonging to the same economic sector present an average similar structure allowing us to empirically test is minor changes in resource structures affect the degree of cost stickiness.

Classifying the firms selected in the sample into the five industries identified in exhibit 3, the 684 observations used in this study are arranged in order to provide some descriptive statistics in Exhibit 10, panel A and panel B. Exhibit 11 presents the descriptive statistics of Argentinean firms considered in this study but organized by industry.

Exhibit 11 shows that although the variability is high, it is similar for net sales and expenses (coef. of variation net sales 2.72; selling expenses 2.51 and administrative expenses 2.18). Analyzing these variables by industry, the energy sector is the one presenting larger average net revenues and expenses but is also the one with the larger variability (coef. of variation net revenues 2.44; selling expenses 2.76 and administra-tive expenses 2.15). The more homogeneous values are observed in the agricultural industry, but it is worth to mention the high level of administrative costs. In total, 29.91% of observations presented a decrease in net revenues with an average decrease of 11.4%. The Energy industry is the one presenting the larger quantity the revenue decreases with a 43% of all observations, but it is the agricultural industry that shows the larger average of net revenue decrease with 19.2%. These numbers show the im-pact of industry specific government regulations because the reduction in revenues occurred in a period where international prices of commodities increased significant-ly. The reduction of revenues in the sector can be explained by a sum of factors: tax withholding (35% for soya beans and 25% for corn in 2008) and the gap between

CONFIDENTIA

L

60

official exchange rate and inflation (revenues are international sales while the bulk of costs in internal, such as labour). In this period changes in regulations were intro-duced increasing significantly administrative expenses related with compliance re-quirements to commercialize agricultural products.

Exhibit 10: Classification of Observations Panel A: Distribution by Industry Panel B: Distribution by

year

Exhibit 11: Descriptive Statistics – firms arranged by Industry

Variable Mean

Stand-

ard Devia-

tion

viation

Variation co-

eficient

1st

Quartil

Medi-

an

3rd

Quartil

Net Revenues (in millions of pe-

sos) 2,110 5,740 2,72 199 573

1,68

0

Industry Observa-tions Percentage

AGRO 25 3.7COM 155 22.7ENER 165 24.1MOA 143 20.9MOI 196 28.7Total 684 100.0

Year Observa-tions

Percentage

2004 70 10.2 2005 73 10.7 2006 75 11.0 2007 80 11.7 2008 89 13.0 2009 91 13.3 2010 91 13.3 2011 87 12.7 2012 28 4.1 Total 684 100

CONFIDENTIA

L

61

22

AGRO 341 284 0,83 146 208 550

ENER 4,390 10,700 2,44 469 898

2,27

0

MOA 1,620 2,220 1,37 166 590

2,66

0

MOI 1,280 2,320 1,81 130 401

1,58

0

COM 1,470 2,310 1,57 233 491

1,15

0

Selling Expenses (in mililons of

pesos) 194 487 2,51 7 31 135

AGRO 27 27 0,99 22 35 35

ENER 284 785 2,76 0 35 137

MOA 213 274 1,29 16 85 85

MOI 62 102 1,63 26 26 70

COM 274 521 1,90 7 24 178

Administrative Expenses (in mil-

lions $) 86 187 2,18 11.6 27.7 73.5

AGRO 23 16 0,72 12 19 36

ENER 142 305 2,15 18 49 90

MOA 64 80 1,24 11 27 78

MOI 55 104 1,88 9 21 56

COM 94 177 1,89 12 32 71

Selling expenses / Net Revenue

(in percentage) 9.20% 8.50% 0.92 3.40% 5.50%

8.00

%

AGRO 6.37% 4.80% 0,75 1.39% 6.68%

9.70

%

ENER 5.90% 5.26% 0.89 0.34% 5.80%

10.0

4%

MOA 13.69% 8.55% 0.62 6.99%

11.71

%

18.1

6%

MOI 7.48% 5.20% 0.70 3.67% 5.89%

11.1

3%

COM 13.28% 11.65% 0.88 4.94%

10.33

%

22.9

2%

Administrative Expenses / Net

Revenues (in percentage) 4.10% 3.30% 0.8 5.80% 4.80%

4.40

%

AGRO 12.71% 14.06% 1.11 5.53% 8.08%

9.03

%

ENER 7.48% 9.58% 1.28 3.08% 5.99%

9.14

%

MOA 6.93% 5.81% 0.84 3.00% 5.62%

8.97

%

MOI 5.75% 3.87% 0.67 3.59% 5.25% 7.12

CONFIDENTIA

L

62

%

COM 9.19% 10.28% 1.12 4.67% 6.71%

11.2

0%

Periods with a decrease in Net

Revenues (% of decrease)

Observations

11.40% 88.60% 198 (29%) 3,28% 8.71%

15.2

7%

AGRO 19.17% 83.29% 7 (28%) 5,76%

15.32

%

36.2

2%

ENER 11.22% 87.96% 71 (43%) 3,96% 8.67%

13.0

7%

MOA 8.39% 93.55% 34 (24%) 3,17% 8.30%

12.5

4%

MOI 12.45% 88.04% 50 (26%) 4,47% 9.60%

15.7

9%

COM 11.39% 88.57% 36 (23%) 2,19% 8.38%

19.4

2%

The second hypothesis suggested that administrative and selling expenses present a sticky behavior when compared with changes in sales, but it would differ between industries. In the total sample costs are found to be sticky, but not all indus-tries show the same behavior. This hypothesis is supported with the empirical results reported in exhibit 12 because the values of the ß2 coefficients are negative. When considering all firms, selling expenses increase 0.89% when net revenues increase 1%, but they only diminish 0.17% (0.89 – 0.72) when sales decrease 1% (Exhibit 12, panel A). The same effect is observed when total expenses are considered, they in-crease 0.43% but decrease 0.15% (0.43 – 0.28) with 1% changes in net revenues (Ex-hibit 12, panel C). These results reflect that in the period of one year, costs cannot be adjusted proportionally to the variation of sales producing in Argentinean firms a reduction in short-term profitability.

The analysis allows a better explanation of this behavior in firms with different costs structures. First shall be noted that both type of costs present a sticky behavior in all industries with the same intensity, but the cost stickiness is larger for selling expenses. The Agricultural industry is the only one that did not show a clear pattern of cost stickiness regardless the fact that is the one where Administrative expenses as a percentage of net revenues is the largest. Firms in the Energy industry do present a sticky behavior only for selling expenses. Firms in the manufacturing of agricultural origin (MOA) have the highest proportion of selling expenses in relation to Sales and although total costs have a sticky behavior it is surprising to see how sticky are Ad-ministrative expenses, because when net revenues decrease, administrative expenses increases. This result can be partially explained by the increase in government regula-tion and reporting required for selling and buying agricultural products. The other manufacturing industry (MOI) reflects cost stickiness only for Administrative expens-

CONFIDENTIA

L

63

24

es. The last industry considered, Construction, Trade and Services (COM) also show strong and robust results of cost stickiness.

Exhibit 12: Results of Model 1 by industry

Model (1) Pannel A - Selling Expenses

Industry Total AGRO COM ENER MOA MOI

N 582 18 130 118 137 179

β0 0.02 -0.30 0.03 0.02 0.03 0.07*

**

β1 0.90*** 2.01 1.02*** 0.75*** 0.82*** 0.49*

**

β2 -0.72*** -2.46 -0.78* -0.66** -0.61* -0.05

Degree of costs sticki-

ness 0.20

0.24 0.12 0.25

R2 ajusted 0.23 0.32 0.31 0.09 0.25 0.17

Model (1) Pannel B – Administrative Expenses


N 623 22 137 139 141 184

β0 0.05*** 0.13** 0.03 0.07*** 0.04** 0.03*

β1 0.31*** 0.26* 0.33*** -0.01 0.49*** 0.39*

**

β2 -0.21** -0.03 -0.07 0.16 -0.74*** -

0.38**


ness 0.32

-0.53 0.03

R2 ajusted 0.09 0.20 0.11 0.01 0.14 0.10

Model (1) Pannel C – Total Expenses


N 639 22 143 147 141 186

β0 0.05*** 0.09 0.04*** 0.06*** 0.04*** 0.05*

**

β1 0.43*** 0.40** 0.52*** 0.10 0.60*** 0.39*

**

β2 -0.28*** -0.11 -0.50*** 0.06 -0.53** -0.13


ness 0.36

0.04

0.11

R2 ajusted 0.21 0.38 0.41 0.03 0.28 0.13*** significance level 1% ** significance level 5% * significance level 10%

CONFIDENTIA

L

64

4.6 Discussion

The results of this section add empirical evidence to the already documented costs sticky behavior in emerging economies. It is possible to affirm that costs of Argen-tinean firms present a sticky behavior, particularly those related with selling expenses. These results are in line but also expand the results of Werbin (2009), Porporato and Werbin (2012) and Werbin and Porporato (2012) where costs in Argentine banks are sticky. The results also show that this sticky behavior differs according to the level of change in revenues, when the change in revenues overcomes 20% managers begin to change the cost structure by adjusting the resources used by the firm. These results support Hypothesis 1.

This study is the first analysis of costs sticky behavior of multiple industries in Ar-gentina and it complements previous studies focused on financial institutions. The results are consistent with prior literature and Hypothesis 2a is supported because costs are not sticky across all industries. Results clearly reveals that one type of firms, agricultural, do have a cost structure and behavior compatible with the traditional textbook formula of fixed and variable costs, with variables costs changing in a pro-portional manner when activity changes no matter what is the direction of that change. There are two other industrias with a partial sticky behavior in their costs. the energy sector shows that total and administrative costs are not sticky while MOI shows only that selling costs are not sticky.

Hypothesis 2b is also supported by the data because the only macroeconomic vari-able included helps to explain cost behavior. Possible explanations were attempted with macroeconomic variables finding that the degree of cost stickiness is different in each of the three trienniums considered that reflect the expectations and performance of the national economy. Particularly, when the Internal Gross Product went down, costs do not have a sticky behavior implying that managers were not reluctant to re-duce resources when the economy was desacelerating.

5 Conclusion

Understanding cost behavior is fundamental to manage any company. The evidence in general shows that in the short and medium term (not so much in the long term), costs are sticky as they increase more for an increase in activity levels than they decrease when activity levels decrease. There are several reasons to explain this behavior, one of the more compelling is the manager’s tendency to delay cutting costs (reduce re-sources used) when activity levels decrease, because the cost of having idle capacity is lower than the cost of exit and replacement of resources being disposed off. The idea of sticky costs challenges the traditional model of cost behavior where a variable cost changes proportionally with changes in the cost driver, whatever is its direction.

The results obtained in this study that replicate the work of Anderson et al. (2003) and others, show that sticky costs are observed also in publicly traded firms of Argen-

CONFIDENTIA

L

65

26

tina for the years 2004-2012. The relation between an increase of total income and increase of costs is positive; as the theory predicts, but the magnitude of the increase associated with an increase in the volume is larger than the magnitude of the fall asso-ciated with a decrease of the volume. A second set of results suggest that both cost structure and macroeconomic climate are good explanators of cost behavior. The cost structure of firms affects the level of cost responses to increases and decreases in activity levels. Firms labeled as agricultural are the only ones that do not show a sticky behavior of their costs, this finding is an important contribution of the study.

The results here presented support the idea that costs are sticky in all industries op-erating in Argentina except one, Agricultural. The degree of cost stickiness is differ-ent for each industry, which is explained by the flexibility they have with adjusting their costs structures to changes in activity levels. This analysis of sticky costs in Ar-gentina allowed finding empirical evidence of their existence and how they change given different macroeconomic circumstances. However, more detailed explanations are needed. Although the results are promising, more research is needed to clearly identify causes of the observed sticky behavior by incorporating new hypotheses and variables, which allow to explain and to quantify better this phenomenon and to relate it to all economic sectors of an emerging economy.

The results of this and similar studies reinforce the idea of sticky costs. It is an im-portant topic for accountants and other professionals who evaluate the changes in costs in relation with the changes in the activity volume or revenues, since it implies moving from a traditional model of variable proportional costs to another that consid-ers the influence of the decisions of the managers in the short term behavior of costs. The empirical test of the existence and behavior of sticky costs in firms operating in emerging economies is an area that deserves special attention in the research agendas of universities and regulatory authorities. Understanding the behavior of these costs and its suitable management can generate very positive results.

This study recognizes one significant limitation related mostly with the sample size. Basing the study on publicly available data extracted from annual reports, we cannot pinpoint to particular factors that might explain why cost behavior is sticky in all sectors but Agricultural. Future research has to be designed in such a way to allow for the proper segregation of factors, weather it is cost structure, asset intensity, stra-tegic intent or economic conditions. An important sticky costs factor is the fully un-derstanding of active cost management and managerial behavior but it is necessary to complement panel data regressions with structured interviews and field-based re-search to colected this information not available at financial reports.

6 BIBLIOGRAPHY

Anderson, M., O. Asdemir and A. Tripathy (2012). “Use of precedent and antecedent in-formation in strategic cost management”. Journal of Business Research. Electronic copy of this paper was available on April 2013 at http://dx.doi.org/10.1016/j.jbusres.2012.08.021

CONFIDENTIA

L

66

Anderson, M., R. Banker and S. Janakiraman (2003). “Are Selling, General and Adminis-trative Costs “Sticky”?”. Journal of Accounting Research. Vol. 41, Nº 1, pp. 47-63. Anderson, S. and W. Lanen (2007). “Understanding Cost Management: What can We Learn from the Evidence on “Sticky Costs”? Electronic copy of this paper was available on April 2010 at: http://papers.ssrn.com/sol3/papers.cfm?abstract_id=975135. Balakrishnan, R. and N. Soderstrom (2008). “Cross – Sectional Variation in Cost Sticki-ness”. Electronic copy of this paper was available on April 2010 at: http://center.uvt.nl/sem/balakrishnan.pdf Balakrishnan, R. and T. Gruca (2008). “Cost Stickiness and Core Competence: A Note”, Contemporary Accounting Research, Vol. 25, Nº 4, pp. 993-1006. Balakrishnan, R., E. Labro and N. Soderstrom (2010). “Cost Structure and Sticky Costs”. Electronic copy of this paper was available on January 2011 at: http://ssrn.com/abstract=1562726. Balakrishnan, R., M. Petersen and N. Soderstrom (2004). “Does Capacity Utilization Af-fect the “Stickiness” of Cost?”. Journal of Accounting, Auditing & Finance, Vol.19, Nº 3, pp. 283-299. Banker, R. and L. Chen (2006a). “Predicting Earnings Using a Model Based on Cost Vari-ability and Cost Stickiness”. The Accounting Review, Vol. 81, Nº 2, pp. 285-307. Banker, R. and L. Chen. (2006b). “Labor market characteristics and cross-country diferences in cost stickiness”. Electronic copy of this paper was available on April 2010 http://ssrn.com/abstract=921419 Banker, R., D. Byzalov and J. Plehn–Dujowich (2010). “Sticky Cost Behavior: Theory and Evidence” Electronic copy of this paper was available on April 2012 at: AAA 2011 Man-agement Accounting Section (MAS) Meeting Paper - http://ssrn.com/abstract=1659493 Banker, R., M. Ciftci and R. Mashruwala (2008). “Managerial Optimism, Prior Period Sales Changes and Sticky Cost Behavior”. Electronic copy of this paper was available on October 2009 at: http://papers.ssrn.com/sol3/papers.cfm?abstract_id=902546. Calleja, K., M. Steliaros and D. Thomas (2006). “A note on cost stickiness: Some interna-tional comparisons”. Management Accounting Research, Vol. 17, Nº 2, pp. 127-140. Chen, C., H. Lu and T. Sougiannis (2008). “Managerial Empire Building, Corporate Gov-ernance, and the Asymmetrical Behavior of Selling, General, and Administrative Costs”. Electronic copy of this paper was available on October 2009 at SSRN AAA 2008 Finan-cial Accounting and Reporting Section (FARS) Paper: http://ssrn.com/abstract=1014088 Cooper, R. and R. Kaplan (1998). The Design of Cost Management Systems: Text, Cases and Readings (2nd Edition). Prentice Hall, Upper Saddle River, N.J.

CONFIDENTIA

L

67

28

Dierynck, B, W. Landsman and A. Renders (2009). “Do Managerial Incentives Drive Cost Behavior? Evidence about the Role of the Zero Earnings Benchmark for Labor Cost Be-havior in Belgian Private Firms”. Elecytronic copy of this paper was available on April 2012 at SSRN: http://ssrn.com/abstract=1458305 or http://dx.doi.org/10.2139/ssrn.1458305. Garrison, R., E. Noreen, P. Brewer, G. Chesley, R. Carrol, A. Webb and T. Libby (2012). Managerial Accounting, Ninth Canadian Edition. McGraw-Hill Ryerson. Horngren, C., S. Datar and M. Rajan (2012). Cost Accounting – A Managerial Emphasis, Fourteenth Edition, Prentice Hall. Kama, I. and D. Weiss (2010). “Do Managers’ Deliberate Decisions Induce Sticky Costs?” Electronic copy of this paper was available on April 2012 at SSRN: http://ssrn.com/abstract=1558953 or http://dx.doi.org/10.2139/ssrn.1558953 Mak, Y. and M. Rousch (1994). “Flexible Budgeting and Variance Analysis in an Activity – Based Cost Environment”. Accounting Horizons, Vol. 8, Nº 2, pp. 93-103. Malcom, R. (1991). “Overhead Control Implications of Activity Costing”. Accounting Ho-rizons, December, pp. 69-78. Noreen E. (1994). ”Conditions Under Which Activity-Based Cost Systems Provide Rele-vant Costs”. Journal of Management Accounting Research, Vol. 3, pp. 159- 168. Noreen, E. and N. Soderstrom (1997). “The Accuracy of Proportional Cost Models: Evi-dence from Hospital Service Departments”. Review of Accounting Studies. Vol. 2, pp 89-114. Pereiro, L. (2006). The Practice of Investment Valuation in Emerging Markets: Evidence from Argentina. Journal of Multinational Financial Management, Nº 16, pp. 160-183 Pervan, M. and I. Pervan (2012). “Analysis of sticky costs: Croatian Evidence”. Recent Researches in Business and Economics ISBN: 978-1-61804-102-9 Poorzamani, Zahra and Bakhtiary Mohammadreza (2013). “Reviewing the impact of mac-ro economic factors on operating cost stickiness y Tehran stock exchange”. Electronic copy of this paper was available on January 2014 at: http://tjeas.com/wp-content/uploads/2013/05/842-850.pdf Porporato, M and E. Werbin (2012). “Evidence of Sticky Costs in Banks of Argentina, Brazil and Canada”, International Journal of Financial Services Management, Vol. 5, Nº 4, pp. 303-320. Ribeiro de Medeiros, O. and P. De Souza Costa (2004). “Cost Stickiness in Brazilian Firms”. Electronic copy of this paper was available on April 2010 at: http://papers.ssrn.com/sol3/papers.cfm? abstract_id=632365 – October 2004. Steliaros, M., D. Thomas, and K. Calleja (2006). “A Note on cost Stickiness: Some Inter-national Comparisons”. Management Accounting Research, Volume 17, pp 127-140.

CONFIDENTIA

L

68

Weidenmier, M. and C. Subramaniam (2003). “Additional Evidence on the Sticky Behav-ior of Costs”. TCU Working Paper. Available on April 2010 at SSRN: http://ssrn.com/abstract=369941 or http://dx.doi.org/10.2139/ssrn.369941 Werbin E. and M. Porporato (2012). “Active Cost Management (Sticky costs) in Argentinean Banks”. International Journal of Business and Economic Research, Vol 4, Nº6, pp 679-703. Werbin, E. (2009) “Los costos pegadizos (sticky costs): Una prueba empírica en bancos argentinos”. Electronic copy of this paper was available on April 2012 at: http://www.observatorio-iberoamericano.org/RICG/n%C2%BA%2014/Eliana_Mariela_Werbin.pdf Werbin, E. L. Marín Vinuesa and M. Porporato (2011). “Un estudio empírico de costos pegadizos (stcky costs) en empresas españolas”. Contaduría y Administración. Electronic copy of this paper was available on April 2013 at http://www.contaduriayadministracionunam.com.mx/userFiles/app/pp_18012011.pdf

CONFIDENTIA

L

69

CONFIDENTIA

L

70

How the LDA Method Works for Collaborative Filtering Priscila Valdiviezo, Guido Riofrío

Computer Science and Electronics Deparment, Universidad Técnica Particular de Loja {pmvaldiviezo, geriofrio}@utpl.edu.ec

Abstract. Recommender Systems serve as important tools for obtaining from a series of options a reduced number of items according to the preferences or interests of users. In this paper, we present a LDA method application for making recom-mendations based on collaborative filtering, that is, by considering a series of data comprised of “users-items” interactions expressed in the form of ratings. Generally speaking, LDA has been commonly used in content-based filtering by facilitating the modeling of topics. However, in this study, we consider the usage of this method as a means to help users obtain items that are based on collaborative filtering, taking into consideration latent information by users. Further more, we considered the application of the basic matrix factorization method to compare the results obtained with LDA. The evaluation of the model was carried out by using data sets MovieLens, and Precision and Recall metrics that validated the quality of the recommendations,

Keywords. Recommender systems, Content-based filtering, Collaborative fil-tering, Latent Dirichlet Allocation

1 Introduction

The sheer amount of information and the services that are available on the Internet are leading to users want to find information that is more and more relevant and precise. In this sense, recommender systems are becoming an alternative method to help users obtain personalized in-formation from big repositories of information that are available on the Internet.

Recommender systems recommend information to users by means of an analysis of past preferences, or are based on the preferences of simi-lar users. For this process, some techniques for filtering information in recommender systems may be used, the most well known are those that are based on Content Based Filtering (CBF) and Collaborative Filtering (CF) [1]. Content-based Filtering Systems can be designed to recom-mend items similar to those that a pre-determined user used to like in the past [2]. These items could be documents, books, news, songs, movies, and websites, among others. In this type of filtering the content

CONFIDENTIA

L

71

of items is important to predict their relevance based on the user pro-file, which includes the preferences, tastes and needs of the users. However, only items that have a high level of similarity with the user profile are recommended to the user [3]. On the other hand, the collabo-rative filter system (CF) bases its predictions and recommendations on the ratings or behaviors of other users in the system [4]. There are two approaches of the CF: those based on the user and those based on items. The approach based on the user analyzes a group of users that share similar interests or experiences and recommends the items that the group generally prefers, while the CF that is based on items, rec-ommends items that have a greater similarity with the list of items than an active user had classified in the past [3].

In these recommender systems, the behavior of the users is often in-fluenced by the hidden interests of users [5] – i.e. information which is very important to know and used to provide better recommendations.

Consequently, in this research paper, we propose a collaborative fil-tering recommender system via the usage of the Latent Dirichlet Allo-cation (LDA) to infer the preferences of the users in a latent space, which is based on historic (past) ratings. Also we considered the appli-cation of the basic matrix factorization method to compare the results obtained with LDA. For the experimentation and evaluation phase of the model, we used data sets MovieLens and we applied metrics of classification such as Precision and Recall.

2 Related Work

Some studies that have been carried out regarding the application of LDA are related to: text segmentation [6], modeling of the evolution of user interests (based on personalized ranking) [5], document analysis [7], and the classification and grouping of documents [8], [9], [10].

Recently, this method has had considerable success in recommender systems, for example [11], by proposing a probabilistic method for rec-ommendations that is content unaware inspired by LDA, and which uses the behavior of users in order to provide recommendations. Instead of modeling with labels or contexts, this method utilizes the behavior collected from user-item interaction for the construction of a recom-mendation model without the content of items. Another study that was carried out is presented by [12], [13], [14] which utilizes LDA for the recommendation of tags with the aim of finding the latent relationships

CONFIDENTIA

L

72

between the key words of the description of items, and the tags of items created by the users. In this way, the items may be recommended based on tags. In [5], the authors propose a new schema for making three-layer recommendations, namely user-interest-item. This system is based on collaborative filtering. The main objective is to aid the under-standing of the interactions between the users, items, and interests of users. In addition, these authors consider the increase of user interest by means of personalized ratings. The authors likewise utilize LDA to simulate the latent interests of users, and show the means by which it is possible to extract information about their interests.

Other studies of great interest are related to LDA for collaborative filtering are the presented by [3], who utilize a hybrid approach in mak-ing recommendations, that is, by employing LDA to discover the latent semantic structure (hidden) in the documents that users have read, which includes the distribution of words on latent topics and the distri-butions of latent topics on documents. After this, the result is incorpo-rated into CF similarity calculation based on the item to facilitate the process of CF prediction. In this way, a hybrid system is obtained based on content, and collaborative filtering – that is with the objective of reducing to a minimum any potential inconveniences.

Similarly, the authors in [15] propose a new approach to improve standard CF based recommenders by utilizing Latent Dirichlet Alloca-tion (LDA) to learn about the latent properties of the items expressed in terms of the proportion of topics, which is derived from their textual description. The user’s topic preferences are inferred by the same latent space based on her historical ratings. In this study, each item that is to be recommended is represented as a document that contains a textual description about that item.

In the next section, we analyze this method and its focus on Collabo-rative Filtering Recommender Systems. According to [5], is easier to extract details about the interests of the users by means of a follow-up of information about the text, such as "key words" or "tags", or labels by using recommender systems that are based on content or hybrid ap-proach. However, in collaborative filtering systems, the latent interests of users constitute a difficult task to identify seeing that the only infor-mation available is the information about the interaction of the user with the system.

CONFIDENTIA

L

73

3 Probabilistic Methods for Recommender Systems

According to [4], the main idea behind these methods is to calculate the probability that the user u visit the element i, which is represented by P (i | u), or the distribution of probability P (ru, i | u), on the rating of user u for item i, besides the problem related to the value expected of the rating of u over i, represented by E [ru, i].

Ekstrand, Riedl, and Konstan [4] highlight that probabilistic models are applicable when the recommender process follows behavior models of users.

Among the probabilistic models that incorporate these types of methods, we can find those that are related with the modeling of topics [16], which is based on the idea that the documents consist of mixed topics. Moreover, a topic is a probability distribution over words [5]. Such models are: the Probabilistic Latent Semantic Analysis (PLSA) [16] and the Latent Dirichlet Allocation LDA [17], which are the most widely used, and can be used as a focus of reducing specific dimen-sions to exploring the characteristics of text content [3] via the discov-ery of the latent topics of each document.

LDA models various aspects by using the Dirichlet prior distribution on user preferences by topics [17], thus overcoming some of the PLSA problems. Among these is the linear growth of the estimated parameters in the model that tend to lead to overfitting [3]. According to the above authors, LDA addresses the mixture of weightings of the topics as a hidden random variable of K-parameters – instead of using a vast set of parameters that are directly related with the training set.

Based on the above background, this study presents a detailed analy-sis of the LDA method LDA so as to be later applied in recommenda-tions based on collaborative filtering.

3.1 The LDA Method

The Latent Dirichlet Allocation (LDA) is a non-supervised probabilis-tic generative model for modeling vast corpora of text, which randomly generates documents that are observed in the corpus. LDA models each document as a random mixture on latent topics (hidden), where each topic is characterized as a mixture or distribution of words [17]. In the study of literature, it has been demonstrated that LDA is capable of

CONFIDENTIA

L

74

capturing latent semantic information of a collection of documents, which is greater in number compared with other models [3].

Based on [4], in this model users are represented by their latent fac-tors of preference (such as a distribution P (k | u)). These are instances of random variables extracted from a Dirichlet distribution: where k is a latent variable that belongs to a set of K latent topics. This model re-quires two hyper-parameters that can be learnt, that is α and β: where α refers to a vector of parameterization in Dirichlet distribution, from which the user information is extracted and β is represented by a matrix of probability of topics and items, which is represented by P(i | k). In literature, we often use the default values of β= 0,01 and α = 50 /K, and where K refers to the number of latent topics [3], [7], [18]. In some cases of collaborative filtering, it refers to the number of latent interests [5].

On the other hand, some measures may be utilized to evaluate the predictive power of these models, which among others are: the measure of perplexity (most widely used in language modeling) [17], i.e. where a value of low perplexity indicates a greater performance of generaliza-tion. To measure the quality of recommendations, one often utilizes metrics for the recuperation of information such Precision and Recall [1]. These help determine the differences between the predicted ratings and those that are recommended.

Based on the idea, that LDA may be utilized to learn the characteris-tics of the items from ratings records given, but without including any additional content or prior knowledge about the items [5]. In the fol-lowing sections, we discuss how to utilize this method in collaborative filtering by only focusing on the ratings of users-items.

4 Matrix factorization methods

Latent factor models approach collaborative filtering possible to dis-cover latent features that explain observed ratings: examples include pLSA, Latent Dirichlet Allocation (LDA), and models that are induced by factorization of the user-item ratings matrix [1].

Matrix factorization models characterize both users and items in a la-tent factors space of dimensionality f, inferred from the items ratings. High correspondence between users and items factors leads to a rec-ommendation [19].

CONFIDENTIA

L

75

In [20] indicate that the matrix factorization methods consist in de-composing a matrix into two or more matrix, which can be used to ex-tract some data latent factors.

In the basic matrix factorization model, the target model is find for each user a vector pi ∈ ℜk measure the extend of interest the user has in items that are high on the corresponding factors. For a given item a vec-tor qj ∈ ℜk measure the extent to winch the item possesses those factors [19]. Then, user-item interactions are modeled as a dot product between their corresponding vectors, such as:

r(u,i)=qi

Tpu (1)

Which denotes the estimated rating of a user on an item[19]. The matrix factorization methods can be combined with methods

such as LDA for specific recommendations, as mentioned in [21].

5 LDA-based collaborative filtering

LDA has been mostly used in content-based recommender systems for topic modeling, i.e. where documents are created by means of the fol-lowing process [19]:

-‐ Generate of randomly selected topics -‐ selecting a distribu-‐tion of words for each K topics.

-‐ Select a specific number of words for the document. -‐ Randomly select a distribution of topics for the document -‐ For each N word in the document:

o Randomly select a topic from a multinomial distribu-‐tion to indicate from which topic the word will be sampled.

o Randomly select a word by sampling a conditioned multinomial probability.

Conversely, in LDA-based collaborative filtering systems, it is as-sumed that the documents and words in the document are analogous to the users and items [17], i.e. where the topics become hidden interests [5]. In our case, the hidden factors (topics) would be related with the latent preference of the user. Likewise, the vocabulary of words in the case of collaborative filtering is replaced with a vocabulary of items (movies) and the number of occurrences of each word in the document,

CONFIDENTIA

L

76

is the rating that the user has given each item, that is, the rating is the number of times the item is considered in the document.

In light of the study carried out by [3], and the process highlighted earlier, this paper considers the following steps to carry out recommen-dations based on the LDA with collaborative filtering techniques.

1. A vocabulary is calculated from a list of movies, and the num-

ber of topics is determined. 2. The LDA model is built from a collection of items (in our case

movies), which the users have seen or liked before. For this, we apply the following generative process:

a. Obtain latent distributions of the topics on users. θ ∼ Dirichlet(α)

b. Obtain latent distribution of the items on the topics β ∼ Dirichlet(δ)

c. For each N item: i. Randomly select a topic ki ∼ Multinomial (θ)

ii. Select an item from a distribution of conditioned multinomial probability on the ki topic, given by P(ii | ki, β).

In this case, β is the distribution of the items of the topics, and refers to the probability of occurrence of an item of a given top-ic. The latent topics are assumed to be distributed multinomially on users (denoted by θ), and the items of the user are assumed that they are distributed multinomially as latent topics (denoted by δ).

3. The model is trained by using a set of ratings and observed us-ers.

4. For each user of the test, the movies that are recommended to the user are predicted. Some of these, however, have not been seen by the user. The likelihood of a movie, given a set of ob-served movies is obtained from the posterior distribution about the topics.

As we can see from this process, each user is represented by a proba-bility distribution about the topics, and each topic is a probability dis-tribution about the items. According to [5], they can have multiple characteristics that belong to many interests or latent preferences (top-

CONFIDENTIA

L

77

ics). At the same time, with the discovered latent topics, it is possible to derive similar items with greater precision for understand the needs of the users, and to make more relevant recommendations [3].

The extraction of information about the interests of the user with LDA is an inference process of the latent variable. Following this pro-cess, the value of these latent variables must maximize the posterior distribution of total of the user’s ratings records [5].

To estimate the values of θ and δ, we can apply Variational Expecta-tion-Maximization (VEM) [17], and the Gibbs sample [7]. Although in the field of literature, most research in this area applies the Gibbs sam-ple, by be more effective in terms of convergence, and has greater tol-erance to local optima [3].

6 Experimentation and Evaluation of the model

In our case, the experiment consisted of the construction of a model based on LDA for collaborative filtering. The dataset correspond to the ratings of users for movies MovieLens, which is described in the next section. Then we proceed to evaluate the model performance with met-rics like Precision and Recall.

For the training of the model, we divided the dataset into 75% (749510 records) for the training and 25% (250012) for the test. Then the LDA model is built, we determined an a priori value of k, which represented the number of topics in the model and establish the Di-richlet hyper-parameters α y β.

6.1 DataSet

The dataset utilized for the experimentation and evaluation of the mod-el was obtained from the MovieLens1 website, which corresponds to the users’ ratings of movies, which are gathered during different peri-ods of time- depending on the size of the set. In our case, it correspond-ed to 1 million ratings (1M) from 6000 users on 4000 movies. Mov-ieLens rating scale is from 1 to 5.

The dataset of MovieLens is limited to the users that valued at least 20 movies. With this dataset, five experiments were carried out with different values of hyper-parameters (α y β).

1 MovieLens: http://movielens.org

CONFIDENTIA

L

78

6.2 Experiment I

In this first experiment, we consider the default values α=50/K, and β=0.01. In addition, we set a value of de k=20. A vocabulary of items was comprised of 3326 movies. A value of N-recommendations is fixed. To estimate the latent values of θ and δ, we used the Gibb sample to determine the posterior probability of latent variables. A matrix of predictions was then obtained per user, which lists the possible recom-mended items for each user, a part of the results obtained, is shown in the following figure:

Table 1. Predictions of items for user

M2858 M260 M1196 M1210 M2028

u1 0.00388 0.00569 0.00456 0.00459 0.00537 u2 0.00456 0.00576 0.00566 0.00494 0.00871 u3 0.00428 0.00977 0.00944 0.00730 0.00714 u4 0.00725 0.01044 0.01062 0.00719 0.00703 u5 0.00643 0.00076 0.00058 0.00087 0.00260 u6 0.00192 0.00279 0.00202 0.00399 0.00232 u7 0.00020 0.00916 0.00809 0.00767 0.00759 u8 0.00280 0.00084 0.00003 0.00076 0.00261 u9 0.00824 0.00500 0.00504 0.00425 0.00812 u10 0.00165 0.00411 0.00357 0.00313 0.00150 u11 0.00568 0.00175 0.00232 0.00192 0.00421 u12 0.00759 0.00491 0.00422 0.00218 0.00275 u13 0.00081 0.00972 0.00847 0.00746 0.00582 u14 0.01378 0.00458 0.00325 0.00322 0.00360 u15 0.00547 0.00170 0.00166 0.00264 0.00481 u16 0.01857 0.00345 0.00215 0.00422 0.00436 u17 0.00365 0.00756 0.00660 0.00547 0.00368 u18 0.00207 0.00660 0.00529 0.00532 0.00497 u19 0.00526 0.00689 0.00597 0.00534 0.00472 u20 0.00373 0.00982 0.00939 0.00742 0.00730

In table 1, the columns correspond to movies (3326), and the rows to users (6040 users), the value in each cell represents the importance of that movie for each user, if this value is high more important will be the item and therefore is a recommendation potential for that user.

CONFIDENTIA

L

79

Then information of the items is shown in order descendent, for in-stance, for User 1, M260 and M2028 items have the highest values: 0.0056 and 0.0053 respectively.

In order to measure the performance of the model, we explain the process we follow to this process.

• Precision: This metric represents the probability of a recommended item being

relevant [20]. It is determined by the number of items in the test set that appear in the top-N of the recommended items (number of items liked and recommended by the users), which is divided by the number of all the recommended items). In this case, it would include:

𝑃 = !" ! !"!" ! !"!!"# ! !"

= !" ! !"!"

(1)

• Recall: This metric represents the probability of a relevant item being rec-

ommended [20]. It is defined as the number of relevant and recom-mended items, which is divided by the total number of relevant items that are relevant in the testing set.

𝑅 = !"_!"!"_!"!!"_!"#

= !"_!"!"

(2)

For this, we used the following matrix, which represents the relation-

ship of the items: Relevant (Rl), Not Relevant (NRl), Recommended (Rc), and Not recommended (NRc):

Table 2: Matrix to compute precision and recall

Relevant (Rl) Not Relevant (NRl)

Total

Recommended (Rc)

Rl and Rc NRl and Rc Rc= Rl and Rc + NRl

and Rc

Not Recommended (NRc)

Rl and NRc NRl and NRc

Total Rl= Rl and Rc + Rl and NRc

CONFIDENTIA

L

80

An item is relevant if its rating is greater or equal to 4, NoRelevante (NRL) if the value of the rating is less than 4. The item is recommend-able (Rc) if it is within the N-recommendations defined in the experi-ment and if it has been rated. Therefore is possible that to user be rec-ommended less items the as defined in Nrecomendation, even anything, since the items that are not rated are removed of the test dataset. Items that are outside the N-recommendations would be the not recommend-ed items (NRc). Note that another option for metrics compute is con-sider the items not rated as not relevant. However in our case the com-pute of these metrics are based on the above saying.

Consequently, given that the results matrix contains some movies that have not been rated by the users, i.e. some ratings user are not known, the following simplifications are made to compute metrics:

• We not included the items not rated. • We removed ratings by a specific user contained in the

"training dataset", this in order that the metrics are con-‐structed from the test dataset. So as not to overestimate the results these metrics.

Therefore, number of recommended items for each user can be re-duced. Once the predictions matrix was processed, we carried out the ‘Preci-sion and Recall’ compute. For the first experiment, the confusion ma-trix obtained from the test data is as follows:

Table 1. Confusion Matrix of Experiment I

Relevant Not Relevant

Recommended 4301 1066

Not Recommended 26504 10671 The above table shows the number of relevant and non-relevant mov-

ies of dataset and the movies recommended and not recommended for this set. Considering the above formulates, and the results obtained in the Table 1, the precision and recall results for N-recommendations= 10, are:

• Precision= 0.8013788, which indicates that 80% of the rec-‐ommended items are relevant.

• Recall = 0.1396202. This means that 13% of the relevant items were recommended.

CONFIDENTIA

L

81

6.3 Experiment II

In this new experiment we lower the number of recommendations to 5, with the default values α=50/K, and β=0.01. Consequently for k=20 and N-Recommendations=5 we had the following results table:

Table 2. Confusion Matrix of Experiment II


Recommended 2264 541 Not Recommended 28541 11196

Based on these results, we calculate performance metrics:

• Precision: 0.80713. This means that 80% of the recom-‐mended movies are relevant.

• Recall: 0.07349. Hence the 7% of the relevant movies were

recommended.

6.4 Experiment III

In this third experiment, we considered the same dataset for training and test, as well as the value of k=20, and the hyper-parameter values vary: α=0.02, y β=0.02. However, the N-Recommendations values vary.

For N-Recommendations= 5, we had the following results:

Table 3. Confusion Matrix of Experiment III

Relevant Not Relevant Recommended 2253 505

Not Recommended 27173 11013

The performance values of the model were:

• Precision= 0.816, this means that el 81% of the recom-‐mended items were relevant.

• Recall= 0.076. The 7% of the relevant items were recom-‐mended.

CONFIDENTIA

L

82

In this case by varying the number of Recommendations to provide

the user, improves the precision value.

6.5 Experiment IV

In this experiment, we considered the same dataset for training and test, as well as the value of k=20. However, the hyper-parameter values vary: α=0.02, y β=0.02.

For N-Recommendations= 10, we had the following results:

Table IV. Confusion Matrix of Experiment IV



The performance values of the model were:

• Precision= 0.7932473 • Recall= 0.1421192.

In this case, 79% of the recommended items were relevant, and 0.14% of the relevant items were recommended.

6.6 Experiment V

We considered the same dataset for training and test, as well as the value of k=20. The hyper-parameter values: α=0.02, y β=0.02. Howev-er, increases N-Recommendations = 20. We had the following results:

Table V. Confusion Matrix of Experiment V


Recommended 7459 2164


The performance values of the model were: • Precision=0.77. This means, 77% of the recommended

items were relevant.

CONFIDENTIA

L

83

• Recall=0.25% of the relevant items were recommended.

We found that the increase in number of recommendation decrese the precision value and on the contrary increases the recall.

6.7 Experiment 6: With matrix factorization method

In this experiment the idea is to predict how users rate the items that they have not yet rated, based on ratings observed. The steps we follow are:

We start from a data set of N users and M items, an N ×M ratings matrix R, where the entry Rij represents the rating for item j by user i.

An example of a part the matrix user-items is as follows:

Table 6. Rating Matrix

M1 M2 M3 M4 M5 M6

u1 5 0 0 0 0 0 u2 0 0 0 0 0 0 u3 0 0 0 0 0 0 u4 0 0 0 0 0 0

u5 0 0 0 0 0 2

Where movie ratings can range from 1 to 5 (zero if non-rated). For training we consider a subset of the matrix with the greater num-

ber of ratings present in the matrix, due to the time required to training a large dimension matrix as used in this study. For this we follow the following process: Obtaining of reduced matrix

The actual dataset has a total of 1000209 records rating of 6040 users (N) and 3706 movies (M). The dataset is converted to form of user-item matrix, this generates a 22384240 total user-item pairs, obtained by multiplying N * M. Of this total 1000209 are observed ratings and the rest is filled with zeros. This means that only 4.4% observed user-item pairs. Therefore this matrix has a sparsity of 95.5%; it is highly sparse.

CONFIDENTIA

L

84

Based on this event, a reduction process of the matrix was per-formed; it was to obtain a matrix with as many common data into rows and columns (user and movies), for which we use a Kohonen maps algo-‐rithm to group data by criteria of similarity (create groups of objects with similar characteristics). This matrix from now we will call reduced ma-‐trix. Then we proceeded to make a visual exploration to select those ele-‐ments with greater similarity measures. The objective was to obtain a user-‐item matrix small but with most rating observed and reduces the measure of scarcity. As a result it has: Users: 166, Movies: 384. 63744 user-‐item pairs, 47052 (73.8%) ob-‐

served user-‐item pairs, 16692 (26.1%) user-‐item pairs with zeros. Then we proceeded to do the training and test. For this process we

proceeded to reserve 20% of this reduced matrix for test, and 80% for training. For which we proceeded to replace random with zero values the 20% of extracted data. In the test set not rated items are removed. Then we build the model by doing the following: We determine some latent features associated with the user and items,

through two new matrices P= N×K latent user latent matrix and Q = M× K the movie feature loading matrix. Here K is the assumed size of the latent features. So, a number of features is set, K = 20. It must be smaller than the number of users and items. After, we initialized with random values the matrix P and Q, which may be positive or negative.

Then the estimated rating is computed based on the product of the vectors of latent characteristics, using equation 1. To learn the factor vector, we minimized the regularized squared er-ror on the set of known rating, utilizing the stochastic gradient de-scendent. For this we compute the prediction error:

eui = rui − qT pu (2)

Then we modify the values of p and q, based on:

qi = qi +α(eui .pu −β.qi ) (3)

pu = pu +α(eui .qi −β.pu ) (4)

In our case we set α=0.002 y β=0.02. Subsequently, we compute again the rating based on the product of

these two modified vectors, using equation 1. The result was a complete matrix of estimated ratings. An extract of this matrix is shown below:

CONFIDENTIA

L

85

Table 7. Matrix of estimated ratings

M1 M2 M3 M4 M5 M6 u1 4.353 4.320 5.055 5.865 4.296 4.244 u2 4.080 3.307 3.739 4.337 3.661 3.523 u3 3.533 2.621 4.065 3.926 3.641 3.849 u4 5.235 4.940 5.040 5.252 5.027 4.825 u5 3.230 2.909 3.772 3.504 3.362 3.054 u6 4.534 4.528 5.603 4.920 4.082 4.670 u7 4.411 3.537 4.649 4.872 4.104 4.193 u8 4.323 3.788 4.426 4.178 4.239 3.907 u9 3.015 3.225 3.230 3.336 2.871 2.896

In this case we have also obtained the prediction of unobserved rat-

ings can be considered as the blank values (zero) in the original matrix. Recommendation Process For this task we tried to set a N-‐Recommendations = 10 movies, as in LDA, considering the movies that were rated in the test set. However we detect the problem in the recommendation could be left without recom-‐mending an item with a high rating, because It will be outside the range of N-‐Recommendations, which affected the final result. So we decided to apply the following strategy: Recommend items present in the test set and set a Nrecomendations

for each user based on the estimated ratings greater than 3.5. While in the first alternative we had a fixed N-‐Recomendaciones, in this

second proposal N-‐Recomendaciones is variable and depends on the pre-‐dicted rating. In this case values greater than 3.5 consider as the recom-‐mended. Therefore, the final list consist of N items that have already been evaluated by the user in question, which can be different for each user. Precision and Recall For the compute of the metrics we identified the relevant and non-‐relevant items of all test set and based on the results previously obtained the following confusion matrix is obtained:

CONFIDENTIA

L

86

Table 8. Confusion Matrix of Experiment VII



As a result was obtained: Precision = 0.76, equivalent to 76% recommended movies are relevant. Recall=0.77, the 77% of the relevant movies were recommended.

7 Analysis of results

The following table summarizes the results obtained in the four ex-periments realized with LDA. For k topics=20:

Table 9: Comparative table of results

Experiment N-Recomendations α β Precision (%)

Recall (%)

1 10 50/k 0.01 80 13 2 5 50/k 0.01 80 7 3 5 0.02 0.02 81 7 4 10 0.02 0.02 79 14 5 20 0.02 0.02 77 25

As we can see from the change in the hyper-parameter values (α and

β), the percentage of precision and recall varies. In this sense, for N-Recommendations=10, the best prediction is given by the default val-ues for the hyper-parameters. In contrast to N-Recommendations = 5, the best prediction is when the hyper-parameters values are 0.02.

However, of all experiments performed when the number of recom-mendations is smaller (N-Recommendations=5), for the same values of α and β, the better precision value is obtained, i.e. where we have a better performance of the model.

Apparently the best performance occur with the number of recom-mendations being 5 and where the hyper-parameters are α=0.02 and β=0.02.

In addition, the following table summarizes the results obtained with basic factorization matrix:

CONFIDENTIA

L

87

Table 10. Factorization Matrix Results

Experiment NRecomendations α β Precision (%)

Recall (%)

V Is different for each user

0.0002 0.002 76 77

Therefore, this method works better than LDA. The results show that

the performance of matrix factorization model is better in the metric of Recall unlike the LDA method, which has low performance. However with LDA is possible to handle the problem of the sparsity without hav-ing to obtain a reduced matrix. Conclusions In this research, we propose to utilize the latent Dirichlet allocation (LDA) model to collaborative Filtering recommender systems, in the movies domain.

With LDA results, items distributions over the latent topics and the latent topic mixture distribution over users can be uncovered.

Five experiments are conducted to examine the performance of our proposed approach, that is, where we determined that the model was able to help make recommendations about relevant movies for users with 81% of precision with LDA. In contrast to the basic factorization matrix which would give recommendations about relevant movies for users with 76% of precision, which are considered to be acceptable.

Finally, we conclude that for LDA the Precision and Recall values obtained depended on the length of the list of recommendations provid-ed to the user. In cases where more movies were returned to the user, the recall value increased and the precision decreased. Therefore, the number of predicted items that are considered valuable for the user will condition these values. References

1. Ricci, F., Rokach, L., Shapira, B., & Kantor, P. B.: Recommender

Systems Handbook. (F. Ricci, L. Rokach, B. Shapira, & P. B. Kantor, Edits.) Boston: MA: Springer. (2011).

CONFIDENTIA

L

88

2. Lops, P., Gemmis, M., & Semeraro, G.: Recommender Systems Handbook. En F. Ricci, L. Rokach, B. Shapira, & P. B. Kantor (Edits.). Boston: MA: Springer US. (2011).

3. Chang, Te-min., & Hsiao, Wen-Feng.: LDA-BASED PERSONALIZED DOCUMENT. PACIS. (2013).

4. Ekstrand, M. D., Riedl, J., & Konstan, J.: Collaborative Filtering Recommender Systems. Foundations and Trends® in Human–Computer Interaction, 4, págs. 81–173 (2010).

5. Liu, Q., Chen, E., Member, S., Xiong, H., & Ding, C.: Enhancing Collaborative Filtering by User Interest Expansion via Personalized Ranking., 42, págs. 218–233 (2012).

6. Misra, H., Yvon, F., Cappé, O., & Jose, J.: Text segmentation: A topic modeling perspective. Information Processing & Management , 47 (4 ), 528-544 (2011).

7. Griffiths, T. L., & Steyvers, M.: Finding scientific topics. Proceedings of the National Academy of Sciences of the United States of America, , 101(Suppl 1), págs. 5228–5235 (2004).

8. Wei, X., & Croſt, W. B.: LDA-based document models for ad- hoc retrieval. Proceedings of the 29th Annual International ACMSIGIR Conference on Research and Development in Infor- mation Retrieval (SIGIR ’06), 178–185 (2006).

9. Wang, Z., & Qian, X.: Text categorization based on LDA and SVM. Proceedings of the International Conference on Computer Science and Soſtware Engineering (CSSE ’08), 1, 674–677 (2008).

10. Ramage, D., Heymann, P., Manning, C. D., & Garci, H.: Clustering the tagged web. Proceeding of the 2nd ACMInternational Conference onWeb Search and DataMining (WSDM'09), New York, NY, USA, 54–63 (2009)..

11. Xie, W., Dong, Q., & Gao, H.: A Probabilistic Recommendation Method Inspired by Latent Dirichlet Allocation Model. Mathematical Problems in Engineering , 1–10 (2014).

12. Krestel, R., Fankhauser, P., & Nejdl, W.: Latent Dirichlet allocation for tag recommendation. Proceedings of the 3rd ACMConference on Recommender Systems, 61–68 (2009).

13. Si, X., & Sun, M.: Tag-LDA for Scalable Real-time Tag Recommendation., 1, 23–30 (2009).

CONFIDENTIA

L

89

14. Song, Y., Zhang, L., & Giles, C. L.: Automatic tag recommendation algorithms for social recommender systems. ACM Transactions on the Web, 5 (2011).

15. Wilson, J., Chaudhury, S., & Lall, B.: Improving Collaborative Filtering Based Recommenders Using Topic Modelling. IEEE/WIC/ACM International Joint Conferences on Web Intelligence (WI) and Intelligent Agent Technologies (IAT), 340–346 (2014).

16. Hofmann, T.: Probabilistic latent semantic analysis. 15th Conference UAI, 289–296 (1999).

17. Blei, D. M., Andrew, N., & Michael, J.: . Latent Dirichlet allocation. 3, 993–1022 (2003).

18. Chen, W., Chu, J. C., Luan, J., Bai, H., Wang, Y., & Chanf, E. Y.: Collabo- rative filtering for orkut communities: Discovery of user latent behavior. Proceeding 18th Int. Conf.WWW, 681–690 (2009).

19. Koren, Y., Bell, R., & Volinsky, C.: MATRIX FACTORIZATION TECHNIQUES FOR RECOMMENDER SYSTEMS. IEEE Computer Society , 42–49 (2009).

20. Luostarinen, T., & Kohonen, O.: Using Topic Models in Content-Based News Recommender Systems. Proceedings of the 19th Nordic Conference of Computational Linguistics (NODALIDA 2013), 2, Norway, 239–251(2013).

21. Herlocker, J. L., Konstan, J. A., & Terveen, L. G.: Evaluating collaborative filtering recommender systems. ACM Transactions on Information Systems , 22 (1), 5–53 (2004).

22. Adomavicius, G., & Tuzhilin, A.: Toward the next generation of recommender systems: A survey of the state-of-the-art and possible extensions. Knowledge and Data Engineering , 17 (6), 734-749 (2005).

23. Beheshti, B., & Desmarais, M. (2012). Improving Matrix Factorization Techniques of Student Test Data with Partial Order Constraints. págs. 346–350.

24. Wang, C., & Blei , D. M. (2011). Collaborative topic modeling for recommending scientific articles. KDD , págs. 448–456.

CONFIDENTIA

L

90

Knowledge Description for Bodies of Knowledges

Pablo Quezada1 and Juan Garbajosa1

Departamento de Sistemas Informaticos CITSEMThecnical University of Madrid, Spain [email protected],

[email protected],www.upm.es

Abstract. Bodies of Knowledge (BOK) contain the relevant knowledgefor a discipline. BOK must embody the consensus reached by the com-munity for which this BOK will be of application. This consensus is aprerequisite for the adoption of the BOK by the community.While several BOK have been developed, a set of widely agreed guide-lines on how to develop BOK and, specifically on the way to describethe knowledge are not yet available. It turns out that whenever the de-velopment of a BOK is started, an effort-intensive activity to define howknowledge will be described also has to commence. This lack of guidelinesto describe knowledge will 1) dramatically increase the required effortto produce a BOK, 2) make it very difficult to compare related BOKcontent, and 3) make it a hard task to reuse knowledge descriptions.Therefore this lack of guidelines results in a large amount of redundanteffort and inefficiency.This paper presents the conclusions of a literature study of the differentways in which knowledge is represented in BOK in the areas softwareengineering and Information Technology (IT). This paper can be con-sidered as a first effort to build a model that can be universally used todescribe knowledge in BOK.

Keywords: Body of knowledge, BOK, IT, Information Tecnology, AreaBreakdown, sofftware engineering, knowledge Structure, knowldege de-scription, knowledge Teaxonomy, stakeholders, educators, practitioners

1 Introduction

A Body of Knowledge (BOK) is a term used to represent the complete setof concepts, terms and activities that make up a professional domain. Itencompasses the core teachings, skills and research in a field or industry[1].One of the main concerns of the software industry is to develop the talentof its human resources, since the quality and innovation of its productsand services depend to a great extent on the knowledge, the ability andthe talent of technical engineers. The knowledge already exists; the goalis to gain consensus on the core subset of the knowledge characterizingthe engineering discipline [2].

CONFIDENTIA

L

91

A professions BOK is its common intellectual ground it is shared byeveryone in the profession regardless of employment or engineering dis-cipline [3]. The Engineering BOK is defined as the depth and breadth ofknowledge, skills, and attitudes appropriate to enter practice as a profes-sional engineer in responsible charge of engineering activities that poten-tially impact in different contexts. For the purposes of the EngineeringBOK, the knowledge, skills, and attitudes are referred to as capabilities.A capability is defined as what an individual is expected to know and beable to do by the time of entry into professional practice in a responsiblerole. A given capability typically consists of many diverse and specificabilities [3].The development of a Body of Knowledge is a complex activity.It hasdone through the use of guide knowledge, determining that knowledgecan be often represented through Knowledge Areas (KA), which has notmet the expectations of stakeholders in BOKs This is so since BOKare quite extensive and it is not only used by engineering stakeholdersbut also, for instance, in government [4].The relevance of the SoftwareEngineering Body of Knowledge and TI BOK for the different stake-holders especially in the context of industry [5],[6], education [7],[8],[9]and government [4] is analyzed focus in the way of description of theKnowledge. The Past Decade, 2000 to 2010, that knowledge can be rep-resented through the core knowledge, skills and attitudes that are gener-ally accepted and applied by investment professionals worldwide. In thesame context in [10] mention that knowledge is often written in a pro-prietary language; rules and algorithms are not compatible with otherKBE-frames [11] and usually not in a level that is understandable forengineers and domain experts. The descriptions vary described abovestudies and criteria develop by authors [11].”Articulating a Body of Knowledge is an essential step toward develop-ing a profession because it represents a broad consensus regarding whatan engineering professional should know. Without such a consensus, nolicensing examination can be validated, no curriculum can prepare anindividual for an examination, and no criteria can be formulated for ac-crediting a curriculum [12].However, it does not yet exist a widely agreed guidelines for the descrip-tion of BOK in general, or specifically, for describing the BOK knowledge,a first step towards the development of the BOK. The description theknowledge allow us to understand the context of each discipline, the re-lation with others disciplines, the structure, contents, and the capacitiesnecessaries for correct develop of BOK [3]. To cover the gap of the de-scription of the BOK knowledge, this paper presents the conclusions ofan analysis of the elements that have been used the available literatureto describe knowledge in BOK.Nowadays, while several BOK have been developed a set of widely agreedguidelines on how to develop BOK and, specifically on the way to de-scribe the knowledge are not yet available. It turns out that whenever thedevelopment of a BOK is started, a effort-intensive activity to define howknowledge will be described also has to commence, requiring often an in-depth literature analysis. It also happens, in practice, that communitiessplit producing each subcommunity its own BOK. Furthermore many

CONFIDENTIA

L

92

disciplines are connected, and the knowledge described in different BOKrelated. Therefore this lack of guidleines to describe knowledge 1) dra-mativcally increases the the required effort to produce a BOK, 2) makesit very difficult to compare BOK addressing closely related disciplines,and 3) makes it more difficult than needed to reuse knowledge descrip-tion whenever it is suitable. Therefore this lack of guidelines results in alarge amount of redundant effort, that instead of focussinf focussing onthe BOK construction focusses on improductive discussions that couldbe, to a large extent, avoided.This paper presents the conclusions of aliterature study of the differ-ent ways in which the Knowledge is represented in BOK, engineerings(software and IT) and how it should be structured, as the first step tobuild a model for knowledge description for BOK With this objective,this paper has reviewed existing literature using a Systematic LiteratureReview. As part of the studythe relevance of the Knowledge in BOK forthe different stakeholders of the software engineering and the IT com-munities is also analyzed, as long as the usefulness of knowledge for astakeholder depends on how well-focused on the stakeholder activity is.The contribution of this paper can be considered as a first effort towardsthe definition of a model that can be universally used to describe theBOK knowledge.This paper has been structured as follows: Section 1, this one, Introduc-tion, section 2, introduces a background to BOK, section 3 described theResearch Methodology used; the most relevants findings are presentedin section 4. Then Section 5 introduces a discussion and some conclu-sions, and the last section, section 6, describe the following steps in thisresearch.

2 Background

A BOK (Body of Knowledge) is a collection of substantial concepts andskills that represent knowledge of a certain area in engineering or sci-entific discipline, and ensures its common understanding. A BOK mayinclude technical terms and theoretical concepts as well as recommendedpractices [13].A Body of Knowledge is not just a professional reading list, a library, awebsite or a collection of websites, a description of professional functionsor a collection of information. A BOK is a list of knowledge, skills andabilities (competencies), organized into an integrated structure (taxon-omy) with a specific level of accomplishment specified for each compe-tency (proficiency). It is the sum of knowledge within a profession thatincludes proven traditional practices which are widely accepted, emerg-ing innovative practices as well as published and unpublished material.It is a living body of information that requires updating and feeding toremain current [14] .Professional communities have created and used Bodies of Knowledge(BOKs) to consolidate their discipline, standardize practices, improveprocesses, and warehouse community knowledge [13]. Formal BOKs havebeen used across disciplines as varied as medical practice management

CONFIDENTIA

L

93

[15], computer usability [16], personal software process [17], Method forProcess Improvement SCAMPI,[18] software engineering [19], projectmanagementcite[20] ,IT security [21],IT information [22],INCOSE[23],[24],SEBok[25]and others.In the context of this paper the description of BOK were focused onsoftware engineering and IT (Information Technology).BOKs are a relatively recent development in information modeling, butthey draw on a rich heritage from other models. Unfortunately the Bodyof Knowledge remains to be written, so it can be difficult to understandhow BOKs relate to other knowledge representations. Nonetheless webelieve those relationships are essential to the disciplined creation ofBOKs [13].The logical evolution of a BOK in four phases:

1. Controlled vocabulary: a collection of preferred terms that are usedto more precisely retrieve content, categorize content, build labelingsystems, create style guides, and design database schemata[26].

2. Taxonomies: a set of hierarchically related terms in a controlled vo-cabulary. In this context we consider Bloom and Vicenti taxonomies[27],[28],[29].Taxonomy involves finding an appropriate breakdown. We start withthe most general category which will be the root of the tree. Thenwe need to find the subcategories for this. For any category, eachsubcategory is taxonomy [30].Essentially, creating taxonomy involves splitting a set into subsets,and repeating the process on the subsets by recursion. The criteriaused to choose appropriate splits depends on the application.In this context in 1956, Benjamin Bloom created taxonomy for cat-egorizing the level of abstraction of questions that commonly occurin educational settings. This taxonomy is used in the SWEBOK [31]to specify the expected level of understanding of each topic withinits Knowledge Areas (KAS) for a ”graduate plus four years of expe-rience”.In its final form, Blooms Taxonomy of the Cognitive Domain com-prises six levels of intellectual behaviors: knowledge, comprehension,application, analysis, synthesis, and evaluation [29].To characterize the software engineering in base of Blooms taxonomyis necessary consider:

– Software Engineering is an intellectual process. This means thathuman factors are significant to many aspects of the applicationof its knowledge [29].

– Knowledge on software engineering consists of generic softwareengineering knowledge (as described in the SWEBOK for in-stance) as well as target software domain specific knowledge [29].

– The engineering process can be refined into Primary Process,Supporting Process, and Organizational Process as in, for ex-ample, ISO/IEC 12207 [32].

– Applicable technologies vary depending on the category, or ap-plication domain, of the target software.

In the same context is important consider Vincenti taxonomy thatstudied the epistemology of engineering based on the historical anal-

CONFIDENTIA

L

94

ysis of five case studies in aeronautical engineering covering a roughlyfifty-year period and proposed taxonomy of engineering knowledge.

Vincenti, on the basis of his analysis of the evolution of aeronauti-cal engineering knowledge, identified different types of engineeringknowledge, and classified them into six categories:

– Fundamental design concepts.– Criteria and specifications.– Theoretical tools.– Quantitative data.– Practical considerations.– Design instrumentalities [33].

Maibaum [33], in contrast, has used Vincenti’s classification to in-vestigate the mathematical foundations of software engineering, butdid not look at specific knowledge areas within software engineering.

We can use the Vincenti’s classification to recognize and identify,the SWEBOK 2004 Guide [34], the types of engineering knowledgeincluded with their current status and description. By extension, thisanalysis will also provide insight into the elements of engineeringknowledge that may be missing, either because they do not existor because they have been missed in the information gathering andsuccessive review processes.

The Vincentis categories are not mutually exclusive: it is thereforeimportant to understand the relationships between them. Their ini-tial modeling of Vincentis categories of engineering knowledge is pre-sented in figure 1. This figure illustrates that, in seeking a designsolution, designers move up and down within categories, as well asback and forth from one category to another.

Fig. 1. Vincentis classification of engineering knowledge [35]

CONFIDENTIA

L

95

In the table 1 we can see Vincenti Engineering knowledge categoriesand goals.

Table 1. Vincenti Engineering knowledge categories and goals

EngineeringKnowledgeCategory

Goals

Fundamental de-sign concepts

Designers embarking on any normal design bringwith them fundamental concepts about the devicein question.

Criteria andspecification

To design a device emboying a given a operati-nal principle and normal configuration, the designermusdt have, at some point,specific requeriments interms of hardware.

Theoritical tools To carry our their design function,engineers use awide range of theorical tools. These include intellec-tual concepts as well as mathematical methods.

Quantitativedata

Even with fundamental concepts and technical spec-ification at hand , mathematical tools are of littleuse without data for the physical properties or otherquantities requiered in the formulas. Other kinds ofdata may also be needed to lay out details of the de-vice or to specify manufacturing processes for pro-duction.

Practical consid-erations

To complement the theorical tools and quantitativedata,which are not sufficient.Designers also need leessharply defined considerations derived from experi-ence.

Design instru-mentalities

Besides the analytical tools,quantitative data andpractical considerations requiered for their task, de-signers need to knowhow to carry out those task.How to employ procedures producitively constituesan essential part of design knowledge.

3. Ontology. A set of statements about a knowledge domain consistingof terms from a controlled vocabulary and the relationships amongthem. We follow Jurisica et al, in distinguishing several subtypes ofontologies: Static, Dinamic and ontology [14].

4. Metamodel. An ontology template whose parameters can be set togenerate ontologies. The metamodel seeks to discover underlyingsimilarities between the BOK being developed and other, relatedBOKs.

In order to describe the knowledge is necessary consider the contextof discipline. In the case of Engineering the best ways to describe theknowledge based in Body of Knowledge is the use of level to describeknowledge areas, units and topics with the respective breakdown.

CONFIDENTIA

L

96

Other way to describe the knowledge in the engineering context is focusin capacities; in [3] consider 30 capacities to develop the correct way aBOK in engineering context.In the context Software and IT Engineering the respective BOKs pro-vided a catalog of relevant factual knowledge and organization of knowl-edge in different Knowledge Areas.

3 RESEARCH METHODOLOGY

In order to support this paper we used a Systematic Literature Review asresearch method. The Systematic Literature Review (SLR) has been pro-posed in software engineering research by Kitchenham [36] as a methodto report reliable conclusions about a research area collecting systemati-cally quality evidences. As it can be seen in the figure 2, the process hasthree main phases:– Planning the review, to develop a review protocol.– Conducting the review, to execute the previous protocol.– Reporting the review, to provide the obtained results to the commu-

nity.The whole process includes several iterations to get valuable feedback,improving the overall research.

Fig. 2. Phases of the SLR method, proposed by Kitchenham

To conduct the systematic literature review it was necessary to use a for-mal search strategy. It facilitated in a more reliable way the localizationof the scientist contributions that could provide relevant answers to thearisen researches contexts. Besides, to follow a formal search strategy en-ables recommendable practices in research contexts like the repeatabilityand the external reviews of this contribution. In an attempt to performan exhaustive search we identified electronic sources of relevance in aBOK context as example: IEEE,ACM Digital Library, Springer,ISI Webof Knowledge,Science Direct, FIE 2014 [9] conference and other relevantresources in the area of this research.In the same context we consider the Inclusion and Exclusion Criteria inorder to identify those primary studies that provide direct evidence aboutthe research question. In order to reduce the likelihood of bias, selectioncriteria should be decided during the protocol definition, although theymay be refined during the search process [36], [37].Study selection is a multistage process. Initially, selection criteria shouldbe interpreted liberally, so that unless a study identified by the electronicand hand searches can be clearly excluded based on title and abstract,a full copy should be obtained. In the paper we consider 2 criteria to

CONFIDENTIA

L

97

evaluate the related work and theorical support.Inclusion criteria. - Scientific material (papers, experience reports, sum-maries of workshops, FIE 2014 conferences ,etc.) written in English andaccessible digitally. In the same context was necessary consider the stud-ies framed in contexts of other discipline but vincula with concept ofBody of Knowledge, workshops and meets of IEEE and ACM.

Exclusion criteria. - Non-scientist, material none written in English.

The next step is to apply inclusion and exclusion criteria based on prac-tical issues [38] such as: Language, journal, authors, setting, participantsor subjects, research design, sampling method and date of publication.

In the same context the data extraction was develop: the objective ofthis stage is to design data extraction forms to accurately record theinformation researchers obtain from the primary studies. To reduce theopportunity for bias, data extraction forms should be defined and pilotedwhen the study protocol is defined [36].

Once the data of the whole studies have been properly extracted, theyshould be synthesized in order to provide new knowledge in research areasin this case in body of knowledge and software engineering contexts. Thedata extraction forms must be designed to collect all the informationneeded to address.

About the synthesis strategies are important not to include multiplepublications of the same data in a systematic literature review becauseduplicate reports would seriously bias any results. It may be necessary tocontact the authors to confirm whether or not reports refer to the samestudy.

On other hand 11 criteria to evaluated the papers was used . In thetable 2 we can show a summary of the quality assessment criteria for thestudies to support the reseach.

CONFIDENTIA

L

98

Table 2. Summary of the assessment quality criteria for studies

Numbercriteria

Quality criteria

1 Is the paper based on research (or is it merely alessons learned report based on expert opinion)?

2 Is there a clear statement of the aims of the research?

3 Is there an adequate description of the context inwhich the research was carried out?

4 Was the research design appropriate to address theaims of the research?.

5 Was the recruitment strategy appropriate to theaims of the research?

6 Was there a control group with which to comparetreatments?

7 Was the data collected in a way that addressed theresearch issue?

8 Was the data analysis sufficiently rigorous?

9 Has the relationship between researcher and partic-ipants been considered to an adequate degree?

10 Is there a clear statement of findings?

11 Is the study of value for research or practice?

After of the application of the protocol and evaluation criteria we obtainthe findings that support this research.

4 Findings

The Bodies of Knowledges are used by individuals for extending theirskills and for career development. Researchers may find it useful foridentifying technology applicable to their research and to help definethe skills required for research teams. The process of building the BoKshould assist in highlighting similarities across disciplines, for example,techniques used in materials science that are common between chemistryand physics [39].Regarding the Knowledge Levels in a Boks, they define the amount ofknowledge to be offered within a specific level of an educational program.The greater the level, the more knowledge is offered. RaPSEEM definesthe following levels:

– Level 0 implying that Boks are not offered at all.– Level 1 Basic Level implying that core BOKs are offered on a general

level. Students are aware of their existence and relationships to otherBOKs.

– Level 2 Mid-Level implying that BOKs are offered on an interme-diate level. Students deepen their knowledge of the core BOKs andextend it with new BOKs.

CONFIDENTIA

L

99

– Level 3 Detailed Level implying that the BOKs are offered on adetailed level within a specific engineering domain and role [40].

The Bodies of Knowledges have a specific structure according the area ofengineering or science. In this paper we described the general structureof Body of Knowledge in the context of engineering.

Firstly, we need consider the context of the area of study of Body ofKnowledge in order to establish the core (skills, knowledge, and experi-ence to be taught in the curriculum to achieve the expected student out-comes). In the same way the BOK establishes Knowledge Areas (KAS).Each Knowledge Area descriptions should use the following structure:

– Acronyms.

– Introduction.

– Breakdown of Topics of the KA.

– Matrix of Topics vs. Reference Material.

– List of Further Readings.

– References [34],[31].

Each area is brokendown into smaller divisions called units [34],[31],[29],which represent individual thematic modules within an area. Each unitis further subdivided into a set of topics, which are the lowest level ofthe hierarchy. The topics depend of evolution and context of knowledgearea and discipline [34],[31].

In the context of BOK is necessary a process of update of knowledge infunction of the advance of the discipline and the necessities of the society.

In general the BOKs have different committees, organization and groupsof collaboration that develop and update their contexts in function of theadvance of the science and engineering. In order to formulate a BOK ina bottom-up manner is necessary considering the materials from whichwe can extract knowledge about the targeted discipline. We paid specialattention to the materials used, such as PowerPoint slides, documents,articles, and books that are used in education and research in the targeteddiscipline. By analyzing the materials, we presumed that a certain levelof knowledge could be obtained and used to formulate a BOK [41].

BOK is the general name for the three types of resources that the BOKConstructor manages, and it is a term to refer to a novel BOK designprinciple for new disciplines as well. The resources are materials, descrip-tions, and the BOK tree, which are linked.

Within this paper a general the structure of BOK by levels is presentedbelow:

Firstly, we need consider the Core of BOK.

Secondly we establish the Knowledge Area of Body of Knowledge. Eacharea is broken down into smaller divisions called Units, which representindividual thematic modules within an area. Adding a two or three lettersuffix to the area identifies each unit. Each unit is further subdividedinto a set of topics that is the lowest level of the structure and thecontent,[34],[31] .

The integration of news areas, unit, topics depend of the different cri-terias that consider the organization and institutions that regularly thedifferent BOKs [34],[31],[29].

CONFIDENTIA

L

100

On the other hand is necessary considering other level in the structureof Boks where the topics will be more detailed (sub-topics). These sub-topics have addressed different knowledge and skills. In the same wayto develop a BOK is necessary consider: Process Model, Deliverables,Organization, Technology focus, Tools, Assignment focus and Exercisedomain [42].

Fig. 3. Model of Body Knowledge

According to the figure 3 to describe a Body of Knowledge we needconsider the hierarchical structure. In the C1: Core of BOK we show thelevels of abstraction and the organization of Knowledge area.Knowledge categories are high-level structural elements used for orga-nizing, classifying, and describing engineering knowledge. A knowledgecategory is composed of knowledge areas [43]Knowledge area (KA): A subdivision of a knowledge category that rep-resents software engineering knowledge that is logically cohesive and re-lated to the knowledge category through inheritance or aggregation. Aknowledge area is composed of a set of knowledge units [43].Knowledge unit (KU): A subdivision of a knowledge area that repre-sents a basic component of engineering knowledge that has a crisp andexplicit description. For the purposes of this activity, the knowledge unitis atomic; that is, it will not be subdivided into simpler or more basicelements [43].In the same context we propose other level of deep subtopic where eachcomponent of the topic was detailed.

CONFIDENTIA

L

101

In adition we proposed the context of the disciplines (software and ITengineering) related with capacities and related disciplines.In the model proposed we consider the domain of application of the BOKin this case software and IT engineering. In addition is necessary consid-ered as a basis, a professional education in engineering context needs toprovide for the professional level of knowledge that can be outlined asfollows [44]:

– Knowing basic concepts and major application areas in this case insoftware and IT engineering.

– Knowing similar concepts (and concepts inter-relation) and alterna-tives, as well as application specific areas.

– Knowing basic technologies and their relation to basic concepts. Be-yond this, a professional needs additional dimensions of awarenessand critical thinking.

– Knowing authoritative (and not authoritative) sources of informa-tion and how to evaluate quality of information

– Ability to work with standards (what is not an easy source of infor-mation).

– Ability to critically evaluate and filter information.

Additional we need consider Educational programs in engineering andengineering technology have been developed to address many technicalaspects associated with computers [45].Computer engineering technology Body of Knowledge, SWEBOK, ITBOK based on inputs provided from several perspectives including indus-try demand, previous work in creating computing bodies of knowledge,and institutional factors.In the same context the ASCE Body of Knowledge [46] highlights theneed for engineers to understand the impact of their solutions with re-gards to society ,culture and industry.The BOK could also be used by individuals for extending their skillsand for career development. Researchers may find it useful for identify-ing technology applicable to their research and to help define the skillsrequired for research teams. The process of building the BoK should as-sist in highlighting similarities across disciplines, for example, techniquesused in materials science [47].A BOK is normally used for certification and education or training [48],[49]. The knowledge must reflect current best practice, which inevitablychanges over time. However, updates cannot be undertaken in an un-controlled manner since associated lecture and other education materialneeds to be maintained in line with the BOK [31].Other important factors to consider in Engineering are the Stakehold-ers [9]: Various people, groups, companies, and other organizational orgovernmental entities have a stake in educational programs. They allshould be identified and their responsibilities towards the educationalprograms should be specified. It is only then one may make sure that theeducational programs fulfill the requirements of those who affect or areaffected by the programs. RaPSEEM suggests defining responsibilitiesfor four groups [50]:

– Students whose competencies are the main outcome of the educa-tional programs.

CONFIDENTIA

L

102

– Educators who are people, groups or organizational entities thatprovides education.

– Industry who employs the students.– Government affecting the educational programs by promoting edu-

cation within various disciplines.

5 CONCLUSIONS

– The findings presented in section 4 showed the criteria to develop thegeneral structure and contents of the Boks in the field of engineering.The proposal of the way of how to elaborate a BOK permitted un-derstands the real context of the Knowledge Areas and the relationwith the related disciplines.

– The BOK needs a previous consensus between Knowledge Areas andrelated disciplines.

– A BOK generally uses a tree structure to represent knowledge, anda certain limit is set to its height to help its understandability andreadability. In the same context the main objective of a BOK is toprovide classification of knowledge and its detailed explanation.

– The definition of BOK in the context of engineering is important torespond the training needs of future professional to they acquire thecompetencies in the social, business, educational and industrial.

– The Body of Knowledge provides the basis for curriculum develop-ment and maintenance and supports professional development andany current and future certification schemes. Lastly, it promotes in-tegration and connections with related disciplines.

– The Body of Knowledge is the sum total of our human understandingof the world around us. Studies in the area of strength and condition-ing make up one of the many fields of knowledge, and strength andconditioning professionals must understand how our understandingis created to successfully use it to optimize their professional prac-tices, approaches, and exercise prescriptions.

– A general structure of BOK in the engineering was established. Thisstructure begins with the set of Knowledge Areas, continue withUnits and end with Topics according to the research area.

– We need consider that the BOKs have a hierarchical structure or-dered by levels which are a function of the needs of each disciplineand the social relevance.

– Other aspect to consider in the developing of BOK is the degree ofmaturity of the discipline, skills, and competences.

– The software engineering body of knowledge is an all-inclusive termthat describes the sum of knowledge within the profession of softwareengineering. However, the Guide to the Software Engineering Bodyof Knowledge seeks to identify and describe that subset of the bodyof knowledge that is generally accepted or, in other words, the corebody of knowledge.

– A BOK can fulfill to stakeholders role in supporting education, cer-tification, professional stature, professional development, and orga-nizational improvement.

CONFIDENTIA

L

103

6 FUTURE WORK

This paper is a first effort towards building a model to support knowledgedescription in software engineering and IT. This effort has been focusedin reporting the conclusions obtained from analyzing existing literatureconcerning knowledge description in BOK. The next step will be to definea model based on the conclusions obtained from the current study.

7 Acknowledgment

This research was supported by Technical University of Madrid (UPM)especially by the PhD.Juan Garbajosa Sopea. We thank our professors:PhD. Sergio Arevalo and Jennifer Perez who organized the 1st Workshopon COmputing Science and Technology for smArt Cities (COSTAC) thatis an initiative of the PhD. Program on COmputing Science and Tech-nology for smArt Cities of the Technical University of Madrid (UPM).

References

1. B. Penzenstadler, D. Mendez Fernandez, D. Richardson, D. Callele,and K. Wnuk, “The requirements engineering body of knowledge(rebok),” in Requirements Engineering Conference (RE), 2013 21stIEEE International, July 2013, pp. 377–379.

2. J. Rivera-Ibarra, J. Rodriguez-Jacobo, and M. Serrano-Vargas,“Competency framework for software engineers,” in Software En-gineering Education and Training (CSEE T), 2010 23rd IEEE Con-ference on, March 2010, pp. 33–40.

3. Licensure and Q. for Practice Committee of the National Society ofProfessional Engineers, “Engineering body of knowledge,” in Na-tional Society of Professional Engineers (NSPE), 2013, p. 67 pages.

4. C. Smith and D. J. Brooks, Security Science: The Theory and Prac-tice of Security. Newton, MA, USA: Butterworth-Heinemann, 2013.

5. V. P. Lopes and G. H. Travassos, “Knowledge repositorystructure of an experimental software engineering environment,”in Proceedings of the 2009 XXIII Brazilian Symposium onSoftware Engineering, ser. SBES ’09. Washington, DC, USA:IEEE Computer Society, 2009, pp. 32–42. [Online]. Available:http://dx.doi.org/10.1109/SBES.2009.12

6. J. Mylopoulos, V. Chaudhri, D. Plexousakis, A. Shrufi, andT. Topologlou, “Building knowledge base management systems,”The VLDB Journal, vol. 5, no. 4, pp. 238–263, Dec. 1996. [Online].Available: http://dx.doi.org/10.1007/s007780050027

7. H. Welch, “Teaching a service course in software engineering,” inFrontiers In Education Conference - Global Engineering: KnowledgeWithout Borders, Opportunities Without Passports, 2007. FIE ’07.37th Annual, Oct 2007, pp. F4B–6–F4B–11.

CONFIDENTIA

L

104

8. R. Dony, P. Botman, W. Briggs, R. Haggart, and P. Taylor, “Thesoftware engineering body of knowledge for professional engineeringin canada,” in Electrical and Computer Engineering, 2002. IEEECCECE 2002. Canadian Conference on, vol. 2, 2002, pp. 743–748vol.2.

9. “[front cover],” in Frontiers in Education Conference (FIE), 2014IEEE, Oct 2014, pp. c1–c1.

10. P. Klein, D. Pugliese, J. Ltzenberger, G. Colombo, andK.-D. Thoben, “Exchange of knowledge in customized prod-uct development processes,” vol. 21, no. 0, 2014, pp. 99– 104, 24th {CIRP} Design Conference. [Online]. Available:http://www.sciencedirect.com/science/article/pii/S221282711400691X

11. H. Yang, J. Chen, N. Ma, and D. Wang, “Imple-mentation of knowledge-based engineering methodology inship structural design,” Computer-Aided Design, vol. 44,no. 3, pp. 196 – 202, 2012, applications in Ship andFloating Structure Design and Analysis. [Online]. Available:http://www.sciencedirect.com/science/article/pii/S0010448511001540

12. P. Bourque, R. Dupuis, A. Abran, J. Moore, and L. Tripp, “Theguide to the software engineering body of knowledge,” Software,IEEE, vol. 16, no. 6, pp. 35–44, Nov 1999.

13. K. Taguchi, H. Nishihara, T. Aoki, F. Kumeno, K. Hayamizu, andK. Shinozaki, “Building a body of knowledge on model checkingfor software development,” in Computer Software and ApplicationsConference (COMPSAC), 2013 IEEE 37th Annual, July 2013, pp.784–789.

14. D. A. Mundie and R. Ruefle, “Building an incident managementbody of knowledge,” in Proceedings of the 2012 Seventh InternationalConference on Availability, Reliability and Security, ser. ARES ’12.Washington, DC, USA: IEEE Computer Society, 2012, pp. 507–513.[Online]. Available: http://dx.doi.org/10.1109/ARES.2012.83

15. C. Medical Group Management Association Englewood, “Body ofknowledge for medical practice management.” Medical Group Man-agement Association.

16. N. Bevan, “Usability body of knowledge. bloomingdale,” Us-ability Professionals Association, 2005. [Online]. Available:http://www.usabilitybok.org.

17. J. M. C. R. Marsh P., Huff and S. M., “Per-sonal software process (psp) body of knowledge, ver-sion 1.0,” Pittsburgh, PA: Software Engineering Institute,Carnegie Mellon University, vol. 1, 2005. [Online]. Available:http://www.sei.cmu.edu/library/abstracts/reports/05sr003.cfm

18. J. M. Steve Masters, Sandra Behrens and R. Charles,“Scampi lead appraiser body of knowledge (sla bok) (cmu/sei-2007-tr-019),” Pittsburgh, PA: Software Engineering Institute,Carnegie Mellon University, vol. 1, 2005. [Online]. Available:http://www.sei.cmu.edu/library/abstracts/reports/07tr019.cfm.

19. F. Robert, A. Abran, and P. Bourque, “A technical review of the soft-ware construction knowledge area in the swebok guide,” in SoftwareTechnology and Engineering Practice, 2002. STEP 2002. Proceed-ings. 10th International Workshop on, Oct 2002, pp. 36–42.

CONFIDENTIA

L

105

20. P. M. Institute, “A guide to the project management body of knowl-edge,” in Proceedings of the 2009 XXIII Brazilian Symposium onSoftware Engineering. Washington, DC, USA: Newton Square,2008.

21. O. o. C. Department of Homeland Security and N. C. S. D.Communications, “A competency and functional framework for itsecurity workforce development,” in Proceedings of the 2009 XXIIIBrazilian Symposium on Software Engineering. Washington, DC,USA: Newton Square, 2008. [Online]. Available: http://www.us-cert.gov/ITSecurityEBK/EBK2008.pdf

22. W. Agresti, “An it body of knowledge: The key to an emergingprofession,” IT Professional, vol. 10, no. 6, pp. 18–22, Nov 2008.

23. “[front cover],” in Guide to the systems engineering body of knowledge(g2sebok), april 2012, pp. 1–52.

24. “Incose insight,” in The INCOSE fellows edition: The technical vi-sion of systems engineering; the intellectual content of systems en-gineering, March 2006, pp. vol. 8 issue 2, pp. 1–64.

25. A. Squires, N. Hutchison, A. Pyster, T. Ferris, D. Olwell, S. Enck,and D. Gelosh, “Work in process: A body of knowledge and cur-riculum to advance systems engineering (bkcase x2122;),” in Sys-tems Conference (SysCon), 2011 IEEE International, April 2011,pp. 250–255.

26. V. Lombardi, “Metadata glossary,” NoiseBetween Stations, 2012. [Online]. Available:http://noisebetweenstations.com/personal/essays

27. P. Bourque, L. Buglione, A. Abran, and A. April, “Bloom’s taxon-omy levels for three software engineer profiles,” in Software Technol-ogy and Engineering Practice, 2003. Eleventh Annual InternationalWorkshop on, Sept 2003, pp. 123–129.

28. “Using ontologies for knowledge management: An information sys-tems perspective,” An information systems perspective, Proc. ASISAnnual Mtg, vol. 3, pp. 482–496, 1999.

29. M. Azuma, F. Coallier, and J. Garbajosa, “How to apply the bloomtaxonomy to software engineering,” in Software Technology and En-gineering Practice, 2003. Eleventh Annual International Workshopon, Sept 2003, pp. 117–122.

30. A. Hunter and W. Liu, “A survey of formalisms for represent-ing and reasoning with scientific knowledge,” Knowledge Eng.Review, vol. 25, no. 2, pp. 199–222, 2010. [Online]. Available:http://dx.doi.org/10.1017/S0269888910000019

31. P. Bourque and R. Dupuis, “Guide to the software engineering bodyof knowledge 2014 version,” Guide to the Software Engineering Bodyof Knowledge, 2014. SWEBOK, pp. –, 2014.

32. “Ieee/eia standard industry implementation of international stan-dard iso/iec 12207: 1995 (iso/iec 12207) standard for informationtechnology software life cycle processes,” IEEE/EIA 12207.0-1996,pp. 1–75, 1998.

33. T. Maibaum, “Mathematical foundations of software engineer-ing: A roadmap,” in Proceedings of the Conference on TheFuture of Software Engineering, ser. ICSE ’00. New York,

CONFIDENTIA

L

106

NY, USA: ACM, 2000, pp. 161–172. [Online]. Available:http://doi.acm.org/10.1145/336512.336548

34. P. Bourque and R. Dupuis, “Guide to the software engineering bodyof knowledge 2004 version,” Guide to the Software Engineering Bodyof Knowledge, 2004. SWEBOK, pp. –, 2004.

35. W. G. Vincenti, What engineers know and how they know it: Ana-lytical studies from aeronautical history. Baltimore: Johns HopkinsUniversity Press, 1990.

36. B. Kitchenham and S. Charters, “Guidelines for performing system-atic literature reviews in software engineering,” Technical report,EBSE Technical Report EBSE-2007-01, Tech. Rep., 2007.

37. B. A. Kitchenham, S. L. Pfleeger, L. M. Pickard, P. W. Jones, D. C.Hoaglin, K. E. Emam, and J. Rosenberg, “Preliminary guidelinesfor empirical research in software engineering,” IEEE Trans. Softw.Eng., vol. 28, no. 8, pp. 721–734, Aug. 2002. [Online]. Available:http://dx.doi.org/10.1109/TSE.2002.1027796

38. K. Green, “Conducting research literature reviews: Fromthe internet to paper (3rd ed.) by arlene fink. sage, losangeles, {CA} (2010),” Library Information Science Re-search, vol. 32, no. 4, pp. 290 – 291, 2010, library andInformation Science Research in Australia. [Online]. Available:http://www.sciencedirect.com/science/article/pii/S0740818810000630

39. I. T. 19759:2005, “Bloom’s taxonomy levels for three software en-gineer profiles,” in ISO/IEC TR 19759:2005 Software Engineering,(2005), Guide to the Software Engineering Body of Knowledge. Inter-national Organization for Standardization. IEEE Computer Society,Sept 2005.

40. M. Kajko-Mattsson, “A method for designing software engineer-ing educational programs,” in Software Engineering Education andTraining (CSEE T), 2012 IEEE 25th Conference on, April 2012, pp.139–143.

41. Y. Masunaga, K. Ito, T. Yabuki, and T. Morita, “Edit conflict reso-lution in wikibok: A wiki-based bok formulation-aid system for newdisciplines,” in Privacy, Security, Risk and Trust (PASSAT), 2012International Conference on and 2012 International Confernece onSocial Computing (SocialCom), Sept 2012, pp. 210–218.

42. J. Han, “Software engineering course design for undergraduates,” J.Comput. Sci. Coll., vol. 26, no. 4, pp. 166–172, Apr. 2011. [Online].Available: http://dl.acm.org/citation.cfm?id=1953573.1953601

43. T. Hilburn, I. Hirmanpour, S. Khajenoori, R. Turner, andA. Qasem, “A software engineering body of knowledge version1.0,” Software Engineering Institute, Carnegie Mellon Univer-sity, Pittsburgh, PA, Tech. Rep. CMU/SEI-99-TR-004, 1999.[Online]. Available: http://resources.sei.cmu.edu/library/asset-view.cfm?AssetID=13359

44. Y. Demchenko, D. Bernstein, A. Belloum, A. Oprescu, T. Wlodar-czyk, and C. De Laat, “New instructional models for building effec-tive curricula on cloud computing technologies and engineering,” inCloud Computing Technology and Science (CloudCom), 2013 IEEE5th International Conference on, vol. 2, Dec 2013, pp. 112–119.

CONFIDENTIA

L

107

45. J. Evans and D. Jacobson, “A computer engineering technology bodyof knowledge,” in Frontiers in Education Conference (FIE), 2010IEEE, Oct 2010, pp. T3E–1–T3E–6.

46. T. B. of Knowledge Committee of the Committee on AcademicPrerequisites for Professional Practice (BOK Committee), CivilEngineering Body of Knowledge for the 21st century: Preparingthe Civil Engineer for the Future, sei/asce/sfpe 29-99 ed. Reston,VA: American Society of Civil Engineers, 2008. [Online]. Available:http://ascelibrary.org/doi/abs/10.1061/9780784406496

47. M. Shaw, J. Herbsleb, I. Ozkaya, and D. Root, “Deciding whatto design: Closing a gap in software engineering education,” inSoftware Engineering Education in the Modern Age, ser. LectureNotes in Computer Science, P. Inverardi and M. Jazayeri, Eds.Springer Berlin Heidelberg, 2006, vol. 4309, pp. 28–58. [Online].Available: http://dx.doi.org/10.1007/119493743

48. H. Idrus, “Developing well-rounded graduates through integration of soft skills in theteaching of engineering courses,” in Frontiers in Education Conference (FIE), 2014IEEE, Oct 2014, pp. 1–9.

49. J. Benning, A. Surovek, D. Dolan, L. Wilson, A. Thompson, and R. Pyatt, “Culturalconsiderations in service learning with american indian reservation communitystakeholders,” in Frontiers in Education Conference (FIE), 2014 IEEE, Oct 2014, pp.1–4.

50. M. Kajko-Mattsson, “A method for designing software engineering educationalprograms,” in 25th IEEE Conference on Software Engineering Education and Training,CSEE&T 2012, Nanjing, China, April 17-19, 2012, 2012, pp. 139–143. [Online].Available: http://dx.doi.org/10.1109/CSEET.2012.34

CONFIDENTIA

L

108

1

DETECTION OF VULNERABLE USERS USING V2X COMMUNICATIONS

José Javier Anaya Catalán1, Edgar Talavera Muñoz2, David Giménez Masvidal3,

Felipe Jiménez Alonso4 and José Eugenio Naranjo Hernández5

1Master Mechanical Engineer. INSIA. Polytechnic University of Madrid 2Master Informatic Engineer. INSIA. Polytechnic University of Madrid

3Graduated in Software Engineer. INSIA. Polytechnic University of Madrid 4Doctor Mechanical Engineer. INSIA. Polytechnic University of Madrid 5Doctor Informatic Engineer. INSIA. Polytechnic University of Madrid

Abstract. Vehicle-Vehicle communications (V2V) allow the exchange of information in

real time. Efforts are moving in two directions. On the one hand, the development of

communications infrastructure to support information exchange. On the other hand,

the development of advanced driver assistance systems (ADAS) that make use of

these communications. Within systems identified, few specifically targeted at

vulnerable road users such as pedestrians, motorist and cyclists, taking into account

the peculiarities of its circulation that distinguish passenger cars, trucks and buses. In

this paper, we presents an ADAS focused to avoid accidents involving motorist and

cyclist using V2V communications, incorporating them to the vehicular network taking

into account the intrinsic characteristic of his group. With it, you may warn the

approaching motorists in advance, higher than provide onboard sensors.

1 Introduction

Safety systems in road transport have evolved in recent years with the development of Information

and Communications Technology (ICT). ICT opens numerous application possibilities to improve

mobility on the road. One of the key issues have been identified as priority from the point of view of

road safety is to reduce accidents involving vulnerable road users (VRU), pedestrians, motorist and

cyclists. The accident rate in this group is one of the few on the rise today, with the social impact in

question. So every day 75 people die on average on European roads, 750 serious injuries also occur

[H2020 (2013)]. Vulnerable road users, pedestrians, cyclists and motorbikes are particularly serious

problems in the field of security, they represent a disproportionately high percentage of the total

number of fatalities and serious injuries. At the same time, policy and regulatory measures to improve

safety of these groups often involve a significant cost and they tend to be slow implementation.

CONFIDENTIA

L

109

2

Therefore, in the field of transport research, we meet the challenge of improving the safety of

vulnerable road users through the development of technological tools that can be applied to reduce

the number of accidents.

Specifically, in 2012 there were 25.651 motorbike accidents in Spain and 5.150 involving cyclists, far

from diminishing, they are increasing year after year [DGT 2012].

Motorcycle riders are guilty of only 5% of accidents are involved with other vehicles[ FMM 2011].

Among the causes of accidents not caused by motorists include those derived from circulation in

parallel with other vehicles, where motorcycles are not usually seen due to the so-called "blind spot"

rear view mirror, rear impacts to bikes by other vehicles and distraction of other vehicle drivers.

Similarly, in Spain the number of bicycle accidents in the city is relatively low compared with

neighboring countries, however, road deaths of Spanish cyclists are at the head of Europe, cars being

the cause that many of these accidents, mainly due to improper overtaking speed or disrespect to the

lateral distance in the same, often due to the poor visibility or distractions [Mapfre 2013].

Today there is a clear lack of technological solutions aimed at reducing the number os accidents of

group, and the few exist are expensive, intrusive and complex, requiring external energy sources or

devices, equipment or interfaces. In fact, VRU are not included in the use cases ITS current security

[Thielen, D. 2012]. In [ VOLVO 2014], Volvo presents a system to improve bicycle safety, based

Smartphone connected to a Bluetooth helmet, where internet connection is required to include GPS

rider position itself in a cloud of data being. Similarly, in [TNO 2014] presents a design of intelligent

bicycle that includes batteries, actuators, communications modules, control units and displays. As

mentioned in Tal et al. [ Tal, l 2013], it is assumed that the only type of technologies suitable for

equipping electric bicycles are V2X, due to energy requirements involved.

At European level, the WATH_OVER project [Andreone, L. 2007] focuses on developing solutions for

the VRU, including vehicular communications but mainly based onboard sensors. These sensors are

mainly onboard computer vision [Cheng-En, W. 2013] or laser scanner [Niknejad, HT 2011]

[Prabhakar, Y. 2011] and limited to the visual horizon of the vehicle. Also, Jaguar recently published in

press developing a detection system for bicycles and pedestrians [Jaguar, 2015], which it is able to

detect this type of users using onboard sensors and warn the driver with sounds and vibration signals.

In this paper describes a system for detecting vulnerable road users by V2X communications, focused

on motorist and cyclists, which takes into account the specific characteristics of each type of transport,

including them in a natural way in vehicular ad-hoc communication networks. An alert system has

been implemented with the support of the monitoring system, capable of informing the driver VRU

nearby presence. This warning system has been tested in real situations, including highways and rural

roads.

CONFIDENTIA

L

110

3

In addition, the system presented is compatible with pedestrian detection system using INRIA

smartphones [Anaya, JJ 2014], covering the entire scope of the VRU. This work was conducted at the

University Institute for Automotive Reasearch (INSIA)of the Polytechnic University of Madrid (UPM).

2 Detection system of vulnerable road users based on V2X

communications

2.1 Vehicle ad-hoc network: devices and protocols

A vehicular ad-hoc network can be defined as a particular type of wireless network where nodes are

vehicles or roadside access points without previously established infrastructure, decentralized, self-

organized and multihop data exchange is allowed. Communications made through VANET, where the

vehicle is the main receiver and transmitter of information, commonly known as communications

vehicle-to-X (V2X) [Zeadally, S. 2012].

The V2X communications include information exchange vehicle-to-vehicle or vehicle-to-infrastructure

based on technology dedicated short range communications (DSRC) that manages data exchange

between vehicles and VANET road nodes.

Each communication unit shipped or inroad is considered as ITS station which must follow the same

protocols and communication standards to ensure interoperability between nodes. Five

standardization poles are currently working to define the different elements of the communication

architecture: the International Organization for Standardization (ISO), the European Committee for

Standardization (CEN), the Institute for Telecommunication Standardization (ETSI), SAE international

and the Institute of Electrical and Electronics Engineers (IEEE). In some cases, these institutions are

linked to a particular geographical area, providing equivalent standards which are not harmonized in

general.

Therefore, each VANET node must contain a STI DSRC station to provide network access to the

vehicle. For the presented works in this article, we have used the ITS DSRC stations developed in the

INSIA in order to provide this access, including hardware, protocols and application software. This ITS

INSIA station is still current communications standards and provides V2X connectivity. Furthermore, It

has added a rich set of interfaces to the platform to increase connectivity features: GPS, CAN bus,

Bluetooth and Wi-Fi.

In Figure 1 shows the architecture of ITS INSIA station is shown following the OSI reference model

adapted to support the ITS communications, supported by standards ETSI EN 302 363. In our case

we have selected the European family of standards (ETSI) for the high level and IEEE standards

family to low level in order to provide communication services.

CONFIDENTIA

L

111

4

ITS INSIA station

Security and efficiency

applications Application Layer

ETSI EN 302 636-5-1 TCP

Basic transport

ETSI EN 302 636-4-1 GeoNetworking

IEEE 802.2 LLC layer

IEEE 802.11 MAC layer

IEEE 802.11p PHY layer

Fig. 1. OSI stack model of protocols of the ITS INSIA station.

The different protocols used in the reference model of the ITS INSIA station are described below:

ETSI EN 302 636-5-1: Intelligent Transport Systems (ITS); Vehicular

Communications; GeoNetworking; Part 5: Transport Protocols; Sub-part 1: Basic

Transport Protocol. Basic network protocols are included for the ITS station

architecture, including TCP.

ETSI EN 302 636-4-1: Intelligent Transport Systems (ITS); Vehicular

Communications; GeoNetworking; Part 4: Geographical addressing and forwarding for

point-to-point and point-to-multipoint communications; Sub-part 1: Media-Independent

Functionality. The network level definition are included, including geographical

address for the ITS architecture of the station.

IEEE 802.2 standard is defining the Logical Link Control (LLC), which is the top layer

of the data lnk for LANs. The LLC sublayer has a uniform user interface for the data

link service, usually located in the network layer.

IEEE 802.11 is a standard for Information Technology - Telecommunications and

information exchange between systems - Local and metropolitan area networks –

CONFIDENTIA

L

112

5

Specific requirements - Part 11: Wireless LAN Medium Access Control (MAC) and

Physical Layer (PHY).

IEEE 802.11p is an approved amendment to the IEEE 802.11 standard that defines

enhancements to 802.11 required to support intelligent transport systems (ITS). This

protocol stack is applied and installed in the ITS INSIA stations to ensure the

connectivity in the VANET.

2.2 System detection of VRUs

Description. Cyclists and motorists are two groups f vulnerable road users with less access to

technological aids to improve their security. In part, this is due to the characteristics of their vehicles, in

which the human body is the main part of their chassis and the possibilities of including equipment and

additional user interfaces are very limited due to the lack of space, power supply and the small extra

weight can be carried by them.

In this case, we have developed a VRU detection system tailored specifically to this group and

considering its limitations in order to be able to use in real situations.

Therefore, the main priority of the detection system is to prevent abuses of vulnerable road users.

Consequently, the system is designed to maintain a one-way flow of information between the

conventional and vulnerable vehicle, all of them maintaining to the not vulnerable driver informed in

real time about the presence and location of vulnerable road users and warning that an accident is

possible.

The basis of the system are vehicular communications that keep an electronic horizon between users

broader than the visual horizon road, much more limited, which is the cause of most accidents. This

feature makes the system is especially suited to run on country oads or mountain which is where

accidents involving vulnerable drivers are more common [Mapfre 2013].

Architecture. As mentioned above, the system architecture has been defined in order to adapt the

equipment and resources depending on the type of vehicle and driver interaction capabilities.

Therefore, in Figure 2 show the functional architecture of the VRU detection system.

CONFIDENTIA

L

113

6

Fig. 2. Functional architecture of the VRU detection system.

In this architecture, vehicles which have their own source of energy lead a V2X communications unit

that includes extensive connectivity capabilities in order to provide information to the VANET installed.

In this case, these vehicles are cars and motorcycles. The warning system uses a smartphone as a

man-machine interface using the "MotoWarn" application. This application is connected to the V2X

vehicle unit through WiFi and a TCP socket, and it is able to retrieve information about VRUs which

are circulating i the VANET, in particular data such as distance, speed, identification or GPS position.

With this information, the smartphone calculates the distance to the VRU and the possibility of

interaction, and it will warn the driver if it is necessary with an audio message and a display on the

screen shows the position depending on the car. The position information is supplied to the VRUs

VANET in two ways: motorcycle equipped with its own V2X unit and transmits the position and

trajectory of the VANET using standard messages. These messages are received by nearby cars and

stored in the database of their V2X modules, identifying them as motorcycles.

In addition, bicycles are not able to include complex equipment being reduced communication

capabilities. In this case, it was decided to install only one iBeacon device onboard sensor. The

iBeacon is a new class of low-power transmitters inexpensive that can report about their presence to

nearby LE Bluetooth devices. This device has a small size and has its own battery so it is very suitable

for installation on a bicycle without affecting its use. The iBeacon Bluetooth sends signals that can be

received by the Bluetooth interface of V2X units in a range of 50 meters. This means that a unit can

detect V2X bicycles at this distance and, once detected, include your information in VANET for use by

any high level application that requires this information, in this case, the "MotoWarn" application.

In this way, the driver can be warned of the presence of VRU in the vicinity of your vehicle using a

smartphone application that takes advantage of information available from these users in VANET,

where now the data bicycles is also included which they were not art of the vehicular network before.

CONFIDENTIA

L

114

7

Configuration. On one hand, each of the applications is operating within a specific range in order to

avoid unnecessary warnings, on the other hand, it reap the maximum ranges of communications.

Thus, the ITS INSIA communication unit has a range of over 300 meters, so that motorcycles moving

in the vicinity will be connected to the vehicular network at that distance at least byt which

nevertheless is too far for start generating warnings to the driver as the risk of saturation of information

runs. In this way, for motorcycle warning system in the vicinity has been established that the warnings

to the driver through the MotoWarn application is generated when the distance is less than 50 meters

around the vehicle that equips the system. In addition of the longitudinal distance, the transverse

distance will check the vehicle itself, ie. checks the channel through which circulates the motorcycle in

relation to the vehicle equipped. Furthermore in MotoWarn, you can set a grid that has as its center

the vehicle in question and where the bike is located in one of its quadrants, to inform the driver.

Specifically, we selected a grid of 5x5 cells with a size of 3x6 meters for the row where the vehicle is

located and the rest with 3x13.5 meters. This means that the system will alert the driver that there is a

motorcycle in the vicinity when it is at least 30 meters from the vehicle and one of the two adjacent

lanes, plus your own. This grid is depicted in Figure 3.

Fig. 3. MotoWarn motorcycle grid detection.

Obviously, the goal is to warn motorcycles circulating in the blind spots or areas of difficult visibility in

the same direction of travel, so the bikes oncoming in a different way are filtered by the system.

Furthermore, the detection system is combined with a motorcycle detection system bicycles. In this

case, the operation and objectives are different. Being reduce and simplify the most of the facilities are

3 m

3 m

3 m

3 m

3 m

13.5 m

13.5 m

6 m

6 m

6 m

+2 c

arri

les

IZ

+1 c

arri

l IZ

Carr

il ve

hícu

lo

+1 c

arri

l DE

+2 c

arri

les

DE

CONFIDENTIA

L

115

8

equipped with iBeacons bicycles that continuously generate information which can be read by any LE

Bluetooth as equipping ITS INSIA stations device. From this information we can obtain the handle of

iBeacon, the transmission power (Tx_Power) and the indicator of received signal strength (RSSI).

These data can detect the presence of a bicycle within the range of iBeacon, set at 50 meters (Fig. 4).

Second, by equation 1 [Rhodes 2007] one can obtain the distance at which the cyclist is cited.

(1)

Where Y is a constant related to the error multipath signal propagation and in the case of open space

takes the value Y=2.

With this configuration, the system will alert MotoWarn in time when a cyclist is less than 50 meters

from the vehicle, indicating also the distance at which information is found by. "In the vicinity", "Close"

and "Danger".

Fig. 4. Functional scheme of warning system of proximity cyclist.

This form of presentation of the information is necessary because the distance calculation fluctuates

depending on many parameters such that this prevents distracting the driver with confusing or

misleading messages.

CONFIDENTIA

L

116

9

3 Tests

There have been two trials to demonstrate the operation of VRU detection system. Both tests were

conducted on the premises of INSIA in a closed circuit and controlled conditions. Also, we have used

three INSIA's vehicles. Mitsubishi iMIEV car equipped with an ITS communication module, a Kimco

SuperDynk motorcycle equipped with another ITS communication module and one bicycle equipped

with one iBeacon. The MotoWarn application runs in the car and it has been implemented on an

Android smartphone.

The firts test shows the functionally of the VRU detection system when a the motorcycle circulate in

the vicinity of the vehicle (figure 5).

.

Fig. 5. Test of the motorcycle detection system.

In this test the trajectories of a car with the MotoWarn application and a motorcycle equipped with

communication modules are represented in Figure 5. These paths represent a maneuver where the

motorcycle is traveling at 30km/h and it approaches to a vehicle traveling in the same lane and

direction at 24km/h. At one point, the motorcycle driver decides to overtake the car running this

CONFIDENTIA

L

117

10

maneuver. In the graph you can see the positions of both vehicles in the same time points, you can

appreciate the approach and overtaking maneuvers. The MotoWarn application begins to alert the

driver when the distance between the two vehicles is less than 30 meters, about 11 seconds of the

start. From that moment, the application shows to the driver the exact location of the motorcycle

respect to the car, even if this is in a blind spot, thus avoiding a possible accident. Figure 6 shows the

sequence of warnings generated by the application screen is displayed.

The second test shows the performance of the system when the vicinity of the vehicle is a bicycle,

which is on the same path and direction of the car (Figure 7). Also in this use case is also added a

bike that runs nearby.

Fig. 6. Screenshots of MotoWarn application while performing the first test.

Thus, in this paper we find a vehicle traveling at 25 km/h which meets a bicycle circulating in your

same direction at a speed of 5 km/h.

CONFIDENTIA

L

118

11

Fig. 7. Test of the bicycle detection system.

In this case, in figure 6 we can see that the car receives the signal from iBeacon when it is less than

50 meters from the bicycle, appearing a warning for the driver, which is maintained while the

overtaking maneuver occurs. The total time of detections are 27 seconds, when the car ahead of the

bike and continued his route safely.

4 Conclusions

We have developed a detection system of vulnerable road users based on standardizes V2X

communications using communication modules ITS INSIA. This detection system takes advantage in

terms of connectivity of these communication modules, as well as the advantages in terms of

functionality of iBeacon and smartphones. The system has been designed, implemented and tested

on real vehicles, achieving the expected results in terms of safety and efficiency, becoming in a first

step towards improving the safety of cyclists and motorists.

CONFIDENTIA

L

119

12

Acknoledgement

This work has been developed through the following projects: State program for research,

development and innovation oriented societal challenges TRA 2013-48314-C3-2-R, DGT

SPIP20141452 and Madrid SEGVAUTO-TRIES S2013/MIT-2713.

References

1. H2020 (2013). HORIZON 2020 WORK PROGRAMME 2014 – 2015. Smart, green and

integrated transport. European Commission 2013.

2. DGT (2012). Las principales cifras de la siniestralidad vial. DGT, 2012.

3. FMM, (2011). I Estudio de siniestralidad vial en motocicletas. Fundación Mutua Madrileña.

2011.

4. Mapfre (2013). Ciclistas: Cascos y lesiones en la cabeza. Fundación Mapfre, 2013.

5. Thielen, D. (2012). Thielen, D.; Lorenz, T.; Hannibal, M.; Koster, F.; Plattner, J., "A

feasibility study on a cooperative safety application for cyclists crossing intersections,"

Intelligent Transportation Systems (ITSC), 2012 15th International IEEE Conference on,

vol., no., pp.1197-1204, 16-19 Sept. 2012.

6. Tal, I (2013). Tal, I.; Tianhua Zhu; Muntean, G.-M., "Short paper: On the potential of V2X

communications in helping electric bicycles saving energy," Vehicular Networking

Conference (VNC), 2013 IEEE , vol., no., pp.218-221, 16-18 Dec. 2013.

7. Andreone, L. (2007). L. Andreone, F. Visintainer, and G. Wanielik, "Vulnerable road users

thoroughly addressed in accident prevention: the WATCH-OVER European project,"

2007.

8. Cheng-En, W. (2013). Cheng-En Wu; Yi-Ming Chan; Li-Chen Fu; Pei-Yung Hsiao; Shin-

Shinh Huang; Han-Hsuan Chen; Pang-Ting Huang; Shao-Chung Hu, "Combining multiple

complementary features for pedestrian and motorbike detection," Intelligent

Transportation Systems - (ITSC), 2013 16th International IEEE Conference on , vol., no.,

pp.1358,1363, 6-9 Oct. 2013.

9. Niknejad, H.T. (2011). Niknejad, H.T.; Takahashi, K.; Mita, S.; McAllester, D., "Embedded

multi-sensors objects detection and tracking for urban autonomous driving," Intelligent

Vehicles Symposium (IV), 2011 IEEE , vol., no., pp.1128,1135, 5-9 June 2011.

10. Prabhakar, Y. (2011). Prabhakar, Y.; Subirats, P.; Lecomte, C., "A new method for the

detection of motorbikes by laser ragefinder," Communications and Signal Processing

(ICCSP), 2011 International Conference on , vol., no., pp.246,249, 10-12 Feb. 2011.

CONFIDENTIA

L

120

13

11. Anaya, J.J. (2014). JJ Anaya, P Merdrignac, O Shagdar, F Nashashibi, JE Naranjo,

“Vehicle to pedestrian communications for protection of vulnerable road users”, Intelligent

Vehicles Symposium Proceedings, 2014 IEEE, 1037-1042.

12. Volvo(2014), https://www.media.volvocars.com/global/en-

gb/media/pressreleases/155565/volvo-cars-and-poc-to-demonstrate-life-saving-wearable-

cycling-tech-concept-at-international-ces-201

13. TNO(2014), https://www.tno.nl/en/about-tno/news/2014/12/elderly-safer-on-the-road-with-

first-intelligent-bike/

14. Jaguar(2015). http://newsroom.jaguarlandrover.com/en-in/jlr-

corp/news/2015/01/jlr_bike_sense_200115/

15. Zeadally, S. (2012). S. Zeadally, R. Hunt, Y. Chen, A. Irwin, A. Hassan, “Vehicular ad hoc

networks (VANETS): status, results, and challenges”, Telecommunication Systems,

August 2012, Volume 50, Issue 4, pp 217-241.

16. Rodas (2007), Rodas, J., Fernández-Caramés, T.M., Iglesia, D.I., Escudero, C.J.,

“Sistema de Posicionamiento Basado en Bluetooth con Calibrado Dinámico”, Proc. URSI,

Santa Cruz de Tenerife, Spain, 2007.

CONFIDENTIA

L

121

https://www.media.volvocars.com/global/en-gb/media/pressreleases/155565/volvo-cars-and-poc-to-demonstrate-life-saving-wearable-cycling-tech-concept-at-international-ces-201



https://www.tno.nl/en/about-tno/news/2014/12/elderly-safer-on-the-road-with-first-intelligent-bike/

https://www.tno.nl/en/about-tno/news/2014/12/elderly-safer-on-the-road-with-first-intelligent-bike/

http://newsroom.jaguarlandrover.com/en-in/jlr-corp/news/2015/01/jlr_bike_sense_200115/

http://newsroom.jaguarlandrover.com/en-in/jlr-corp/news/2015/01/jlr_bike_sense_200115/

CONFIDENTIA

L

122

Implementing Uniform Reliable Broadcastin Anonymous Distributed Systems

with Fair Lossy Channels ?

Jian Tang1, Mikel Larrea2, Sergio Arevalo3, and Ernesto Jimenez3’4

1 Distributed Systems Laboratory (LSD)Universidad Politecnica de Madrid, 28031 Madrid, Spain

[email protected] University of the Basque Country UPV/EHU, 20018 San Sebastian, Spain

[email protected] Universidad Politecnica de Madrid, 28031 Madrid, Spain

4 Prometeo Researcher, Escuela Politecnica Nacional, 170515 Quito, Ecuador{sergio.arevalo,ernes}@eui.upm.es

Abstract. Uniform Reliable Broadcast (URB) is an important abstraction in dis-tributed systems, offering delivery guarantee when spreading messages amongprocesses. Informally, URB guarantees that if a process (correct or not) deliversa message m, then all correct processes deliver m. This abstraction has been ex-tensively investigated in distributed systems where all processes have differentidentifiers. Furthermore, the majority of papers in the literature usually assumethat the communication channels of the system are reliable, which is not alwaysthe case in real systems.In this paper, the URB abstraction is investigated in anonymous asynchronousmessage passing systems with fair lossy communication channels. Firstly, a sim-ple algorithm is given to solve URB in such system model assuming a majority ofcorrect processes. Then a new failure detector class AΘ is proposed. With AΘ,URB can be implemented with any number of correct processes. Due to the mes-sage loss caused by fair lossy communication channels, every correct process inthis first algorithm has to broadcast all URB delivered messages forever, whichmakes the algorithm to be non-quiescent. In order to get a quiescent URB algo-rithm in anonymous asynchronous systems, a perfect anonymous failure detectorAP ∗ is proposed. Finally, a quiescent URB algorithm using AΘ and AP ∗ isgiven.

Keywords: Fault-tolerance, Uniform Reliable Broadcast, Message Passing System,Anonymous System, Asynchronous System, Fair Lossy Channel, Failure Detector, Qui-escence.

? This paper has been accepted by the 17th Workshop on Advances in Parallel and DistributedComputational Models with IPDPS 2015.

CONFIDENTIA

L

123

1 Introduction

The broadcast communication abstraction plays an important role in fault-tolerant dis-tributed systems. It is used to disseminate messages among a set of processes, andit has several different forms according to its quality of service [1]. Uniform Reli-able Broadcast (URB) is the form which offers the best quality of service that wasproposed by Hadzilacos and Toueg ([2], [3], [4]). Uniform Reliable Broadcast, withURB-broadcast() and URB-deliver() operations, guarantees that if a process (no mattercorrect or not) delivers a message m, then all correct processes deliver m.

This service has been extensively studied in the non-anonymous system model,where each process has a unique identifier, usually assuming that communication chan-nels are reliable (if a process p sends a message to a process q, and q is correct, thenq eventually receives m) or quasi-reliable (if a process p sends a message to a processq, and the two processes are correct, then q eventually receives m) [5]. However, realchannels are neither always reliable nor quasi-reliable, most of them are unreliable (e.g.,fair lossy, which means that if a message is sent an arbitrary but finite number of times,there is no guarantee on its reception, it can lose an infinite number of messages [6]).In this regard, several works have addressed the construction of reliable channels overunreliable channels in non-anonymous systems ([6], [7]).

As far as we know, the first research on anonymous systems was conducted by An-gluin [8], which led to the works of Yamashita and Kameda ([9], [10]). Then, severalpapers appeared in this field, e.g., ring anonymous networks and shared memory anony-mous systems ([11], [12], [13], [14]). In [15], the reliable broadcast abstraction has beenstudied in anonymous systems assuming reliable channels.

In classic message passing distributed systems, processes communicate with eachother by sending and receiving messages. Because they all have unique identifiers,senders can choose the recipients of their messages, and recipients are aware of theidentities of the senders of messages they receive [16]. However, all these rules haveto be changed in anonymous systems. In this paper, each process has a broadcast()communication primitive, with which a process can send a message to all processes(including itself).

Our Contributions This work is devoted to the study of the Uniform Reliable Broad-cast (URB) abstraction in anonymous asynchronous message passing distributed sys-tems where processes may crash and communication channels are fair lossy. There arefour main contributions in this paper:

– A simple, non-quiescent uniform reliable broadcast algorithm in such a systemmodel assuming a majority of correct processes, which proves that URB can besolved in anonymous asynchronous message passing distributed systems.

– An impossibility result on solving URB without a majority of correct processes.– Two new classes of anonymous failure detectors AΘ and AP ∗.– A quiescent uniform reliable broadcast algorithm using AΘ and AP ∗, which does

not require a majority of correct processes.

Roadmap This paper is organized as follows. The system model and several defini-tions are presented in Section 2. A simple and non-quiescent algorithm implementing

CONFIDENTIA

L

124

uniform reliable broadcast is proposed in Section 3 under the condition of a majority ofcorrect processes. Then, in Section 4, an impossibility result on solving uniform reliablebroadcast without the condition of a majority of correct processes is given. In order tocircumvent this impossibility result and make the algorithm quiescent, two classes offailure detectors AΘ and AP ∗ are proposed in Section 5. Then, a quiescent uniformreliable broadcast algorithm with AΘ and AP ∗ is given in the Section 6. Finally, theconclusions are presented in Section 7.

2 System Model and Definitions

In this paper, the anonymous asynchronous distributed system is considered as a sys-tem in which processes have no identifiers and communicate with each other via acompletely connected network with fair lossy communication channels. Two primitivesare used in this system to send and receive messages: broadcast(m) and receive(m).We say that a process pi broadcasts a message m when it invokes broadcasti(m). Sim-ilarly, a process pi receives a message m when it invokes receivei(m).

Process The anonymous asynchronous distributed system is formed by a set of nanonymous processes, denoted as Π = {pi}i=1,...,n, such that its size is |Π| = n, i(1 ≤ i ≤ n) is the index of each process of the system. All processes are anonymous,that means they have no identifiers and execute the same algorithm. The index i ofprocess cannot be known by any process of the system. We just use it as a notation likep1, · · · , pn to simplify the description of the algorithms. Furthermore, all processes areasynchronous, that is, there is no assumption on their respective speeds.

There is a global clock whose values are the positive natural numbers. Note thatthis global clock is an auxiliary concept that we only use it for notation, but processescannot check or modify it.

Failure model A process that does not crash in a run is correct in that run, otherwiseit is faulty. We useCorrect to denote the set of correct processes in a run, and Faultyto denote the set of faulty processes. A process executes its algorithm correctly until itcrashes. A crashed process can not execute any more statements or recover. We alsoassume that at least one correct process exists in the system (i.e., t ≤ n− 1).

Communication Each pair of processes are connected by bidirectional fair lossycommunication channels. Processes communicate among them by sending and receiv-ing messages through these channels. We assume that these channels neither duplicatenor create messages, but may lost messages. In anonymous system, when a processreceives a message, it cannot determine who is the sender of this message.

Fair Lossy Channel A channel between two processes p and q is called as fair lossychannel if it satisfies the following properties [17]:

– Fairness: If p sends a message m to q an infinite number of times and q is correct,then q eventually receives m from p.

– Uniform Integrity: If q receives a message m from p, then p previously sent m toq; and if q receives m infinitely often from p, then p sends m infinitely often to q.

CONFIDENTIA

L

125

Uniform Reliable Broadcast Uniform Reliable Broadcast offers complete deliveryguarantees when spreading messages among processes. That is, when a process de-livers a message m, then all correct processes have to deliver it. It is also defined interms of two primitives: URB broadcast(m) and URB deliver(m). They satisfy thefollowing three properties:

– Validity: If a correct process broadcasts a messagem, then it eventually deliversm.– Uniform Agreement: If some process delivers a message m, then all correct pro-

cesses eventually deliver m.– Uniform Integrity: For every message m, every process delivers m at most once,

and only if m was previously broadcast by sender(m).

Failure Detector A failure detector is a module that provides each process a read-only local variable containing failure information (may be unreliable) of processes. Itcan be divided into different classes according to the quality of the provided failureinformation. It was introduced in [18].

Notation The system model is denoted by AAS Fn,t[∅] or AAS Fn,t[D]. AAS F isan acronym for anonymous asynchronous message passing distributed systems with fairlossy communication channels; ∅means there is no additional assumption,D means thesystem is enriched with a failure detector class of D. The variable n represents the totalnumber of processes in the system, and t represents the maximum number of processesthat can crash.

3 Implementing Uniform Reliable Broadcast inAAS Fn,t[t < n/2]

In this section, a simple implementation algorithm of uniform reliable broadcast underthe condition of a majority of correct processes is proposed. The system model of thissection is denoted by AAS Fn,t[t < n/2].

As far as we know, the implementation of the URB abstraction in the classic (non-anonymous) asynchronous systems with a majority of correct processes is simple. Inorder to ensure the URB termination property, the construction relies on one condition:a message m can be locally URB-delivered to the upper application layer when this mhas been received by at least one non-faulty process. As n > 2t, this means that, withoutrisking to be blocked forever, a process may URB-deliver m as soon as it knows that atleast t+1 processes have received a copy ofm. Obviously, this condition is also neededto be satisfied in the anonymous distributed systems. However, there is no easy way toidentify who is the correct process that has received m in the anonymous asynchronousmessage passing distributed systems due to the fact that all processes have no identifiers.In order to solve this difficulty, the idea of implementing URB in AAS Fn,t[t < n/2]is as follows: 1) to add a unique tag to each message by its sender before it to bebroadcast; 2) to add a unique tag ack to each acknowledgment message (denoted byACK) when a process receives a message.

With the idea mentioned above, the URB deliver condition can be expressed in anequivalent way: each process can deliver a message m if it has received a majority of

CONFIDENTIA

L

126

Algorithm 1 Uniform Reliable Broadcast in AAS Fn,t[t < n/2] (code of pi)

1 Initialization2 sets MSGi, MY ACKi, ALL ACKi, URB DELIV EREDi empty3 activate Task 1

4 When URB broadcasti(m) is executed5 tag ← randomi()6 insert (m, tag) into MSGi

7 When receivei(MSG,m, tag) is executed8 if (m, tag) is not in MSGi then9 insert (m, tag) into MSGi

10 end if11 if (m, tag, tag ack) is in MY ACKi then12 broadcasti(ACK,m, tag, tag ack)13 else14 tag ack← randomi()15 insert (m, tag, tag ack) into MY ACKi

16 broadcasti(ACK,m, tag, tag ack)17 end if

18 When receivei(ACK,m, tag, tag ack) is executed19 if (m, tag, tag ack) is not in ALL ACKi then20 insert (m, tag, tag ack) into ALL ACKi

21 end if22 if there is a majority of (m, tag,−) in ALL ACKi then23 if (m, tag) is not in URB DELIV EREDi then24 insert (m, tag) into URB DELIV EREDi

25 URB deliveri(m)26 end if27 end if

Task 1:28 repeat forever29 for every message (m, tag) in MSGi do30 broadcasti(MSG,m, tag)31 end for32 end repeat

distinct ACKs of m. Together with the condition of a majority of correct processes, itis guaranteed that at least one correct process has received m.Description of the algorithm: Algorithm 1 is the implementation algorithm of uni-form reliable broadcast abstraction in AAS Fn,t[t < n/2]. In this algorithm, two typesof messages are transmitted:MSG (a message needs to beURB delivered) andACK(reception acknowledgment of a message). Each process manages a random functionrandom() and four local sets:

– MSGi, initialized to empty, records all messages that it has received.

CONFIDENTIA

L

127

– URB DELIV EREDi, initialized to empty, records all URB delivered mes-sages.

– MY ACKi, initialized to empty, records all acknowledgment messages of eachmessage generated by itself.

– ALL ACKi, initialized to empty, records all acknowledgment messages of eachmessage it has received (from any process).

Then, let us consider a process pi to simplify the description. At the beginning, piinitializes all its sets into empty and activates Task 1 (lines 1-3).

When pi calls URB broadcast(m), it assigns a unique random tag to this mes-sage m, and inserts this pair of (m, tag) into MSGi (lines 4-6). Then, this (m, tag) isbroadcast forever in the Task 1 to propagate it to all processes (lines 28-32).

When pi receives a message (MSG,m, tag) (may come from itself or others pro-cess), there are three cases:

– If pi receives (MSG,m, tag) from itself for the first time (i.e., if this (m, tag)has already existed in MSGi, but its ACK message (m, tag, tag ack) does notexist in MY ACKi). This process will go to execute line 14, generates a ran-dom tag ack to tag the acknowledgment message of (m, tag). Then, pi inserts thisacknowledgment message (m, tag, tag ack) into its sets MY ACKi, and broad-casts (ACK,m, tag, tag ack) to all processes to acknowledge the reception of(m, tag) (lines 15,16). This tag ack is unique for each pair of (m, tag), whichmeans tag ack cannot be changed for the same pair of (m, tag) once it is gener-ated. The local set MY ACKi is used to maintain this uniqueness, to distinguishthe tag ack generated by itself from the received tag ack from others process.

– If pi receives (MSG,m, tag) from others process for the first time (i.e., if this(m, tag) does not in MSGi, neither its ACK message (m, tag, tag ack) does notexist in MY ACKi). It inserts this message into MSGi (lines 8, 9). Then, like thefirst case, pi generates a random tag ack to tag the acknowledgment message of(m, tag) (line 14). Then, pi inserts this acknowledgment message (m, tag, tag ack)into MY ACKi, and broadcasts (ACK,m, tag, tag ack) to all processes to con-firm the reception of (m, tag) (lines 15,16).

– If pi has received a (m, tag) already (i.e., if this (m, tag) has already existed inMSGi and its ACK message (m, tag, tag ack) also exists in MY ACKi), it re-broadcasts the identical acknowledgment message (ACK,m, tag, tag ack) to allprocesses in order to confirm the reception of (m, tag) to overcome the messagelost caused by the fair lossy communication channels (lines 11,12).

When pi receives an acknowledgment message (denoted by ACK) for the first time,it inserts this ACK message to its set ALL ACKi (lines 19-21).

When pi receives a majority of acknowledgment messages (m, tag, tag ack) of (m,tag) (more than n/2 different tag ack), and thismwith tag has not beenURB deliveredyet, then pi URB deliver m for one time (lines 22-25).Theorem 1 The algorithm 1 guarantees the property of URB. The correct proof ofthis theorem is straightforward, and it can be found in [20].Remark: The algorithm 1 can fulfill a fast URB deliver() of a message due to theproperty of fair lossy communication channels and the asynchrony of the system. For

CONFIDENTIA

L

128

example, a process may receive a majority of acknowledgment messages (ACK, m,tag, tag ack) and URB deliver m (according to line 22). Hence, this URB deliveris earlier than the reception of (MSG,m, tag). However, this does not violate the prop-erty of URB, even if this fast deliver process crashes after URB deliver m. Becausethis fast deliver process has received a majority of acknowledgment messages of mbefore URB deliver() m, which means a majority of processes have received thism (different processes generate distinct ACKs to the same m). Because together withthe condition that there is a majority of correct processes, it is guaranteed that at leastone correct process has received m. Then, this correct process will broadcast m for-ever guaranteeing that all correct processes will receive m. If the fast deliver process iscorrect, it will receive m eventually from others correct process.

It is necessary to generate a unique tag to each MSG and a unique tag ack to eachACK in this algorithm. However, it is possible that one random value can be shared bytwo messages only if one is MSG type and another one is ACK type.

4 An Impossibility Result

In this section, it is proved that the assumption of a majority of correct processes in thealgorithm 1 is a necessary condition to solve URB in AAS Fn,t[∅] if without any otheradditional assumption.Theorem 2 It is impossible to solve URB inAAS Fn,t[∅] without a majority of correctprocesses.Proof: The proof is by contradiction, let us suppose there exists an algorithm A thatsolves URB in AAS Fn,t[t ≥ n/2]. Then we divide all processes in the system intotwo subsets S1 and S2, such that | S1 |=dn/2e and | S2 |=bn/2c. Now, we consider tworuns: R1 and R2.

– Run R1. In this run, all processes of S2 crash initially, and all the processes in S1

are non-faulty. Moreover, if a process in S1 issues URB broadcast(m). Due tothe very existence of the algorithm A, every process of S1 URB delivers m.

– Run R2. In this run, all processes of S2 are non-faulty, and no process of S2 everissues URB broadcast(). The processes of S1 behave as in R1: a process is-sues URB broadcast(m), and they all URB deliver m. Moreover, after it hasURB delivered m, every process of S1 crashes, and all messages ever sent bythe process of S1 are lost, neither has been received by a process of S2. Hence, noprocess in S2 will URB deliver m.

It is easy to see that all processes of S1 cannot distinguish run R2 from run R1

before they URB deliver m, as they did in run R1. Then after that all processes inS1 are crashed, together with the fair lossy channel, no process in S2 has received m.This violates the uniform agreement of URB, so the algorithm A does not exist. Wecomplete the proof of Theorem 2.

5 Two Failure Detector Classes

In this section, two classes of failure detector are proposed. One is used to circumventthe impossibility mentioned above, one is used to make the algorithm 1 to be quiescent.

CONFIDENTIA

L

129

5.1 Failure DetectorAΘ

Following the previous impossibility result, one question appears naturally, that is, whatextra information is needed if the uniform reliable broadcast abstraction is implementedunder the assumption that any number of processes can crash? The answer is that theconfirmation of a message m has been received by at least one correct process pj be-fore a process pi(i 6= j) URB deliver this m. Thanks to the failure detector that wasproposed by S. Toueg, this confirmation can be guaranteed by the usage of the (unre-liable) failure information provided by it. In this section, we try to circumvent such animpossibility result by using the failure detector.

In non-anonymous systems, failure detector Θ is considered as the weakest one tosolve URB. It is defined as that it always trust at least one correct process (accuracy) andeventually every correct process do not trust any crashed process (completeness) [17].The counterpart of this Θ in anonymous distributed system is named as AΘ. Then, wetry to define AΘ in the anonymous asynchronous distributed systems.

AΘ provides the same failure information as Θ if each process has a unique identi-fier. However, it is impossible to give such information in anonymous systems becauseeach process has no identifier. So, the key point to define AΘ is how to identify everyprocess without breaking the anonymity of the system. We are inspired by the def-inition of failure detector class of AΣ, which was introduced by F. Bonnet and M.Raynal [19], to define the AΘ. This AΘ provides each process with a read-only localvariable a thetai that contains several pairs of (label, number), in which one labelrepresents a temporary identifier of one process and number represents the numberof correct processes who have known this label. For example, process pj’s local vari-able a thetaj = {(label1, number1), ... , (labeli, numberi), ... , (labeln, numbern)}if there are n processes in the system. A label is assigned randomly to each processwithout breaking the anonymity of the system due to the fact that each process does notknow the mapping relationship between a label and a process (even itself).

The definition of AΘ is given as follows:

– AΘ-completeness: There is a time after which the output variable a theta perma-nently contains pairs of (label, number) associated to all correct processes.

– AΘ-accuracy: If there is a correct process, then at every time, all pairs of (label,number) outputted by failure detectorAΘ hold that for all subset T of size numberof processes that know a label contains at least one correct process (i.e., for eachlabel, there always exists one correct process in the output set of number processesthat knows this label).

Then we give this definition more formally.Formal definition of AΘ:

S(label)={i | ∃ τ ∈ N: (label,−) ∈ a thetaτi }. S(label) is a set of all processeswho have known the label.• AΘ-completeness: ∃ τ ∈ N, ∀ i ∈ Correct, ∀ τ ′ ≥ τ , ∀ (label, number) ∈

a thetaτ′

i : |S(label) ∩ Correct| = number.• AΘ-accuracy: Correct 6= ∅ =⇒ ∀ τ ∈ N, ∀ i ∈ Π , ∀ (label, number) ∈

a thetaτi : ∀ T ⊆ S(label), |T | = number: T ∩ Correct 6= ∅.

CONFIDENTIA

L

130

5.2 Failure DetectorAP ∗

An algorithm is quiescent means that eventually no process sends or receives messages.Hence, it is obvious that the algorithm 1 is a non-quiescent algorithm since every correctprocess has to broadcast all URB delivered messages forever. However, a quiescentalgorithm is more valuable and practical in the real systems. In this section, we try tosolve this quiescent problem.

The intuitive idea to obtain a quiescent URB algorithm is to terminate the foreverbroadcast in the algorithm 1. According to the property of uniform reliable broadcast,this forever broadcast can be stopped when a message has been URB delivered by allcorrect processes (i.e., delete messages that have been URB delivered by all correctprocesses from the setMSG). In order to realize this idea, another failure detectorAP ∗

is needed to enrich the system model to provide the information of who are correctprocesses in the system.

Anonymous perfect failure detector AP ∗ provides each process with a read-onlylocal variable a p∗ that contains several pairs of (label, number), which is similar tothe failure detector AΘ.

To be more clearly, the definition of AP ∗ is given as follows:

– AP ∗-completeness: There is a time after which the output variable a p∗ perma-nently contains pairs of (label, number) associated to all correct processes.

– AP ∗-accuracy: If a process crashes, the label of this process and the correspond-ing number to the label will be eventually and permanently deleted from the outputvariable a p∗.

Eventually the number of pairs of (label, number) is equal to the number of correctprocesses.

Let us define AP ∗ more formally:S(label)={i | ∃ τ ∈ N: (label,−) ∈ a p∗τi }, which is the set of all processes who

have known this label at time τ according to a p∗i .• AP ∗-completeness: ∃ τ ∈ N, ∀ i ∈ Correct, ∀ τ ′ ≥ τ , ∀ (label, number) ∈

a p∗τ′

i : |S(label) ∩ Correct| = number.• AP ∗-accuracy: ∀ i, j ∈ Π , i ∈ Correct, ∃ τ : j ∈ Fault: ∀ τ ′ ≥ τ : (labelj ,

numberj) /∈ a p∗τ′

i .

6 Quiescent Uniform Reliable Broadcast inAAS Fn,t[AΘ,AP ∗]

In this section, the anonymous asynchronous distributed system model is enriched byboth failure detectorsAΘ andAP ∗, denoted byAAS Fn,t[AΘ,AP ∗]. The algorithm 2is the quiescent implementation algorithm of the uniform reliable broadcast abstractionin AAS Fn,t[AΘ, AP ∗] under the assumption of any number of processes can crash.We first give a detailed description of it as follows:

Each process initializes its four sets:MSGi,URB DELIV EREDi,MY ACKi,ALL ACKi and activates the Task 1 (lines 1-3). We take a process pi as an exampleto simplify the description. When pi calls URB broadcast(m), it generates a randomtag to this message m and inserts (m, tag) into set MSGi (lines 4-6). Then, this m

CONFIDENTIA

L

131

Algorithm 2 Quiescent Uniform Reliable Broadcast in AAS Fn,t[AΘ, AP ∗] (code ofpi)

1 Initialization2 setsMSGi,URB DELIV EREDi,

MY ACKi, ALL ACKi empty3 activate Task 1

4 When URB broadcasti(m) is exe-cuted

5 tag ← randomi()6 insert (m, tag) into MSGi

7 When receivei(MSG,m, tag) is exe-cuted

8 if (m, tag) is not in MSGi then9 if (m, tag) is not in

URB DELIV EREDi then10 insert (m, tag) into MSGi

11 end if12 end if13 if (m, tag, tag ack) is inMY ACKi

then14 labelsi ← {label | (label, −) ∈

a thetai }15 broadcasti(ACK,m, tag, tag ack, labelsi)16 else17 tag ack← randomi()18 insert (m, tag, tag ack) into

MY ACKi

19 labelsi ← {label | (label, −) ∈a thetai }

20 broadcasti(ACK,m, tag, tag ack, labelsi)21 end if

22 When receivei(ACK,m, tag, tag ack, labelsj)is executed

23 if (m, tag,−,−) is not inALL ACKi then

24 allocate array label counteri[(m,tag), −]

25 allocate array all labelsi[(m, tag),−]

26 end if27 if (m, tag, tag ack) is not in

ALL ACKi then28 insert (m, tag, tag ack) into

ALL ACKi

29 all labelsi[(m, tag), tag ack] ←labelsj

30 for each label ∈ labelsj do31 label counteri[(m, tag), label]

← label counteri[(m, tag),label] + 1

32 end for33 else34 for each label in labelsj but not in

all labelsi[(m, tag), tag ack]do

35 all labelsi[(m, tag), tag ack]←all labelsi[(m, tag), tag ack]∪ {label}

36 label counteri[(m, tag), label]← label counteri[(m, tag),label] + 1

37 end for38 for each label in all labelsi[(m,

tag), tag ack] but not in labelsjdo

39 all labelsi[(m, tag), tag ack]←all labelsi[(m, tag),tag ack] \ {label}

40 delete label counteri[(m,tag), label]

41 for each label in bothall labelsi[(m, tag),tag ack] and labelsj do

42 label counteri[(m, tag),label]←label counteri[(m, tag),label] - 1

43 end for44 end for45 end if46 if ∃(label, number) ∈ a thetai:

label counteri[(m, tag), label] =number then

47 if (m, tag) is not inURB DELIV EREDi then

48 insert (m, tag) intoURB DELIV EREDi

49 URB deliveri(m)50 end if51 end if

Task 1:52 repeat forever53 for every message (m, tag) in MSGi

do54 broadcasti(MSG,m, tag)55 if each pair of (label, number)

∈ a p∗i : label counteri[(m,tag), label] = number ∧all labelsi[(m, tag), −] ={label | (label,−) ∈ a p∗i } then

56 if (m, tag) is inURB DELIV EREDi

then57 delete (m, tag) from MSGi

58 end if59 end if60 end for61 end repeat

CONFIDENTIA

L

132

is broadcast to all processes forever in the Task 1 in the form of (MSG,m, tag) (lines52-54).

When pi receives a message (MSG, m, tag), it first checks if this (m, tag) hasalready existed in its MSGi. If not, then it checks if this (m, tag) has already beenURB delivered (lines 8, 9). If not, pi inserts this message to MSGi (line 10). Other-wise, pi overlook this reception. Then, there are three cases as in the algorithm 1:

– If pi receives (MSG,m, tag) from itself for the first time (i.e., if this (m, tag) hasalready existed in MSGi, but its ACK message (m, tag, tag ack) does not ex-ist in MY ACKi). Then, pi goes to execute the line 17 that generates a randomtag ack to tag the acknowledgment message of (m, tag). Then, pi inserts this ac-knowledgment message (m, tag, tag ack) into its sets MY ACKi, and reads thelabel information from its failure detector AΘi. Then, pi broadcasts (ACK, m,tag, tag ack, labelsi) to all processes to acknowledge the reception of (m, tag)(lines 17-20).

– If pi receives (MSG,m, tag) from others process for the first time (i.e., if this(m, tag) does not exist inMSGi or URB DELIV EREDi, neither its ACK mes-sage (m, tag, tag ack) does not exist in MY ACKi). It inserts this message intoMSGi (line 10). Then, pi does the same as the case 1 (lines 17-20).

– If pi has received this (m, tag) already (i.e., if this (m, tag) has already existed inMSGi or URB DELIV EREDi and its ACK message (m, tag, tag ack) alsoexists in MY ACKi), it re-broadcasts the identical acknowledgment message butwith the updated label information (ACK, m, tag, tag ack, labelsi) to all pro-cesses in order to confirm the reception of (m, tag) to overcome the message lostcaused by the fair lossy communication channels (lines 13-15).

When pi receives an acknowledgment message (ACK, m, tag, tag ack, labelsj)from pj (could be itself), there are three cases as follows:

– pi receives the very first ACK message of (m, tag), which also means this is thefirst time receives an ACK message with tag ack (by checking (m, tag) exists inthe set ALL ACKi or not), which also means this is the first ACK message fromone process (tag ack represents a process).pi allocates an array label counteri[(m, tag), −] (used to record the number

of every label that received together with (m, tag)), and an array all labelsi[(m,tag), −] (used to records all label in each ACK message of (m, tag) (lines 23-25).

– pi receives an ACK message coming from a new process (by checking (m, tag,tag ack) exists in ALL ACKi or not). (Case 1 is naturally included in case 2,but case 2 considers not only the very first ACK but more later ACKs from othersprocess).pi firstly insert (m, tag, tag ack) into the set ALL ACKi, and labelsj into the

array all labelsi[(m, tag), tag ack]. After that, for each received label in labelsj ,pi increases its count number by 1(lines 30-32) (1 means that every label is knownby the process who generates this ACK with tag ack).

– pi receives a repeated ACK message (with the same tag ack) (i.e., one processre-broadcast an ACK due to the fair lossy channel).

CONFIDENTIA

L

133

There are two mutually exclusive cases: 1) repeated ACK with more (new) la-bel information (lines 34-37); 2) repeated ACK with less label information (dueto the completeness property of AΘ that it needs some time to delete a label ofcrashed process) (lines 38-44). In one instance of algorithm 2, only one of these twocases can happen. In case 1, for each new label, pi inserts it into all labelsi[(m,tag), tag ack] and increases its count number by 1. In case 2, for each disap-peared label, pi deletes it from all labelsi[(m, tag), tag ack] and its correspond-ing label counter. Then, decreases the count number of repeatedly received labelby 1 (miscount this label by 1 due to the ACK from the crashed process).

After counting the number of each label, if there exists one pair of (label, number)outputted by AΘi satisfies the condition that the counted number of this label recordedin label counteri[(m, tag), label] is equal to the outputted number, then pi checksthis m has been URB delivered or not. If not, pi URB deliver m for one time.

In task 1, for each pair of (label, number) in the output of AP ∗i , if the conditionthat the counted number of each label label counteri[(m, tag), label] is equal to thecorresponding number (means it has received number different ACKs(tag ack) of(m, tag)) and the set of received label related to (m, tag) all labelsi[(m, tag), −] isequal to the outputted label set of AP ∗i {label | (label, −) ∈ a p∗i } (means the receivedACKs (tag ack) are from the correct processes) is satisfied (line 55), and together withthe fact that (m, tag) has already been URB delivered, then, pi deletes (m, tag) fromits MSGi set (line 57).Lemma 1: If a correct process broadcasts a message m, then it eventually deliver m.(Validity) Proof: Let us consider a non-fault process pi broadcastsm. A unique randomtag is assigned to this message m (line 5), then inserts (m, tag) into the set MSGi tobe broadcast a bounded but unknown times (until the condition of line 55 is satisfied)in Task 1 (lines 52-54). Together with the fairness property of fair lossy channel, allcorrect processes (include pi) will receive this m eventually.

Then, when a correct process receives (MSG,m, tag) for the first time, it generatesa second unique tag ack to the corresponding acknowledgment message and broad-casts it to all processes. Due to the bounded but unknown times of broadcast(MSG,m, tag) in the Task 1 of pi, each correct process receives it for a bounded but unknowntimes. Hence, each process broadcast an acknowledgment message for a bounded butunknown times too. The same reason of the fairness property of the communicationchannels, pi will receive all acknowledgment messages of (m, tag) from correct pro-cesses. Then, it is obvious that the condition of line 46 is satisfied, and pi delivers m.We complete the proof of Lemma 1.

Lemma 2: If some process deliver a message m, then all correct processes eventuallydeliver m. (Uniform Agreement) Proof: To prove this Lemma, we consider the follow-ing two cases:Case 1: A message m is delivered by a correct process.

Suppose this correct process is pi, then according to lines 52-54 of Task 1, pi willbroadcastm for a bounded but unknown times (until the condition of line 55 is satisfied)to all processes. With the fairness property of the channels, all correct processes willeventually receive m. Then, all correct process will do the same as pi to broadcast this

CONFIDENTIA

L

134

m a bounded but unknown times. Together with the Lemma 1, all correct processeseventually deliver this m.Case 2: A message m is delivered by a crashed process.

The condition of line 46 was satisfied before this crashed process deliver m. Due tothe accuracy property ofAΘ, at least one correct process has received thism. Then, thiscorrect process will broadcast m for a bounded but unknown times (until the conditionof line 55 is satisfied). Together with Lemma 1, it is obvious that all correct processeswill deliver m.

Following case 1 and 2, we can see that Lemma 2 is correct.Lemma 3: For every message m, every process delivers m at most once, and only if mwas previously broadcast by sender(m). (Uniform Integrity) Proof: It is easy to seethat any message m was previously broadcast by its sender, because each process onlyforwards messages it has received and the fair lossy channel does not create, duplicate,or garble messages.

To prove a message only be delivered for at most one time, let us observe that twokinds of tags exist in the system: one is used to label the message itself; one is used tolabel the acknowledgment of this message. The setMY ACKi is used to guarantee thateach process broadcasts the identical acknowledgment message to the same (m, tag)(line 18). The set URB DELIV EREDi to record all messages that have delivered(line 48).

Even each message is broadcast for a bounded but unknown times (until the condi-tion of line 55 is satisfied) and will be received by every correct process for a boundedbut unknown times (lines 52-54), one message can not be modified or relabeled as anew message due to these tags and sets mentioned above. Moreover, every message ischecked whether it has already existed in its set URB DELIV EREDi (line 47) be-fore URB deliver it. With those mechanisms, it is certain that no message m will bedelivered more than once. Hence, the proof is completed.Theorem 3 Algorithm 2 is a quiescent implementation of the uniform reliable broadcastcommunication abstraction in AAS Fn,t[AΘ,AP ∗].Proof: From Lemma 1, 2 and 3, it is easy to see that algorithm 2 is the implementationof the uniform reliable broadcast. Then, it is only necessary to prove the algorithm 2satisfies the quiescent property.

An algorithm is quiescent means that eventually no process sends or receives mes-sages. In algorithm 2, it is obvious that the broadcast of ACK message is invoked bythe reception of MSG message. Hence, the proof is reduced to show that the broad-cast times of MSG message is finite. Moreover, a faulty process only broadcast a finitetimes ofMSGmessage. Hence, the rest of this proof focus on that each correct processbroadcast a finite times of MSG message.

It is easy to see that the broadcast of MSG only exist in Task 1. Let us considertwo processes p (correct) and q that p broadcast (MSG,m, tag) to q a bounded butunknown times (p repeat broadcast (MSG,m, tag) until the condition of line 55 issatisfied).

– If q is correct, then eventually both p and q receive this MSG for a bounded butunknown times due to the fairness property of fair lossy communication chan-nels, then p delivers m only once when the first reception of m. Since q broadcast

CONFIDENTIA

L

135

(ACK,m, tag, labelq) to p when each time it receivesMSG, q broadcastACK top for the same times as the reception times ofMSG. (p also can receiveMSG fromitself and broadcast its ACK. Here, we only take q as an example.) By the fairnessproperty of channels, p receives a bounded but unknown times ofACK. Accordingto lines 22-51, p has to count every label existed in the received ACK from q anditself, as label counterp[(m, tag), labelq]=2, label counterp[(m, tag), labelp]=2.Together by the property of the failure detector AP ∗, the output of AP ∗p is com-posed by label and number of correct processes that is [(labelq, 2), (labelp, 2)].Then, the condition of line 55 is satisfied that (m, tag) is deleted from MSG andp stops the repeated broadcast of (MSG,m, tag), which proves this case is quies-cent.

– If q is faulty. Then, p only receivesACK from itself and together with the accuracyproperty of AP ∗p , the label and corresponding number of q will eventually andpermanently removed from the output ofAP ∗p . Hence, it is trivial that the conditionof line 55 is satisfied, which proves this case is quiescent too.

Hence, according to the description mentioned above, we complete the proof.

7 Conclusions

In this paper, we have studied how to implement the uniform reliable broadcast abstrac-tion in anonymous asynchronous message passing distributed systems with fair lossycommunication channels. A non-quiescent algorithm with the assumption of a majorityof correct processes is proposed firstly. Then, in order to obtain a quiescent algorithmand to circumvent the impossibility result of implementing URB without the assump-tion of a majority of correct processes, two classes of failure detectors are given andused. Finally, the quiescent uniform reliable broadcast algorithm is proposed.

References

1. C. Cachin, R. Guerraoui, and L. Rodrigues. Reliable and secure distributed programming.Springer (second edition), 2011.

2. V. Hadzilacos. Issues of fault tolerance in concurrent computation. Ph.D thesis, HarvardUniversity, 1984.

3. V. Hadzilacos, and S. Toueg. Fault tolerant broadcast and related problems. S.J. Mullender(Ed.), Distributed Systems. New York: ACM Press & Addison-Wesley, 1993.

4. V. Hadzilacos and S. Toueg. A modular approach to fault-tolerant broadcasts and relatedproblems. Technical Report 94-1425, 83 pages, Cornell University, Ithaca (USA), 1994.

5. A. Schiper. Failure detection vs group membership in fault-tolerant distributed systems: hid-den trade-offs. Proceedings of the Second Joint International Workshop on Process Alge-bra and Probabilistic Methods, Performance Modeling and Verification, pp. 1–15, Springer-Verlag, London, 2002.

6. A. Basu, B. Charron-Bost, and S. Toueg. Simulating reliable links with unreliable linksin the presence of process crashes. Proceedings of the 10th International Workshop onDistributed Algorithms, pp. 105–122, Springer-Verlag, London, 1996.

CONFIDENTIA

L

136

7. Y. Afek, H. Attiya, A. D. Fekete, M. Fisher, N. Lynch, Y. Mansour, D. Wang, and L. Zuck.Reliable communication over unreliable channels. Journal of the ACM, 41(6): pp. 1267–1297, 1994.

8. D. Angluin. Local and global properties in networks of processors (extended abstract).Proceedings of the twelfth annual ACM symposium on Theory of computing (STOC’80), pp. 82–93, ACM New York, 1980.

9. M. Yamashita, and T. Kameda. Computing on anonymous networks, part I: characterizingthe solvable cases. IEEE Transactions on Parallel and Distributed Systems, 7(1): pp. 69–89,1996.

10. M. Yamashita, and T. Kameda. Computing on anonymous networks, part II: decisionand membership problems. IEEE Transactions on Parallel and Distributed Systems,7(1): pp. 90–96, 1996.

11. H. Buhrman, A. Panconesi, R. Silvestri, and P. Vityani. On the importance of having anidentity or is consensus really universal?. Distributed Computing, 18(3), pp. 167–175, 2006.

12. C. Delporte-Gallet, H. Fauconnier and A. Tielmann. Fault-Tolerant consensus in unknownand anonymous networks. Proceeding of 29th IEEE International Conference on DistributedComputing Systems (ICDCS’09), pp. 368–375, 2009.

13. R. Guerraoui and E. Ruppert. Anonymous and fault-tolerant shared-memory computing.Distributed Computing, 20(3), pp. 165–177, 2007.

14. C. Delporte-Gallet, H. Fauconnier, and H. Tran-the. Homonyms with forgeable identifiers.Proceedings of the 19th international conference on Structural Information and Communi-cation Complexity (SIROCCO’12), pp. 171–182. Springer-Verlag Berlin, Heidelberg, 2012.

15. Sergio Arevalo, Ernesto Jimenez, and Jian Tang. Fault-tolerant broadcast in anonymoussystems. Technical Report of Deparmento de Sistemas Informaticos, 21 pages, UniversidadPolitecnica de Madrid, Madrid (Spain), 2014.

16. D. Angluin, J. Aspnes, D. Eisenstat, E. Ruppert. On the power of anonymous one-waycommunication. Principles of Distributed Systems, Lecture Notes in Computer ScienceVolume 3974, pp. 396–411,Springer Berlin Heidelberg, 2006.

17. M. K. Aguilera, S. Toueg, and B. Deianov. Revisiting the weakest failure detector for uni-form reliable broadcast. Proceeding of the 13th International Symposium on DistributedComputing (DISC’99), pp. 19–33, Bratislava, Slovak Republic, 1999.

18. Tushar Deepak Chandra and Sam Toueg Unreliable Failure Detectors for Reliable Dis-tributed Systems. Journal of the ACM, 43:2, pp. 225–267, March, 1996.

19. F. Bonnet and M. Raynal. Anonymous asynchronous systems: the case of failure de-tectors. Proceedings of the 24th International Symposium on Distributed Computing(DISC’10), pp. 206–220, Cambridge, MA, USA, 2010.

20. Jian Tang, Mikel Larrea, Sergio Arevalo, and Ernesto Jimenez. Implementing Reli-able Broadcast in Anonymous Distributed Systems with Fair Lossy Channels. Techni-cal Report of University of the Basque Country UPV/EHU, San Sebastian, Spain, 2014.http://www.sc.ehu.es/acwlaalm/research/EHU-KAT-IK-03-14.pdf.CONFI

DENTIAL

137

CONFIDENTIA

L

138

A survey on subspace clustering ?

Bo Zhu and Alexandru Mara

Universidad Politecnica de Madrid, Madrid, Spain,[email protected]

[email protected]

Abstract. Subspace clustering techniques were proposed as an exten-sion of traditional clustering algorithms to discover hidden clusters thatonly exist in certain subsets of the full feature space. In recent years,this research area has been intensively studied, a large number of algo-rithms have be proposed and successfully applied in many domains. Inthis paper we conduct a survey on both centralized and parallel subspaceclustering algorithms.

Keywords: subspace, parallel, clustering, high-dimensional data

1 Introduction

Clustering is one of the main techniques for unsupervised knowledge discoveringout of unlabeled datasets. This technique uses the notion of similarity to groupdata points in entities known as clusters. Generally clustering techniques can beclassified into several groups: partition-based approaches, such as K-means[1],hierarchical approaches, such as DIANA[2] and density-based ones, such as DB-SCAN[3]. These traditional algorithms conduct the clustering process taking thewhole feature space into consideration based on their mutual similarity. How-ever, due to the great advances in data collection and management, both thenumber of samples and the number of features of current datasets keep explod-ing. The clustering for huge datasets consisting of a large number of features hasbecome a challenging task despite the fact that traditional clustering methodsare mature enough to handle smaller datasets.

Achieving scalability w.r.t the number of dimensions is a much more com-plex task as most traditional clustering techniques suffer for what is known ascurse of dimensionality[4]. The curse of dimensionality provokes the distancemeasurements to become meaningless in very high spaces. This is because whencomputing a certain distance measurement e.g. Euclidean distance among datapoints in very large hyperspace they all seem to be equidistant form each otherdue to the skew introduced by the spaces where the points are not clustered.Apart from the problem mentioned above, in the field of high-dimensional clus-tering, the existence of irrelevant features or of the correlation among different

? The research leading to these results has been developed within the ONTIC project,which has received funding from the European Union’s Seventh Framework Pro-gramme (FP7/2007-2011) under grant agreement n◦ 619633.

CONFIDENTIA

L

139

subsets of features also has a strong negative influence on the performance ofclustering algorithms. To deal with these problems, a number of dimensionalityreduction methods were proposed. Feature extraction techniques, e.g. PrincipleComponent Analysis[5], try to transform data from the original high-dimensionalspace to another space with fewer dimensions. This transformation to a lower-dimensional space maximizes the variance of original data by means of decom-position techniques, with sacrifices of data precision and the interpretability oftransformed dimensions. Feature selection techniques such as mRMR feature se-lection algorithm[6], intend to select an optimal relevant subset of the originalfeatures. This type of methods are under an assumption that in the originaldataset there are irrelevant features that provide no meaningful information inany context and can be removed by filter, wrapper or embedded methods.

However, these dimensionality reduction techniques ignore the fact that clus-ters can be found in different relevant subsets of features. In [13] H.-P.Kriegel etal. proposed the local feature relevance (or local feature correlation), which refersto the phenomenon that different features or a different correlation of featuresmay be relevant for varying clusters. Here, the clusters are extended to instancescluster in subspaces of the full data space. As a consequence, global dimension-ality reduction techniques on the whole space of high-dimensional data set maygenerate unsatisfactory results given the fact that each cluster may exist in adifferent subspace.

To overcome the limitation of dimensionality reduction techniques, a specialfamily of algorithms, which derived from the frequent pattern mining field[7],rapidly constituted a novel field named subspace clustering. These techniquessolve the two principle problems of high-dimensional clustering:

1) They search for the relevant subspaces of the original feature space toproperly define a certain cluster so that the distance computations will not bedistorted.

2) They detect all the (overlapping or non-overlapping) clusters existing indifferent subspaces.

In this survey we first present an overview of the related work in Section 2.In Section 3 we categorize basic subspace clustering algorithms. In Section 4 weintroduce all parallel subspace clustering algorithms that have been proposed bynow. Finally Section 5 outlines our conclusions.

2 Related work

Many research works tried to complete subspace clustering task in the last twodecades. There are some surveys[4, 12–14] that conducted either theoretical orexperimental comparisons among different subspace clustering algorithms. Theseapproaches were categorized into several classes considering different algorith-mic aspects. [12] divided a number of classical methods into bottom-up andtop-down groups based on the search method applied by the algorithms. Insome recent summary research works[4, 13, 14] subspace clustering algorithmsare classified into three paradigms with regards to the underlying cluster def-

CONFIDENTIA

L

140

inition and parametrization. Grid-based approaches, e.g. SCHISM[20], Maxn-Cluster[21], try to find sets of grid bins which contain more data than a densitythreshold for different subspaces. Density-based approaches, e.g. FIRES[22], IN-SCY[23], PreDeCon[24], search for dense regions separated by sparse regions bycalculating the distance of relevant dimensions. Clustering-oriented approaches,e.g. STATPC[25], Gamer[26], assume the global properties of the whole clusterset and is similar to those of top-down approaches[12].

3 Centralized subspace clustering algorithms

In this survey we categorize basic subspace clustering algorithms into three mainfamilies: namely lattice based algorithms, approximation algorithms and hybridalgorithms.

3.1 Lattice based algorithms

A large part of existing subspace clustering algorithms fall into the lattice basedcategory. Like mentioned in [12], either top-down or bottom-up traversal is car-ried out on a lattice each of whose nodes represents a set of candidates which canbe the units of grid based subspace cluster, the windows of the windows basedsubspace cluster, feature values or features etc. Potential subspace clusters aresought from each node.

The bottom-up group first generates 1-D histogram information for each di-mension, and then tries to do the bottom-up traversal on the subspace lattice. Bystarting from one dimension and adding another dimension each time, bottom-upapproaches tend to work efficiently in relatively small subspaces. Consequently,they generally show better scalability when uncovering hidden subspace clustersthat have lower dimensionality. However, the performance decreases dramati-cally with the size of candidate subspaces where clusters are found[12]. This isbecause the time complexity grows at most exponentially w.r.t the dimension-ality of hidden clusters.

In contrast with bottom-up approaches, top-down methods start from theequally-weighted full feature space and generate an approximation of the set ofclusters. After the initialization, updates of the weight for each dimension ineach cluster, and the regeneration of clusters are iteratively conducted. Finallya refinement of the clustering result is carried out to achieve a better quality ofclusters. Since multiple iterations of clustering process are conducted in the fullfeature space, sampling techniques are generally used to increase efficiency bysacrificing the accuracy of clustering results. Clusters generated by this kind ofapproaches are non-overlapping and of similar dimensionality due to the manda-tory input parameters.

Moreover, depending on how the traversal is conducted, it can be furtherdivided into breadth-first way and depth-first way. In breadth-first traversal,before moving down to the next level of the lattice, all parent nodes in theneighboring upper level should be traversed. For depth-first traversal, in order

CONFIDENTIA

L

141

to prevent the duplication of traversals for the same children nodes, the latticeshould be treated as a set enumeration tree where each child node is traversedonly by one of its parents.Since the time complexity of naive traversal on the lattice is exponential w.r.t thedimensionality of the dataset, the majority of this kind of approaches conductsa pruning step of candidate subspaces using an anti-monotonicity property toefficiently prune the search space.

Anti-Monotonicity Property: if no cluster exists in the subspace Sk, thenthere is no cluster existing in higher dimensional subspaces Sk+1 either. i.e.

∃Sk, CSk= ∅⇒ ∀Sk+1 ⊃ Sk, CSk+1

= ∅.

A similar monotonicity property was first introduced in the research area offrequent itemset mining to efficiently reduce the possible search space. repre-sentative examples of this kind of approaches are CLIQUE [15], ENCLUS[16],DOC[17], PROCLUS[18], ORCLUS[19] etc.

3.2 Approximation algorithm

Since some optimization problems of clustering are NP-hard and computation-ally infeasible, approximation algorithms are exploited with the sacrifices of thecompleteness and quality of clustering results. [32] proposed a greedy and iter-ative subspace clustering method for uncertain data. Each iteration generatesa set of subspace clusters for a set of randomly chosen medoids (representativeobjects of the clusters, similar to centroids but always members of the dataset)accounting for specific measurements, and the subspace cluster with the bestquality is selected and outputted as final cluster. And then in next iterationinstances that appear in selected subspace clusters with high probability are ex-cluded from the original dataset in order to prevent the same subspace clusterto be generated again. This process is repeated until no new subspace clusterappears. Another example is RESCU algorithm[33] that does subspace cluster-ing under the relevance model. This algorithm maintains a descending-order listof subspace clusters based on the cluster gain. RESCU carries out an iterativeprocess and for each iteration a significant subspace cluster is selected fromthe list and added to the final cluster set. This process is terminated when nosignificant subspace cluster can be added. FIRES[22] is another representativedensity-based approximation algorithm.

3.3 Hybrid algorithm

With the motivation of increasing efficiency, some hybrid algorithms combinedifferent algorithms so that the strengths of each are absorbed. For example, thealgorithm MASC proposed in [34] is a combination of optimization and frequentpattern mining techniques to discover actionable subspace clusters based oncentroids of original data. It first finds an optimal set of weights that indicateswhether an instance belongs to a certain subspace cluster containing a centroid.

CONFIDENTIA

L

142

A threshold of weight is then determined in a heuristic way. The weights arebinarized into 1 and 0 by comparing their values to the threshold. This processis repeated for each feature, generating a transformed binary weight matrix.Finally maximal biclique subgraphs are sought for from the matrix as actionablesubspace clusters.

4 Parallel subspace clustering algorithms

During a thorough survey of the state-of-the-art on subspace clustering, we de-tected a lack of parallel clustering algorithms on subspaces despite of the signif-icant improvements of performance that can be benefited from parallelization.Actually as far as we know, only two parallel implementations have been pro-posed by now, while they used specific architectures like Message Passing Inter-face (MPI) and Parallel Random Access Machine (PRAM). Compared with thenovel Spark framework, these models suffer from a non-negligibly high commu-nication cost and a complex failure recovery mechanism. Apart from this, theydo not scale as well as Spark and don’t provide much flexibility to programmers.They are also not compatible to current big data analysis tools such as Hadoop,HDFS etc.

In [27] proposed the grid based parallel MAFIA algorithm extended fromCLIQUE by partitioning each dimension into several discrete adaptive bins withdistinct densities. However, an expensive sorting preprocessing for each sub-space is inevitable before the generation of adaptive grids. The discretization oforiginal values for each dimension increases efficiency and works well for cate-gorical data, but it causes loss of precision for numerical data, thus decreases theaccuracy of clustering results [4]. What’s more, the sensibility of the value of pa-rameters like the size of bins is another drawback of grid based algorithms [13].Experiments in [12] showed that clusters discovered by MAFIA left out somesignificant dimension, and some clusters were reported as two separate clustersinstead of properly merged. Moreover, in order to run the algorithm in paral-lel, the original dataset has been randomly partitioned into several parts andread into local disks of different nodes. Although data parallelization based on ashared-nothing architecture can bring a significant reduce of execution time, thegenerated partitions can be highly skewed and greatly affect the quality of theclustering result. The parallel architecture used in [27] assembles a naive versionof MapReduce paradigm. However, as [28] claims, for iterative algorithms likesubspace clustering, Spark outperforms MapReduce by two orders of magnituderegarding the time of execution.

The other parallel algorithm was based on the Locally Adaptive Clustering(LAC) algorithm. LAC was proposed in [29] as a top-down approach by assign-ing values to a weight vector based on the relevance of dimensions within thecorresponding subspace cluster. Apriori knowledge is necessary for appropriatesetting of parameters. LAC maintains all the information, but it also suffers fromexpensive dataset scans, which leads to extremely long run time. In order to im-prove the efficiency of the original algorithm, [30] proposed the parallel LAC

CONFIDENTIA

L

143

(PLAC) algorithm. PLAC transforms subspace clustering task to the problemof finding K centroids of clusters in an N-dimensional space. Given K x N proces-sors that share a global memory, PLAC managed to distribute the whole datasetinto a grid of K centroids and N nodes for each centroid. In the experiments,they used one machine as the global shared memory, assuming that the wholedataset could fit in a single node. This architecture design severely limited thescalability of PLAC, as the maximal size of the dataset shown in [30] was nolarger than 10,000.

5 Conclusion

Research on subspace clustering has been a hot area for the past decade. Manybasic subspace clustering algorithms have been proposed and successfully appliedto various domains. In this paper we present a survey on subspace clustering bycategorizing basic algorithms into three families. We also highlight a lack ofparallel subspace clustering algorithms, which can be a promising research area.

References

1. E.W. Forgy. Cluster analysis of multivariate data: efficiency versus interpretabilityof classifications. Biometrics 21, 1965, pp. 768769.

2. L. Kaufman, P.J. Rousseeuw: Finding Groups in Data (John Wiley & Sons) 1990.3. M. Ester, H. P. Kriegel, J. S, X. Xu: A density-based algorithm for discovering

clusters in large spatial databases with noise. AAAI Press, 1996, pp. 226-231.4. E. Muller, S. Gunnemann, I. Assent, T. Seidl. Evaluating clustering in subspace

projections of high dimensional data. Proceedings of the VLDB Endowment, Vo(2),Issue 1. 2009, pp. 1270-1281.

5. K. Pearson. On Lines and Planes of Closest Fit to Systems of Points in Space.Philosophical Magazine 2 (11), 1901, pp. 559572.

6. H. Peng, F. Long, C. Ding. Feature selection based on mutual information crite-ria of max-dependency, max-relevance, and min-redundancy. Pattern Analysis andMachine Intelligence, Vol(27), Issue 8. 2005, pp. 1226-1238.

7. A. Zimek, I. Assent, J. Vreeken. Frequent Pattern Mining Algorithms for Data Clus-tering. Frequent Pattering Mining, Chapter 16. Springer International Publishing.2014, pp. 403-423.

8. K. Kailing, H. -P. Kriegel, P. Kroger. Density-Connected Subspace Clustering forHigh-Dimensional Data. Proceedings of 4th SIAM Int. Conf. on Data Mining. 2004,pp. 246-257.

9. Dean, Jeffrey, and Sanjay Ghemawat. ”MapReduce: simplified data processing onlarge clusters.” Communications of the ACM 51.1 (2008): 107-113.

10. Shvachko, Konstantin, et al. ”The hadoop distributed file system.” Mass StorageSystems and Technologies (MSST), 2010 IEEE 26th Symposium on. IEEE, 2010.

11. Zaharia, Matei, et al. ”Resilient distributed datasets: A fault-tolerant abstractionfor in-memory cluster computing.” Proceedings of the 9th USENIX conference onNetworked Systems Design and Implementation. USENIX Association, 2012.

12. L. Parsons, E. Haque, H. Liu. Subspace clustering for high dimensional data: areview. ACM SIGKDD Explorations Newsletter. Vol(6), Issue 1. 2004, pp. 90-105.

CONFIDENTIA

L

144

13. H. -P. Kriegel, P. Kroger, A. Zimek. Clustering high-dimensional data: A surveyon subspace clustering, pattern-based clustering, and correlation clustering. Trans-actions on Knowledge Discovery from Data, Vol(3), Issue 1. 2009, Article No. 1.

14. K. Sim, V. Gopalkrishnan, A. Zimek, G. Cong. A survey on enhanced subspaceclustering. Data Mining and Knowledge Discovery, Vol(26), Issue 2. 2013, pp. 332-397.

15. R. Agrawal, J. Gehrke, D. Gunopulos, P. Raghavan. Automatic subspace cluster-ing of high dimensional data for data mining applications. Proceedings of ACMSIGMOD, ACM Press, 1998, pp. 94-105.

16. C. Cheng, A. Fu, Y. Zhang. Entropy-based subspace clustering for mining numer-ical data. Proceedings of the 5th SIGKDD, ACM Press, 1999, pp. 84-93.

17. C. M. Procopiuc, M. Jones, P. K. Aggarwal, T. M. Murali. A monte carlo algo-rithm for fast projective clustering. Proceedings of ACM SIGMOD internationalconference on Management of data, ACM Press, 2002, pp. 418-427.

18. C. C. Aggarwal, J. L. Wolf, P. S. Yu, C. Procopiuc, J. S. Park. Fast algorithmsfor projected clustering. Proceedings of ACM SIGMOD international conference onManagement of data, ACM Press, 1999, pp. 61-72.

19. C. C. Aggarwal, P. S. Yu. Finding generalized projected clusters in high dimen-sional spaces. Proceedings of ACM SIGMOD international conference on Manage-ment of data, ACM Press, 2000, pp. 70-81.

20. K. Sequeira and M. Zaki. SCHISM: A new approach for interesting subspace min-ing. Proceedings of ICDM, 2004, pp. 186-193.

21. G. Liu, K. Sim, J. Li, L. Wong. Efficient mining of distance-based subspace clusters.Statistical Analysis and Data Mining, Vol(2), Issue 5-6, 2010, pp. 427-444.

22. H.-P. Kriegel, P. Kroger, M. Renz, and S. Wurst. A generic framework for efficientsubspace clustering of high-dimensional data. Proceedings of ICDM, 2005, pp. 250-257.

23. I. Assent, R. Krieger, E. Muller, and T. Seidl. INSCY: Indexing subspace clusterswith in-process-removal of redundancy. Proceedings of ICDM, 2008, pp. 719-724.

24. H. -P. Kriegel, P. Kroger, I. Ntoutsi, A. Zimek. Density Based Subspace Clusteringover Dynamic Data. Scientific and Statistical Database Management, Vol(6809),2011, pp 387-404.

25. G. Moise, J. Sander. Finding non-redundant, statistically significant regions in highdimensional data: a novel approach to projected and subspace clustering. Proceed-ings of SIGKDD, 2008, pp. 533-541.

26. S. Gunnemann, I. Farber, B. Boden, T. Seidl. Subspace Clustering Meets DenseSubgraph Mining: A Synthesis of Two Paradigms. Proceedings of the 10th ICDM,2010, pp. 845-850.

27. S. Goil , H. Nagesh , A. Choudhary. MAFIA: Efficient and Scalable SubspaceClustering for Very Large Data Sets. Proceedings of the 5th SIGKDD, 1999.

28. Spark: https://spark.apache.org/29. C. Domenoconi, D. Papadopoulos, D. Gunopulos, S. Ma. Subspace Clustering of

High Dimensional Data. Proceedings of SIAM, 2004.30. H. Nazerzadeh, M. Ghodsi, S. Sadjadian. Parallel Subspace Clustering. Proceeding

of the 10th Annual Conference of Computer Society of Iran. 2005.31. M. Zaharia, P. Wendell, A. Konwinski, H. Karau. Learning Spark. O’Reilly Media,

Inc. 2015.32. S. Gunnemann, I. Farber, E. Muller, T. Seidl. ASCLU: alternative subspace clus-

tering. MultiClust, KDD 2010.33. E. Muller, I. Assent, R. Krieger, S. Gunnemann, T. Seidl. DensEst: density esti-

mation for data mining in high dimensional spaces. SDM 2009.

CONFIDENTIA

L

145

34. K. Sim, AK. Poernomo, V. GopalkrisHnan. Mining actionable subspace clusters insequential data. SDM 2010.

CONFIDENTIA

L

146

COSTAC 2015 - UPMauthors use the concept of Technical Debt (Samarthyam and Sharma 2014) and (Chris et al. 2012). Technical Debt commonly focuses on how bad the code can be, and also

Documents