Aggregation Functions in OCL

Specifying Aggregation Functions inMultidimensional Models with OCL

Jordi Cabot, Jose-Norberto Mazón, Jesús Pardillo, Juan Trujillo

École des Mines de Nantes & Universidad Alicante

ER 2010

Introduction

Conceptual modeling has proved to be very useful in the development of data warehouse systems.

Main benefits -> benefits of conceptual modeling: – Implementation-independent view of the system– Possibility of (semi)automatic code-generation– Better maintainability and evolution– …

Several proposals in this direction. – UML Profile for multidimensional modeling of data warehouses

• [Luján et al DKE 2007] – Model-driven approach for development of data warehouses

• [Mazón & Trujillo DSS 2008]

Conceptual Modeling of DWH (1/2)

Modeling multidimensional concept at conceptual level– Data structured in a multidimensional space– Dimensions specify different ways the data can be viewed,

aggregated, and sorted• E.g., according to time, store, customer, product, etc.

– Events of interest for an analyst are represented as facts which are associated with cells or points in the multidimensional space and which are described in terms of a set of measures

abstracted logical details:– technology: relational, multidimensional, ...– logical variations: star, snowflake schema, ...

automatically obtain a logical representation– model-driven approach

Conceptual Modeling of DWH

An airline’s marketing department wants to analyze the flight activity of each member of its frequent flyer program

Conceptual Modeling of DWH (1/2)

… once annotated with the Profile becomes …

Conceptual Modeling of DWH

… BUT (there’s always a ‘but’)

Right now, only the structural aspects of the DWH are modeled but decision makers require a set of multidimensional queries

These multidimensional queries are not specified as part of the Conceptual Schema (CS) of the DWH

They are only added once the DWH is implemented

As a result:– Breaks the MDE approach– The completeness of the DWH cannot be validated until it is

implemented (i.e. DWH contains enough information?)– Definition of queries requires expertise in the target platform– No reusability– …

This limitation affect not only multidimensional models but, in

general, all kinds of CSs (informative function ignored)

Limitations of CM languages

The main restriction for defining queries at the CS level -> poor support in current CM languages

In particular, CM languages exhibit a lack of rich constructs for the specification of aggregation functions (key in DWH systems)

Usually only basic ones (sum, avg,…) are covered but DWH systems require richer analysis functions (e.g. rank, percentile, min, max,…)

For instance, OCL (most popular query language for CSs) only includes the sum, size and count functions

If a designer wants to know the ranking of frequent flyers he

has to build the ranking function himself

Very time consuming and error-prone

Don’t you prefer to have the “*” operator even if “+” is enough?

Goal

Extending OCL with a new set of predefined aggregation functions

Making sure these functions can be integrated in current MDD methods

Provide a way to specify complex queries as part of the CS definition

We will apply these new OCL functions in combination with our UML

profile for DWH modeling

The functions themselves are independent of the profile and can be used

to complement any CSs

OCL: Basic Concepts (1/2)

OCL is a rich language that offers predefined mechanisms for:– Retrieving the values of an object– Navigating through a set of related objects,– Iterating over collections of objects (e.g., forAll, exists, select)

OCL includes a predefined standard library: set of types + operations on them– Primitive types: Integer, Real, Boolean and String– Collection types: Set, Bag, OrderedSet and Sequence– Examples of operations: and, or, not (Boolean), +, −, , >, < (Real and

Integer), union, size, includes, count and sum (Set).

All these constructs can be used in the definition of OCL constraints, derivation rules, queries and pre/post-conditions

OCL: Basic Concepts (2/2)

Template for queries

Example query (total miles earned by a frequent flyer in his/her trips from Denver in a given fare)

context Class::Q(p1:T1, . . . , pn:Tn): Tresultbody: Query-ocl-expression

context Customer::sumMiles(FareClass fc)body: self.frequentFlyerLegs−>select(f | f.fareClass=fc andf.origin.city.name=’Denver’)−>sum()

But no easy way to define more complex queries required to

properly analyze the data of the system

Extending OCL

Extension classified in three different groups of functions:

– Distributive functions: can be defined by structural recursion• Max, min, sum, count, count distinct,…

– Algebraic functions: finite algebraic expressions over distributive functions

• Avg, variance, stddev, covariance, …

– Holistic functions: the rest• Mode, descending rank, ascending rank, percentile, median

These operations can be combined to provide more advanced

ones (e.g. top(x) that is implemented using rank)

How are these operations defined?

Functions as an extension of the OCL standard library (unfortunately, we don’t have a yet an import mechanism in OCL to add an external OCL analytics library)

Defined in OCL by specifying their operation contract (same style as used in the standard)

No changes on the OCL metamodel

Each operation is attached to the most appropriate (primitive or collection) type

Users can call them in the same way as the normal ones

Some examples (1/3)

MAX: Returns the element in a non-empty collection of objects of type T with the highest value.

COUNT DISTINCT: Returns the number of different elements in a collection

context Collection::max():Tpre: self−>notEmpty()post: result = self−>any(e | self−>forAll(e2 | e >= e2))

context Collection::countDistinct(): Integerpost: result = self−>asSet()−>size()

Some examples (2/3)

AVG: Returns the arithmetic average value of the elements in the non-empty collection.

COVARIANCE: Returns the covariance value between two ordered sets

context Collection::avg():Realpre: self−>notEmpty()post: result = self−>sum() / self−>size()

context OrderedSet::covariance(Y: OrderedSet):Realpre: self−>size() = Y−>size() and self−>notEmpty()post: let avgY:Real = Y−>avg() in let avgSelf:Real = self−>avg() in result = (1/self−>size()) * self−>iterate(e; acc:Real=0 | acc + ((e - avgSelf) * (Y−>at(self−>indexOf(e)) - avgY))

Some examples (3/3)

MODE: Returns the most frequent value in a collection.

DESCENDING RANK: Returns the position (i.e., ranking) of an element within a Collection.

context Collection::mode(): Tpre: self−>notEmpty()post: result = self−>any(e | self−>forAll(e2 | self−>count(e) >= self−>count(e2))

context Collection::rankDescending(e: T): Integerpre: self−>includes(e)post: result = self−>size() - self−>select(e2 | e >= e2)−>size() + 1

Using our new aggregate functions

Our functions can be used wherever a OCL standard function can be used

They are called exactly in the same way

Ex of use of the avg function to compute the average number of miles earned by a customer in each flight leg.

context Customer::avgMilesPerFlightLeg():Realbody: self−>frequentFlyerLegs.Miles−>avg()

MDD of our “enriched” DWH CSs

To be useful, we need to make sure that CSs using our new aggregate functions can be used as input of MDD processes and tools

Current MDD methods do NOT need to be extended to cope with enriched CSs– Our library is written in OCL itself (platform-independent)– Complex functions can be reduced to standard OCL functions

Two scenarios depending on whether the target implementation platform directly supports our function– In the latter, preprocessing our functions is required to reexpress them

in terms of standard OCL operations

Existing OCLtoX (X=Java, SQL,…) tools can help in the process

MDD Scenario 1: Direct implementation

create view AvgMilesFlight as { select avg(l.miles) from customer c, frequentflyerlegs l where c.id=l.customer}

(a) DBMS code


MDD Scenario 2: Normalization/unfolding

class Customer {int id;String name;Vector<FrequentFlyerLegs> f;...public float avgMiles() {return sumMiles(f)/f.size();} }

(b) Java code

context Customer::avgMilesPerFlightLeg():Realpost: result = self−>frequentFlyerLegs.Miles−>sum() /

self−>frequentFlyerLegs.Miles−>size()


Validation

Our OCL extension has been validated by using the UML Specification Environment (USE) tool

Our functions have been added to USE as new user-defined functions

2-phase analysis:– Syntactic analysis: USE parses the OCL operations and checks their

syntactic correctness– Semantic analysis: USE executes the operations on sample scenarios.

Analyzing the results we can check if the operations behave as expected

Validation

Conclusions

Complex aggregation functions should be part of the predefined constructs provided by modeling languages

We made this possible by extending OCL

Queries written with this “extended OCL” can be animated and validated at design-time and automatically implemented along with the rest of DWH CS

Further Work

Giving mechanisms for defining/validating multidimensional queries at conceptual level in a more intuitive manner – Natural language– OCL <-> Semantics of Business Vocabulary and Business Rules (SBVR)

[Cabot et al, Inf. Syst. 2010]

Verifying the proper use of the aggregation function chosen by the designer. The kind of aggregation functions to be applied depends on the kind of measure and the kind of dimension. E.g.: – Temperatures cannot be aggregated along the time nor location

dimension

Continuing the discussion

http://modeling-languages.com

[email protected]

[email protected]

@softmodeling

Aggregation Functions in OCL

Documents

dwh modelingthe functions

count functions

new ocl functions

dwh systems

algebraic functions

holistic functions

sum set

set of operations