Specifying Aggregation Functions in Multidimensional Models with OCL Jordi Cabot , Jose-Norberto Mazón, Jesús Pardillo, Juan Trujillo École des Mines de Nantes & Universidad Alicante ER 2010
May 10, 2015
Specifying Aggregation Functions inMultidimensional Models with OCL
Jordi Cabot, Jose-Norberto Mazón, Jesús Pardillo, Juan Trujillo
École des Mines de Nantes & Universidad Alicante
ER 2010
Introduction
Conceptual modeling has proved to be very useful in the development of data warehouse systems.
Main benefits -> benefits of conceptual modeling: – Implementation-independent view of the system– Possibility of (semi)automatic code-generation– Better maintainability and evolution– …
Several proposals in this direction. – UML Profile for multidimensional modeling of data warehouses
• [Luján et al DKE 2007] – Model-driven approach for development of data warehouses
• [Mazón & Trujillo DSS 2008]
Conceptual Modeling of DWH (1/2)
Modeling multidimensional concept at conceptual level– Data structured in a multidimensional space– Dimensions specify different ways the data can be viewed,
aggregated, and sorted• E.g., according to time, store, customer, product, etc.
– Events of interest for an analyst are represented as facts which are associated with cells or points in the multidimensional space and which are described in terms of a set of measures
abstracted logical details:– technology: relational, multidimensional, ...– logical variations: star, snowflake schema, ...
automatically obtain a logical representation– model-driven approach
Conceptual Modeling of DWH
An airline’s marketing department wants to analyze the flight activity of each member of its frequent flyer program
Conceptual Modeling of DWH (1/2)
… once annotated with the Profile becomes …
Conceptual Modeling of DWH
… BUT (there’s always a ‘but’)
Right now, only the structural aspects of the DWH are modeled but decision makers require a set of multidimensional queries
These multidimensional queries are not specified as part of the Conceptual Schema (CS) of the DWH
They are only added once the DWH is implemented
As a result:– Breaks the MDE approach– The completeness of the DWH cannot be validated until it is
implemented (i.e. DWH contains enough information?)– Definition of queries requires expertise in the target platform– No reusability– …
This limitation affect not only multidimensional models but, in
general, all kinds of CSs (informative function ignored)
Limitations of CM languages
The main restriction for defining queries at the CS level -> poor support in current CM languages
In particular, CM languages exhibit a lack of rich constructs for the specification of aggregation functions (key in DWH systems)
Usually only basic ones (sum, avg,…) are covered but DWH systems require richer analysis functions (e.g. rank, percentile, min, max,…)
For instance, OCL (most popular query language for CSs) only includes the sum, size and count functions
If a designer wants to know the ranking of frequent flyers he
has to build the ranking function himself
Very time consuming and error-prone
Don’t you prefer to have the “*” operator even if “+” is enough?
Goal
Extending OCL with a new set of predefined aggregation functions
Making sure these functions can be integrated in current MDD methods
Provide a way to specify complex queries as part of the CS definition
We will apply these new OCL functions in combination with our UML
profile for DWH modeling
The functions themselves are independent of the profile and can be used
to complement any CSs
OCL: Basic Concepts (1/2)
OCL is a rich language that offers predefined mechanisms for:– Retrieving the values of an object– Navigating through a set of related objects,– Iterating over collections of objects (e.g., forAll, exists, select)
OCL includes a predefined standard library: set of types + operations on them– Primitive types: Integer, Real, Boolean and String– Collection types: Set, Bag, OrderedSet and Sequence– Examples of operations: and, or, not (Boolean), +, −, , >, < (Real and
Integer), union, size, includes, count and sum (Set).
All these constructs can be used in the definition of OCL constraints, derivation rules, queries and pre/post-conditions
OCL: Basic Concepts (2/2)
Template for queries
Example query (total miles earned by a frequent flyer in his/her trips from Denver in a given fare)
context Class::Q(p1:T1, . . . , pn:Tn): Tresultbody: Query-ocl-expression
context Customer::sumMiles(FareClass fc)body: self.frequentFlyerLegs−>select(f | f.fareClass=fc andf.origin.city.name=’Denver’)−>sum()
But no easy way to define more complex queries required to
properly analyze the data of the system
Extending OCL
Extension classified in three different groups of functions:
– Distributive functions: can be defined by structural recursion• Max, min, sum, count, count distinct,…
– Algebraic functions: finite algebraic expressions over distributive functions
• Avg, variance, stddev, covariance, …
– Holistic functions: the rest• Mode, descending rank, ascending rank, percentile, median
These operations can be combined to provide more advanced
ones (e.g. top(x) that is implemented using rank)
How are these operations defined?
Functions as an extension of the OCL standard library (unfortunately, we don’t have a yet an import mechanism in OCL to add an external OCL analytics library)
Defined in OCL by specifying their operation contract (same style as used in the standard)
No changes on the OCL metamodel
Each operation is attached to the most appropriate (primitive or collection) type
Users can call them in the same way as the normal ones
Some examples (1/3)
MAX: Returns the element in a non-empty collection of objects of type T with the highest value.
COUNT DISTINCT: Returns the number of different elements in a collection
context Collection::max():Tpre: self−>notEmpty()post: result = self−>any(e | self−>forAll(e2 | e >= e2))
context Collection::countDistinct(): Integerpost: result = self−>asSet()−>size()
Some examples (2/3)
AVG: Returns the arithmetic average value of the elements in the non-empty collection.
COVARIANCE: Returns the covariance value between two ordered sets
context Collection::avg():Realpre: self−>notEmpty()post: result = self−>sum() / self−>size()
context OrderedSet::covariance(Y: OrderedSet):Realpre: self−>size() = Y−>size() and self−>notEmpty()post: let avgY:Real = Y−>avg() in let avgSelf:Real = self−>avg() in result = (1/self−>size()) * self−>iterate(e; acc:Real=0 | acc + ((e - avgSelf) * (Y−>at(self−>indexOf(e)) - avgY))
Some examples (3/3)
MODE: Returns the most frequent value in a collection.
DESCENDING RANK: Returns the position (i.e., ranking) of an element within a Collection.
context Collection::mode(): Tpre: self−>notEmpty()post: result = self−>any(e | self−>forAll(e2 | self−>count(e) >= self−>count(e2))
context Collection::rankDescending(e: T): Integerpre: self−>includes(e)post: result = self−>size() - self−>select(e2 | e >= e2)−>size() + 1
Using our new aggregate functions
Our functions can be used wherever a OCL standard function can be used
They are called exactly in the same way
Ex of use of the avg function to compute the average number of miles earned by a customer in each flight leg.
context Customer::avgMilesPerFlightLeg():Realbody: self−>frequentFlyerLegs.Miles−>avg()
MDD of our “enriched” DWH CSs
To be useful, we need to make sure that CSs using our new aggregate functions can be used as input of MDD processes and tools
Current MDD methods do NOT need to be extended to cope with enriched CSs– Our library is written in OCL itself (platform-independent)– Complex functions can be reduced to standard OCL functions
Two scenarios depending on whether the target implementation platform directly supports our function– In the latter, preprocessing our functions is required to reexpress them
in terms of standard OCL operations
Existing OCLtoX (X=Java, SQL,…) tools can help in the process
MDD Scenario 1: Direct implementation
create view AvgMilesFlight as { select avg(l.miles) from customer c, frequentflyerlegs l where c.id=l.customer}
(a) DBMS code
context Customer::avgMilesPerFlightLeg():Realbody: self−>frequentFlyerLegs.Miles−>avg()
MDD Scenario 2: Normalization/unfolding
class Customer {int id;String name;Vector<FrequentFlyerLegs> f;...public float avgMiles() {return sumMiles(f)/f.size();} }
(b) Java code
context Customer::avgMilesPerFlightLeg():Realpost: result = self−>frequentFlyerLegs.Miles−>sum() /
self−>frequentFlyerLegs.Miles−>size()
context Customer::avgMilesPerFlightLeg():Realbody: self−>frequentFlyerLegs.Miles−>avg()
Validation
Our OCL extension has been validated by using the UML Specification Environment (USE) tool
Our functions have been added to USE as new user-defined functions
2-phase analysis:– Syntactic analysis: USE parses the OCL operations and checks their
syntactic correctness– Semantic analysis: USE executes the operations on sample scenarios.
Analyzing the results we can check if the operations behave as expected
Validation
Conclusions
Complex aggregation functions should be part of the predefined constructs provided by modeling languages
We made this possible by extending OCL
Queries written with this “extended OCL” can be animated and validated at design-time and automatically implemented along with the rest of DWH CS
Further Work
Giving mechanisms for defining/validating multidimensional queries at conceptual level in a more intuitive manner – Natural language– OCL <-> Semantics of Business Vocabulary and Business Rules (SBVR)
[Cabot et al, Inf. Syst. 2010]
Verifying the proper use of the aggregation function chosen by the designer. The kind of aggregation functions to be applied depends on the kind of measure and the kind of dimension. E.g.: – Temperatures cannot be aggregated along the time nor location
dimension
Continuing the discussion
http://modeling-languages.com
@softmodeling