ICT, STREP FERARI ICT-FP7-619491 Flexible Event pRocessing ... · Flexible Event pRocessing for big dAta aRchItectures ... development of large-scale Big Data streaming applications

ICT, STREP

FERARI ICT-FP7-619491

Flexible Event pRocessing for big dAta aRchItectures

Collaborative Project

D1.3

Application Scenario and Prototype Report

01.02.2016 – 31.01.2017 (preparation period)

Contractual Date of Delivery: 31.01.2017

Actual Date of Delivery: 31.01.2017

Author(s): Damir Bogadi (HT), Marko Štajcer (PI),

Fabiana Fournier (IBM), Inna Skarbovsky

(IBM), Izchak Sharfman (TECHNION),

Michael Kamp (FhG), Michael Mock (FHG)

Institution: HT

Workpackage: WP1

Security: PU/CO (Appendix B and C)

Nature: R

Total number of pages: 51

Project funded by the European Community under the Information and Communication Technologies Programme Contract ICT-FP7-619491

Project coordinator name: Michael Mock

Project coordinator organisation name:

Fraunhofer Institute for Intelligent Analysis

and Information Systems (IAIS)

Revision: 1

Schloss Birlinghoven, 53754 Sankt Augustin, Germany

URL: http://www.iais.fraunhofer.de

Abstract:

FERARI aims at developing a general-purpose architecture for flexible,

communication efficient distributed complex event processing on massively

distributed streams of data. While the architecture, being developed in WP2 and being described in D2.3, is of general purpose and hence not restricted to specific

use cases and application domains, it was instantiated and validated in the FERARI project in two specific use cases working with real-world data. Instantiation of the

FERARI architecture in those use cases and evaluation on real-world data in the scope of these use cases is addressed in WP1. These specific use-case related applications running on top of the general purpose FERARI platform are working on

real-world data provided by HT and were instantiated in a test bed on cluster hardware installed at HT. The purpose of this document (D1.3) is to explain the

setup of the test bed hardware, definition of the evaluation criteria tested (number of fraudsters detected, detection time achieved, and theoretical revenue savings), performance results in terms of latency and throughput and usability evaluation

(end user testing of the frontend).

http://www.iais.fraunhofer.de/


Revision history

Administration Status

Project acronym: FERARI ID: ICT-FP7-619491

Document identifier: D1.3 Application Scenario and Prototype Report (01.02.2016 – 31.01.2017)

Leading Partner: HT

Report version: 1

Report preparation date: 22.01.2017. Classification: PU / CO (Appendix B and C) Nature: REPORT

Author(s) and contributors: Damir Bogadi (HT), Marko Štajcer (PI), Fabiana Fournier

(IBM), Inna Skarbovsky (IBM), Izchak Sharfman

(TECHNION), Michael Kamp (FhG), Michael Mock (FHG)

Status: - Plan

- Draft

- Working

- Final

x Submitted

Copyright

This report is © FERARI Consortium 2017. Its duplication is restricted to the personal use

within the consortium and the European Commission. www.ferari-project.eu

http://www.ferari-project.eu/


Document History Version Date Author Change Description 0.1 21/12/2016 Damir Bogadi (HT) First draft 1.0 20/01/2017 Damir Bogadi (HT) Draft after integration of parts 1.1 30/01/2017 Damir Bogadi (HT) Finalization for submission


Table of Contents 1 Introduction .......................................................................................................................................... 2

1.1 Purpose and Scope of the Document ........................................................................................... 2

1.2 FERARI Setting ............................................................................................................................... 3

1.3 Workpackage 1 Description .......................................................................................................... 4

1.4 Relationship with other Documents ............................................................................................. 5

2 Testing Environment ............................................................................................................................. 6

2.1 Prototype Environment ................................................................................................................ 6

2.2 Privacy Constraints for Publication ............................................................................................... 9

2.2.1 Privacy Requirements ........................................................................................................... 9

3 Scenarios Evaluation ........................................................................................................................... 10

3.1 Evaluation Procedure and Tasks Performed ............................................................................... 10

3.2 Fraud Use Case ............................................................................................................................ 12

3.2.1 Business Evaluation Criteria ................................................................................................ 12

3.2.1.1 KPIs event-driven implementation ................................................................................. 13

3.2.1.1.1 Event types ................................................................................................................ 14

3.2.1.2 Dataset ............................................................................................................................ 14

3.2.2 Usability Evaluation ............................................................................................................. 15

3.2.2.1 FERARI Dashboard Overview .......................................................................................... 17

3.2.3 Performance Evaluation ...................................................................................................... 20

3.2.3.1 Recall and Precision ........................................................................................................ 21

3.2.3.1.1 Alarm X ...................................................................................................................... 21

3.2.3.1.2 Alarm Y ...................................................................................................................... 22

3.2.3.1.3 Alarm Z ...................................................................................................................... 22

3.2.3.2 Timing of detection ......................................................................................................... 22

3.2.3.3 Latency ............................................................................................................................ 22

3.2.3.4 Throughput ..................................................................................................................... 25

3.2.3.5 Revenue at Risk ............................................................................................................... 25

3.2.4 Other Test Results ............................................................................................................... 26


3.2.4.1 Ease of Context Configuration or Event Processing ........................................................ 26

3.2.4.2 Interoperability ............................................................................................................... 28

3.2.5 Summary of our Findings in the Fraud Use Case ................................................................ 28

3.3 System Health Monitoring Use Case ........................................................................................... 30

4 Summary ............................................................................................................................................. 31

5 Appendix A - Hardware configuration ................................................................................................ 32

6 Appendix B – Fraud Alarms Description and Results (CONFIDENTIAL) .............................................. 33

6.1 Event processing agents.............................................................................................................. 33

6.1.1 EPA1: Alarm X ...................................................................................................................... 33

6.1.2 EPA2: Alarm Y ...................................................................................................................... 35

6.1.3 EPA3: Alarm Z ...................................................................................................................... 36

6.2 Timing of detection in more details ............................................................................................ 37

7 Appendix C - DSLAM Descriptive Statistics Results (CONFIDENTIAL) ................................................. 38


List of Tables Table 1: Acronyms ........................................................................................................................................ ix

Table 2: Data sources, periods and volumes of collection ........................................................................... 8

Table 3: Criteria Evaluation Matrix ............................................................................................................. 10

Table 4: KPI’s for fraud use case ................................................................................................................. 13

Table 5: Event types for the KPI EPN .......................................................................................................... 14

Table 6: Usability evaluation criteria categories ......................................................................................... 16

Table 7: MEGS' flagged calling numbers for alarm X .................................................................................. 21

Table 8: PROTON's derivations (situations) times for alarm X versus MEGS .............................................. 22

Table 9: Performance Results Summary (90% percentile values) .............................................................. 24

Table 10: Revenue saved by earlier detection ............................................................................................ 25

Table 11: PROTON's derivations (situations) times for alarm X versus MEGS – real values ...................... 37

Table 12: The probability of a ports LSI on the next day given its current LSI ............................................ 39

List of Figures Figure 1: Relationship between WP1 (Prototype) and WP2 (Software Platform) ........................................ 2

Figure 2: Three Phases of the FERARI Project and WP1 ............................................................................... 4

Figure 3: Deliverables in WP1 ....................................................................................................................... 4

Figure 4: Integration of the prototype within the FERARI architecture ....................................................... 8

Figure 5: KPI EPN ......................................................................................................................................... 14

Figure 6: Login form .................................................................................................................................... 18

Figure 7: The derived events in graph representation ............................................................................... 18

Figure 8: List of the derived events ............................................................................................................. 19

Figure 9: List of the most frequent calls ..................................................................................................... 19

Figure 10: Information about subscriber .................................................................................................... 20

Figure 11: List of calls that led to the selected derived event .................................................................... 20

Figure 12: Mesos cluster topology .............................................................................................................. 24

Figure 13: The event processing (EPA) of type SUM that specifies Alarm X ............................................... 27


Figure 14: The definition of the input event (CallPOPDHW), e.i. the list of attributes and their types for this event type ............................................................................................................................................ 27

Figure 15: The definition of the segmentation context, in our case, we partition the input events (CallPOPDHW) by their calling_number attribute ...................................................................................... 28

Figure 16: Event recognition process for Alarm X EPA ............................................................................... 34

Figure 17: Context for Alarm X EPA per single customer ........................................................................... 35

Figure 18: Event recognition process for Alarm Y EPA ............................................................................... 35

Figure 19: Event recognition process for Alarm Z EPA ............................................................................... 36

Figure 20: Daily percentage of port link loss events and line stability index .............................................. 38

Figure 21: Daily percentage of port with LSI <= 2 ....................................................................................... 39

Figure 22: Distribution of link loses over the time of day. .......................................................................... 39


Acronyms

CEP Complex Event Processing

DSLAM Digital Subscriber Line Access Multiplexer

FERARI Flexible Event pRocessing for big dAta aRchItectures

PROTON PROactive Technology Online - IBM tool

WP Work Package Table 1: Acronyms

FERARI Deliverable D1.3

Application Scenario and Prototype Report

Damir Bogadi, Marko Štajcer, Fabiana Fournier, Inna Skarbovsky,

Izchak Sharfman, Michael Kamp, Michael Mock

2

1 Introduction

1.1 Purpose and Scope of the Document

In WP1 we develop the basic application scenarios on which all further development is dependent. This WP

is driven by the end user in the project, i.e. HT.

Specific objectives of WP 1 are:

Selection and definition of the application scenarios in the telecommunications;

Definition of testing and evaluation criteria for the end users at HT;

Setting up of a test bed in HT and at the project partner’s local sites; and

Implementation and evaluation of scenarios to demonstrate the advantage of FERARI with

respect to the state of the art as well as to demonstrate its business value.

Figure 1: Relationship between WP1 (Prototype) and WP2 (Software Platform)

Figure 3 summarizes the main characteristics of WP2 and WP1, showing the differences and relationship

between these two work packages. The architecture developed in WP2 is made available as open source

platform in the FERARI open source repository https://bitbucket.org/sbothe-iais/ferari. It is intended to be

of general purpose and usable in any application domain that works on distributed streaming data. It provides

flexible mechanisms for complex event processing, libraries and run-time components for the distributed

execution of applications. Therefore, FERARI open source platform makes use of communication efficient

https://bitbucket.org/sbothe-iais/ferari

3

protocols for in-situ processing and distributed execution of complex event processing runtimes. Application

development of large-scale Big Data streaming applications is supported and significantly simplified by the

FERARI open source platform. However, specific application-dependent algorithms like fraud detection or

system health monitoring are not part of the open source platform, but will be developed for the purpose of

validating the FERARI platform against the use cases developed and described in WP1. These specific use-

case related applications running on top of the general purpose FERARI platform are working on real-world

data provided by HT and will be instantiated in a test bed on cluster hardware installed at HT. As these

applications are very specific and closely related to the HT data, they will not be open sourced (see also DoW

p. 6 of 28 in the WP1 description).

The purpose of this document is to explain:

Testing performed against business driven KPIs. Please note that the use case specific

evaluation criteria being described here do not stand for their own, but extend the

general purpose goals of the FERARI architecture such as scalability at very large scale,

communication efficiency, and flexibility in the expressiveness of the Complex Event

Processing.

Usability evaluation as performed by HT’s experts. Although FERARI is of general

purpose and not restricted or related to any specific kind of GUI, the evaluation of a

specific use case application makes it necessary to provide.

Recall and Precision of the event-driven application. PROTON’s results are compared to

the MEGS system that is currently in place in HT.

Performance of the FERARI architecture including latency.

This report is structured as follows: chapter 2 gives on overview of the testing environment, chapter 3 gives

an overview of the scenarios setup, evaluation criteria and test results and chapter 0 discusses the summary of

testing results.

1.2 FERARI Setting

The FERARI project aims to develop a highly scalable distributed streaming architecture supporting complex

event processing in a communication efficient manner. A key element of the architecture will be

communication efficient distributed methods for monitoring global functions on globally distributed states by

partitioning of the global function to distributed local functions that communicate only if needed. The general

applicability of these methods will be demonstrated in various application scenarios, including distributed

online machine learning. The use cases for the evaluation of the framework and the machine learning

algorithms are real world use cases from Hrvatski Telekom. One use case focuses on fraud discovery in

mobile networks, which includes SIMbox fraud, premium rate service fraud and roaming fraud amongst

others. The other use case is system health monitoring, a problem attracting more and more attention as

current failure detectors model normal behavior usually from historical data. This approach is getting more

and more challenging, as the network components are becoming more complex and very dynamic as

technologies evolve and data consumption grows.

4

1.3 Workpackage 1 Description

WP1 will develop the basic application scenarios on which all further development crucially dependents. This

WP is driven by the end user (HT) in the project.

Figure 2: Three Phases of the FERARI Project and WP1

Work package 1 (WP1) “Applications scenarios, Test Bed, Prototype” develops the basic application

scenarios and use cases of the FERARI project. The WP is driven by the end users in the project, which

selected and defined uses cases and will perform testing of final product based on predetermined evaluation

criteria.

Figure 3: Deliverables in WP1

WP 1 - Applications scenarios, Test Bed, Prototype

D1.1 Application Scenario Description

and Requirement Analysis

D1.2 Final Application Scenarios and

Description of Test Environment

D1.3 Application Scenario and

Prototype Report

5

Specific objectives as per Description of Work are:

1. Selecting and defining the application scenarios in the telecommunication scenario;

2. Definition of testing and evaluation criteria for the end users at HT;

3. Setting up of a test bed both at HT and at the project partner’s local sites; and

4. Implementation and evaluation of scenarios to demonstrate the advantage of FERARI with

respect to the state of the art as well as to demonstrate its business value.

1.4 Relationship with other Documents

The general purpose architecture described in D2.3 was instantiated in WP1 to handle the use cases described

in D1.2 and test results reported in D1.3. Fraud detection was implemented via the instantiation of specific

fraud rules in the Complex Event Processing – see detailed description in D4.1 and its application in the

flexible event model described in D4.3. D3.2 evaluates a communication efficient in-situ implementation of a

fraud rule on the fraud data described in D1.1, and D1.2. The distributed CEP optimizer described in D5.2,

and D5.3 was used to find optimal distributed placements for the complex event expressions that implement

the fraud rules.

6

2 Testing Environment

2.1 Prototype Environment

HT has installed a dedicated cluster hardware as test bed for the FERARI project. This test bed served two

main purposes:

Data repository for real-world data: Real-world data from HT is copied from the HT

operational infrastructure on the FERARI test bed. It is anonymized on the test bed by

the partner PI. Anonymized data is made available to rest of the consortium.

Installation of FERARI use cases: Use-case specific application running on the FERARI

architecture was installed at the test bed. Recall that one final goal of the FERARI

architecture is to support communication efficient in-situ operations, which implies that

the data is process close to where it is created (i.e. in the HT infrastructure). However, as

a research project, it is impossible to experiment with the HT operational infrastructure.

The test bed environment of FERARI project is located in Zagreb, Data center Utrine, where FERARI

Server is located. All Consortium partners received remote access to the environment. All partners have

approach to anonymized data, while only Poslovna Inteligencija has approach to raw data, which is

anonymized and stored into separate file directory for other consortium members to use.

The Hardware and Software configuration of the FERARI test bed is as follows:

7

Testing environment is placed in Data Center Utrina, Kombolova 2, 10 000 Zagreb

Physical Sever; CentOS Linux release 7.1.1503 (Core)

Hardware Components : HP configuration – details in Appendix A

System SW installed :

o OS CentOS,

o Backup App. Networker

Application SW installed:

o NodeJS witch attached modules

o Redis

o MySQL

o Java

o Anonymization engine

Additionally: Storm/FERARI platform will be installed and configured

Scheme of File System directories with volume of data collected:

o System directories

o oriData filesystem – contains raw data

o anonData filesystem – contains anonymized data

In a similar way, data from the HT network infrastructure is collected and copied into the FERARI test bed

hardware. The following Table summarizes the data connections and amount of data being transferred

between HT infrastructure and the FERARI test bed:

8

Data Sources and Volumes

HT data sources with period of collecting

Data Warehouse (CDR) – batch load

Fraudsters data – batch load (weekly)

VMware – Cloud Health logs – on daily basis

DSLAM (POLraw, POLraw_1 data bases) –On daily basis

DSLAM HW measurement (Netcool) – On daily basis

Data Volumes

DWH CDR (currently): 0.4 TB – planned additional 0.6 TB in next batch

Cloud Health logs (currently): 0.5 TB

DSLAM Line data: 1.9 TB

Total: 2.8 TB

Table 2: Data sources, periods and volumes of collection

The following Figure shows the relationship between the FERARI test bed and the FERARI architecture

described in D2.2 and how the data for fraud detection and system health monitoring is collected and

processed on the FERARI test bed. Note that the data from the operational HT infrastructure is copied into

the FERARI test bed.

Figure 4: Integration of the prototype within the FERARI architecture

The complete FERARI runtime system is installed on the FERARI test bed. The (anonymized) real-world

data, previously being collected and copied from the HT infrastructure to the FERARI test bed, is being fed

as streaming data into the FERARI Architecture. Use case specific algorithms implemented in the FERARI

9

architecture evaluate the streaming data and generate use case specific output, which can be fed into a use

case specific GUI.

2.2 Privacy Constraints for Publication

Due to Privacy restrictions, real thresholds used in HT for detecting fraud cannot be publicized. However,

these values were disclosed to Consortium members in order to perform valid testing, i.e. in order to be able

to perform results comparison.

2.2.1 Privacy Requirements

Sensitive customer data is used as input for FERARI and therefore strict privacy policies and anonymisation

rules have to be applied. In month 3 of the project, new EU Opinion on Anonymisation caused re-evaluation

of anonymisation techniques previously in use.

New anonymisation techniques were developed by Poslovna Inteligencija and evaluated by Hrvatski Telekom

privacy team. Approval of new anonymisation techniques was required (and successfully obtained) from

Deutsche Telekom Headquarter in Bonn.

10

3 Scenarios Evaluation

This chapter describes the use cases investigated in FERARI and testing methodologies and business values

in detail. Two real world use cases from Hrvatski Telekom were selected to test the solutions to be developed

within FERARI. These use case are:

Fraud Mining; and

System health monitoring.

Each use case is described in a dedicated section.

3.1 Evaluation Procedure and Tasks Performed

When evaluating FERARI solution, we took into consideration several criteria’s:

Criteria Evaluation Through KPIs

CRITERIA FRAUD USE CASE SYSTEM HEALTH

MONITORING USE CASE

PROCESSING TIME Fraud Case Detection Time Failure Prediction Time

NUMBER OF FALSE

ALARMS/VALUE OF THE

PROPOSED SOLUTION

Number of Fraudsters False Alarm Ratio

Revenue at Risk Due to

Fraud Failure Precision Percentage

Table 3: Criteria Evaluation Matrix

For each use case testing we focused on:

Correct implementation of theory in the solution;

Testing of the solution automation;

The ability of key components to communicate among each other;

Correctness output data format and content; and

The defined KPIs.

11

For the purpose of testing we performed the following:

Formulated evaluation criteria, verification procedure and feedback report guidelines

Created usability matrix which was used to perform usability evaluation of the implemented

Fraud use case by HT's end users

Held workshops to perform dedicated evaluation sessions engaging both end users and

developers

Evaluated the developed software tools by testing functionalities, accessibility, respect of user

needs

Evaluated integration and execution times

12

3.2 Fraud Use Case

This section defines the evaluated KPIs and business value from FERARI fraud detection use case.

Our goal is to assess the accuracy of the set of event rules specified for the mobile fraud use case. To this end,

we have implemented a new set of event rules applying real thresholds and tested it on a real data set as

described in the following sub-sections.

It’s important to note that as per HT’s privacy requirements, the thresholds shown in the event patterns are

masked with “???” due to privacy issues. However, the implementation uses the correct numbers so

performance evaluation is possible.

3.2.1 Business Evaluation Criteria

Our aim was to perform benchmarking of HT’s fraud detection system against results derived from the

FERARI fraud use case. In order to perform comparison of the results we set the real thresholds used in

HT’s fraud detection system in FERARI implementation. Result of the test should be subscribers that are

flagged as fraudsters.

For the purpose of results evaluation, the following KPIs have been determined:

Fraud Use Case KPIs

KPI Description Validation Criteria

Number of

fraudsters

Number of customers where fraudulent behavior is

detected. Our goal is to detect as many as possible

customers who exhibit fraudulent behavior. Expectation

from FERARI project is to increase number of detected

fraudsters.

While detecting the number of fraudsters in a data

population, we have to take notice on the false positive

and false negative alarms ratio. While we would have

perfect result if we declare the whole population as

fraudulent we would also have the highest percentage of

false positives. Generally, the false positive rate should

be no more than produced by the current fraud mining

system deployed in HT.

Increase in effectiveness

or value with the

proposed solution.

Detect at least as many

fraudsters as are detected

before.

Fraud case

detection time

Time spent between moment when fraud case happened

and moment when HT detected fraud case. We aim to

detect fraudulent behavior as soon as possible and to

take measures in preventing such behavior. Expectation

from FERARI project is to decrease fraud detection

time, meaning we will be able to detect fraud cases

Velocity improvement.

Decrease fraud detection

time by 10 percent.

Current fraud case

detection time is 20

13

sooner than now. minutes.

Revenue at risk due

to the fraud

Estimated sum of charges generated by fraudsters which

will never be collected by HT. Fraudsters generate

significant charges through fraudulent behavior;

fraudsters never pay those invoices which represents

direct revenue loss for the service provider.

We aim to detect all fraudsters and to detect fraudulent

behavior as soon as possible; therefore fraudsters will

have limited time to generate fraudulent charges. This

will in return decrease the revenue at risk and bring

more value.

Increase in value with the

proposed solution.

Decrease revenue at risk

due to the fraud by 15

percent.

Table 4: KPI’s for fraud use case

The first KPI is evaluated in terms of recall and precision as described in Section 3.2.3.1.

The second KPI is evaluated by comparing the detection times of the flagged calling numbers by MEGS to

their counterpart detection time derived by PROTON. A description of this evaluation is detailed in Section

3.2.3.2.

The third KPI is evaluated by multiplying revenue with saving in detection time.

In addition, we also provide numbers related to the average latency of our application to give an estimation of

its performance (see Section 3.2.3.3).

3.2.1.1 KPIs event-driven implementation

The new event processing network (EPN) consisting of three event processing agents (EPAs) is shown in

Figure 5 and detailed in the following sections. For the sake of simplicity we only show the EPAs and the

events flow in the network.

In this new event processing network (EPN) we fire alerts in the cases detailed described in Appendix B –

Fraud Alarms Description and Results (CONFIDENTIAL).

Note that as previously noted, the EPN implemented runs with real thresholds.

We follow the same semantics and names convention as in previous deliverables and show each EPA in

terms of the event recognition process and context (Please refer to work package 4 deliverables in the

FERAR project website1).

1 http://www.ferari-project.eu/key-deliverables/

14

Figure 5: KPI EPN

3.2.1.1.1 Event types

Four event types have been defined that comprise the event inputs, outputs/derived, and situations as shown

in Table 5. As in our previous application, the input event is a CDR record, “Call” in short for simplicity.

Event name Payload

Billed_msisdn (or Call)

object_id; billed_msisdn; call_service; call_start_date; calling_number; call_number; call_direction; tap_related; conversation_duration; ; other_party_tel_number; total_call_charge_amount; customer_activation_date; customer_type

Alarm X calling_number; amount_spent; call_start_dates

Alarm Y calling_number; amount_spent; call_start_dates

Alarm Z calling_number; number_of_calls; call_start_dates Table 5: Event types for the KPI EPN

3.2.1.2 Dataset

The dataset comprised 80GB of masked CDRs over a time span of 30 days that correspond to July 2016. We

counted 18,426,138 events per 24 hour on a random day, meaning an injection rate of 213 events per second.

Extrapolating n the number of records per day we arrive at approximately 553 million records per month.

We had to solve the following issues:

Running the application in FERARI server at HT, due to privacy issues as well as size of input file.

Running the application as events were coming at real-time (according to their timestamps)

Being able to somehow compress the times so we don’t wait one month to get the result of the

running.

The first issue was solved by installing ProtonOnStorm in FERARI’s server and running the application in

HT venue.

15

We tackled the last two issues by applying our utility PETITE (Proton EvenT Injection & Time compression)

that has been developed in the scope of the SPEEDD project2 . PETITE provides a mechanism that can take

the input events and inject them into the engine at very high rates (much higher than in reality imposed by

their timestamps) and process them very fast without altering the logic of the application (for a description of

PETITE refer to3).

The PETITE script is an external utility to PROTON that given an ordered input file and a total elapsed time

for processing all the events in the input file (a compression “ratio”) it:

Checks for feasibility of the requested ratio. This means that events are injected in a way that

can be processed by the engine without being out of order, contexts are not too small, and

there is a minimal interval between input events. For example, having an input file that starts

with the first input event occurrence time at 1st Dec 2014 and ends with the last input event

occurrence time at 7th Dec 2014, we target to run all these events in only one hour then our

ratio is 168 (we need to “compress” or reduce all times by 168, as we are moving from one week

or 168 hours to one hour).

Alters the temporal contexts in the JSON definition file for PROTON in such a way that the ratio

is met and the application logic is maintained.

Alters the time intervals between input events so that they are compressed to the required

injection rate. The original date and time remains intact for the purposes of date comparisons

and manipulation, however, an additional column of timestamps are added to the input file, so

that they can be used as “OccurenceTime” attribute in the Timed File Input adapter of PROTON

for injection of events based on those intervals.

In our KPIs application, we compressed one month of input data into 8 hours.

As aforementioned we tested for the detection of three event patterns: Alarm X, Alarm Y, and Alarm Z.

3.2.2 Usability Evaluation

Based on the defined usability evaluation matrix HT’s experts evaluated the Fraud use case in the following

segments:

Fraud Use Case – Usability Evaluation

Criteria Results Description

Features and

functionality

Features and functionality meet common user goals and objectives to a limited

extent. Additional tracking options for suspect fraudsters should be added to the

solution (e.g. flagging users). Also, additional filtering options and help should be

2 http://speedd-project.eu/

3 https://github.com/ishkin/Proton/

https://github.com/ishkin/Proton/

16

added.

Homepage/ starting

page

The starting page provides a clear snapshot and overview of the content, features

and functionality available. Layout is clear and uncluttered with sufficient 'white

space'.

Navigation Navigation is easier than in the current system in HT. The navigation is intuitive

and consistent. While adding additional complexity to the solution, this should be

taken into consideration as to not clutter the interface and make the navigation

more complex and less intuitive.

Search Search function is available per each field which offers quick, Excel like filtering.

However, current system implemented in HT offers much complex filtering

options (eg. subscriber type, filtering by flags set by the observer, filtering by the

number of alarms triggered by a subscriber, etc.).

Control and feedback Users can easily undo, go back and change or cancel actions. This seems more

advanced than implemented in the current solution in HT.

Errors Error handling is not present on the Dashboard side of the prototype and as such

could not be evaluated.

Content and text Terms, language and tone used are consistent (e.g. the same term is used

throughout). Language, terminology and tone used is appropriate and readily

understood by the target audience.

Performance Site performance does not inhibit the user experience (e.g. slow page downloads

and long delays). Various user technologies are supported (e.g. browsers,

resolutions, computer specs). No reliability issues were found that could inhibit

the user experience.

Table 6: Usability evaluation criteria categories

The following scoring was used:

17

Very poor (less than 29) - Users are likely to experience very significant difficulties using the site

and might not be able to complete a significant number of important tasks.

Poor (between 29 and 49) - Users are likely to experience some difficulties using the site and

might not be able to complete some important tasks.

Moderate (between 49 and 69) - Users should be able to use the site and complete most

important tasks, however the user experience could be significantly improved.

Good (between 69 and 89) - Users should be able to use this site with relative ease and should

be able to complete the vast majority of important tasks.

Excellent (more than 89) - The site provides an excellent user experience for users. Users should

be able to complete all important tasks on the site.

Overall usability score for the FERARI dashboard was evaluated as GOOD with an absolute score of 77/100.

The architectural integration of the end user application in to the FERARI platform is described in detail in

D2.3.

3.2.2.1 FERARI Dashboard Overview

FERARI dashboard contains minimum information for basic analysis. However, MEGS system enables more

filtering and settings options. In addition, MEGS enable better tracking of history per subscriber and better

case visualization. Visualization is one of the key tools used by HT’s experts in analyzing potential fraud cases.

FERARI frontend should be additionally upgraded to reach minimum options in order to present real value

as a day-to-day tool:

A more complex dashboard

More filtering options

More information about subscribers from CRM and other systems

Graphical representations of events

More statistical information about the subscribers

Ability to set a case to a certain state (e.g. checked, unchecked, monitoring, etc.)

A setup window for new alarms or editing existing ones.

A separate more complex visualization of traffic created by a subscriber.

Fraud expert notification via e-mail or SMS gateway.

Given that FERARI’s advantage lies in its speed over existing system, HT will consider effort needed to

implement the underlying architecture or upgrade the frontend.

The Figure 6 shows the login form as the first screen of the application.

18

Figure 6: Login form

On the Figure 7 there is a graphical representation of the derived events that were detected by the system in

the last hour including the real-time counter of the incoming events.

Figure 7: The derived events in graph representation

The Figure 8 shows all derived events detected by the system.

19

Figure 8: List of the derived events

The Figure 9 shows the most frequent calls and the number of their occurrence.

Figure 9: List of the most frequent calls

The Figure 10 give information about subscriber as well as exporting functionality.

20

Figure 10: Information about subscriber

On the Figure 11 there are list of calls that led to the selected derived event.

Figure 11: List of calls that led to the selected derived event

3.2.3 Performance Evaluation

In order to assess the accuracy of our event-driven application, we performed two types of analysis:

calculation of recall and precision (see Section 3.2.3.1), and of latency (see Section 3.2.3.3.).

We would like to note that event processing speed currently in place is tuned to actual business needs. The

centralized system in HT could be tuned to a higher performance if requested. Therefore, results presented

below are compared to actual production status of the platform and do not represent the maximal

performance of the used Fraud management system in HT.

21

3.2.3.1 Recall and Precision

In order to assess the accuracy of our event-driven application in terms of recall and precision we compared

PROTON’s results to the results given by MEGS to the same data set (July 2016) for the three alarms we had

(this relates to the first business KPI which aims to detect at least as many fraudsters as before). That is, we

compared the calling numbers in the derived events from PROTON to the ones detected by MEGS, for each

of the three event rules. MEGS flagged calling numbers have been provided to us by partners HT and PI in a

csv file that contained the flagged calling numbers (Billed_msisdn) along with the detection time by MEGS.

One csv file for each of the three event patterns.

3.2.3.1.1 Alarm X

The csv file contained 10 flagged calling numbers as shown in Table 7. Table 8 shows PROTON’s

derivations for the data set for this alarm. As can be seen from the tables, calling numbers 3859932QJACY

and 3859942YKEWM (rows 1 and 3 in Table 7) were flagged by MEGS but not by PROTON. On the other

hand, calling numbers color coded by green (385985CYDWI, 3859587OFFWQ, 3859941JSEPA) in Table 8,

were found by PROTON but not by MEGS.

Row # Billed_msisdn MEGS’ detection date

1 3859942YKEWM 01/07/2016

2 3859932FHJRR 01/07/2016

3 3859932QJACY 01/07/2016

4 3859966BYSFE 04/07/2016

5 3859942IHCQC 06/07/2016

6 3859816SNBYQ 12/07/2016

7 385986LZGVG 13/07/2016

8 3859942HYCEA 16/07/2016

9 3859947RTRKA 26/07/2016

10 3859980ZBDXE 28/07/2016

Table 7: MEGS' flagged calling numbers for alarm X

Row # in Table

2

Billed_msisdn PROTON’s detection time

MEGS’ detection time

2 3859932FHJRR 01/07/2016-00:39:25 Refer to Section 6.2

4 3859966BYSFE 04/07/2016-15:09:59 Refer to Section 6.2

5 3859942IHCQC 06/07/2016-00:29:35 Refer to Section 6.2

6 3859816SNBYQ 12/07/2016-13:37:17 Refer to Section 6.2

7 385986LZGVG 13/07/2016-17:56:32 Refer to Section 6.2

8 3859942HYCEA 16/07/2016-20:03:51 Refer to Section 6.2

3859587OFFWQ 16/07/2016-21:01:38 Refer to Section 6.2

385985CYDWI 25/07/2016-13:42:34 Refer to Section 6.2

9 3859947RTRKA 26/07/2016-16:42:33 Refer to Section 6.2

10 3859980ZBDXE 28/07/2016-00:04:01 Refer to Section 6.2

22

3859941JSEPA 31/07/2016-13:00:29 Refer to Section 6.2

Table 8: PROTON's derivations (situations) times for alarm X versus MEGS

Further analysis in collaboration with experts in HT resulted in the following:

Rows 1 and 3 in Table 7 couldn’t have been detected by PROTON since the dates for these calling numbers

are July 1, meaning we need events from June 30 in order to be able to derive the corresponding derived

event. Therefore, we exclude these two numbers from our calculations of precision and recall. Regarding the

calling numbers found by PROTON and not found by MEGS, two out of these three numbers were not in

the list of X alarms but were entered mistakenly under Y alarms in MEGS annotation table (3859941JSEPA

and 385985CYDWI) since the annotation table was prepared manually. Therefore, these two situations are

considered true positives. With regards to the third situation, i.e. 3859587OFFWQ, the reason why it was not

detected by MEGS is not fully clear. We apply a conservative approach and consider this calling number as a

false positive, though our drill down of the pattern matching shows clearly that this calling number should be

flagged as a fraud according to the event rule.

Therefore, we have a recall of 100% (we have identified all fraudulent numbers as marked by MEGS) while

the precision is 91% (assuming MEGS is the ground truth, we have 10 true positives while 1 false positive).

3.2.3.1.2 Alarm Y

The csv file contained 95 flagged calling numbers. Following a similar analysis as for Alarm X resulted in a

precision of 97% (as before, assuming MEGS is the ground truth, we have 95 true positives while 3 false

positives) and recall of 100%. Again, we apply a conservative approach and consider this calling numbers as

false positives, though our drill down of the pattern matching shows clearly that this calling numbers should

be flagged as a fraud according to the event rule.

3.2.3.1.3 Alarm Z

This type of alarm has not been working properly in MEGS for some time, so we were not able to get any

annotations. However, specifically for this alarm type PROTON has not detected any fraudulent calling

number, so we cannot make any conclusion based on this alarm.

3.2.3.2 Timing of detection

One of the main characteristics of complex event processing systems is their ability to derive situations in

real-time. This is extremely important in use cases such as fraud detection, since the sooner an operator

blocks a fraudulent mobile phone number, the higher the savings can be. To check whether we were able to

detect fraudulent calls before MEGS, we compared the timestamps of PROTON derived events to their

counterparts in the csv file.

Time saving results are significant both for Alarm X and Y. Further details regarding timing are discussed in

Section 6.2 due to privacy restrictions.

3.2.3.3 Latency

Latency is defined as the elapsed time between the detection time of the last input event required for a

pattern matching and the corresponding detection time of the output event. For example: assuming that the

(derived) event D is defined as a sequence of events (E1, E2, E3) then the latency is measured as the time

period since an instance of E3 arrives into the system till the emission of the corresponding instance of D.

23

Of course, this definition does not work for all event patterns. For instance, it is not applicable for ‘absence’

event patterns, or any other event patterns triggered by expiration of a time window. Therefore, in order for

this definition to hold we should analyze the performance of PROTON on a set of “applicable” patterns with

IMMEDIATE evaluation policy. All our patterns in our KPI implementation use the IMMEDIATE

evaluation policy and are TREND or aggregators, therefore applicable for latency analysis.

In order to correlate the derived event with the latest input event that triggered the derivation, we leverage the

feature of PROTON that allows attaching to a derived event the matching set of contributing input events.

Thus, given in instance of the derived event one can easily obtain the list of the events that has contributed to

the event pattern, along with their timestamps – so it’s straightforward to compute the latency as defined

above.

Big Data imposes that the performance analysis of latency will be done on a cluster infrastructure with the

testing of different workers. To this end, we exploited the infrastructure developed in the scope of the

SPEEDD2 project, comprising of a cluster with four physical machines of the following configuration:

CPU: 2 x Intel Xeon E5520 @ 2.27GHz -- Cores : 16threads (8 cores)

RAM: 12GB ECC

Disks: 2x1TB (RAID1)

NIC : 4 x 1Gbps

OS: Debian 8

In this cluster we already carried out performance tests and have a benchmark of different patterns and

applications. In our case we only have two patterns: COUNT and SUM with immediate pattern policy for

which we already have results, as detailed below.

The cluster has the Mesos4 framework installed that manages the computational resources thus simplifying

the task of cluster configuration and resource allocation. A single virtual machine runs on every physical

machine (to simplify maintenance and management), with exception of one machine where another small VM

runs that functions as a Mesos gateway server. The storm-mesos5 framework used to run STORM cluster on

Mesos, the kafka-mesos6 framework used for running Kafka cluster on Mesos.

4 http://mesos.apache.org/

5 https://github.com/mesos/storm

6 https://github.com/mesos/kafka

24

The topology of the Mesos cluster is shown in Figure 12 below:

Figure 12: Mesos cluster topology

Table 9 states the different configurations tested. As one can see in the table, configurations that involve multiple

Kafka partitions and brokers were not tested in this version. The reason for that is that initial test results have

demonstrated that the messaging layer in its minimal configuration (single broker, single executor for kafka-storm

spout, 1-2 executors for kafka-bolt) was operating significantly below its capacity, and further increase of

messaging power would not improve the performance of the entire system. As can be seen from the table, adding

more workers improves latency.

Config Number

of workers

CEP parallelization factor

End-to-end latency (ms) 50 events/sec

End-to-end latency (ms) 500 events/sec

1 1 1 31 189

2 1 2 86 253

3 2 4 25 111

4 4 8 14 44.7

5 4 16 16 87

6 8 16 11 16 Table 9: Performance Results Summary (90% percentile values)

Since the injection rate required by our use case is around 200 events/second, we can see that with the

specified configuration we can reach latency of no more than 16 ms (90% percentile value, probably much

less for average latency value). If the rate of the events increases, or faster processing is required, we can add

more STORM workers to linearly reduce the processing latency.

25

3.2.3.4 Throughput

Throughput is defined as the maximum rate at which events can be processed. Complete HT mobile network

generates approximately 500 million call/SMS events per month, which implies that required average

throughput for complete HT network is less than 200 events per second. Nevertheless, the required real-time

throughput might be higher, since in “peak” minute HT network may generate up to 340 events per second,

and in some occasions even more. Highest recorded rate was about 1750 events per second.

Experiments on the current FERARI server show that we were able to run stable system with injection rate

of 500 events per second, which is more than enough to handle current load.

FERARI server hardware configuration:

• CPU: 2 x Intel Xeon E5-2660v2 @ 2.2GHz -- Cores : 10

• RAM: 16 GB

• Disks: 8x3TB

• NIC : 1Gbps and 10Gbps

It will be considered to set up the system in distributed manner in the HT’s infrastructure once it goes for

production use. Cell towers do not allow additional data processing and cannot be used to process call events,

but HT already has a set of 80 probe servers on 6 locations across Croatia that can be used to run the

FERARI sites distributed manner. Those servers are used for network monitoring and generate aggregated

information per call, including A and B numbers, timestamps and additional call attributes needed for fraud

detection. Using these servers, we can scale out our architecture and improve throughput much more than

would be ever needed.

3.2.3.5 Revenue at Risk

In order to demonstrate the real business value of the implemented Prototype, we estimated sum of charges

generated by fraudsters which will never be collected by HT. Our aim was to decrease revenue at risk due to

the fraud by 15 percent.

Table 10: Revenue saved by earlier detection

Revenue at risk

ALARM

% of revenue at risk saved

in relation to alarm

threshold value

% of revenue saved on

average per fraud

Alarm X 62% 30%

Alarm Y 403% 30%

26

The data in the previous table was calculated with the following assumptions and considerations:

Actual average value of one hour of fraud and actual average value of fraud was taken

into consideration for Alarm X and Y (with Alarm X having threshold value for triggering

6.5 times higher).

This value has been extrapolated to total minutes saved with the earlier detection time.

We have divided this number with alarm threshold value of each alarm (??? minutes,

refer to Section 7.2 for exact values).

% of revenue at risk saved in relation to alarm threshold value = Saved revenue per

alarm/ Alarm threshold value.

Further on, we have divided average saved revenue of earlier detection with average

fraud value in HT for the selected alarms to calculate decrease of revenue at risk.

% of revenue saved on average per fraud = Saved revenue per alarm/Average fraud

value.

Since average saved time by earlier detection is some ??? minutes (refer to Section 6.2 for exact values),

we can calculate that for Alarm X and Y, we would decrease revenue at risk for some 30% which is

significantly more than 15% set in the KPI.

3.2.4 Other Test Results

3.2.4.1 Ease of Context Configuration or Event Processing

Proton‘s authoring tool is used in CEP applications, i.e. Event Processing Networks (EPNs) where they can

be defined and deployed to an engine through a web based interface.

The Proton metadata file is a JSON file containing all EPN definitions created by the Proton Authoring tool.

This includes definitions for event and action types, EPAs, contexts and producers and consumers.

When Proton runtime starts it will access the metadata file, load and parse all the definitions, create thread per

each input and output adapter and start listening for events incoming from the input adapters and forwarding

events to output adapters.

27

Figure 13: The event processing (EPA) of type SUM that specifies Alarm X

Figure 14: The definition of the input event (CallPOPDHW), e.i. the list of attributes and their types for this event type

28

Figure 15: The definition of the segmentation context, in our case, we partition the input events (CallPOPDHW) by their calling_number attribute

3.2.4.2 Interoperability

ProtonOnSTORM provides an API to integrate into standard STORM topology. That means that the system

benefits from all the various interoperability solutions the STORM platform has to offer. STORM provides

pluggable implementations for integration with different queuing and data systems as information

source/sink - messaging systems like Kestrel, RabbitMG, Kafka, JMS, databases like MongoDB, RDBMSs,

Cassandra etc. Since STORM is a large open-source community, such a list of ready-to-use integration

solutions grows continuously.

3.2.5 Summary of our Findings in the Fraud Use Case

Our results are very positive. First, we have achieved a very high percentage of recall and precision. This is

not surprising, as we specified the rules according to the domain experts and the way they are implemented in

MEGS. However, this shows that our implementation is running correctly and as expected.

More importantly, is the fact that we were able to detect the alarms in real-time as opposed to the batch

processing currently performed by MEGS, enabling a more rapid detection of frauds. We derive situations in

an average of ??? minutes earlier than MEGS (refer to Section 6.2 for exact values). This slack can be used by

the operators for an earlier inspection and if necessary blocking of a specific mobile phone number.

Regarding the KPIs listed in Section 3.2.1, as we alert immediately as the pattern is matched and we don’t

wait until the end of the hour, the time savings are (MEGS processing time + the remaining time until the

end of the hour time window). As already mentioned, the average saving surpassed the expected KPI1

29

outcome (target to 18 minutes after collection). Regarding KPI2, we only have 1 false positive for alarm while

we have 3 false positives for Alarm Y (the reason for which is undetermined since judging by the data and the

rule definition they should have been marked as fraud by MEGS as well) with a recall of 1 in both cases.

30

3.3 System Health Monitoring Use Case

In the third year of the FERARI project, the Technion has looked into DSLAM performance data provided

by HT. The data consisted of hourly readings of measurements from more than 678,000 DLAM ports. In

addition, HT has provided logs of events related to the performance of the ports. Examples of such events

are a port successfully establishing a link with the endpoint, a port losing a link, or the endpoint equipment

losing power. Finally, a daily quality score, known as the Line Stability Index (or LSI), has been calculated by

HT for every port in the dataset. LSI values range from 2 to 5 where a score of 5 indicates a very stable line,

and a score of 2 indicates a very unstable line. Poor line quality (a score of 2 or 3) was take as an indication of

a failure in the DSLAM equipment.

After closely examining the data and receiving further clarification from HT, it became apparent that while

poor line quality can be a product of malfunctioning DSLAM equipment, it is also effected by the

performance of the endpoint equipment (e.g. a DSL modem at the subscribers house or network equipment

connected to it), and by actions of the end user (powering down, restarting, or disconnecting such equipment).

Since it is impossible to isolate the effects of endpoint equipment failures and actions on line quality from

those of DSLAM equipment failures, and since an off-the-shelf DSLAM modem is more likely to fail than a

telco grade DSLAM, predicting poor line quality based on DSLAM performance indicators seems to be an

intractable task.

Hence we concluded to continue evaluation of the system health monitoring on other real-world data as

described in D3.3

We present descriptive statistic of the data in Appendix C due to its confidential nature.

31

4 Summary

In Year 3 of the FERARI project we have tested the solution in terms of business defined KPIs, performance of the system and general usability.

Overall, the results show business benefits of complex event processing for the business users, i.e. quicker reaction times and potential cost savings compared to current speed of the solution implemented in HT (though the current system in HT could be improved to certain extent). We have achieved all defined business KPIs: detection of at least the same number of fraudsters, better detection time and decreased revenue at risk through better detection time. Additionally, latency is easily reduced by adding more STORM workers to the system if needed with the increase of events injected. Proton’s authoring tool enables easier configuration of event processing thus enabling easier deployment in production environment. Though solution interface for Fraud use case enables first level analysis, it requires some additional development in order to enable minimum set of usability options for a day-to-day use. However, given that all three business KPI’s for Fraud use case were achieved this investment should have a positive ROI and shall be considered.

We have analyzed the DSLAM performance data and gotten to the conclusion that, while poor line quality of the DSLAM devices can be a sign of a device malfunction itself, it can also be affected by poor endpoint equipment. Since it is currently impossible to isolate the effects of endpoint equipment failures and actions on line quality from those of DSLAM equipment failures, and given that the off-the-shelf DSLAM modem is more likely to fail than a telco grade DSLAM, predicting poor line quality based on DSLAM performance indicators seems to be an intractable task. Therefore, we continued the evaluation of the system health monitoring on other real-world data as described in D3.3.

32

5 Appendix A - Hardware configuration

HP ProLiant DL380p Gen8 12 LFF Configure-to-order Server

HP DL380p Gen8 Intel Xeon E5-2660v2 (2.2GHz/10-core/25MB/95W) FIO Processor Kit

HP DL380p Gen8 Intel Xeon E5-2660v2 (2.2GHz/10-core/25MB/95W) Processor Kit

HP 16GB (1x16GB) Dual Rank x4 PC3-12800R (DDR3-1600) Registered CAS-11 Memory Kit

8 x HP 3TB 6G SATA 7.2K rpm LFF (3.5-inch) SC Midline 1yr Warranty Hard Drive

HP 2GB P-series Smart Array Flash Backed Write Cache

HP Ethernet 1Gb 4-port 331FLR FIO Adapter

HP Ethernet 10Gb 2-port 561T Adapter

HP 750W Common Slot Platinum Plus Hot Plug Power Supply Kit

HP iLO Advanced including 1yr 24x7 Technical Support and Updates Single Server License

HP Insight Control including 1yr 24x7 Support ProLiant ML/DL/BL-bundle Tracking License

HP 2U LFF BB Gen8 Rail Kit with CMA

33

6 Appendix B – Fraud Alarms Description and Results (CONFIDENTIAL)

It’s important to note that as per HT’s privacy requirements, the thresholds shown in the event patterns are

masked with “???” due to privacy issues. However, the implementation uses the correct numbers so

performance evaluation is possible.

In this new event processing network (EPN) we fire alerts in the following cases (for detailed descriptions of

each EPA see Section 6.1):

This alarm detects all customers that spend on voice premium rate services ??? kn or more in a

period of 24 hours (EPA1, Alarm X)

This alarm detects all customers that are active ??? days or less and spend on voice premium

rate services ??? kn or more in a period of 24 hours (EPA2, Alarm Y).

This alarm detects all customers that make ??? calls or SMSes to premium rate voice and SMS

VAS services in 4 hour period (EPA3, Alarm Z).

Note the following:

As previously noted, the EPN implemented runs with real thresholds

Premium services are calling numbers with prefix 06

Voice services are CDRs whereas call_service = T11

SMS services are CDRs whereas call_service = T12

6.1 Event processing agents

Henceforth, we describe the EPAs in the following order: Event name; motivation; event recognition

process; contexts along with temporal context policy; and pattern policies.

In the event recognition process we only show the steps that take place, i.e. relevant, in the specific EPA,

while the others are greyed. For the filtering step we show the filtering expression; for the matching step we

denote the pattern variables; and for the derivation step we denote the values assignment and calculations.

Please note that for the sake of simplicity we only show the assignments that are not copy of values (all other

derived event attributes values are copied from the input events). For attributes, we just denote their names

without the prefix of ‘attribute_name.’

6.1.1 EPA1: Alarm X

Motivation: We are looking for customers with high use to voice premium services as potential fraudsters. In

this type of alarm MEGS is triggered every hour (45 min after the rounded hour, e.g., 8:45, 9:45, and so on)

and looks for this pattern 24 hours backwards.

Event recognition process

34

Figure 16: Event recognition process for Alarm X EPA

Pattern policies

Evaluation Cardinality Repeated Consumption

IMMEDIATE SINGLE FIRST REUSE

Context

Segmentation: calling_number

Temporal (fixed sliding window):

• Initiator: 8:45 AM

• Duration: 24 hours

• Sliding period: 1 hour

Meaning: For each customer, the first window opens at 8:45 AM, and at each hour afterwards as we have

overlapping sliding windows. Each window lasts 24 hours. For example (see Figure 17), let’s assume that for a

certain customer we detect one alarm X in the first temporal window open (in blue), one in the second (in

azure), and one in the third (in green). Note that in the first temporal window we actually detect two derived

events (second one in grey), however since the policies applied are IMMEDIATE and SINGLE, we only fire

one derived event as soon as it is detected. This derived event will be fired in the scope of the second

temporal window (in azure). The color of events in each temporal window denotes the specific events in this

temporal window, while the blue events denote the common events to all open windows in our example

(non-overlapping events can only occur in the tails of the windows when new input events arrive).

35

Figure 17: Context for Alarm X EPA per single customer

6.1.2 EPA2: Alarm Y

Motivation: We are looking for relatively new customers in the system, with high use to voice premium

services as potential fraudsters. In this type of alarm MEGS is triggered every hour (45 min after the rounded

hour, e.g., 8:45, 9:45, and so on) and looks for this pattern 24 hours backwards.

Figure 18: Event recognition process for Alarm Y EPA

36

Pattern policies



Context


Temporal (fixed sliding window): Same as for Alarm X (see Figure 17).




6.1.3 EPA3: Alarm Z

Motivation: We are looking for customers with a large number of calls to premium rate voice and SMS VAS

services in 4 hour period as potential fraudsters. In this type of alarm MEGS is triggered every hour (30 min

after the rounded hour, e.g., 8:30, 9:30, and so on) and looks for this pattern 4 hours backwards.

Figure 19: Event recognition process for Alarm Z EPA

Pattern policies



37

Context


Temporal (fixed sliding window): Same as for Alarm X and Alarm Y but the length of the window is 4 hours

instead of 24 hours, and the initiator is 30 min after the rounded hour (8:30, 9:30, 10:30, and so on)




6.2 Timing of detection in more details

Row # in Table

2

Billed_msisdn PROTON’s detection time

MEGS’ detection time

2 3859932FHJRR 01/07/2016-00:39:25 01/07/2016-00:58:28

4 3859966BYSFE 04/07/2016-15:09:59 04/07/2016-17:54:08

5 3859942IHCQC 06/07/2016-00:29:35 06/07/2016-02:58:28

6 3859816SNBYQ 12/07/2016-13:37:17 12/07/2016-15:52:00

7 385986LZGVG 13/07/2016-17:56:32 13/07/2016-18:57:35

8 3859942HYCEA 16/07/2016-20:03:51 16/07/2016-21:46:32

3859587OFFWQ 16/07/2016-21:01:38

385985CYDWI 25/07/2016-13:42:34

9 3859947RTRKA 26/07/2016-16:42:33 26/07/2016-18:50:02

10 3859980ZBDXE 28/07/2016-00:04:01 28/07/2016-02:53:53

3859941JSEPA 31/07/2016-13:00:29

Table 11: PROTON's derivations (situations) times for alarm X versus MEGS – real values

In our case, MEGS runs the query for each of the alarms every hour. Therefore, theoretically, the maximum

potential time that can be gained by PROTON is (60 + MEGS processing time) min, assuming that the

detection is done just after the window is opened. On the other hand, in the worst case, that is, the detection

by PROTON is done only at the end of the temporal window; the potential time that can be gained is

(MEGS processing time) min.

Calculating the differences in the times detection of MEGS and PROTON from Table 8 we can see that we

have an earlier average detection of PROTON of 115 minutes, with a maximum of 169 minutes and a

minimum of 20 minutes. The results for alarm Y were almost the same, showing we can get a significant

“time saving” in time due to our real-time detection.

38

7 Appendix C - DSLAM Descriptive Statistics Results (CONFIDENTIAL)

In this section we present several characteristics of the network equipment data. We focus on a subset of the

data for October 2016. We focus on the following types of daily events:

1. Ports that lost their link to the endpoint equipment at least once in a given day

2. Ports with an LSI score of 2 or 3 (“unstable” or “very unstable”)

Figure 20 depicts the daily percentage of ports that have experienced one of the events described above

between 01.10.2016 and 31.10.2016. During this period, an average of 9.94% of the ports experience at least

one link loss per day, an average of 2.41% of the ports were scored as “unstable” or “very unstable”. These

event rates can reach a peak of 21.95% for ports with at least on link lost on the 15.10.2016, and of 9.28% for

ports with an LSI of 2 or 3 on the 03.10.2016. Figure 21 focuses on the daily percentage of ports with an LSI

score of 2 or 3.

Figure 22 depicts the distribution of link loses over the time-of-day. Most loses of links (5.39%) occur

between 10:00 AM to 11:00 AM. After 06:00 PM the probability of a loss of link starts declining, and starts

rising again at 02:00 AM.

Finally, Table 12 presents the probability of a ports LSI score on the succeeding day given its current LSI.

This table reveals that a port that received a LSI score of 5 (“very stable”) has a 97.85% chance of receiving

the same score on the next day, and a 99.55% of receiving a score of 4 or above. Ports with an LSI score of 2

(“very unstable”) have a probability of 60.45% of receiving the same score on the next day and a probability

of 73.92% of receiving a score of 3 or below (“unstable” or “very unstable”). Ports with an LSI score of 2

only have a probability of 17.15% of receiving an LSI score of 5 (“very stable”) on the succeeding day.

Figure 20: Daily percentage of port link loss events and line stability index

0.00%

5.00%

10.00%

15.00%

20.00%

25.00%

Pecentage of Ports with Link Loss and LSI index

Ports with line Loss LSI = 2 LSI = 3

39

Figure 21: Daily percentage of port with LSI <= 2

Figure 22: Distribution of link loses over the time of day.

LSI on Succeeding Day

2 3 4 5

LSI

2 60.75% 13.17% 8.93% 17.15%

3 11.07% 19.29% 21.86% 47.77%

4 2.06% 7.63% 22.49% 67.82%

5 0.08% 0.37% 1.70% 97.85% Table 12: The probability of a ports LSI on the next day given its current LSI

0.00%

2.00%

4.00%

6.00%

8.00%

10.00%

Pecentage of Ports with LSI values

LSI = 2 LSI = 3 LSI = 2,3

0

0.01

0.02

0.03

0.04

0.05

0.06

0:00 6:00 12:00 18:00 0:00

Pro

bab

ility

of

Lin

k Lo

ss

Time of Day

Distribution of Link Loss over Time of Day