S-CUBE LP: Self-healing in Mixed Service-oriented Systems

www.s-cube-network.eu

S-Cube Learning Package

Self-* infrastructures:

Self-healing in Mixed Service-oriented Systems

TU Wien (TUW)

Harald Psaier, TUW

© Harald Psaier

Learning Package Categorization

S-Cube

Self-* Service Infrastructure

and Discovery Support

Self-* Service Infrastructure

Self-healing in SOA

Learning Package Overview

Problem Description

Self-healing research

Example: Self-healing policies for Mixed Service-oriented

Systems

Conclusions

© Harald Psaier

Mixed Service-oriented Systems

Open dynamic service environment to humans and services

– distributed coordination and communication

– no predefined top-down- but flexible compositions

Interactions are ad-hoc and dynamic and usually in

boundaries of an activity

Mixed System (MS) include a mixed collaboration between two main and distinct types of services:

Human-Provided Services (HPS)

– Human provide knowledge/skills/expertise as services

– Close gab between required human expertise and difficulty of implementation as software

Software-Based Services (SBS)

© Harald Psaier

http://www.s-cube-network.eu/km/terms/d/dynamic-binding



http://www.infosys.tuwien.ac.at/prototypes/HPS/HPS_index.html



http://www.s-cube-network.eu/km/terms/s/service



Examples of mixed systems

Review services: Include shared reviewing activities arround

documents, code, and evaluations

Innovation services: foster various ideas for a new product

design

Support services: provide solutions for questions and

problems on multiple or selected subjects

Current platforms with massive use of MSs: crowdsourcing

platforms. These include, e.g., Amazon’s Mechanical Turk,

Yahoo answers, uTest.

© Harald Psaier

Let’s Consider a Scenario (1)

Humans and services interact to perform work described by

the activities in the process model.

© Harald Psaier

Service

Registry

Process Model

inv

oke

human service

activity scopes

http://www.s-cube-network.eu/km/terms/p/process-model




One of the services fails to complete an assigned activity.

In a loop self-healing monitors, recognizes and adapts to the

incident © Harald Psaier

Process Model

Deployment with

Dependency

Management

Run-Time Environment

Monitoring

in

vo

ke

X

Adaptation

Self-healing

Policies




http://www.s-cube-network.eu/km/terms/m/monitoring

http://www.s-cube-network.eu/km/terms/a/adaptation


The reaction is controlled by policies connected to the

process activities

The challenge of the autonomous system is in particular the

complexity of MSs (c.f., dynamicity of MSs).

The goal of Self-* properties is to support administration in

system management.

In particular the tasks of self-healing in MS include:

– Avoid errors in design

– Avoid errors in configuration

– Replace failing services at runtime

– Handle adaptation complexity transparently to keep system healthy

– Support need of service maintenance

© Harald Psaier


Problem Description



Systems

Conclusions

© Harald Psaier

What is self-healing

A self-healing system should recover from the abnormal (or

“unhealthy”) state and return to the normative (“healthy”)

state, and function as it was prior to disruption.

A system with self-healing properties can be identified as a

system that comprises fault-tolerant, self-stabilizing, and

survivable system capabilities and, if needed, must be human

supported.

© Harald Psaier

The 3 common states are

Normal, Broken, and

Degraded. The challenge is

to identify Degraded in time

and to recover soundly.

http://www.s-cube-network.eu/km/terms/s/self-healing-system



Self-healing origins

Fault-tolerant system refers to a system that continues

working at a reasonable degree in the presence of faults

Self-stabilizing systems refers to a system that continuously

stabilizes the system from any perturbations.

Survivable systems sustain the unexpected

© Harald Psaier

Self-healing research: autonomic computing (1/2)

IBM's autonomic computing research envisions a layered structure that can manage itself to given high-level objectives from administrators.

Motivated by the amount spent on and overwhelming effort in system maintenance

The research tries to cover all adaptable layers down to network and operating system

Defines 4 properties for a self-managing system (self-CHOP):

– self-configuring: The ability to readjust itself “on-the fly”

– self-healing: Discover, diagnose, and react to disruptions

– self-optimization: Maximize resource utilization to meet end-user needs

– self-protection: Anticipate, detect, identify, and protect itself from attacks.

© Harald Psaier

http://www.s-cube-network.eu/km/terms/s/self-configuration






http://www.s-cube-network.eu/km/terms/s/self-optimization



http://www.s-cube-network.eu/km/terms/s/self-protection



Self-healing research: self-adaptive systems (2/2)

Self-adaptive systems evaluate their behavior and adapt on

system irregularities or when better functionality or

performance is possible

The research primarily covers the application and the

middleware layers and focuses on the system as a whole.

Includes also self-healing as a combination of self-diagnosing

and self-repairing with the capabilities to diagnose and

recover from malfunctions.

© Harald Psaier

Self-healing characteristics

© Harald Psaier

What:

Continuous availability by

compensating the dynamics of a

running system.

Why:

maintenance of health momentarily

and ...

Enduring continuity by resilience

against unintentional behavior

How:

Detect disruptions

Diagnose root cause

Derive recovery strategy

Self-healing requirements

A closed loop design which integrates sufficient sensor and

effector interfaces.

A status knowledge database and logic for an accurate state

recognition

State recognition must include failure classification for a

adequate handling of the problem

A collection of recovery policies in the format of <trigger, rule,

action>. Usually this collection is preconfigured but must also

be configurable to obtain…

Fitness and evolutionary aspects. Self-* properties generally

are applied to maintain a long-term use of the system

© Harald Psaier

http://www.s-cube-network.eu/km/terms/r/recovery

http://www.s-cube-network.eu/km/terms/e/evolution

Self-healing loop

© Harald Psaier

detecting: filters any

suspicious status information

diagnosing: does root cause

analysis and calculates an

appropriate recovery

recovery: carefully applies

the planned adaptations

A self-healing loop comprises 3 common states: detecting,

diagnosing, recovering

These are connected to the sensors and effectors of the

system

In the background, a knowledge-base supports the states

Self-healing states

The most general states in self-healing research are:

Normal: The system is in a “healthy” state. In particular, it

signalizes intentional functioning and all requirements are

met as expected.

Broken: This is an “unhealthy” system. It can generally be

identified by an unacceptable response which most probably

is the cause of a failure or error.

Degraded: The system is in a fuzzy transition zone between

the former. Behavior is expected to be unpredictable and

parts of the system will drift from acceptable state to some

failure state. In large-scale system in many cases this is

recognizable by considerable performance loss. If

redundant, in most cases the size provides the system with

additional recovery time.

© Harald Psaier

Failure classification: Failure types (1/2)

The main goal of this classification is to assist root cause

analysis and find the adequate resolution for the failure.

Common failure types are:

– Crash failure: undetectable malign service interruption

– Fail-stop: detected failure caused a service interruption

– Transient: instantaneous transparent interruption with measurable

side-effects

– Omission: message loss, transmission errors in communication

infrastructure

– Performance: violation of agreements on execution time

– Arbitrary: any type of failure with no specific pattern

© Harald Psaier

Failure classification: Policies (2/2)

Policies provide configuration and settings for detection and

recovery.

There are three different types of policies:

– Action policies: These are reactive policies with a specialized trigger

and immediate response is expected.

– Goal Policies: These define a set of desired states. They also

calculate the set of actions for the transition from the current (failure

affected) to a desired state

– Utility Function Policies: the set of states is connected to an utility

function. Problem solving includes extensive analysis including history

information, adaptation knowledge and a comprehensive system

awareness

Common recovery include:

– Replacement, balancing, isolation, persistence, redirection, etc.

© Harald Psaier

Fitness and evolution

Current large-scale systems, especially self-* enhanced, must

be designed for long-term service.

This means they must be resilient to changes and allow any

required future variations.

The issues to keep in mind are:

– Most arising requirements are not known a-priori but expose over time

– Intervention and changes on the current system must respect the

system’s essential functionality and avoid malicious failures at any

cost

– adaptation might reach its limits in resources

The current solution is to create self-* systems with exposed

configuration management and thus human assisted

adaptations

© Harald Psaier

S-Cube contributions to Self-healing/-* research

<NAME> – SoE1.1 Virtual Campus learning material © Harald Psaier – 21/<Max>

Psaier H., Dustdar S. (2010). A survey on self-healing systems: approaches and systems. Computing. Springer Wien.

Di Nitto, E., Ghezzi, C., Metzger, A., Papazoglou, M., Pohl, K. (2008). A journey to highly dynamic, self-adaptive service-based applications. Automated Software Engineering, 15(3), p 313—341. Springer.

Hielscher, J., Kazhamiakin, R., Metzger, A., Pistore, M. (2008). A framework for proactive self-adaptation of service-based applications based on online testing. Towards a Service-Based Internet. P 122—133. Springer.

Pernici, B. (2009). Self-healing Systems and Web Services: The WS-Diamond Approach. Business Process Management Workshops. p 440—442. Springer.

Psaier H., Skopik F., Schall D., Dustdar S. (2010). Behavior Monitoring in Self-healing Service-oriented Systems. 34th Annual IEEE Computer Software and Applications Conference (COMPSAC), July 19-23, 2010, Seoul, South Korea. IEEE.

Papazoglou, M.; Pohl, K.; Parkin, M.; Metzger, A. (2010). S-Cube - Towards Engineering, Managing and Adapting Service-Based Systems. Springer. 1st Edition., 2010, XVIII, 374 p.

http://portal.acm.org/citation.cfm?id=1719370

http://www.springerlink.com/content/3lm6525736145126/



http://www.springerlink.com/content/v803458l2315gr70/





http://www.springerlink.com/content/g2jq403jxj943467/






http://www.springerlink.com/content/w24886132x411v34/





http://www.infosys.tuwien.ac.at/staff/skopik/2010_behaviormon_pssd.pdf






http://www.springer.com/computer/database+management+&+information+retrieval/book/978-3-642-17598-5










Problem Description



Systems

Conclusions

© Harald Psaier

Mixed Service-oriented Systems: Challenges

Mixed Service-oriented Systems aka. Mixed Systems (MS)

are open to humans and services.

Inherit all properties of SOA including distributed, ad-hoc

interactions along with a communication infrastructure and

coordination.

… and aforementioned properties

… and examples

What are the challenges in MS?

– the „openness“ of the system allows to join many and possibly

unreliable services

– In particular humans are unreliable related to their, e.g., different

working hours, particular preferences, current mood, and context.

© Harald Psaier

Scenario: Expert Network

The key is to share the subtask of the activity among the

appropriate experts for the subtask. This is usually solved by

delegation and re-delegation. However can fail on individual

misbehavior.

Main challenge: How to guarantee that the activity is

complete, also, on time?

© Harald Psaier

Includes two parties: the

service consumer with a

request as an activity – and

experts and resources in

the service network.

The network combines all

knowledge required to

process jointly the activity

Delegation and processing behavior

A model of the network helps to analyze a possible problem

– HPS and SBS are represented as nodes

– Interactions are allowed over established channels

– The current work load of nodes is indicated by the queues

At runtime the model additionally indicates

– The delegation directions and frequency by the arrow direction and the

thickness of the connection

– The current work load is indicated

by the queue fill state

With the model we can present

two main patterns of misbehavior

© Harald Psaier

1st misbehavior pattern: Delegation Factory

The delegation factory misbehavior pattern:

– a accepts and delegates particular tasks frequently

– However, a processes few tasks and has a low task-queue

The factory behavior impact:

– produces unusual amounts of task delegations

– tasks miss their deadline

– leads to performance degradations of the entire network

© Harald Psaier

2nd misbehavior pattern: Delegation Sink

The delegation sink Misbehavior pattern:

– d accepts too many offered tasks

– However, d processes slow (e.g., overestimates its capability vs.

received overload)

Sink behavior impact:

– produces unusual amounts of task delegations

– tasks miss their deadline

– leads to performance degradations of the entire network

© Harald Psaier

Observing and avoiding misbehavior

A successful self-healing architecture that can handle the

misbehavior situations must

– avoid unpredictable system behavior leading to faults

– indentify and handle degraded states. Degraded states here relate to

poor progress in activity process because of increasing factory/source

behavior

Feasible adaptation actions must not include direct

punishment of the misbehaving participating experts. Instead

a transparent temporary decoupling from the system is

considered.

Also, the architecture must be aware of the side-effects of the

healing actions.

– a feedback loop informs about the success of the adaptation

© Harald Psaier

The VieCure Framework

© Harald Psaier

Between the MS atop a

monitoring and adaptation

layer connects to the

framework.

From the interaction logs

events are derived and

diagnosed.

The Behavior Registry

provides the metrics to

identify the misbehavior

patterns

During recovery the

interaction channels are

adjusted

Self-healing steps on misbehavior

System is in prefect health

An overload in node b is detected

Assuming a causes the most

overload traffic, the recovery action

regulates channel (i) between a and b

However, b remains overloaded. An

additional unknown cause is

assumed

An alternative for b is found and

channels to d are opened

Channels (ii) and (iii) are now

available

© Harald Psaier


Problem Description



Systems

Conclusions

© Harald Psaier

Summary

Self-healing research principles

– A self-healing system should recover from the abnormal (or

“unhealthy”) state and return to the normative (“healthy”) state, and

function as it was prior to disruption.

– The 3 common states are Normal, Broken, and Degraded. The

Challenge is to identify Degraded in time and to recover soundly.

– In order to recover a self-healing loop is required that detects,

diagnose, and recovers the system.

Self-healing in MS

– the „openness“ of the system and the generally unpredictable human

behavior are sources of system degradation.

– The two presented misbehavior models are delegation factory and

sink. Either a node delegates without respecting the capacity of the

neighbors or a node overestimates its capacity.

– The VieCure Framework considers and resolves both cases.

© Harald Psaier

Further S-Cube Reading

© Harald Psaier

Psaier H., Juszczyk L., Skopik F., Schall D., Dustdar S. (2010). Runtime Behavior Monitoring and Self-Adaptation in Service-Oriented Systems. 4th IEEE International Conference on Self-Adaptive and Self-Organizing Systems (SASO), September 27 - October 01, 2010, Budapest, Hungary. IEEE.

Psaier H., Skopik F., Schall D., Juszczyk L., Treiber M., Dustdar S. (2010). A Programming Model for Self-Adaptive Open Enterprise Systems. 5th Workshop of the 11th International Middleware Conference (MW4SOC), November 29 - December 3, 2010, Bangalore, India. ACM.

Psaier H., Skopik F., Schall D., Dustdar S. (2011). Resource and Agreement Management in Dynamic

Crowdcomputing Environments. 15th IEEE International EDOC Conference (EDOC), 29th August - 2nd

September, 2011, Helsinki, Finland, IEEE.

.

Dustdar, S.; Schall, D.; Skopik, F.; Juszczyk, L.; Psaier, H. (Eds.) (2011). Socially Enhanced Services

Computing -- Modern Models and Algorithms for Distributed Systems. (1) p. 37. Springer


http://www.infosys.tuwien.ac.at/staff/hpsaier/papers/SaSo2010.pdf






http://www.infosys.tuwien.ac.at/staff/hpsaier/papers/paper10-psaier.pdf




http://www.infosys.tuwien.ac.at/staff/hpsaier/papers/EDOC2011_CR.pdf

http://www.infosys.tuwien.ac.at/staff/hpsaier/papers/EDOC2011_CR.pdf


http://www.springer.com/computer/information+systems+and+applications/book/978-3-7091-0812-3





Acknowledgements

The research leading to these results has

received funding from the European

Community’s Seventh Framework

Programme [FP7/2007-2013] under grant

agreement 215483 (S-Cube).

© Harald Psaier

S-CUBE LP: Self-healing in Mixed Service-oriented Systems

Education

selfhealing system

selfhealing policies

selfhealing properties

selfhealing characteristics

selfhealing requirements

tasks of selfhealing

fly selfhealing

goal of self