Top Banner
Reliability Week 11 - Lecture 2
28

Reliability Week 11 - Lecture 2. What do we mean by reliability? Correctness – system/application does what it has to do correctly. Availability – Be.

Dec 20, 2015

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Reliability Week 11 - Lecture 2. What do we mean by reliability? Correctness – system/application does what it has to do correctly. Availability – Be.

Reliability

Week 11 - Lecture 2

Page 2: Reliability Week 11 - Lecture 2. What do we mean by reliability? Correctness – system/application does what it has to do correctly. Availability – Be.

What do we mean by reliability?

• Correctness – system/application does what it has to do correctly.

• Availability – Be available within the agreed time frame

• Consistency – provide much the same response time on each occasion

Page 3: Reliability Week 11 - Lecture 2. What do we mean by reliability? Correctness – system/application does what it has to do correctly. Availability – Be.

Service Level Agreement

• Reliability and performance requirements are usually built into an SLA or Service Level Agreement

• An SLA defines the level of service the organisation and the users can expect from the DIS

• It is negotiated between the organisation and the service provider, be that the internal IT dept or an outside body

Page 4: Reliability Week 11 - Lecture 2. What do we mean by reliability? Correctness – system/application does what it has to do correctly. Availability – Be.

All components affect reliability

• Any component can effect the reliability of the whole system, but each component can affect different aspects: correctness, availability and consistency

• We will look at:• Application software

• System software – O/S, DBMS & Middleware

• Server hardware

• Network

• Storage

• Change management and Problem management

Page 5: Reliability Week 11 - Lecture 2. What do we mean by reliability? Correctness – system/application does what it has to do correctly. Availability – Be.

Application Software

• Application software can affect availability for a few, some or all customers in the event of a failure.

• Main area for bugs – particularly if developed in-house or modified.

• Can affect correctness and consistency if changes to application software are not rigorously tested.

Page 6: Reliability Week 11 - Lecture 2. What do we mean by reliability? Correctness – system/application does what it has to do correctly. Availability – Be.

System software (DBMS, O/S, etc)

• System software failures generally affect availability for all customers on a server.

• Operating at high utilisation (90-95% capacity) can affect reliability. Parts of system not often used can become active (eg. queuing logic).

Page 7: Reliability Week 11 - Lecture 2. What do we mean by reliability? Correctness – system/application does what it has to do correctly. Availability – Be.

Server hardware

• Hardware failure will affect availability for all users on the server.

• One server supporting an application/database provides a Single Point of Failure (to be avoided).

• Server problems can affect consistency (eg failure of one procesor in multi-processor server will affect performance.)

Page 8: Reliability Week 11 - Lecture 2. What do we mean by reliability? Correctness – system/application does what it has to do correctly. Availability – Be.

Networks - LAN

• Lan failures will affect availability for a few or many users.

• Changes to routers, switches or cabling can affect availability.

• Lan component failures/changes generally affect availability and consistency.

Page 9: Reliability Week 11 - Lecture 2. What do we mean by reliability? Correctness – system/application does what it has to do correctly. Availability – Be.

Networks - WAN

• It is a Purchased service, controlled by an external company.

• Wan failure will generally affect all users (eg ISP failure will affect all access to the Internet)

• It requires• Careful selection of supplier

• Sufficient capacity for peak loads

• Carefully negotiated SLA

• Capable network management

Page 10: Reliability Week 11 - Lecture 2. What do we mean by reliability? Correctness – system/application does what it has to do correctly. Availability – Be.

Planning for Reliability

• Managing problems and changes.

• Planning for application and system software reliability

• Planning for hardware reliability

• Planning for disaster recovery

Page 11: Reliability Week 11 - Lecture 2. What do we mean by reliability? Correctness – system/application does what it has to do correctly. Availability – Be.

Managing Problems/Changes

• The cause of all problems MUST be determined and then resolved (or they will simply return again and again to affect availability)

• All application and system software changes MUST– be reviewed by a committee before implementation

– have been thoroughly tested

– have a back-out plan

– be APPROVED by all affected parties

– implemented out of normal availability periods

Page 12: Reliability Week 11 - Lecture 2. What do we mean by reliability? Correctness – system/application does what it has to do correctly. Availability – Be.

Planning System Reliability

• Server selection and operating system must fit the scale of the operation.

• Regular system software update plan should be followed to fix bugs, implement new features.

• Update plan should be fully investigated– update may introduce new bugs

– may cause problems for applications

– may intoduce performance problems

Page 13: Reliability Week 11 - Lecture 2. What do we mean by reliability? Correctness – system/application does what it has to do correctly. Availability – Be.

Planning Application Reliability

• Starts in design – how the objects and components are packaged and the interfaces designed

• Software package selection must place high weight on reliability factors (availability etc.)

• Implementations need formal processes• Test plans

• Testing techniques

• Test scripts

Page 14: Reliability Week 11 - Lecture 2. What do we mean by reliability? Correctness – system/application does what it has to do correctly. Availability – Be.

Planning for Harware Reliability

• Build in redundancy, avoid single points of failure (even within hardware items).

• Use servers with multiple processors and hot-swap capability. Use server clusters if appropriate.

• Build redundancy and alternate routes into the network. Lan can be controlled.

• Disks have many mechanical parts and will fail often. Use Raid or redundancy when-ever possible

Page 15: Reliability Week 11 - Lecture 2. What do we mean by reliability? Correctness – system/application does what it has to do correctly. Availability – Be.

RAID

• Redundant Arrays of Independent Disks

• Groups of drives are linked to a special controller

• They appear as a single logical drive

• Take advantage of multiple physical drives to store data redundantly

• Six different RAID approaches numbered 0 to 5

Page 16: Reliability Week 11 - Lecture 2. What do we mean by reliability? Correctness – system/application does what it has to do correctly. Availability – Be.

0 Data striping, block orientedNo redundancy – no protection from disk lossReads and writes for contiguous block overlap, giving improved performanceNo space overhead

Page 17: Reliability Week 11 - Lecture 2. What do we mean by reliability? Correctness – system/application does what it has to do correctly. Availability – Be.

1 Disk mirroring – all data written to two disksFull data protectionImproved read accessDoubles disk space requiredEasy to implement, easy to recover

Page 18: Reliability Week 11 - Lecture 2. What do we mean by reliability? Correctness – system/application does what it has to do correctly. Availability – Be.

5 Data striping, block oriented, distributed parityFull error protection, but slower to recover than 1Slow write, good read performance25% overhead in disk space

Page 19: Reliability Week 11 - Lecture 2. What do we mean by reliability? Correctness – system/application does what it has to do correctly. Availability – Be.

Planning for Business Continuance(or Disaster/Recovery)

• Planning to continue business in the event of a disaster - is a design job . 1993 and 9/11.

• Consider all scenarios, plan recovery approach, test & document.

• Common causes are fires (Sydney) , floods (Brisbane) or back-hoes.

• Test recovery regularly (3- 6 months)

Page 20: Reliability Week 11 - Lecture 2. What do we mean by reliability? Correctness – system/application does what it has to do correctly. Availability – Be.

Performance

Week 11 - Lecture 2

Page 21: Reliability Week 11 - Lecture 2. What do we mean by reliability? Correctness – system/application does what it has to do correctly. Availability – Be.

Why is Performance Important

• DIS systems have potential for performance issues

• New systems almost always require performance tuning

• DIS performance affects user productivity

• Performance is a measure of value for money

Page 22: Reliability Week 11 - Lecture 2. What do we mean by reliability? Correctness – system/application does what it has to do correctly. Availability – Be.

A simple test

• In most systems, what is likely to be the highest priority for users?

– Improved functionality– Improved reliability– Improved performance

Page 23: Reliability Week 11 - Lecture 2. What do we mean by reliability? Correctness – system/application does what it has to do correctly. Availability – Be.

Performance Measures

• Response time - time taken to complete a task or transaction

• Throughput - the amount of work (transactions) that can be completed in a set time period (sec or hour)

• The relationship between the two is generally inverse (although not always)

Page 24: Reliability Week 11 - Lecture 2. What do we mean by reliability? Correctness – system/application does what it has to do correctly. Availability – Be.

Concurrency is the answer

Slow response timeHigh throughput

Fast response timeLow throughput

Time

Page 25: Reliability Week 11 - Lecture 2. What do we mean by reliability? Correctness – system/application does what it has to do correctly. Availability – Be.

A user requires consistency, then speed.

• A user wants a transaction to run consistently. The faster, the better.

• A user sees response time at the PC or terminal.

• A user is not concerned with the entire infrastructure that supports a transaction.

• It staff see reponse time only in their domain of responsibility (server, database, network etc)

Page 26: Reliability Week 11 - Lecture 2. What do we mean by reliability? Correctness – system/application does what it has to do correctly. Availability – Be.

Difficult to measure total response time

• How do you add together web server + application server + database server + network

• Do you get statistics from each group ? Will each group maintain statistics is the same format ?

• You need to measure total response time and response in each area (server, database etc).

• New network monitors may be able to provide statistics closer to what you need

Page 27: Reliability Week 11 - Lecture 2. What do we mean by reliability? Correctness – system/application does what it has to do correctly. Availability – Be.

Improving performance

• You can add more resources (faster servers, faster disks, networks etc) to improve response time and throughput.

• However, performance improvements may not be proportional to the additional resources.

• 100% increase in resources may only bring, say, 70% performance improvement. Scalability.

Page 28: Reliability Week 11 - Lecture 2. What do we mean by reliability? Correctness – system/application does what it has to do correctly. Availability – Be.

Monitoring Performance

• Performance is a process, not a task.

• Performance should be constantly monitored. Cost of monitoring must weighed against “do nothing”

• Performance tuning should be carried out to correct performance problems.