Implementing a Major Incident Management Process/media/HDIConf/Files/Copy of... · 2017-04-20 · and knowledge process owner, and, currently, as customer experience manager. She

SESSION 404

Thursday, May 11, 10:00am - 11:00am Track: Service Desk Masters

Implementing a Major Incident Management Process Lisa Callihan IT Customer Experience Manager, University of Michigan [email protected]

Session Description Do you have a process in place for quickly restoring services in the event of a major incident? Beginning in 2010, the University of Michigan started a journey of implementing just such a process. This session will share that journey through vivid examples of the development, implementation, and results they achieved. You’ll leave with an understanding of just how crucial a major incident process is, as well as a toolkit that includes a process guide, templates, communication plans, and costing model to help you create a similar process for your organization.

Speaker Background For the past ten years, Lisa Callihan has worked for the University of Michigan’s IT department, as a service desk manager, incident and knowledge process owner, and, currently, as customer experience manager. She also spent fifteen years in corporate help desk management. A Wayne State University graduate, Lisa has earned her ITIL v3 Foundations, ITIL v3 Support and Restore Practitioner, and HDI Support Center Manager certifications. She previously served as VP of communications for the Motown local chapter.

mailto:[email protected]

Implementing a Major Incident Management Process

Lisa Callihan

University of Michigan

IT Customer Experience Manager

What We Will Cover

• Background & About Us

• Challenges & Initial State

• Release & Adoption

• Process Improvements

• What We Learned

• Current Process

• Q & A

Major Incident Journey

• Pre – IT Infrastructure Library (ITIL)

• 2010 Launch

• Milestones

• Process Improvements

• Current State

• New Challenges

• And Beyond…

University of Michigan• Founded in 1817

• Three Campuses

– Ann Arbor: • 19 Schools & Colleges

– Dearborn & Flint: • 4 Schools & Colleges

• By the numbers– 44,718 Students & 575,000 Alumni

– 7,886 Faculty & 32,288 Staff

– 6,295 Grad Assistants & Research Fellows

• IT Support at Michigan

The Early Years

Help Desk during an IT outage

What would help?

Answers!What is the issue?

What is the impact?

Who is needed?

Is there a workaround?

When will it be fixed?

What can I tell customers?

Does leadership know?

After it is fixed, what was the resolution?

Can you think of anything else?

Initial Release

Major Incident Identification

Process Guidelines & Template

Limited Roles & Decision Making

Process Purpose

1. Restoration of Service as quickly as possible2. Allow the right people to fix the issue3. Effective communication across teams & leadership4. Organized plan to resolve the issue5. Timely and accurate outage documentation

Steps

1. Determine & Notify2. Initial Stand Up Meeting3. Coordination4. Ongoing Meetings5. Major Incident Closure

First Year Results

• Nine Major Incidents

• Successful Meetings

• Manual Process

• Helpful to Clarify

• Adoption Rate

Process Improvements

Staff Training

Process Adoption

Consistency

Ease of Use

Major Incident LevelsHIGH MEDIUM LOW

Level Definition Cause and resolution unknownMultiple, single, or limited services impacted

Cause may or may not be known The resolution is pending

Cause is known The resolution is in progress

Process Requirements

Must follow full Major Incident processMajor Incident Calendar notifications Review Meeting conducted

SameSame Review Meeting conducted where appropriate

SameSameReview meeting is not conducted

Attendance Expectations

“All Hands on Deck” - Any Staff with the potential to assist must call in for the initial stand up meeting All identified oncall Staff

Service Owner, Manager or Representative(s)Impacted Service and Technical TeamsCommunications TeamService Desk or Data Center Operations Customer Relationship Manager

Service Owner, Manager or Representative(s)Communications Team Service Desk or Data Center Operations Customer Relationship Manager

Facilitator Service Management Service Management Service Management

Success

Leadership Quote:

“I've already seen some nice improvements and have heard good feedback from many staff on this process. It's quickly becoming one of our most mature processes. Usage of the process is up and it's used widely across our organization.”

Continuous Improvement

Participation

Facilitation

On Call

Roles

Major Incident Quick Reference

Call 4-HELP

• Is there a degradation/outage critically impacting service?

• Has an incident been created and assigned?

• Have service restoration attempts been unsuccessful?

• Gather the supporting details and call.

RACIR Responsible – Person working on incident

A Accountable – Person with decision authority

C Consult – Key stakeholder who should be included

I Inform – Needs to know of decision or action

MAJOR INCIDENT RACI

DetermineSchedule Calls;

Take NotesConduct

CallsAttend

CallsGenerate

CommunicationsDecide When

ResolvedPerform Post-Root Cause

AnalysisAttend Post- Lessons

Learned

All Staff R -- -- -- -- -- -- --

Scheduler I

R

schedule 1st call;

no notes

-- -- -- -- -- --

Facilitator C A/R A/R R I I -- --

Support I

R

schedule update

calls; all notes

I R I I -- --

Service Owner A I I A/R A/R A/R A A

Communications

RepresentativeI I -- R C/R I -- --

On-Call Teams I I -- RR

status pages onlyC R R

What It Looks Like Today

The Future

Avoidance

Automation

Accountability

Action

Resources

Process Documentation

Major Incident Template

Analysis Questions

Facilitator Checklist

Service Owner Checklist

Resolution Summary

Cost Analysis Template

Questions?

Lisa CallihanIT Customer Experience Manager

[email protected]

Thank you for attending this session.

Please complete the short evaluation for this session on your mobile device. It is available in

your email or through the conference app.

Example Major Incident Process*

*Note: The University of Michigan has rebranded Major Incidents to be called Significant Incidents.

1

MAJOR INCIDENT TEMPLATE

PURPOSE and FOCUS: Restoration of service as quickly as possible

AGENDA REQUIRED PARTICIPANTS/REPRESENTATION

1. Verify oncall group representation

2. Overview of issue*

3. Next steps and/or contingencies*

4. Action items and assignment*

5. Establish next check in date/time

6. Communication plan

Initial Call: ● Service Owner (if known yet)

or assigned delegate ● All support and on-call teams

represented ● Communications

representative

Ongoing Calls: ● Service Owner or assigned

delegate ● All support teams working on

issue ● Communications

representative

SUMMARY

Name: <yyyymmdd>_<Service or Application Name_Symptom>

Status: <New, Active, Closed>

Cause: <Unknown, May be Known, Known> Resolution: <Unknown, Pending, In Progress>

Incident Summary:

This Major Incident was called by <name> & <date / time>. The following is being impacted:

● <Service and/or applications> The symptoms experienced:

● <Symptoms>

Incident Resolution:

<What was done to restore the service and/or application>

Parent Incident: <INC# Number / Hyperlink> No. Related Incidents: Problem (PRB#): <Number / Hyperlink>

Resolution Summary: <Service Owner complete a Resolution Summary> Review Meeting Notes: <Service

Owner complete section provided>

COMMUNICATION

Conference Line: Email Group:

Communication Details:

SERVICE ROLES

Service Owner(s) Service Manager(s) Product Manager(s) Technical Lead(s)

2

INITIAL CALL TEAM REPRESENTATION

Complete this section during first call; if a team is missing, page them as needed.

On-Call Group Representatives (separate names with commas to keep rows brief)

CURRENT STATUS After the initial call, copy Status Update Call Notes template to here before next call; keep most current notes on top.

<Date and time> INITIAL CALL NOTES, next call scheduled for <date and time>

Attendees Facilitator: <facilitator name> Notetaker: <notetaker name> Note to facilitators: For initial call names are entered above; no need to duplicate them here. OK to just indicate “See initial call list above.”

Status <Incident update>

Discussion <meeting collaboration notes>

Action Items <include task, person assigned, and time>

STATUS UPDATE CALLS NOTES TEMPLATE

Copy this table for status update calls. Move it to the top of all status notes so most recent are always shown first.

<Date and time> SI Status Update Notes, next call scheduled for <date and time>

Attendees Facilitator: <facilitator name> Notetaker: <notetaker name> All Others: Please add your name in the comments Note to facilitator: Recommend you used a comment box for capturing attendee list for status calls. It keeps the notes from getting lengthy when there are staff who aren’t necessarily required but just want to listen in. You might then indicate here if specific required teams were or weren’t represented.

Status <Incident update> <Last call action items updates> Note to facilitator: recommend you also copy in the action items from most recent call and record their status here

3

Discussion <meeting collaboration notes>

Action Items <include task, person assigned, and time>

MAJOR INCIDENT REVIEW MEETING NOTES

Review Meeting ● <date and time of scheduled meeting>

Attendees ● <list names of Significant Incident Review participants>

Key Findings ● <list key findings>

Root Cause ● <document root cause, if known>

Action Items ● <include task, person assigned and target date>

MAJOR INCIDENT ANALYSIS QUESTIONS Considering printing this talking point reference and having it in front of you during the call as a discussion guide if needed. .

WHAT

What Services are affected?

Which related services are not affected?

What are the symptoms (Error messages, slowness, timeouts)?

What is considered to be “normal” state?

What is the user or customer population that is impacted?

WHERE

Where are the affected areas, geographically?

Are there specific areas within the geographic area that are affected?

WHEN

When did we first know that Services were affected?

How frequently is the interruption/degradation happening (constant, intermittent, pattern)?

Does the interruption occur in conjunction with a business process?

Is there a point in the business cycle that is or could be affected?

EXTENT

How many things (of how many things) (number of users) are affected? (ex: 4 of 6 servers for the Service are affected)

Are the symptoms getting worse or getting better?

Have there been any Changes to the Service or a dependant Service leading up to this Incident?

DOCUMENTATION

What type of documentation (e.g. architecture map, recent changes) would be helpful in this Major Incident? Who will get them?

ACTION

What action(s) do we believe that we may take to restore service?

What are the benefits or risks associated with the actions identified to restore service?

Who is performing or responsible for each action?

Example Facilitator Checklist

Activity Information and Links

1. Confirm Major Incident Document link is added to the Calendar event. Update with information already known; review Document agenda at top

Timing: 10+ min before initial call

<insert information and links directly related to the activity>

2. Send Service Owner or delegate a link to Service Owner Checklist as soon as they are identified. They may be identified during the initial call.

Timing: As soon as possible or during initial call

3. Dial into conference line; notetaker also joins

Timing: 3-5 min before initial call

4. Verify required group representation before starting conversation: ● Service Owner and Comm Team member ● Representation from all teams

Timing: During call

5. Request overview of issue from person reporting Major Incident. Also: ● Issue posts to Status page ● Any Security concern? ● Any Customer relations concern?

Timing: During call

6. Identify next steps and/or contingencies. By now there should also be a Service Owner/Delegate identified. Do not end the call without a manager-level or higher person identified to act as Service Owner or delegate

Timing: During call

7. Confirm action items and assignment.

Timing: During call

8. Establish next check in date/time (callers often start to drop off now) ● Confirm required attendees for future calls ● Ask Comm, Service Owner, Technical SME to stay on to develop

the Communications Plan

Timing: During call

9. Confirm Communication Plan before formally ending call ● Status pages (Technical SME will often update these) ● Leadership and Campus Facing (Comm Rep assists Service Owner)

Timing: During call

10. Determine ongoing coordination methods.

Timing: During or after call

11. Schedule next status update call. Between calls: ● Get link to comm plan from Comm rep; add to Major Incident

Document ● Clean up Document Notes and get the doc ready for next call ● Be aware of Major Incident related communications

Timing: After initial call

12. Conduct Ongoing Status Update Calls and Update Document notes

Timing: As needed

13. Confirm Major Incident resolution and permission to close Incident with Service Owner ● Confirm final communication task assignments ● Confirm post - problem investigation assignment to technical staff ● Remind Service Owner of follow up items in Service Owner

Checklist ● Clean Document Notes; close and store with other Major Incident

Documents ● If Service Owner completed a Resolution Summary, also obtain and

store it with Major Incident docs

Timing: When Service Owner indicates Major Incident is closed

Example Service Owner Checklist

Activity Information and Links

1. As soon as possible (even before initial call) communicate with your Service Portfolio Owner (SPO). The SPO is responsible for direct communications with CIO.

<insert information and links directly related to the activity>

2. If an SI is a sensitive communications issue, contact Communications immediately for guidance

3. Contact the Facilitator throughout the Major Incident , as needed.

4. Help determine required teams for the Major Incident and ongoing status calls. Release nonessential staff as early as possible.

5. Remain on initial call when it ends to collaborate with Communications Rep on communications plan. Comm Rep is responsible to help compose initial status email to leadership and campus-facing communications. Engage with the Customer Relationship team for hands-on outreach to units, if needed.

6. If the Major Incident team wants a HipChat or physical WAR room, it’s helpful to offer your administrative support staff to assist with setup. The designated WAR room may need to be cleared of other meetings.

7. Attend all Major Incident Status Update meetings to partner with the Facilitator. Continue working with Communications Rep for ongoing leadership and campus-facing communications.

8. When you are satisfied service has been restored to a working state and at your discretion, have the Facilitator close the Major Incident.

9. Make sure the technical team is assigned to conduct a root-cause analysis and provide a ServiceLink problem record number to the Facilitator ASAP.

10. After the Major Incident, work with the Communications Rep for closure communications including the Resolution Summary that is due within 3 days from Major Incident close date. Work with the Facilitator to store it with the Major Incident document.

11. Follow up with the technical team to make sure they are working the root-cause analysis and closing the problem record in a timely manner.

12. Conduct a Major Incident Review/Lessons Learned meeting with participants.

13. In the Major Incident Document that was recorded, complete the Review Meeting Notes section located on the last page.

1

Major Incident Resolution Summary Template Send to:

Author:

Date:

Service Owner:

Service Manager:

Incident Record Number:

Problem Record Number:

Change Record Number(s):

Description of Service Outage: (Describe the outage. Provide the general idea of impact by quantifying

customers and locations impacted, the duration of the outage, and what issues it may have caused as a

result of being unavailable.)

Impacted Service(s): (List all known Business Services or Applications Services Impacted)

Resolution: (Summary of efforts completed to resolve the service disruption and prevent reoccurrence)

●

Root Cause: (The identified point or points of failure that led to the service disruption)

●

Supporting Information: (Include any links to supporting documentation, presumed questions &

answers, a timeline, technical documentation or reports)

● Incident Details: <insert link> ● ●

Incident Begin Date End Date # of hours of

Conference Calls Number of Elapsed Time Total Number of $/hour Total $

Call 1 0 0 0

Call 2 0 0 0

Call 3 0 0 0

Call 4 0 0 0

Review Meeting 0 0 0

$0

Estimate of 0 0 $0

Significant Incident 0 0 $0

Number of $/Incident Total $

0 0 $0

Total Cost of Incident $0

Average Salary Hours per year Hourly rate Average hourly

0 2080 0.00 0.00

Elapsed time =