SESSION 404 Thursday, May 11, 10:00am - 11:00am Track: Service Desk Masters Implementing a Major Incident Management Process Lisa Callihan IT Customer Experience Manager, University of Michigan [email protected]Session Description Do you have a process in place for quickly restoring services in the event of a major incident? Beginning in 2010, the University of Michigan started a journey of implementing just such a process. This session will share that journey through vivid examples of the development, implementation, and results they achieved. You’ll leave with an understanding of just how crucial a major incident process is, as well as a toolkit that includes a process guide, templates, communication plans, and costing model to help you create a similar process for your organization. Speaker Background For the past ten years, Lisa Callihan has worked for the University of Michigan’s IT department, as a service desk manager, incident and knowledge process owner, and, currently, as customer experience manager. She also spent fifteen years in corporate help desk management. A Wayne State University graduate, Lisa has earned her ITIL v3 Foundations, ITIL v3 Support and Restore Practitioner, and HDI Support Center Manager certifications. She previously served as VP of communications for the Motown local chapter.
25
Embed
Implementing a Major Incident Management Process/media/HDIConf/Files/Copy of... · 2017-04-20 · and knowledge process owner, and, currently, as customer experience manager. She
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
SESSION 404
Thursday, May 11, 10:00am - 11:00am Track: Service Desk Masters
Implementing a Major Incident Management Process Lisa Callihan IT Customer Experience Manager, University of Michigan [email protected]
Session Description Do you have a process in place for quickly restoring services in the event of a major incident? Beginning in 2010, the University of Michigan started a journey of implementing just such a process. This session will share that journey through vivid examples of the development, implementation, and results they achieved. You’ll leave with an understanding of just how crucial a major incident process is, as well as a toolkit that includes a process guide, templates, communication plans, and costing model to help you create a similar process for your organization.
Speaker Background For the past ten years, Lisa Callihan has worked for the University of Michigan’s IT department, as a service desk manager, incident and knowledge process owner, and, currently, as customer experience manager. She also spent fifteen years in corporate help desk management. A Wayne State University graduate, Lisa has earned her ITIL v3 Foundations, ITIL v3 Support and Restore Practitioner, and HDI Support Center Manager certifications. She previously served as VP of communications for the Motown local chapter.
• By the numbers– 44,718 Students & 575,000 Alumni
– 7,886 Faculty & 32,288 Staff
– 6,295 Grad Assistants & Research Fellows
• IT Support at Michigan
The Early Years
Help Desk during an IT outage
What would help?
Answers!What is the issue?
What is the impact?
Who is needed?
Is there a workaround?
When will it be fixed?
What can I tell customers?
Does leadership know?
After it is fixed, what was the resolution?
Can you think of anything else?
Initial Release
Major Incident Identification
Process Guidelines & Template
Limited Roles & Decision Making
Process Purpose
1. Restoration of Service as quickly as possible2. Allow the right people to fix the issue3. Effective communication across teams & leadership4. Organized plan to resolve the issue5. Timely and accurate outage documentation
Steps
1. Determine & Notify2. Initial Stand Up Meeting3. Coordination4. Ongoing Meetings5. Major Incident Closure
First Year Results
• Nine Major Incidents
• Successful Meetings
• Manual Process
• Helpful to Clarify
• Adoption Rate
Process Improvements
Staff Training
Process Adoption
Consistency
Ease of Use
Major Incident LevelsHIGH MEDIUM LOW
Level Definition Cause and resolution unknownMultiple, single, or limited services impacted
Cause may or may not be known The resolution is pending
Cause is known The resolution is in progress
Process Requirements
Must follow full Major Incident processMajor Incident Calendar notifications Review Meeting conducted
SameSame Review Meeting conducted where appropriate
SameSameReview meeting is not conducted
Attendance Expectations
“All Hands on Deck” - Any Staff with the potential to assist must call in for the initial stand up meeting All identified oncall Staff
Service Owner, Manager or Representative(s)Impacted Service and Technical TeamsCommunications TeamService Desk or Data Center Operations Customer Relationship Manager
Service Owner, Manager or Representative(s)Communications Team Service Desk or Data Center Operations Customer Relationship Manager
Facilitator Service Management Service Management Service Management
Success
Leadership Quote:
“I've already seen some nice improvements and have heard good feedback from many staff on this process. It's quickly becoming one of our most mature processes. Usage of the process is up and it's used widely across our organization.”
Continuous Improvement
Participation
Facilitation
On Call
Roles
Major Incident Quick Reference
Call 4-HELP
• Is there a degradation/outage critically impacting service?
• Has an incident been created and assigned?
• Have service restoration attempts been unsuccessful?
• Gather the supporting details and call.
RACIR Responsible – Person working on incident
A Accountable – Person with decision authority
C Consult – Key stakeholder who should be included
Service Owner(s) Service Manager(s) Product Manager(s) Technical Lead(s)
2
INITIAL CALL TEAM REPRESENTATION
Complete this section during first call; if a team is missing, page them as needed.
On-Call Group Representatives (separate names with commas to keep rows brief)
CURRENT STATUS After the initial call, copy Status Update Call Notes template to here before next call; keep most current notes on top.
<Date and time> INITIAL CALL NOTES, next call scheduled for <date and time>
Attendees Facilitator: <facilitator name> Notetaker: <notetaker name> Note to facilitators: For initial call names are entered above; no need to duplicate them here. OK to just indicate “See initial call list above.”
Status <Incident update>
Discussion <meeting collaboration notes>
Action Items <include task, person assigned, and time>
STATUS UPDATE CALLS NOTES TEMPLATE
Copy this table for status update calls. Move it to the top of all status notes so most recent are always shown first.
<Date and time> SI Status Update Notes, next call scheduled for <date and time>
Attendees Facilitator: <facilitator name> Notetaker: <notetaker name> All Others: Please add your name in the comments Note to facilitator: Recommend you used a comment box for capturing attendee list for status calls. It keeps the notes from getting lengthy when there are staff who aren’t necessarily required but just want to listen in. You might then indicate here if specific required teams were or weren’t represented.
Status <Incident update> <Last call action items updates> Note to facilitator: recommend you also copy in the action items from most recent call and record their status here
3
Discussion <meeting collaboration notes>
Action Items <include task, person assigned, and time>
MAJOR INCIDENT REVIEW MEETING NOTES
Review Meeting ● <date and time of scheduled meeting>
Attendees ● <list names of Significant Incident Review participants>
Key Findings ● <list key findings>
Root Cause ● <document root cause, if known>
Action Items ● <include task, person assigned and target date>
MAJOR INCIDENT ANALYSIS QUESTIONS Considering printing this talking point reference and having it in front of you during the call as a discussion guide if needed. .
WHAT
What Services are affected?
Which related services are not affected?
What are the symptoms (Error messages, slowness, timeouts)?
What is considered to be “normal” state?
What is the user or customer population that is impacted?
WHERE
Where are the affected areas, geographically?
Are there specific areas within the geographic area that are affected?
WHEN
When did we first know that Services were affected?
How frequently is the interruption/degradation happening (constant, intermittent, pattern)?
Does the interruption occur in conjunction with a business process?
Is there a point in the business cycle that is or could be affected?
EXTENT
How many things (of how many things) (number of users) are affected? (ex: 4 of 6 servers for the Service are affected)
Are the symptoms getting worse or getting better?
Have there been any Changes to the Service or a dependant Service leading up to this Incident?
DOCUMENTATION
What type of documentation (e.g. architecture map, recent changes) would be helpful in this Major Incident? Who will get them?
ACTION
What action(s) do we believe that we may take to restore service?
What are the benefits or risks associated with the actions identified to restore service?
Who is performing or responsible for each action?
Example Facilitator Checklist
Activity Information and Links
1. Confirm Major Incident Document link is added to the Calendar event. Update with information already known; review Document agenda at top
Timing: 10+ min before initial call
<insert information and links directly related to the activity>
2. Send Service Owner or delegate a link to Service Owner Checklist as soon as they are identified. They may be identified during the initial call.
Timing: As soon as possible or during initial call
3. Dial into conference line; notetaker also joins
Timing: 3-5 min before initial call
4. Verify required group representation before starting conversation: ● Service Owner and Comm Team member ● Representation from all teams
Timing: During call
5. Request overview of issue from person reporting Major Incident. Also: ● Issue posts to Status page ● Any Security concern? ● Any Customer relations concern?
Timing: During call
6. Identify next steps and/or contingencies. By now there should also be a Service Owner/Delegate identified. Do not end the call without a manager-level or higher person identified to act as Service Owner or delegate
Timing: During call
7. Confirm action items and assignment.
Timing: During call
8. Establish next check in date/time (callers often start to drop off now) ● Confirm required attendees for future calls ● Ask Comm, Service Owner, Technical SME to stay on to develop
the Communications Plan
Timing: During call
9. Confirm Communication Plan before formally ending call ● Status pages (Technical SME will often update these) ● Leadership and Campus Facing (Comm Rep assists Service Owner)
Timing: During call
10. Determine ongoing coordination methods.
Timing: During or after call
11. Schedule next status update call. Between calls: ● Get link to comm plan from Comm rep; add to Major Incident
Document ● Clean up Document Notes and get the doc ready for next call ● Be aware of Major Incident related communications
Timing: After initial call
12. Conduct Ongoing Status Update Calls and Update Document notes
Timing: As needed
13. Confirm Major Incident resolution and permission to close Incident with Service Owner ● Confirm final communication task assignments ● Confirm post - problem investigation assignment to technical staff ● Remind Service Owner of follow up items in Service Owner
Checklist ● Clean Document Notes; close and store with other Major Incident
Documents ● If Service Owner completed a Resolution Summary, also obtain and
store it with Major Incident docs
Timing: When Service Owner indicates Major Incident is closed
Example Service Owner Checklist
Activity Information and Links
1. As soon as possible (even before initial call) communicate with your Service Portfolio Owner (SPO). The SPO is responsible for direct communications with CIO.
<insert information and links directly related to the activity>
2. If an SI is a sensitive communications issue, contact Communications immediately for guidance
3. Contact the Facilitator throughout the Major Incident , as needed.
4. Help determine required teams for the Major Incident and ongoing status calls. Release nonessential staff as early as possible.
5. Remain on initial call when it ends to collaborate with Communications Rep on communications plan. Comm Rep is responsible to help compose initial status email to leadership and campus-facing communications. Engage with the Customer Relationship team for hands-on outreach to units, if needed.
6. If the Major Incident team wants a HipChat or physical WAR room, it’s helpful to offer your administrative support staff to assist with setup. The designated WAR room may need to be cleared of other meetings.
7. Attend all Major Incident Status Update meetings to partner with the Facilitator. Continue working with Communications Rep for ongoing leadership and campus-facing communications.
8. When you are satisfied service has been restored to a working state and at your discretion, have the Facilitator close the Major Incident.
9. Make sure the technical team is assigned to conduct a root-cause analysis and provide a ServiceLink problem record number to the Facilitator ASAP.
10. After the Major Incident, work with the Communications Rep for closure communications including the Resolution Summary that is due within 3 days from Major Incident close date. Work with the Facilitator to store it with the Major Incident document.
11. Follow up with the technical team to make sure they are working the root-cause analysis and closing the problem record in a timely manner.
12. Conduct a Major Incident Review/Lessons Learned meeting with participants.
13. In the Major Incident Document that was recorded, complete the Review Meeting Notes section located on the last page.
1
Major Incident Resolution Summary Template Send to:
Author:
Date:
Service Owner:
Service Manager:
Incident Record Number:
Problem Record Number:
Change Record Number(s):
Description of Service Outage: (Describe the outage. Provide the general idea of impact by quantifying
customers and locations impacted, the duration of the outage, and what issues it may have caused as a
result of being unavailable.)
Impacted Service(s): (List all known Business Services or Applications Services Impacted)
Resolution: (Summary of efforts completed to resolve the service disruption and prevent reoccurrence)
●
Root Cause: (The identified point or points of failure that led to the service disruption)
●
Supporting Information: (Include any links to supporting documentation, presumed questions &
answers, a timeline, technical documentation or reports)
● Incident Details: <insert link> ● ●
Incident Begin Date End Date # of hours of
Conference Calls Number of Elapsed Time Total Number of $/hour Total $
Call 1 0 0 0
Call 2 0 0 0
Call 3 0 0 0
Call 4 0 0 0
Review Meeting 0 0 0
$0
Estimate of 0 0 $0
Significant Incident 0 0 $0
Number of $/Incident Total $
0 0 $0
Total Cost of Incident $0
Average Salary Hours per year Hourly rate Average hourly