Incident Management in a Nutshell by Quinn Murphy

1 of 43

Slide Notes

Incident Management in a Nutshell

Published on Jun 23, 2016

Incidents that disrupt the service are not simply anomalies....disruptive events are weaknesses in your system being revealed because of a specific configuration of events. We then need to look at our incident management as a method for us to foremost reduce weakness and improve. How might we do that, and what do we need?

PRESENTATION OUTLINE

Incident Management

in a nutshell

Photo by steffenz

Response to Service Impacting Events

...aka, Incident Management

Photo by taberandrew

Thinking about Incident Management

Photo by ansik

Incidents aren't anomalies

Photo by ansik

Outages are weakness revealed

We can't fix weakness during an outage

During incidents, our main job is to communicate effectively

Communicate with

Stakeholders
Problem Solvers
Customers

COMMUNICATION

Observe
Notify
Marshal
Assess/Inform
Describe

...UNTIL INCIDENT RESOLVED

We improve after an incident

We Improve:

Process
Culture
Systems

We improve by making incremental fixes

Incremental > Systemic

one part versus the whole elephant

Photo by digitalART2

Incremental Improvement through post-mortems

Blameless post-mortems are vital.

Photo by bizstone

Identify changes to be made and track them.

ITERATE

small but constant changes

Photo by coloneljohnbritt

Problems

Our workload versus our structure

Photo by coloneljohnbritt

Ops lacks structure for improvement...

...and also the time/resources to dedicate to building structures

aka, Firefighting

Fix current problem and try to catch it earlier next time.

Photo by U.S. Pacific Fleet

We grow into problems.

We learn to get by, afraid of systemic pain points.

Photo by garryknight

What to do about Incident Management

Photo by Jed Sullivan

During Incidents

establish procedures/structures for better communication

Contacts

who do I contact when X breaks?
Who do I call for S4? S1?
Where is Vendor contact information?

Simple

but add up time lost searching/not knowing.

Have who to contact and when to contact in 1 easy to find place

How to talk

Determine ahead of time communication in stressful situations.

Incident Communications

Use Incident Command System
Lines of Communication
Pre-made text for outside communications

After Incidents

Supporting and Tracking Growth.

Post-mortems

blameless. scheduled immediately after incident.

Post-mortems

provide a standard form for reporting
create and track tickets
regularly follow up on progress.
create high level summaries for execs.

Observability

What exactly is going on in our system?

Photo by washingtonydc

Observability

Improve our knowledge about resources as they are used
Monitor off of data instead of "what hit us most recently"
Metrics and analysis

People

Culture and Mental Models

Troubleshooting Ability Varies

Can you/can't you teach it?

Photo by cesarastudillo

Troubleshooting

Whether teachable or not, can supply basic processes and tools
Focus on giving people the best tools
Empower people to use tools
Provide good documentation

....this is just the beginning

Where I've Started

Working on Monitoring (Icinga + InfluxDB + Grafana)
Creating Post-mortem Workflow with Jira (following best practices)
Confluence Best Practices

Thinking > Technology

Mental Tools > Software Tools

Photo by Howdy, I'm H. Michael Karshis

Thank You!

Comments/Questions?

Untitled Slide

Quinn Murphy