1 of 43

Slide Notes

DownloadGo Live

Incident Management in a Nutshell

Published on Jun 23, 2016

Incidents that disrupt the service are not simply anomalies....disruptive events are weaknesses in your system being revealed because of a specific configuration of events. We then need to look at our incident management as a method for us to foremost reduce weakness and improve. How might we do that, and what do we need?

PRESENTATION OUTLINE

Incident Management

in a nutshell
Photo by steffenz

Response to Service Impacting Events

...aka, Incident Management 
Photo by taberandrew

Thinking about Incident Management

Photo by ansik

Incidents aren't anomalies

Photo by ansik

Outages are weakness revealed

We can't fix weakness during an outage

During incidents, our main job is to communicate effectively

Communicate with

  • Stakeholders
  • Problem Solvers
  • Customers

COMMUNICATION

  • Observe
  • Notify
  • Marshal
  • Assess/Inform
  • Describe

...UNTIL INCIDENT RESOLVED

We improve after an incident

We Improve:

  • Process
  • Culture
  • Systems

We improve by making incremental fixes

Incremental > Systemic

one part versus the whole elephant
Photo by digitalART2

Incremental Improvement through post-mortems

Blameless post-mortems are vital.

Photo by bizstone

Identify changes to be made and track them.

ITERATE

small but constant changes 

Problems

Our workload versus our structure

Ops lacks structure for improvement...

...and also the time/resources to dedicate to building structures

aka, Firefighting

Fix current problem and try to catch it earlier next time.

We grow into problems.

We learn to get by, afraid of systemic pain points.
Photo by garryknight

What to do about Incident Management

Photo by Jed Sullivan

During Incidents

establish procedures/structures for better communication

Contacts

  • who do I contact when X breaks?
  • Who do I call for S4? S1?
  • Where is Vendor contact information?

Simple

but add up time lost searching/not knowing.

Have who to contact and when to contact in *1* easy to find place

How to talk

Determine ahead of time communication in stressful situations.

Incident Communications

  • Use Incident Command System
  • Lines of Communication
  • Pre-made text for outside communications

After Incidents

Supporting and Tracking Growth.

Post-mortems

blameless. scheduled immediately after incident. 

Post-mortems

  • provide a standard form for reporting
  • create and track tickets
  • regularly follow up on progress.
  • create high level summaries for execs.

Observability

What exactly is going on in our system?
Photo by washingtonydc

Observability

  • Improve our knowledge about resources as they are used
  • Monitor off of data instead of "what hit us most recently"
  • Metrics and analysis

People

Culture and Mental Models

Troubleshooting Ability Varies

Can you/can't you teach it?

Troubleshooting

  • Whether teachable or not, can supply basic processes and tools
  • Focus on giving people the best tools
  • Empower people to use tools
  • Provide good documentation

....this is just the beginning

Where I've Started

  • Working on Monitoring (Icinga + InfluxDB + Grafana)
  • Creating Post-mortem Workflow with Jira (following best practices)
  • Confluence Best Practices

Thinking > Technology

Mental Tools > Software Tools

Thank You!

Comments/Questions?

Untitled Slide