PRESENTATION OUTLINE
Response to Service Impacting Events
Thinking about Incident Management
Incidents aren't anomalies
Outages are weakness revealed
We can't fix weakness during an outage
During incidents, our main job is to communicate effectively
Communicate with
- Stakeholders
- Problem Solvers
- Customers
COMMUNICATION
- Observe
- Notify
- Marshal
- Assess/Inform
- Describe
...UNTIL INCIDENT RESOLVED
We improve after an incident
We improve by making incremental fixes
Incremental Improvement through post-mortems
Blameless post-mortems are vital.
Identify changes to be made and track them.
Ops lacks structure for improvement...
...and also the time/resources to dedicate to building structures
What to do about Incident Management
Contacts
- who do I contact when X breaks?
- Who do I call for S4? S1?
- Where is Vendor contact information?
Have who to contact and when to contact in *1* easy to find place
Incident Communications
- Use Incident Command System
- Lines of Communication
- Pre-made text for outside communications
Post-mortems
- provide a standard form for reporting
- create and track tickets
- regularly follow up on progress.
- create high level summaries for execs.
Observability
- Improve our knowledge about resources as they are used
- Monitor off of data instead of "what hit us most recently"
- Metrics and analysis
Troubleshooting Ability Varies
Troubleshooting
- Whether teachable or not, can supply basic processes and tools
- Focus on giving people the best tools
- Empower people to use tools
- Provide good documentation
....this is just the beginning
Where I've Started
- Working on Monitoring (Icinga + InfluxDB + Grafana)
- Creating Post-mortem Workflow with Jira (following best practices)
- Confluence Best Practices