Self-Improving Operations
by @jose-compu
Captures process bottlenecks, incident patterns, capacity issues, automation gaps, SLA breaches, and toil accumulation to enable continuous operations improv...
1. Conduct blameless postmortems β focus on systemic causes, not individual blame
2. Automate toil aggressively β if you do it manually 3 times, automate it
3. Define SLOs before SLAs β internal targets should be stricter than customer commitments
4. Maintain runbooks β keep them current, test them during game days, include verification steps
5. Track error budgets β use them to balance feature velocity and reliability work
6. Rotate on-call fairly β equitable distribution, adequate rest, compensatory time off
7. Rehearse incident response β run tabletop exercises and chaos engineering experiments
8. Log immediately β incident context fades fast after resolution
9. Include timelines β timestamps are critical for postmortems and pattern detection
10. Measure DORA metrics β track deployment frequency, lead time, change failure rate, and MTTR
11. Review before on-call shifts β check .learnings/ for known issues and recent patterns
clawhub install self-improving-operations