Useful Links
Computer Science
DevOps and SRE
Site Reliability Engineering (SRE)
1. Introduction to Site Reliability Engineering
2. Core Principles of SRE
3. Service Level Management
4. Observability and Monitoring
5. Incident Management and On-Call
6. Toil Management and Automation
7. Change and Release Management
8. System Design for Reliability
9. SRE Organization and Culture
10. Advanced SRE Practices
Incident Management and On-Call
On-Call Engineering
Philosophy of On-Call
Rotation Models and Scheduling
Handoffs and Escalations
Psychological Safety for On-Call Engineers
Managing On-Call Fatigue
On-Call Compensation and Fairness
Incident Response Process
Incident Detection and Alerting
Incident Classification and Severity
Incident Command System
Roles and Responsibilities
Incident Commander
Communications Lead
Subject Matter Experts
Communication Protocols
Internal Communication
External Communication
Status Updates
Incident Documentation
Incident Resolution Strategies
Post-Incident Analysis
Blameless Postmortem Culture
Root Cause Analysis Techniques
Timeline Reconstruction
Generating Actionable Items
Tracking Remediation Work
Learning from Incidents
Sharing Postmortem Findings
Postmortem Review Process
Previous
4. Observability and Monitoring
Go to top
Next
6. Toil Management and Automation