Useful Links
Computer Science
DevOps and SRE
Site Reliability Engineering (SRE)
1. Introduction to Site Reliability Engineering
2. Core Principles of SRE
3. Service Level Management
4. Observability and Monitoring
5. Incident Management and On-Call
6. Toil Management and Automation
7. Change and Release Management
8. System Design for Reliability
9. SRE Organization and Culture
10. Advanced SRE Practices
Core Principles of SRE
Embracing Risk
Understanding Risk in Reliability
Acceptable Risk Levels
Risk Mitigation Strategies
Service Level Objectives
Purpose of SLOs
SLOs as Communication Tool
SLOs and Business Alignment
Eliminating Toil
Identifying Toil
Impact of Toil on Productivity
Toil Reduction as Core Value
Automation
Benefits of Automation
Identifying Automation Opportunities
Automation Best Practices
Release Engineering
Principles of Reliable Releases
Release Process Automation
Rollback and Rollforward Strategies
Simplicity
Value of Simplicity in Systems
Techniques for Achieving Simplicity
Avoiding Unnecessary Complexity
Previous
1. Introduction to Site Reliability Engineering
Go to top
Next
3. Service Level Management