Useful Links
Computer Science
DevOps and SRE
Site Reliability Engineering (SRE)
1. Introduction to Site Reliability Engineering
2. Core Principles of SRE
3. Service Level Management
4. Observability and Monitoring
5. Incident Management and On-Call
6. Toil Management and Automation
7. Change and Release Management
8. System Design for Reliability
9. SRE Organization and Culture
10. Advanced SRE Practices
Observability and Monitoring
Pillars of Observability
Metrics
Types of Metrics
Metric Collection and Aggregation
Metric Visualization
Time Series Databases
Logs
Log Collection
Log Retention Policies
Log Analysis Techniques
Structured vs Unstructured Logs
Traces
Distributed Tracing Concepts
Trace Collection and Storage
Trace Analysis
Sampling Strategies
Monitoring Systems
White-Box Monitoring
Instrumentation of Code
Application-Level Metrics
Internal System Metrics
Black-Box Monitoring
Synthetic Monitoring
External Probes
User Journey Monitoring
Alerting Philosophy
Alerting on Symptoms vs Causes
Designing Effective Alerts
Reducing Alert Fatigue
Tuning Alert Thresholds
Monitoring Strategy
Monitoring Stack Architecture
Data Pipeline Design
Monitoring as Code
Cross-Service Monitoring
Dashboards and Visualization
Dashboard Design Principles
Operational Dashboards
Executive Dashboards
Real-Time vs Historical Views
Previous
3. Service Level Management
Go to top
Next
5. Incident Management and On-Call