UsefulLinks
Computer Science
DevOps and SRE
Site Reliability Engineering (SRE)
1. Introduction to Site Reliability Engineering
2. Core Principles of SRE
3. Service Level Management
4. Observability and Monitoring
5. Incident Management and On-Call
6. Toil Management and Automation
7. Change and Release Management
8. System Design for Reliability
9. SRE Organization and Culture
10. Advanced SRE Practices
4.
Observability and Monitoring
4.1.
Pillars of Observability
4.1.1.
Metrics
4.1.1.1.
Types of Metrics
4.1.1.2.
Metric Collection and Aggregation
4.1.1.3.
Metric Visualization
4.1.1.4.
Time Series Databases
4.1.2.
Logs
4.1.2.1.
Log Collection
4.1.2.2.
Log Retention Policies
4.1.2.3.
Log Analysis Techniques
4.1.2.4.
Structured vs Unstructured Logs
4.1.3.
Traces
4.1.3.1.
Distributed Tracing Concepts
4.1.3.2.
Trace Collection and Storage
4.1.3.3.
Trace Analysis
4.1.3.4.
Sampling Strategies
4.2.
Monitoring Systems
4.2.1.
White-Box Monitoring
4.2.1.1.
Instrumentation of Code
4.2.1.2.
Application-Level Metrics
4.2.1.3.
Internal System Metrics
4.2.2.
Black-Box Monitoring
4.2.2.1.
Synthetic Monitoring
4.2.2.2.
External Probes
4.2.2.3.
User Journey Monitoring
4.2.3.
Alerting Philosophy
4.2.3.1.
Alerting on Symptoms vs Causes
4.2.3.2.
Designing Effective Alerts
4.2.3.3.
Reducing Alert Fatigue
4.2.3.4.
Tuning Alert Thresholds
4.3.
Monitoring Strategy
4.3.1.
Monitoring Stack Architecture
4.3.2.
Data Pipeline Design
4.3.3.
Monitoring as Code
4.3.4.
Cross-Service Monitoring
4.4.
Dashboards and Visualization
4.4.1.
Dashboard Design Principles
4.4.2.
Operational Dashboards
4.4.3.
Executive Dashboards
4.4.4.
Real-Time vs Historical Views
Previous
3. Service Level Management
Go to top
Next
5. Incident Management and On-Call