Objective
Service Level Objectives (SLOs) quantify reliability targets. The error budget (time or requests where you can fail) makes SLO-based decision making concrete: if the budget is nearly exhausted, stop feature releases and focus on reliability. This exercise implements a complete SLO measurement system using Prometheus recording rules and Sloth-style multi-window burn rate alerts.
Prerequisites
- kube-prometheus-stack deployed (from d5-e1)
- A web service generating HTTP metrics (deploy podinfo or any HTTP service)
- Grafana accessible
Steps
Deploy a sample service with HTTP metrics
# Deploy podinfo as the service to measure
cat << 'EOF' | kubectl apply -f -
apiVersion: apps/v1
kind: Deployment
metadata:
name: web-service
namespace: default
spec:
replicas: 2
selector:
matchLabels: {app: web-service}
template:
metadata:
labels: {app: web-service}
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "9898"
spec:
containers:
- name: app
image: stefanprodan/podinfo:6.5.0
ports:
- containerPort: 9898
---
apiVersion: v1
kind: Service
metadata:
name: web-service
namespace: default
labels:
app: web-service
spec:
selector: {app: web-service}
ports:
- port: 9898
targetPort: 9898
EOF
kubectl wait deployment/web-service \
--for=condition=Available --timeout=60sDefine the SLO in Prometheus recording rules
Recording rules pre-compute expensive queries and store them as new metrics. This is critical for SLO dashboards — querying raw metrics over 30 days is too slow for dashboards.
# slo-rules.yaml — PrometheusRule CRD
cat << 'EOF' | kubectl apply -f -
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: web-service-slo
namespace: monitoring
labels:
release: kube-prometheus-stack
spec:
groups:
- name: web-service.slo
interval: 30s
rules:
# SLI: ratio of good requests to total requests
- record: slo:web_service:success_rate1m
expr: |
sum(rate(http_requests_total{job="web-service",status!~"5.."}[1m]))
/
sum(rate(http_requests_total{job="web-service"}[1m]))
# Error budget remaining (0-1, where 1 = 100% remaining)
# SLO: 99.9% = 0.999 target
- record: slo:web_service:error_budget_remaining
expr: |
1 - (
(1 - slo:web_service:success_rate30d:burnrate1h)
/
(1 - 0.999)
)
# Burn rate over different windows (for multi-window alerting)
- record: slo:web_service:success_rate30d:burnrate1h
expr: |
1 - (
sum(rate(http_requests_total{job="web-service",status!~"5.."}[1h]))
/
sum(rate(http_requests_total{job="web-service"}[1h]))
)
- record: slo:web_service:success_rate30d:burnrate6h
expr: |
1 - (
sum(rate(http_requests_total{job="web-service",status!~"5.."}[6h]))
/
sum(rate(http_requests_total{job="web-service"}[6h]))
)
- record: slo:web_service:success_rate30d:burnrate3d
expr: |
1 - (
sum(rate(http_requests_total{job="web-service",status!~"5.."}[3d]))
/
sum(rate(http_requests_total{job="web-service"}[3d]))
)
EOFCreate multi-window burn rate alert rules
# slo-alerts.yaml
cat << 'EOF' | kubectl apply -f -
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: web-service-slo-alerts
namespace: monitoring
labels:
release: kube-prometheus-stack
spec:
groups:
- name: web-service.slo.alerts
rules:
# Page: fast burn consuming 5% of monthly budget in 1h
# 14.4x burn rate means exhaustion in ~2 days
- alert: WebServiceSLOFastBurn
expr: |
slo:web_service:success_rate30d:burnrate1h > (14.4 * 0.001)
and
slo:web_service:success_rate30d:burnrate6h > (14.4 * 0.001)
for: 1m
labels:
severity: critical
slo: web-service-availability
annotations:
summary: "Fast error budget burn rate for web-service"
description: "Burn rate is {{ $value | humanizePercentage }}, consuming budget 14x faster than expected. Budget will be exhausted in ~2 days."
# Ticket: slow burn consuming 10% of monthly budget in 3 days
- alert: WebServiceSLOSlowBurn
expr: |
slo:web_service:success_rate30d:burnrate6h > (3 * 0.001)
and
slo:web_service:success_rate30d:burnrate3d > (3 * 0.001)
for: 1h
labels:
severity: warning
slo: web-service-availability
annotations:
summary: "Slow error budget burn for web-service"
description: "Budget consuming at 3x normal rate. Review error trends."
EOFVerify recording rules are loaded
# Port-forward to Prometheus kubectl port-forward svc/kube-prometheus-stack-prometheus \ -n monitoring 9090:9090 # In Prometheus UI: Status → Rules → search for "web-service" # Verify the recording rules show State: ok # Test the queries in Prometheus: # slo:web_service:success_rate1m → should return a value near 1.0 # slo:web_service:success_rate30d:burnrate1h → near 0 (low errors) # Generate some errors to see the burn rate move kubectl exec -n default \ $(kubectl get pod -l app=web-service -o name | head -1) \ -- sh -c " for i in \$(seq 1 100); do wget -q -O/dev/null http://localhost:9898/status/500 || true done"
Build the SLO Grafana dashboard
# Key panels to build in Grafana: # Panel 1: Current SLO compliance (99.9% target) # Query: slo:web_service:success_rate1m * 100 # Display: Gauge, threshold at 99.9 # Panel 2: Error budget remaining (%) # Query: ( 1 - ( sum(increase(http_requests_total{job="web-service",status=~"5.."}[30d])) / sum(increase(http_requests_total{job="web-service"}[30d])) ) / 0.001 ) * 100 # Display: Gauge, 0-100%, threshold at 10% (warning) # Panel 3: Burn rate over time # Query: slo:web_service:success_rate30d:burnrate1h # Display: Time series, add reference line at 1x (normal rate) # Panel 4: Projected exhaustion date # Error budget seconds remaining: ( 30 * 24 * 3600 * 0.001 # total budget in seconds (43.2 min) - sum(increase(http_requests_total{job="web-service",status=~"5.."}[30d])) / sum(rate(http_requests_total{job="web-service"}[30d])) ) # Display: Stat panel showing days remaining
Success Criteria
Further Reading
- Google SRE workbook — sre.google/workbook/implementing-slos
- Sloth SLO generator — sloth.dev
- Multi-window burn rate alerting — sre.google/workbook/alerting-on-slos
- Prometheus recording rules — prometheus.io/docs/prometheus/latest/configuration/recording_rules