Define an SLO with Error Budget Tracking — K8s Platform Engineering

Objective

Service Level Objectives (SLOs) quantify reliability targets. The error budget (time or requests where you can fail) makes SLO-based decision making concrete: if the budget is nearly exhausted, stop feature releases and focus on reliability. This exercise implements a complete SLO measurement system using Prometheus recording rules and Sloth-style multi-window burn rate alerts.

A 99.9% availability SLO over 30 days = 43.2 minutes of error budget. If your error rate consumes the budget at 14x normal rate, you exhaust it in 3 days. Multi-window burn rate alerting catches both fast burns (paging) and slow burns (tickets).

Prerequisites

kube-prometheus-stack deployed (from d5-e1)
A web service generating HTTP metrics (deploy podinfo or any HTTP service)
Grafana accessible

Steps

01

Deploy a sample service with HTTP metrics

# Deploy podinfo as the service to measure
cat << 'EOF' | kubectl apply -f -
apiVersion: apps/v1
kind: Deployment
metadata:
  name: web-service
  namespace: default
spec:
  replicas: 2
  selector:
    matchLabels: {app: web-service}
  template:
    metadata:
      labels: {app: web-service}
      annotations:
        prometheus.io/scrape: "true"
        prometheus.io/port: "9898"
    spec:
      containers:
      - name: app
        image: stefanprodan/podinfo:6.5.0
        ports:
        - containerPort: 9898
---
apiVersion: v1
kind: Service
metadata:
  name: web-service
  namespace: default
  labels:
    app: web-service
spec:
  selector: {app: web-service}
  ports:
  - port: 9898
    targetPort: 9898
EOF

kubectl wait deployment/web-service \
  --for=condition=Available --timeout=60s

02

Define the SLO in Prometheus recording rules

Recording rules pre-compute expensive queries and store them as new metrics. This is critical for SLO dashboards — querying raw metrics over 30 days is too slow for dashboards.

# slo-rules.yaml — PrometheusRule CRD
cat << 'EOF' | kubectl apply -f -
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: web-service-slo
  namespace: monitoring
  labels:
    release: kube-prometheus-stack
spec:
  groups:
  - name: web-service.slo
    interval: 30s
    rules:
    # SLI: ratio of good requests to total requests
    - record: slo:web_service:success_rate1m
      expr: |
        sum(rate(http_requests_total{job="web-service",status!~"5.."}[1m]))
        /
        sum(rate(http_requests_total{job="web-service"}[1m]))

    # Error budget remaining (0-1, where 1 = 100% remaining)
    # SLO: 99.9% = 0.999 target
    - record: slo:web_service:error_budget_remaining
      expr: |
        1 - (
          (1 - slo:web_service:success_rate30d:burnrate1h)
          /
          (1 - 0.999)
        )

    # Burn rate over different windows (for multi-window alerting)
    - record: slo:web_service:success_rate30d:burnrate1h
      expr: |
        1 - (
          sum(rate(http_requests_total{job="web-service",status!~"5.."}[1h]))
          /
          sum(rate(http_requests_total{job="web-service"}[1h]))
        )

    - record: slo:web_service:success_rate30d:burnrate6h
      expr: |
        1 - (
          sum(rate(http_requests_total{job="web-service",status!~"5.."}[6h]))
          /
          sum(rate(http_requests_total{job="web-service"}[6h]))
        )

    - record: slo:web_service:success_rate30d:burnrate3d
      expr: |
        1 - (
          sum(rate(http_requests_total{job="web-service",status!~"5.."}[3d]))
          /
          sum(rate(http_requests_total{job="web-service"}[3d]))
        )
EOF

03

Create multi-window burn rate alert rules

# slo-alerts.yaml
cat << 'EOF' | kubectl apply -f -
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: web-service-slo-alerts
  namespace: monitoring
  labels:
    release: kube-prometheus-stack
spec:
  groups:
  - name: web-service.slo.alerts
    rules:
    # Page: fast burn consuming 5% of monthly budget in 1h
    # 14.4x burn rate means exhaustion in ~2 days
    - alert: WebServiceSLOFastBurn
      expr: |
        slo:web_service:success_rate30d:burnrate1h > (14.4 * 0.001)
        and
        slo:web_service:success_rate30d:burnrate6h > (14.4 * 0.001)
      for: 1m
      labels:
        severity: critical
        slo: web-service-availability
      annotations:
        summary: "Fast error budget burn rate for web-service"
        description: "Burn rate is {{ $value | humanizePercentage }}, consuming budget 14x faster than expected. Budget will be exhausted in ~2 days."

    # Ticket: slow burn consuming 10% of monthly budget in 3 days
    - alert: WebServiceSLOSlowBurn
      expr: |
        slo:web_service:success_rate30d:burnrate6h > (3 * 0.001)
        and
        slo:web_service:success_rate30d:burnrate3d > (3 * 0.001)
      for: 1h
      labels:
        severity: warning
        slo: web-service-availability
      annotations:
        summary: "Slow error budget burn for web-service"
        description: "Budget consuming at 3x normal rate. Review error trends."
EOF

04

Verify recording rules are loaded

# Port-forward to Prometheus
kubectl port-forward svc/kube-prometheus-stack-prometheus \
  -n monitoring 9090:9090

# In Prometheus UI: Status → Rules → search for "web-service"
# Verify the recording rules show State: ok

# Test the queries in Prometheus:
# slo:web_service:success_rate1m  → should return a value near 1.0
# slo:web_service:success_rate30d:burnrate1h  → near 0 (low errors)

# Generate some errors to see the burn rate move
kubectl exec -n default \
  $(kubectl get pod -l app=web-service -o name | head -1) \
  -- sh -c "
  for i in \$(seq 1 100); do
    wget -q -O/dev/null http://localhost:9898/status/500 || true
  done"

05

Build the SLO Grafana dashboard

# Key panels to build in Grafana:

# Panel 1: Current SLO compliance (99.9% target)
# Query: slo:web_service:success_rate1m * 100
# Display: Gauge, threshold at 99.9

# Panel 2: Error budget remaining (%)
# Query:
(
  1 - (
    sum(increase(http_requests_total{job="web-service",status=~"5.."}[30d]))
    /
    sum(increase(http_requests_total{job="web-service"}[30d]))
  ) / 0.001
) * 100
# Display: Gauge, 0-100%, threshold at 10% (warning)

# Panel 3: Burn rate over time
# Query: slo:web_service:success_rate30d:burnrate1h
# Display: Time series, add reference line at 1x (normal rate)

# Panel 4: Projected exhaustion date
# Error budget seconds remaining:
(
  30 * 24 * 3600 * 0.001  # total budget in seconds (43.2 min)
  -
  sum(increase(http_requests_total{job="web-service",status=~"5.."}[30d]))
  /
  sum(rate(http_requests_total{job="web-service"}[30d]))
)
# Display: Stat panel showing days remaining

Success Criteria

PrometheusRule CRD applied and recording rules show State: ok in Prometheus UI slo:web_service:success_rate1m returns a value between 0 and 1 After generating errors, burn rate recording rules show non-zero values Alert rules loaded and visible in Prometheus → Alerts Grafana dashboard created with at least 3 SLO panels You can explain what the 14.4x burn rate threshold means in terms of budget exhaustion time