SysTeam/Docs/SLA & SLO

SLA & SLO Tracking

Track Service Level Agreements and Objectives with automated uptime calculations, error budgets, and burn rate alerts.

Check detail with SLA/SLO panel

Check detail page showing SLO compliance, error budget, and SRE metrics (MTTR, MTBF, Apdex)

Key Concepts

TermDefinition
SLAService Level Agreement — A contractual commitment to your customers about uptime (e.g., "99.9% uptime per month"). SysTeam tracks actual uptime so you can verify you're meeting your SLA.
SLIService Level Indicator — The actual measured metric (e.g., "99.95% uptime over the last 30 days"). This is what we calculate from your check data.
SLOService Level Objective — Your internal target for the SLI (e.g., "target 99.9% uptime"). SLOs are typically stricter than SLAs to give you a buffer.
Error BudgetThe amount of acceptable downtime within your SLO window. For a 99.9% SLO over 30 days, your error budget is ~43 minutes of downtime.

SLA Tracking

Every check automatically tracks SLA metrics. No configuration needed. Open any check's detail page to see the SLA panel with uptime statistics across multiple time windows:

PeriodMetrics
Last 24 hoursUptime %, total incidents, MTTR
Last 7 daysUptime %, total incidents, MTTR
Last 30 daysUptime %, total incidents, MTTR
Last 90 daysUptime %, total incidents, MTTR

How Uptime is Calculated

Uptime percentage is calculated as:

Uptime % = (Total Time - Downtime) / Total Time × 100
  • Downtime includes periods where the check status was DOWN
  • DEGRADED periods are counted as incidents but not as full downtime
  • Maintenance windows are excluded — planned downtime does not count against uptime
  • PAUSED periods are excluded from the calculation entirely

MTTR (Mean Time To Recovery)

MTTR is the average duration of incidents within the time period. It tells you how quickly you typically recover from outages:

MTTR = Total Downtime Duration / Number of Incidents

A lower MTTR means faster recovery. Track this over time to measure improvement in your incident response process.


Configuring SLOs

While SLA tracking is automatic, SLOs require explicit configuration. An SLO lets you set a target uptime percentage and get alerted when your error budget is running low.

Creating an SLO

  1. Open a check's detail page
  2. Scroll to the SLO section
  3. Click Configure SLO
  4. Set your target and window
  5. Save

Example: 99.9% Monthly SLO

Target Uptime:99.9%
Window:30 days (rolling)
Alert on Burn Rate:Enabled
Burn Rate Threshold:2x (budget consumed 2× faster than expected)

SLO Settings

SettingDescription
Target UptimeYour uptime target as a percentage (e.g., 99.9%, 99.95%, 99.99%)
WindowThe rolling time window for the SLO calculation (7, 30, or 90 days)
Burn Rate AlertEnable alerts when your error budget is being consumed faster than expected
Burn Rate ThresholdThe multiplier that triggers an alert (e.g., 2× means you'll exhaust your budget in half the expected time)

Common SLO Targets

TargetAllowed Downtime / 30 DaysUse Case
99.0%~7 hours 12 minInternal tools, non-critical services
99.5%~3 hours 36 minBusiness applications
99.9%~43 minutesProduction APIs, customer-facing services
99.95%~21 minutesCritical infrastructure
99.99%~4 minutesPayment systems, core platform

Choose Realistic Targets

Setting an SLO target higher than what you can realistically achieve leads to a permanently breached error budget, which defeats the purpose. Start with a target slightly below your historical uptime and tighten it over time.


Error Budget

The error budget is the maximum amount of downtime you can afford within your SLO window while still meeting your target. It's displayed as a visual gauge on the check detail page:

Reading the Error Budget Gauge

  • Green — Error budget is healthy. You have plenty of downtime budget remaining.
  • Yellow — Error budget is below 50%. Proceed with caution before deploying risky changes.
  • Red — Error budget is breached (0% remaining). Your uptime is below the SLO target.

How to Use Error Budgets

Error budgets are a powerful tool for making engineering decisions:

  • Budget available — You can safely deploy changes, run experiments, or do maintenance. The budget gives you room for risk.
  • Budget running low — Slow down deployments, focus on reliability. Don't risk further incidents.
  • Budget exhausted — Freeze non-critical deployments. Focus all effort on improving reliability until the budget recovers.

Burn Rate Alerts

Burn rate measures how fast you're consuming your error budget relative to expectations. A burn rate of 1× means you're using the budget at exactly the expected rate (evenly distributed across the SLO window).

Burn RateMeaningAction
< 1×Consuming budget slower than expectedHealthy, no action needed
1× - 2×Slightly elevated consumptionMonitor closely
2× - 5×Rapid consumptionInvestigate immediately
> 5×Emergency levelAll hands on deck

When the burn rate exceeds your configured threshold, an alert is sent through your configured notification channels.


Dashboard SLO Column

The main dashboard includes an SLO column for checks that have SLOs configured. It shows a compact error budget indicator:

  • Small colored progress bar showing remaining budget
  • Green / yellow / red based on budget health
  • Use the SLO filter to show only checks with specific SLO health status

Maintenance Exclusion

Active maintenance windows are automatically excluded from all SLA and SLO calculations. This means:

  • Planned downtime during a maintenance window does not reduce your uptime percentage
  • Incidents during maintenance do not consume your error budget
  • MTTR calculations exclude maintenance-related incidents

This ensures your SLA/SLO metrics accurately reflect unplanned outages only.


API Access

SLA and SLO data is available via the API:

GET/api/checks/{id}/slaSLA metrics (24h/7d/30d/90d)
GET/api/checks/{id}/sloSLO configuration and current status
POST/api/checks/{id}/sloCreate or update SLO configuration
DELETE/api/checks/{id}/sloRemove SLO configuration
Example: SLA Response
{
  "check_id": 42,
  "periods": {
    "24h": { "uptime_pct": 100.0, "incidents": 0, "mttr_seconds": 0 },
    "7d":  { "uptime_pct": 99.95, "incidents": 1, "mttr_seconds": 320 },
    "30d": { "uptime_pct": 99.92, "incidents": 3, "mttr_seconds": 410 },
    "90d": { "uptime_pct": 99.94, "incidents": 5, "mttr_seconds": 385 }
  }
}

Next Steps