SysTeam/Docs/Playbooks

Playbooks / Event Orchestration

Automate incident response with configurable playbooks. Trigger them automatically on check status changes or inbound events, execute actions sequentially or in parallel, pause for manual human steps, and call outbound webhooks for auto-remediation.

Overview

Playbooks wire together existing building blocks — escalation policies, war rooms, status pages, notifications, and Jira tickets — into repeatable automation flows. Each playbook has a trigger (with optional conditions), steps (16 action types available), and a target incident (check or inbound).

  • Hybrid model — auto-trigger + automated actions + manual human steps + manual invocation
  • 16 action types — from setting priority to firing outbound webhooks
  • Parallel groups — run independent steps concurrently
  • Per-step conditions — skip steps based on incident attributes
  • Suppress mode — playbook takes full control of the notification flow
  • Live execution timeline — SSE-powered progress view

Creating a Playbook

Navigate to Operate > Playbooks in the sidebar and click New Playbook. The editor has three sections:

1. Basics

FieldDescription
NameShort, descriptive name (e.g. "Auto-page P1 database incidents")
DescriptionOptional longer explanation of what the playbook does
ActiveToggle off to disable without deleting

2. Trigger

Choose one of four trigger types:

TriggerFires when
check_status_changeA check transitions between UP / DOWN / DEGRADED / LATE
inbound_incident_createdA new inbound incident is ingested via the Events API
inbound_status_changeAn inbound incident is acknowledged or resolved
manualOnly runs when triggered by a user via "Run Playbook" button

Trigger Conditions (AND / OR)

Restrict when the playbook fires. Available fields depend on the trigger type. Operators: equals, not_equals, in, contains, regex, between, exists.

{
  "logic": "AND",
  "conditions": [
    { "field": "severity", "operator": "equals", "value": "critical" },
    { "field": "service_tier", "operator": "in", "value": ["P1", "P2"] },
    { "field": "time_of_day", "operator": "between", "value": ["22:00", "06:00"] }
  ]
}

Tip

Service binding — you can optionally bind a playbook to a specific Service. Only incidents for that service will match, on top of any trigger conditions.

Suppress default notifications

When enabled, the playbook takes full control of the incident response. Default notification channels and escalation policies are skipped for matching incidents — the playbook's steps handle everything.

Warning

Use with care: make sure your playbook has steps to actually notify responders (e.g. run_escalation, add_responders, or send_custom_notification) — otherwise alerts will be silently dropped.

3. Steps

Drag & drop steps to reorder. Each step has a name, action type, configuration, and optional conditions, parallel group, and timeout. Steps run top-to-bottom.

Action Types

Notification actions

ActionWhat it does
add_respondersNotify step-0 targets of an escalation policy via email
run_escalationFire a full escalation policy step (user/schedule/channel targets)
send_custom_notificationSend a rendered message to a specific notification channel
notify_subscribersEmail status page subscribers about a state change

Incident management actions

ActionWhat it does
set_prioritySet incident priority (P1–P5)
set_severitySet incident severity (critical/error/warning/info)
auto_ackAcknowledge the incident programmatically
auto_resolveMark the incident as resolved
title_enrichmentAppend context to the incident title (e.g. recent deploy revision or feature flag)
add_linksAttach runbook / dashboard URLs to the incident timeline

External integrations

ActionWhat it does
setup_war_roomCreate a war room (Jitsi, Google Meet, Slack Huddle, Discord, MS Teams)
create_jira_ticketOpen a Jira issue via your configured Jira channel
create_slack_channelCreate a dedicated Slack channel for the incident (via Bot API)
update_status_pagePublish a message on a public status page
outbound_webhookCall an external HTTP endpoint (POST/GET/PUT/…) with a templated body and headers — for auto-rollback, feature-flag disable, diagnostics collection, etc.

Manual step

Use manual_step to pause execution until a human ticks a checkbox in the execution detail view. Perfect for "verify database is responding" or "confirm rollback succeeded" checks before continuing automation.

Outbound Webhooks

The outbound_webhook action turns playbooks into an auto-remediation engine. Typical uses:

  • Auto-rollback — POST to your deploy API with the previous revision from a Change Event
  • Disable feature flag — POST to LaunchDarkly / Flagsmith to kill a broken flag
  • Scale up — POST to your cloud provider to add capacity
  • Collect diagnostics — trigger a remote script that gathers logs and attaches them to the incident
  • Mirror to PagerDuty — for teams migrating between platforms

Configuration

FieldDescription
URLEndpoint to call
MethodPOST, GET, PUT, PATCH, DELETE
HeadersJSON object of headers (supports secret references)
Body templateString with {{variable.path}} placeholders (see Templates)
Retry count0–5 attempts with exponential backoff (5s → 15s → 45s)
TimeoutPer-request timeout in seconds (max 120)

The response (status code + body truncated to 1KB) is logged in the execution detail so you can audit what happened. On failure after all retries, the step is marked failed but subsequent steps still run.

Template Variables

Use {{variable.path}} syntax in any configurable text field (body templates, messages, Jira summaries, Slack channel names, …). Missing values resolve to an empty string — no errors.

VariableAvailable for
{{check.name}}, {{check.target}}, {{check.status}}Check status triggers
{{incident.title}}, {{incident.severity}}, {{incident.priority}}All triggers
{{service.name}}, {{service.tier}}Checks bound to a Service
{{playbook.name}}, {{execution.id}}Always available
{{org.name}}, {{org.slug}}Always available

Parallel Groups

By default, steps run sequentially. To execute several steps in parallel, assign them the same parallel group (Group A / B / C / D). Steps in the same group start at the same time using a thread pool (max 5 concurrent workers per group).

Tip

Use parallel groups for independent actions like create_jira_ticket + create_slack_channel + setup_war_room — none depend on each other and running them sequentially just wastes seconds when every second counts.

Manual Steps & Resume

When the orchestrator encounters a step marked is_manual, execution pauses with status waiting_for_human. The step appears in the execution timeline with a "Mark Complete" button.

When a user clicks the button, the playbook resumes from the next step — a new task continues where the previous one left off, preserving the full step history.

Running a Playbook

Automatic trigger

When an incident fires (check status changes or inbound event received), matching active playbooks are queued automatically. An SSE event playbook.started is published so the UI updates in real time.

Manual trigger

On a check detail page, scroll to the Playbooks section and click Run Playbook. Pick a playbook and it will be executed against the current incident. Manual triggers are recorded in the execution history with your user ID.

Execution History

Every run is recorded as a PlaybookExecution with full step-by-step results (status, timing, output, errors). Access via Playbooks > … menu > History or click any execution from a check detail page.

The execution detail page shows a live timeline updated via Server-Sent Events — no manual refresh needed.

Example: Auto-rollback on failed deploy

Put it all together. The scenario:

  1. CI/CD pushes a ChangeEvent of type deployment
  2. 2 minutes later, an HTTP check starts returning 500s
  3. A playbook matches (trigger: check_status_change, condition: has_recent_change_event == true), and:
    • Sets priority to P1
    • In parallel: creates a war room, opens a Jira ticket, sends a custom Slack notification
    • Calls an outbound webhook to roll back the deploy with the previous revision from the change event
    • Pauses on a manual step: "Verify rollback succeeded in production"
    • After the responder confirms, auto_resolve closes the incident

API

Playbooks are fully manageable via the REST API. See the API Reference for details.

  • GET /api/organizations/{org_id}/playbooks
  • POST /api/organizations/{org_id}/playbooks
  • PUT /api/organizations/{org_id}/playbooks/{id}
  • POST /api/organizations/{org_id}/playbooks/{id}/execute
  • GET /api/organizations/{org_id}/playbook-executions
  • POST /api/organizations/{org_id}/playbook-executions/{id}/steps/{step}/complete