Playbooks / Event Orchestration

Automate incident response with configurable playbooks. Trigger them automatically on check status changes or inbound events, execute actions sequentially or in parallel, pause for manual human steps, and call outbound webhooks for auto-remediation.

Overview

Playbooks wire together existing building blocks — escalation policies, war rooms, status pages, notifications, and Jira tickets — into repeatable automation flows. Each playbook has a trigger (with optional conditions), steps (16 action types available), and a target incident (check or inbound).

Hybrid model — auto-trigger + automated actions + manual human steps + manual invocation
16 action types — from setting priority to firing outbound webhooks
Parallel groups — run independent steps concurrently
Per-step conditions — skip steps based on incident attributes
Suppress mode — playbook takes full control of the notification flow
Live execution timeline — SSE-powered progress view

Creating a Playbook

Navigate to Operate > Playbooks in the sidebar and click New Playbook. The editor has three sections:

1. Basics

Field	Description
Name	Short, descriptive name (e.g. "Auto-page P1 database incidents")
Description	Optional longer explanation of what the playbook does
Active	Toggle off to disable without deleting

2. Trigger

Choose one of four trigger types:

Trigger	Fires when
`check_status_change`	A check transitions between UP / DOWN / DEGRADED / LATE
`inbound_incident_created`	A new inbound incident is ingested via the Events API
`inbound_status_change`	An inbound incident is acknowledged or resolved
`manual`	Only runs when triggered by a user via "Run Playbook" button

Trigger Conditions (AND / OR)

Restrict when the playbook fires. Available fields depend on the trigger type. Operators: equals, not_equals, in, contains, regex, between, exists.

{
  "logic": "AND",
  "conditions": [
    { "field": "severity", "operator": "equals", "value": "critical" },
    { "field": "service_tier", "operator": "in", "value": ["P1", "P2"] },
    { "field": "time_of_day", "operator": "between", "value": ["22:00", "06:00"] }
  ]
}

Tip

Service binding — you can optionally bind a playbook to a specific Service. Only incidents for that service will match, on top of any trigger conditions.

Suppress default notifications

When enabled, the playbook takes full control of the incident response. Default notification channels and escalation policies are skipped for matching incidents — the playbook's steps handle everything.

Warning

Use with care: make sure your playbook has steps to actually notify responders (e.g. run_escalation, add_responders, or send_custom_notification) — otherwise alerts will be silently dropped.

3. Steps

Drag & drop steps to reorder. Each step has a name, action type, configuration, and optional conditions, parallel group, and timeout. Steps run top-to-bottom.

Action Types

Notification actions

Action	What it does
`add_responders`	Notify step-0 targets of an escalation policy via email
`run_escalation`	Fire a full escalation policy step (user/schedule/channel targets)
`send_custom_notification`	Send a rendered message to a specific notification channel
`notify_subscribers`	Email status page subscribers about a state change

Incident management actions

Action	What it does
`set_priority`	Set incident priority (P1–P5)
`set_severity`	Set incident severity (critical/error/warning/info)
`auto_ack`	Acknowledge the incident programmatically
`auto_resolve`	Mark the incident as resolved
`title_enrichment`	Append context to the incident title (e.g. recent deploy revision or feature flag)
`add_links`	Attach runbook / dashboard URLs to the incident timeline

External integrations

Action	What it does
`setup_war_room`	Create a war room (Jitsi, Google Meet, Slack Huddle, Discord, MS Teams)
`create_jira_ticket`	Open a Jira issue via your configured Jira channel
`create_slack_channel`	Create a dedicated Slack channel for the incident (via Bot API)
`update_status_page`	Publish a message on a public status page
`outbound_webhook`	Call an external HTTP endpoint (POST/GET/PUT/…) with a templated body and headers — for auto-rollback, feature-flag disable, diagnostics collection, etc.

Manual step

Use manual_step to pause execution until a human ticks a checkbox in the execution detail view. Perfect for "verify database is responding" or "confirm rollback succeeded" checks before continuing automation.

Outbound Webhooks

The outbound_webhook action turns playbooks into an auto-remediation engine. Typical uses:

Auto-rollback — POST to your deploy API with the previous revision from a Change Event
Disable feature flag — POST to LaunchDarkly / Flagsmith to kill a broken flag
Scale up — POST to your cloud provider to add capacity
Collect diagnostics — trigger a remote script that gathers logs and attaches them to the incident
Mirror to PagerDuty — for teams migrating between platforms

Configuration

Field	Description
URL	Endpoint to call
Method	POST, GET, PUT, PATCH, DELETE
Headers	JSON object of headers (supports secret references)
Body template	String with `{{variable.path}}` placeholders (see Templates)
Retry count	0–5 attempts with exponential backoff (5s → 15s → 45s)
Timeout	Per-request timeout in seconds (max 120)

The response (status code + body truncated to 1KB) is logged in the execution detail so you can audit what happened. On failure after all retries, the step is marked failed but subsequent steps still run.

Template Variables

Use {{variable.path}} syntax in any configurable text field (body templates, messages, Jira summaries, Slack channel names, …). Missing values resolve to an empty string — no errors.

Variable	Available for
`{{check.name}}`, `{{check.target}}`, `{{check.status}}`	Check status triggers
`{{incident.title}}`, `{{incident.severity}}`, `{{incident.priority}}`	All triggers
`{{service.name}}`, `{{service.tier}}`	Checks bound to a Service
`{{playbook.name}}`, `{{execution.id}}`	Always available
`{{org.name}}`, `{{org.slug}}`	Always available

Parallel Groups

By default, steps run sequentially. To execute several steps in parallel, assign them the same parallel group (Group A / B / C / D). Steps in the same group start at the same time using a thread pool (max 5 concurrent workers per group).

Tip

Use parallel groups for independent actions like create_jira_ticket + create_slack_channel + setup_war_room — none depend on each other and running them sequentially just wastes seconds when every second counts.

Manual Steps & Resume

When the orchestrator encounters a step marked is_manual, execution pauses with status waiting_for_human. The step appears in the execution timeline with a "Mark Complete" button.

When a user clicks the button, the playbook resumes from the next step — a new task continues where the previous one left off, preserving the full step history.

Running a Playbook

Automatic trigger

When an incident fires (check status changes or inbound event received), matching active playbooks are queued automatically. An SSE event playbook.started is published so the UI updates in real time.

Manual trigger

On a check detail page, scroll to the Playbooks section and click Run Playbook. Pick a playbook and it will be executed against the current incident. Manual triggers are recorded in the execution history with your user ID.

Execution History

Every run is recorded as a PlaybookExecution with full step-by-step results (status, timing, output, errors). Access via Playbooks > … menu > History or click any execution from a check detail page.

The execution detail page shows a live timeline updated via Server-Sent Events — no manual refresh needed.

Example: Auto-rollback on failed deploy

Put it all together. The scenario:

CI/CD pushes a ChangeEvent of type deployment
2 minutes later, an HTTP check starts returning 500s
A playbook matches (trigger: check_status_change, condition: has_recent_change_event == true), and:
- Sets priority to P1
- In parallel: creates a war room, opens a Jira ticket, sends a custom Slack notification
- Calls an outbound webhook to roll back the deploy with the previous revision from the change event
- Pauses on a manual step: "Verify rollback succeeded in production"
- After the responder confirms, auto_resolve closes the incident

API

Playbooks are fully manageable via the REST API. See the API Reference for details.

GET /api/organizations/{org_id}/playbooks
POST /api/organizations/{org_id}/playbooks
PUT /api/organizations/{org_id}/playbooks/{id}
POST /api/organizations/{org_id}/playbooks/{id}/execute
GET /api/organizations/{org_id}/playbook-executions
POST /api/organizations/{org_id}/playbook-executions/{id}/steps/{step}/complete