SysTeam/Docs/Incident Management

Incident Management

Track, acknowledge, investigate, and resolve incidents from a single timeline. Collaborate with War Rooms, write postmortems, and force-resolve stuck incidents.

Incidents page

Incident list with status, duration, cause, and acknowledgment controls

Overview

An incident is created when a check transitions to a bad state (DOWN, DEGRADED, or LATE) or when an external alert triggers via the Inbound Events API. Incidents are the central object for tracking outages and coordinating response.

Two types of incidents exist:

  • Check incidents — created automatically when a health check detects failure. Resolved automatically when the check recovers (goes UP).
  • Inbound incidents — created via the Events API. Can be triggered, acknowledged, snoozed, and resolved via API actions or the UI.

Check Incidents

Check incidents appear in the Incident History panel on the check detail page. Click any incident to open its detail page with full context.

Incident Detail Page

The incident detail page has four tabs:

TabContents
TimelineChronological list of all incident activities (creation, escalation, ACK, resolution, notes, war rooms)
NotesFree-text notes added by team members during investigation
War RoomShared meeting spaces for real-time collaboration
DetailsCheck name, type, target, project, times, duration, cause, acknowledgments

Acknowledgment (ACK)

Acknowledging an incident signals that someone is actively working on it. ACK has two effects:

  • Stops escalation — no further escalation steps will fire
  • Suppresses repeated notifications — only recovery notifications are sent after ACK (the initial failure alert is not repeated)

Incidents can be acknowledged in three ways:

  1. UI button — click Acknowledge on the incident detail page
  2. Notification link — ACK links are included in email, Slack, Discord, MS Teams, and Telegram notifications
  3. API POST /api/checks/{check_id}/incidents/{log_id}/ack

Tip

ACK tokens are single-use and time-limited (48 hours). Each notification contains a unique token.

Force Resolve

Sometimes an incident needs to be closed manually even though the underlying check has not recovered. The Force Resolve button on the incident detail page does this.

Force resolving an incident:

  • Stops any active escalation for this incident
  • Closes all auto-created War Rooms
  • Records a "force resolved" activity in the timeline
  • Publishes an SSE event for real-time UI updates

Warning

Force resolve does not change the check's status. If the check is still DOWN, it continues monitoring normally and a new incident will start if it's still failing at the next check interval.

War Rooms

War Rooms are shared meeting or chat spaces for coordinating incident response in real-time. They support 9 providers, 5 of which can auto-create rooms via API:

ProviderTypeAuto-CreateConfiguration
Jitsi MeetVideoYes (zero-config)None — instant room generation
Slack ChannelTextYes (Bot API)Slack Bot Token in org settings
Discord TextTextYes (Bot API)Discord Bot Token + Guild ID
Discord VoiceVoiceYes (Bot API)Discord Bot Token + Guild ID
MS TeamsVideoYes (Graph API)Tenant ID + Client ID + Secret
Google MeetVideoNo (“New Meet” button)None — opens meet.google.com/new
Slack HuddleVoiceNo (paste URL)Start huddle in Slack, share link
ZoomVideoNo (paste URL)Paste meeting URL
VS Code LiveShareDebugNo (paste URL)Paste Live Share session URL

Manual War Rooms

Click Start War Room in the War Room tab of any incident. Select a provider, enter a name and optionally a URL. For providers marked Auto-Create, the URL is generated automatically by the backend — no need to paste anything. For Jitsi, click Generate to create an instant room. For Google Meet, click New Meet to open a new room in a separate tab, then paste the URL back.

Auto-Created War Rooms

War Rooms can be created automatically when escalation fires. See On-Call > War Room Auto-Creation for setup. Auto-created rooms display an Auto badge and are automatically closed when the incident resolves.

War Room Members

War rooms have members who are participants in the incident response. When a war room is auto-created, on-call responders from the escalation policy are automatically added. You can manually add or remove members at any time.

If users have linked their platform accounts (Discord, Slack, MS Teams) in their Profile > Linked Platforms, they receive automatic invites to the platform room when added as war room members.

War Room Providers

Organization admins configure the default provider in Settings > War Room Auto-Creation:

ProviderSetup RequiredNotes
Jitsi MeetNoneFree, works instantly, auto-generates meeting URLs
Google MeetNoneGenerates ad-hoc meeting links
DiscordBot token + Guild IDCreates voice/text channels in your Discord server
SlackBot tokenCreates channels with channel management scopes
MS TeamsTenant ID + Client ID + SecretCreates online meetings via Azure AD app

Use the Test Connection button to verify bot credentials before saving.

Closing War Rooms

Click the × button on an active War Room to close it. Auto-created rooms are also closed automatically on incident resolution. Manual War Rooms stay open until explicitly closed, which is useful for postmortem discussions.

Inbound Incidents

Inbound incidents are created via the Inbound Events API. They support additional actions beyond check incidents:

ActionDescription
AcknowledgeMark as being worked on, stop escalation
UnacknowledgeRe-enable escalation if needed
SnoozeTemporarily suppress escalation for a set duration
UnsnoozeResume escalation before snooze expires
ResolveClose the incident
Set PriorityP1 (Critical) through P5 (Informational)

Inbound incident detail pages have the same four tabs (Timeline, Notes, War Room, Details) plus an Events tab showing raw inbound events.

Postmortems

After an incident is resolved, you can create a Postmortem from the incident detail page. Click Create Postmortem to start.

A postmortem includes:

  • Summary — what happened
  • Impact — who was affected and how
  • Root Cause — why it happened
  • Action Items — checklist of follow-up tasks
  • Timeline — auto-populated from incident activities

Postmortems have a Draft and Published status. Draft postmortems are only visible to org members. Published postmortems can be shared.

Managing Postmortems

Navigate to On-Call > Postmortems to view all postmortems across your organization. You can filter by status (Draft/Published) and search by title. From the incident detail page, click Create Postmortem to link a postmortem to a specific incident — the timeline tab auto-populates from incident activities.

Incident Priority

Inbound incidents can be assigned a priority level to help triage and prioritize response:

PriorityUse Case
P1 — CriticalComplete service outage, immediate action required
P2 — HighMajor degradation, user-facing impact
P3 — MediumPartial impact, workaround available
P4 — LowMinor issue, no immediate user impact
P5 — InformationalAdvisory, no action needed

Noise Reduction

Several features help reduce alert fatigue:

Flapping Detection

When a check rapidly toggles between UP and DOWN (more than 5 times in 10 minutes), the system marks it as flapping and suppresses repeated notifications until the status stabilizes. This prevents alert storms from unstable network connections or services that are bouncing.

Suppression Rules

Create rules to automatically suppress inbound events that match specific patterns. Go to Integrations > Suppression Rules to manage rules.

FieldDescription
FieldWhich event field to match: summary, source, component, severity
Operatorcontains, equals, or regex
ValueThe pattern to match against (e.g., "test-", "staging")

Suppressed events are logged but do not create incidents or trigger notifications. Use this for known non-actionable alerts from noisy monitoring systems.

Alert Grouping

Configure a time window on integration keys to merge rapid-fire alerts into a single incident. When grouping_type is set to time_window, events arriving within the window increment the alert_count on the parent incident instead of creating new ones.

Other Noise Reduction

  • Alert-after-X-failures — only alert after consecutive failures exceed a threshold
  • Quiet hours — suppress notifications during configured off-hours (see Organization Settings)
  • Maintenance windows — suppress alerts during planned downtime (see Maintenance Windows)

Real-Time Updates

Incident state changes are pushed to all connected clients via Server-Sent Events (SSE). Events include:

  • incident.new — new incident started
  • incident.resolve — incident resolved
  • incident.ack — incident acknowledged
  • warroom.created — War Room created (manual or auto)
  • warroom.closed — War Room closed

The dashboard and incident detail pages update automatically when these events arrive, with a fallback polling interval of 30–60 seconds.

Best Practices

  • Always acknowledge incidents promptly to stop unnecessary escalation
  • Use War Rooms for high-severity incidents involving multiple responders
  • Write postmortems for major incidents to prevent recurrence
  • Configure flapping detection for checks with known intermittent issues
  • Use Force Resolve sparingly — only when you're certain the incident is handled even though the check hasn't recovered
  • Set up escalation policies to ensure someone always responds