How fast should production triage actually happen?

From page to severity-assigned with a clear owner, 15 minutes is the 2026 benchmark for high-severity production bugs. Anything slower compounds: each minute means more affected users and more confused stakeholders pinging Slack. Below 5 minutes is heroic but unsustainable; above 30 minutes the bug owns you.

What if I cannot reproduce the bug in 15 minutes?

That is fine — triage is not 'fix it.' Triage produces a severity assignment, an owner, and a next-action. If repro takes longer than 5 of the 15 minutes, escalate to a senior engineer in step 4 and keep the triage clock running. Don't conflate the two.

Should the on-call engineer triage alone?

No. The on-call leads but pulls in 1-2 others — typically the most recent person to touch the affected code (git blame) and the customer-facing owner (CSM or PM). Three people in a 15-minute call is the sweet spot.

What goes into the public-facing status post?

Three things and only three. What is broken in user-language, the workaround if any, and the next update time. Resist the urge to speculate on root cause publicly until you actually know.

Does this work for non-emergencies?

The 15-minute frame is for SEV1/SEV2 production issues. Apply the same 5 steps to SEV3/SEV4 in a stretched timebox — 1 hour for SEV3, async for SEV4. The structure is what matters; the clock adjusts to the severity.

Playbooks

How to Triage a Production Bug in 15 Minutes: A 5-Step Playbook

A repeatable 5-step playbook to triage production bugs in 15 minutes — from page to severity assignment to first action — without losing your sprint to investigation.

BugMojo TeamMay 22, 20265 min read

Playbooks

What "triage" actually means

Triage produces three artifacts: a severity assignment, a clear owner of the next action, and a one-sentence plan. It does NOT produce a fix, a root-cause analysis, or a post-mortem — those come later. Conflating triage with diagnosis is the most common reason teams blow past the 15-minute window without realizing it.

The single most useful framing: triage is the meeting where you decide what to do, not the meeting where you do it. A clear triage outcome lets the actual fix work happen in parallel with stakeholder communication and root-cause investigation, instead of all three competing for the same attention.

The 15-minute clock starts when…

Three triggers start the triage clock: a PagerDuty alert firing for a SEV1/SEV2 alarm, a customer email or support ticket flagged as urgent by the on-call CSM, or an internal Slack message that escalates an issue from #bugs to #incidents. The clock does not start when "someone notices something might be wrong" — it starts when the issue is named and surfaced.

If you're not sure whether the clock has started, ask: "do I have a specific user-facing symptom and a way to reach the on-call engineer right now?" If yes, the clock is running. If not, you're still in detection, not triage.

Step 1: Acknowledge and pull in 2 people (0–2 min)

The on-call engineer acknowledges the page within the first 2 minutes, then opens an incident Slack channel (or thread) and tags exactly two people: the most recent author of the affected code (via git blame) and the customer-facing owner (CSM, PM, or support lead). Three people total — more becomes a town hall, fewer leaves blind spots.

The Slack message template:

texttext

🚨 Possible SEV1: [one-line user-facing symptom]
Reporter: [name + channel — customer email / monitoring / etc.]
On-call: [you]
Pulling in: @[recent-author] @[customer-owner]
Joining call: [Meet/Zoom link]
ETA on triage outcome: 15 min from now ([timestamp])

Drop the link, ping the names, hit send. Don't wait for confirmation before moving to step 2.

Step 2: Reproduce or accept "can't repro yet" (2–7 min)

Reproducing the bug locally is the single fastest path to severity assignment — if you can reproduce it, you know exactly what's broken. Give it 5 minutes. If you can't reproduce, switch to evidence: session replay, error logs, metrics, customer screenshots. A bug with corroborating evidence is triageable even without local repro.

The repro checklist for the 5-minute window:

Open the affected URL in an incognito window with the same user state if possible.
Check the customer's session replay if you have one — this is where session-replay tools (rrweb, BugMojo, etc.) pay off.
Check the error tracker for the exception fingerprint and affected user count.
Check deployment timestamps — was there a recent deploy that correlates?
Check the database — is there a data anomaly?

Step 3: Assign severity (7–9 min)

Severity is a technical assessment based on what is broken, not a business decision. Use the 4-level scale: CRITICAL (data loss, security, app-down for a class of users), HIGH (major feature broken, no workaround), MEDIUM (workaround exists), LOW (cosmetic). The on-call engineer assigns severity; the team confirms in 30 seconds.

Two failure modes to watch for:

Severity inflation under pressure. Customers calling it CRITICAL doesn't make it CRITICAL; it makes it loud. Hold the bar.
Severity deflation to avoid escalation. "It's only HIGH because the workaround is annoying" — if every user has to do the workaround on every interaction, it's CRITICAL.

For the deep definition of the severity levels and the priority dimension, see our bug severity vs priority decision framework.

Step 4: Decide the next action (9–13 min)

Four options for the next action: rollback the last deploy, ship a hotfix forward, deploy a workaround (feature flag off, monkey-patch, customer-side advisory), or accept the issue and schedule a normal-cycle fix. The choice depends on severity, time-to-fix, and confidence in the diagnosis.

Feature	Option	When	Time-to-relief	Risk
Rollback	Revert the last deploy	Confidence is HIGH that recent deploy caused the issue and rollback is safe	5-15 min	May lose recent legitimate work; database migrations may not be reversible
Hotfix	Ship a targeted forward fix	You can identify the bug and write a 1-3 line fix in under 30 min	30-90 min	Fix introduces new bugs; harder to test in time
Workaround	Feature flag off / advisory	Bug is gated behind a flag, or customers can avoid it	1-10 min	Feature unavailable for everyone, not just affected users
Schedule	Accept and queue normal fix	Severity is MEDIUM/LOW or affected user count is tiny	Next release	User dissatisfaction; reputational

Step 5: Communicate (13–15 min)

Three audiences need updates: affected customers (status page or direct email), internal stakeholders (Slack/Teams), and leadership (a one-line ping for SEV1/CRITICAL). Each gets a different message. The 2-minute window is enough if you have templates ready before the incident.

Public status update template:

texttext

Investigating: [user-facing symptom in plain language]
Started: [timestamp]
Affected: [scope — e.g. "approximately 5% of users on checkout"]
Workaround: [if any, in 1 sentence; else "none currently"]
Next update: [timestamp, within next 30 min]

Internal Slack template:

texttext

SEV[level]: [symptom]
Severity assigned: [CRITICAL/HIGH/MEDIUM/LOW]
Next action: [rollback / hotfix / workaround / schedule]
Owner: @[name]
ETA to resolution: [estimate]
Channel: #incident-[id]

Leadership ping (CEO/CTO if CRITICAL):

texttext

SEV1 in progress on [product/feature]. [Owner] handling. Update at [time]. Will escalate if needed.

After triage: handoff

Triage ends with a handoff to the fix owner. The handoff packet has four items: the severity, the agreed-upon next action, the owner's name, and a link to the incident channel where context lives. If the triage owner is also the fix owner, the handoff happens in your head — but write it down anyway for post-mortem traceability.

The handoff template:

texttext

[Severity] | Owner: @[name]
Next action: [rollback / hotfix / workaround / schedule]
Incident channel: #incident-[id]
First milestone: [time + outcome — e.g. "11:30 AM — PR open"]

Common mistakes

Treating triage as the fix. The clock runs out, the bug is still unfixed, and you've burned the urgency capital. Triage ends with a plan, not a code change.
Pulling in too many people. Three is the sweet spot. Five becomes a town hall; ten becomes paralysis.
No timestamps in updates. "Next update soon" is useless. "Next update by 11:45 AM PT" is actionable.
Skipping the post-triage handoff. The on-call moves on, the fix owner forgets the context, and the bug stalls.
Speculating publicly. Status pages are not where you brainstorm. Write what you know.

A real example

Time 11:00 AM. PagerDuty fires: error rate on POST /checkout spiking from 0.2% to 8%. On-call engineer Alice acknowledges within 90 seconds, opens #incident-2026-05-22-checkout, pings @bob (last touched /checkout controller) and @carla (CSM lead).

11:02 — repro attempt. Alice tries to check out herself: succeeds. Checks the session-replay tool — sees 3 affected sessions in the past 10 minutes, all from users on a specific payment method (prepaid cards). Repro pattern: known.

11:07 — severity assigned. "Affected users get a 500 with no recovery path. Not all users, but the affected ones lose the cart. HIGH severity, not CRITICAL — workaround is to use a different card."

11:11 — next action. Recent deploy at 10:45 AM added a new validation step that mishandles prepaid card numbers. Rollback is safe (no DB migration). Decision: rollback.

11:13 — communication. Public status: "Some users on prepaid cards are seeing checkout errors. Workaround: use a different payment method. Rolling back the affected change now, ETA 5 min." Internal Slack: SEV2, Alice owning rollback, ETA 11:20.

11:20 — rollback complete. Triage closed in 20 min (5 over budget — acceptable for a SEV2).

Next steps

Build a runbook with the 5 templates above hardcoded into your incident-channel auto-create flow.
Practice the playbook quarterly with a game day — invent a fake SEV1 and run through the 15 minutes.
Capture session replays in production so step 2 doesn't depend on your ability to reproduce locally. BugMojo auto-captures rrweb session replay + console + network for every error, so the on-call has the evidence in hand before they open Slack.

For more on severity assignment specifically, read our bug severity vs priority decision framework.

Have the evidence before you open Slack

BugMojo auto-captures rrweb session replay, console, and network for every production error — so the on-call engineer can triage from a recording instead of guessing.

Install the extension

Frequently asked questions

Sources

Google SRE Workbook — Incident Response — Google (2018)
PagerDuty Incident Response documentation — PagerDuty (2025)
Atlassian Incident Management Handbook — Atlassian (2025)

Get bug-tracking insights, weekly.

Engineering deep-dives, QA playbooks, and honest tool comparisons. No spam — unsubscribe in one click.

Playbooks

How to Triage a Production Bug in 15 Minutes: A 5-Step Playbook

A repeatable 5-step playbook to triage production bugs in 15 minutes — from page to severity assignment to first action — without losing your sprint to investigation.

BugMojo TeamMay 22, 20265 min read

Playbooks

What "triage" actually means

The 15-minute clock starts when…

Step 1: Acknowledge and pull in 2 people (0–2 min)

The Slack message template:

texttext

🚨 Possible SEV1: [one-line user-facing symptom]
Reporter: [name + channel — customer email / monitoring / etc.]
On-call: [you]
Pulling in: @[recent-author] @[customer-owner]
Joining call: [Meet/Zoom link]
ETA on triage outcome: 15 min from now ([timestamp])

Drop the link, ping the names, hit send. Don't wait for confirmation before moving to step 2.

Step 2: Reproduce or accept "can't repro yet" (2–7 min)

The repro checklist for the 5-minute window:

Open the affected URL in an incognito window with the same user state if possible.
Check the customer's session replay if you have one — this is where session-replay tools (rrweb, BugMojo, etc.) pay off.
Check the error tracker for the exception fingerprint and affected user count.
Check deployment timestamps — was there a recent deploy that correlates?
Check the database — is there a data anomaly?

Step 3: Assign severity (7–9 min)

Two failure modes to watch for:

Severity inflation under pressure. Customers calling it CRITICAL doesn't make it CRITICAL; it makes it loud. Hold the bar.
Severity deflation to avoid escalation. "It's only HIGH because the workaround is annoying" — if every user has to do the workaround on every interaction, it's CRITICAL.

For the deep definition of the severity levels and the priority dimension, see our bug severity vs priority decision framework.

Step 4: Decide the next action (9–13 min)

Feature	Option	When	Time-to-relief	Risk
Rollback	Revert the last deploy	Confidence is HIGH that recent deploy caused the issue and rollback is safe	5-15 min	May lose recent legitimate work; database migrations may not be reversible
Hotfix	Ship a targeted forward fix	You can identify the bug and write a 1-3 line fix in under 30 min	30-90 min	Fix introduces new bugs; harder to test in time
Workaround	Feature flag off / advisory	Bug is gated behind a flag, or customers can avoid it	1-10 min	Feature unavailable for everyone, not just affected users
Schedule	Accept and queue normal fix	Severity is MEDIUM/LOW or affected user count is tiny	Next release	User dissatisfaction; reputational

Step 5: Communicate (13–15 min)

Public status update template:

texttext

Investigating: [user-facing symptom in plain language]
Started: [timestamp]
Affected: [scope — e.g. "approximately 5% of users on checkout"]
Workaround: [if any, in 1 sentence; else "none currently"]
Next update: [timestamp, within next 30 min]

Internal Slack template:

texttext

SEV[level]: [symptom]
Severity assigned: [CRITICAL/HIGH/MEDIUM/LOW]
Next action: [rollback / hotfix / workaround / schedule]
Owner: @[name]
ETA to resolution: [estimate]
Channel: #incident-[id]

Leadership ping (CEO/CTO if CRITICAL):

texttext

SEV1 in progress on [product/feature]. [Owner] handling. Update at [time]. Will escalate if needed.

After triage: handoff

The handoff template:

texttext

[Severity] | Owner: @[name]
Next action: [rollback / hotfix / workaround / schedule]
Incident channel: #incident-[id]
First milestone: [time + outcome — e.g. "11:30 AM — PR open"]

Common mistakes

Treating triage as the fix. The clock runs out, the bug is still unfixed, and you've burned the urgency capital. Triage ends with a plan, not a code change.
Pulling in too many people. Three is the sweet spot. Five becomes a town hall; ten becomes paralysis.
No timestamps in updates. "Next update soon" is useless. "Next update by 11:45 AM PT" is actionable.
Skipping the post-triage handoff. The on-call moves on, the fix owner forgets the context, and the bug stalls.
Speculating publicly. Status pages are not where you brainstorm. Write what you know.

A real example

11:11 — next action. Recent deploy at 10:45 AM added a new validation step that mishandles prepaid card numbers. Rollback is safe (no DB migration). Decision: rollback.

11:20 — rollback complete. Triage closed in 20 min (5 over budget — acceptable for a SEV2).

Next steps

Build a runbook with the 5 templates above hardcoded into your incident-channel auto-create flow.
Practice the playbook quarterly with a game day — invent a fake SEV1 and run through the 15 minutes.
Capture session replays in production so step 2 doesn't depend on your ability to reproduce locally. BugMojo auto-captures rrweb session replay + console + network for every error, so the on-call has the evidence in hand before they open Slack.

For more on severity assignment specifically, read our bug severity vs priority decision framework.

Have the evidence before you open Slack

BugMojo auto-captures rrweb session replay, console, and network for every production error — so the on-call engineer can triage from a recording instead of guessing.

Install the extension

Frequently asked questions

Sources

Google SRE Workbook — Incident Response — Google (2018)
PagerDuty Incident Response documentation — PagerDuty (2025)
Atlassian Incident Management Handbook — Atlassian (2025)

Get bug-tracking insights, weekly.

Engineering deep-dives, QA playbooks, and honest tool comparisons. No spam — unsubscribe in one click.