How to Triage a Production Bug in 15 Minutes: A 5-Step Playbook
A repeatable 5-step playbook to triage production bugs in 15 minutes — from page to severity assignment to first action — without losing your sprint to investigation.
What "triage" actually means
Triage produces three artifacts: a severity assignment, a clear owner of the next action, and a one-sentence plan. It does NOT produce a fix, a root-cause analysis, or a post-mortem — those come later. Conflating triage with diagnosis is the most common reason teams blow past the 15-minute window without realizing it.
The single most useful framing: triage is the meeting where you decide what to do, not the meeting where you do it. A clear triage outcome lets the actual fix work happen in parallel with stakeholder communication and root-cause investigation, instead of all three competing for the same attention.
The 15-minute clock starts when…
Three triggers start the triage clock: a PagerDuty alert firing for a SEV1/SEV2 alarm, a customer email or support ticket flagged as urgent by the on-call CSM, or an internal Slack message that escalates an issue from #bugs to #incidents. The clock does not start when "someone notices something might be wrong" — it starts when the issue is named and surfaced.
If you're not sure whether the clock has started, ask: "do I have a specific user-facing symptom and a way to reach the on-call engineer right now?" If yes, the clock is running. If not, you're still in detection, not triage.
Step 1: Acknowledge and pull in 2 people (0–2 min)
The on-call engineer acknowledges the page within the first 2 minutes, then opens an incident Slack channel (or thread) and tags exactly two people: the most recent author of the affected code (via git blame) and the customer-facing owner (CSM, PM, or support lead). Three people total — more becomes a town hall, fewer leaves blind spots.
The Slack message template:
🚨 Possible SEV1: [one-line user-facing symptom]
Reporter: [name + channel — customer email / monitoring / etc.]
On-call: [you]
Pulling in: @[recent-author] @[customer-owner]
Joining call: [Meet/Zoom link]
ETA on triage outcome: 15 min from now ([timestamp])
Drop the link, ping the names, hit send. Don't wait for confirmation before moving to step 2.
Step 2: Reproduce or accept "can't repro yet" (2–7 min)
Reproducing the bug locally is the single fastest path to severity assignment — if you can reproduce it, you know exactly what's broken. Give it 5 minutes. If you can't reproduce, switch to evidence: session replay, error logs, metrics, customer screenshots. A bug with corroborating evidence is triageable even without local repro.
The repro checklist for the 5-minute window:
- Open the affected URL in an incognito window with the same user state if possible.
- Check the customer's session replay if you have one — this is where session-replay tools (rrweb, BugMojo, etc.) pay off.
- Check the error tracker for the exception fingerprint and affected user count.
- Check deployment timestamps — was there a recent deploy that correlates?
- Check the database — is there a data anomaly?
If you can repro in 30 seconds, do it before pulling anyone in. The Slack message in step 1 changes from "possible SEV1" to "confirmed SEV1 — repro steps below" and saves 2 minutes of back-and-forth.
Step 3: Assign severity (7–9 min)
Severity is a technical assessment based on what is broken, not a business decision. Use the 4-level scale: CRITICAL (data loss, security, app-down for a class of users), HIGH (major feature broken, no workaround), MEDIUM (workaround exists), LOW (cosmetic). The on-call engineer assigns severity; the team confirms in 30 seconds.
Two failure modes to watch for:
- Severity inflation under pressure. Customers calling it CRITICAL doesn't make it CRITICAL; it makes it loud. Hold the bar.
- Severity deflation to avoid escalation. "It's only HIGH because the workaround is annoying" — if every user has to do the workaround on every interaction, it's CRITICAL.
For the deep definition of the severity levels and the priority dimension, see our bug severity vs priority decision framework.
Step 4: Decide the next action (9–13 min)
Four options for the next action: rollback the last deploy, ship a hotfix forward, deploy a workaround (feature flag off, monkey-patch, customer-side advisory), or accept the issue and schedule a normal-cycle fix. The choice depends on severity, time-to-fix, and confidence in the diagnosis.
Step 5: Communicate (13–15 min)
Three audiences need updates: affected customers (status page or direct email), internal stakeholders (Slack/Teams), and leadership (a one-line ping for SEV1/CRITICAL). Each gets a different message. The 2-minute window is enough if you have templates ready before the incident.
Public status update template:
Investigating: [user-facing symptom in plain language]
Started: [timestamp]
Affected: [scope — e.g. "approximately 5% of users on checkout"]
Workaround: [if any, in 1 sentence; else "none currently"]
Next update: [timestamp, within next 30 min]
Internal Slack template:
SEV[level]: [symptom]
Severity assigned: [CRITICAL/HIGH/MEDIUM/LOW]
Next action: [rollback / hotfix / workaround / schedule]
Owner: @[name]
ETA to resolution: [estimate]
Channel: #incident-[id]
Leadership ping (CEO/CTO if CRITICAL):
SEV1 in progress on [product/feature]. [Owner] handling. Update at [time]. Will escalate if needed.
Do not speculate on root cause in the public status update. Until you actually know, say "investigating." Wrong public guesses ("a bad cache invalidation" turns into "an unrelated database issue") cost more trust than silence.
After triage: handoff
Triage ends with a handoff to the fix owner. The handoff packet has four items: the severity, the agreed-upon next action, the owner's name, and a link to the incident channel where context lives. If the triage owner is also the fix owner, the handoff happens in your head — but write it down anyway for post-mortem traceability.
The handoff template:
[Severity] | Owner: @[name]
Next action: [rollback / hotfix / workaround / schedule]
Incident channel: #incident-[id]
First milestone: [time + outcome — e.g. "11:30 AM — PR open"]
Common mistakes
- Treating triage as the fix. The clock runs out, the bug is still unfixed, and you've burned the urgency capital. Triage ends with a plan, not a code change.
- Pulling in too many people. Three is the sweet spot. Five becomes a town hall; ten becomes paralysis.
- No timestamps in updates. "Next update soon" is useless. "Next update by 11:45 AM PT" is actionable.
- Skipping the post-triage handoff. The on-call moves on, the fix owner forgets the context, and the bug stalls.
- Speculating publicly. Status pages are not where you brainstorm. Write what you know.
A real example
Time 11:00 AM. PagerDuty fires: error rate on POST /checkout spiking from 0.2% to 8%. On-call engineer Alice acknowledges within 90 seconds, opens #incident-2026-05-22-checkout, pings @bob (last touched /checkout controller) and @carla (CSM lead).
11:02 — repro attempt. Alice tries to check out herself: succeeds. Checks the session-replay tool — sees 3 affected sessions in the past 10 minutes, all from users on a specific payment method (prepaid cards). Repro pattern: known.
11:07 — severity assigned. "Affected users get a 500 with no recovery path. Not all users, but the affected ones lose the cart. HIGH severity, not CRITICAL — workaround is to use a different card."
11:11 — next action. Recent deploy at 10:45 AM added a new validation step that mishandles prepaid card numbers. Rollback is safe (no DB migration). Decision: rollback.
11:13 — communication. Public status: "Some users on prepaid cards are seeing checkout errors. Workaround: use a different payment method. Rolling back the affected change now, ETA 5 min." Internal Slack: SEV2, Alice owning rollback, ETA 11:20.
11:20 — rollback complete. Triage closed in 20 min (5 over budget — acceptable for a SEV2).
Next steps
- Build a runbook with the 5 templates above hardcoded into your incident-channel auto-create flow.
- Practice the playbook quarterly with a game day — invent a fake SEV1 and run through the 15 minutes.
- Capture session replays in production so step 2 doesn't depend on your ability to reproduce locally. BugMojo auto-captures rrweb session replay + console + network for every error, so the on-call has the evidence in hand before they open Slack.
For more on severity assignment specifically, read our bug severity vs priority decision framework.
Frequently asked questions
Sources
- Google SRE Workbook — Incident Response — Google (2018)
- PagerDuty Incident Response documentation — PagerDuty (2025)
- Atlassian Incident Management Handbook — Atlassian (2025)
Get bug-tracking insights, weekly.
Engineering deep-dives, QA playbooks, and honest tool comparisons. No spam — unsubscribe in one click.
Keep reading
Bug Severity vs Priority: A Decision Framework That Ends the Argument
A practical framework separating bug severity (how bad it is) from priority (when to fix it) — with a 4x4 matrix, real examples, and the decision rules to apply on triage day.
How to Connect Claude Code to Your Bug Tracker via MCP
Step-by-step guide to wire Claude Code into BugMojo via the Model Context Protocol so your AI agent can read, triage, and update bugs in about 10 minutes.
How to file a bug report developers actually want to fix
A 2026 guide to bug reports that close fast — what to include, what to skip, and how session replay changes the rules.

