What Is a Flaky Test? Why Tests Fail Intermittently and How to Fix It
A flaky test passes and fails on the exact same code. Here is why tests fail intermittently, how to confirm and fix the non-determinism, and what to capture before you hit retry.

Definition
A flaky test is an automated test that both passes and fails on the exact same code, with no change to the code or test in between. Its outcome is non-deterministic, driven by timing, run order, shared state, or environment rather than the behavior it claims to verify.
The word that matters is non-deterministic. A normal test is a pure function of the code: same input, same result, every time. A flaky test smuggles in a hidden input that changes between runs -- a clock, a scheduler, a leftover row, a network round-trip -- so the same commit yields green on Tuesday and red on Wednesday. Datadog defines it precisely as a test that generates 'conflicting results across test runs, without any changes to the code.' Playwright pins the outcome label to the same idea: a result is 'flaky' when a test 'failed on the first run, but passed when retried.'
Why it matters
Flakiness is not a rare edge case you can wave away -- at scale it is the dominant source of red builds. Google has reported that almost 16% of its tests have some level of flakiness, more than one in seven, and that about 84% of observed transitions from pass to fail involve a flaky test rather than a real regression. Read that second number twice: most of the time a previously-green test goes red, the code is fine and the test is lying. If your team reflexively trusts every red build, you are spending the majority of your investigation time on noise.
It also gets worse the more realistic your test is. Trunk's analysis of 20.2 million CI jobs surfaced Google's own breakdown: 0.5% of small unit tests were flaky versus 1.6% of medium and 14% of large end-to-end tests. The more a test touches real systems -- a browser, a database, the network -- the more hidden inputs it inherits, and the flakier it gets. And tiny per-test rates compound: in Uber's Go monorepo, even a 0.1% individual flake rate meant roughly 63% of pull requests hit at least one flaky failure. That is the tax, and it is not free to pay: an ICST 2024 industrial case study found that handling flaky tests consumed at least 2.5% of productive developer time on a ~30-developer project.
Notice what the diagram does not tell you: why run 2 and run 5 went red. That is the whole problem with flakiness -- the assertion that failed (expected true, got false) is identical on every run, so the failure message carries no information about the hidden input that flipped. You cannot debug it from the red X alone. And over 70% of the time, this is baked in from the start: Datadog reports that more than 70% of flaky tests already exhibit flaky behavior the first time they are introduced. Flakiness is usually an authoring defect, not something a test acquires with age.
The industry's two escape hatches are retry and quarantine, and they are different tools. A retry re-runs a failing test up to N times in the same build and passes it if any attempt is green -- Playwright keeps the receipt by labeling that outcome 'flaky' rather than 'passed,' and note that failing tests are not retried by default (retries: 0). Quarantine instead detaches a known-bad test from the build status: in Datadog, a quarantined test keeps running in the background but its failures no longer affect CI status or break the pipeline. Both buy time. Neither removes the non-determinism -- and if you forget that, a quarantined test is just a bug you agreed to stop looking at.
How this shows up in a real BugMojo bug report
Here is the framing every other flaky-test guide skips. The standard advice ends at 'retry or quarantine, then go fix the non-determinism.' But the moment you decide not to blindly retry, you have a new problem: the failing run is gone. CI re-ran it, it went green, and the one execution that actually reproduced the race -- the stale token, the empty API response, the assertion that fired 12ms too early -- has evaporated. A flaky test is, at root, a reproducibility problem, not a CI-config problem. You cannot fix what you cannot re-observe.
That is the gap BugMojo closes. When a flaky failure shows up against a real browser session, the BugMojo extension captures the failing run with its surrounding state -- an rrweb session replay, the console output (including the assertion and any stack trace), and the exact network responses that fed the test. So instead of a one-line expected true, got false, the report shows the GET /api/cart call that returned an empty body on the run that flaked, next to a scrubbable replay of what the page was doing at that instant. Then the BugMojo MCP server hands that whole bundle to an AI agent (Claude Code, Cursor). The agent reads the actual non-deterministic state -- the race, the stale token, the empty response -- instead of guessing from the assertion. That is the difference between 'this test is flaky, retry it' and 'the assertion races the cart fetch; the run that failed got an empty cart, so await the response before asserting.'
| Feature | Capability | BugMojo | Flake tooling (Datadog / Trunk / Playwright) |
|---|---|---|---|
| Detect and label a flaky result | — | — | ✓ |
| Quarantine a known-bad test off the critical path | — | — | ✓ |
| Retry config (retries: N) in the runner | — | — | ✓ |
| rrweb replay of the run that actually flaked | — | ✓ | — |
| Exact network responses + console captured with the failure | — | ✓ | Partial |
| Failing-run bundle handed to an AI agent over MCP | — | ✓ | — |
Frequently asked questions
Frequently asked questions
Sources
- Test Flakiness -- One of the main challenges of automated testing: ~16% of Google's tests have some flakiness; ~84% of pass-to-fail transitions involve a flaky test — Google Testing Blog (2020-12-16)
- Flaky Tests at Google and How We Mitigate Them -- original source for the 84% pass-to-fail / quarantine-tool figures — Google Testing Blog (2016-05-27)
- What we learned from analyzing 20.2 million CI jobs -- Google small/medium/large flake rates (0.5% / 1.6% / 14%); Uber's Go monorepo and ~63% of PRs hitting a flaky failure — Trunk.io (2024-11-12)
- Flaky tests: their hidden costs and how to address flaky behavior -- defines a flaky test and notes over 70% show flaky behavior when first introduced — Datadog (2024-10-23)
- Retries -- Playwright docs: flaky = a test that 'failed on the first run, but passed when retried'; failing tests are not retried by default (retries: 0) — Microsoft / Playwright (2026)
- Flaky Tests Management -- Datadog docs: quarantined tests keep running but their failures do not affect CI status or break pipelines — Datadog (2026)
- Cost of Flaky Tests in CI: An Industrial Case Study -- handling flaky tests consumed at least 2.5% of productive developer time in a ~30-developer project — IEEE ICST 2024 (2024)
- How to reduce flaky test failures -- root-cause taxonomy of flaky tests (async waits, timeouts, time-of-day, concurrency, test-order dependency) — CircleCI (2024-12-23)
Get bug-tracking insights, weekly.
Engineering deep-dives, QA playbooks, and honest tool comparisons. No spam — unsubscribe in one click.

