What is a flaky test, in one sentence?

A flaky test is an automated test that both passes and fails on the exact same code, with no change to the code or the test in between. The result is non-deterministic: run it ten times and you might see eight greens and two reds. Because the source under test never changed, a flaky failure is not a real regression -- it is the test itself being unreliable, usually due to timing, shared state, or environment assumptions baked in when the test was written.

Why do tests fail intermittently instead of consistently?

Intermittent failure means the test depends on something that varies between runs rather than only on the code it claims to check. The usual culprits are async timing and race conditions (the assertion fires before the UI or API is ready), order dependency (test B only passes if test A ran first and left some state), shared or non-isolated state (a leftover database row, a global, a cached token), and environment drift (clock, time zone, locale, network latency, a slow CI machine). None of those are visible in the assertion itself, which is why the same test flips between pass and fail.

How do you fix a flaky test?

First confirm it is actually flaky by re-running it on the unchanged commit; if it flips, it is flaky. Then find the source of non-determinism instead of papering over it: replace fixed sleeps with explicit waits on a real condition, make each test set up and tear down its own state so it does not depend on run order, and pin anything time- or locale-sensitive. Auto-retries (re-run on failure) and quarantine (keep it running but stop it failing the build) are useful triage to unblock the pipeline, but they hide the problem rather than fix it -- treat them as a holding pattern, not a cure.

What is the difference between retrying and quarantining a flaky test?

A retry re-runs a failing test up to N times in the same build and marks it green if any attempt passes; tools like Playwright label that outcome 'flaky' rather than 'passed' so you still see it happened. Quarantine instead lets the test keep running but detaches its result from the build status -- in Datadog, a quarantined test's failures no longer break the pipeline. Retries mask transient noise so one bad run does not block a merge; quarantine isolates a known-bad test off the critical path while someone fixes it. Both buy time; neither removes the underlying non-determinism.

How common are flaky tests, really?

Common enough that large engineering orgs treat them as a permanent tax, not an edge case. Google has reported that roughly 16% of its tests show some flakiness and that about 84% of pass-to-fail transitions are caused by flaky tests rather than real bugs. Flakiness also scales with how realistic a test is: Google's own data showed end-to-end tests flaking around 14% of the time versus 0.5% for small unit tests. At scale even a 0.1% per-test flake rate adds up -- in Uber's Go monorepo it meant roughly 63% of pull requests hit at least one flaky failure.

Can a flaky test hide a real bug?

Yes, and that is the dangerous part. An intermittent failure can be a genuine race condition or memory issue in the product that only surfaces under certain timing -- the same non-determinism that makes the test flaky also makes the underlying bug real but hard to reproduce. Blindly retrying until green can therefore paper over a shipping defect. The safe move is to capture the failing run's full context -- the exact inputs, console output, network responses, and a replay of what happened -- so you can decide whether you are looking at test noise or a real concurrency bug before you hit retry.

Glossary

What Is a Flaky Test? Why Tests Fail Intermittently and How to Fix It

A flaky test passes and fails on the exact same code. Here is why tests fail intermittently, how to confirm and fix the non-determinism, and what to capture before you hit retry.

ManviJun 5, 20266 min read

Glossary

Five identical test-run lanes ending in status nodes, three passing and two failing in no pattern, with a lime probe tracing the inconsistent outcomes and one cracked failing node

Definition

A flaky test is an automated test that both passes and fails on the exact same code, with no change to the code or test in between. Its outcome is non-deterministic, driven by timing, run order, shared state, or environment rather than the behavior it claims to verify.

The word that matters is non-deterministic. A normal test is a pure function of the code: same input, same result, every time. A flaky test smuggles in a hidden input that changes between runs -- a clock, a scheduler, a leftover row, a network round-trip -- so the same commit yields green on Tuesday and red on Wednesday. Datadog defines it precisely as a test that generates 'conflicting results across test runs, without any changes to the code.' Playwright pins the outcome label to the same idea: a result is 'flaky' when a test 'failed on the first run, but passed when retried.'

Why it matters

Flakiness is not a rare edge case you can wave away -- at scale it is the dominant source of red builds. Google has reported that almost 16% of its tests have some level of flakiness, more than one in seven, and that about 84% of observed transitions from pass to fail involve a flaky test rather than a real regression. Read that second number twice: most of the time a previously-green test goes red, the code is fine and the test is lying. If your team reflexively trusts every red build, you are spending the majority of your investigation time on noise.

It also gets worse the more realistic your test is. Trunk's analysis of 20.2 million CI jobs surfaced Google's own breakdown: 0.5% of small unit tests were flaky versus 1.6% of medium and 14% of large end-to-end tests. The more a test touches real systems -- a browser, a database, the network -- the more hidden inputs it inherits, and the flakier it gets. And tiny per-test rates compound: in Uber's Go monorepo, even a 0.1% individual flake rate meant roughly 63% of pull requests hit at least one flaky failure. That is the tax, and it is not free to pay: an ICST 2024 industrial case study found that handling flaky tests consumed at least 2.5% of productive developer time on a ~30-developer project.

Flake rate rises with test realism (Google's suite)

Small (unit)

0.5% of runs flaky

Medium (integration)

1.6% of runs flaky

Large (end-to-end)

14% of runs flaky

Source: Trunk.io analysis of 20.2M CI jobs, 2024-11-12 (citing Google data)

The same test, the same code, five runs: pass, fail, pass, pass, fail. The outcome tracks a hidden input (here a race condition), not the behavior under test. A retry that re-runs the failures would paint this 'green' without changing why it flaked.

Notice what the diagram does not tell you: why run 2 and run 5 went red. That is the whole problem with flakiness -- the assertion that failed (expected true, got false) is identical on every run, so the failure message carries no information about the hidden input that flipped. You cannot debug it from the red X alone. And over 70% of the time, this is baked in from the start: Datadog reports that more than 70% of flaky tests already exhibit flaky behavior the first time they are introduced. Flakiness is usually an authoring defect, not something a test acquires with age.

The industry's two escape hatches are retry and quarantine, and they are different tools. A retry re-runs a failing test up to N times in the same build and passes it if any attempt is green -- Playwright keeps the receipt by labeling that outcome 'flaky' rather than 'passed,' and note that failing tests are not retried by default (retries: 0). Quarantine instead detaches a known-bad test from the build status: in Datadog, a quarantined test keeps running in the background but its failures no longer affect CI status or break the pipeline. Both buy time. Neither removes the non-determinism -- and if you forget that, a quarantined test is just a bug you agreed to stop looking at.

How this shows up in a real BugMojo bug report

Here is the framing every other flaky-test guide skips. The standard advice ends at 'retry or quarantine, then go fix the non-determinism.' But the moment you decide not to blindly retry, you have a new problem: the failing run is gone. CI re-ran it, it went green, and the one execution that actually reproduced the race -- the stale token, the empty API response, the assertion that fired 12ms too early -- has evaporated. A flaky test is, at root, a reproducibility problem, not a CI-config problem. You cannot fix what you cannot re-observe.

That is the gap BugMojo closes. When a flaky failure shows up against a real browser session, the BugMojo extension captures the failing run with its surrounding state -- an rrweb session replay, the console output (including the assertion and any stack trace), and the exact network responses that fed the test. So instead of a one-line expected true, got false, the report shows the GET /api/cart call that returned an empty body on the run that flaked, next to a scrubbable replay of what the page was doing at that instant. Then the BugMojo MCP server hands that whole bundle to an AI agent (Claude Code, Cursor). The agent reads the actual non-deterministic state -- the race, the stale token, the empty response -- instead of guessing from the assertion. That is the difference between 'this test is flaky, retry it' and 'the assertion races the cart fetch; the run that failed got an empty cart, so await the response before asserting.'

Feature	Capability	BugMojo	Flake tooling (Datadog / Trunk / Playwright)
Detect and label a flaky result	—	—	✓
Quarantine a known-bad test off the critical path	—	—	✓
Retry config (retries: N) in the runner	—	—	✓
rrweb replay of the run that actually flaked	—	✓	—
Exact network responses + console captured with the failure	—	✓	Partial
Failing-run bundle handed to an AI agent over MCP	—	✓	—

Two-sided: BugMojo ships the failing run as an agent-readable artifact, but it does not manage your CI flake lifecycle.

Catch the run that actually flaked

Install the extension

Frequently asked questions

Sources

Test Flakiness -- One of the main challenges of automated testing: ~16% of Google's tests have some flakiness; ~84% of pass-to-fail transitions involve a flaky test — Google Testing Blog (2020-12-16)
Flaky Tests at Google and How We Mitigate Them -- original source for the 84% pass-to-fail / quarantine-tool figures — Google Testing Blog (2016-05-27)
What we learned from analyzing 20.2 million CI jobs -- Google small/medium/large flake rates (0.5% / 1.6% / 14%); Uber's Go monorepo and ~63% of PRs hitting a flaky failure — Trunk.io (2024-11-12)
Flaky tests: their hidden costs and how to address flaky behavior -- defines a flaky test and notes over 70% show flaky behavior when first introduced — Datadog (2024-10-23)
Retries -- Playwright docs: flaky = a test that 'failed on the first run, but passed when retried'; failing tests are not retried by default (retries: 0) — Microsoft / Playwright (2026)
Flaky Tests Management -- Datadog docs: quarantined tests keep running but their failures do not affect CI status or break pipelines — Datadog (2026)
Cost of Flaky Tests in CI: An Industrial Case Study -- handling flaky tests consumed at least 2.5% of productive developer time in a ~30-developer project — IEEE ICST 2024 (2024)
How to reduce flaky test failures -- root-cause taxonomy of flaky tests (async waits, timeouts, time-of-day, concurrency, test-order dependency) — CircleCI (2024-12-23)

Get bug-tracking insights, weekly.

Engineering deep-dives, QA playbooks, and honest tool comparisons. No spam — unsubscribe in one click.