BugMojoBugMojoBugMojo
FeaturesPricingBlogGuidesAbout
Log inGet started
BugMojoBugMojo

Bug reports that actually help fix bugs — capture, replay, share.

A product of Softech Infra.

Product

  • Features
  • Pricing
  • Get started
  • Log in

Resources

  • Blog
  • Guides
  • Compare
  • Glossary

Company

  • About
  • Contact
  • Privacy
  • Sitemap
  • Engineering
  • Playbooks
© 2026 BugMojo. All rights reserved.
AllGuidesEngineeringPlaybooksCompareGlossaryAlternativesBy roleBug tracking by framework
  1. Home
  2. Blog
  3. Glossary
  4. What Is a Flaky Test? Why Tests Fail Intermittently and How to Fix It
Glossary

What Is a Flaky Test? Why Tests Fail Intermittently and How to Fix It

A flaky test passes and fails on the exact same code. Here is why tests fail intermittently, how to confirm and fix the non-determinism, and what to capture before you hit retry.

ManviManvi·Jun 5, 2026·6 min read
Glossary
Five identical test-run lanes ending in status nodes, three passing and two failing in no pattern, with a lime probe tracing the inconsistent outcomes and one cracked failing node
TL;DR
  • A flaky test passes and fails on the same code, with no change in between -- the result is non-deterministic.
  • It fails intermittently because it depends on something that varies between runs: timing, run order, shared state, or environment.
  • A flaky failure is usually not a real regression -- but it can hide a genuine race condition, so do not blindly retry.
  • Retry and quarantine unblock the pipeline; neither removes the non-determinism. Fix the cause, and capture the failing run before you move on.

Definition

A flaky test is an automated test that both passes and fails on the exact same code, with no change to the code or test in between. Its outcome is non-deterministic, driven by timing, run order, shared state, or environment rather than the behavior it claims to verify.

The word that matters is non-deterministic. A normal test is a pure function of the code: same input, same result, every time. A flaky test smuggles in a hidden input that changes between runs -- a clock, a scheduler, a leftover row, a network round-trip -- so the same commit yields green on Tuesday and red on Wednesday. Datadog defines it precisely as a test that generates 'conflicting results across test runs, without any changes to the code.' Playwright pins the outcome label to the same idea: a result is 'flaky' when a test 'failed on the first run, but passed when retried.'

Why it matters

Flakiness is not a rare edge case you can wave away -- at scale it is the dominant source of red builds. Google has reported that almost 16% of its tests have some level of flakiness, more than one in seven, and that about 84% of observed transitions from pass to fail involve a flaky test rather than a real regression. Read that second number twice: most of the time a previously-green test goes red, the code is fine and the test is lying. If your team reflexively trusts every red build, you are spending the majority of your investigation time on noise.

It also gets worse the more realistic your test is. Trunk's analysis of 20.2 million CI jobs surfaced Google's own breakdown: 0.5% of small unit tests were flaky versus 1.6% of medium and 14% of large end-to-end tests. The more a test touches real systems -- a browser, a database, the network -- the more hidden inputs it inherits, and the flakier it gets. And tiny per-test rates compound: in Uber's Go monorepo, even a 0.1% individual flake rate meant roughly 63% of pull requests hit at least one flaky failure. That is the tax, and it is not free to pay: an ICST 2024 industrial case study found that handling flaky tests consumed at least 2.5% of productive developer time on a ~30-developer project.

The four usual root causes
  1. Async timing / race conditions -- the assertion fires before the UI or API is ready. Replace fixed sleeps with explicit waits on a real condition.
  2. Order dependency -- test B only passes if test A ran first and left state behind. Each test should set up and tear down its own world.
  3. Shared / non-isolated state -- a leftover DB row, a global, a cached token. Isolate fixtures per test.
  4. Environment drift -- clock, time zone, locale, network latency, a slow CI box. Pin anything time- or locale-sensitive. (Taxonomy: CircleCI.)
Flake rate rises with test realism (Google's suite)
Small (unit)
0.5% of runs flaky
Medium (integration)
1.6% of runs flaky
Large (end-to-end)
14% of runs flaky
Source: Trunk.io analysis of 20.2M CI jobs, 2024-11-12 (citing Google data)
Five identical vertical lanes for one test run five times, each ending in a status node: runs 1, 3 and 4 pass green, runs 2 and 5 fail, with a lime probe line tracing the inconsistent outcomes and the run-2 failing node cracked to suggest a hidden race condition; a retry loop arrow curves back to run 1
The same test, the same code, five runs: pass, fail, pass, pass, fail. The outcome tracks a hidden input (here a race condition), not the behavior under test. A retry that re-runs the failures would paint this 'green' without changing why it flaked.

Notice what the diagram does not tell you: why run 2 and run 5 went red. That is the whole problem with flakiness -- the assertion that failed (expected true, got false) is identical on every run, so the failure message carries no information about the hidden input that flipped. You cannot debug it from the red X alone. And over 70% of the time, this is baked in from the start: Datadog reports that more than 70% of flaky tests already exhibit flaky behavior the first time they are introduced. Flakiness is usually an authoring defect, not something a test acquires with age.

The industry's two escape hatches are retry and quarantine, and they are different tools. A retry re-runs a failing test up to N times in the same build and passes it if any attempt is green -- Playwright keeps the receipt by labeling that outcome 'flaky' rather than 'passed,' and note that failing tests are not retried by default (retries: 0). Quarantine instead detaches a known-bad test from the build status: in Datadog, a quarantined test keeps running in the background but its failures no longer affect CI status or break the pipeline. Both buy time. Neither removes the non-determinism -- and if you forget that, a quarantined test is just a bug you agreed to stop looking at.

How this shows up in a real BugMojo bug report

Here is the framing every other flaky-test guide skips. The standard advice ends at 'retry or quarantine, then go fix the non-determinism.' But the moment you decide not to blindly retry, you have a new problem: the failing run is gone. CI re-ran it, it went green, and the one execution that actually reproduced the race -- the stale token, the empty API response, the assertion that fired 12ms too early -- has evaporated. A flaky test is, at root, a reproducibility problem, not a CI-config problem. You cannot fix what you cannot re-observe.

That is the gap BugMojo closes. When a flaky failure shows up against a real browser session, the BugMojo extension captures the failing run with its surrounding state -- an rrweb session replay, the console output (including the assertion and any stack trace), and the exact network responses that fed the test. So instead of a one-line expected true, got false, the report shows the GET /api/cart call that returned an empty body on the run that flaked, next to a scrubbable replay of what the page was doing at that instant. Then the BugMojo MCP server hands that whole bundle to an AI agent (Claude Code, Cursor). The agent reads the actual non-deterministic state -- the race, the stale token, the empty response -- instead of guessing from the assertion. That is the difference between 'this test is flaky, retry it' and 'the assertion races the cart fetch; the run that failed got an empty cart, so await the response before asserting.'

FeatureCapabilityBugMojoFlake tooling (Datadog / Trunk / Playwright)
Detect and label a flaky result——✓
Quarantine a known-bad test off the critical path——✓
Retry config (retries: N) in the runner——✓
rrweb replay of the run that actually flaked—✓—
Exact network responses + console captured with the failure—✓Partial
Failing-run bundle handed to an AI agent over MCP—✓—
Two-sided: BugMojo ships the failing run as an agent-readable artifact, but it does not manage your CI flake lifecycle.
Key takeaway

A flaky test is non-determinism, not a real regression -- but the same non-determinism can hide a real race. So do not just retry. Confirm it flips on the unchanged commit, then fix the hidden input (explicit waits, isolated state, pinned clock/locale). Retry and quarantine are a holding pattern; a captured failing run -- replay, console, network -- is what lets you, or an agent, tell test noise from a shipping bug.

Catch the run that actually flaked
Install the extension

Frequently asked questions

Frequently asked questions

Sources

  1. Test Flakiness -- One of the main challenges of automated testing: ~16% of Google's tests have some flakiness; ~84% of pass-to-fail transitions involve a flaky test — Google Testing Blog (2020-12-16)
  2. Flaky Tests at Google and How We Mitigate Them -- original source for the 84% pass-to-fail / quarantine-tool figures — Google Testing Blog (2016-05-27)
  3. What we learned from analyzing 20.2 million CI jobs -- Google small/medium/large flake rates (0.5% / 1.6% / 14%); Uber's Go monorepo and ~63% of PRs hitting a flaky failure — Trunk.io (2024-11-12)
  4. Flaky tests: their hidden costs and how to address flaky behavior -- defines a flaky test and notes over 70% show flaky behavior when first introduced — Datadog (2024-10-23)
  5. Retries -- Playwright docs: flaky = a test that 'failed on the first run, but passed when retried'; failing tests are not retried by default (retries: 0) — Microsoft / Playwright (2026)
  6. Flaky Tests Management -- Datadog docs: quarantined tests keep running but their failures do not affect CI status or break pipelines — Datadog (2026)
  7. Cost of Flaky Tests in CI: An Industrial Case Study -- handling flaky tests consumed at least 2.5% of productive developer time in a ~30-developer project — IEEE ICST 2024 (2024)
  8. How to reduce flaky test failures -- root-cause taxonomy of flaky tests (async waits, timeouts, time-of-day, concurrency, test-order dependency) — CircleCI (2024-12-23)
Share:
Manvi
Manvi· QA Tester

Manvi is a Quality Assurance Tester with three years of experience. For her, quality is not just about finding bugs — it is about ensuring the best possible experience for every user.

On this page

  • Definition
  • Why it matters
  • How this shows up in a real BugMojo bug report

Get bug-tracking insights, weekly.

Engineering deep-dives, QA playbooks, and honest tool comparisons. No spam — unsubscribe in one click.