Why your E2E tests are flaky (and why rerunning them is making it worse)

Every engineering team I've worked with has had the same Tuesday: the deploy is ready, CI goes red, someone squints at the failure, says "oh, that one," and clicks re-run. Green. Ship it.

We talk about flaky tests like they're a hygiene problem — as if the team two doors down has a perfectly deterministic suite because they're more disciplined than you. They don't, and they aren't. Google, with effectively unlimited engineering resources, reported that about 1.5% of all test runs across their corpus report a flaky result, and that almost 16% of their tests — more than one in seven, written by their engineers — have some level of flakiness associated with them. Their own post calls the number staggering.

If Google can't engineer flakiness away, the interesting question isn't "how do we get to zero." It's: why is flakiness structural to this kind of testing, what does it actually cost, and which parts of it are telling you something?

This is a long one. It's the piece I wish someone had handed me the first time I was told to "just stabilize the suite."

What the data says actually causes flakes

The best empirical work on this is still Luo, Hariri, Eloussi, and Marinov's study from UIUC — they analyzed 201 commits that fix flaky tests across 51 open-source projects and classified the root causes. Three categories dominate:

Async wait — 45% of flaky-test fixes. The test asserts before the application finished doing something. The page hasn't rendered, the request hasn't resolved, the animation hasn't released the button, but the assertion already ran. This is half your flakes, and it exists because a script and an application are two concurrent systems pretending to be one synchronous one.

Concurrency — 20%. Data races, atomicity violations, deadlocks — sometimes in the test, sometimes in the code under test. These are the flakes that are secretly bug reports, which we'll come back to, because it's the most important point in this post.

Test order dependency — 12%. The test passes alone and fails after some other test, because state leaked: a cookie, a database row, a singleton that test #37 mutated and never reset.

Notice what's not on the list: carelessness. Async wait isn't a typo. It's a bet about timing — "this will have loaded by now" — and every waitFor in your suite is a hand-tuned guess about how fast an asynchronous system will be on hardware you don't control. The guess is wrong some fraction of a percent of the time. That fraction is your flake rate.

Where the bets actually fail in a web app

The categories sound abstract until you map them onto a real frontend. Here's where I've personally watched each one bite:

The animation that isn't done animating. Your modal fades in over 200ms. The element exists in the DOM at millisecond 10 — visible, clickable, and still translating across the screen. The script clicks where the button was. On your laptop it works; on a loaded CI runner where the frame budget slips, it doesn't. This is the purest form of async-wait flake: the DOM said ready, the pixels said not yet.

The debounced search box. The test types "kindle" and asserts on results. The app waits 300ms after the last keystroke before firing the request. Your test's typing speed, the debounce window, and the API latency are three clocks running against each other, and the assertion wins or loses depending on which clock drifts on a given run.

The third-party script in the critical path. Analytics, chat widgets, payment iframes — anything loaded from someone else's CDN arrives in a different order on different runs. If any listener or layout shift depends on it, your test's timing assumptions inherit somebody else's infrastructure weather.

The clock itself. Tests that pass for three weeks and fail on the 1st of the month, or only between 11pm and midnight UTC, or only during daylight-saving transitions. Time is global mutable state, and almost every app reads it somewhere.

The leftover row. The signup test that fails only when the previous run half-completed and left test@example.com already registered. That's the order-dependency category wearing its most common costume: data that outlived its test.

Every one of these is an implementation detail your script bet on without you noticing you'd placed the bet. The page didn't lie to a human even once in any of these scenarios — a person watching the screen would have waited out the animation, seen the spinner, noticed the stale account. The script got lied to, because the script wasn't looking at the page; it was looking at the DOM through a keyhole.

The math that turns 0.1% into a broken pipeline

Here's the part that took me embarrassingly long to internalize. Suppose you do heroic work and get every single test to 99.9% reliability — each one fails spuriously once in a thousand runs. That sounds done, right?

A suite fails when any test fails. With independent tests, the suite passes only if all of them pass:

P(suite passes) = 0.999^N

N =  50 tests  →  95.1% green   (1 red build in 20)
N = 200 tests  →  81.9% green   (1 red build in 5.5)
N = 500 tests  →  60.6% green   (2 red builds in 5)

At 500 well-maintained, 99.9%-reliable tests, two out of five CI runs are false alarms. Nobody on the team did anything wrong — the suite rotted by arithmetic. Every test you add compounds the problem, which means a growing product with a growing suite gets flakier by default, even with constant per-test quality. Per-test quality has to improve continuously just to stand still.

This is why "we'll fix the flaky ones" is a treadmill and not a project. You can absolutely triage the worst offenders — in practice instability is concentrated in a handful of tests, and fixing those is worth it — but the asymptote is set by the model, not by your diligence.

And the cost compounds in a currency worse than CI minutes: trust. The first time someone re-runs a red build and it goes green, they learn a lesson — red doesn't mean broken. The tenth time, the whole team has internalized it, and now a genuinely broken build gets the same shrug-and-rerun as a flake. A test suite people don't believe isn't a safety net with holes in it; it's scenery. The entire point of the suite was to make a red build mean something, and flakiness spends that meaning down to zero a few percent at a time.

The standard mitigations, and what they really cost

Automatic retries. The industry's favorite. CI reruns failures once or twice; only "fails three times" counts as red. Google's post describes exactly this mechanism — the ability to re-run only failing tests, automatic re-runs on failure, and a flaky designation that only reports red after three consecutive fails — and, to their credit, calls the three-strikes approach "hardly a perfect solution," because it trains developers to ignore nondeterminism in their own tests. Sit with that for a second: the organization with the most sophisticated testing infrastructure on the planet, plus a dedicated team just for flakiness information, describes its own best mitigation as a trade-off it isn't happy with. That's not an indictment of Google. It's evidence about the problem.

Here's the sharper version of that concern: a retry is a filter that selectively deletes evidence of timing-dependent behavior. Remember that 20% of flakes are concurrency issues, and some of those live in your application. The checkout that double-charges when two requests race. The token validation with a window between the SELECT and the UPDATE — the exact bug class we walked through in testing magic-link auth. When that bug surfaces in CI as a once-in-thirty-runs failure, it is your test suite working better than designed — it caught something real that only happens under timing pressure. The retry throws that signal away and stamps the build green. You haven't stabilized the suite; you've muted the one alarm that was telling the truth.

Add more waits. Fixes the symptom test-by-test — Luo's data shows most async-wait flakes get fixed with waitFor-style patches — at the cost of slower suites and a new constant to mis-tune. The waits are guesses; the hardware changes; the guesses go stale.

Quarantine. Move flaky tests to a non-blocking lane. Honest in the short term, but quarantine without a paydown plan is where coverage goes to die — six months later the quarantined suite is the size of the real one and tests nothing anyone trusts.

Splitting and isolating CI jobs. Genuinely good — separate databases and backends per runner remove whole classes of state leakage. It attacks the 12% (order dependency) and some environment noise. It does nothing for the 45% (async wait), because that one lives inside the test model itself.

All four mitigations share a property worth naming: they manage the symptom (red builds) rather than the cause (scripts making timing bets against an asynchronous system). That's not a criticism — managing symptoms is sometimes the right engineering call — but you should know which one you're buying.

The heresy: maybe the script is the problem

Strip a scripted E2E test to its skeleton and it's a stack of bets:

Element #submit-btn will exist (structure bet)
...within 5 seconds (timing bet)
...and clicking it produces .success-toast (structure bet)
...within 3 seconds (timing bet)

The structure bets break when the UI changes — that's the maintenance problem, the one self-healing tools try to patch. The timing bets break nondeterministically under load — that's the flake problem. Same root: the script encodes assumptions about implementation details it doesn't control, then fails when reality drifts.

Now ask: how does a human tester — the gold standard neither problem applies to — run the same check? They look at the screen. If it's still loading, they wait, watching, until it isn't. They don't time out after exactly 5000ms; they notice the spinner is stuck and call it a bug. They don't assert .success-toast exists; they read the page and judge whether the thing worked. A human doesn't make timing bets. A human observes until the system settles, then evaluates the outcome.

That's the actual reason an agent-based model is interesting for flakiness — not "AI magic," just a different contract. An AI QA agent reads the rendered page each run and acts on what's actually there, the way a person would. There is no selector to go stale and no hand-tuned wait to mis-guess, because the agent's loop is observe → decide → act, not find → assert → timeout. The flake categories map directly:

Async wait (45%): largely dissolves — the agent waits like a human, by watching the page settle, and "this page never finished loading" stops being a false positive and becomes a finding.
Order dependency (12%): each Test Run is an independent browser session with no shared runner state.
Concurrency (20%): unchanged — and that's correct behavior, because those are real bugs. You want them surfaced, with evidence, not retried into silence.

Full honesty, because this post is useless without it: an agent run is not byte-deterministic either. The same prompt can take a slightly different path on two runs — the model is reasoning, not replaying. The difference is what a "failure" gives you. A flaky script gives you a red X, a stack trace pointing at a selector, and a re-run button. A failed Test Run gives you a Monito Session — screenshot timeline, console output, network log, and the agent's reasoning at each step — so "it failed weirdly this time" becomes something you read rather than something you re-roll. We've written about why that evidence-first model catches bugs scripts miss; the flakiness argument is the same coin, other face.

And scripted tests still have a place. Determinism is a real virtue when you need an exact, repeatable gate — the password must be rejected, the total must equal $43.20, every time, in milliseconds. Keep a thin scripted suite for those two or three invariants. The mistake is using a determinism tool for the 95% of checks that are actually judgment calls about whether a flow works.

What I'd actually do, in order

If your suite is flaky today, the priority order that respects the data:

First, stop auto-retrying silently. If you must retry, log every flake with its failure detail and review the log weekly. The concurrency flakes hiding in there are the cheapest bug reports you'll ever get — they found themselves.

Second, fix the top offenders by category. Pull thirty days of flake history; it'll be concentrated in a few tests. For each: is it async wait (patch the wait, accept the treadmill), order dependency (isolate the state — permanent fix), or unexplained (suspect the app, not the test — dig in).

Third, do the arithmetic on your suite size. Multiply your per-test reliability out. If the math says your suite can't stay green at its size, no amount of discipline will save it — shrink the scripted suite to the deterministic invariants and move the judgment-call coverage to a model that doesn't make timing bets.

Fourth, make the flaky flows someone's job to actually watch. The flows that flake most are usually your most asynchronous — checkout, auth, anything with webhooks or emails. That's not a coincidence: the flakiest flows are flaky because they have the most concurrent moving parts, which is exactly why they're also where your worst production bugs live. Those deserve real eyes, scripted or not, on a schedule or in CI.

Hunt one flake tonight

Take your single flakiest scripted test — you know which one — and hand its intent to an agent instead of its steps. The pattern:

Test the checkout flow on https://staging.yourapp.com.

Log in as test@example.com / Password123!, add any product to the
cart, and complete checkout with Stripe test card 4242 4242 4242 4242.

Do this three times in a row, as three separate passes. For each pass,
note how long the order confirmation takes to appear and whether
anything behaves differently from the previous pass — slower loads,
elements that appear in a different state, errors in the console,
failed or duplicated network requests.

If any pass behaves differently from the others, describe exactly
what differed and capture the evidence. Inconsistency between passes
is the finding I care about most.

Run it as a Test Scenario (a full run is typically 8–13 credits, $0.08–$0.13). One of two things happens: every pass is identical and your script's flake was a timing bet — or the passes differ, and you've just watched your "flaky test" reproduce as a real, evidenced, intermittent behavior in your application. Either answer ends the Tuesday re-run ritual for that test.

First run's free. Bring your worst test — the one nobody on the team will say out loud that they've stopped trusting.