Playwright vs AI testing: the case against writing your own Playwright tests

I want to be careful here, because it's easy to read this title as a cheap shot at a great piece of software. It isn't. Playwright is the best browser-automation library anyone has ever shipped. The auto-waiting is genuinely smart, the API is clean, the tracing is a gift, and the team behind it has better engineering taste than most of the products built on top of it. I reach for it happily when I need it.

The argument I actually want to make is narrower and, I think, harder to dismiss: for most teams, writing and owning your own Playwright suite is the wrong default — not because Playwright is bad, but because the thing you're signing up to maintain is a liability that compounds, and you usually have a cheaper way to get the same confidence. The better the library, the easier it is to miss that you've taken on the bad trade.

This is an opinion piece. I'll tell you where I think the line is, and where Playwright is still the right answer.

The library is great. The suite is the liability.

Here's the sleight of hand that gets teams. "Playwright is excellent" is true. "Therefore we should write our E2E tests in Playwright" does not follow, because those two sentences are about different things. The first is about a tool. The second is about a long-term ownership commitment that you, specifically, are taking on.

A Playwright test is not Playwright. It's a stack of assumptions you wrote about your own application's structure and timing, expressed through a very good library. The library will execute those assumptions flawlessly. It will also execute them long after they've stopped being true, and tell you — correctly — that the page changed. That's not a Playwright failure. It's the suite doing its job. The job is just one you have to keep paying for.

And the bill has two parts, which most "Playwright vs X" comparisons blur together:

The maintenance bill. Every selector your test names is a hostage to your own refactors. Rename a button, restructure a form, ship a redesign, and the test breaks — not because the feature broke, but because the test was looking at the implementation. This is the cost self-healing tools exist to reduce, and the honest version of their pitch is that they patch selectors, not meaning.

The flakiness bill. This one's worse because it's silent. I wrote a whole piece on why E2E tests are flaky, and the short version is that flakiness is structural, not a discipline problem: the best empirical study of the topic (Luo et al., analyzing 201 flaky-test fixes) found async-wait timing issues are the single largest root cause, and Google — with effectively unlimited engineering resources — has reported that around 1.5% of their test runs flake and almost 16% of their tests have some flakiness. If Google can't engineer it away, your five-person team won't either.

But Playwright auto-waits — doesn't that fix the flakiness?

This is the fair objection, and it deserves a real answer, because Playwright's auto-waiting is genuinely one of its best features. Before it acts, Playwright runs a set of actionability checks and waits for them to pass — is the element visible, stable, enabled, does it actually receive events — and its web-first assertions auto-retry until the condition is met or a timeout fires. That kills a huge swath of the naive sleep(500) flakiness that plagued Selenium suites. Credit where it's due: this is the right design, and it's why a well-written Playwright test flakes far less than a badly-written one.

But notice the escape hatch in the same docs: the timeout. Auto-waiting waits up to a limit, and then it fails with a TimeoutError. So the flakiness doesn't disappear — it moves. Instead of "I didn't wait long enough," it becomes "the thing I was waiting for took longer than my timeout on a loaded CI runner this one time." You've traded a guess about how long to sleep for a guess about how long to allow, which is a better guess, but it's still a guess about the timing of an asynchronous system on hardware you don't control. And the moment a test gets hard — a third-party iframe, a debounced search, an animation, a race — engineers reach for the force option or a bare timeout, which disables the very actionability checks that were protecting them. The safety rails are real, and they're also the first thing people unbolt when the test won't go green.

So: auto-waiting raises the floor. It doesn't change the shape of the deal. You're still encoding timing assumptions; Playwright just gives you better defaults for them.

What you're actually buying when you write the test yourself

Strip an E2E test to its skeleton and it's four bets: this element will exist, within this long, and clicking it produces that result, within that long. Two structure bets, two timing bets. The structure bets are your maintenance bill. The timing bets are your flakiness bill. You are personally underwriting all four, forever, for every flow you cover.

Now ask what you were actually trying to buy. You didn't want a Playwright suite. You wanted to know that checkout still works before you ship. The suite was a means. And the means has a cost curve that bends the wrong way: every test you add compounds the flakiness math (a suite fails if any test flakes), and every feature you ship adds selectors to maintain. A growing product with a growing suite gets more expensive to test per unit of confidence over time, even with constant per-test quality. That's not a hypothetical; it's arithmetic, and it's why so many teams have a Playwright suite that everyone has quietly stopped trusting.

Even Octomind — a company whose entire product was generating and maintaining Playwright tests for you, and who wrote some of the most honest engineering content in this space before they wound down in 2026 — built their business on the premise that owning the Playwright code by hand was a cost worth outsourcing. The market validated the problem even as the specific solution didn't survive.

The alternative isn't "AI magic." It's a different contract.

Here's where I have to be careful in the other direction, because the AI-testing pitch is full of people overclaiming. So let me undersell it precisely.

An AI QA agent doesn't fix flakiness with intelligence. It changes the contract. Instead of find selector → assert → timeout, the loop is observe the rendered page → decide → act, the way a human tester works. A person doesn't time out after exactly 5000ms; they watch the spinner and, if it's stuck, they call that the bug. They don't assert .success-toast exists; they read the page and judge whether the thing worked. Hand an agent the intent of a test instead of its steps, and the structure bets disappear (there's no selector to name) and the async-wait bets mostly dissolve (it waits like a person, by watching the page settle). The flake categories I mapped out in the flakiness piece line up directly: order-dependency goes away because each Test Run is an independent browser session, and concurrency bugs stay — which is correct, because those are real bugs you want surfaced, not retried into silence.

And the honest costs, because a post like this is worthless without them:

An agent run is not byte-deterministic. The same prompt can take a slightly different path twice, because the model is reasoning, not replaying. For a check that must be exact and identical every run, that's a downside.
It's slower and it costs per run. Tens of seconds, not seconds; a typical full run is 8–13 credits, roughly $0.08–$0.13. A Playwright test is faster and, after it's written, "free" to run — if you don't count the maintenance, which is the whole game.
Determinism is a real virtue sometimes. When you need an exact, repeatable gate — the password must be rejected, the total must equal $43.20, every time, in milliseconds — a scripted test is the right tool. Keep a thin one for those.

Notice that last bullet is the same advice I'd give about Playwright: it's excellent for the job it's excellent at. My argument isn't "never write Playwright." It's "stop using a determinism tool for the 90% of your checks that are actually judgment calls about whether a flow works for a human."

Where I'd still write Playwright

To keep myself honest, the cases where I reach for it and wouldn't reach for an agent:

A two-or-three-flow invariant gate. The handful of checks that must be exact, fast, and identical on every commit. Thin scripted suite, owned deliberately. (This is the world where the generated code is the asset — Octomind's old "you own the code" model lived here too.)
Deep, deterministic component or integration tests below the E2E layer, where you control the inputs completely and timing isn't a guess.
A team with a real QA owner and a stable UI. If someone's job is the suite and the product isn't being redesigned every quarter, the maintenance bill is budgeted and the determinism is worth it. That's a legitimate, well-run setup — it's just not most early-stage teams.

What ties those together: a small, stable, deliberately-owned set of checks where determinism is the feature. The mistake is letting that set grow into a sprawling E2E suite covering every flow, because that's where the maintenance and flakiness bills compound past what the confidence is worth. For the broader landscape of what to use instead of a sprawling suite, I went through the options in Playwright alternatives without the code.

The bigger point

The reason "Playwright is great, so write Playwright tests" is such a sticky idea is that the first half is so obviously true that you stop examining the second half. But "the tool is excellent" and "I should personally take on the long-term liability of a suite built with it" are independent claims. The tool's quality doesn't lower your maintenance bill. It just makes the bill easier to sign up for without reading it.

Write the thin deterministic suite. Own it on purpose. And for everything else — the long tail of "did this deploy break anything a user would notice" — describe the intent and let something that reads the page like a human go check, instead of encoding four bets per flow that you'll be paying off for as long as the product lives.

Try it on your flakiest flow tonight

Take the single Playwright test you trust least — you know which one — and hand its intent to an agent instead of its steps:

Test the checkout flow on https://staging.yourapp.com.

Log in as test@example.com / Password123!, add any product to the
cart, and complete checkout with the Stripe test card
4242 4242 4242 4242, any future expiry, any CVC.

Run this three times in a row as three separate passes. For each pass,
note how long the confirmation takes to appear and whether anything
behaves differently from the previous pass — slower loads, elements in
a different state, console errors, failed or duplicated network
requests. If any pass differs from the others, describe exactly what
differed and capture the evidence. Inconsistency between passes is the
finding I care about most.

Run it as a Test Scenario, or wire it into CI against every preview deploy. Either every pass is identical and your "flaky test" was a timing bet — or the passes differ, and you've just watched the flake reproduce as a real, evidenced, intermittent behavior in your application. Both answers are worth more than another re-run. First run's free; bring the test nobody on the team will admit they've stopped trusting.