Why AI QA agents find bugs your scripts miss

Scripted E2E tests check what you remembered to assert. An AI QA agent explores what you forgot. Here's the technical difference, and where it matters.

ai-qaengineeringmonito
monito

Why AI QA agents find bugs your scripts miss

ai-qaengineeringmonito
June 2, 2026

A Playwright test is a contract you wrote with your past self. I expect this button to exist. I expect that toast to say "Saved." I expect the URL to be /dashboard. When the contract is met, the test passes. When it's not, the test fails.

The contract is the strength of scripted testing. It's also the limit. (New to the idea of an agent doing the testing instead of a script? Start with the primer — this post is the deeper cut.)

The bugs we ship aren't usually contract violations. They're things we never wrote a contract for: the dropdown that opens behind the modal, the form that accepts whitespace as a valid password, the empty state that flashes for 200ms before the real content loads. Nobody wrote an assertion for any of that. So nobody catches it until a user does.

That's the gap an AI QA agent fills. Not by writing more contracts. By not needing one.

The contract model: what scripted tests are good at

If I ask Playwright to do this:

await page.goto("/signup");
await page.getByLabel("Email").fill("test@example.com");
await page.getByLabel("Password").fill("Password123!");
await page.getByRole("button", { name: "Create account" }).click();
await expect(page).toHaveURL("/dashboard");

I get an extremely precise check. Same inputs, same expected outputs, every time. If anything in that exact sequence breaks, I'll know.

This is great for two things:

  1. Regression detection on flows I already understand. Once I've decided what "the signup flow works" means, this test pins it in place.
  2. Deterministic CI gates. If the test passes locally and fails on main, something measurable changed. The signal is clean.

The contract is also why teams end up with 200+ Playwright tests and still find production bugs every Friday. Because the contract only covers what I thought to write down.

The exploration gap

Real bugs hide in the gap between "I asked for this" and "this is what's actually possible."

A handful of examples that I or our beta testers have hit just running the Monito agent against a normal SaaS app:

  • A "Save changes" button that stays enabled after a successful save, so frustrated users click it five times and create five database rows.
  • A password field with a 200-character limit on the client but a 60-character limit on the server. Passwords between 61 and 200 chars get accepted on signup, then rejected on login. The user is locked out of an account they thought they created.
  • A magic link email that arrives three minutes late. The token is technically valid for ten minutes, but the link expires at "now + 5 minutes" instead of "issued + 10." Tests with sub-second timing never catch it.
  • An empty state that renders briefly during pagination. "No results found" flickers between pages 1 and 2, then disappears. Looks broken. Isn't broken. Easy to fix, easy to miss.

None of these is hard to fix. Each is hard to find with a scripted test. You'd have to know to look for it. The whole point is that you didn't.

This is why scripted E2E suites — even huge, well-maintained ones — leak bugs. Not because the scripts are bad. Because the world is bigger than the scripts.

What an AI agent does differently

A QA agent doesn't start with a contract. It starts with an intent, expressed in plain English:

Test the signup flow. Try invalid emails, weak passwords, and what happens after a successful signup.

Then it does what a junior QA hire would do on day one: open the page, look at it, decide what to try, try it, look at the result, decide what to try next.

The mechanical pieces are:

  1. Perception. The agent reads the actual rendered page — DOM plus visual hints — to figure out what's on it.
  2. Planning. Given the intent and the page state, it picks a next action. Sometimes that's "fill the email field with a clearly invalid string." Sometimes it's "click the button labelled Sign up." Sometimes it's "wait two seconds and re-check."
  3. Execution. It drives a real Chromium browser. No simulation, no mock DOM. What the agent sees is what your users see.
  4. Evaluation. After each action, it asks: did anything visible go wrong? Did the URL change unexpectedly? Did an error toast appear? Did the request return 500?
  5. Branching. Edge cases come from the agent deciding to try them: empty strings, unicode, the same email twice, hitting back at an inconvenient moment.

That last point is the magic. The agent generates the test inputs as it goes. It's not bound to a list you wrote in advance.

Where the agent model is genuinely better

A few situations where a scripted test will never compete, no matter how much effort you pour into it:

Edge cases you didn't enumerate. The agent will try the empty form. The whitespace password. The 500-character email. The double-click on submit. Not because anyone wrote those tests, but because that's how an exploratory tester behaves.

Flows you don't fully remember. Six months into a SaaS product, nobody on the team can list every reachable screen. The agent doesn't need a list. Ask it to test "the settings page" and it'll click through everything it finds.

UI you just changed. A redesign breaks half your Playwright selectors. The agent doesn't care that the button moved or got renamed. It cares that there's a button labelled something like "Save."

Bugs that depend on visible state. A spinner that never disappears. A modal that's behind another modal. A toast that says something different from what the API returned. Scripted tests check for specific text and specific selectors. The agent looks at the screen.

Where the agent model is worse (be honest about it)

If you're an SDET reading this and thinking but..., you're right. There are things scripted tests do better, and pretending otherwise is how you lose your audience.

Determinism. A scripted test that passed yesterday will pass tomorrow if nothing changed. An agent reasoning about the page can disagree with itself. We engineer hard against that — same prompt, same browser version, same page state should produce the same verdict — but it's a different kind of guarantee.

Speed. Playwright runs a tight signup test in 4–6 seconds. An agent run is closer to 30–60 seconds because there's a reasoning step between each browser action. You're not going to put 500 agent runs in a pre-commit hook.

Cost per run. Each agent run does LLM work. Even with aggressive caching it's never free in the way expect(...).toBe(...) is free.

Precise assertions. "The dashboard chart must render exactly 12 bars" is a thing Playwright does trivially. An agent can be told to check it, but the verification is fuzzier.

The honest summary: scripts are for things you've decided matter. Agents are for things you don't yet know matter.

You want both. We've never claimed otherwise.

What this looks like in practice with Monito

A Monito session for the signup-flow prompt above looks like this:

  1. Agent opens /signup in a fresh Chromium browser.
  2. It identifies an email field, a password field, and a submit button.
  3. It tries to submit empty. Captures the validation error.
  4. It tries "not-an-email". Captures the validation error.
  5. It tries "a@b". Captures the result (sometimes that does pass — domain validation is harder than people think).
  6. It tries a strong, valid email + a four-character password. Captures the password-strength error.
  7. It tries a valid combination. Captures the redirect, the dashboard URL, the welcome toast, and any console errors that fired.
  8. It tries the same email a second time. Captures the duplicate-email error path.

Every step is recorded: screenshots, the full network log, the console output, the agent's reasoning at each branch. If anything failed — even something the prompt didn't explicitly ask for, like a 500 on a sign-up tracking endpoint — it shows up in the report with a screenshot and reproduction steps.

The deliverable is the thing a developer needs to fix the bug. Not a green checkmark. Not "test 47 of 51 passed." A report.

When to reach for which

A pragmatic split that's worked for us and for teams running Monito alongside their existing Playwright suite:

  • Use scripted E2E for the 5–15 flows you've decided must never break. Login, checkout, the one or two paths your business literally depends on. Pin them with Playwright contracts.
  • Use the agent for everything else. New features the day they ship. Settings pages. Multi-step forms with conditional fields. Anything where writing a script would take longer than the feature itself.
  • Use the agent on every Vercel preview deploy. Catch the bugs your PR review missed, in the five minutes between merge and ship.

Don't pick. Use both for what they're good at.


Want to see what your app looks like through an agent's eyes? Run your first test on Monito — describe what to test in plain English, get a full session back with screenshots and a verdict. No code, no setup.