How to test an AI chat UI: streaming, retries, and the weird parts
How to test an AI chat UI when the output is never the same twice — streaming, stop/regenerate, copy buttons, scroll behavior, and message-order races, covered in five plain-English prompts.
How to test an AI chat UI: streaming, retries, and the weird parts
Every product grew a chat box this year. And almost nobody is testing it, because the standard playbook doesn't work: you can't write expect(response).toBe("...") against a model that never says the same thing twice.
So teams test the parts that are easy to assert — the input renders, the send button is enabled — and ship the parts that actually break: the stream that dies silently behind a proxy, the copy button that copies the previous message, the auto-scroll that pins you to the bottom while you're trying to read something three messages up.
This is the playbook we'd run against a chat UI, written as plain-English prompts. The trick that makes it tractable: you stop asserting on content and start asserting on behavior. The model's words are non-deterministic. Whether the UI streamed them, rendered them, and kept its own state straight is not.
Why chat UIs break differently
A chat interface looks like a form with a fancy textarea. Mechanically, it's one of the most stateful surfaces in your product:
The response arrives in pieces. Most chat UIs stream over server-sent events or a fetch stream — the browser holds a connection open and renders chunks as they land. The transport has its own failure modes: MDN's guide to server-sent events covers the mechanics, and the classic production bug lives one layer up — a proxy or load balancer that buffers the stream, so locally you get a smooth token flow and in production the answer slams in as one block after ten seconds of nothing. Same HTTP 200 either way.
Errors hide inside successful responses. When a model call dies halfway through, the HTTP response often already returned 200 and started streaming. The failure arrives as an error event mid-stream — or as silence. A test that checks status codes sees nothing wrong. The user sees a paragraph that ends mid-sen
The UI mutates constantly while you interact with it. Tokens append, the scroll position recalculates, buttons (stop, copy, regenerate) mount and unmount with the stream state. Race conditions between "user did something" and "stream did something" are the native bug class here, and they're exactly the kind of timing-dependent behavior that's miserable to pin down with selectors and waits — the same reason scripted tests miss whole bug classes elsewhere in your app.
The state machine has more states than anyone designed for. Idle, composing, sending, streaming, stopped-by-user, errored, retrying, rate-limited. Multiply by "user is offline," "user sent another message during the stream," and "user switched conversations mid-stream." Most chat bugs are an unhandled cell in that grid.
An agent-based check sidesteps the content problem because the agent reads the rendered page each run and judges behavior — did a response appear, did it stream incrementally, did the buttons do their job — without ever needing to know what the model would say. (If that execution model is new to you, AI QA testing explained is the primer.)
Prompt 1: the streaming happy path
Boring on purpose. If this fails, everything else is noise.
That second check is the proxy-buffering bug. If the agent reports "the full answer appeared at once after 9 seconds," your streaming is broken in that environment even though every request succeeded — go look at what sits between the browser and your inference endpoint.
Prompt 2: stop, then regenerate
The stop button is the least-tested control in your product. It's also one users smash constantly.
Failure modes this catches: the stop that detaches the UI but leaves the request running (watch the network log — the tokens keep billing), the stuck spinner that needs a refresh, and the regenerate that duplicates the user's message in history.
Prompt 3: copy buttons and rendered markdown
Small features, outsized embarrassment. The copy button is a state bug magnet because it's usually wired to a message ID at render time and nobody re-checks it after streams, retries, and re-renders.
That last step is the classic: copy works fine with one message and breaks with two, because the handler closed over stale state.
Prompt 4: the mid-stream interruption
The state-machine torture test. This is where chat UIs actually fall over.
The reload at the end matters more than it looks: plenty of chat UIs keep a coherent story in client state and persist something else entirely. If history-after-reload disagrees with history-before-reload, you've found a server-side ordering bug users will hit every time they switch devices.
Prompt 5: scroll behavior
Nobody scripts this, everybody notices it.
The "scroll hijack during streaming" bug is probably the single most common complaint about production chat UIs, and it's nearly impossible to express as a selector assertion. It's a one-line instruction to an agent.
Reading the results
A failed Test Run gives you the full session: screenshot timeline, console output, the network log with the streaming request visible, and the agent's reasoning at each step (the run docs cover pulling session details). For streaming bugs, the network log is the money shot — you can see whether the stream stayed open, when it closed, and whether the UI's behavior matches what the wire actually did.
Two of the five prompts — the interruption and the stop/regenerate — are the ones we'd bet find a bug on the first run against a chat UI that hasn't been deliberately tested. They're also the two that are most painful to script, which is not a coincidence: timing-dependent, multi-state, and judged by "did the UI do something sane" rather than "does element X contain text Y."
Make it a standing check
Save each prompt as a Test Scenario in your Project and run the set against staging on every release — or wire them into CI on preview deploys. Five scenarios, roughly 8–13 credits each (about $0.08–$0.13 per run), and the prompts don't reference a single selector — redesign the chat UI completely and they still run.
Here's the first one to paste in, ready to go:
Your first run is free — point it at your chat page and see what it finds.