How to test an AI chat UI: streaming, retries, and the weird parts

Every product grew a chat box this year. And almost nobody is testing it, because the standard playbook doesn't work: you can't write expect(response).toBe("...") against a model that never says the same thing twice.

So teams test the parts that are easy to assert — the input renders, the send button is enabled — and ship the parts that actually break: the stream that dies silently behind a proxy, the copy button that copies the previous message, the auto-scroll that pins you to the bottom while you're trying to read something three messages up.

This is the playbook we'd run against a chat UI, written as plain-English prompts. The trick that makes it tractable: you stop asserting on content and start asserting on behavior. The model's words are non-deterministic. Whether the UI streamed them, rendered them, and kept its own state straight is not.

Why chat UIs break differently

A chat interface looks like a form with a fancy textarea. Mechanically, it's one of the most stateful surfaces in your product:

The response arrives in pieces. Most chat UIs stream over server-sent events or a fetch stream — the browser holds a connection open and renders chunks as they land. The transport has its own failure modes: MDN's guide to server-sent events covers the mechanics, and the classic production bug lives one layer up — a proxy or load balancer that buffers the stream, so locally you get a smooth token flow and in production the answer slams in as one block after ten seconds of nothing. Same HTTP 200 either way.

Errors hide inside successful responses. When a model call dies halfway through, the HTTP response often already returned 200 and started streaming. The failure arrives as an error event mid-stream — or as silence. A test that checks status codes sees nothing wrong. The user sees a paragraph that ends mid-sen

The UI mutates constantly while you interact with it. Tokens append, the scroll position recalculates, buttons (stop, copy, regenerate) mount and unmount with the stream state. Race conditions between "user did something" and "stream did something" are the native bug class here, and they're exactly the kind of timing-dependent behavior that's miserable to pin down with selectors and waits — the same reason scripted tests miss whole bug classes elsewhere in your app.

The state machine has more states than anyone designed for. Idle, composing, sending, streaming, stopped-by-user, errored, retrying, rate-limited. Multiply by "user is offline," "user sent another message during the stream," and "user switched conversations mid-stream." Most chat bugs are an unhandled cell in that grid.

An agent-based check sidesteps the content problem because the agent reads the rendered page each run and judges behavior — did a response appear, did it stream incrementally, did the buttons do their job — without ever needing to know what the model would say. (If that execution model is new to you, AI QA testing explained is the primer.)

Prompt 1: the streaming happy path

Boring on purpose. If this fails, everything else is noise.

Go to https://staging.yourapp.com/chat and log in with
test@example.com / Password123!

Send the message: "Explain what this product does in about 150 words."

Verify that:
- A response begins to appear within a few seconds
- The response renders incrementally (visibly grows over time),
  not as a single block that appears all at once after a long pause
- While the response is streaming, a stop or cancel control is visible
- When the response finishes, the input is re-enabled and focused
- No errors appear in the console during the entire exchange

Report the time from sending to the first visible text.

That second check is the proxy-buffering bug. If the agent reports "the full answer appeared at once after 9 seconds," your streaming is broken in that environment even though every request succeeded — go look at what sits between the browser and your inference endpoint.

Prompt 2: stop, then regenerate

The stop button is the least-tested control in your product. It's also one users smash constantly.

Go to https://staging.yourapp.com/chat and log in.

Send: "Write a detailed 500-word explanation of HTTP caching."

While the response is still streaming, click the stop button.

Verify that:
- Streaming halts promptly and the partial text remains visible
- The UI returns to a state where you can act (input enabled,
  no spinner stuck on screen)
- No error toast or console error results from a user-initiated stop

Then trigger regenerate (or send "try again" if there is no
regenerate control).

Verify that a new response is produced, the conversation history
still shows the earlier messages in the right order, and the
stopped partial response is either clearly kept or clearly
replaced — not duplicated.

Failure modes this catches: the stop that detaches the UI but leaves the request running (watch the network log — the tokens keep billing), the stuck spinner that needs a refresh, and the regenerate that duplicates the user's message in history.

Prompt 3: copy buttons and rendered markdown

Small features, outsized embarrassment. The copy button is a state bug magnet because it's usually wired to a message ID at render time and nobody re-checks it after streams, retries, and re-renders.

Go to https://staging.yourapp.com/chat and log in.

Send: "Show me a Python function that reverses a string, with a
short explanation."

After the response completes, verify that:
- Code renders in a code block, not as plain text with backticks
- The code block has its own copy control if your UI provides one

Use the copy button on the response, then paste into the input
field and compare. The pasted text must come from THIS response,
not an earlier message, and code must not have lost line breaks.

Then send a second message: "Now do it in JavaScript."
After it completes, copy the SECOND response and verify the
clipboard holds the JavaScript answer, not the Python one.

That last step is the classic: copy works fine with one message and breaks with two, because the handler closed over stale state.

Prompt 4: the mid-stream interruption

The state-machine torture test. This is where chat UIs actually fall over.

Go to https://staging.yourapp.com/chat and log in.

Send: "Write a long, detailed history of web browsers."

While the response is streaming, immediately type a new message
"What year was Netscape founded?" and try to send it.

Observe and report what the UI does. Acceptable behaviors:
- The send is blocked with a clear affordance (disabled button,
  "wait for the response" hint), or
- The current stream is cleanly cancelled and the new message
  is sent

Unacceptable — flag as bugs:
- Both responses stream into the same bubble or interleave
- Messages appear out of order in the history
- The second message silently disappears
- The UI locks up and the input never re-enables

Afterwards, reload the page and verify the conversation history
matches what actually happened.

The reload at the end matters more than it looks: plenty of chat UIs keep a coherent story in client state and persist something else entirely. If history-after-reload disagrees with history-before-reload, you've found a server-side ordering bug users will hit every time they switch devices.

Prompt 5: scroll behavior

Nobody scripts this, everybody notices it.

Go to https://staging.yourapp.com/chat and log in.

Send: "Write a very long answer — at least 800 words — about the
history of databases."

While the response streams:
1. Confirm the view auto-scrolls to keep the newest text visible.
2. Scroll UP to re-read an earlier part of the conversation.
3. Verify the UI STOPS auto-scrolling while you're scrolled up —
   it must not yank you back to the bottom while text arrives.
4. Verify there's a way back down (a "jump to latest" control, or
   manually scrolling down re-engages following).

Also try this in a 375px-wide window and confirm the input bar
isn't covered by the streaming content or the keyboard area.

The "scroll hijack during streaming" bug is probably the single most common complaint about production chat UIs, and it's nearly impossible to express as a selector assertion. It's a one-line instruction to an agent.

Reading the results

A failed Test Run gives you the full session: screenshot timeline, console output, the network log with the streaming request visible, and the agent's reasoning at each step (the run docs cover pulling session details). For streaming bugs, the network log is the money shot — you can see whether the stream stayed open, when it closed, and whether the UI's behavior matches what the wire actually did.

Two of the five prompts — the interruption and the stop/regenerate — are the ones we'd bet find a bug on the first run against a chat UI that hasn't been deliberately tested. They're also the two that are most painful to script, which is not a coincidence: timing-dependent, multi-state, and judged by "did the UI do something sane" rather than "does element X contain text Y."

Make it a standing check

Save each prompt as a Test Scenario in your Project and run the set against staging on every release — or wire them into CI on preview deploys. Five scenarios, roughly 8–13 credits each (about $0.08–$0.13 per run), and the prompts don't reference a single selector — redesign the chat UI completely and they still run.

Here's the first one to paste in, ready to go:

Go to https://staging.yourapp.com/chat and log in with
test@example.com / Password123!

Send the message: "Explain what this product does in about 150 words."

Verify the response streams incrementally, a stop control is visible
during streaming, the input re-enables when it finishes, and no
console errors occur. Report the time to first visible text, and
flag anything that looks broken — including things I didn't ask about.

Your first run is free — point it at your chat page and see what it finds.