ConvApparel: Finally, a Way to Measure How Realistic Your AI User Simulators Really Are

Google Research just dropped ConvApparel, and honestly, it’s about time someone took a hard look at how we’re testing conversational AI.

We’ve all seen it: a chatbot that handles the first five turns like a champ, then forgets your preference for dark roast coffee and starts recommending tea. Or an assistant that asks clarifying questions perfectly until you hit turn 20, at which point it starts hallucinating your dietary restrictions. The problem isn’t just bad models—it’s bad testing.

For years, the go-to method for improving these systems has been live human testing. That’s the gold standard, sure, but it’s also a nightmare to scale. Expensive, slow, and you’re constantly wrangling participants. So the research community turned to LLM-based user simulators: agents instructed to roleplay as human users. In theory, they should be a cheap, scalable alternative. In practice? They’re often too polite, too patient, and too knowledgeable—basically the opposite of a real user who’s had a long day.

Ofer Meshi and Sally Goldman from Google Research saw this gap and built ConvApparel to measure it. The name is a bit clunky, but the idea is sharp: create a dataset and evaluation framework that quantifies exactly how unrealistic these simulators are, and then gives you a path to fix them.

The realism gap is real

Think about what happens when you put an LLM in the role of a user. These models are trained to be helpful assistants. They don’t naturally know how to be frustrated, impatient, or forgetful. They don’t say “I already told you that” or “just give me something, anything.” They’re too verbose, they lack a consistent persona, and they have this uncanny ability to recall every detail of a conversation from ten turns ago—something no human does.

If you train your conversational agent only against these pristine simulators, you’re setting yourself up for failure. Deploy that agent in the wild, and real users will eat it alive. It’s like training a pilot only in perfect weather with no crosswinds. The moment a bird hits the engine, they’re lost.

The counterfactual trick

Here’s where ConvApparel gets clever. The real challenge isn’t just building a simulator that matches training data—it’s building one that can react plausibly to novel situations. Specifically, what happens when your simulator encounters a bad agent? One that’s unhelpful, frustrating, or just broken?

Most simulators overfit to the agent they were trained on. Feed them a new, untested policy, and they’ll just repeat patterns from training data. That’s useless for testing.

ConvApparel introduces counterfactual validation. You take your simulator, throw it against an intentionally unhelpful agent (they call it the “Bad” agent), and see if it reacts like a real human would. Does it get annoyed? Does it give up? Does it ask to speak to a manager? If it just keeps being polite and helpful, you know it’s not learning real human behavior—it’s just mimicking.

Building the dataset

To make this work, the team collected a new dataset of human-AI conversations. Participants were randomly assigned to either a helpful “Good” agent or an intentionally unhelpful “Bad” agent. This gave them the full spectrum of human behavior—from satisfaction to profound annoyance. They then validated the dataset using three pillars: population-level statistics, human-likeness scoring, and that counterfactual validation I just mentioned.

The result is a benchmark that doesn’t just measure surface-level mimicry. It tests whether your simulator has actually learned what it means to be human in a conversation.

Why this matters for practitioners

If you’re building conversational agents—whether for customer support, recommendation systems, or anything else—you need realistic testing. ConvApparel gives you a way to quantify how far off your simulators are, and more importantly, a dataset to train them on.

I’ve seen too many teams ship agents that work fine in demo environments but fall apart in production. The reason is almost always the same: they tested against simulators that were too forgiving. ConvApparel won’t solve all your problems, but it will force you to confront the gap.

One caveat: the paper focuses on Conversational Recommender Systems (CRSs), which are a specific domain. The framework is general, but the dataset is tailored to recommendation tasks. If you’re building a troubleshooting bot or a therapy chatbot, you’ll need to adapt the approach.

Still, this is a solid step forward. The days of pretending your LLM-based simulator is realistic are over. Now you have a ruler to measure the gap. Go use it.

ConvApparel: Finally, a Way to Measure How Realistic Your AI User Simulators Really Are

The realism gap is real

The counterfactual trick

Building the dataset

Why this matters for practitioners

Comments (0)