Google Research just dropped a paper that asks a question I’ve been mulling over for a while: how well do LLMs actually mirror human social behavior? Not just in terms of factual accuracy or safety guardrails, but in the messy, nuanced stuff—empathy, assertiveness, how they handle conflict at work or give advice on booking a trip.
The answer, based on their evaluation of 25 models? Not great. But the methodology is what’s interesting.
What they did
The team adapted established psychological questionnaires—things like the Interpersonal Reactivity Index (IRI) for empathy and the Emotion Regulation Questionnaire (ERQ)—into what they call Situational Judgment Tests (SJTs). These are common in hiring and psychology: you get a realistic scenario with two possible actions, one leaning toward a trait (e.g., high empathy) and one against it.
Instead of asking the model directly “Are you empathetic?” (which is pointless—models will just say yes), they present the scenario and let the model generate a natural response. Then an LLM-as-a-judge maps that response to one of the two actions. Meanwhile, they had 550 human annotators rate the same scenarios to get a consensus distribution.
The gaps they found
Two interesting types of misalignment popped up:
- Models deviate from human consensus. When most humans agreed on the “right” behavioral response, models often picked something else. This isn’t about safety—it’s about social appropriateness. For example, in a conflict resolution scenario, a model might lean toward blunt assertiveness when most humans would choose a softer approach.
- Models don’t capture the range of human opinion. When humans were split on a scenario (no clear consensus), models tended to pick one side consistently. They flattened the diversity of human judgment into a single, often extreme, preference.
This second gap is the one I find more worrying. If models can’t reflect the natural variation in human social reasoning, they’ll feel robotic or tone-deaf in group settings or when advising on sensitive topics.
Why this matters
We’re past the point where LLMs are just chatbots. They’re embedded in workflows, customer service, even therapy-adjacent apps. If they consistently misread social dynamics, the consequences aren’t just awkward—they can be harmful. A model that always defaults to “assertive” advice in a workplace dispute could escalate tension instead of de-escalating it.
The researchers are careful to call this an “early step,” and they’re right. The sample of 25 models is decent, but the scenarios are limited to everyday interactions and workplace situations. I’d love to see this extended to cross-cultural contexts, where behavioral norms vary wildly.
My take
This is a refreshing change from the usual alignment research, which focuses on refusal rates or jailbreaks. Behavioral alignment is harder to measure but arguably more important for real-world deployment. The framework is clever—using validated psych instruments gives it scientific grounding that many LLM evaluations lack.
That said, I’m skeptical about the LLM-as-a-judge step. If the judge model has its own behavioral biases, you’re compounding errors. The paper acknowledges this but doesn’t fully address it. Also, 10 annotators per SJT feels thin for capturing human diversity, especially on traits like empathy where cultural background matters.
Still, this is the kind of work we need more of. Not just “can the model answer correctly?” but “does the model behave like a decent person?”
Comments (0)
Login Log in to comment.
Be the first to comment!