I’ve been following the AI reproducibility problem for a while now, and it’s one of those issues that everyone acknowledges but few actually try to fix. Google Research just published something that actually gets at the root cause: how we collect human ratings for benchmarks.
The short version is that most benchmarks use way too few raters per item. The standard of 1 to 5 raters? Often not enough. And the reason isn’t laziness — it’s budget. Human annotation is expensive, and researchers have to choose between rating lots of items with few people, or fewer items with more people.
The forest vs. tree problem
The paper, “Forest vs Tree: The (N,K) Trade-off in Reproducible ML Evaluation,” frames this as a choice between breadth and depth. The forest approach: ask 1,000 different people to each rate one item. You get a broad overview but zero insight into disagreement on any specific item. The tree approach: ask 20 people to rate the same 50 items. You get much richer signal per item but cover far less ground.
Historically, AI evaluation has leaned heavily toward the forest. Most papers I read use 1-3 raters per example and call it a day. The assumption is that you can find a single “correct” truth by taking a majority vote or something similar. But that assumption collapses the moment you deal with inherently subjective tasks like toxicity detection, hate speech, or content moderation.
What they actually did
The team built a simulator based on real-world datasets involving subjective tasks. They ran a massive stress test varying two levers:
- N (Scale): Total items rated, from 100 to 50,000
- K (Crowd): Raters per item, from 1 to 500
They tested thousands of configurations and measured statistical reliability at p < 0.05. The goal was to find which combinations actually produce reproducible results — meaning two different research teams running the same evaluation would get the same answer.
What they found
The results are pretty striking. The optimal ratio of raters to items depends heavily on the subjectivity of the task, but the general pattern is clear: more raters per item almost always beats more items with fewer raters, up to a point. For highly subjective tasks like hate speech detection, you need at least 10-20 raters per item to get stable results. For more objective tasks, maybe 5-10.
This is higher than I expected. Most benchmarks I’ve seen use 3 raters and call it gold standard. The paper shows that with 3 raters, you’re essentially measuring noise, not signal, for any task with meaningful disagreement.
The practical takeaway
For anyone building benchmarks or running evaluations, the framework gives you a way to calculate the optimal N and K for your specific budget and task. They released an open source simulator so you can run your own numbers.
The uncomfortable truth is that many existing benchmarks are less reliable than we think. That leaderboard you’re looking at? If the underlying ratings were collected with 3 raters per item, the rankings might flip if you reran the annotation with a different crowd.
This approach has been tried before in other fields — survey methodology has known about the rater effect for decades. But applying it systematically to ML evaluation is overdue. Google’s contribution here isn’t revolutionary in theory, but it’s desperately needed in practice.
One thing I wish they’d addressed more directly: the cost implications. More raters means higher annotation budgets. For well-funded labs, that’s fine. For academic teams or smaller companies, it’s a real constraint. The paper acknowledges the trade-off but doesn’t offer much guidance on what to do when you simply can’t afford 20 raters per item.
Still, this is one of those rare papers that actually makes me want to revisit my own evaluation pipelines. If you’re building any kind of human-annotated benchmark, read this before you spend your next annotation budget.
Comments (0)
Login Log in to comment.
Be the first to comment!