Can LLMs Actually Help Physicists? Google Put Them to the Test on Superconductivity

I’ve been watching the AI-in-science space with a mix of excitement and skepticism for a while now. Everyone loves to talk about how LLMs will revolutionize research, but the reality is that most of these models are trained on internet text, not on the kind of deep, nuanced, and often contradictory knowledge that physicists deal with daily.

So when Google Research dropped a paper in PNAS testing six different LLMs on high-temperature superconductivity questions, I paid attention. This isn’t another “AI can write a decent email” demo. This is a serious stress test: can these models act as knowledgeable, unbiased thought partners in a field where the answers aren’t settled?

The setup: hard questions, harder grading

The team, led by Subhashini Venugopalan and Eun-ah Kim, focused on cuprates—a class of copper-based compounds that superconduct at temperatures around -140°C. That’s still cold, but way warmer than traditional superconductors. The catch? No one fully agrees on why they work. There are competing theories, decades of experimental data, and thousands of papers. A graduate student entering this field faces a brutal learning curve.

The researchers asked six LLMs a set of expert-level questions about the underlying mechanisms. Then a panel of physicists graded the responses not just on factual accuracy, but on comprehensiveness, balance, and whether the models acknowledged open debates.

The winners: closed ecosystems with curated sources

Two systems stood out: NotebookLM and a custom-built tool that both draw from a closed ecosystem of certified, quality-controlled references. That’s not surprising, honestly. If you want a model to talk about cutting-edge physics, you can’t let it pull from random Reddit threads or outdated blog posts. You need it to stick to peer-reviewed literature and expert-curated sources.

The custom system in particular was built to retrieve and synthesize information from a hand-picked set of papers. It didn’t just regurgitate abstracts—it could weigh evidence from different studies and present a balanced view of competing theories. That’s exactly what a good thought partner should do.

Where the others fell short

The other four models—which I won’t name because the paper doesn’t explicitly call them out, but let’s be real, it’s the usual suspects—struggled in predictable ways. They’d give confident-sounding answers that were technically correct but missed important nuances. They’d ignore contradictory evidence. A few times, they flat-out hallucinated references to papers that don’t exist.

This is the danger with LLMs in science: they sound authoritative even when they’re wrong. A junior researcher might trust a confident answer from a model that doesn’t understand the field’s open questions.

What this means for the future

This study isn’t a takedown of LLMs in science. It’s a reality check. The top performers showed that with the right architecture—curated sources, retrieval-augmented generation, and careful prompt engineering—these models can genuinely help researchers navigate complex literature. But the gap between “helpful” and “reliable” is still wide.

I think the most interesting finding is that the models did better when they were forced to stay within a closed corpus. That suggests the future of scientific AI isn’t in bigger general-purpose models, but in specialized systems that know their limits and stick to verified sources.

For now, if you’re a physicist looking for a virtual collaborator, NotebookLM or a custom RAG system is your best bet. But don’t stop double-checking the answers. The models are getting better, but they’re not there yet.

Can LLMs Actually Help Physicists? Google Put Them to the Test on Superconductivity

The setup: hard questions, harder grading

The winners: closed ecosystems with curated sources

Where the others fell short

What this means for the future

Comments (0)