Flattery and Pitfalls of AI in Science

What are the possibilities and limitations of AI in scientific research? Data scientist Andres Algaba (32) closely follows the rapid developments in AI, contributes to the university’s AI policy, and focuses on the reliability and transparency of large language models. “What was new last year is now considered normal.”

There’s hardly a student or researcher today who hasn’t asked ChatGPT a question. Some scientists go a step further and outsource aspects of their research, such as literature reviews, to save time. Others dream of a future where AI can conduct their entire research process—from hypothesis to publication. Is that possible? And what should we be cautious about before we get there?

As an FWO postdoctoral researcher, Andres Algaba focuses on the transparency of large language models. He is acutely aware of AI’s impact on scientific practice.

What is the great promise of AI for scientific research?
Andres Algaba: “The promise is, of course, to automate scientific research—at least partially. Unlike humans, computers don’t need to eat or sleep, they can run 24/7 and operate multiple systems simultaneously. Automation through AI could lead to a massive acceleration of scientific research.

We’ve been dreaming of that acceleration for a while, but recent developments in large language models (LLMs) offer new possibilities. Algorithms can now suggest innovative hypotheses within a scientific field—hypotheses that are comparable to those proposed by real scientists.

With the advent of AI agents, full automation is no longer far off. One system generates a hypothesis, another provides feedback, and yet another conducts the research. Two years ago, you’d have called me crazy for predicting this, but things are moving incredibly fast. What was new last year is now standard.”

Suppose we automate research—what’s the major risk?
“Large language models are trained with human feedback. Before a model works well, it must learn our values, norms, and expectations of an assistant. This training involves presenting the model with questions or tasks and ranking its responses from least to most helpful.

Now, imagine that the human trainer rewards hypotheses and results that are most publishable. The model will then do everything it can to achieve that goal. It might manipulate data and select only the most significant results. It’s not that the model has bad intentions—it’s simply trained that way.”

Most research isn’t automated yet. Are there major pitfalls when researchers test hypotheses using AI?
“People like to be right, and that’s something the models have learned from human feedback. We tend to rate answers that begin positively (‘This is a good hypothesis, here are a few improvements’) higher than those that start critically (‘This is a poor hypothesis, here’s how to improve it’), even if the latter is more accurate.

Large language models have learned that a little dishonesty is acceptable if it pleases the user. This sycophancy—flattery from a language model—is a major issue for science, which is fundamentally about truth. As a scientist, you must learn not to reveal your preferred answer in your question, or you’ll reinforce confirmation bias.”

You helped develop algorithms at VUB that interrogate other algorithms. Can these expose biases?
“To some extent, yes. We developed an algorithm that examined the citation behaviour of large language models. What we found was that when you ask an LLM for a literature review, it tends to cite papers with shorter titles and fewer authors. So there are systematic biases in citation behaviour, which is problematic because those criteria don’t necessarily lead to the most relevant papers.”

What’s the solution? A ‘fairer’ tool?
“Technically, it’s easy to fix. But there’s no clear answer to the question: ‘What is desirable citation behaviour in science?’ I don’t think the solution lies in another algorithm. It’s about awareness. Students and researchers need to understand that if they use AI to compile a literature list, they’ll likely get papers with shorter titles and fewer authors. The message is: be cautious. Know that if AI creates your literature review, you might miss important papers.”

This issue isn’t limited to science. You’ve also researched fairness in language models more broadly. Can you tell us more?
“There’s a common belief that algorithms are neutral and objective. But that’s not true. In our research, we asked an algorithm to select 100 ideal candidates for a job. What did we find? For certain jobs, 99% of the selected candidates were men. Just because an algorithm makes the selection doesn’t mean discrimination disappears. Again, the goal was awareness. We wanted companies using AI in recruitment to see: ‘Look, this is what happens when you do that.’”

AI is here to stay in research. What skills will students and researchers need in the future?
“Some processes will be partially automated. That will change the nature of our jobs. We need to think about this—not just as a university, but as a society.

It’s important to learn how to collaborate effectively with these models. That’s not straightforward, and I think it partly explains why adoption rates aren’t very high yet. Sometimes it seems easier to let the model do things on its own than to collaborate with it as a human.

For example, there was an experiment where three groups received the same patient files: one group of doctors with access to GPT-4, one without, and GPT-4 itself. All three had to make diagnoses based on the files.

What happened? GPT-4 alone performed better than the doctors without GPT-4. But surprisingly, the doctors with GPT-4 performed slightly worse than their colleagues without it.

Why? Those doctors made their own diagnosis first and then checked it against GPT-4. That opened the door to sycophancy—the system reinforced incorrect assumptions. To collaborate better with these models, we’ll need to learn how to avoid sycophancy.”

If science becomes automated, what role remains for the scientist?
“We might need more scientists to keep up with the pace at which LLMs produce scientific output. AI generates far more results that we, as scientists, must assess and process.

Scientists will also need to decide which questions we want to answer. You can’t physically conduct all experiments at once. Some argue that smarter models should decide which research to pursue. But is that what we want?”

And science communication—will that become AI’s job?
“You can ask AI to communicate your research to policymakers or specific audiences. But if you give a generic prompt, you’ll get a generic response. A model doesn’t write or interview like a professional would.

How could you expect that from a model trained on the entire internet? Hopefully, someone professionally engaged in science communication doesn’t sound like the average of the internet.

But if you know exactly what you want and clearly define what paths you don’t want to take, AI can be very useful. If you just want to get science communication over with quickly, you’ll get the first generic text it produces.”

Bio
Andres Algaba (32) is an FWO postdoctoral researcher at the Data Analytics Lab of VUB. His main research interests include automated science and innovation with large language models, reliability and transparency of LLMs, and the science of science. He is also a member of the Young Academy of Belgium.

Flattery and Pitfalls of AI in Science

Researcher Andres Algaba: “People like to be right, and that’s something the models have learned too”