A study of newer, bigger versions of three major artificial intelligence (AI) chatbots shows that they are more inclined to generate wrong answers than to admit ignorance. The assessment also found that people aren’t great at spotting the bad answers.
AI models fed AI-generated data quickly spew nonsense
Plenty of attention has been given to the fact that the large language models (LLMs) used to power chatbots sometimes get things wrong or ‘hallucinate’ strange responses to queries. José Hernández-Orallo at the Valencian Research Institute for Artificial Intelligence in Spain and his colleagues analysed such errors to see how they are changing as the models are getting bigger — making use of more training data, involving more parameters or decision-making nodes and gobbling up more computing power. They also tracked whether the likelihood of errors matches up to human perceptions of question difficulty, and how well people can identify the wrong answers. The study1 was published in Nature on 25 September.
The team found that bigger, more-refined versions of LLMs are, as expected, more accurate, thanks in large part to having been shaped with fine-tuning methods such as reinforcement learning from human feedback. That is good news. But they are less reliable: among all the non-accurate responses, the fraction of wrong answers has increased, the team reports, because the models are less likely to avoid answering a question — for example, by saying they don’t know, or by changing the subject.
“They are answering almost everything these days. And that means more correct, but also more incorrect” answers, says Hernández-Orallo. In other words, the chatbots’ tendency to offer opinions beyond their own knowledge has increased. “That looks to me like what we would call bullshitting,” says Mike Hicks, a philosopher of science and technology at the University of Glasgow, UK, who proposes the term bullshit to describe the phenomenon2. “It’s getting better at pretending to be knowledgeable.”
The result is that everyday users are likely to overestimate the abilities of chatbots and that’s dangerous, says Hernández-Orallo.
Incorrect and avoidant
The team looked at three LLM families: OpenAI’s GPT, Meta’s LLaMA and BLOOM, an open-source model created by the academic group BigScience. For each, they looked at early, raw versions of models and later, refined versions.
They tested the models on thousands of prompts that included questions on arithmetic, anagrams, geography and science, as well as prompts that tested the bots’ ability to transform information, such as putting a list in alphabetical order. They also ranked the human-perceived difficulty of the questions — for example, a question about Toronto, Canada, was ranked as easier than a question about the lesser-known and smaller town of Akil, Mexico.
AI models fed AI-generated data quickly spew nonsense
As expected, the accuracy of the answers increased as the refined models became larger and decreased as the questions got harder. And although it might be prudent for models to avoid answering very difficult questions, the researchers found no strong trend in this direction. Instead, some models, such as GPT-4, answered almost everything. The fraction of wrong answers among those that were either incorrect or avoided rose as the models got bigger, and reached more than 60%, for several refined models.
The team also found that all the models would occasionally get even easy questions wrong, meaning there is no ‘safe operating region’ in which a user can have high confidence in the answers.
The team then asked people to rank the answers as correct, incorrect or avoidant. People incorrectly classified inaccurate answers as being accurate surprisingly often — roughly between 10% and 40% of the time — across easy and difficult questions. “Humans are not able to supervise these models,” says Hernández-Orallo.
Safe space
Hernández-Orallo thinks that developers should boost AI performance on easy questions, and encourage chatbots to decline to answer hard questions, so that people are able to better gauge the situations in which AIs are likely to be reliable. “We need humans to understand: ‘I can use it in this area, and I shouldn’t use it in that area’,” he says.
Three ways ChatGPT helps me in my academic writing
Making chatbots more inclined to answer tricky questions looks impressive and does well on leaderboards that rank performance, says Hernández-Orallo, but isn’t always helpful. “I’m still very surprised that recent versions of some of these models, including o1 from OpenAI, you can ask them to multiply two very long numbers, and you get an answer, and the answer is incorrect,” he says. That should be fixable, he adds. “You can put a threshold, and when the question is challenging, [get the chatbot to] say, ‘no, I don’t know’.”
“There are some models which will say ‘I don’t know’, or ‘I have insufficient information to answer your question’,” says Vipula Rawte, a computer scientist at the University of South Carolina in Columbia. All AI companies are working hard to reduce hallucinations, and chatbots developed for specific purposes, such as medical use, are sometimes refined even further to prevent them from going beyond their knowledge base. But, she adds, for companies trying to sell all-purpose chatbots, “that is not something you typically want to give to your customers”.