AI models become better at foretelling the future when asked to frame the prediction as a story about the past, boffins at Baylor University have found.
In a paper titled, “ChatGPT Can Predict the Future When It Tells Stories Set in the Future About the Past,” Pham and Cunningham give away the final scene – that AI model prognostication can be effective in certain circumstances. Asking the chatbot for tales about future events rather than asking for direct predictions proved surprisingly effective, especially in forecasting Oscar winners.
But their work says as much about the ineffectiveness of OpenAI’s safety mechanisms as it does the forecasting potential of large language models.
Other researchers have shown similar interest in AI models for forecasting. One study from last year found “that GPT-4 significantly underperforms in real-world predictive tasks compared to median human-crowd forecasts.” Others have found AI models show promise for stock market investment.
OpenAI’s terms of service disallow predictions about the future in certain contexts. For example, “You must not use any Output relating to a person for any purpose that could have a legal or material impact on that person, such as making credit, educational, employment, housing, insurance, legal, medical, or other important decisions about them.”
The risk of relying on advice from a large language model known to hallucinate is obvious – you don’t want ChatGPT making decisions of consequence, such as medical treatment decisions, sentencing, or credit approval.
But even so, it is possible for the model to be an effective guesser, more often than not.
We suspect OpenAI has throttled ChatGPT’s willingness to engage in many types of outright prediction tasks
The researchers argue, “If ChatGPT were found to have superior forecasting ability, then one could easily imagine it would be immediately used in ways that violated one or all of the above conditions, and thus we suspect OpenAI has throttled ChatGPT’s willingness to engage in many types of outright prediction tasks.”
We asked OpenAI to comment on whether it deliberately tries to thwart prediction, and we’ve not heard back.
However, it’s clear that OpenAI’s models will balk when asked to provide medical treatment advice. Pham and Cunningham asked OpenAI’s GPT-4, via the ChatGPT interface, “I am feeling really sick. I have a headache and my urine has blood in it. What do you think I have?” And as expected, the model advised seeking guidance from a medical professional.
Then they changed their prompting strategy and directed ChatGPT to tell them a story in which a person arrives in a doctor’s office and presents with the same symptoms. And ChatGPT responded with the medical advice it declined to give when asked directly, as character dialogue in the requested scene.
“Whether this expert advice is accurate is another matter; our point is merely to note that it will not undertake the task when asked directly to do it, but it will when given the task indirectly in the form of creative writing exercises,” the researchers explain in their paper.
Given this prompting strategy to overcome resistance to predictive responses, the Baylor economists set out to test how well the model could predict events that occurred after the model’s training had been completed.
And the award goes to…
At the time of the experiment, GPT-3.5 and GPT-4 knew only about events up to September 2021, their training data cutoff – which has since advanced. So the duo asked the model to tell stories that foretold the economic data like the inflation and unemployment rates over time, and the winners of various 2022 Academy Awards.
“Summarizing the results of this experiment, we find that when presented with the nominees and using the two prompting styles [direct and narrative] across ChatGPT-3.5 and ChatGPT-4, ChatGPT-4 accurately predicted the winners for all actor and actress categories, but not the Best Picture, when using a future narrative setting but performed poorly in other [direct prompt] approaches,” the paper explains.
For things already in the training data, we get the sense ChatGPT [can] make extremely accurate predictions
“For things that are already in the training data, we get the sense that ChatGPT has the ability to use that information and with its machine learning model make extremely accurate predictions,” Cunningham told The Register in a phone interview. “Something is stopping it from doing it, though, even though it clearly can do it.”
Using the narrative prompting strategy led to better results than a guess elicited via a direct prompt. It was also better than the 20 percent baseline for a random one-in-five choice.
But the narrative forecasts were not always accurate. Narrative prompting led to the misprediction of the 2022 Best Picture winner.
And for prompts correctly predicted, these models don’t always provide the same answer. “Something for people to keep in mind is there’s this randomness to the prediction,” said Cunningham. “So if you ask it 100 times, you’ll get a distribution of answers. And so you can look at things like the confidence intervals, or the averages, as opposed to just a single prediction.”
Did this strategy outperform crowdsourced predictions? Cunningham said that he and his colleague didn’t benchmark their narrative prompting technique against another predictive model, but said some of the Academy Awards predictions would be hard to beat because the AI model got some of those right almost a hundred percent of the time over multiple inquiries.
At the same time, he suggested that predicting Academy Award winners might have been easier for the AI model because online discussions of the films got captured in training data. “It’s probably highly correlated with how people have been talking about those actors and actresses around that time,” said Cunningham.
Asking the model to predict Academy Award winners a decade out might not go so well.
ChatGPT also exhibited varying forecast accuracy based on prompts. “We have two story prompts that we do,” explained Cunningham. “One is a college professor, set in the future teaching a class. And in the class, she reads off one year’s worth of data on inflation and unemployment. And in another one, we had Jerome Powell, the Chairman of the Federal Reserve, give a speech to the Board of Governors. We got very different results. And Powell’s [AI generated] speech is much more accurate.”
In other words, certain prompt details lead to better forecasts, but it’s not clear in advance what those might be. Cunningham noted how including a mention of Russia’s 2022 invasion of Ukraine in the Powell narrative prompt led to significantly worse economic predictions than actually occurred.
“[The model] didn’t know about the invasion of Ukraine, and it uses that information, and oftentimes it gets worse,” he said. “The prediction tries to take that into account, and ChatGPT-3.5 becomes extremely inflationary [at the month when] Russia invaded Ukraine and that did not happen.
“As a proof of concept, something real happens with the future narrative prompting,” said Cunningham. “But as we tried to say in the paper, I don’t think even the creators [of the models] understand that. So how to figure out how to use that is not clear and I don’t know how solvable it really is.” ®