If you’ve ever tried to use ChatGPT as a calculator, you’ve almost certainly noticed its dyscalculia: The chatbot is bad at math. And it’s not unique among AI in this regard.
Anthropic’s Claude can’t solve basic word problems. Gemini fails to understand quadratic equations. And Meta’s Llama struggles with straightforward addition.
So how is it that these bots can write soliloquies, yet get tripped up by grade-school-level arithmetic?
Tokenization has something to do with it. The process of dividing data up into chunks (e.g., breaking the word “fantastic” into the syllables “fan,” “tas,” and “tic”), tokenization helps AI densely encode information. But because tokenizers — the AI models that do the tokenizing — don’t really know what numbers are, they frequently end up destroying the relationships between digits. For example, a tokenizer might treat the number “380” as one token but represent “381” as a pair of digits (“38” and “1”).
But tokenization isn’t the only reason math’s a weak spot for AI.
AI systems are statistical machines. Trained on a lot of examples, they learn the patterns in those examples to make predictions (like that the phrase “to whom” in an email often precedes the phrase “it may concern”). For instance, given the multiplication problem 5,7897 x 1,2832, ChatGPT — having seen a lot of multiplication problems — will likely infer the product of a number ending in “7” and a number ending in “2” will end in “4.” But it’ll struggle with the middle part. ChatGPT gave me the answer 742,021,104; the correct one is 742,934,304.
Yuntian Deng, an assistant professor at the University of Waterloo specializing in AI, thoroughly benchmarked ChatGPT’s multiplication abilities in a study earlier this year. He and co-authors found that the default model, GPT-4o, struggled to multiply beyond two numbers containing more than four digits each (e.g., 3,459 x 5,284).
“GPT-4o struggles with multi-digit multiplication, achieving less than 30% accuracy beyond four-digit by four-digit problems,” Deng told TechCrunch. “Multi-digit multiplication is challenging for language models because a mistake in any intermediate step can compound, leading to incorrect final results.”
So, will math skills forever elude ChatGPT? Or is there reason to believe the bot might someday become as proficient with numbers as humans (or a TI-84, for that matter)?
Deng is hopeful. In the study, he and his colleagues also tested o1, OpenAI’s “reasoning” model that recently came to ChatGPT. The o1, which “thinks” through problems step by step before answering them, performed much better than GPT-4o, getting up to nine-digit by nine-digit multiplication problems right about half the time.
“The model might be solving the problem in ways that differ from how we solve it manually,” Deng said. “It makes us curious about the model’s internal approach and how it differs from human reasoning.”
Deng thinks that the progress indicates that at least some types of math problems — multiplication problems being one of them — will eventually be “fully solved” by ChatGPT-like systems. “This is a well-defined task with known algorithms,” Deng said. “We’re already seeing significant improvements from GPT-4o to o1, so it’s clear that enhancements in reasoning capabilities are happening.”
Just don’t get rid of your calculator anytime soon.