To CoT or not to CoT? Chain-of-thought helps mainly on math and symbolic reasoning
Today's paper examines the effectiveness of chain-of-thought (CoT) prompting for large language models across different types of reasoning tasks. Through a comprehensive meta-analysis of over 100 papers and new experiments on 20 datasets, the authors find that CoT primarily benefits mathematical and symbolic reasoning tasks, with limited gains on other types of problems. They analyze why CoT helps on these specific tasks and compare it to other approaches.
Overview
The study employs two main approaches to evaluate CoT prompting. First, they conduct a meta-analysis of over 100 papers that report results comparing CoT to direct answering across various tasks. They categorize the tasks and analyze the performance differences.
Second, they run new experiments on 20 diverse datasets using 14 different language models. These experiments compare zero-shot and few-shot CoT prompting to direct answering across different types of reasoning tasks. The datasets span categories like commonsense reasoning, knowledge-based questions, symbolic reasoning, and mathematics.
To dig deeper into why CoT helps on certain tasks, they break down symbolic reasoning problems into planning and execution stages. They compare different prompting strategies that separate these stages, including using external symbolic solvers. This allows them to isolate where CoT provides the most benefit in the reasoning process.
Results
The key findings of the paper include:
CoT prompting provides substantial performance gains primarily on mathematical and symbolic reasoning tasks, with much smaller or no improvements on other types of problems.
On the MMLU benchmark, nearly all of CoT's performance gain (up to 95%) comes from questions containing mathematical equations.
CoT's main benefit comes from improving the execution of symbolic computations, rather than the planning stage of breaking down problems.
However, CoT still falls short of the performance achieved by using external symbolic solvers for math and logic problems.
There is little evidence that CoT consistently helps on non-symbolic reasoning tasks like commonsense inference or reading comprehension.
Conclusion
The paper concludes that the benefits of chain-of-thought prompting are more limited than commonly assumed. CoT primarily helps on mathematical and symbolic reasoning tasks by improving intermediate computation, but even there it underperforms compared to using external solvers. For more information please consult the full paper.
Congrats to the authors for their work!
Sprague, Zayne, et al. "To CoT or not to CoT? Chain-of-Thought Helps Mainly on Math and Symbolic Reasoning." arXiv preprint arXiv:2409.12183 (2024).