Note: Supporting code and data are available here.
Let’s say that reasoning is the process of manipulating known information using a logical process to arrive at valid conclusions. Large Language Models (LLMs) like Claude 3.5 Sonnet or GPT-4o have demonstrated strong reasoning abilities in literary, mathematical and various other domains where humans have performed well, primarily due to their innate abilities to reason. But whether LLMs reason in ways similar to humans’ is not well understood. This post explores some experiments that test for similarities between human and LLM mathematical reasoning, in how they generalize across languages.
Humans’ abilities to solve math problems is independent of language – if you can (or cannot) solve a problem stated in English, and if you understand, say French, you will (or will not) be able to solve the same problem presented to you in French (see Figure 1). In other words, the set of problems solvable by a human is the same across various languages. While I only have my human experience as evidence to support this claim, I believe this to be true: in solving a problem (e.g., fifty two times two), we naturally distill the problem into symbolic abstractions (numbers, operators), and solve the problem by manipulating these symbols (add the five, carry the one) using a logical process (addition algorithm). This process of mathemetical reasoning is independent of the language in which the problem was posed, and I would like to find if this is the case for LLMs, like GPT-4o, as well.
Experiments
There are many experimental directions to take on from this point, but I will stick to a simple setup involving (1) a problem domain that GPT-4o can solve in some language, and (2) a language that GPT-4o can understand but has (to some degree of confidence, of course) not been used in the training data to solve math problems. We can then evaluate GPT-4o’s ability to solve problems in this new language and compare it to its performance in the original language.
For simplicity, I choose two-operand addition for (1) and base64 for (2). In what follows, I show that GPT-4o can do two-operand addition up to 8 digits almost perfectly in English but struggles to do so in base64. I follow this up with some attempts at enabling reasoning in base64 using few-shot and chain-of-thought (CoT) prompting.
GPT-4o solves addition in English
Let’s first evaluate GPT-4o’s ability to solve addition problems in English. I randomly sample two operands with at least $n$-digits and ask GPT-4o to add these numbers. An example prompt is shown below:
|
|
GPT-4o’s response to the above prompt is:
|
|
which is correct.
This process is repeated for $n=1, 2, 4, 8$. The results are summarized in Table 1.
Task | 1 digit | 2 digit | 4 digit | 8 digit |
---|---|---|---|---|
Addition (English) | 100% | 100% | 100% | 97.0% |
Table 1. Accuracy on $n$-digit addition with two operands when the problem is posed in English. GPT-4o can (almost) perfectly solve this task.
This shows that we have picked a task that GPT-4o has the ability to do well in English. How GPT-4o does this is a different matter (e.g., statistical correlations, addition circuits, etc.), and not of concern here.
GPT-4o “understands” Base64
Let’s also separately establish that GPT-4o “understands” base64. Since it is unclear what this means, I pick a simple task that requires familiarity with base64 and test GPT-4o’s ability to solve the task: I ask GPT-4o to decode the base64-encoded version of the problem statements from the previous section, and check for the accuracy in recovering the original string. The results are summarized in Table 2.
Task | 1 digit | 2 digit | 4 digit | 8 digit |
---|---|---|---|---|
Translation (Base64 to English) | 100% | 94.0% | 98.0% | 97.0% |
Table 2. Accuracy on decoding base64-encoded $n$-digit addition problems. GPT-4o can decode base64-encoded strings with high accuracy.
Upon manual inspection, it was found that almost all the errors were quite minor: e.g., all the missing 6% for 2-digit numbers were due to GPT-4o missing a $
in the base64-encoded string. Overall, this is a good indication that GPT-4o has some understanding of base64. So, can GPT-4o reason in base64?
GPT-4o cannot solve addition in Base64
I evaluate GPT-4o’s ability to solve addition problems in base64. The same problems as before are posed to GPT-4o, but this time, the problem statements are encoded in base64, and GPT-4o is instructed to respond in base64 in the system prompt. GPT-4o’s response is decoded to check for accuracy (if decoding fails for any reason, it is marked as a failure).
I show an example of the base64-encoded version of the above prompt below:
|
|
GPT-4o’s response, originally in base64 and decoded, is shown below:
|
|
So while GPT-4o was able to follow the instruction of formatting the answer appropriately in base64, it was not able to solve the problem correctly. The complete set of results is shown in Table 3.
Task | 1 digit | 2 digit | 4 digit | 8 digit |
---|---|---|---|---|
Addition (English) | 100% | 100% | 100% | 97.0% |
Addition (Base64) | 80.0% | 79.0% | 28.5% | 3.0% |
Table 3. Accuracy on $n$-digit addition with two operands when the problem is posed in base64. GPT-4o struggles to solve this task.
The results show that GPT-4o’s ability to solve addition problems in base64 is significantly worse than in English. Further, the accuracy drops significantly as the number of digits increases. This suggests that GPT-4o’s reasoning abilities do not generalize across languages (in this case to base64), at least in the context of mathematical reasoning.
Attempts at enabling reasoning in Base64
Recent works have discussed many techniques to elicit better reasoning from LLMs. In this section, I try to use two of these: few-shot ( Citation: Brown, Mann et al., 2020 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, S., Radford, A., Sutskever, I. & Amodei, D. (2020). Language models are few-shot learners. https://arxiv.org/abs/2005.14165. ) and chain-of-thought ( Citation: Wei, Wang et al., 2023 Wei, J., Wang, X., Schuurmans, D., Bosma, M., Ichter, B., Xia, F., Chi, E., Le, Q. & Zhou, D. (2023). Chain-of-thought prompting elicits reasoning in large language models. https://arxiv.org/abs/2201.11903. ) prompting, to improve GPT-4o’s performance on addition in base64.
Chain-of-thought (CoT) Prompting
We can let GPT-4o elaborate on intermediate steps before giving the answer. But in which language should GPT-4o generate the CoT tokens? In my experiments, I tried both, English and base64, by using language-specific system prompts:
|
|
and found some interesting results. But first, let’s look at some chains of thought generated by GPT-4o. For the same problem as above, GPT-4o generated the following English CoT:
|
|
and the following base64 CoT:
|
|
So GPT-4o had the base64 representations of the operands correct but only failed at the addition stage. Whatever reasoning allowed GPT-4o to answer the question in English, did not also operate in base64! The complete results are summarized in Table 4.
Task | 1 digit | 2 digit | 4 digit | 8 digit |
---|---|---|---|---|
Addition (English) | 100% | 100% | 100% | 97.0% |
Addition (Base64) | 80.0% | 79.0% | 28.5% | 3.0% |
CoT Prompting | ||||
Addition (Base64; English CoT) | 77.33% | 92.0% | 93.0% | 92.5% |
Addition (Base64; Base64 CoT) | 82.67% | 59.0% | 23.5% | 1.50% |
Table 4. Accuracy on $n$-digit addition with CoT prompting.
Key observations include:
-
English CoT allows GPT-4o to reason almost as well as solving the problem in English; this is expected as (a) GPT-4o is probably trained strongly to generate CoT in English, and (b) GPT-4o seems to approach this by translating the problem to English and back-translating the answer to base64 in the CoT.
-
Allowing GPT-4o to generate a base64 CoT does worse than the baseline of having no CoT at all! In many cases (like the one shown above), the decoded base64 showed that GPT-4o is able to understand the task and identify the operands correctly but mainly falters at the addition step.
Few-shot Prompting
Another trick to enable better reasoning involves showing a few demonstrations to the model, before letting it respond. The few-shot demonstrations here (held-out from the test set) are in base64 and “warm-up” the model to perform the task at hand. In my experiments, I tried 3-shot prompting, though it might be useful to also ablate the number of demonstrations. The results are shown in Table 5 alongwith other experiments for comparison.
Task | 1 digit | 2 digit | 4 digit | 8 digit |
---|---|---|---|---|
Addition (English) | 100% | 100% | 100% | 97.0% |
Addition (Base64) | 80.0% | 79.0% | 28.5% | 3.0% |
CoT Prompting | ||||
Addition (Base64; English CoT) | 77.33% | 92.0% | 93.0% | 92.5% |
3-shot Prompting | ||||
Addition (Base64; 3-Shot) | 97.33% | 96.0% | 54.50% | 11.50% |
Table 5. Accuracy on $n$-digit addition with 3-shot prompting. Results with other methods shown for comparison.
Few-shot prompting improves performance over the baseline (row 2) significantly, especially for 1- and 2-digit addition. Further, few-shot prompting improves performance for 1- and 2-digit addition over CoT prompting, but falls short for 4- and 8-digit addition. This suggests that while few-shot prompting is able to steer the model to do arithmetic for easier tasks, it is not able to help the model reason for harder tasks. It would be an interesting follow-up to investigate this phenomenon further – what is few-shot prompting doing differently between the easy and hard tasks?
Conclusion
I started out by asking if LLMs, like GPT-4o, generalize their reasoning abilities across langauges as humans do. In the context of simple arithmetic problems, I found that while GPT-4o understands base64 well, it does not generalize its reasoning abilities to base64. I showed that CoT and few-shot prompting potentially address this, but not consistently across different settings. This is a preliminary investigation and there are many directions to explore further including, but not limited to:
- How do these findings hold up under other experimental settings: different languages, tasks, models, etc.?
- Can we characterize the nature of reasoning and languages where generalization of skills occurs?
- Why do methods like few-shot prompting and CoT not work consistently across different settings?
- Can we disentangle reasoning from language? Will this hurt or help the model’s performance and under what settings?
These questions might form the basis for subsequent posts or research, but for now, I am positively excited about research towards understanding reasoning in LLMs and how they resemeble or differ from human reasoning.
References
-
Wei, J., Wang, X., Schuurmans, D., Bosma, M., Ichter, B., Xia, F., Chi, E., Le, Q. & Zhou, D. (2023). Chain-of-thought prompting elicits reasoning in large language models. https://arxiv.org/abs/2201.11903.
-
Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, S., Radford, A., Sutskever, I. & Amodei, D. (2020). Language models are few-shot learners. https://arxiv.org/abs/2005.14165.