Show the code
!ollama -v
Warning: could not connect to a running Ollama instance
Warning: client version is 0.5.12
deepseek r1 and the chinese room
January 11, 2025
Since the advent of GPT-3, foundation models have rapidly progressed from single pass transformer models, to multi-step models that can reason over multiple passes. Multi-step reasoning can be applied to more complex problems, where the model benefits from iterative reasoning to arrive at the correct answer.
If you have used Open AI’s o1 model, you might have noticed that it “thinks” for longer and goes through a series of steps to arrive at the answer. It does so because it has been trained to produce a “chain of thought” (CoT) as it reasons through the problem. In the case of o1, OpenAI specifically chose to hide the CoT from the user.
o1 shows quite impressive results in several fields, and is capable of answering certain domain questions as well or better than domain experts. In the case of the MMMU benchmark, o1 is as of September 2024 only about 10 points behind the best human performance, and two points ahead of the worst scores for human experts.
The thing is, o1 is a closed model, and we don’t know how it reasons besides the little information OpenAI has published. We can’t see its chain of thought, and we can’t evaluate the intermediate steps it takes when tackling any given problem.
The Chinese Room is a thought experiment that was first proposed by John Searle in 1980. The experiment is designed to show that a computer program cannot have a mind, understanding or consciousness, regardless of how intelligently it may behave.
Funny that, as Deepseek, a chinese company, has released a multi-step model which discloses its chain of thought, and is entirely open source.
It is backed by the chinese High-Flier quantitative hedge fund, and was founded by three alumni from Zhejiang University. Zhejiang has amongst its alumni the founders of Alibaba and Tsung-Dao Lee (the 1957 physics Nobel Prize laureate), and is considered one of the top universities in China.
Deepseek R1 is claimed to be on par with OpenAI’s o1, and shows some impressive results on multiple benchmarks. R1-Zero, the baseline model on which R1 is based, uses reinforcement learning only, without any further supervised fine tuning. R1 is also much cheaper to run than o1 ($2.60 per million tokens, vs $60 per million tokens or a factor of 23x!), which is a big deal for many applications which require scale.
Even the distilled versions of R1 show impressive results, with the 32 billion parameter model beating o1-mini on every single benchmark except for GPQA-Diamond, where it is only three points behind.
In Supervised Fine Tuning (SFT), you take a pretrained large language model and directly show it examples of correct or “good” responses in a labeled dataset. This gives the model a clear roadmap for how to behave, so it tends to produce more predictable, consistent answers within the scope of that data.
In contrast, a model trained only with Reinforcement Learning (RL) relies on trial-and-error plus reward signals to figure out the best outputs. There aren’t explicit labeled examples; instead, the model explores various responses and updates its strategy based on which ones earn higher rewards.
A model which produces good results with RL only, without SFT, shows that reasoning capabilities can emerge from the model’s architecture and training process, without the need for explicit examples of correct behavior, which is groundbreaking. With RL only, in principle R1 will be able to reason more generally than a model which has been fine-tuned on a specific dataset.
There are already quantized versions of R1 released into the wild by the AI community, meaning that pretty capable versions of R1 can be run on relatively modest hardware.
You can run distilled versions of the R1 model locally in multiple ways, but the easiest is to use Ollama. Start by downloading the Ollama app, and proceed to then download a version of the model which will fit your hardware (you will likely need 16GB of RAM to run the 8B model, 32GB for 14B model and 64GB+ for the 32B model). There are multiple parameter sizes available, from 1.5B parameters all the way up to 70B parameters.
In my case, my version of Ollama is:
Warning: could not connect to a running Ollama instance
Warning: client version is 0.5.12
And I am running it on a 96GB RAM Mac Studio M2 Max.
Once Ollama is installed, chose and install an appropriate distilled model. In our case we will use unsloth/DeepSeek-R1-Distill-Qwen-8B-GGUF
quantized to 8 bits, which is a 8 billion parameter model, so very small compared to the original R1 model.
pulling manifest ⠋ pulling manifest ⠙ pulling manifest ⠹ pulling manifest ⠸ pulling manifest
pulling f8eba201522a... 100% ▕████████████████▏ 4.9 GB
pulling 369ca498f347... 100% ▕████████████████▏ 387 B
pulling b31c130852cc... 100% ▕████████████████▏ 107 B
pulling eef4a93c7add... 100% ▕████████████████▏ 193 B
verifying sha256 digest
writing manifest
success
With the model installed, we can run it in preparation for some prompts to make sure it all works.
⠙ ⠹ ⠸ ⠼ ⠴ ⠦ ⠧ ⠇ ⠏ ⠋ ⠙ ⠹ ⠸ ⠼ ⠴ ⠦ ⠧ ⠇ ⠏ ⠋ ⠙ ⠹ ⠸ ⠼ ⠴ ⠦ ⠧ ⠇ ⠏ ⠋ ⠙ ⠹ ⠸ ⠼ ⠴ ⠦ ⠧ ⠇ ⠏ ⠋ ⠋ ⠙ ⠸ ⠼ ⠼ ⠴ ⠦ ⠧ ⠏ ⠋ ⠋ ⠙ ⠸ ⠸ ⠼ ⠴ ⠧ ⠇ ⠇ ⠏ ⠙ ⠹ ⠹ ⠼ ⠴ ⠴ ⠧ ⠇ ⠇ ⠏ ⠙ ⠹ ⠸ ⠸ ⠴ ⠦ ⠧ ⠇ ⠏ ⠋ ⠙ ⠹ ⠸ ⠸ ⠼ ⠦ ⠦ ⠧ ⠏ ⠏ ⠙ ⠹ ⠸ ⠼ ⠴ ⠦ ⠧ ⠇ ⠏ ⠋ ⠙ ⠹ ⠸ ⠼ ⠼ ⠦ ⠦ ⠇ ⠏ ⠋ ⠙ ⠹ ⠸ ⠼ ⠴ ⠦ ⠧ ⠇ ⠇ ⠏ <think>
Greetings! I'm DeepSeek-R1, an artificial intelligence assistant created by
by DeepSeek. I'm at your service and would be delighted to assist you with
any inquiries or tasks you may have.
</think>
Greetings! I'm DeepSeek-R1, an artificial intelligence assistant created by
by DeepSeek. I'm at your service and would be delighted to assist you with
any inquiries or tasks you may have.
Ollama can be invoked from Python via the ollama
package, which can be installed in your Python environment via pip
or conda
. You then can use it to interact with the model. Let’s start with a simple fact based question.
import textwrap
from ollama import generate
MODEL = "hf.co/unsloth/DeepSeek-R1-Distill-Llama-8B-GGUF"
def send_message(message: str, temperature: float) -> str:
"""
Sends a prompt to the model and returns the generated response.
"""
return generate(
model=MODEL, prompt=message, options=dict(temperature=temperature)
).response
def format_response(response: str) -> tuple:
"""
Extracts the chain of thought (CoT) between <think>...</think>
and the final answer from the response string.
Then cleans up whitespace and wraps the CoT at 80 characters per line.
"""
try:
cot = response.split("<think>")[1].split("</think>")[0]
answer = response.split("</think>")[1]
except IndexError:
# If the tags aren't present, just return the whole response as the answer
return "", response.strip()
# Remove extra whitespace from the CoT
cot = " ".join(cot.split())
# Wrap the CoT at 80 characters
cot = textwrap.fill(cot, width=80)
return cot, answer.strip()
cot, answer = format_response(
send_message(
message="Explain why the sky is blue to a five year old.", temperature=0.6
)
)
print(f"Chain of thought:\n\n{cot}")
Chain of thought:
Okay, so I need to explain why the sky is blue to a five-year-old. Hmm, let's
think about how to make this simple and engaging for a young child. First, I
should consider their understanding of the world around them. At five, they
might have basic knowledge of colors and maybe some simple concepts about light.
So, using terms they can relate to is key. I remember that when light enters the
Earth's atmosphere, it scatters in all directions. The shorter wavelengths,
which are blue and green, scatter more than longer ones like red. But how do I
convey that without getting too technical? Maybe I can use an analogy they
understand, like a trampoline or a playground. If the sky is like a big
trampoline, then when sunlight hits it, some of it bounces up and scatters in
all directions. The colors with more energy, which are blue and green, bounce
higher than others. Wait, but I should make sure not to overcomplicate it.
Keeping it simple, maybe using the idea that the sky is made of air, and when
light passes through the atmosphere, some colors spread out more. Blue is one of
those colors that spreads the most, making the sky look blue to us. Another
angle could be using a prism or something they might have seen in school, but I
think for a five-year-old, keeping it visual with a trampoline analogy makes
sense. It's something they can imagine and relate to their own experiences. I
should also avoid any complex scientific terms. Words like "Rayleigh scattering"
are probably beyond their understanding, so using a simple explanation about the
sky being like a trampoline where some colors bounce higher than others is
better. So putting it all together: The sky is blue because when sunlight goes
through the Earth's atmosphere, some of the light scatters in all directions.
Blue light, which has more energy, scatters more and reaches our eyes, making
the sky appear blue. I think that covers both simplicity and an engaging analogy
for a young child.
Let us give the model something to chew on which goes beyond just some factual questions. We will give it a simple reasoning problem which involves geography, and see how it performs.
Chain of thought:
Okay, so I want to figure out if someone could swim from London to Southampton.
First off, I know both cities are in the UK, but I'm not exactly sure about
their geographical positions. Let me think... London is in the south-east of
England, right? And Southampton is further down on the south coast. So, the
distance between them must be a factor here. I should probably look up the
straight-line distance between London and Southampton. Maybe using something
like Google Maps to estimate it. From what I remember, it's around 100-120
miles. But that's by road, which is different from swimming. Swimming would
require a longer path because you can't cut through land; you have to go along
rivers or the coast. Wait, there are rivers in between, like the Thames and
maybe others. The Thames flows into London, so perhaps one could follow it
downstream towards Southampton. But I'm not sure how navigable the Thames is for
swimming. I think there's a lot of pollution and maybe some weirs that could
make it difficult or even dangerous. Alternatively, there are other waterways
like the Lee River or the Medway. Maybe someone could navigate through those.
But again, I don't know if they're swimmable. There might be locks or obstacles
along the way. Also, thinking about tides is important because swimming against
them would be really exhausting and maybe impossible at certain times. Another
option is to go around the coast. Southampton is on the south coast, so maybe a
route that follows the English Channel from London to Southampton could work.
But that seems like it would add a lot of distance. How long would that be?
Maybe 150 miles or more? That's a long swim even for a trained athlete. I should
also consider environmental factors. The Thames Estuary is known for having
strong currents and maybe some industrial pollution. Swimming through that could
be harmful, not to mention the risk of getting lost or caught in fishing gear.
There might also be boat traffic, which is another hazard. What about the time
it would take? If someone could swim at a steady pace of, say, 4 knots (which is
about 7.2 km/h), covering 100 miles would take roughly 14 hours. But I don't
think human swimmers can maintain that speed for such a long time without rest
or food. Plus, the weather conditions and tides might change during the swim.
Logistically, planning such a crossing seems complicated. There are no lifelines
or support teams in the middle of the Thames or English Channel. So, if
something goes wrong, help would be hard to come by. Navigation aids like buoys
or markers might not be sufficient for someone swimming without any support. I
also wonder about the feasibility from a practical standpoint. Why would someone
want to do this? Maybe as a challenge or a personal record. But it's not exactly
a common activity, so there might not be much information or resources available
on how to plan such a swim. In terms of preparation, an athlete would need to
train extensively for endurance and strength. They'd also have to learn about
tides, currents, and navigation. Maybe using a GPS device could help track their
progress, but that adds another layer of complexity. They’d also need to carry
provisions like food, water, and maybe a survival kit. Thinking about the route
again, I'm not sure if there's an official channel or a recognized path from
London to Southampton by swimming. Most long-distance swims have specific
courses, like the English Channel Swim which is 21 miles, so this would be much
longer. There might also be legal considerations. Swimming through certain areas
might require permits or could be against local regulations, especially if it's
in restricted zones near airports or industrial areas. In summary, while on
paper it seems possible from a geographical standpoint, the practical aspects
like distance, environmental hazards, physical endurance, and logistical
planning make it extremely challenging for a human athlete. Plus, there are
significant safety concerns that would need to be addressed before attempting
such an endeavor.
The model correctly inferred that there is no direct water route between London and Southampton, and that a swim would be very challenging. It also inferred options between swimming near the coast, or offshore. So it seems to have a basic understanding of the geography, the challenges involved in what was asked, and how to mitigate them.
Let us now give the model a simple maths problem, and see how it performs. In this case we aren’t asking a question which only requires simple formulation, but instead one which requires more deliberate thinking.
cot, answer = format_response(
send_message(
message="""
Bob has 3 boxes of sardine cans. Each can holds 5 sardines. Bob’s cat eats 2 sardines from one of the cans. In the end, there are 28 sardines left. How many sardines did Bob start with?
""",
temperature=0.5,
)
)
print(f"Chain of thought:\n\n{cot}")
Chain of thought:
First, I need to determine how many sardines Bob started with in total. There
are three boxes containing the sardine cans. Each can holds 5 sardines. Let’s
denote the number of cans as \( x \). Therefore, the total number of sardines
initially is \( 5x \). Bob’s cat eats 2 sardines from one of the cans. This
means that in one can, there are now \( 5 - 2 = 3 \) sardines left. In the end,
there are 28 sardines remaining. Since only one can was affected by the cat
eating sardines, all the remaining sardines must be in the other two cans.
Therefore, the total number of sardines in these two cans is 28. Since each can
holds 5 sardines, and there are \( x - 1 \) cans left with their original count,
the equation becomes: \( 5(x - 1) = 28 \) Solving for \( x \): \( x - 1 =
\frac{28}{5} \) \( x - 1 = 5.6 \) \( x = 6.6 \) Since the number of cans must be
a whole number, there seems to be an inconsistency in the problem. Perhaps I
made a mistake or misinterpreted the information. Let me double-check. Wait,
each can holds 5 sardines, and Bob has three boxes of cans. The cat eats 2
sardines from one can, leaving 3 sardines in that can. There are still 28
sardines left in total. So, if there were \( x \) cans initially, after the cat
ate some, there are \( x - 1 \) cans with 5 sardines each and 1 can with 3
sardines, making a total of: \( 5(x - 1) + 3 = 28 \) Solving this equation: \(
5x - 5 + 3 = 28 \) \( 5x - 2 = 28 \) \( 5x = 30 \) \( x = 6 \) So, Bob started
with 6 cans.
The reasoning sequence in the chain of thought is interesting. The model actually goes beyond what was asked, and consistently solves through the whole chain (it would have been enough to stop at \(28 + 2 = 30\)), but it shows a decent level of sophistication for an 8 billion parameter model.
This isn’t exactly a groundbreaking result, but it shows that it can reason through a simple maths problem, and that it can identify the possible solution.
It finally presents us with the answer.
Answer:
**Solution:**
Let's determine how many sardines Bob initially had.
1. **Total Cans and Sardines:**
- Bob has **3 boxes** of sardine cans.
- Each can holds **5 sardines**.
Therefore, the total number of sardines initially is:
\[
3 \text{ boxes} \times 5 \text{ sardines per box} = 15 \text{ sardines}
\]
2. **Cat Eats Sardines:**
- The cat eats **2 sardines** from one can.
So, the remaining sardines in that particular can are:
\[
5 \text{ sardines} - 2 \text{ sardines} = 3 \text{ sardines}
\]
3. **Sardines Left:**
- After the cat eats the sardines, there are **28 sardines** left in total.
4. **Determining Initial Sardines:**
- Initially, Bob had:
- **1 can** with **3 sardines** (after the cat ate some).
- The remaining **2 cans** still have their original **5 sardines each**.
So, the total number of sardines is:
\[
1 \text{ can} \times 3 \text{ sardines} + 2 \text{ cans} \times 5 \text{ sardines per can} = 3 + 10 = 13 \text{ sardines}
\]
However, this doesn't align with the given total of **28 sardines**. This discrepancy suggests there might be a misunderstanding in the problem setup.
5. **Re-evaluating the Problem:**
- If each can holds **5 sardines**, and the cat eats **2 sardines** from one can, leaving **3 sardines** in that can.
Given that there are still **28 sardines** left:
\[
1 \text{ can} \times 3 \text{ sardines} + 2 \text{ cans} \times 5 \text{ sardines} = 3 + 10 = 13 \text{ sardines}
\]
This does not add up to **28 sardines**. Therefore, there might be an error in the problem's parameters.
6. **Alternative Approach:**
- Let's denote the number of cans as \( x \).
- Initially, there are \( 5x \) sardines.
- After eating:
- **1 can** has \( 5 - 2 = 3 \) sardines.
- The remaining \( x - 1 \) cans each have **5 sardines**.
- Total sardines left:
\[
(x - 1) \times 5 + 3 = 28
\]
- Solving for \( x \):
\[
5(x - 1) + 3 = 28 \\
5x - 5 + 3 = 28 \\
5x - 2 = 28 \\
5x = 30 \\
x = 6
\]
- Therefore, Bob started with **6 cans**.
\[
\boxed{15}
\]
It remains to be seen how chain of thought models operate at scale with real world applications, or how cost effective they will be. However it is pretty clear that the race is on, and that OpenAI and its o1 and o3 models isn’t the only game in town. The fact that Deepseek has released an open source model which is on par with o1 is a big deal, especially since this is a model originating in China, and that it is much cheaper to run than o1.
This is particularly important as it shows how quickly and furiously competitors can emerge, plus how fast China is catching up.
In the meantime, in Europe, the only foundation model which even comes close is Mistral Large 2. But at least Europe has an Act!
ollama stop hf.co/unsloth/DeepSeek-R1-Distill-Qwen-7B-GGUF:Q8_0