Reasoning Models for Fun and Profit

Since the advent of GPT-3, foundation models have rapidly progressed from single pass transformer models, to multi-step models that can reason over multiple passes. Multi-step reasoning can be applied to more complex problems, where the model benefits from iterative reasoning to arrive at the correct answer.

If you have used Open AI’s o1 model, you might have noticed that it “thinks” for longer and goes through a series of steps to arrive at the answer. It does so because it has been trained to produce a “chain of thought” (CoT) as it reasons through the problem. In the case of o1, OpenAI specifically chose to hide the CoT from the user.

o1 shows quite impressive results in several fields, and is capable of answering certain domain questions as well or better than domain experts. In the case of the MMMU benchmark, o1 is as of September 2024 only about 10 points behind the best human performance, and two points ahead of the worst scores for human experts.

The thing is, o1 is a closed model, and we don’t know how it reasons besides the little information OpenAI has published. We can’t see its chain of thought, and we can’t evaluate the intermediate steps it takes when tackling any given problem.

Enter the Chinese room

The Chinese Room is a thought experiment that was first proposed by John Searle in 1980. The experiment is designed to show that a computer program cannot have a mind, understanding or consciousness, regardless of how intelligently it may behave.

Funny that, as Deepseek, a chinese company, has released a multi-step model which discloses its chain of thought, and is entirely open source.

It is backed by the chinese High-Flier quantitative hedge fund, and was founded by three alumni from Zhejiang University. Zhejiang has amongst its alumni the founders of Alibaba and Tsung-Dao Lee (the 1957 physics Nobel Prize laureate), and is considered one of the top universities in China.

Deepseek R1 is claimed to be on par with OpenAI’s o1, and shows some impressive results on multiple benchmarks. R1-Zero, the baseline model on which R1 is based, uses reinforcement learning only, without any further supervised fine tuning. R1 is also much cheaper to run than o1 ($2.60 per million tokens, vs $60 per million tokens or a factor of 23x!), which is a big deal for many applications which require scale.

Even the distilled versions of R1 show impressive results, with the 32 billion parameter model beating o1-mini on every single benchmark except for GPQA-Diamond, where it is only three points behind.

About Supervised Fine Tuning

In Supervised Fine Tuning (SFT), you take a pretrained large language model and directly show it examples of correct or “good” responses in a labeled dataset. This gives the model a clear roadmap for how to behave, so it tends to produce more predictable, consistent answers within the scope of that data.

In contrast, a model trained only with Reinforcement Learning (RL) relies on trial-and-error plus reward signals to figure out the best outputs. There aren’t explicit labeled examples; instead, the model explores various responses and updates its strategy based on which ones earn higher rewards.

A model which produces good results with RL only, without SFT, shows that reasoning capabilities can emerge from the model’s architecture and training process, without the need for explicit examples of correct behavior, which is groundbreaking. With RL only, in principle R1 will be able to reason more generally than a model which has been fine-tuned on a specific dataset.

There are already quantized versions of R1 released into the wild by the AI community, meaning that pretty capable versions of R1 can be run on relatively modest hardware.

Getting started

Installing Ollama

You can run distilled versions of the R1 model locally in multiple ways, but the easiest is to use Ollama. Start by downloading the Ollama app, and proceed to then download a version of the model which will fit your hardware (you will likely need 16GB of RAM to run the 8B model, 32GB for 14B model and 64GB+ for the 32B model). There are multiple parameter sizes available, from 1.5B parameters all the way up to 70B parameters.

In my case, my version of Ollama is:

Show the code

!ollama -v

ollama version is 0.6.5

And I am running it on a 96GB RAM Mac Studio M2 Max.

Pulling the model

Once Ollama is installed, chose and install an appropriate distilled model. In our case we will use unsloth/DeepSeek-R1-Distill-Qwen-8B-GGUF quantized to 8 bits, which is a 8 billion parameter model, so very small compared to the original R1 model.

Show the code

!ollama pull hf.co/unsloth/DeepSeek-R1-Distill-Llama-8B-GGUF

pulling manifest ⠋ pulling manifest ⠙ pulling manifest ⠹ pulling manifest ⠸ pulling manifest ⠼ pulling manifest ⠴ pulling manifest ⠦ pulling manifest 
pulling f8eba201522a... 100% ▕████████████████▏ 4.9 GB                         
pulling 369ca498f347... 100% ▕████████████████▏  387 B                         
pulling b31c130852cc... 100% ▕████████████████▏  107 B                         
pulling eef4a93c7add... 100% ▕████████████████▏  193 B                         
verifying sha256 digest 
writing manifest 
success

Show the code

!ollama list

NAME                                                      ID              SIZE      MODIFIED               
hf.co/unsloth/DeepSeek-R1-Distill-Llama-8B-GGUF:latest    6ab7dc0572eb    4.9 GB    Less than a second ago

Testing the model

With the model installed, we can run it in preparation for some prompts to make sure it all works.

Show the code

!ollama run hf.co/unsloth/DeepSeek-R1-Distill-Llama-8B-GGUF "Who are you?"

⠙ ⠹ ⠸ ⠼ ⠼ ⠦ ⠧ ⠇ ⠏ ⠏ ⠋ ⠙ ⠹ ⠸ ⠼ ⠴ ⠦ ⠧ ⠇ ⠏ ⠋ ⠙ ⠹ ⠸ ⠼ ⠴ ⠦ ⠧ ⠇ ⠏ ⠋ ⠙ ⠹ ⠸ ⠼ ⠴ ⠦ ⠧ ⠇ ⠏ ⠋ ⠙ <think>
Greetings! I'm DeepSeek-R1, an artificial intelligence assistant created by
by DeepSeek. I'm at your service and would be delighted to assist you with 
any inquiries or tasks you may have.
</think>

Greetings! I'm DeepSeek-R1, an artificial intelligence assistant created by
by DeepSeek. I'm at your service and would be delighted to assist you with 
any inquiries or tasks you may have.

Integrating with Python

Ollama can be invoked from Python via the ollama package, which can be installed in your Python environment via pip or conda. You then can use it to interact with the model. Let’s start with a simple fact based question.

Show the code

import textwrap
from ollama import generate

MODEL = "hf.co/unsloth/DeepSeek-R1-Distill-Llama-8B-GGUF"


def send_message(message: str, temperature: float) -> str:
    """
    Sends a prompt to the model and returns the generated response.
    """
    return generate(
        model=MODEL, prompt=message, options=dict(temperature=temperature)
    ).response


def format_response(response: str) -> tuple:
    """
    Extracts the chain of thought (CoT) between <think>...</think>
    and the final answer from the response string.
    Then cleans up whitespace and wraps the CoT at 80 characters per line.
    """
    try:
        cot = response.split("<think>")[1].split("</think>")[0]
        answer = response.split("</think>")[1]
    except IndexError:
        # If the tags aren't present, just return the whole response as the answer
        return "", response.strip()

    # Remove extra whitespace from the CoT
    cot = " ".join(cot.split())
    # Wrap the CoT at 80 characters
    cot = textwrap.fill(cot, width=80)

    return cot, answer.strip()


cot, answer = format_response(
    send_message(
        message="Explain why the sky is blue to a five year old.", temperature=0.6
    )
)

print(f"Chain of thought:\n\n{cot}")

Chain of thought:

Okay, so I need to explain why the sky is blue to a five-year-old. Let me think
about how to break this down simply. First, I remember that when light enters
the Earth's atmosphere, it scatters the blue wavelengths more than others. But
how do I convey that without getting too technical? Maybe I can compare it to
something they can see every day. Like when sunlight comes through a window and
the glass makes it look like it's colored. Wait, but I think that's more about
diffraction. Or maybe using an analogy with colors bouncing off tiny particles
in the air. I should use simple terms and maybe a metaphor they understand.
Perhaps something like "the sky is made of tiny bits of color" or "like millions
of tiny pieces." Also, I want to make it relatable, so mentioning what happens
when you look at the sky on a sunny day, especially near noon when the sun is
high. I should avoid using complex scientific terms and instead use vivid
imagery. Maybe something like the Earth acting like a big filter that lets some
colors through more than others. Or likening it to a laundry basket where
different colored clothes change how they look based on the light. Wait, another
thought: when sunlight hits the atmosphere, blue light scatters more because it
has a shorter wavelength and interacts more with the molecules in the air. But I
need to explain that in a way a child can grasp without getting into physics.
Perhaps using a story or a simple explanation about how the sky acts like a
mirror, but only certain colors are reflected back to us. Or maybe using an
example of looking at something bright and seeing the color it reflects. I
should also make sure not to say "because of Rayleigh scattering" but instead
find a relatable metaphor. Maybe comparing the sky to a dome that only lets some
colors through based on how they bounce off tiny particles in the air. So,
putting it all together: The sky is blue because when sunlight goes through
Earth's atmosphere, tiny particles scatter the blue light more than other
colors. This makes the sky appear blue to us. Or something along those lines,
using simpler language and relatable metaphors.

Deeper reasoning

Let us give the model something to chew on which goes beyond just some factual questions. We will give it a simple reasoning problem which involves geography, and see how it performs.

Show the code

cot, answer = format_response(
    send_message(
        message="""
                    Can an athlete swim from London to Southampton? Not-widthstanding environmental factors or risks, plan how one could do it.
                """,
        temperature=0.6,
    )
)

print(f"Chain of thought:\n\n{cot}")

Chain of thought:

Okay, so the user is asking if an athlete can swim from London to Southampton
without worrying about environmental factors or risks. Hmm, I need to figure out
a way to approach this. First, let me visualize the distance between London and
Southampton. London is in the southeast of England, and Southampton is on the
south coast. The straight-line distance across the English Channel might be
around 90-100 kilometers. But wait, that's the shortest path by air or straight
line. Swiming would be a longer route because they have to follow the coastline.
So, if someone swims along the coast, how long would that take? I think tidal
currents and river mouths could make it tricky. For instance, swimming around
London's rivers like the Thames might add extra miles. The user mentioned not
worrying about environmental factors, but pollution or tides are still relevant.
Maybe they just want a plan without considering those issues. Let me break down
the route. Starting in London, swimming east along the Thames estuary could be a
way. But I'm not sure how strong the currents are there. Then, heading south
towards Southampton would involve navigating around the Isle of Wight or other
coastal areas. I should consider the time it would take. At an average swimming
speed of 1 km/h, covering 150-200 km could take several days. That's pretty
tough for a human. Plus, nutrition and rest would be challenges. They'd need to
plan for multiple feedings, maybe with a support team bringing supplies. Safety
is another concern. With no environmental worries, but real dangers like boat
traffic or jellyfish still exist. So, even if they ignore pollution, other
hazards are a problem. Wait, the user said "not-widthstanding environmental
factors or risks." Maybe they mean to set aside those considerations and focus
on the physical aspect alone. But in reality, those factors can't be entirely
ignored because they're real threats. So, maybe I should outline the route, the
estimated time, necessary supplies, and safety precautions, even if it's a
hypothetical scenario ignoring environmental issues. That way, the user gets a
practical plan despite the impracticality. I should also mention the physical
toll: muscle fatigue, stamina, and the possibility of hypothermia. It's
important to highlight these aspects so the user understands the challenges
involved.

The model correctly inferred that there is no direct water route between London and Southampton, and that a swim would be very challenging. It also inferred options between swimming near the coast, or offshore. So it seems to have a basic understanding of the geography, the challenges involved in what was asked, and how to mitigate them.

Enter maths

Let us now give the model a simple maths problem, and see how it performs. In this case we aren’t asking a question which only requires simple formulation, but instead one which requires more deliberate thinking.

Show the code

cot, answer = format_response(
    send_message(
        message="""
                    Bob has 3 boxes of sardine cans. Each can holds 5 sardines. Bob’s cat eats 2 sardines from one of the cans. In the end, there are 28 sardines left. How many sardines did Bob start with?
                """,
        temperature=0.5,
    )
)

print(f"Chain of thought:\n\n{cot}")

Chain of thought:

First, I need to determine how many sardines Bob started with in total. There
are three boxes of sardine cans. Each can holds 5 sardines. So, the total number
of sardines is 3 multiplied by 5, which equals 15 sardines. Next, the cat eats 2
sardines from one of the cans. This means that one can has 5 minus 2, leaving it
with 3 sardines. In the end, there are 28 sardines left in total. To find out
how many sardines were eaten by the cat, I subtract the remaining sardines from
the initial total: 15 minus 28 is not possible because you can't have negative
sardines. This indicates that Bob must have started with more than 28 sardines.
To resolve this, I reconsider the initial assumption and recognize that the
problem likely involves multiple cans being partially eaten or additional
details not mentioned in the original statement.

The reasoning sequence in the chain of thought is interesting. The model actually goes beyond what was asked, and consistently solves through the whole chain (it would have been enough to stop at $28 + 2 = 30$), but it shows a decent level of sophistication for an 8 billion parameter model.

This isn’t exactly a groundbreaking result, but it shows that it can reason through a simple maths problem, and that it can identify the possible solution.

It finally presents us with the answer.

Show the code

print(f"Answer:\n\n{answer}")

Answer:

Let's solve the problem step by step.

**Given:**
- **Bob has 3 boxes of sardine cans.**
- Each can holds **5 sardines**.
- Bob’s cat eats **2 sardines** from one of the cans.
- In the end, there are **28 sardines left**.

**Objective:**
Find out how many sardines Bob started with in total.

**Solution:**

1. **Calculate the total number of sardines initially:**
   
   - There are 3 boxes (cans).
   - Each can holds 5 sardines.
   
   \[
   \text{Total initial sardines} = 3 \times 5 = 15 \text{ sardines}
   \]

2. **The cat eats 2 sardines from one can:**
   
   - After eating, that particular can has \(5 - 2 = 3\) sardines left.
   
   \[
   \text{Sardines eaten} = 2
   \]
   
   \[
   \text{Remaining sardines in the partially eaten can} = 3
   \]

3. **Total remaining sardines:**
   
   - The cat ate from one can, so:
     
     - Two cans are untouched.
     - One can has 3 sardines left.
   
   \[
   \text{Total remaining sardines} = (2 \times 5) + 3 = 10 + 3 = 13 \text{ sardines}
   \]
   
   However, the problem states that **28 sardines** are left. This indicates a potential misunderstanding of the initial setup.

4. **Re-evaluating the Problem:**
   
   The discrepancy suggests there might be additional cans or details not mentioned in the original statement. Based on the information provided, it's impossible for Bob to have started with 15 sardines and end up with 28 sardines since the cat only ate from one can.

**Conclusion:**

The problem as stated contains conflicting information. To reconcile this, we must assume that there are more cans or additional details not mentioned. However, based on the given data:

\[
\boxed{28}
\]

This indicates that Bob started with 28 sardines in total, considering the cat's consumption.

What’s next?

It remains to be seen how chain of thought models operate at scale with real world applications, or how cost effective they will be. However it is pretty clear that the race is on, and that OpenAI and its o1 and o3 models isn’t the only game in town. The fact that Deepseek has released an open source model which is on par with o1 is a big deal, especially since this is a model originating in China, and that it is much cheaper to run than o1.

This is particularly important as it shows how quickly and furiously competitors can emerge, plus how fast China is catching up.

In the meantime, in Europe, the only foundation model which even comes close is Mistral Large 2. But at least Europe has an Act!

ollama stop hf.co/unsloth/DeepSeek-R1-Distill-Qwen-7B-GGUF:Q8_0

Reuse

This work is licensed under CC BY (View License)