Why I think DeepSeek-R1 just revealed the path to AGI.
Here’s a visual explanation of exactly what makes DeepSeek-R1 so good.
DeepSeek, a Chinese AI company, just came up with a model that has reasoning capabilities comparable to OpenAI-o1, despite having a fraction of the parameters and costing much less to train.
Earlier, I wrote an article on Medium that showed instances where GPT-4o failed to reason effectively. One such instance was the “dead Schrödinger cat” problem.
When I asked GPT-4o, here’s what its response was. Clearly, it failed to recognise that the cat was dead and the probability that it was alive was 0.

When I tried the same problem on DeepSeek, I noticed the LLM thinking like a smart human, trying to consider all possibilities and avoiding silly mistakes before finalising the answer. Eventually, it not only gave the correct answer, but also explained it really well.
That’s when I decided I needed to investigate further. What is the secret behind DeepSeek’s improved reasoning?

Base Models
All training approaches, including DeepSeek’s, start with the pre-training phase that get to the “base model”.
Base models are LLMs that arise immediately after pre-training but before any form of supervised fine-tuning.
Pre-training (in LLMs) involves exposing the LLMs to huge corpuses of internet text, improving the LLM’s ability to predict the next word. The base models that arise will not necessarily give helpful answers, but they will be reasonably fluent with the structure of the language, and know what words they can predict given a sequence of text.
The final base model that arises has the following features:
It understands the structure of the language. It can predict a grammatically fluent set of next tokens given an input question.
It may fail to provide helpful responses, such as cases where it answers a fluent sentence but provides an inaccurate or irrelevant answer.
It may result in harmful outputs, including objectionable answers or an inability to reject harmful requests (e.g., “how to hack into someone’s email”).
Most base models are available on HuggingFace, and you can try them out. These models are usually not meant for production use cases, since they miss out the later steps that help make them more apt for production.
DeepSeek’s approach
The main difference between DeepSeek-R1 and other approaches was the special “autonomous” RL step that they introduced during training. Note that this is very different from the RLHF step that already exists in LLMs.
What exactly is RL? I’ll explain the idea briefly here.
A Brief Intro to RL
Let’s say a mouse needs to learn to find food.
Initially, the mouse “explores” a lot, trying random actions and seeing what works.
Over time, specific sequences of actions lead the mouse to a reward (such as the food), and the mouse learns to prioritise those actions. At this stage, the mouse “exploits” what it already knows to maximise its reward.

There are three variables in an RL problem — state, action and reward.
Given a specific state, the mouse should “learn a policy”, which allows it to determine the action it should take to get the maximum reward.
So what exactly did DeepSeek do with RL?
DeepSeek-R1-Zero’s RL strategy
We’ll start with DeepSeek-R1-Zero, a simplified version of DeepSeek-R1.
Earlier, DeepSeek released a paper called DeepSeekMath, where they first introduced this RL strategy.
This was the second time they used it.
The goal of the RL strategy was to allow the LLM to reason better while generating outputs. The actions correspond to the next token generated by the LLM, while the state corresponds to the tokens generated so far and the rewards are determined by a special reward function that rewards “good outputs”.

You must’ve heard of RLHF, one of the typical LLM training phases.
The main difference I’d like to touch upon in this blog is the way the rewards were defined.
The reward functions are no longer based on “human feedback”.
Rewards are determined automatically based on:
Correctness of the answer. The model is given some math questions where it typically boxes the final answer it gets, and the model is rewarded for “correct final answers”. Thus, through RL, the model tries to optimise its process to give more correct answers.

Correctness of the code output. The training dataset also consists of coding questions, and the code outputs of the LLM can simply be passed into a compiler and evaluated on a set of predetermined test cases, similar to competitive programming sites like LeetCode.

Rewarding for the thinking process. Some say this is the secret to why DeepSeek works so well. It specifically rewards the LLM for incorporating thinking tokens within the <think></think> tags. This enforces the LLM to think, which encourages the LLM to figure out the answer before answering.
This process helps a lot because it ensures we can use RL to train the LLM with a huge number of examples and good quality data. Human feedback data could be noisy, but an objective reward like the one used here means cleaner data, allowing the LLM to learn how to optimise for getting the correct answer.
The result?

The plot here represents the accuracy of the LLM on the American Invitational Mathematics Examination (AIME) benchmark.
The blue plot shows the accuracy obtained through predictions of a single model.
The red plot represents the consensus prediction across 16 models. That means 16 of our DeepSeek models are asked each question, and the “majority vote” is taken as the final answer while computing accuracy.
The RL process evidently improves the LLM’s reasoning ability with time, showing how powerful this technique is. In fact, the model ended up performing similarly to OpenAI-o1 on several reasoning benchmarks!
DeepSeek-R1
The issue with DeepSeek-R1-Zero was that, while it excelled at reasoning, it didn’t produce readable outputs.
To fix this issue, they implemented a few additional steps, such as supervised fine-tuning and an RL step that prioritised not just reasoning but also other tasks.
For example, since we typically want LLMs to be harmless, we might train it to refuse to answer any harmful questions, like “how to hack into someone’s email”.
But one core difference for DeepSeek-R1 is the cold-start.
Cold-start
Think of it this way:
DeepSeek-R1-Zero figured it out the hard way (how to reason), through trial and error.
DeepSeek-R1-Zero then teaches its learnings to DeepSeek-R1 through multiple high-quality examples. This allows DeepSeek-R1 to make much more progress with fewer training iterations.
That is what the cold-start is. It’s the phase where Chain-of-Thought reasoning examples generated by DeepSeek-R1-Zero are cleaned up, made readable, and used to fine-tune DeepSeek-R1.
After the cold start, DeepSeek-R1 is subjected to the same RL process — it has already learnt a lot from R1-Zero, but now continues to figure stuff out on its own.
Concluding thoughts
This new RL pipeline boosted reasoning to a great extent. But is this the end?
I think not.
A clean and objective reward function allowed for mountains of data that the LLM could use to learn how to come to the correct answer.
Encouraging the LLM to “think” a lot before answering ensured the LLM would verify its answer and give more accurate answers.
Certain “smart” behaviours evolved out of this learning process. For example, the LLM learnt behaviours such as re-reading the question, considering all possibilities, and revisiting/re-evaluating its previous steps. These “smart” behaviours weren’t explicitly programmed into the LLM, but evolved just by providing the model with the right incentives.
Just by providing the right incentives, AI systems can develop new rules of reasoning that even humans aren’t aware of.
That means they “could” one day be smarter than humans, even if they are built by humans.
Will RL be the secret to reaching AGI?
Acknowledgements
The contents of this blog are inspired by the DeepSeek-R1 paper. All images were made by me on Canva or using GPT-4o.
Follow me: LinkedIn | X (Twitter) | Website | Medium









