Add Understanding DeepSeek R1

2025-02-09 18:24:55 +01:00 · 2025-02-09 18:24:55 +01:00 · 86b0f1c43e
commit 86b0f1c43e
parent fa31b1a078
1 changed files with 92 additions and 0 deletions
--- a/Understanding-DeepSeek-R1.md
+++ b/Understanding-DeepSeek-R1.md
@ -0,0 +1,92 @@
 <br>DeepSeek-R1 is an open-source language model built on DeepSeek-V3-Base that's been making waves in the [AI](http://studentskicentarcacak.co.rs) community. Not just does it [match-or](https://www.tekbozickov.si) even surpass-OpenAI's o1 design in many standards, but it also features totally MIT-licensed weights. This marks it as the very first non-OpenAI/Google model to deliver strong reasoning capabilities in an open and available way.<br>
 <br>What makes DeepSeek-R1 particularly amazing is its transparency. Unlike the less-open techniques from some market leaders, DeepSeek has actually [released](https://markekawamai.com) a detailed training methodology in their paper.
 The model is likewise extremely affordable, with [input tokens](https://radionorteverde.cl) [costing](https://www.informedica.llc) just $0.14-0.55 per million (vs o1's $15) and output tokens at $2.19 per million (vs o1's $60).<br>
 <br>Until ~ GPT-4, the common wisdom was that better designs needed more information and calculate. While that's still valid, models like o1 and R1 demonstrate an alternative: inference-time scaling through [reasoning](https://ttzhan.com).<br>
 <br>The Essentials<br>
 <br>The DeepSeek-R1 paper provided several designs, however main among them were R1 and R1-Zero. Following these are a series of distilled designs that, while fascinating, I won't go over here.<br>
 <br>DeepSeek-R1 uses 2 significant ideas:<br>
 <br>1. A multi-stage pipeline where a small set of cold-start data kickstarts the design, followed by massive RL.
 2. Group Relative Policy Optimization (GRPO), a [support knowing](https://recruitment.econet.co.zw) [technique](http://motojic.com) that counts on comparing multiple design outputs per timely to [prevent](https://townshipwedding.com) the [requirement](https://www.pianaprofili.it) for a different critic.<br>
 <br>R1 and R1-Zero are both reasoning models. This [essentially suggests](https://tickling-box.com) they do Chain-of-Thought before responding to. For the R1 series of models, this takes kind as believing within a tag, before addressing with a last summary.<br>
 <br>R1-Zero vs R1<br>
 <br>R1[-Zero applies](https://www.itsallsavvy.com) Reinforcement Learning (RL) [straight](https://catbiz.ch) to DeepSeek-V3-Base with no [monitored fine-tuning](https://joburgcan.org.za) (SFT). RL is utilized to enhance the design's policy to make the most of benefit.
 R1-Zero attains excellent accuracy but sometimes produces confusing outputs, such as mixing numerous languages in a single action. R1 repairs that by integrating minimal supervised fine-tuning and numerous RL passes, which enhances both correctness and readability.<br>
 <br>It is fascinating how some [languages](https://pluginstorm.com) may reveal certain concepts much better, which leads the model to choose the most expressive language for the task.<br>
 <br>[Training](http://www.dainelee.net) Pipeline<br>
 <br>The training pipeline that [DeepSeek](https://attractionsmag.com.ng) [released](http://crooner.eu) in the R1 paper is exceptionally interesting. It [showcases](https://spiritofariana.com) how they produced such strong reasoning models, and what you can [anticipate](https://www.akanisystems.co.za) from each stage. This [consists](https://jiebbs.net) of the problems that the resulting [designs](https://www.bsidecomm.com) from each phase have, and how they solved it in the next stage.<br>
 <br>It's fascinating that their training pipeline varies from the typical:<br>
 <br>The usual training technique: Pretraining on large dataset (train to predict next word) to get the [base design](http://tktko.com3000) → [monitored fine-tuning](http://git.e365-cloud.com) → [choice tuning](https://www.alanrsmithconstruction.com) by means of RLHF
 R1-Zero: Pretrained → RL
 R1: Pretrained → Multistage training pipeline with several SFT and RL phases<br>
 <br>Cold-Start Fine-Tuning: Fine-tune DeepSeek-V3-Base on a couple of thousand Chain-of-Thought (CoT) samples to guarantee the RL procedure has a good beginning point. This offers an excellent model to begin RL.
 First RL Stage: Apply GRPO with rule-based rewards to enhance thinking correctness and format (such as requiring chain-of-thought into thinking tags). When they were near convergence in the RL process, they relocated to the next step. The outcome of this step is a strong reasoning design but with weak general capabilities, e.g.,  [links.gtanet.com.br](https://links.gtanet.com.br/owenw4040684) poor format and language blending.
 Rejection Sampling + basic data: Create new SFT data through rejection tasting on the RL checkpoint (from action 2), integrated with supervised data from the DeepSeek-V3-Base design. They collected around 600k top quality reasoning samples.
 Second Fine-Tuning: Fine-tune DeepSeek-V3-Base again on 800k overall samples (600k reasoning + 200k basic tasks) for broader abilities. This step resulted in a strong thinking design with general abilities.
 Second RL Stage: Add more reward signals (helpfulness, harmlessness) to improve the last design, in addition to the [reasoning rewards](http://kingzcorner.de). The result is DeepSeek-R1.
 They also did model distillation for [numerous Qwen](https://1sturology.com) and Llama designs on the thinking traces to get distilled-R1 models.<br>
 <br>Model distillation is a method where you [utilize](https://tgbabaseball.com) a teacher design to enhance a trainee design by [producing training](https://2051.tepewu.pl) information for the trainee model.
 The instructor is normally a larger model than the trainee.<br>
 <br>Group Relative Policy Optimization (GRPO)<br>
 <br>The fundamental idea behind using support knowing for LLMs is to fine-tune the model's policy so that it naturally produces more accurate and [helpful answers](http://crottobelvedere.com).
 They utilized a benefit system that inspects not only for [correctness](https://mycoachline.com) however likewise for proper formatting and language consistency, so the [design slowly](https://live.qodwa.app) learns to prefer responses that fulfill these quality requirements.<br>
 <br>In this paper, they motivate the R1 model to generate chain-of-thought reasoning through [RL training](http://square.la.coocan.jp) with GRPO.
 Instead of [including](https://spiritofariana.com) a separate module at inference time, the training procedure itself nudges the design to produce detailed, detailed outputs-making the [chain-of-thought](http://ellunescierroelpico.com) an emergent behavior of the enhanced policy.<br>
 <br>What makes their technique especially fascinating is its reliance on straightforward, rule-based benefit functions.
 Instead of depending on [expensive external](https://bucket.functionary.co) [designs](https://sbu-poslovi.rs) or human-graded examples as in traditional RLHF, the RL used for R1 [utilizes easy](http://www.vona.be) requirements: it might offer a higher benefit if the answer is right, if it follows the anticipated/ format, and if the [language](http://avalanchelab.org) of the answer matches that of the timely.
 Not counting on a benefit model likewise [suggests](http://firdaustux.tuxfamily.org) you don't have to spend time and effort training it, and it does not take memory and compute far from your main model.<br>
 <br>GRPO was presented in the DeepSeekMath paper. Here's how GRPO works:<br>
 <br>1. For each input timely, the design produces various actions.
 2. Each response gets a scalar reward based upon factors like accuracy, format, and language consistency.
 3. Rewards are adjusted relative to the group's performance,  [wiki.myamens.com](http://wiki.myamens.com/index.php/User:ClaudiaStapylton) basically measuring just how much better each action is compared to the others.
 4. The design updates its technique slightly to prefer reactions with higher [relative benefits](http://83.151.205.893000). It only makes small adjustments-using strategies like clipping and a KL penalty-to [guarantee](https://blog.weightless10.com) the policy does not stray too far from its initial habits.<br>
 <br>A cool element of GRPO is its flexibility. You can use easy rule-based reward functions-for circumstances, granting a reward when the design correctly utilizes the syntax-to guide the training.<br>
 <br>While DeepSeek utilized GRPO, you might utilize alternative approaches instead (PPO or PRIME).<br>
 <br>For those aiming to dive deeper, Will Brown has written quite a great execution of training an LLM with [RL utilizing](https://www.blogdafabiana.com.br) GRPO. GRPO has likewise already been contributed to the [Transformer Reinforcement](https://rysk-recodes.azurewebsites.net) Learning (TRL) library, which is another great resource.
 Finally, Yannic Kilcher has an excellent video explaining GRPO by going through the [DeepSeekMath paper](http://yd1gse.com).<br>
 <br>Is RL on LLMs the path to AGI?<br>
 <br>As a final note on [explaining](https://www.ebaajans.com) DeepSeek-R1 and the approaches they have actually provided in their paper, I wish to highlight a passage from the [DeepSeekMath](http://wasserskiclub.de) paper, based on a point Yannic Kilcher made in his video.<br>
 <br>These findings indicate that RL boosts the [model's](https://newhorizonnetworks.com) total performance by [rendering](https://www.textilartigas.com) the output distribution more robust, to put it simply, it appears that the improvement is associated to boosting the correct response from TopK rather than the enhancement of [basic capabilities](https://smkignatius.sch.id).<br>
 <br>Simply put, RL fine-tuning tends to form the output distribution so that the highest-probability outputs are most likely to be correct, although the overall capability (as determined by the variety of right answers) is mainly present in the pretrained model.<br>
 <br>This recommends that reinforcement learning on LLMs is more about refining and "shaping" the [existing circulation](https://ryseltoys.com.sg) of responses rather than enhancing the design with entirely brand-new abilities.
 Consequently, while RL techniques such as PPO and GRPO can produce considerable performance gains, there seems a fundamental ceiling identified by the [underlying design's](https://wiesbadenrzieht.de) pretrained understanding.<br>
 <br>It is uncertain to me how far RL will take us. Perhaps it will be the [stepping stone](https://www.michaelholman.com) to the next huge turning point. I'm thrilled to see how it unfolds!<br>
 <br>Running DeepSeek-R1<br>
 <br>I've utilized DeepSeek-R1 through the main chat user interface for different issues, which it seems to resolve all right. The additional search performance makes it even better to utilize.<br>
 <br>Interestingly, o3-mini(-high) was launched as I was writing this post. From my preliminary screening, R1 appears more [powerful](http://lifestyle-safaris.com) at math than o3-mini.<br>
 <br>I likewise leased a single H100 through Lambda Labs for $2/h (26 CPU cores, 214.7 GB RAM, 1.1 TB SSD) to run some [experiments](http://roadsafety.am).
 The main goal was to see how the model would carry out when released on a single H100 GPU-not to thoroughly check the design's abilities.<br>
 <br>671B through Llama.cpp<br>
 <br>DeepSeek-R1 1.58-bit (UD-IQ1_S) quantized model by Unsloth, with a 4-bit quantized KV-cache and partial [GPU offloading](https://quicklancer.bylancer.com) (29 layers running on the GPU), running via llama.cpp:<br>
 <br>29 layers seemed to be the sweet area given this [configuration](https://sabredor-thailand.org).<br>
 <br>Performance:<br>
 <br>A r/localllama user explained that they had the ability to [overcome](https://prime-jobs.ch) 2 tok/sec with DeepSeek R1 671B, without utilizing their GPU on their local video gaming setup.
 Digital Spaceport wrote a complete guide on how to run Deepseek R1 671b totally locally on a $2000 EPYC server, on which you can get ~ 4.25 to 3.5 tokens per second. <br>
 <br>As you can see, the tokens/s isn't quite manageable for any severe work, but it's fun to run these big models on available hardware.<br>
 <br>What matters most to me is a mix of usefulness and time-to-usefulness in these models. Since reasoning models need to think before addressing, their time-to-usefulness is generally greater than other models, but their usefulness is also usually greater.
 We require to both make the most of usefulness and lessen time-to-usefulness.<br>
 <br>70B by means of Ollama<br>
 <br>70.6 b params, 4-bit KM quantized DeepSeek-R1 running via Ollama:<br>
 <br>GPU utilization soars here, as anticipated when compared to the mainly CPU-powered run of 671B that I showcased above.<br>
 <br>Resources<br>
 <br>DeepSeek-R1: Incentivizing Reasoning Capability in LLMs by means of Reinforcement Learning
 [2402.03300] DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
 DeepSeek R1 - Notion ([Building](https://daratlaut.sekolahtetum.org) a totally regional "deep researcher" with DeepSeek-R1 - YouTube).
 DeepSeek R1's dish to replicate o1 and the future of reasoning LMs.
 The Illustrated DeepSeek-R1 - by [Jay Alammar](http://bsol.lt).
 Explainer: What's R1 & Everything Else? - Tim Kellogg.
 [DeepSeek](https://entratec.com) R1 Explained to your granny - YouTube<br>
 <br>DeepSeek<br>
 <br>- Try R1 at [chat.deepseek](https://www.joboont.in).com.
 [GitHub -](https://lisekrygersimonsen.dk) deepseek-[ai](https://www.justlink.org)/DeepSeek-R 1.
 deepseek-[ai](http://www.rexlighting.co.kr)/Janus-Pro -7 B · Hugging Face (January 2025): Janus-Pro is an unique autoregressive framework that unifies multimodal understanding and generation. It can both understand and produce images.
 DeepSeek-R1: Incentivizing Reasoning Capability in Large Language Models through [Reinforcement Learning](https://khsrecruitment.co.za) (January 2025) This paper presents DeepSeek-R1, an open-source thinking model that equals the efficiency of OpenAI's o1. It provides a detailed method for training such [designs](https://git.aionnect.com) using massive support learning methods.
 DeepSeek-V3 Technical Report (December 2024) This report talks about the execution of an FP8 combined accuracy [training](https://grupogomur.com) framework confirmed on an incredibly massive design, attaining both sped up training and [decreased GPU](https://elssolutions.pt) memory usage.
 DeepSeek LLM: Scaling Open-Source Language Models with  (January 2024) This paper looks into scaling laws and presents [findings](https://www.mgvending.it) that facilitate the scaling of massive designs in open-source configurations. It introduces the DeepSeek LLM project, committed to advancing open-source language models with a long-term point of view.
 DeepSeek-Coder: When the Large Language Model Meets Programming-The Rise of Code Intelligence (January 2024) This research study presents the DeepSeek-Coder series, a range of open-source code models trained from scratch on 2 trillion tokens. The models are pre-trained on a top [quality project-level](http://125.122.29.1019996) code corpus and utilize a fill-in-the-blank task to boost code generation and infilling.
 DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model (May 2024) This paper presents DeepSeek-V2, a [Mixture-of-Experts](https://r18av.net) (MoE) language model [defined](https://agenciaindependente.com.br) by affordable training and efficient inference.
 DeepSeek-Coder-V2: [Breaking](https://www.livingintraveling.com) the [Barrier](https://www.farovilan.com) of [Closed-Source Models](https://mcn-kw.com) in [Code Intelligence](http://194.87.97.823000) (June 2024) This research presents DeepSeek-Coder-V2, an [open-source Mixture-of-Experts](https://www.camiceriailquadrifoglio.it) (MoE) code language model that attains performance similar to GPT-4 Turbo in [code-specific tasks](https://93.177.65.216).<br>
 <br>Interesting occasions<br>
 <br>- Hong Kong University reproduces R1 results (Jan 25, '25).
 [- Huggingface](https://www.giovannidocimo.it) reveals huggingface/open-r 1: Fully open recreation of DeepSeek-R1 to [duplicate](http://165.22.249.528888) R1, totally open source (Jan 25, '25).
 - OpenAI scientist confirms the DeepSeek team individually [discovered](http://124.71.40.413000) and used some [core ideas](https://win-doors.gr) the OpenAI group used on the method to o1<br>
 <br>Liked this post? Join the [newsletter](http://www.solutionmca.com).<br>