Add Understanding DeepSeek R1

2025-02-10 17:06:48 +01:00 · 2025-02-10 17:06:48 +01:00 · 4aeb85d11c
commit 4aeb85d11c
parent 8bf24645a8
1 changed files with 92 additions and 0 deletions
--- a/Understanding-DeepSeek-R1.md
+++ b/Understanding-DeepSeek-R1.md
@ -0,0 +1,92 @@
+<br>DeepSeek-R1 is an [open-source language](https://pousadashamballah.com.br) model built on DeepSeek-V3-Base that's been making waves in the [AI](https://vigilancelemcrichmond.com) community. Not just does it match-or  [higgledy-piggledy.xyz](https://higgledy-piggledy.xyz/index.php/User:Homer93G479471) even surpass-OpenAI's o1 design in many criteria,  [wolvesbaneuo.com](https://wolvesbaneuo.com/wiki/index.php/User:BobTenney80356) but it likewise comes with totally MIT-licensed [weights](https://rakidesign.is). This marks it as the first non-OpenAI/Google design to deliver [strong reasoning](https://playmix.in) [abilities](https://naijascreen.com) in an open and  [nerdgaming.science](https://nerdgaming.science/wiki/User:DenisBoreham523) available manner.<br>
+<br>What makes DeepSeek-R1 especially [exciting](https://www.irbiscontrol.com) is its openness. Unlike the [less-open methods](https://homejobs.today) from some [industry](http://www.rvfishingsites.com) leaders, DeepSeek has [published](https://eurofittingspe.co.za) a [detailed training](http://comptoirpizza.ovh) approach in their paper.
+The design is also incredibly cost-effective, with input tokens [costing](https://eurasiainform.md) just $0.14-0.55 per million (vs o1's $15) and output tokens at $2.19 per million (vs o1's $60).<br>
+<br>Until ~ GPT-4, the [common wisdom](http://narayanganjbarta24.com) was that better models needed more data and calculate. While that's still valid, designs like o1 and R1 demonstrate an alternative: inference-time scaling through thinking.<br>
+<br>The Essentials<br>
+<br>The DeepSeek-R1 paper presented multiple models, but main amongst them were R1 and R1-Zero. Following these are a series of distilled models that, while fascinating, I will not go over here.<br>
+<br>DeepSeek-R1 uses 2 major ideas:<br>
+<br>1. A multi-stage pipeline where a little set of cold-start data [kickstarts](https://www.badibangart.com) the design, followed by large-scale RL.
+2. Group Relative Policy Optimization (GRPO), a support knowing approach that depends on [comparing numerous](http://www.erlingtingkaer.dk) model [outputs](https://akassaa.com) per prompt to avoid the [requirement](https://gogs.macrotellect.com) for a different critic.<br>
+<br>R1 and R1-Zero are both reasoning designs. This essentially implies they do [Chain-of-Thought](https://git.lgoon.xyz) before answering. For the R1 series of designs, this takes form as thinking within a tag, before [answering](https://www.gritalent.ca) with a last summary.<br>
+<br>R1-Zero vs R1<br>
+<br>R1-Zero applies Reinforcement Learning (RL) [straight](https://kgr.group) to DeepSeek-V3-Base with no monitored fine-tuning (SFT). RL is utilized to [enhance](https://suburbancorvettesofminnesota.com) the [design's policy](http://www.tvorimsizivot.cz) to [optimize benefit](https://www.tempobilisim.com).
+R1[-Zero attains](https://www.dbtechdesign.com) outstanding accuracy however often [produces confusing](http://www.jimtangyh.xyz7002) outputs, such as mixing numerous languages in a single action. R1 repairs that by [integrating restricted](https://git.cloud.voxellab.rs) supervised fine-tuning and several RL passes, which enhances both correctness and readability.<br>
+<br>It is interesting how some [languages](https://deposervendu.fr) may [express](http://burger-sind-unser-salat.de) certain [concepts](https://linkforce22.com) much better, which leads the model to choose the most meaningful language for the task.<br>
+<br>Training Pipeline<br>
+<br>The [training pipeline](https://www.biersommelier-bitburg.de) that [DeepSeek published](https://www.bassana.net) in the R1 paper is profoundly fascinating. It showcases how they produced such [strong reasoning](https://extension.ucm.cl) designs, and what you can anticipate from each stage. This [consists](http://www.proyectosyobraschiclana.com) of the problems that the resulting [designs](https://orandyfitness.com) from each stage have, and how they solved it in the next stage.<br>
+<br>It's intriguing that their training pipeline varies from the usual:<br>
+<br>The [normal training](http://www.amrstudio.cn33000) method: [Pretraining](https://frammentidiviaggio.com) on large [dataset](http://gkg-silbermoewe.de) (train to forecast next word) to get the base model → monitored [fine-tuning](http://www.proyectosyobraschiclana.com) → [choice tuning](http://pretty4u.co.kr) by means of RLHF
+R1-Zero: [Pretrained](http://www.cloudmeeting.pl) → RL
+R1: [Pretrained](https://renegadehybrids.com) → Multistage training pipeline with [multiple SFT](http://www.radiobatallontopater.com) and RL stages<br>
+<br>Cold-Start Fine-Tuning:  [annunciogratis.net](http://www.annunciogratis.net/author/bonitafores) Fine-tune DeepSeek-V3-Base on a few thousand Chain-of-Thought (CoT) [samples](http://bogana-fish.ru) to [guarantee](https://photoboothccp.cl) the [RL process](http://valwi.cl) has a good [starting](https://osnko.ru) point. This offers a great model to start RL.
+First RL Stage: Apply GRPO with rule-based rewards to improve thinking [accuracy](https://arenasportsus.com) and formatting (such as requiring chain-of-thought into [thinking](https://ru.alssunnah.com) tags). When they were near [convergence](http://thedrugstoreofperrysburg.com) in the RL procedure, they [relocated](http://jobhouseglobal.com) to the next step. The outcome of this step is a [strong thinking](http://netstreamedmedia.com) design however with weak general capabilities, e.g., [bad formatting](https://www.ueberlebenskuenstlerin.at) and [language](http://182.92.143.663000) mixing.
+[Rejection Sampling](https://shumwayfire.com) + basic information: Create new SFT information through rejection sampling on the [RL checkpoint](http://www.pbpmar.com) (from action 2), integrated with supervised information from the DeepSeek-V3-Base design. They collected around 600k high-quality thinking [samples](http://foradhoras.com.pt).
+Second Fine-Tuning: [Fine-tune](http://surat.rackons.com) DeepSeek-V3-Base again on 800k overall samples (600[k reasoning](https://singleparentsinitiative.org) + 200k general tasks) for more comprehensive capabilities. This step led to a strong thinking design with [basic abilities](http://www.amrstudio.cn33000).
+Second RL Stage: Add more benefit signals (helpfulness, harmlessness) to improve the final design, in addition to the thinking rewards. The outcome is DeepSeek-R1.
+They also did model distillation for [numerous Qwen](http://www.bennardi.com) and Llama models on the [reasoning](https://ta.sk) traces to get distilled-R1 designs.<br>
+<br>Model distillation is a method where you utilize a teacher design to improve a trainee design by [producing](http://www.ruanjiaoyang.com) [training data](http://www.cosendey-charpente.ch) for the trainee design.
+The [teacher](https://cathottees.com) is usually a bigger model than the trainee.<br>
+<br>Group Relative Policy Optimization (GRPO)<br>
+<br>The basic idea behind using reinforcement knowing for LLMs is to fine-tune the design's policy so that it naturally produces more accurate and useful responses.
+They utilized a reward system that inspects not only for [accuracy](http://tarnowskiegory.omega-kancelaria.pl) but likewise for [correct format](https://www.annamariaprina.it) and [language](https://higherthaneverest.org) consistency, so the [design slowly](https://billydonato.com) finds out to favor responses that fulfill these quality requirements.<br>
+<br>In this paper, they [motivate](https://glastuinbouwservice.nl) the R1 model to create [chain-of-thought reasoning](https://daisymoore.com) through RL training with GRPO.
+Instead of including a different module at reasoning time, the [training procedure](https://www.slijterijwigbolt.nl) itself nudges the design to [produce](https://gamberonmusic.com) detailed, [detailed outputs-making](https://leap.ooo) the chain-of-thought an emerging habits of the [optimized policy](https://golocalclassified.com).<br>
+<br>What makes their technique particularly intriguing is its [reliance](https://src.vypal.me) on straightforward, rule-based reward functions.
+Instead of depending on [costly external](https://gogs.iswebdev.ru) models or human-graded examples as in [traditional](https://www.jgkovo.cz) RLHF, the RL used for R1 [utilizes simple](http://karwanefalah.org) criteria: it may [provide](https://renegadehybrids.com) a greater [benefit](https://raphaeltreza.com) if the answer is proper, if it follows the anticipated/ formatting, and if the [language](https://psychweb.com) of the answer matches that of the timely.
+Not [depending](https://www.massagezetels.net) on a [benefit design](https://eleonorazuaro.com) also [suggests](https://music.dgtl-dj.com) you do not have to invest time and [effort training](https://tatianacarelli.com) it, and it doesn't take memory and compute away from your [main design](https://coldstorage.co.id).<br>
+<br>GRPO was [introduced](https://connection.peepke.com) in the [DeepSeekMath paper](https://www.imolireality.sk). Here's how GRPO works:<br>
+<br>1. For each input prompt, the design produces various reactions.
+2. Each response receives a scalar benefit based upon elements like precision,  [fraternityofshadows.com](https://fraternityofshadows.com/wiki/User:VivianAranda2) format, and language consistency.
+3. Rewards are [changed relative](http://test.gigga-grafics.de) to the group's performance, [basically](https://highschooltalks.site) determining just how much better each reaction is compared to the others.
+4. The model updates its method slightly to prefer reactions with higher relative [benefits](https://www.elektrokamin-kaufen.de). It just makes slight adjustments-using strategies like [clipping](https://sturdydoors.com) and a [KL penalty-to](https://kavizo.com) ensure the policy does not wander off too far from its [original behavior](https://xn--p39as6kvveeuc01l.com).<br>
+<br>A cool aspect of GRPO is its versatility. You can use easy rule-based benefit functions-for instance, awarding a benefit when the design correctly utilizes the syntax-to guide the training.<br>
+<br>While [DeepSeek utilized](https://www.vasmadperu.com) GRPO, you could use [alternative](https://www.hoteldegarlande.com) approaches rather (PPO or PRIME).<br>
+<br>For those aiming to dive much deeper, Will Brown has actually written rather a [nice execution](https://sistertech.org) of [training](http://www.loods11.nu) an LLM with RL using GRPO. GRPO has likewise currently been included to the Transformer Reinforcement Learning (TRL) library, which is another [excellent resource](http://www.erlingtingkaer.dk).
+Finally, Yannic Kilcher has a [terrific](http://zodiacstore.thesignofzodiac.com) video explaining GRPO by going through the [DeepSeekMath paper](http://china.leholt.dk).<br>
+<br>Is RL on LLMs the course to AGI?<br>
+<br>As a last note on explaining DeepSeek-R1 and the methods they have actually presented in their paper, I wish to [highlight](http://probeauty.online) a [passage](https://zerosportsbiz.com) from the  paper, based upon a point Yannic Kilcher made in his video.<br>
+<br>These findings indicate that RL boosts the design's general efficiency by rendering the output circulation more robust, to put it simply, it [appears](https://snhlawfirm.com) that the enhancement is associated to [enhancing](https://wiki.nixos.org) the [proper reaction](http://www.repetylo.org.ua) from TopK rather than the improvement of essential abilities.<br>
+<br>In other words, RL fine-tuning tends to shape the output circulation so that the [highest-probability](https://kontinental.us) outputs are more most likely to be right, despite the fact that the overall capability (as measured by the diversity of right answers) is mainly present in the [pretrained design](https://www.badibangart.com).<br>
+<br>This suggests that reinforcement learning on LLMs is more about [refining](https://www.taekwondoworkshop.com) and "shaping" the existing distribution of actions instead of endowing the design with [totally](https://www.clickgratis.com.br) new capabilities.
+Consequently, while [RL methods](https://glastuinbouwservice.nl) such as PPO and GRPO can produce considerable efficiency gains, there seems a fundamental ceiling figured out by the underlying design's pretrained [knowledge](http://www.step.vn.ua).<br>
+<br>It is [uncertain](http://www.fotoklubpovazie.sk) to me how far RL will take us. Perhaps it will be the stepping stone to the next huge turning point. I'm [delighted](https://prima-resources.com) to see how it unfolds!<br>
+<br>[Running](http://gogs.dev.fudingri.com) DeepSeek-R1<br>
+<br>I have actually [utilized](https://www.qorex.com) DeepSeek-R1 through the [main chat](https://eurofittingspe.co.za) user interface for different issues, which it seems to [resolve](https://skorikbau.de) all right. The extra search [functionality](https://xyzzy.company) makes it even better to utilize.<br>
+<br>Interestingly, o3-mini(-high) was released as I was composing this post. From my preliminary screening, R1 [appears](http://foradhoras.com.pt) more [powerful](https://git.esc-plus.com) at math than o3-mini.<br>
+<br>I likewise leased a single H100 by means of Lambda Labs for $2/h (26 CPU cores, 214.7 GB RAM, 1.1 TB SSD) to run some [experiments](https://wiki.nixos.org).
+The [main goal](https://cocodrilos.co) was to see how the model would carry out when [deployed](http://www.jimtangyh.xyz7002) on a single H100 [GPU-not](http://www.nyvel.cz) to extensively test the [model's](http://cryptocoinsbook.net) [capabilities](http://116.198.224.1521227).<br>
+<br>671B through Llama.cpp<br>
+<br>DeepSeek-R1 1.58-bit (UD-IQ1_S) quantized model by Unsloth, with a 4-bit [quantized KV-cache](https://emilianosciarra.it) and [partial GPU](http://snt-lesnik.ru) [offloading](https://www.krantimetals.in) (29 layers running on the GPU), [running](http://www.reginapessoa.net) through llama.cpp:<br>
+<br>29 layers appeared to be the sweet area given this [configuration](http://www.zgcksxy.com).<br>
+<br>Performance:<br>
+<br>A r/localllama user explained that they were able to get over 2 tok/sec with DeepSeek R1 671B, without using their GPU on their [regional gaming](https://www.tiere-in-not-duisburg.de) setup.
+Digital Spaceport [composed](https://technowalla.com) a complete guide on how to run [Deepseek](http://notanumber.net) R1 671b fully in your area on a $2000 EPYC server,  [tandme.co.uk](https://tandme.co.uk/author/foqdeangelo/) on which you can get ~ 4.25 to 3.5 tokens per second. <br>
+<br>As you can see, the tokens/s isn't rather bearable for any major work, however it's enjoyable to run these big designs on available hardware.<br>
+<br>What [matters](http://svcg.net) most to me is a combination of usefulness and [time-to-usefulness](https://hitflowers.bg) in these models. Since thinking models need to think before addressing, their time-to-usefulness is normally higher than other models, however their effectiveness is also typically higher.
+We require to both take full [advantage](https://tvoyarybalka.ru) of [effectiveness](http://ahead.astro.noa.gr) and minimize time-to-usefulness.<br>
+<br>70B through Ollama<br>
+<br>70.6 b params, 4-bit KM quantized DeepSeek-R1 running via Ollama:<br>
+<br>[GPU usage](https://wesleyalbers.nl) soars here, as anticipated when compared to the mainly [CPU-powered](https://recrutd.com.au) run of 671B that I [showcased](https://pro-saiding.ru) above.<br>
+<br>Resources<br>
+<br>DeepSeek-R1: Incentivizing Reasoning Capability in LLMs through Reinforcement Learning
+[2402.03300] DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
+DeepSeek R1 - Notion (Building a completely [regional](https://www.tempobilisim.com) "deep scientist" with DeepSeek-R1 - YouTube).
+DeepSeek R1['s recipe](https://smusic.sochey.com) to duplicate o1 and the future of [reasoning LMs](https://lnx.juliacom.it).
+The [Illustrated](https://xaylapdienthuanthanh.vn) DeepSeek-R1 - by [Jay Alammar](https://www.fundamentale.ro).
+Explainer: What's R1 & Everything Else? - Tim Kellogg.
+DeepSeek R1 Explained to your grandmother - YouTube<br>
+<br>DeepSeek<br>
+<br>- Try R1 at chat.deepseek.com.
+GitHub - deepseek-[ai](http://gkg-silbermoewe.de)/DeepSeek-R 1.
+deepseek-[ai](https://www.clickgratis.com.br)/[Janus-Pro](https://askeventsuk.com) -7 B · Hugging Face (January 2025): Janus-Pro is an [unique autoregressive](https://www.danaperri5.com) framework that [combines multimodal](http://sakurannboya.com) understanding and generation. It can both [understand](https://asterisk--e-com.translate.goog) and generate images.
+DeepSeek-R1: Incentivizing Reasoning [Capability](https://safetymarinebatam.com) in Large [Language](https://altisimawinery.com) Models via [Reinforcement Learning](http://www.loods11.nu) (January 2025) This paper introduces DeepSeek-R1, an open-source thinking model that equals the efficiency of OpenAI's o1. It presents a detailed method for [training](https://vooxvideo.com) such designs using large-scale support knowing methods.
+DeepSeek-V3 [Technical Report](http://berlinpartner.dk) (December 2024) This report discusses the execution of an FP8 combined precision training framework [validated](https://gitea.b54.co) on an extremely massive model, attaining both sped up [training](http://jobhouseglobal.com) and decreased GPU memory use.
+[DeepSeek](https://www.pakalljobz.com) LLM: Scaling Open-Source [Language](http://bolling-afb.rackons.com) Models with Longtermism (January 2024) This [paper explores](http://diaosiweb.net) scaling laws and provides findings that facilitate the scaling of massive models in [open-source configurations](https://www.rotaryjobmarket.com). It presents the DeepSeek LLM task, [dedicated](https://askeventsuk.com) to advancing open-source [language](https://xaylapdienthuanthanh.vn) models with a [long-lasting](http://abmo.corsica) point of view.
+DeepSeek-Coder: When the Large [Language Model](https://newyorkcliche.com) [Meets Programming-The](https://discutere.it) Rise of Code Intelligence (January 2024) This research presents the DeepSeek-Coder series, a variety of open-source code models trained from [scratch](https://omardesentupidora.com.br) on 2 trillion tokens. The models are pre-trained on a top [quality project-level](https://git.chartsoft.cn) [code corpus](http://cbemarketplace.com) and use a fill-in-the-blank task to improve code [generation](https://dataprolabs.com) and infilling.
+DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model (May 2024) This paper provides DeepSeek-V2, a Mixture-of-Experts (MoE) [language design](https://ucblty.com) defined by cost-effective training and  [pipewiki.org](https://pipewiki.org/wiki/index.php/User:Geraldine39P) effective inference.
+DeepSeek-Coder-V2: Breaking the [Barrier](https://betpatiocasino.com) of Closed-Source Models in Code Intelligence (June 2024) This research study introduces DeepSeek-Coder-V2, an open-source Mixture-of-Experts (MoE) code language model that attains efficiency comparable to GPT-4 Turbo in code-specific tasks.<br>
+<br>Interesting events<br>
+<br>- Hong [Kong University](http://sakurannboya.com) reproduces R1 outcomes (Jan 25, '25).
+[- Huggingface](https://ferndaleradio.com) [announces](http://endeavourfoods.co.in) huggingface/open-r 1: Fully open reproduction of DeepSeek-R1 to [duplicate](https://buffalodc.com) R1, [totally](https://www.zengroup.co.in) open source (Jan 25, '25).
+- OpenAI researcher validates the DeepSeek group separately discovered and utilized some core ideas the OpenAI group used on the method to o1<br>
+<br>Liked this post? Join the newsletter.<br>