DeepSeek-V3 New Paper is coming! Unveiling the Secrets of Low-Cost Large Model Training through Hardware-Aware Co-design | Synced
A newly released 14-page technical paper from the team behind DeepSeek-V3, with DeepSeek CEO Wenfeng Liang as a co-author, sheds light on the “Scaling Challenges and Reflections on Hardware for AI Architectures.” This follow-up to their initial technical report delves into the intricate relationship between large language model (LLM) development, training, and the underlying hardware infrastructure. The paper moves beyond the architectural specifics of DeepSeek-V3 to explore how hardware-aware model co-design can effectively address the limitations of current hardware, ultimately enabling cost-efficient large-scale training and inference.
The rapid scaling of LLMs has exposed critical bottlenecks in current hardware architectures, particularly concerning memory capacity, computational efficiency, and interconnect bandwidth. DeepSeek-V3, trained on a cluster of 2048 NVIDIA H800 GPUs, serves as a compelling case study demonstrating how a synergistic approach between model design and hardware considerations can overcome these limitations. This research focuses on the interplay between hardware architecture and model design in achieving economical large-scale training and inference, aiming to provide actionable insights for efficiently scaling LLMs without compromising performance or accessibility.
DeepSeek-V3 incorporates several key architectural innovations, as illustrated in Figure 1 of the paper, including the DeepSeekMoE architecture and Multi-head Latent Attention (MLA). These designs directly tackle the core challenges of scaling LLMs: memory efficiency, cost-effectiveness, and inference speed.
LLMs exhibit exponential growth in memory demands, outpacing the slower growth of high-speed memory like HBM. While multi-node parallelism offers a solution, optimizing memory usage at the source remains crucial. DeepSeek addresses this bottleneck with Multi-head Latent Attention (MLA), which employs projection matrices to compress the key-value (KV) representations of all attention heads into a smaller latent vector, trained jointly with the model. During inference, only this compressed latent vector needs to be cached, significantly reducing memory consumption compared to storing full KV caches for each head.
Beyond MLA, DeepSeek highlights other valuable techniques for KV cache size reduction, providing inspiration for future advancements in memory-efficient attention mechanisms:
Table 1 in the paper compares the per-token KV cache memory footprint of DeepSeek-V3, Qwen-2.5 72B, and LLaMA-3.1 405B. DeepSeek-V3 achieves a remarkable reduction, requiring only 70 KB per token, significantly lower than LLaMA-3.1 405B’s 516 KB and Qwen-2.5 72B’s 327 KB.
For sparse computation, DeepSeek developed DeepSeekMoE, an advanced Mixture-of-Experts (MoE) architecture (Figure 1, bottom right). MoE models offer two key advantages in terms of cost-effectiveness:
DeepSeek prioritizes both system-level maximum throughput and single-request latency for inference speed. To maximize throughput, the model employs a dual micro-batch overlapping architecture from the outset, intentionally overlapping communication latency with computation.
Furthermore, DeepSeek decouples the computation of MLA and MoE into distinct stages. While one micro-batch performs part of the MLA or MoE computation, the other concurrently executes the corresponding scheduling communication. Conversely, during the second micro-batch’s computation phase, the first micro-batch undertakes the combine communication step. This pipelined approach enables seamless overlap of all-to-all communication with continuous computation, ensuring full GPU utilization. In production, DeepSeek utilizes a prefill and decode separation architecture, assigning large-batch prefill and latency-sensitive decode requests to different-sized expert-parallel groups, maximizing system throughput under real-world serving conditions.
The paper also touches upon the importance of test-time scaling for reasoning models and highlights the critical role of high token output speed in reinforcement learning workflows and for reducing user-perceived latency in long inference sequences. Optimizing inference speed through hardware-software co-innovation is therefore paramount for the efficiency of reasoning models.
While quantization techniques like GPTQ and AWQ have significantly reduced memory requirements primarily for inference, DeepSeek has pioneered the use of FP8 mixed-precision training for a large-scale MoE model. Despite NVIDIA’s Transformer Engine supporting FP8, DeepSeek-V3 marks a significant step as the first publicly known large model to leverage FP8 for training. This achievement, resulting from close collaboration between infrastructure and algorithm teams, along with extensive experimentation, significantly reduces computational costs while maintaining model quality, making large-scale training more feasible. Figure 1 illustrates the FP8 precision used in the forward and backward passes during training.
DeepSeek also employs low-precision compression for network communication within the DeepSeek-V3 architecture. During EP parallelism, tokens are scheduled using fine-grained FP8 quantization, reducing communication volume by 50% compared to BF16, thereby significantly shortening communication time.
Beyond traditional floating-point formats, DeepSeek experimented with a novel data type called LogFMT-nBit (Logarithmic Floating-Point Formats).
DeepSeek currently utilizes the NVIDIA H800 GPU SXM architecture (Figure 2), which, while based on the Hopper architecture similar to the H100, features reduced FP64 compute performance and NVLink bandwidth (400 GB/s down from 900 GB/s in H100) due to regulatory requirements. This significant reduction in intra-node scaling bandwidth poses challenges for high-performance workloads. To compensate, each node is equipped with eight 400G Infiniband (IB) CX7 network interface cards (NICs) to enhance inter-node scaling capabilities.
To navigate the limitations of the H800 architecture, the DeepSeek-V3 model incorporates hardware-aware design considerations for parallelization, including: avoiding Tensor Parallelism (TP), enhancing Pipeline Parallelism (PP), and accelerating Expert Parallelism (EP). Specific details of these strategies are available in the original paper.
A key aspect of model co-design is “node-aware routing” for the TopK expert selection strategy in the MoE architecture. Given the approximately 4:1 bandwidth difference between intra-node (NVLink, ~160 GB/s effective) and inter-node (IB, ~40 GB/s effective per NIC) communication, DeepSeek designed the routing to leverage the higher intra-node bandwidth. By grouping the 256 routing experts (4 per GPU in an 8-node, 64-GPU setup) into 8 groups of 32 experts, each residing on a single node, and algorithmically ensuring that each token is routed to at most 4 nodes, DeepSeek mitigates the IB communication bottleneck and improves effective communication bandwidth during training. Tokens destined for experts on the same node can be sent via IB once and then forwarded via NVLink, reducing redundant IB traffic.
While node-aware routing reduces bandwidth demands, the bandwidth disparity between NVLink and IB complicates the implementation of communication-intensive kernels. Currently, GPU Streaming Multiprocessors (SMs) handle both network message processing and data forwarding via NVLink, consuming significant compute resources. DeepSeek advocates for integrating intra-node (scale-up) and inter-node (scale-out) communication into a unified framework.
Integrating dedicated co-processors for network traffic management and seamless forwarding between NVLink and IB domains could reduce software complexity and maximize bandwidth utilization. Hardware support for dynamic traffic deduplication could further optimize strategies like DeepSeek-V3’s node-aware routing. DeepSeek also explores emerging interconnect protocols like Ultra Ethernet Consortium (UEC) and Ultra Accelerator Link (UALink), noting the Unified Bus (UB) as a recent approach to converging scale-up and scale-out. The paper details methods for achieving this convergence at the programming framework level, including unified network adapters, dedicated communication co-processors, flexible forwarding and broadcast/reduce mechanisms, and hardware synchronization primitives.
Another limitation of current hardware is the lack of flexibility in dynamically allocating bandwidth between different traffic types on NVLink and PCIe. For instance, transferring KV cache data from CPU memory to GPUs during inference can saturate PCIe bandwidth, leading to contention with inter-GPU EP communication via IB, potentially degrading overall performance and causing latency spikes. DeepSeek suggests solutions including dynamic NVLink/PCIe traffic prioritization, I/O chiplet integration, and CPU-GPU interconnect within the scale-up domain.
For DeepSeek-V3 training, a Multi-Plane Fat-Tree (MPFT) scale-out network was deployed (Figure 3). Each node, equipped with 8 GPUs and 8 IB NICs, assigns each GPU-NIC pair to a different network plane. Additionally, each node has a 400 Gbps Ethernet RoCE NIC connected to a separate storage network plane for accessing the 3FS distributed file system. The scale-out network utilizes 64-port 400G IB switches, theoretically supporting up to 16,384 GPUs while retaining the cost and latency advantages of a two-layer network. However, due to policy and regulatory constraints, the actual deployment involved over two thousand GPUs.
The deployed MPFT network did not fully realize its intended architecture due to current limitations of the IB ConnectX-7. Ideally (Figure 4), each NIC would have multiple physical ports, each connected to a separate network plane but presented to the user as a single logical interface via port bonding. This would allow a single Queue Pair (QP) to seamlessly send and receive messages across all available ports, similar to packet spraying. Native out-of-order layout support within the NIC would be necessary to ensure message consistency and correct ordering semantics, as packets from the same QP might traverse different network paths and arrive out of order. InfiniBand ConnectX-8 natively supports four planes, and future NICs with full support for advanced multi-plane capabilities will significantly benefit the scalability of two-layer fat-tree networks for large AI clusters. Overall, multi-plane architectures offer significant advantages in fault isolation, robustness, load balancing, and scalability for large systems.
DeepSeek highlights several advantages of MPFT, including its composition as a subset of Multi-Rail Fat-Tree (MRFT) allowing seamless integration of existing NVIDIA and NCCL optimizations for MRFT networks, cost-effectiveness, traffic isolation, reduced latency, and robustness. Performance analysis comparing MPFT and MRFT (Figures 5 and 6, Table 4) revealed that the all-to-all performance of multi-plane networks is very similar to single-plane multi-rail networks, and the performance of MPFT and MRFT was nearly identical when training the V3 model on 2048 GPUs.
In DeepSeek’s model inference, large-scale EP heavily relies on all-to-all communication, which is sensitive to both bandwidth and latency. Even microsecond-level inherent network latency can significantly impact system performance.
DeepSeek analyzes the latency characteristics of IB and RoCE (Table 5), noting IB’s consistently lower latency, making it preferable for latency-sensitive workloads like distributed training and inference. While RoCE offers a potentially cost-effective alternative, its current latency and scalability limitations prevent it from fully meeting the demands of large-scale AI systems. DeepSeek proposes specific improvements for RoCE, including dedicated low-latency RoCE switches, optimized routing policies, and enhanced traffic isolation or congestion control mechanisms.
To further reduce network communication latency, DeepSeek utilizes InfiniBand GPUDirect Async (IBGDA). Traditionally, network communication involves CPU proxy threads, introducing additional overhead. IBGDA allows GPUs to directly populate Work Request (WR) content and write to RDMA doorbell MMIO addresses, eliminating the significant latency associated with GPU-CPU communication. By managing the entire control plane within the GPU, IBGDA avoids CPU bottlenecks, especially when sending numerous small packets, as the GPU’s parallel threads can distribute the workload. DeepSeek’s DeepEP and other works have demonstrated significant performance gains using IBGDA, leading DeepSeek to advocate for broad support of such features across various accelerator devices.
Building upon the identified hardware limitations and proposed solutions in specific application contexts, the paper broadens the discussion to offer forward-looking directions for future hardware architecture design:
The paper delves into each of these areas with specific insights and recommendations, highlighting the need for a holistic co-design approach between hardware and software to enable the continued advancement and accessibility of large-scale AI.
In conclusion, this technical report provides valuable insights into the challenges and solutions encountered during the development and training of DeepSeek-V3. By meticulously analyzing the interplay between model architecture and hardware limitations, DeepSeek offers a compelling vision for the future of AI infrastructure, emphasizing the critical role of hardware-aware co-design in achieving cost-efficient and scalable large language models. The paper’s detailed exploration of techniques like MLA, DeepSeekMoE, FP8 training, LogFMT, and the MPFT network, coupled with its forward-looking recommendations for hardware development, serves as a significant contribution to the field of large-scale AI research and engineering.
The Paper Insights into DeepSeek-V3: Scaling Challenges and Reflections on Hardware for AI Architectures is on arXiv
What are the key reflections or insights from the DeepSeek team’s experience with scaling DeepSeek-V3 that could inform future AI architecture and hardware development?
Informasi ini cukup akurat, cocok banget buat pecinta live hk kayak saya.
Kalau ngomongin togel hk, saya biasanya buka juga https://infiniteroom.id/ buat perbandingan data.
Jacana Life’s commitment to innovation and quality reminds me of the importance of integrating expertise across fields—just like how DeepSeek-V3’s team is bridging AI model design with hardware capabilities. Whether in wellness or technology, thoughtful co-design and addressing real-world limitations lead to more effective and sustainable solutions. It’s inspiring to see such a holistic approach driving progress forward!
https://jacana.life/
This is fascinating! It highlights a crucial aspect often overlooked: the hardware limitations themselves. It’s not just about bigger models; it’s about smarter code design to work with existing infrastructure. I’m curious to see if the hardware-aware approach they’re taking might lead to breakthroughs usable on less-powerful, even mobile, devices. This makes me think about fun lightweight games, like Eggy Car, and how ingenuity can trump raw power. Optimizing like this is important.
# DeepSeek V3: New Paper is Coming – Unveiling the Secrets of Low-Cost Large Model Training Through Hardware-Aware Co-Design
DeepSeek, a pioneering AI research lab, is set to release a groundbreaking paper that delves into the intricacies of low-cost large model training. The upcoming publication, titled “Hardware-Aware Co-Design for Efficient Large Model Training,” promises to revolutionize the way AI models are developed and deployed, making advanced AI more accessible and affordable.
The paper focuses on the concept of hardware-aware co-design, an innovative approach that optimizes both the hardware and software components of AI training processes. By carefully integrating hardware capabilities with algorithmic design, DeepSeek aims to significantly reduce the computational and financial costs associated with training large-scale models.
“Our research demonstrates that by aligning hardware and software design, we can achieve unprecedented efficiency in large model training,” said Dr. Alex Lee, lead author of the paper. “This approach not only lowers the barrier to entry for AI development but also paves the way for more sustainable and scalable AI solutions.”
1. **Hardware-Aware Algorithms**: The development of algorithms that are specifically tailored to leverage the unique capabilities of modern hardware, such as GPUs and TPUs.
2. **Efficient Resource Utilization**: Techniques for maximizing the use of available computational resources, reducing waste, and improving overall training speed.
3. **Cost-Effective Solutions**: Strategies for minimizing the financial costs of training large models, making AI more accessible to a broader range of organizations and researchers.
DeepSeek’s findings are expected to have far-reaching implications for the AI industry. By making large model training more efficient and cost-effective, the research could accelerate innovation in various fields, from natural language processing to computer vision and beyond.
The paper is scheduled to be presented at the upcoming International Conference on Machine Learning (ICML) 2025. The research community is eagerly anticipating the insights and methodologies that DeepSeek will share, as they hold the potential to reshape the landscape of AI development.
For more information and to stay updated on the release of the paper, visit the DeepSeek official website.
That’s fascinating! It’s encouraging to see such a deep dive into the hardware side of LLM development. Thinking about cost-efficient training, it reminds me of the early days of game development. People were always finding clever ways to optimize for limited hardware. I wonder if the team considered physics-based solutions as a baseline? Like optimizing for collision instead of full simulation. Or heck, even something silly but resource-light like an Eggy Car game engine could provide insights into efficient resource management.
Wow, DeepSeek-V3 sounds super interesting! Low-cost large model training? That’s a game changer! Gotta check out this hardware-aware co-design stuff. Maybe AI won’t be so resource-intensive after all!
Poor Bunny is a fun online free game that you can play right in your web browser! You don’t need to download anything – just click and start playing!
https://poorbunny.cc/
I appreciate the insight and useful information in your article, thank you so much. We hope you will share more with us!
Last week, I was returning by train from Mwanza to Tabora. When the internet connection was restored, I remembered seeing the PmBet Casino banner. I went there, chose blackjack, and decided to play a few games. I quickly figured out the interface and rules. The trip flew by, and the game turned out to be a convenient way to keep myself busy on the road.
I get what you mean about DeepSeek, but honestly I’ve found that andelska cisla is better. Iit feels more intuitive, gives warmer, more human-like responses, and somehow just “gets” the vibe of what I’m asking. For me, it’s less about raw data and more about that personal touch, and andelska cisla nails it.
I can’t say enough good things about Shoreline Dental Studio! The Mission Viejo team is fantastic—professional, kind, and thorough. They helped me with a complex dental issue, and the results were beyond my expectations. The office has a warm, welcoming vibe, and they really care about their patients’ well-being. I’m a lifelong patient! https://www.shorelinedentalstudio.com/
Really interesting read—DeepSeek’s new paper hits home on how crucial hardware-aware co-design is becoming for scaling LLMs efficiently. It’s refreshing to see a team openly reflecting on the real-world bottlenecks we face with current infrastructure. I had Sprunki Retake running in the background while skimming through it—surprisingly helped me focus on the technical depth!
Thanks for sharing these valuable insights on AI model training! It’s exciting to see the progress in making advanced AI more accessible. On a related note, I’d like to share AINanoBanana (https://ai-nano-banana.com/), a free online AI image generator and editor that helps users create stunning visuals with features like smart image enhancement and creative remixing. It’s designed for ease of use and high-quality output.
Thank you for the valuable insights in your article. We truly appreciate it and look forward to more informative content from you!
Great review! Synced Review always delivers insightful and well-researched content about the latest tech trends and AI developments. I really appreciate how clearly complex topics are explained. Looking forward to more in-depth analyses and updates!
A3 Schools is a leading online learning platform offering online classes for students across various subjects. It specializes in coding for kids and provides some of the best online courses in India, designed to make learning interactive and engaging.
I wonder if the team considered physics-based solutions as a baseline? Like optimizing for collision instead of full simulation. Or heck, even something silly but resource-light like an Eggy Car game engine could provide insights into efficient resource management waffle game free.
Really interesting read—DeepSeek’s new paper hits home on how crucial hardware-aware co-design is becoming for scaling LLMs efficiently. waffle game free.
This is a fascinating look at how hardware and AI models work together. The idea of co-design to lower training costs is really smart and could make big models more accessible.
Thanks for sharing this update! The focus on hardware-aware co-design to cut training costs is really interesting. It’s cool to see how they’re tackling scaling challenges beyond just model architecture.
It’s interesting how the paper emphasizes the limitations of current hardware architectures, especially concerning memory capacity. Given the focus on hardware-aware co-design, what specific hardware innovations do they foresee as being most impactful in the next generation?
It’s interesting how the paper emphasizes the limitations of current hardware as a key factor in LLM development. The point about DeepSeek-V3 using 2048 NVIDIA H800 GPUs really highlights the scale we’re talking about when trying to overcome those bottlenecks. I wonder how future hardware iterations will specifically address these issues.
The article discusses how hardware limitations impact LLM training and inference costs. It’s interesting to see DeepSeek-V3 using 2048 NVIDIA H800 GPUs; that kind of scale probably requires a lot of fine-tuning to get right. I wonder how much the hardware-aware co-design approach improved their overall experience compared to more standard methods.
Reading about how DeepSeek-V3 cuts memory footprint to just 70KB per token while using 2048 H800 GPUs is mind-blowing, almost makes me wish I had a Video to Text tool to transcribe these dense technical breakdowns while I grab my coffee. It is wild how they scaled up to 671B parameters without breaking the bank.
Mobile-first entertainment platforms are becoming more popular as users shift toward smartphones for everyday digital activity. The main expectations today are fast loading, simple navigation, and stable performance without unnecessary complexity or long setup processes.
https://batery-app.in/ reflects this trend, focusing on a streamlined mobile experience where users can quickly access different sections and manage their activity from a single interface. The idea is to keep everything as convenient as possible, especially for users who prefer doing everything on the go.
At the same time, the industry as a whole is clearly moving toward more user-friendly and accessible solutions. Platforms are designed to reduce friction, improve speed, and make interaction more intuitive, which matches the habits of modern mobile users.
Overall, this type of service shows how digital entertainment is evolving toward simplicity, mobility, and better usability across all devices.
Hermes Agent is the most capable open-source AI agent I’ve found — runs 24/7, connects to 15+ platforms, and gets smarter over time. Learn how to set it up at https://hermesagent101.dev
Thanks for sharing these detailed technical specs — really appreciate the transparency around model architecture and performance.
On a related note, if anyone’s looking for a fun, lightweight, and completely free way to create custom avatars, I’d like to share the square face generator.
It’s a modern HTML5 remake of the classic Japanese Flash-based square face maker, with over 200 customization options across 12 categories — all running locally in your browser. No sign-up, no uploads, no watermarks, and it works great on both desktop and mobile.
Perfect for Discord, Twitter, TikTok, or just having fun with a cute, blocky avatar style. Hope you find it as enjoyable to play with as I did!
The ReST^EM choice over PPO/GRPO is interesting — DeepMind’s EM-style approach has been quietly outperforming online RL in stability-critical scenarios. The 72.5% adaptation rate on Llama-3.2-1B is a strong result for that model size.
One practical question this raises: if self-adapting models become mainstream, benchmarking becomes much harder because each deployed model diverges from its base. We’ve been tracking this at A cost-effective LLM aggregation platform offering unified access to DeepSeek, Claude, GPT, Gemini, Qwen, and more — at the best price on the market. — comparing DeepSeek, Qwen, Claude, and GPT families on consistent benchmarks — and SEAL-style updates would make snapshot comparisons almost meaningless over time.
Curious whether the authors considered version-locked evaluation protocols for this.
This article about DeepSeek-V3 really highlights the intricate balance between hardware and model design, which is crucial for advancing LLM training. The emphasis on memory efficiency and sparse computation through innovative architectures like DeepSeekMoE is particularly interesting, as these aspects significantly contribute to more sustainable and scalable AI models. The approach of co-designing hardware and software to overcome current technological barriers is something that resonates deeply with the challenges faced by many in the AI community.
On a related note, for developers seeking efficient ways to leverage various AI models, platforms like API Models provide a unified solution to access multiple AI APIs, including the gpt-image-2-OpenAi image generation model, offering cost-effective alternatives.
It’s truly fascinating to see how the DeepSeek-V3 paper explores the complex relationship between hardware architecture and large language model design. The detailed analysis of memory efficiency and cost-effectiveness, especially with innovations like Multi-head Latent Attention and DeepSeekMoE, highlights the critical role of hardware-aware co-design in overcoming scaling challenges. This approach not only optimizes performance but also makes LLMs more accessible, which is crucial as we continue to push the boundaries of AI.
Incidentally, when it comes to optimizing images by removing unwanted elements, I’ve found remove unwanted objects from photos to be quite effective in handling tasks like erasing objects or people seamlessly from photos. It’s a nice complement to the kind of efficiency and precision discussed in the DeepSeek paper.
Great deep dive into how hardware and AI models must evolve together super insightful.
I think this post will be especially helpful for beginners trying to understand the basics while still offering value for more experienced readers.
Excellent analysis of DeepSeek V3! The hardware-aware co-design approach for reducing training costs is truly innovative. Its fascinating to see how Chinese AI labs are pushing the boundaries of efficient model training. For anyone working with AI video generation tools, VidGlory provides some interesting options for creating content around these topics. Great read!
Hardware-aware co-design is such an underexplored area in large model training. The fact that DeepSeek managed to reduce training costs so significantly by optimizing the interaction between model architecture and hardware is impressive. This kind of systems-level thinking is what the field needs more of rather than just scaling up blindly. Looking forward to reading the full paper when it drops.
The court’s interpretation of Sections 28 and 29 of the NMC Act provides valuable clarity for medical colleges seeking approval for additional MBBS seats.
