FYI, I ran the math through O1 (no code execution), Sonnet 3.5 (JS code execution) and Gemini 2.0 Pro (Python code execution) w/ the config JSON and Python to try to get a good sense of the architecture and some more exact stats. Hopefully, this is broadly right (but corrections welcomed):

  • 28.81B activations per fwd pass / 452.82B total parameters
  • Hybrid architecture: 3 dense layers + 58 8x256+1 MoE
  • Uses YaRN RoPE extension to achieve 160K token context
  • FP16 weights: 905.65GB , FP8 weights: 452.82GB
  • FP16 kvcache: 286.55GB , FP8 kvcache: 143.28GB

At FP8 everything, might just fit into 1xH100 node, otherwise you’d need two, or an H200 or MI300X node…

vs Llama 3

Here is a comparison to Llama 3:

ParameterDeepSeek-V3Llama3-70BLlama3-405B
Hidden Size7168819216384
Num Layers6180126
Attn Heads12864128
KV Heads12888
GQA Ratio1:18:116:1
Head Dim56128128
Interm Size184322867253248
Context Len1638408192131072
Vocab Size129280128256128256

FFN Expansion Ratios:

  • DeepSeek-V3 Dense Layers: 2.57x
  • DeepSeek-V3 MoE Experts: 0.29x (but with 257 experts)
  • Llama3-70B: 3.50x
  • Llama3-405B: 3.25x

Effective FFN Dimensions per Token:

  • DeepSeek-V3 Dense Layers: 18432
  • DeepSeek-V3 MoE Layers: 16384 (2048 Ă— 8 experts)
  • Llama3-70B: 28672
  • Llama3-405B: 53248

vs Snowflake Arctic

The dense+moe hybrid maybe best compared to Snowflake Arctic (128 experts). Snowflake runs w/ parallel routing (more like Switch Transformer?) and DeepSeek-V3 is sequential (GLaM?) but they arrive at similar intermediate sizes (in most other ways, DeepSeek-V3 is bigger and badder, but to be expected):

ParameterDeepSeek-V3Arctic
Hidden Size71687168
Num Layers6135
Attention Heads12856
KV Heads1288
GQA Ratio1:17:1
Head Dimension56128
Context Length1638404096
Vocab Size12928032000

MoE Architecture:

ParameterDeepSeek-V3Arctic
Architecture3 dense + 58 MoE layersDense-MoE hybrid (parallel)
Num Experts257128
Experts/Token82
Base Params~10B10B
Expert Size~1.7B3.66B
Total Params~452B~480B
Active Params~29B~17B

FFN Expansion Ratios (DeepSeek-V3):

  • Dense Layers: 2.57x
  • MoE Layers (per expert): 0.29x
  • MoE effective expansion: 2.29x

Effective FFN Dimensions per Token (DeepSeek-V3):

  • Dense Layers: 18432
  • MoE Layers: 16384 (2048 Ă— 8 experts)

FFN Expansion Ratios (Arctic):

  • Dense (Residual) Path: 1.00x
  • MoE Path (per expert): 0.68x
  • Combined effective expansion: 2.36x

Effective FFN Dimensions per Token (Arctic):

  • Dense Path: 7168
  • MoE Path: 9728 (4864 Ă— 2 experts)
  • Total: 16896

Reference

Other resources: https://simonwillison.net/2024/Dec/25/deepseek-v3/

LiveBench Results: https://www.reddit.com/r/LocalLLaMA/comments/1hm4959/benchmark_results_deepseek_v3_on_livebench/