vLLM TP vs PP

PP=2 TP=8

“

$ PP=2 sbatch vllm-server.slurm
...
[2025-01-05 11:32:46] Starting vLLM server with tensor parallel size: 8
[2025-01-05 11:32:46] Full vLLM command:
[2025-01-05 11:32:46] vllm serve deepseek-ai/DeepSeek-V3 --tensor-parallel-size 8 --gpu-memory-utilization 0.95 --num-scheduler-steps 20 --host 0.0.0.0 --port 8000
[2025-01-05 11:32:46] vLLM logs will be written to: /fsx/ubuntu/meti/experiments/00-enja-translation-eval/logs/278-20250105_113126/vllm_server.log
[2025-01-05 11:32:46] Full vLLM command:
[2025-01-05 11:32:46] vllm serve deepseek-ai/DeepSeek-V3 --tensor-parallel-size 8 --pipeline-parallel-size 2 --gpu-memory-utilization 0.95 --num-scheduler-steps 20 --max-model-len 8192 --host 0.0.0.0 --port 8000 --trust-remote-code
[2025-01-05 11:32:46] vLLM server started with PID: 3427707
[2025-01-05 11:32:46] Waiting for vLLM server to become ready...
INFO 01-05 11:33:24 api_server.py:712] vLLM API server version 0.6.6.post2.dev5+g5ce4627a
INFO 01-05 11:33:24 api_server.py:713] args: Namespace(subparser='serve', model_tag='deepseek-ai/DeepSeek-V3', config='', host='0.0.0.0', port=8000, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, chat_template_content_format='auto', response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_request_id_headers=False, enable_auto_tool_choice=False, tool_call_parser=None, tool_parser_plugin='', model='deepseek-ai/DeepSeek-V3', task='auto', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=True, allowed_local_media_path=None, download_dir=None, load_format='auto', config_format=<ConfigFormat.AUTO: 'auto'>, dtype='auto', kv_cache_dtype='auto', quantization_param_path=None, max_model_len=8192, guided_decoding_backend='xgrammar', logits_processor_pattern=None, distributed_executor_backend=None, worker_use_ray=False, pipeline_parallel_size=2, tensor_parallel_size=8, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=None, enable_prefix_caching=None, disable_sliding_window=False, use_v2_block_manager=True, num_lookahead_slots=0, seed=0, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.95, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_seqs=None, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, hf_overrides=None, enforce_eager=False, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, limit_mm_per_prompt=None, mm_processor_kwargs=None, disable_mm_preprocessor_cache=False, enable_lora=False, enable_lora_bias=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', num_scheduler_steps=20, multi_step_stream_outputs=True, scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_model=None, speculative_model_quantization=None, num_speculative_tokens=None, speculative_disable_mqa_scorer=False, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=None, qlora_adapter_name_or_path=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, scheduling_policy='fcfs', override_neuron_config=None, override_pooler_config=None, compilation_config=None, kv_transfer_config=None, worker_cls='auto', generation_config=None, disable_log_requests=False, max_log_len=None, disable_fastapi_docs=False, enable_prompt_tokens_details=False, dispatch_function=<function serve at 0x7f3cdb0c6520>)

...

INFO 01-05 12:04:53 distributed_gpu_executor.py:57] # GPU blocks: 2889, # CPU blocks: 528
INFO 01-05 12:04:53 distributed_gpu_executor.py:61] Maximum concurrency for 8192 tokens per request: 5.64x

34 minutes to load

❯ python ~/vllm/benchmarks/benchmark_serving.py --backend openai-chat --host ip-10-1-21-143 --port 8000 --endpoint='/v1/chat/completions' --model "deepseek-ai/DeepSeek-V3" --dataset-name sharegpt --dataset-path ./ShareGPT_V3_unfiltered_cleaned_split.json --num-prompts 1024 --max-concurrency 64 --seed 42
Namespace(backend='openai-chat', base_url=None, host='ip-10-1-21-143', port=8000, endpoint='/v1/chat/completions', dataset=None, dataset_name='sharegpt', dataset_path='./ShareGPT_V3_unfiltered_cleaned_split.json', max_concurrency=64, model='deepseek-ai/DeepSeek-V3', tokenizer=None, best_of=1, use_beam_search=False, num_prompts=1024, logprobs=None, request_rate=inf, burstiness=1.0, seed=42, trust_remote_code=False, disable_tqdm=False, profile=False, save_result=False, metadata=None, result_dir=None, result_filename=None, ignore_eos=False, percentile_metrics='ttft,tpot,itl', metric_percentiles='99', goodput=None, sonnet_input_len=550, sonnet_output_len=150, sonnet_prefix_len=200, sharegpt_output_len=None, random_input_len=1024, random_output_len=128, random_range_ratio=1.0, random_prefix_len=0, hf_subset=None, hf_split=None, hf_output_len=None, tokenizer_mode='auto')
Starting initial single prompt test run...
Initial test run completed. Starting main benchmark run...
Traffic request rate: inf
Burstiness factor: 1.0 (Poisson process)
Maximum request concurrency: 64
100%|███████████████████████████████████████████████████████████████████████████████████████████████████| 1024/1024 [14:09<00:00,  1.21it/s]
============ Serving Benchmark Result ============
Successful requests:                     1024
Benchmark duration (s):                  849.46
Total input tokens:                      229783
Total generated tokens:                  196323
Request throughput (req/s):              1.21
Output token throughput (tok/s):         231.11
Total Token throughput (tok/s):          501.62
---------------Time to First Token----------------
Mean TTFT (ms):                          2639.01
Median TTFT (ms):                        4076.09
P99 TTFT (ms):                           5234.55
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          319.29
Median TPOT (ms):                        236.82
P99 TPOT (ms):                           1464.94
---------------Inter-token Latency----------------
Mean ITL (ms):                           4421.57
Median ITL (ms):                         4402.53
P99 ITL (ms):                            5306.83
==================================================

❯ python ~/vllm/benchmarks/benchmark_serving.py --backend openai-chat --host ip-10-1-21-143 --port 8000 --endpoint='/v1/chat/completions' --model "deepseek-ai/DeepSeek-V3" --dataset-name sharegpt --dataset-path ./ShareGPT_V3_unfiltered_cleaned_split.json --num-prompts 1024 --max-concurrency 128 --seed 42
Namespace(backend='openai-chat', base_url=None, host='ip-10-1-21-143', port=8000, endpoint='/v1/chat/completions', dataset=None, dataset_name='sharegpt', dataset_path='./ShareGPT_V3_unfiltered_cleaned_split.json', max_concurrency=128, model='deepseek-ai/DeepSeek-V3', tokenizer=None, best_of=1, use_beam_search=False, num_prompts=1024, logprobs=None, request_rate=inf, burstiness=1.0, seed=42, trust_remote_code=False, disable_tqdm=False, profile=False, save_result=False, metadata=None, result_dir=None, result_filename=None, ignore_eos=False, percentile_metrics='ttft,tpot,itl', metric_percentiles='99', goodput=None, sonnet_input_len=550, sonnet_output_len=150, sonnet_prefix_len=200, sharegpt_output_len=None, random_input_len=1024, random_output_len=128, random_range_ratio=1.0, random_prefix_len=0, hf_subset=None, hf_split=None, hf_output_len=None, tokenizer_mode='auto')
Starting initial single prompt test run...
Initial test run completed. Starting main benchmark run...
Traffic request rate: inf
Burstiness factor: 1.0 (Poisson process)
Maximum request concurrency: 128
100%|███████████████████████████████████████████████████████████████████████████████████████████████████| 1024/1024 [09:32<00:00,  1.79it/s]
============ Serving Benchmark Result ============
Successful requests:                     1024
Benchmark duration (s):                  572.44
Total input tokens:                      229783
Total generated tokens:                  195623
Request throughput (req/s):              1.79
Output token throughput (tok/s):         341.74
Total Token throughput (tok/s):          743.15
---------------Time to First Token----------------
Mean TTFT (ms):                          4602.03
Median TTFT (ms):                        4990.24
P99 TTFT (ms):                           15123.49
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          435.80
Median TPOT (ms):                        279.78
P99 TPOT (ms):                           2854.01
---------------Inter-token Latency----------------
Mean ITL (ms):                           5147.83
Median ITL (ms):                         4947.29
P99 ITL (ms):                            13139.99
==================================================

❯ python ~/vllm/benchmarks/benchmark_serving.py --backend openai-chat --host ip-10-1-21-143 --port 8000 --endpoint='/v1/chat/completions' --model "deepseek-ai/DeepSeek-V3" --dataset-name sharegpt --dataset-path ./ShareGPT_V3_unfiltered_cleaned_split.json --num-prompts 1024 --max-concurrency 256 --seed 42
Namespace(backend='openai-chat', base_url=None, host='ip-10-1-21-143', port=8000, endpoint='/v1/chat/completions', dataset=None, dataset_name='sharegpt', dataset_path='./ShareGPT_V3_unfiltered_cleaned_split.json', max_concurrency=256, model='deepseek-ai/DeepSeek-V3', tokenizer=None, best_of=1, use_beam_search=False, num_prompts=1024, logprobs=None, request_rate=inf, burstiness=1.0, seed=42, trust_remote_code=False, disable_tqdm=False, profile=False, save_result=False, metadata=None, result_dir=None, result_filename=None, ignore_eos=False, percentile_metrics='ttft,tpot,itl', metric_percentiles='99', goodput=None, sonnet_input_len=550, sonnet_output_len=150, sonnet_prefix_len=200, sharegpt_output_len=None, random_input_len=1024, random_output_len=128, random_range_ratio=1.0, random_prefix_len=0, hf_subset=None, hf_split=None, hf_output_len=None, tokenizer_mode='auto')
Starting initial single prompt test run...
Initial test run completed. Starting main benchmark run...
Traffic request rate: inf
Burstiness factor: 1.0 (Poisson process)
Maximum request concurrency: 256
100%|███████████████████████████████████████████████████████████████████████████████████████████████████| 1024/1024 [08:51<00:00,  1.93it/s]
============ Serving Benchmark Result ============
Successful requests:                     1024
Benchmark duration (s):                  531.65
Total input tokens:                      229783
Total generated tokens:                  196498
Request throughput (req/s):              1.93
Output token throughput (tok/s):         369.60
Total Token throughput (tok/s):          801.81
---------------Time to First Token----------------
Mean TTFT (ms):                          36067.72
Median TTFT (ms):                        36935.92
P99 TTFT (ms):                           78867.00
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          476.54
Median TPOT (ms):                        291.63
P99 TPOT (ms):                           3272.67
---------------Inter-token Latency----------------
Mean ITL (ms):                           4960.69
Median ITL (ms):                         4915.31
P99 ITL (ms):                            6029.69
==================================================


❯ python ~/vllm/benchmarks/benchmark_serving.py --backend openai-chat --host ip-10-1-21-143 --port 8000 --endpoint='/v1/chat/completions' --
model "deepseek-ai/DeepSeek-V3" --dataset-name sharegpt --dataset-path ./ShareGPT_V3_unfiltered_cleaned_split.json --num-prompts 1024 --max-
concurrency 512 --seed 42

Namespace(backend='openai-chat', base_url=None, host='ip-10-1-21-143', port=8000, endpoint='/v1/chat/completions', dataset=None, dataset_nam
e='sharegpt', dataset_path='./ShareGPT_V3_unfiltered_cleaned_split.json', max_concurrency=512, model='deepseek-ai/DeepSeek-V3', tokenizer=No
ne, best_of=1, use_beam_search=False, num_prompts=1024, logprobs=None, request_rate=inf, burstiness=1.0, seed=42, trust_remote_code=False, d
isable_tqdm=False, profile=False, save_result=False, metadata=None, result_dir=None, result_filename=None, ignore_eos=False, percentile_metr
ics='ttft,tpot,itl', metric_percentiles='99', goodput=None, sonnet_input_len=550, sonnet_output_len=150, sonnet_prefix_len=200, sharegpt_out
put_len=None, random_input_len=1024, random_output_len=128, random_range_ratio=1.0, random_prefix_len=0, hf_subset=None, hf_split=None, hf_o
utput_len=None, tokenizer_mode='auto')
Starting initial single prompt test run...
Initial test run completed. Starting main benchmark run...
Traffic request rate: inf
Burstiness factor: 1.0 (Poisson process)
Maximum request concurrency: 512
100%|███████████████████████████████████████████████████████████████████████████████████████████████████| 1024/1024 [09:05<00:00,  1.88it/s]
============ Serving Benchmark Result ============
Successful requests:                     1024
Benchmark duration (s):                  545.97
Total input tokens:                      229783
Total generated tokens:                  196351
Request throughput (req/s):              1.88
Output token throughput (tok/s):         359.64
Total Token throughput (tok/s):          780.51
---------------Time to First Token----------------
Mean TTFT (ms):                          101638.17
Median TTFT (ms):                        136694.69
P99 TTFT (ms):                           182805.42
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          488.89
Median TPOT (ms):                        299.54
P99 TPOT (ms):                           3528.80
---------------Inter-token Latency----------------
Mean ITL (ms):                           5150.19
Median ITL (ms):                         5100.70
P99 ITL (ms):                            6421.38
==================================================

TP16

❯ for c in 64 128 256 512; do echo "Running with concurrency $c..." && python ~/vllm/benchmarks/benchmark_serving.py --backend openai-chat --host ip-10-1-21-143 --port 8000 --endpoint='/v1/chat/completions' --model "deepseek-ai/DeepSeek-V3" --dataset-name sharegpt --dataset-path ./ShareGPT_V3_unfiltered_cleaned_split.json --num-prompts 1024 --max-concurrency $c --seed 42; done
Running with concurrency 64...
Namespace(backend='openai-chat', base_url=None, host='ip-10-1-21-143', port=8000, endpoint='/v1/chat/completions', dataset=None, dataset_name='sharegpt', dataset_path='./ShareGPT_V3_unfiltered_cleaned_split.json', max_concurrency=64, model='deepseek-ai/DeepSeek-V3', tokenizer=None, best_of=1, use_beam_search=False, num_prompts=1024, logprobs=None, request_rate=inf, burstiness=1.0, seed=42, trust_remote_code=False, disable_tqdm=False, profile=False, save_result=False, metadata=None, result_dir=None, result_filename=None, ignore_eos=False, percentile_metrics='ttft,tpot,itl', metric_percentiles='99', goodput=None, sonnet_input_len=550, sonnet_output_len=150, sonnet_prefix_len=200, sharegpt_output_len=None, random_input_len=1024, random_output_len=128, random_range_ratio=1.0, random_prefix_len=0, hf_subset=None, hf_split=None, hf_output_len=None, tokenizer_mode='auto')
Starting initial single prompt test run...
Initial test run completed. Starting main benchmark run...
Traffic request rate: inf
Burstiness factor: 1.0 (Poisson process)
Maximum request concurrency: 64
100%|██████████████████████████████████████████████████████████████████████████████████████████████████| 1024/1024 [11:39<00:00,  1.46it/s]
============ Serving Benchmark Result ============
Successful requests:                     1024
Benchmark duration (s):                  699.36
Total input tokens:                      229783
Total generated tokens:                  196723
Request throughput (req/s):              1.46
Output token throughput (tok/s):         281.29
Total Token throughput (tok/s):          609.85
---------------Time to First Token----------------
Mean TTFT (ms):                          2223.85
Median TTFT (ms):                        3500.61
P99 TTFT (ms):                           3836.83
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          250.66
Median TPOT (ms):                        188.68
P99 TPOT (ms):                           1163.99
---------------Inter-token Latency----------------
Mean ITL (ms):                           3539.84
Median ITL (ms):                         3373.03
P99 ITL (ms):                            4265.99
==================================================
Running with concurrency 128...
Namespace(backend='openai-chat', base_url=None, host='ip-10-1-21-143', port=8000, endpoint='/v1/chat/completions', dataset=None, dataset_name='sharegpt', dataset_path='./ShareGPT_V3_unfiltered_cleaned_split.json', max_concurrency=128, model='deepseek-ai/DeepSeek-V3', tokenizer=None, best_of=1, use_beam_search=False, num_prompts=1024, logprobs=None, request_rate=inf, burstiness=1.0, seed=42, trust_remote_code=False, disable_tqdm=False, profile=False, save_result=False, metadata=None, result_dir=None, result_filename=None, ignore_eos=False, percentile_metrics='ttft,tpot,itl', metric_percentiles='99', goodput=None, sonnet_input_len=550, sonnet_output_len=150, sonnet_prefix_len=200, sharegpt_output_len=None, random_input_len=1024, random_output_len=128, random_range_ratio=1.0, random_prefix_len=0, hf_subset=None, hf_split=None, hf_output_len=None, tokenizer_mode='auto')
Starting initial single prompt test run...
Initial test run completed. Starting main benchmark run...
Traffic request rate: inf
Burstiness factor: 1.0 (Poisson process)
Maximum request concurrency: 128
100%|██████████████████████████████████████████████████████████████████████████████████████████████████| 1024/1024 [06:46<00:00,  2.52it/s]
============ Serving Benchmark Result ============
Successful requests:                     1024
Benchmark duration (s):                  406.81
Total input tokens:                      229783
Total generated tokens:                  196577
Request throughput (req/s):              2.52
Output token throughput (tok/s):         483.22
Total Token throughput (tok/s):          1048.06
---------------Time to First Token----------------
Mean TTFT (ms):                          2340.63
Median TTFT (ms):                        3499.54
P99 TTFT (ms):                           3817.32
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          257.49
Median TPOT (ms):                        192.60
P99 TPOT (ms):                           1206.03
---------------Inter-token Latency----------------
Mean ITL (ms):                           3592.16
Median ITL (ms):                         3349.91
P99 ITL (ms):                            4366.81
==================================================
Running with concurrency 256...
Namespace(backend='openai-chat', base_url=None, host='ip-10-1-21-143', port=8000, endpoint='/v1/chat/completions', dataset=None, dataset_name='sharegpt', dataset_path='./ShareGPT_V3_unfiltered_cleaned_split.json', max_concurrency=256, model='deepseek-ai/DeepSeek-V3', tokenizer=None, best_of=1, use_beam_search=False, num_prompts=1024, logprobs=None, request_rate=inf, burstiness=1.0, seed=42, trust_remote_code=False, disable_tqdm=False, profile=False, save_result=False, metadata=None, result_dir=None, result_filename=None, ignore_eos=False, percentile_metrics='ttft,tpot,itl', metric_percentiles='99', goodput=None, sonnet_input_len=550, sonnet_output_len=150, sonnet_prefix_len=200, sharegpt_output_len=None, random_input_len=1024, random_output_len=128, random_range_ratio=1.0, random_prefix_len=0, hf_subset=None, hf_split=None, hf_output_len=None, tokenizer_mode='auto')
Starting initial single prompt test run...
Initial test run completed. Starting main benchmark run...
Traffic request rate: inf
Burstiness factor: 1.0 (Poisson process)
Maximum request concurrency: 256
100%|██████████████████████████████████████████████████████████████████████████████████████████████████| 1024/1024 [05:32<00:00,  3.08it/s]
============ Serving Benchmark Result ============
Successful requests:                     1024
Benchmark duration (s):                  332.79
Total input tokens:                      229783
Total generated tokens:                  196895
Request throughput (req/s):              3.08
Output token throughput (tok/s):         591.65
Total Token throughput (tok/s):          1282.12
---------------Time to First Token----------------
Mean TTFT (ms):                          13584.20
Median TTFT (ms):                        16816.48
P99 TTFT (ms):                           38278.11
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          359.57
Median TPOT (ms):                        223.84
P99 TPOT (ms):                           2436.09
---------------Inter-token Latency----------------
Mean ITL (ms):                           3916.60
Median ITL (ms):                         3904.36
P99 ITL (ms):                            6149.72
==================================================
Running with concurrency 512...
Namespace(backend='openai-chat', base_url=None, host='ip-10-1-21-143', port=8000, endpoint='/v1/chat/completions', dataset=None, dataset_name='sharegpt', dataset_path='./ShareGPT_V3_unfiltered_cleaned_split.json', max_concurrency=512, model='deepseek-ai/DeepSeek-V3', tokenizer=None, best_of=1, use_beam_search=False, num_prompts=1024, logprobs=None, request_rate=inf, burstiness=1.0, seed=42, trust_remote_code=False, disable_tqdm=False, profile=False, save_result=False, metadata=None, result_dir=None, result_filename=None, ignore_eos=False, percentile_metrics='ttft,tpot,itl', metric_percentiles='99', goodput=None, sonnet_input_len=550, sonnet_output_len=150, sonnet_prefix_len=200, sharegpt_output_len=None, random_input_len=1024, random_output_len=128, random_range_ratio=1.0, random_prefix_len=0, hf_subset=None, hf_split=None, hf_output_len=None, tokenizer_mode='auto')
Starting initial single prompt test run...
Initial test run completed. Starting main benchmark run...
Traffic request rate: inf
Burstiness factor: 1.0 (Poisson process)
Maximum request concurrency: 512
100%|██████████████████████████████████████████████████████████████████████████████████████████████████| 1024/1024 [05:24<00:00,  3.15it/s]
============ Serving Benchmark Result ============
Successful requests:                     1024
Benchmark duration (s):                  324.76
Total input tokens:                      229783
Total generated tokens:                  196326
Request throughput (req/s):              3.15
Output token throughput (tok/s):         604.53
Total Token throughput (tok/s):          1312.09
---------------Time to First Token----------------
Mean TTFT (ms):                          51245.90
Median TTFT (ms):                        76532.73
P99 TTFT (ms):                           93584.25
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          355.42
Median TPOT (ms):                        219.52
P99 TPOT (ms):                           2423.14
---------------Inter-token Latency----------------
Mean ITL (ms):                           3868.20
Median ITL (ms):                         3855.88
P99 ITL (ms):                            6106.05
==================================================

llama.cpp

llama-bench

(base) ubuntu@ip-10-1-1-135:~/llama.cpp/DeepSeek-V3-Q5_K_M$ time ~/llama.cpp/llama.cpp/build/bin/llama-gguf-split --merge DeepSeek-V3-Q5_K_M-00001-of-00011.gguf DeepSeek-V3-Q5_K_M.gguf

(base) ubuntu@ip-10-1-1-135:~/llama.cpp/DeepSeek-V3-Q5_K_M$ time ../llama.cpp/build/bin/llama-bench -m DeepSeek-V3-Q5_K_M.gguf
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 8 CUDA devices:
  Device 0: NVIDIA H100 80GB HBM3, compute capability 9.0, VMM: yes
  Device 1: NVIDIA H100 80GB HBM3, compute capability 9.0, VMM: yes
  Device 2: NVIDIA H100 80GB HBM3, compute capability 9.0, VMM: yes
  Device 3: NVIDIA H100 80GB HBM3, compute capability 9.0, VMM: yes
  Device 4: NVIDIA H100 80GB HBM3, compute capability 9.0, VMM: yes
  Device 5: NVIDIA H100 80GB HBM3, compute capability 9.0, VMM: yes
  Device 6: NVIDIA H100 80GB HBM3, compute capability 9.0, VMM: yes
  Device 7: NVIDIA H100 80GB HBM3, compute capability 9.0, VMM: yes
| model                          |       size |     params | backend    | ngl |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------------: | -------------------: |
| deepseek2 671B Q5_K - Medium   | 442.74 GiB |   671.03 B | CUDA       |  99 |         pp512 |        290.28 ± 1.25 |
| deepseek2 671B Q5_K - Medium   | 442.74 GiB |   671.03 B | CUDA       |  99 |         tg128 |         23.63 ± 0.04 |

build: b56f079e (4418)

real    9m18.083s
user    1m18.287s
sys     7m58.478s

benchmark_serving.py

# llama.cpp
(vllm) ubuntu@ip-10-1-1-135:~/meti/benchmarks$ python ~/vllm/benchmarks/benchmark_serving.py --backend openai-chat --host localhost --port 8080 --endpoint='/v1/chat/completions' --model "deepseek-ai/DeepSeek-V3" --dataset-name share
gpt --dataset-path ./ShareGPT_V3_unfiltered_cleaned_split.json --num-prompts 50 --max-concurrency 1 --seed 42
Namespace(backend='openai-chat', base_url=None, host='localhost', port=8080, endpoint='/v1/chat/completions', dataset=None, dataset_name='sharegpt', dataset_path='./ShareGPT_V3_unfiltered_cleaned_split.json', max_concurrency=1, mode
l='deepseek-ai/DeepSeek-V3', tokenizer=None, best_of=1, use_beam_search=False, num_prompts=50, logprobs=None, request_rate=inf, burstiness=1.0, seed=42, trust_remote_code=False, disable_tqdm=False, profile=False, save_result=False,
metadata=None, result_dir=None, result_filename=None, ignore_eos=False, percentile_metrics='ttft,tpot,itl', metric_percentiles='99', goodput=None, sonnet_input_len=550, sonnet_output_len=150, sonnet_prefix_len=200, sharegpt_output_l
en=None, random_input_len=1024, random_output_len=128, random_range_ratio=1.0, random_prefix_len=0, hf_subset=None, hf_split=None, hf_output_len=None, tokenizer_mode='auto')
Starting initial single prompt test run...
Initial test run completed. Starting main benchmark run...
Traffic request rate: inf
Burstiness factor: 1.0 (Poisson process)
Maximum request concurrency: 1
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████| 50/50 [26:52<00:00, 32.24s/it]
============ Serving Benchmark Result ============
Successful requests:                     50
Benchmark duration (s):                  1612.14
Total input tokens:                      12211
Total generated tokens:                  35857
Request throughput (req/s):              0.03
Output token throughput (tok/s):         22.24
Total Token throughput (tok/s):          29.82
---------------Time to First Token----------------
Mean TTFT (ms):                          1353.39
Median TTFT (ms):                        1121.37
P99 TTFT (ms):                           3898.91
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          43.01
Median TPOT (ms):                        42.97
P99 TPOT (ms):                           44.10
---------------Inter-token Latency----------------
Mean ITL (ms):                           43.08
Median ITL (ms):                         42.90
P99 ITL (ms):                            46.61
==================================================

# vLLM PP=2 TP=8
❯ python ~/vllm/benchmarks/benchmark_serving.py --backend openai-chat --host ip-10-1-21-143 --port 8000 --endpoint='/v1/chat/completions' --model "deepseek-ai/DeepSeek-V3" --dataset-name sharegpt --dataset-path ./ShareGPT_V3_unfiltered_cleaned_split.json --num-prompts 50 --max-concurrency 1 --seed 42
Namespace(backend='openai-chat', base_url=None, host='ip-10-1-21-143', port=8000, endpoint='/v1/chat/completions', dataset=None, dataset_name='sharegpt', dataset_path='./ShareGPT_V3_unfiltered_cleaned_split.json', max_concurrency=1, model='deepseek-ai/DeepSeek-V3', tokenizer=None, best_of=1, use_beam_search=False, num_prompts=50, logprobs=None, request_rate=inf, burstiness=1.0, seed=42, trust_remote_code=False, disable_tqdm=False, profile=False, save_result=False, metadata=None, result_dir=None, result_filename=None, ignore_eos=False, percentile_metrics='ttft,tpot,itl', metric_percentiles='99', goodput=None, sonnet_input_len=550, sonnet_output_len=150, sonnet_prefix_len=200, sharegpt_output_len=None, random_input_len=1024, random_output_len=128, random_range_ratio=1.0, random_prefix_len=0, hf_subset=None, hf_split=None, hf_output_len=None, tokenizer_mode='auto')
Starting initial single prompt test run...
Initial test run completed. Starting main benchmark run...
Traffic request rate: inf
Burstiness factor: 1.0 (Poisson process)
Maximum request concurrency: 1
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████| 50/50 [58:56<00:00, 70.73s/it]
============ Serving Benchmark Result ============
Successful requests:                     50
Benchmark duration (s):                  3536.56
Total input tokens:                      12211
Total generated tokens:                  10683
Request throughput (req/s):              0.01
Output token throughput (tok/s):         3.02
Total Token throughput (tok/s):          6.47
---------------Time to First Token----------------
Mean TTFT (ms):                          347.96
Median TTFT (ms):                        341.99
P99 TTFT (ms):                           427.86
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          408.90
Median TPOT (ms):                        339.68
P99 TPOT (ms):                           1127.59
---------------Inter-token Latency----------------
Mean ITL (ms):                           6317.84
Median ITL (ms):                         6349.42
P99 ITL (ms):                            6846.15
==================================================

# vLLM TP=16
============ Serving Benchmark Result ============
Successful requests:                     50
Benchmark duration (s):                  1826.67
Total input tokens:                      12211
Total generated tokens:                  10742
Request throughput (req/s):              0.03
Output token throughput (tok/s):         5.88
Total Token throughput (tok/s):          12.57
---------------Time to First Token----------------
Mean TTFT (ms):                          394.63
Median TTFT (ms):                        176.61
P99 TTFT (ms):                           3931.75
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          207.92
Median TPOT (ms):                        172.19
P99 TPOT (ms):                           597.99
---------------Inter-token Latency----------------
Mean ITL (ms):                           3226.57
Median ITL (ms):                         3219.82
P99 ITL (ms):                            3330.43
==================================================

sglang

# Hey way easy to run: https://sgl-project.github.io/backend/server_arguments.html#server-arguments

mamba activate sglang
python -m sglang.launch_server --model-path deepseek-ai/DeepSeek-V3 --tp 16 --nccl-init ip-10-1-21-143:50000 --nnodes 2 --node-rank 0 --trust-remote-code

mamba activate
python -m sglang.launch_server --model-path deepseek-ai/DeepSeek-V3 --tp 16 --nccl-init ip-10-1-21-143:50000 --nnodes 2 --node-rank 1 --trust-remote-code

(sglang) ubuntu@ip-10-1-1-135:~$ for c in 64 128 256 512; do echo "Running with concurrency $c..." && python3 -m sglang.bench_serving --backend sglang --num-prompts 1024 --max-concurrency $c --seed 42; done
Running with concurrency 64...
Namespace(backend='sglang', base_url=None, host='0.0.0.0', port=30000, dataset_name='sharegpt', dataset_path='', model='deepseek-ai/DeepSeek-V3', tokenizer=None, num_prompts=1024, sharegpt_output_len=None, random_input_len=1024, random_output_len=1024, random_range_ratio=0.0, request_rate=inf, max_concurrency=64, seed=42, multi=False, request_rate_range='2,34,2', output_file=None, disable_tqdm=False, disable_stream=False, disable_ignore_eos=False, return_logprob=False, extra_request_body=None, gen_num_groups=64, gen_prompts_per_group=16, gen_system_prompt_len=2048, gen_question_len=128, gen_output_len=256, profile=False, lora_name=None)

#Input tokens: 229783
#Output tokens: 207591
Starting initial single prompt test run...
Initial test run completed. Starting main benchmark run...
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1024/1024 [06:38<00:00,  2.57it/s]

============ Serving Benchmark Result ============
Backend:                                 sglang
Traffic request rate:                    inf
Max reqeuest concurrency:                64
Successful requests:                     1024
Benchmark duration (s):                  398.99
Total input tokens:                      229783
Total generated tokens:                  207591
Total generated tokens (retokenized):    206702
Request throughput (req/s):              2.57
Input token throughput (tok/s):          575.91
Output token throughput (tok/s):         520.29
Total token throughput (tok/s):          1096.20
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   23386.87
Median E2E Latency (ms):                 15854.39
---------------Time to First Token----------------
Mean TTFT (ms):                          494.63
Median TTFT (ms):                        409.65
P99 TTFT (ms):                           1562.19
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          112.12
Median TPOT (ms):                        116.39
P99 TPOT (ms):                           156.07
---------------Inter-token Latency----------------
Mean ITL (ms):                           113.76
Median ITL (ms):                         65.04
P99 ITL (ms):                            489.22
==================================================
Running with concurrency 128...
Namespace(backend='sglang', base_url=None, host='0.0.0.0', port=30000, dataset_name='sharegpt', dataset_path='', model='deepseek-ai/DeepSeek-V3', tokenizer=None, num_prompts=1024, sharegpt_output_len=None, random_input_len=1024, random_output_len=1024, random_range_ratio=0.0, request_rate=inf, max_concurrency=128, seed=42, multi=False, request_rate_range='2,34,2', output_file=None, disable_tqdm=False, disable_stream=False, disable_ignore_eos=False, return_logprob=False, extra_request_body=None, gen_num_groups=64, gen_prompts_per_group=16, gen_system_prompt_len=2048, gen_question_len=128, gen_output_len=256, profile=False, lora_name=None)

#Input tokens: 229783
#Output tokens: 207591
Starting initial single prompt test run...
Initial test run completed. Starting main benchmark run...
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1024/1024 [04:58<00:00,  3.43it/s]

============ Serving Benchmark Result ============
Backend:                                 sglang
Traffic request rate:                    inf
Max reqeuest concurrency:                128
Successful requests:                     1024
Benchmark duration (s):                  298.50
Total input tokens:                      229783
Total generated tokens:                  207591
Total generated tokens (retokenized):    206704
Request throughput (req/s):              3.43
Input token throughput (tok/s):          769.79
Output token throughput (tok/s):         695.44
Total token throughput (tok/s):          1465.23
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   31174.26
Median E2E Latency (ms):                 22265.58
---------------Time to First Token----------------
Mean TTFT (ms):                          729.85
Median TTFT (ms):                        499.59
P99 TTFT (ms):                           2940.07
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          153.84
Median TPOT (ms):                        163.12
P99 TPOT (ms):                           213.69
---------------Inter-token Latency----------------
Mean ITL (ms):                           151.28
Median ITL (ms):                         73.89
P99 ITL (ms):                            552.57
==================================================
Running with concurrency 256...
Namespace(backend='sglang', base_url=None, host='0.0.0.0', port=30000, dataset_name='sharegpt', dataset_path='', model='deepseek-ai/DeepSeek-V3', tokenizer=None, num_prompts=1024, sharegpt_output_len=None, random_input_len=1024, random_output_len=1024, random_range_ratio=0.0, request_rate=inf, max_concurrency=256, seed=42, multi=False, request_rate_range='2,34,2', output_file=None, disable_tqdm=False, disable_stream=False, disable_ignore_eos=False, return_logprob=False, extra_request_body=None, gen_num_groups=64, gen_prompts_per_group=16, gen_system_prompt_len=2048, gen_question_len=128, gen_output_len=256, profile=False, lora_name=None)

#Input tokens: 229783
#Output tokens: 207591
Starting initial single prompt test run...
Initial test run completed. Starting main benchmark run...
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1024/1024 [04:53<00:00,  3.49it/s]

============ Serving Benchmark Result ============
Backend:                                 sglang
Traffic request rate:                    inf
Max reqeuest concurrency:                256
Successful requests:                     1024
Benchmark duration (s):                  293.24
Total input tokens:                      229783
Total generated tokens:                  207591
Total generated tokens (retokenized):    206719
Request throughput (req/s):              3.49
Input token throughput (tok/s):          783.59
Output token throughput (tok/s):         707.91
Total token throughput (tok/s):          1491.51
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   56784.72
Median E2E Latency (ms):                 43139.58
---------------Time to First Token----------------
Mean TTFT (ms):                          1532.73
Median TTFT (ms):                        768.49
P99 TTFT (ms):                           5498.26
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          301.15
Median TPOT (ms):                        319.49
P99 TPOT (ms):                           473.52
---------------Inter-token Latency----------------
Mean ITL (ms):                           274.56
Median ITL (ms):                         214.96
P99 ITL (ms):                            808.56
==================================================
Running with concurrency 512...
Namespace(backend='sglang', base_url=None, host='0.0.0.0', port=30000, dataset_name='sharegpt', dataset_path='', model='deepseek-ai/DeepSeek-V3', tokenizer=None, num_prompts=1024, sharegpt_output_len=None, random_input_len=1024, random_output_len=1024, random_range_ratio=0.0, request_rate=inf, max_concurrency=512, seed=42, multi=False, request_rate_range='2,34,2', output_file=None, disable_tqdm=False, disable_stream=False, disable_ignore_eos=False, return_logprob=False, extra_request_body=None, gen_num_groups=64, gen_prompts_per_group=16, gen_system_prompt_len=2048, gen_question_len=128, gen_output_len=256, profile=False, lora_name=None)

#Input tokens: 229783
#Output tokens: 207591
Starting initial single prompt test run...
Initial test run completed. Starting main benchmark run...
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1024/1024 [03:21<00:00,  5.08it/s]

============ Serving Benchmark Result ============
Backend:                                 sglang
Traffic request rate:                    inf
Max reqeuest concurrency:                512
Successful requests:                     1024
Benchmark duration (s):                  201.72
Total input tokens:                      229783
Total generated tokens:                  207591
Total generated tokens (retokenized):    206731
Request throughput (req/s):              5.08
Input token throughput (tok/s):          1139.13
Output token throughput (tok/s):         1029.11
Total token throughput (tok/s):          2168.24
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   59024.01
Median E2E Latency (ms):                 52111.56
---------------Time to First Token----------------
Mean TTFT (ms):                          3862.67
Median TTFT (ms):                        1779.22
P99 TTFT (ms):                           10721.69
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          352.81
Median TPOT (ms):                        336.45
P99 TPOT (ms):                           997.52
---------------Inter-token Latency----------------
Mean ITL (ms):                           274.12
Median ITL (ms):                         199.53
P99 ITL (ms):                            867.67
==================================================

jp64

https://huggingface.co/mmnga/DeepSeek-V3-slice-jp64-gguf

JA only subset of experts?

llama-bench

(base) ubuntu@ip-10-1-33-173:~/llama.cpp/llama.cpp$ build/bin/llama-bench -m ../DeepSeek-V3-slice-jp64-gguf/Q5_K_M/Q5_K_M-000-fa 1
Q5_K_M-00001-of-00012.gguf  Q5_K_M-00004-of-00012.gguf  Q5_K_M-00007-of-00012.gguf  Q5_K_M-00010-of-00012.gguf
Q5_K_M-00002-of-00012.gguf  Q5_K_M-00005-of-00012.gguf  Q5_K_M-00008-of-00012.gguf  Q5_K_M-00011-of-00012.gguf
Q5_K_M-00003-of-00012.gguf  Q5_K_M-00006-of-00012.gguf  Q5_K_M-00009-of-00012.gguf  Q5_K_M-00012-of-00012.gguf
(base) ubuntu@ip-10-1-33-173:~/llama.cpp/llama.cpp$ build/bin/llama-bench -m ../DeepSeek-V3-slice-jp64-gguf/Q5_K_M/Q5_K_M-00001-of-00012.gguf -fa 1
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 8 CUDA devices:
  Device 0: NVIDIA H100 80GB HBM3, compute capability 9.0, VMM: yes
  Device 1: NVIDIA H100 80GB HBM3, compute capability 9.0, VMM: yes
  Device 2: NVIDIA H100 80GB HBM3, compute capability 9.0, VMM: yes
  Device 3: NVIDIA H100 80GB HBM3, compute capability 9.0, VMM: yes
  Device 4: NVIDIA H100 80GB HBM3, compute capability 9.0, VMM: yes
  Device 5: NVIDIA H100 80GB HBM3, compute capability 9.0, VMM: yes
  Device 6: NVIDIA H100 80GB HBM3, compute capability 9.0, VMM: yes
  Device 7: NVIDIA H100 80GB HBM3, compute capability 9.0, VMM: yes
| model                          |       size |     params | backend    | ngl | fa |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | ------------: | -------------------: |
| deepseek2 671B Q5_K - Medium   | 119.02 GiB |   180.52 B | CUDA       |  99 |  1 |         pp512 |        742.82 ± 1.23 |
| deepseek2 671B Q5_K - Medium   | 119.02 GiB |   180.52 B | CUDA       |  99 |  1 |         tg128 |         31.00 ± 0.01 |

almost no gain w/ CUDA_VISIBLE_DEVICES=0,1

llama-server

(base) ubuntu@ip-10-1-33-173:~/llama.cpp/llama.cpp$ build/bin/llama-server -m ../DeepSeek-V3-slice-jp64-gguf/Q5_K_M/Q5_K_M-00001-of-00012.gguf --alias
DeepSeek-V3-slice-jp64-gguf --host 0.0.0.0 --port 8000 -ngl 99

❯ OPENAI_BASE_URL='http://ip-10-1-33-173:8000/v1' python generate_answers.py --num_proc 1 --model_name DeepSeek-V3-slice-jp64-gguf

-np 8 has problems w/ context size being exceeded…

📖 llm-tracker

Explorer

DeepSeek-V3 Testing

vLLM TP vs PP

PP=2 TP=8

TP16

llama.cpp

llama-bench

benchmark_serving.py

sglang

jp64

Table of Contents

Backlinks