Using vLLM with DeepSeek on Karolina¶
This guide walks through how to set up and serve the DeepSeek model using vLLM on the Karolina HPC cluster. It covers requesting GPU resources, loading necessary modules, setting environment variables, and launching the model server with tensor parallelism across multiple GPUs on a single and multiple nodes.
Multi GPUs - Single Node Setup¶
1. Request Compute Resources via SLURM¶
Use salloc
to allocate an interactive job session on the GPU partition.
salloc -A PROJECT-ID --partition=qgpu --gpus=4 --time=02:00:00
Replace PROJECT-ID
with your actual project ID.
2. Load Required Modules on Karolina¶
Load the necessary software modules including Python and CUDA.
ml Python/3.12.3-GCCcore-13.3.0
ml CUDA/12.4.0
Verify CUDA is loaded correctly
nvcc --version
3. Create and Activate Virtual Environment¶
python -m venv vllm
source vllm/bin/activate
pip install "vllm==0.7.3" "ray==2.40.0"
Check if the necessary environment variables like LD_LIBRARY_PATH are correctly set.
echo $LD_LIBRARY_PATH
4. Set Environment Variables for Cache Directories¶
These directories will be used by HuggingFace and vLLM to store model weights.
export HF_HUB_CACHE=/scratch/project/fta-25-9
export VLLM_CACHE_ROOT=/scratch/project/fta-25-9
Adjust paths to your project's scratch directory.
4. Serve Model¶
Launch the DeepSeek model using vLLM.
vllm serve deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B --trust-remote-code --tensor-parallel-size 4 --download-dir /scratch/project/fta-25-9
note that trust-remote-code
(required for some models).
--tensor-parallel-size
should match the number of GPUs allocated (4 in this case).
The --download-dir
should point to a high-performance, writable scratch directory.
Multi GPUs - Multi Nodes Setup¶
This section describes how to launch a distributed vLLM model server across multiple nodes and using Ray for orchestration.
1. Request Compute Resources via SLURM¶
Request multiple nodes with GPUs using SLURM. Replace the account ID as needed.
salloc -A FTA-25-9 --partition=qgpu --nodes=2 --gpus=9 --time=12:00:00
--nodes=2
: Requests 2 compute nodes.
--gpus=9
: Requests 9 GPUs in total (in all the nodes).
2. Load Modules in Karolina¶
Load the required modules on each node
ml Python/3.12.3-GCCcore-13.3.0
ml CUDA/12.4.0
verify CUDA installation
nvcc --version
3. Create and Activate Virtual Environment¶
python -m venv vllm
source vllm/bin/activate
pip install "vllm==0.7.3" "ray==2.40.0"
Ensure you do this on all nodes.
4. Set Up Ray Cluster¶
Choose a free port (e.g., 6379) and start the Ray cluster:
ray start --head --port=6379
Specify the port you will start your cluster on.
The IP address and port used will be needed for the worker nodes.
Change your current node to the worker nodes using:
On Worker Nodes:
- identify the
node IDs
of the allocated nodes: squeue --me
: Get list of assigned nodesSSH
into each of the other nodes (excluding the head node)
squeue --me
ssh acn-node_id
Connect to the Ray head from the worker node using its IP and port:
ray start --address='head-node-ip:port-number'
Repeat this for all worker nodes to join the cluster.
Check Cluster Status:
- On any node, confirm that all nodes have joined:
ray status
Serve Model with vLLM:¶
Once the Ray cluster is ready, launch the vLLM OpenAI-compatible API server on one of the worker nodes.
python -m vllm.entrypoints.openai.api_server --model deepseek-ai/DeepSeek-R1-Distill-Llama-70B --tokenizer deepseek-ai/DeepSeek-R1-Distill-Llama-70B --served-model-name deepseek-ai/DeepSeek-R1-Distill-Llama-70B --host 0.0.0.0 --port 8000 --tensor_parallel-size 4 --pipeline-parallel-size 2 --device cuda
This command assumes model parallelism:
--tensor_parallel-size
: how many GPUs to split each layer across
--pipeline-parallel-size
: how many layers to split across different nodes/devices
Test Model Server¶
You can test the endpoint using curl
. Run this from the same node (adjust IP if needed).
curl http://127.0.0.1:8000/v1/completions -H "Content-Type: application/json" -d '{
"model": "deepseek-ai/DeepSeek-R1-Distill-Llama-70B",
"prompt": "Hello, world!",
"max_tokens": 50
}'