Using vLLM With DeepSeek on Karolina
This guide walks through how to set up and serve the DeepSeek model using vLLM on the Karolina HPC cluster. It covers requesting GPU resources, loading necessary modules, setting environment variables, and launching the model server with tensor parallelism across multiple GPUs on a single and multiple nodes.
Multi GPUs - Single Node Setup
Request Compute Resources via SLURM
Use salloc to allocate an interactive job session on the GPU partition.
salloc -A PROJECT-ID --partition=qgpu --gpus=4 --time=02:00:00Replace PROJECT-ID with your actual project ID.
Load Required Modules on Karolina
Load the necessary software modules including Python and CUDA.
ml Python/3.12.3-GCCcore-13.3.0
ml CUDA/12.4.0Verify CUDA is loaded correctly
nvcc --versionCreate and Activate Virtual Environment
python -m venv vllm
source vllm/bin/activate
pip install "vllm==0.7.3" "ray==2.40.0"Check if the necessary environment variables like LD_LIBRARY_PATH are correctly set.
echo $LD_LIBRARY_PATHSet Environment Variables for Cache Directories
These directories will be used by HuggingFace and vLLM to store model weights.
export HF_HUB_CACHE=/scratch/project/fta-25-9
export VLLM_CACHE_ROOT=/scratch/project/fta-25-9Adjust paths to your project’s scratch directory.
Serve Model
Launch the DeepSeek model using vLLM.
vllm serve deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B --trust-remote-code --tensor-parallel-size 4 --download-dir /scratch/project/fta-25-9note that trust-remote-code (required for some models).
--tensor-parallel-size should match the number of GPUs allocated (4 in this case).
The --download-dir should point to a high-performance, writable scratch directory.
Multi GPUs - Multi Nodes Setup
This section describes how to launch a distributed vLLM model server across multiple nodes and using Ray for orchestration.
Request Compute Resources via SLURM
Request multiple nodes with GPUs using SLURM. Replace the account ID as needed.
salloc -A FTA-25-9 --partition=qgpu --nodes=2 --gpus=9 --time=12:00:00--nodes=2: Requests 2 compute nodes.
--gpus=9: Requests 9 GPUs in total (in all the nodes).
Load Modules in Karolina
Load the required modules on each node
ml Python/3.12.3-GCCcore-13.3.0
ml CUDA/12.4.0verify CUDA installation
nvcc --versionCreate and Activate Virtual Environment
python -m venv vllm
source vllm/bin/activate
pip install "vllm==0.7.3" "ray==2.40.0"Ensure you do this on all nodes.
Set Up Ray Cluster
Choose a free port (e.g., 6379) and start the Ray cluster:
ray start --head --port=6379Specify the port you will start your cluster on.
The IP address and port used will be needed for the worker nodes.
Change your current node to the worker nodes using:
On Worker Nodes:
- identify the
node IDsof the allocated nodes: squeue --me: Get list of assigned nodesSSHinto each of the other nodes (excluding the head node)
squeue --me
ssh acn-node_idConnect to the Ray head from the worker node using its IP and port:
ray start --address='head-node-ip:port-number'Repeat this for all worker nodes to join the cluster.
Check Cluster Status:
- On any node, confirm that all nodes have joined:
ray statusServe Model With vLLM:
Once the Ray cluster is ready, launch the vLLM OpenAI-compatible API server on one of the worker nodes.
python -m vllm.entrypoints.openai.api_server --model deepseek-ai/DeepSeek-R1-Distill-Llama-70B --tokenizer deepseek-ai/DeepSeek-R1-Distill-Llama-70B --served-model-name deepseek-ai/DeepSeek-R1-Distill-Llama-70B --host 0.0.0.0 --port 8000 --tensor_parallel-size 4 --pipeline-parallel-size 2 --device cudaThis command assumes model parallelism:
--tensor_parallel-size : how many GPUs to split each layer across
--pipeline-parallel-size : how many layers to split across different nodes/devices
Test Model Server
You can test the endpoint using curl. Run this from the same node (adjust IP if needed).
curl http://127.0.0.1:8000/v1/completions -H "Content-Type: application/json" -d '{
"model": "deepseek-ai/DeepSeek-R1-Distill-Llama-70B",
"prompt": "Hello, world!",
"max_tokens": 50
}'
