Skip to content

Using vLLM with DeepSeek on Karolina

This guide walks through how to set up and serve the DeepSeek model using vLLM on the Karolina HPC cluster. It covers requesting GPU resources, loading necessary modules, setting environment variables, and launching the model server with tensor parallelism across multiple GPUs on a single and multiple nodes.

Multi GPUs - Single Node Setup

1. Request Compute Resources via SLURM

Use salloc to allocate an interactive job session on the GPU partition.

salloc -A PROJECT-ID --partition=qgpu --gpus=4 --time=02:00:00

Replace PROJECT-ID with your actual project ID.

2. Load Required Modules on Karolina

Load the necessary software modules including Python and CUDA.

ml Python/3.12.3-GCCcore-13.3.0
ml CUDA/12.4.0

Verify CUDA is loaded correctly

nvcc --version

3. Create and Activate Virtual Environment

python -m venv vllm

source vllm/bin/activate
pip install "vllm==0.7.3" "ray==2.40.0"

Check if the necessary environment variables like LD_LIBRARY_PATH are correctly set.

echo $LD_LIBRARY_PATH

4. Set Environment Variables for Cache Directories

These directories will be used by HuggingFace and vLLM to store model weights.

export HF_HUB_CACHE=/scratch/project/fta-25-9
export VLLM_CACHE_ROOT=/scratch/project/fta-25-9

Adjust paths to your project's scratch directory.

4. Serve Model

Launch the DeepSeek model using vLLM.

vllm serve deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B --trust-remote-code --tensor-parallel-size 4 --download-dir /scratch/project/fta-25-9

note that trust-remote-code (required for some models).

--tensor-parallel-size should match the number of GPUs allocated (4 in this case).

The --download-dir should point to a high-performance, writable scratch directory.

Multi GPUs - Multi Nodes Setup

This section describes how to launch a distributed vLLM model server across multiple nodes and using Ray for orchestration.

1. Request Compute Resources via SLURM

Request multiple nodes with GPUs using SLURM. Replace the account ID as needed.

salloc -A FTA-25-9 --partition=qgpu --nodes=2 --gpus=9 --time=12:00:00

--nodes=2: Requests 2 compute nodes.

--gpus=9: Requests 9 GPUs in total (in all the nodes).

2. Load Modules in Karolina

Load the required modules on each node

ml Python/3.12.3-GCCcore-13.3.0
ml CUDA/12.4.0

verify CUDA installation

nvcc --version   

3. Create and Activate Virtual Environment

python -m venv vllm
source vllm/bin/activate
pip install "vllm==0.7.3" "ray==2.40.0"

Ensure you do this on all nodes.

4. Set Up Ray Cluster

Choose a free port (e.g., 6379) and start the Ray cluster:

ray start --head --port=6379

Specify the port you will start your cluster on.

The IP address and port used will be needed for the worker nodes.

Change your current node to the worker nodes using:

On Worker Nodes:

  • identify the node IDs of the allocated nodes:
  • squeue --me: Get list of assigned nodes
  • SSH into each of the other nodes (excluding the head node)
squeue --me

ssh acn-node_id

Connect to the Ray head from the worker node using its IP and port:

ray start --address='head-node-ip:port-number'

Repeat this for all worker nodes to join the cluster.

Check Cluster Status:

  • On any node, confirm that all nodes have joined:
ray status

Serve Model with vLLM:

Once the Ray cluster is ready, launch the vLLM OpenAI-compatible API server on one of the worker nodes.

python -m vllm.entrypoints.openai.api_server --model deepseek-ai/DeepSeek-R1-Distill-Llama-70B --tokenizer deepseek-ai/DeepSeek-R1-Distill-Llama-70B --served-model-name deepseek-ai/DeepSeek-R1-Distill-Llama-70B --host 0.0.0.0 --port 8000 --tensor_parallel-size 4 --pipeline-parallel-size 2 --device cuda

This command assumes model parallelism:

--tensor_parallel-size : how many GPUs to split each layer across --pipeline-parallel-size : how many layers to split across different nodes/devices

Test Model Server

You can test the endpoint using curl. Run this from the same node (adjust IP if needed).

curl http://127.0.0.1:8000/v1/completions     -H "Content-Type: application/json"     -d '{
      "model": "deepseek-ai/DeepSeek-R1-Distill-Llama-70B",
      "prompt": "Hello, world!",
      "max_tokens": 50
    }'