Skip to main content

Deploy LLM models on Kubernetes using OpenLLM

Pavan Shiraguppi

Pavan Shiraguppi


Natural Language Processing (NLP) has evolved significantly, with Large Language Models (LLMs) at the forefront of cutting-edge applications. Their ability to understand and generate human-like text has revolutionized various industries. Deploying and testing these LLMs effectively is crucial for harnessing their capabilities.

OpenLLM is an open-source platform for operating large language models (LLMs) in production. It allows you to run inference on any open-source LLMs, fine-tune them, deploy, and build powerful AI apps with ease.

This blog post explores the deployment of LLM models using the OpenLLM framework on a Kubernetes infrastructure. For the purpose of the demo, I am using a hardware setup consisting of an RTX 3060 GPU and an Intel i7 12700K processor, we delve into the technical aspects of achieving optimal performance.

Environment Setup and Kubernetes Configuration

Before diving into LLM deployment on Kubernetes, we need to ensure the environment is set up correctly and the Kubernetes cluster is ready for action.

Preparing the Kubernetes Cluster

Setting up a Kubernetes cluster requires defining worker nodes, networking, and orchestrators. Ensure you have Kubernetes installed and a cluster configured. This can be achieved through tools like kubeadm, minikube, kind or managed services such as Google Kubernetes Engine (GKE) and Amazon EKS.

If you are using kind cluster, you can create cluster as following:

kind create cluster

Installing Dependencies and Resources

Within the cluster, install essential dependencies such as NVIDIA GPU drivers, CUDA libraries, and Kubernetes GPU support. These components are crucial for enabling GPU acceleration and maximizing LLM performance.

To use CUDA on your system, you will need the following installed:

Using OpenLLM to Containerize and Load Models


OpenLLM supports a wide range of state-of-the-art LLMs, including Llama 2, StableLM, Falcon, Dolly, Flan-T5, ChatGLM, and StarCoder. It also provides flexible APIs that allow you to serve LLMs over RESTful API or gRPC with one command, or query via WebUI, CLI, our Python/Javascript client, or any HTTP client.

Some of the key features of OpenLLM:

  • Support for a wide range of state-of-the-art LLMs
  • Flexible APIs for serving LLMs
  • Integration with other powerful tools
  • Easy to use
  • Open-source

To use OpenLLM, you need to have Python 3.8 (or newer) and pip installed on your system. We highly recommend using a Virtual Environment (like conda) to prevent package conflicts.

You can install OpenLLM using pip as follows:

pip install openllm

To verify if it's installed correctly, run:

openllm -h

To start an LLM server, for example, to start an Open Pre-trained transformer model aka OPT server, do the following:

openllm start opt

Selecting the LLM Model

OpenLLM framework supports various pre-trained LLM models like GPT-3, GPT-2, and BERT. When selecting a large language model (LLM) for your application, the main factors to consider are:

  • Model size - Larger models like GPT-3 have more parameters and can handle more complex tasks, while smaller ones like GPT-2 are better for simpler usecases.
  • Architecture - Models optimized for generative AI like GPT-3 or understanding (e.g. BERT) align with different use cases.
  • Training data - More high-quality, diverse data leads to better generalization capabilities.
  • Fine-tuning - Pre-trained models can be further trained on domain-specific data to improve performance.
  • Alignment with usecase- Validate potential models on your specific application and data to ensure the right balance of complexity and capability.

The ideal LLM matches your needs in terms of complexity, data requirements, compute resources, and overall capability. Thoroughly evaluate options to select the best fit. For this demo, we will be using the Dolly-2 model with 3B parameters.

Loading the Chosen Model within a Container

Containerization enhances reproducibility and portability. Package your LLM model, OpenLLM dependencies, and other relevant libraries within a Docker container. This ensures a consistent runtime environment across different deployments.

With OpenLLM, you can easily build a Bento for a specific model, like dolly-v2-3b, using the build command.

openllm build dolly-v2 --model-id databricks/dolly-v2-3b

In this demo, we are using BentoML, an MLOps platform and also the parent organization behind OpenLLM project. A Bento, in BentoML, is the unit of distribution. It packages your program's source code, models, files, artifacts, and dependencies.

To Containerize your Bento, run the following command:

bentoml containerize <name:version> -t dolly-v2-3b:latest --opt progress=plain

This generates an OCI-compatible docker image that can be deployed anywhere docker runs.

You will be able to locate the docker image in $BENTO_HOME\bentos\stabilityai-stablelm-tuned-alpha-3b-service\$id\env\docker.

Need help for your organization? Share your email below to discuss with the experts.

Model Inference and High Scalability using Kubernetes

Executing model inference efficiently and scaling up when needed are key factors in a Kubernetes-based LLM deployment. The reliability and scalability features of Kubernetes can help efficiently scale the model for the production usecase.

Running LLM Model Inference

  1. Pod Communication: Set up communication protocols within pods to manage model input and output. This can involve RESTful APIs or gRPC-based communication.

OpenLLM has a gRPC server running by default on port 3000. We can have a deployment file as follows:

apiVersion: apps/v1 kind: Deployment metadata: name: dolly-v2-deployment spec: replicas: 3 selector: matchLabels: app: dolly-v2 template: metadata: labels: app: dolly-v2 spec: containers: - name: dolly-v2 image: dolly-v2-3b:latest imagePullPolicy: Never ports: - containerPort: 3000

Note: We will be assuming that the image is available locally with the name dolly-v2-3b with the latest tag. If the image is pushed to the repository, then make sure to remove the imagePullPolicy line and provide the credentials to the repository as secrets if it is a private repository.

  1. Service: Expose the deployment using services to distribute incoming inference requests evenly among multiple pods.

We set up a LoadBalancer type service in our Kubernetes cluster that gets exposed on port 80. If you are using Ingress then it will be ClusterIP instead of LoadBalancer.

apiVersion: v1 kind: Service metadata: name: dolly-v2-service spec: type: LoadBalancer selector: app: dolly-v2 ports: - name: http port: 80 targetPort: 3000

Horizontal Scaling and Autoscaling

  1. Horizontal Pod Autoscaling (HPA): Configure HPAs to automatically adjust the number of pods based on CPU or custom metrics. This ensures optimal resource utilization.

We can declare an HPA yaml for CPU configuration as below:

apiVersion: autoscaling/v1 kind: HorizontalPodAutoscaler metadata: name: dolly-v2-hpa spec: scaleTargetRef: apiVersion: apps/v1 kind: Deployment name: dolly-v2-deployment minReplicas: 1 maxReplicas: 10 targetCPUUtilizationPercentage: 60

For GPU configuration, To gather GPU metrics in Kubernetes, follow this blog to install the DCGM server: Kubernetes HPA using GPU metrics.

After installation of the DCGM server, we can use the following to create HPA for GPU memory:

apiVersion: autoscaling/v2beta1 kind: HorizontalPodAutoscaler metadata: name: dolly-v2-hpa spec: scaleTargetRef: apiVersion: apps/v1beta1 kind: Deployment name: dolly-v2-deployment minReplicas: 1 maxReplicas: 10 metrics: - type: Object object: target: kind: Service name: dolly-v2-deployment # kubectl get svc | grep dcgm metricName: DCGM_FI_DEV_MEM_COPY_UTIL targetValue: 80
  1. Cluster Autoscaling: Enable cluster-level autoscaling to manage resource availability across multiple nodes, accommodating varying workloads. Here are the key steps to configure cluster autoscaling in Kubernetes:
  • Install the Cluster Autoscaler plugin:
kubectl apply -f
  • Configure auto scaling by setting min/max nodes in your cluster config.
  • Annotate node groups you want to scale automatically:
kubectl annotate node POOL_NAME
  • Deploy an auto scaling-enabled application, like an HPA-based deployment. The autoscaler will scale the node pool when pods are unschedulable.

  • Configure auto scaling parameters as needed:

    • Adjust scale-up/down delays with --scale-down-delay
    • Set scale-down unneeded time with --scale-down-unneeded-time
    • Limit scale speed with --max-node-provision-time
  • Monitor your cluster autoscaling events:

kubectl get events | grep ClusterAutoscaler

Performance Analysis of LLMs in a Kubernetes Environment

Evaluating the performance of LLM deployment within a Kubernetes environment involves latency measurement and resource utilization assessment.

Latency Evaluation

  1. Measuring Latency: Use tools like kubectl exec or custom scripts to measure the time it takes for a pod to process an input prompt and generate a response. Refer the below python script to determine latency metrics of the GPU.

Python Program to test Latency and Tokens/sec.

import torch from transformers import AutoModelForCausalLM model_name = "databricks/dolly-v2-3b" model = AutoModelForCausalLM.from_pretrained(model_name).cuda() text = "Sample text for benchmarking" input_ids = model.tokenizer(text, return_tensors="pt").input_ids.cuda() reps =100 times = [] for i in range(reps): start = torch.cuda.Event(enable_timing=True) end = torch.cuda.Event(enable_timing=True) # Start timer start.record() # Model inference outputs = model(input_ids).logits # End timer end.record() # Sync and get time torch.cuda.synchronize() times.append(start.elapsed_time(end)) # Calculate TPS tokens = len(text.split()) tps = (tokens * reps) / sum(times) # Calculate latency latency = sum(times) / reps * 1000 # in ms print(f"Avg TPS: {tps:.2f}") print(f"Avg Latency: {latency:.2f} ms")
  1. Comparing Latency using Aviary: Aviary is a valuable tool for developers who want to get started with LLMs, or who want to improve the performance and scalability of their LLM-based applications. It is easy to use and provides a number of features that make it a great choice for both beginners and experienced developers.

Resource Utilization and Scalability Insights

  1. Monitoring Resource Consumption: Utilize Kubernetes dashboard or monitoring tools like Prometheus and Grafana to observe resource usage patterns across pods.
  2. Scalability Analysis: Analyze how Kubernetes dynamically adjusts resources based on demand, ensuring resource efficiency and application responsiveness.


We have tried to put up an in-depth technical analysis that demonstrates the immense value of leveraging Kubernetes for LLM deployments. By combining GPU acceleration, specialized libraries, and Kubernetes orchestration capabilities, LLMs can be deployed with significantly improved performance and for a large scale. In particular, GPU-enabled pods achieved over 2x lower latency and nearly double the inference throughput compared to CPU-only variants. Kubernetes autoscaling also allowed pods to be scaled horizontally on demand, so query volumes could increase without compromising responsiveness.

Overall, the results of this analysis validate that Kubernetes is the best choice for deploying LLMs at scale. The synergy between software and hardware optimization on Kubernetes unlocks the true potential of LLMs for real-world NLP use cases.

If you are looking for help implementing LLMs on Kubernetes, we would love to hear how you are scaling LLMs. Please contact us to discuss your specific problem statement.

Unlock the full potential of your tech ecosystem!

Contact us today to discuss your requirements and learn how our consulting services can drive efficiency, reliability, and scalability for your organization.

Enjoying this post?

Get our posts directly in your inbox.