Introduction

Natural Language Processing (NLP) has evolved significantly, with Large Language Models (LLMs) at the forefront of cutting-edge applications. Their ability to understand and generate human-like text has revolutionized various industries. Deploying and testing these LLMs effectively is crucial for harnessing their capabilities.

OpenLLM is an open-source platform for operating large language models (LLMs) in production. It allows you to run inference on any open-source LLMs, fine-tune them, deploy, and build powerful AI apps with ease.

This blog post explores the deployment of LLM models using the OpenLLM framework on a Kubernetes infrastructure. For the purpose of the demo, I am using a hardware setup consisting of an RTX 3060 GPU and an Intel i7 12700K processor, we delve into the technical aspects of achieving optimal performance.

Environment Setup and Kubernetes Configuration

Before diving into LLM deployment on Kubernetes, we need to ensure the environment is set up correctly and the Kubernetes cluster is ready for action.

Preparing the Kubernetes Cluster

Setting up a Kubernetes cluster requires defining worker nodes, networking, and orchestrators. Ensure you have Kubernetes installed and a cluster configured. This can be achieved through tools like kubeadm, minikube, kind or managed services such as Google Kubernetes Engine (GKE) and Amazon EKS.

If you are using kind cluster, you can create cluster as following:

kind create cluster
bash

Installing Dependencies and Resources

Within the cluster, install essential dependencies such as NVIDIA GPU drivers, CUDA libraries, and Kubernetes GPU support. These components are crucial for enabling GPU acceleration and maximizing LLM performance.

To use CUDA on your system, you will need the following installed:

A CUDA-capable GPU
A supported version of Linux with a gcc compiler and toolchain
CUDA Toolkit 12.2 at NVIDIA Developer portal

Using OpenLLM to Containerize and Load Models

OpenLLM

OpenLLM supports a wide range of state-of-the-art LLMs, including Llama 2, StableLM, Falcon, Dolly, Flan-T5, ChatGLM, and StarCoder. It also provides flexible APIs that allow you to serve LLMs over RESTful API or gRPC with one command, or query via WebUI, CLI, our Python/Javascript client, or any HTTP client.

Some of the key features of OpenLLM:

Support for a wide range of state-of-the-art LLMs
Flexible APIs for serving LLMs
Integration with other powerful tools
Easy to use
Open-source

To use OpenLLM, you need to have Python 3.8 (or newer) and pip installed on your system. We highly recommend using a Virtual Environment (like conda) to prevent package conflicts.

You can install OpenLLM using pip as follows:

pip install openllm
bash

To verify if it's installed correctly, run:

openllm -h
bash

To start an LLM server, for example, to start an Open Pre-trained transformer model aka OPT server, do the following:

openllm start opt
bash

Selecting the LLM Model

OpenLLM framework supports various pre-trained LLM models like GPT-3, GPT-2, and BERT. When selecting a large language model (LLM) for your application, the main factors to consider are:

Model size - Larger models like GPT-3 have more parameters and can handle more complex tasks, while smaller ones like GPT-2 are better for simpler usecases.
Architecture - Models optimized for generative AI like GPT-3 or understanding (e.g. BERT) align with different use cases.
Training data - More high-quality, diverse data leads to better generalization capabilities.
Fine-tuning - Pre-trained models can be further trained on domain-specific data to improve performance.
Alignment with usecase- Validate potential models on your specific application and data to ensure the right balance of complexity and capability.

The ideal LLM matches your needs in terms of complexity, data requirements, compute resources, and overall capability. Thoroughly evaluate options to select the best fit. For this demo, we will be using the Dolly-2 model with 3B parameters.

Loading the Chosen Model within a Container

Containerization enhances reproducibility and portability. Package your LLM model, OpenLLM dependencies, and other relevant libraries within a Docker container. This ensures a consistent runtime environment across different deployments.

With OpenLLM, you can easily build a Bento for a specific model, like dolly-v2-3b, using the build command.

openllm build dolly-v2 --model-id databricks/dolly-v2-3b
bash

In this demo, we are using BentoML, an MLOps platform and also the parent organization behind OpenLLM project. A Bento, in BentoML, is the unit of distribution. It packages your program's source code, models, files, artifacts, and dependencies.

To Containerize your Bento, run the following command:

bentoml containerize <name:version> -t dolly-v2-3b:latest --opt progress=plain
bash

This generates an OCI-compatible docker image that can be deployed anywhere docker runs.

You will be able to locate the docker image in $BENTO_HOME\bentos\stabilityai-stablelm-tuned-alpha-3b-service\$id\env\docker.

Need help with deployment?

Model Inference and High Scalability using Kubernetes

Executing model inference efficiently and scaling up when needed are key factors in a Kubernetes-based LLM deployment. The reliability and scalability features of Kubernetes can help efficiently scale the model for the production usecase.

Running LLM Model Inference

Pod Communication: Set up communication protocols within pods to manage model input and output. This can involve RESTful APIs or gRPC-based communication.

OpenLLM has a gRPC server running by default on port 3000. We can have a deployment file as follows:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: dolly-v2-deployment
spec:
  replicas: 3
  selector:
    matchLabels:
      app: dolly-v2
  template:
    metadata:
      labels:
        app: dolly-v2
    spec:
      containers:
        - name: dolly-v2
          image: dolly-v2-3b:latest
          imagePullPolicy: Never
          ports:
            - containerPort: 3000
yaml

Note: We will be assuming that the image is available locally with the name dolly-v2-3b with the latest tag. If the image is pushed to the repository, then make sure to remove the imagePullPolicy line and provide the credentials to the repository as secrets if it is a private repository.

Service: Expose the deployment using services to distribute incoming inference requests evenly among multiple pods.

We set up a LoadBalancer type service in our Kubernetes cluster that gets exposed on port 80. If you are using Ingress then it will be ClusterIP instead of LoadBalancer.

apiVersion: v1
kind: Service
metadata:
  name: dolly-v2-service
spec:
  type: LoadBalancer
  selector:
    app: dolly-v2
  ports:
    - name: http
      port: 80
      targetPort: 3000
yaml

Horizontal Scaling and Autoscaling

Horizontal Pod Autoscaling (HPA): Configure HPAs to automatically adjust the number of pods based on CPU or custom metrics. This ensures optimal resource utilization.

We can declare an HPA yaml for CPU configuration as below:

apiVersion: autoscaling/v1
kind: HorizontalPodAutoscaler
metadata:
  name: dolly-v2-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: dolly-v2-deployment
  minReplicas: 1
  maxReplicas: 10
  targetCPUUtilizationPercentage: 60
yaml

For GPU configuration, To gather GPU metrics in Kubernetes, follow this blog to install the DCGM server: Kubernetes HPA using GPU metrics.

After installation of the DCGM server, we can use the following to create HPA for GPU memory:

apiVersion: autoscaling/v2beta1
kind: HorizontalPodAutoscaler
metadata:
  name: dolly-v2-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1beta1
    kind: Deployment
    name: dolly-v2-deployment
  minReplicas: 1
  maxReplicas: 10
  metrics:
    - type: Object
      object:
        target:
          kind: Service
          name: dolly-v2-deployment # kubectl get svc | grep dcgm
        metricName: DCGM_FI_DEV_MEM_COPY_UTIL
        targetValue: 80
yaml

Cluster Autoscaling: Enable cluster-level autoscaling to manage resource availability across multiple nodes, accommodating varying workloads. Here are the key steps to configure cluster autoscaling in Kubernetes:

Install the Cluster Autoscaler plugin:

kubectl apply -f https://github.com/kubernetes/autoscaler/releases/download/v1.20.0/cluster-autoscaler-component.yaml
bash

Configure auto scaling by setting min/max nodes in your cluster config.
Annotate node groups you want to scale automatically:

kubectl annotate node POOL_NAME cluster-autoscaler.kubernetes.io/safe-to-evict=true
bash

Deploy an auto scaling-enabled application, like an HPA-based deployment. The autoscaler will scale the node pool when pods are unschedulable.
Configure auto scaling parameters as needed:
- Adjust scale-up/down delays with --scale-down-delay
- Set scale-down unneeded time with --scale-down-unneeded-time
- Limit scale speed with --max-node-provision-time
Monitor your cluster autoscaling events:

kubectl get events | grep ClusterAutoscaler
bash

Performance Analysis of LLMs in a Kubernetes Environment

Evaluating the performance of LLM deployment within a Kubernetes environment involves latency measurement and resource utilization assessment.

Latency Evaluation

Measuring Latency: Use tools like kubectl exec or custom scripts to measure the time it takes for a pod to process an input prompt and generate a response. Refer the below python script to determine latency metrics of the GPU.

Python Program to test Latency and Tokens/sec.

import torch
from transformers import AutoModelForCausalLM

model_name = "databricks/dolly-v2-3b"
model = AutoModelForCausalLM.from_pretrained(model_name).cuda()
text = "Sample text for benchmarking"
input_ids = model.tokenizer(text, return_tensors="pt").input_ids.cuda()
reps =100
times = []

for i in range(reps):
    start = torch.cuda.Event(enable_timing=True)
    end = torch.cuda.Event(enable_timing=True)
    # Start timer
    start.record()
    # Model inference
    outputs = model(input_ids).logits
    # End timer
    end.record()
    # Sync and get time
    torch.cuda.synchronize()
    times.append(start.elapsed_time(end))

# Calculate TPS
tokens = len(text.split())
tps = (tokens * reps) / sum(times)
# Calculate latency
latency = sum(times) / reps * 1000 # in ms
print(f"Avg TPS: {tps:.2f}")
print(f"Avg Latency: {latency:.2f} ms")
python

Comparing Latency using Aviary: Aviary is a valuable tool for developers who want to get started with LLMs, or who want to improve the performance and scalability of their LLM-based applications. It is easy to use and provides a number of features that make it a great choice for both beginners and experienced developers.

Resource Utilization and Scalability Insights

Monitoring Resource Consumption: Utilize Kubernetes dashboard or monitoring tools like Prometheus and Grafana to observe resource usage patterns across pods.
Scalability Analysis: Analyze how Kubernetes dynamically adjusts resources based on demand, ensuring resource efficiency and application responsiveness.

Conclusion

We have tried to put up an in-depth technical analysis that demonstrates the immense value of leveraging Kubernetes for LLM deployments. By combining GPU acceleration, specialized libraries, and Kubernetes orchestration capabilities, LLMs can be deployed with significantly improved performance and for a large scale. In particular, GPU-enabled pods achieved over 2x lower latency and nearly double the inference throughput compared to CPU-only variants. Kubernetes autoscaling also allowed pods to be scaled horizontally on demand, so query volumes could increase without compromising responsiveness.

Overall, the results of this analysis validate that Kubernetes is the best choice for deploying LLMs at scale. The synergy between software and hardware optimization on Kubernetes unlocks the true potential of LLMs for real-world NLP use cases.

If you are looking for help implementing LLMs on Kubernetes, we would love to hear how you are scaling LLMs. Please contact us to discuss your specific problem statement.

Ready to scale your LLM deployments with Kubernetes?

Deploy LLM models on Kubernetes using OpenLLM