Now that we’ve gotten over the “buzz lifecycle” of AI and LLMs, it’s time to start thinking about how to run the workloads in our environments. Wiping away all of the said “buzz”, there’s a solid use case for running training Models and other LLM/AI workloads on Kubernetes. One of the biggest reasons is the decoupling of memory, CPU, and GPUs.
In this blog post, you’ll learn how to get started with running an LLM on a local Kubernetes cluster.
💡
This will work on a production cluster and/or a cloud-based cluster (AKS, EKS, GKE, etc.) as well.
Prerequisites
To follow along from a hands-on perspective with this blog post, you should have the following:
- A code editor like VS Code.
- Minikube installed. You can find the installation here.
Why Ollama?
When using Ollama, it comes down to:
- Privacy
- Local-style LLM workloads
The biggest piece of the puzzle is that Ollama allows you to use whatever Models you want, train your own, and fine-tune those Models with your own Retrieval Augmented Generation (RAG). One of the largest advantages for a lot of engineers and organizations when using Ollama is that because it’s installed “locally” (could be your local machine, but could also be a Kubernetes cluster or a standard VM), you control what data gets fed to it. There are also pre-existing Models that you can go off of, but you don’t have to.
Meta made Llama and OpenAI made gpt4. As you’ve probably seen, there are other chat-bot/LLM-based tools out there as well.
Google Gemini, Microsoft Bing AI, ChatGPT, Grok, and the other chat-based AIs/LLMs are all essentially SaaS based. They’re hosted for you and you can call upon them/use them (for free or for a price). Ollama is local and you fully control it (aside from the general llama Model that you can bring down to get started).
Setting Up Kubernetes Locally
Ollama is an LLM, and although it’s not huge (I don’t believe it falls under the Small Language Model (SLM) category), it still requires a solid amount of resources to run it as all AI/LLM tools do. Because of that, you’ll need a Minikube environment with 3 nodes as the extra CPU/memory is necassary.
To run a local Kubernetes cluster using Minikube with 3 nodes, run the following command:
minikube start --nodes 3
You’ll be able to see the three nodes with kubectl get nodes
.
Ollama Kubernetes Manifest
Now that the cluster is created, you can deploy the Kubernetes Pod that runs Ollama.
Luckily, there’s already a container image for Ollama that exists, so you don’t have to worry about building out a Dockerfile yourself.
💡
As with all pre-built container images, you want to ensure that it’s secure. You can use a container image scanner like docker scout
for confirmation.
Use the following Kubernetes Manifest which deploys Ollama using:
- A
Deployment
object/resource. - One Pod.
- The latest container image version of Ollama.
- Port
11434
.
apiVersion: apps/v1
kind: Deployment
metadata:
name: ollama
namespace: ollama
spec:
selector:
matchLabels:
name: ollama
template:
metadata:
labels:
name: ollama
spec:
containers:
- name: ollama
image: ollama/ollama:latest
ports:
- name: http
containerPort: 11434
protocol: TCP
Save the Manifest in a location of your choosing with the name ollama.yaml
and run the following command:
kubectl apply -f ollama.yaml
Confirm Ollama Works
The Kubernetes Deployment with one replica is now deployed, so you can start testing to see if it works.
First, exec
(which is like an SSH) into the Pod. Ensure to swap out “pod name” with the name of your Pod.
kubectl -n default exec -ti “pod name” — /bin/bash
You should now be able to run ollama
commands. You can confirm with the --version
subcommand.
ollama —version
Once you’ve confirmed Ollama works, pull the latest Llama Model.
ollama pull llama3.2
Run the Model.
ollama run llama3.2
Last but not least, you can confirm it’s working by asking it a question. Here’s an example (I’m on my way to re:Invent, so I figured this was appropriate).
Congrats! You’ve successfully deployed an LLM to Kubernetes.