Ollama makes it easy to run large language models locally, and combining it with Kubernetes can give you a flexible, containerized environment for AI development. In this guide, I’ll walk you through setting up Ollama on a local Kubernetes cluster that runs right on your laptop.
Prerequisites
- A laptop with at least 16GB RAM (more is better for larger models)
- Docker Desktop installed
- Basic familiarity with Kubernetes concepts
- kubectl command-line tool
Step 1: Set Up a Local Kubernetes Cluster
First, let’s check if we have any existing Kubernetes clusters running:
It appears we need to set up our Kubernetes environment first. For a local laptop setup, I recommend using minikube or kind. Let’s create a new namespace for our Ollama deployment:
# Start minikube (if not already running)
minikube start --driver=docker --cpus=4 --memory=8g
# Create a namespace for our Ollama deployment
kubectl create namespace ollama
Step 2: Create a Deployment for Ollama
Let’s create a deployment file for Ollama. We’ll call it ollama-deployment.yaml
:
apiVersion: apps/v1
kind: Deployment
metadata:
name: ollama
namespace: ollama
spec:
replicas: 1
selector:
matchLabels:
app: ollama
template:
metadata:
labels:
app: ollama
spec:
containers:
- name: ollama
image: ollama/ollama:latest
ports:
- containerPort: 11434
resources:
requests:
memory: "4Gi"
cpu: "2"
limits:
memory: "8Gi"
cpu: "4"
volumeMounts:
- name: ollama-data
mountPath: /root/.ollama
volumes:
- name: ollama-data
persistentVolumeClaim:
claimName: ollama-pvc
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: ollama-pvc
namespace: ollama
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 10Gi
---
apiVersion: v1
kind: Service
metadata:
name: ollama-service
namespace: ollama
spec:
selector:
app: ollama
ports:
- port: 11434
targetPort: 11434
type: LoadBalancer
Apply this configuration with:
kubectl apply -f ollama-deployment.yaml
Step 3: Wait for the Deployment to Complete
Let’s check the status of our deployment:
kubectl -n ollama get pods
kubectl -n ollama get services
Wait until the pod status is “Running” and the service has been assigned an external IP.
Step 4: Pull a Language Model
Now that Ollama is running in your Kubernetes cluster, you can pull a language model. For a laptop environment, I recommend starting with a smaller model like Llama 2 7B:
# Get the service IP address
export OLLAMA_HOST=$(minikube service -n ollama ollama-service --url)
# Pull the model
curl -X POST $OLLAMA_HOST/api/pull -d '{"name": "llama2:7b"}'
This will download and prepare the model, which may take a few minutes depending on your internet connection and laptop specs.
Step 5: Test Your Deployment
Let’s test that everything is working correctly:
# Generate a simple response
curl -X POST $OLLAMA_HOST/api/generate -d '{
"model": "llama2:7b",
"prompt": "Write a haiku about Kubernetes",
"stream": false
}'
Port Forwarding for Easy Access
To make it easier to access your Ollama instance, you can set up port forwarding:
kubectl -n ollama port-forward svc/ollama-service 11434:11434
This will make Ollama available at http://localhost:11434.
Advanced Configuration: Resource Management
For laptop environments, you’ll want to be careful with resource allocation. Modify the resources section in your deployment YAML to match your laptop’s capabilities:
resources:
requests:
memory: "2Gi" # Lower for less powerful laptops
cpu: "1"
limits:
memory: "6Gi" # Adjust based on your available RAM
cpu: "2" # Adjust based on your CPU
Cleaning Up
When you’re done working with Ollama, you can clean up your resources:
kubectl delete namespace ollama
# Or if using minikube
minikube stop
Conclusion
Running Ollama on Kubernetes locally gives you a flexible environment for AI development that’s portable and reproducible. This setup allows you to experiment with different models and configurations while keeping everything neatly containerized.
By following this guide, you’ve set up a local AI environment that can run various language models right on your laptop. This approach is perfect for development, testing, and learning about both Kubernetes and large language models.
Remember that the performance will depend on your laptop’s specifications, so start with smaller models and adjust resource allocations accordingly.
Happy coding!