Scaling and Horizontal Pod Autoscaling (HPA) in Kubernetes
Kubernetes provides robust mechanisms to scale applications and workloads efficiently to meet demand while optimizing resource utilization. Scaling can be done manually or automatically through Horizontal Pod Autoscaling (HPA). In this article, we’ll explore the concept of scaling in Kubernetes, how HPA works, and how to implement it in your cluster.
Types of Scaling in Kubernetes
-
Manual Scaling
- Developers or operators manually adjust the number of pods in a deployment or replica set using commands like
kubectl scale
or by editing the deployment manifest. - Example:
kubectl scale deployment <deployment-name> --replicas=5
- Developers or operators manually adjust the number of pods in a deployment or replica set using commands like
-
Automatic Scaling
- Kubernetes can automatically scale workloads using built-in features like:
- Horizontal Pod Autoscaling (HPA): Adjusts the number of pods in a deployment based on CPU, memory, or custom metrics.
- Vertical Pod Autoscaling (VPA): Adjusts resource requests and limits (CPU and memory) for pods dynamically.
- Cluster Autoscaler: Adds or removes nodes to/from the cluster based on workload demands.
- Kubernetes can automatically scale workloads using built-in features like:
Horizontal Pod Autoscaling (HPA)
HPA dynamically adjusts the number of pods in a deployment, replica set, or stateful set based on observed metrics (e.g., CPU or memory utilization). It ensures that applications have the necessary resources during peak demand while scaling down during low usage to save costs.
How HPA Works
- Metrics Collection: HPA relies on the Kubernetes Metrics Server to collect resource utilization data (CPU, memory, or custom application metrics).
- Target Threshold: You specify a threshold value (e.g., CPU utilization at 70%), and HPA ensures the workload maintains this target.
- Adjustment: If utilization exceeds the target, HPA increases the number of pods. If utilization falls below the target, it reduces the number of pods.
Setting Up Horizontal Pod Autoscaling
To implement HPA in Kubernetes, follow these steps:
1. Ensure Metrics Server is Running
HPA depends on the Metrics Server to collect resource utilization data. Verify that the Metrics Server is installed:
kubectl get deployment metrics-server -n kube-system
If not installed, deploy it using the official manifest:
kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml
2. Define Resource Requests and Limits
HPA requires pods to have defined resource requests for CPU or memory. Without these definitions, HPA cannot calculate utilization metrics.
Example Deployment Manifest with Resource Requests and Limits:
apiVersion: apps/v1
kind: Deployment
metadata:
name: example-app
spec:
replicas: 1
selector:
matchLabels:
app: example-app
template:
metadata:
labels:
app: example-app
spec:
containers:
- name: app-container
image: nginx
resources:
requests:
cpu: "100m"
memory: "128Mi"
limits:
cpu: "200m"
memory: "256Mi"
3. Create an HPA Resource
Use the kubectl autoscale
command or define an HPA manifest to create an autoscaler.
Example HPA Manifest:
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: example-app-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: example-app
minReplicas: 2
maxReplicas: 10
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
In this example:
- scaleTargetRef specifies the deployment to scale.
- minReplicas and maxReplicas define the scaling range.
- averageUtilization is the CPU utilization target (70%).
4. Apply the HPA Manifest
Apply the HPA configuration using kubectl
:
kubectl apply -f hpa.yaml
5. Monitor HPA Behavior
Use the following command to monitor the HPA’s status:
kubectl get hpa
Output:
NAME REFERENCE TARGETS MINPODS MAXPODS REPLICAS AGE
example-app-hpa Deployment/example-app 60%/70% 2 10 3 10m
Key Features of HPA
- Metrics Support: HPA can use CPU, memory, or custom metrics (e.g., requests per second).
-
Scaling Range: Define a range for scaling using
minReplicas
andmaxReplicas
. - Dynamic Scaling: Automatically adjusts the number of pods based on observed metrics.
- Custom Metrics: HPA can integrate with custom metrics (via Prometheus or other systems) to scale workloads based on application-specific metrics like HTTP request rates.
Custom Metrics with HPA
In addition to CPU and memory metrics, HPA supports custom metrics via the Custom Metrics API. For example, you can scale pods based on HTTP requests or queue length.
Example Custom Metric HPA:
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: custom-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: example-app
minReplicas: 2
maxReplicas: 10
metrics:
- type: Pods
pods:
metric:
name: http_requests_per_second
target:
type: AverageValue
averageValue: "10"
This configuration scales the pods based on an average of 10 HTTP requests per second.
Best Practices for HPA
- Define Resource Requests and Limits: Ensure all pods have CPU and memory requests defined to enable effective scaling.
- Set Realistic Thresholds: Use appropriate thresholds for CPU, memory, or custom metrics based on your application’s performance benchmarks.
- Monitor Metrics Server: Ensure the Metrics Server is healthy and operational to avoid scaling issues.
- Combine with Cluster Autoscaler: Use HPA in conjunction with the Cluster Autoscaler to ensure the cluster can provision enough nodes during peak demand.
- Test Scaling Behavior: Simulate high traffic or load scenarios to verify that the HPA behaves as expected.
Scaling Limits and Considerations
- Cool Down Periods: HPA may take a few minutes to adjust pod counts due to metrics collection intervals and decision-making delays.
-
Minimum and Maximum Limits: Define
minReplicas
andmaxReplicas
to avoid over-scaling or under-scaling. - Cluster Capacity: Ensure the cluster has sufficient resources (nodes) to accommodate the maximum number of pods defined by HPA.
- Custom Metrics: Use Prometheus or an adapter to provide custom metrics for advanced scaling use cases.
Conclusion
Horizontal Pod Autoscaling (HPA) in Kubernetes is a powerful feature for maintaining application performance and optimizing resource utilization. By automatically adjusting the number of pods based on workload demands, HPA ensures that your applications remain responsive under varying loads while avoiding unnecessary costs during idle periods.
When combined with best practices, custom metrics, and tools like the Cluster Autoscaler, HPA enables dynamic, efficient scaling for modern cloud-native applications.