Scaling Applications with Kubernetes: A Guide to Horizontal Pod Autoscaling (HPA)

3 min read

Scaling and Horizontal Pod Autoscaling (HPA) in Kubernetes

Kubernetes provides robust mechanisms to scale applications and workloads efficiently to meet demand while optimizing resource utilization. Scaling can be done manually or automatically through Horizontal Pod Autoscaling (HPA). In this article, we’ll explore the concept of scaling in Kubernetes, how HPA works, and how to implement it in your cluster.

Types of Scaling in Kubernetes

  1. Manual Scaling

    • Developers or operators manually adjust the number of pods in a deployment or replica set using commands like kubectl scale or by editing the deployment manifest.
    • Example:
     kubectl scale deployment <deployment-name> --replicas=5
    
  2. Automatic Scaling

    • Kubernetes can automatically scale workloads using built-in features like:
      • Horizontal Pod Autoscaling (HPA): Adjusts the number of pods in a deployment based on CPU, memory, or custom metrics.
      • Vertical Pod Autoscaling (VPA): Adjusts resource requests and limits (CPU and memory) for pods dynamically.
      • Cluster Autoscaler: Adds or removes nodes to/from the cluster based on workload demands.

Horizontal Pod Autoscaling (HPA)

HPA dynamically adjusts the number of pods in a deployment, replica set, or stateful set based on observed metrics (e.g., CPU or memory utilization). It ensures that applications have the necessary resources during peak demand while scaling down during low usage to save costs.

How HPA Works

  1. Metrics Collection: HPA relies on the Kubernetes Metrics Server to collect resource utilization data (CPU, memory, or custom application metrics).
  2. Target Threshold: You specify a threshold value (e.g., CPU utilization at 70%), and HPA ensures the workload maintains this target.
  3. Adjustment: If utilization exceeds the target, HPA increases the number of pods. If utilization falls below the target, it reduces the number of pods.

Setting Up Horizontal Pod Autoscaling

To implement HPA in Kubernetes, follow these steps:

1. Ensure Metrics Server is Running

HPA depends on the Metrics Server to collect resource utilization data. Verify that the Metrics Server is installed:

kubectl get deployment metrics-server -n kube-system

If not installed, deploy it using the official manifest:

kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml

2. Define Resource Requests and Limits

HPA requires pods to have defined resource requests for CPU or memory. Without these definitions, HPA cannot calculate utilization metrics.

Example Deployment Manifest with Resource Requests and Limits:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: example-app
spec:
  replicas: 1
  selector:
    matchLabels:
      app: example-app
  template:
    metadata:
      labels:
        app: example-app
    spec:
      containers:
        - name: app-container
          image: nginx
          resources:
            requests:
              cpu: "100m"
              memory: "128Mi"
            limits:
              cpu: "200m"
              memory: "256Mi"

3. Create an HPA Resource

Use the kubectl autoscale command or define an HPA manifest to create an autoscaler.

Example HPA Manifest:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: example-app-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: example-app
  minReplicas: 2
  maxReplicas: 10
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 70

In this example:

  • scaleTargetRef specifies the deployment to scale.
  • minReplicas and maxReplicas define the scaling range.
  • averageUtilization is the CPU utilization target (70%).

4. Apply the HPA Manifest

Apply the HPA configuration using kubectl:

kubectl apply -f hpa.yaml

5. Monitor HPA Behavior

Use the following command to monitor the HPA’s status:

kubectl get hpa

Output:

NAME              REFERENCE          TARGETS   MINPODS   MAXPODS   REPLICAS   AGE
example-app-hpa   Deployment/example-app   60%/70%   2         10        3          10m

Key Features of HPA

  1. Metrics Support: HPA can use CPU, memory, or custom metrics (e.g., requests per second).
  2. Scaling Range: Define a range for scaling using minReplicas and maxReplicas.
  3. Dynamic Scaling: Automatically adjusts the number of pods based on observed metrics.
  4. Custom Metrics: HPA can integrate with custom metrics (via Prometheus or other systems) to scale workloads based on application-specific metrics like HTTP request rates.

Custom Metrics with HPA

In addition to CPU and memory metrics, HPA supports custom metrics via the Custom Metrics API. For example, you can scale pods based on HTTP requests or queue length.

Example Custom Metric HPA:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: custom-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: example-app
  minReplicas: 2
  maxReplicas: 10
  metrics:
    - type: Pods
      pods:
        metric:
          name: http_requests_per_second
        target:
          type: AverageValue
          averageValue: "10"

This configuration scales the pods based on an average of 10 HTTP requests per second.

Best Practices for HPA

  1. Define Resource Requests and Limits: Ensure all pods have CPU and memory requests defined to enable effective scaling.
  2. Set Realistic Thresholds: Use appropriate thresholds for CPU, memory, or custom metrics based on your application’s performance benchmarks.
  3. Monitor Metrics Server: Ensure the Metrics Server is healthy and operational to avoid scaling issues.
  4. Combine with Cluster Autoscaler: Use HPA in conjunction with the Cluster Autoscaler to ensure the cluster can provision enough nodes during peak demand.
  5. Test Scaling Behavior: Simulate high traffic or load scenarios to verify that the HPA behaves as expected.

Scaling Limits and Considerations

  • Cool Down Periods: HPA may take a few minutes to adjust pod counts due to metrics collection intervals and decision-making delays.
  • Minimum and Maximum Limits: Define minReplicas and maxReplicas to avoid over-scaling or under-scaling.
  • Cluster Capacity: Ensure the cluster has sufficient resources (nodes) to accommodate the maximum number of pods defined by HPA.
  • Custom Metrics: Use Prometheus or an adapter to provide custom metrics for advanced scaling use cases.

Conclusion

Horizontal Pod Autoscaling (HPA) in Kubernetes is a powerful feature for maintaining application performance and optimizing resource utilization. By automatically adjusting the number of pods based on workload demands, HPA ensures that your applications remain responsive under varying loads while avoiding unnecessary costs during idle periods.

When combined with best practices, custom metrics, and tools like the Cluster Autoscaler, HPA enables dynamic, efficient scaling for modern cloud-native applications.