Home / Technology / Empowering Developers to Achieve Microservices Observability on Kubernetes with Tracestore, OPA, Flagger & Custom Metrics

Empowering Developers to Achieve Microservices Observability on Kubernetes with Tracestore, OPA, Flagger & Custom Metrics

Introduction

In modern microservices architectures, achieving comprehensive observability is not just an option—it’s a necessity. As applications scale dynamically within Kubernetes environments, tracking performance issues, enforcing security policies, and ensuring smooth deployments become complex challenges. Traditional monitoring solutions alone cannot fully address these challenges.

This guide explores four powerful tools that significantly improve observability and control in microservices environments:

  • Tracestore: Provides deep insights into distributed tracing, enabling developers to track request flows, identify latency issues, and diagnose bottlenecks across microservices.
  • OPA (Open Policy Agent): Ensures security and governance by enforcing dynamic policy controls directly within Kubernetes environments.
  • Flagger: Enables automated progressive delivery, minimizing deployment risks through intelligent traffic shifting and rollback strategies.
  • Custom Metrics: Captures application-specific metrics, offering enhanced insights that generic monitoring tools may overlook.

Developers often struggle with diagnosing latency issues, securing services, and ensuring stable deployments in dynamic Kubernetes environments. By combining Tracestore, OPA, Flagger, and Custom Metrics, you can unlock enhanced visibility, improve security enforcement, and streamline progressive delivery processes.
Observability Tools integrate with a Kubernetes Cluster and Microservices
This diagram illustrates how Observability Tools integrate with a Kubernetes Cluster and Microservices (Java, Node.js, etc.). Key tools like TraceStore (Distributed Tracing), Custom Metrics (Performance Insights), Flagger (Deployment Control), and OPA (Policy Enforcement) enhance system visibility, security, and stability.

Why These Tools Are Essential for Microservices Observability

The combination of these tools addresses crucial pain points that traditional observability approaches fail to resolve:

  • Tracestore vs. Jaeger: While Jaeger is a well-known tracing tool, Tracestore integrates seamlessly with OpenTelemetry, providing greater flexibility with streamlined configurations, ideal for modern cloud-native applications.
  • OPA vs. Kyverno: OPA excels in complex policy logic and dynamic rule enforcement, offering advanced flexibility that Kyverno’s simpler syntax may not provide in complex security scenarios.
  • Flagger vs. Argo Rollouts: Flagger’s automated progressive delivery mechanisms, especially with Istio and Linkerd integration, offer developers a streamlined way to deploy changes safely with minimal manual intervention.

The Unique Value of These Tools

  • Improved Developer Insights: Tracestore enhances visibility by tracking transactions across microservices, ensuring better root-cause analysis for latency issues.
  • Enhanced Security Posture: OPA dynamically enforces security policies, reducing vulnerabilities without frequent manual updates to application logic.
  • Faster and Safer Deployments: Flagger’s canary deployment automation allows developers to deploy features faster, with automatic rollback for failing releases.
  • Business-Centric Observability: Custom Metrics empower developers to align performance data with critical business KPIs, ensuring that engineering efforts focus on what matters most.

By integrating these tools, developers gain a comprehensive, proactive observability strategy that improves application performance, strengthens security enforcement, and simplifies deployment processes. This guide focuses on code snippets, best practices, and integration strategies tailored to help developers implement these solutions directly in their applications.

Step 1: Tracestore Implementation for Developers

Why Prioritize Tracestore?

In modern microservices architectures, tracking how requests flow across services is essential to diagnose performance issues, identify latency bottlenecks, and maintain application reliability. Traditional debugging methods often struggle in distributed environments, where failures may occur across multiple interconnected services.

Tracestore addresses these challenges by enabling distributed tracing, allowing developers to visualize request paths, track dependencies, and pinpoint slow or failing services in real-time. By integrating Tracestore, developers gain valuable insights into their application’s behavior, enhancing troubleshooting efficiency and improving system reliability.

Without Distributed Tracing: Identifying performance bottlenecks and tracing errors in microservices without context propagation is extremely challenging. Developers are forced to rely on fragmented logs, delaying issue resolution.

With Distributed Tracing: By propagating trace context headers across services, developers can achieve complete request visibility, improving latency analysis and fault isolation.

Without Distributed Tracing: No visibility across services

Without distributed tracing, requests across services lack trace context, making it difficult to track the flow of requests. This leads to fragmented logs, limited visibility, and complex debugging when issues arise. The diagram below illustrates how requests are processed without trace context, resulting in no clear insight into service interactions.

Service Communication Without Distributed Tracing
Service Communication Without Distributed Tracing — This diagram shows a microservices environment where requests are processed without trace context. As a result, developers face no visibility across services, making it difficult to diagnose issues, track failures, or identify performance bottlenecks.

With Distributes Tracing: Visibility across services

This diagram illustrates how trace context (e.g., traceparent header) is injected and forwarded across multiple services. Each service propagates the trace context through outgoing requests to ensure continuity in the trace flow. The database call includes the trace context, ensuring full visibility across all service interactions, which helps developers trace issues, measure latency, and diagnose bottlenecks effectively.
Service Communication Without Distributed Tracing
Trace Context Propagation in a Microservices Architecture – Demonstrates how trace context flows across services via traceparent headers, enabling end-to-end request tracking for improved observability.
Service Communication Without Distributed Tracing — This diagram shows a microservices environment where requests are processed without trace context. As a result, developers face no visibility across services, making it difficult to diagnose issues, track failures, or identify performance bottlenecks.

Java Application – Tracestore Integration (Spring Boot)

This code snippet demonstrates how to integrate OpenTelemetry for distributed tracing in a Spring Boot application using Java. Let’s break down each part for better understanding:

Dependencies:

    io.opentelemetry
    opentelemetry-sdk
    1.20.0


    io.opentelemetry
    opentelemetry-exporter-otlp
    1.20.0

Explanation:

  • opentelemetry-sdk — This is the core OpenTelemetry SDK required to create traces and manage spans in Java applications. It includes the key components like TracerProvider, context propagation, and sampling strategies.
  • opentelemetry-exporter-otlp — This exporter sends trace data to an OpenTelemetry Collector or directly to an observability backend (e.g., Jaeger, Tempo) using the OTLP (OpenTelemetry Protocol).

Both dependencies are crucial for enabling trace generation and exporting the data to your monitoring platform.

Configuration in Code:

@Configuration
public class OpenTelemetryConfig {
    @Bean
    public OpenTelemetry openTelemetry() {
        return OpenTelemetrySdk.builder()
            .setTracerProvider(SdkTracerProvider.builder().build())
            .build();
    }
    @Bean
    public Tracer tracer(OpenTelemetry openTelemetry) {
        return openTelemetry.getTracer("my-application");
    }
}

Explanation:

  1. @Configuration Annotation:

    • Marks this class as a Spring Boot configuration class where beans are defined.
  2. @bean public OpenTelemetry openTelemetry()

    • This method creates and configures an instance of OpenTelemetrySdk, which is the core entry point for instrumenting code.
    • The TracerProvider is initialized using SdkTracerProvider.builder() to create and manage tracer instances, ensuring each service instance has a dedicated tracer.
    • The .build() method finalizes the configuration.
  3. @bean public Tracer tracer()

    • This method defines a Tracer bean that will be injected into application components requiring tracing.
    • getTracer(“my-application”) assigns a service name (my-application) that identifies this application in the observability backend.

Instrumenting REST Template with Tracing

@Configuration
public class RestTemplateConfig {

    @Bean
    public RestTemplate restTemplate() {
        return new RestTemplateBuilder()
            .interceptors(new RestTemplateInterceptor())
            .build();
    }
}

Explanation:

  • The RestTemplateInterceptor intercepts outbound HTTP calls and adds a trace span.
  • The span ensures the trace context is propagated to downstream services.

Cron Job Example with Tracestore

@Component
public class ScheduledTask {
    private final Tracer tracer;
    public ScheduledTask(Tracer tracer) {
        this.tracer = tracer;
    }
    @Scheduled(fixedRate = 5000)
    public void performTask() {
        Span span = tracer.spanBuilder("cronjob-task").startSpan();
        try (Scope scope = span.makeCurrent()) {
            System.out.println("Executing scheduled task");
        } finally {
            span.end();
        }
    }
}

Node.js Application – Tracestore Integration

This code snippet demonstrates how to integrate OpenTelemetry for distributed tracing in a Node.js application. Let’s break down the dependencies, configuration, and their significance for effective observability.

Dependencies Installation:
npm install @opentelemetry/api @opentelemetry/sdk-trace-node @opentelemetry/exporter-trace-otlp-http

Explanation:

  • @opentelemetry/api — Provides the core API interfaces for tracing. This ensures the application follows OpenTelemetry standards for tracing APIs.
  • @opentelemetry/sdk-trace-node — The Node.js SDK implementation that integrates directly with Node’s ecosystem to create and manage spans.
  • @opentelemetry/exporter-trace-otlp-http — Exports trace data to an OpenTelemetry Collector or directly to an observability backend (e.g., Jaeger, Tempo) using the OTLP (OpenTelemetry Protocol).

These dependencies form the foundation for trace instrumentation and data export in Node.js applications.

Configuration in tracer.js

const { NodeTracerProvider } = require('@opentelemetry/sdk-trace-node');
const { OTLPTraceExporter } = require('@opentelemetry/exporter-trace-otlp-http');
const { SimpleSpanProcessor } = require('@opentelemetry/sdk-trace-base');

const provider = new NodeTracerProvider();
const exporter = new OTLPTraceExporter({ url: 'http://otel-collector:4317' });

provider.addSpanProcessor(new SimpleSpanProcessor(exporter));
provider.register();

Explanation:

  1. NodeTracerProvider Initialization:

    • The NodeTracerProvider is the primary tracing provider for Node.js applications, responsible for creating and managing tracers.
    • This provider handles lifecycle management, sampling, and context propagation.
  2. OTLPTraceExporter Configuration:

    • The OTLPTraceExporter sends trace data to the OpenTelemetry Collector or observability backend.
    • The URL ‘http://otel-collector:4317‘ points to the OTLP endpoint in the OpenTelemetry Collector, which efficiently processes and forwards trace data.
  3. SimpleSpanProcessor Setup:

    • The SimpleSpanProcessor is a lightweight span processor that exports spans immediately as they finish.
    • For production environments, consider switching to BatchSpanProcessor for improved performance via batch data exports.
  4. provider.register() Registration:

    • Registers the tracer provider globally in the Node.js application.
    • This step ensures that any instrumented modules, middleware, or libraries automatically utilize the defined tracer.

Configuration for otel.host and Key Properties

Impact of Configuration on Scalability and Performance

  • otel.exporter.otlp.endpoint Considerations: For scalable architectures, ensure the endpoint points to a load-balanced OpenTelemetry Collector service to handle increased trace data volume efficiently.
  • otel.exporter.otlp.protocol Choices:

    • Use http/protobuf for lightweight, high-performance transmission in high-traffic environments.
    • Consider grpc for improved reliability with built-in retries and flow control.
  • otel.traces.sampler Strategies:

    • Use parentbased_always_on for detailed tracing in development.
    • Switch to parentbased_traceidratio with a ratio (e.g., 0.1) in production to reduce overhead while still capturing meaningful insights.

Adding Custom Attributes to Spans

Example:

app.get('/payment/:id', (req, res) => {
    const span = tracer.startSpan('payment-processing');
    span.setAttribute('payment_id', req.params.id);
    span.setAttribute('user_role', req.user.role);
    try {
        processPayment(req.params.id);
        res.send('Payment Processed');
    } catch (error) {
        span.recordException(error);
    } finally {
        span.end();
    }
});

Explanation:

  • setAttribute() attaches useful data to the span for better trace visibility.
  • recordException() captures errors for deeper analysis.

Trace ID Propagation in Microservices

Outgoing Request (Client Side):

const { context, trace, propagation } = require('@opentelemetry/api');
const axios = require('axios');
app.get('/trigger-service', async (req, res) => {
    const span = tracer.startSpan('trigger-service-call');
    try {
        const headers = {};
        propagation.inject(context.active(), headers);
        const response = await axios.get('http://other-service/api', { headers });
        res.json(response.data);
    } finally {
        span.end();
    }
});

Incoming Request (Server Side):

const { context, propagation, trace } = require('@opentelemetry/api');
app.get('/api', (req, res) => {
    const extractedContext = propagation.extract(context.active(), req.headers);
    const span = tracer.startSpan('incoming-request', { parent: extractedContext });
    try {
        res.send('Data Retrieved');
    } finally {
        span.end();
    }
});

OpenTelemetry Data Flow in a Microservices Architecture
OpenTelemetry Data Flow in a Microservices Architecture — This diagram illustrates the flow of trace data from the application code to the observability backend. The OpenTelemetry SDK generates trace data, which is exported via OTLP to the OpenTelemetry Collector. The collector processes and forwards the data to observability backends like Jaeger or Tempo for visualization and analysis.

Trace Context Propagation Pitfalls

While propagating trace context, developers should watch out for common issues like:

  • Missing Headers in Async Flows:

    • In environments using async processing (e.g., message queues or event-driven systems), headers containing the traceparent value may be lost. Solutions include:
    • Injecting the trace context as part of the message payload.
    • Using middleware or interceptors to capture and propagate trace context efficiently.
  • Service Boundary Drops:

    • If services use different frameworks or libraries that don’t standardize trace propagation, you may experience gaps in traces. Using OpenTelemetry’s Context Propagation API helps maintain trace continuity across such environments.

Step 2: OPA (Open Policy Agent) for Developers

Why Use OPA for Security and Policy Enforcement?

Open Policy Agent (OPA) is a powerful tool for enforcing security policies and ensuring consistent access management in Kubernetes environments. By leveraging Rego logic, OPA dynamically validates requests, prevents unauthorized access, and strengthens compliance measures. Below are the Key Benefits of OPA for Security and Policy Enforcement

  • Admission Control: Prevents unauthorized deployments by validating manifests before they’re applied to the cluster.
  • Access Control: Ensures only authorized users and services can access specific endpoints or resources.
  • Data Filtering: Limits sensitive data exposure by enforcing filtering rules at the API layer.

Practical Example: In a multi-tenant SaaS environment, OPA can:

  • Deny requests that attempt to access resources outside the user’s assigned tenant.
  • Enforce RBAC rules dynamically based on request parameters without modifying the application code.

OPA’s flexible Rego policies enable developers to define complex logic that adapts to evolving security and operational requirements.

Example Use Case: Consider a multi-tenant SaaS application where customers have isolated data and permissions. Using OPA, developers can:

  • Deny requests that attempt to access resources outside the user’s assigned tenant.
  • Enforce RBAC rules dynamically based on request parameters without modifying the application code.

OPA’s flexible Rego policies enable developers to define complex logic that adapts to evolving security and operational requirements.

Understanding OPA Webhook

OPA Webhooks are designed to enforce policy decisions before resources are created or modified in Kubernetes. When a webhook is triggered, OPA evaluates the incoming request against defined policy rules and returns an allow or deny decision.
OPA webhook evaluation process during Kubernetes admission control
This diagram showcases the OPA webhook evaluation process during Kubernetes admission control, ensuring secure policy enforcement before resource creation.

OPA Webhook Configuration Example

apiVersion: admissionregistration.k8s.io/v1
kind: MutatingWebhookConfiguration
metadata:
  name: opa-webhook
webhooks:
  - name: "example-opa-webhook.k8s.io"
    clientConfig:
      url: "https://opa-service.opa.svc.cluster.local:443/v1/data/authz"
    rules:
      - operations: ["CREATE", "UPDATE"]
        apiGroups: [""]
        apiVersions: ["v1"]
        resources: ["pods"]
    failurePolicy: Fail

Where Rego Policies are Configured

Rego policies are stored in designated policy repositories or inside Kubernetes ConfigMaps. For example:

Example Policy ConfigMap

apiVersion: v1
kind: ConfigMap
metadata:
  name: opa-policy-config
  namespace: opa
  labels:
    openpolicyagent.org/policy: rego
  annotations:
    openpolicyagent.org/policy-status: "active"
data:
  authz.rego: |
    package authz
    default allow = false
    allow {
        input.user == "admin"
        input.action == "read"
    }

    allow {
        input.user == "developer"
        input.action == "view"
    }

Deployment YAML with OPA as a Sidecar

To integrate OPA as a sidecar, modify your deployment YAML as shown below:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: sample-app
spec:
  replicas: 2
  selector:
    matchLabels:
      app: sample-app
  template:
    metadata:
      labels:
        app: sample-app
    spec:
      containers:
      - name: sample-app
        image: sample-app:latest
        ports:
        - containerPort: 8080
      - name: opa-sidecar
        image: openpolicyagent/opa:latest
        args:
        - "run"
        - "--server"
        - "--config-file=/config/opa-config.yaml"
        volumeMounts:
        - mountPath: /config
          name: opa-config-volume
        - mountPath: /policies
          name: opa-policy-volume
      volumes:
      - name: opa-config-volume
        configMap:
          name: opa-config
      - name: opa-policy-volume
        configMap:
          name: opa-policy-config

OPA sidecar's role in intercepting application requests
This diagram showcases the OPA webhook evaluation process during Kubernetes admission control, ensuring secure policy enforcement before resource creation.

Sample OPA Policy (Rego) for Access Control

OPA policies are written in Rego language. Below are example policies for controlling API endpoint access.

authz.rego

package authz
default allow = false
allow {
    input.user == "admin"
    input.action == "read"
}
allow {
    input.user == "developer"
    input.action == "view"
}
allow {
    input.role == "finance"
    input.action == "approve"
}
allow {
    input.ip == "192.168.1.1"
    input.method == "GET"
}
allow {
    input.role == "editor"
    startswith(input.path, "/editor-area/")
}
allow {
    input.role == "viewer"
    startswith(input.path, "/public/")
}

Explanation of Rules

  • Admin Rule: Grants read access to users with the admin role.
  • Developer Rule: Allows view actions for users with the developer role.
  • Finance Role Rule: Grants approve permissions to users in the finance role.
  • IP-Based Restriction Rule: Allows GET requests from IP 192.168.1.1. Useful for internal-only API endpoints.
  • Editor Access Rule: Grants access to endpoints starting with /editor-area/ for users with the editor role.
  • Viewer Access Rule: Permits access to /public/ endpoints for users with the viewer role.

Each rule ensures clear conditions to improve security, role management, and resource control.

Java Integration – OPA Policy Enforcement

OPA rules can be integrated into Java applications using HTTP requests to communicate with the OPA sidecar.

Sample Java Code for Access Control

import org.springframework.web.bind.annotation.*;
import org.springframework.http.ResponseEntity;
import org.springframework.http.HttpStatus;
import org.springframework.web.client.RestTemplate;

@RestController
@RequestMapping("/secure")
public class SecureController {

    @PostMapping("/access")
    public ResponseEntity checkAccess(@RequestBody Map request) {
        RestTemplate restTemplate = new RestTemplate();
        String opaEndpoint = "http://localhost:8181/v1/data/authz";

        ResponseEntity response = restTemplate.postForEntity(opaEndpoint, request, Map.class);
        boolean allowed = (Boolean) response.getBody().get("result");

        if (allowed) {
            return ResponseEntity.ok("Access Granted");
        }
        return ResponseEntity.status(HttpStatus.FORBIDDEN).body("Access Denied");
    }
}

Node.js Integration – OPA Policy Enforcement

OPA can also be integrated into Node.js applications using HTTP requests to query the OPA sidecar.

Sample Node.js Code for Access Control

const express = require('express');
const axios = require('axios');
const app = express();
app.use(express.json());
app.post('/access', async (req, res) => {
    const opaEndpoint = 'http://localhost:8181/v1/data/authz';
    try {
        const response = await axios.post(opaEndpoint, { input: req.body });
        if (response.data.result) {
            res.status(200).send('Access Granted');
        } else {
            res.status(403).send('Access Denied');
        }
    } catch (error) {
        res.status(500).send('OPA Evaluation Failed');
    }
});
app.listen(3000, () => console.log('Server running on port 3000'));

Explanation:

  • The /access endpoint forwards user actions and roles to the OPA sidecar.
  • The OPA response defines whether the request is accepted or rejected.

Best Practices for OPA Integration

  1. Minimize Complex Logic in Policies: Keep your Rego policies simple, with clear rules to avoid performance bottlenecks.
  2. Utilize Versioning for Policies: To prevent compatibility issues, version your policy files and bundles.
  3. Leverage OPA’s Decision Logging: Enable OPA’s decision logs for better observability and debugging.
  4. Cache OPA Responses Where Possible: For repeated evaluations, caching improves performance.

Hierarchical Policy Enforcement Example (Admin, User, Guest Roles)

OPA effectively enforces role-based permissions by defining clear security boundaries for different user roles such as:

  • Admin: Full control with unrestricted access.
  • User: Limited permissions based on defined criteria.
  • Guest: Restricted to read-only access.

By integrating OPA, developers can achieve robust security, improved compliance, and dynamic policy enforcement — all without modifying application code directly.

Example Rego Policy for Role-Based Access Control

package authz
default allow = false
allow {
    input.user.role == "admin"
    input.action in ["create", "read", "update", "delete"]
}
allow {
    input.user.role == "user"
    input.action in ["read", "update"]
}
allow {
    input.user.role == "guest"
    input.action == "read"
}

Visualizes how different roles receive distinct permissions
This decision tree visualizes how different roles such as Admin, User, and Guest receive distinct permissions via Rego policies.

Sidecar Scaling Concerns in High-Traffic Environments

  • CPU/Memory Overhead: Each OPA sidecar requires its own resources, which can increase overhead when scaling pods.
  • Latency Impact: OPA evaluations introduce latency, especially with complex policies.
  • Cluster-Wide Policy Management: Scaling sidecars across hundreds of pods can create maintenance overhead.

Solutions:

  • Enable OPA bundle caching to reduce frequent policy fetches.
  • Optimize Rego policies by limiting nested conditions and leveraging partial evaluation to pre-compute logic.
  • For large-scale environments, consider deploying a centralized OPA instance or using OPA Gatekeeper for improved scalability.

Policy Versioning Best Practices

  1. Use Git for Version Control
  2. Implement CI/CD Pipelines for Policies
  3. Leverage OPA’s Bundle API for consistent policy distribution.
  4. Tag Stable Policy Versions
  5. Automate Rollbacks for Broken Policies

Step 3: Flagger Implementation for Developers

Flagger’s Role in CI/CD Pipelines

Flagger automates progressive delivery in Kubernetes by gradually shifting traffic to the canary deployment while measuring success rates, latency, and custom metrics.

Flagger plays a crucial role in ensuring safer and automated releases in CI/CD pipelines. By integrating Flagger, developers can:

  • Automate progressive rollouts, reducing deployment risks.
  • Continuously validate new releases by analyzing real-time metrics.
  • Trigger webhooks for automated testing or data validation before fully shifting traffic.

This computerized approach empowers developers to deploy changes confidently while minimizing service disruptions.
Flagger's automated canary deployment process
This diagram shows Flagger’s automated canary deployment process, where Flagger triggers a load test, evaluates results, and either promotes the canary to stable or rolls it back on failure.

Flagger Canary Deployment Configuration

Sample Flagger Canary Configuration

apiVersion: flagger.app/v1beta1
kind: Canary
metadata:
  name: podinfo
  namespace: test
spec:
  provider: istio
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: podinfo
  progressDeadlineSeconds: 60
  autoscalerRef:
    apiVersion: autoscaling/v2beta2
    kind: HorizontalPodAutoscaler
    name: podinfo
  service:
    gateways:
    - monitor/monitor-gw
    hosts:
    - monitor.dev.scus.cld.samsclub.com
    name: podinfo
    port: 9898
    targetPort: 9898
    portName: http
    portDiscovery: true
    match:
      - uri:
          prefix: /
    rewrite:
      uri: /
    timeout: 5s
  skipAnalysis: false
  analysis:
    interval: 1m
    threshold: 10
    maxWeight: 50
    stepWeight: 5
    metrics:
    - name: checkout-failure-rate
      templateRef:
        name: checkout-failure-rate
        namespace: istio-system
      thresholdRange:
        max: 1
      interval: 1m
    webhooks:
      - name: "load test"
        type: rollout
        url: http://flagger-loadtester.test/
        metadata:
          cmd: "hey -z 1m -q 10 -c 2 http://podinfo.test:9898/"
    alerts:
      - name: "dev team Slack"
        severity: error
        providerRef:
          name: dev-slack
          namespace: flagger
      - name: "qa team Discord"
        severity: warn
        providerRef:
          name: qa-discord

Explanation for Key Fields

  • provider: Specifies the service mesh provider like istio, linkerd, etc.
  • targetRef: Refers to the primary deployment.
  • autoscalerRef: Associates the canary with an HPA for automated scaling.
  • analysis: Defines the testing strategy:

    • interval: Time between each traffic increment.
    • threshold: Number of failed checks before rollback.
    • maxWeight: Maximum traffic percentage shifted to the canary.
    • stepWeight: Traffic increment step size.
  • metrics: Specifies the Prometheus metrics template used for success criteria.

  • webhooks: Executes external tests (e.g., load tests) before promotion.

  • alerts: Defines alert triggers for services like Slack, Discord, or Teams.

Use Case: Feature Rollout for a Shopping Cart System

Imagine a shopping cart application where new checkout logic needs to be tested. Using Flagger’s canary strategy, you can gradually introduce the new checkout flow while ensuring stability by monitoring metrics like order success rates and latency

Progressive Traffic Shifting Diagram

Flow of Progressive Traffic Shifting in Flagger

Progressive traffic shifting strategy
This diagram visualizes the progressive traffic shifting strategy where traffic gradually shifts from the stable version to the canary version, ensuring safe rollouts.
Explanation:

  • Flagger gradually shifts traffic from the stable version to the canary version.
  • If the canary deployment meets performance goals (e.g., latency, success rate), traffic continues to increase until full promotion.
  • If metrics exceed failure thresholds, Flagger automatically rolls back the canary deployment.

Best Practices for Webhook Failure Handling

To ensure resilience during webhook failures, follow these practices:

  1. Implement Retries with Backoff:

    • Configure webhooks to retry failed requests with exponential backoff to reduce unnecessary load during transient failures.
  2. Introduce Timeout Limits:

    • Add timeouts for webhook responses to avoid delays in canary promotions.
  3. Implement Fallback Alerts:

    • If a webhook fails after multiple retries, configure an alert system to notify developers immediately (e.g., Slack, PagerDuty).
  4. Add Webhook Health Checks:

    • Periodically test webhook endpoints to proactively detect and fix issues before deployment failures occur.

Metric Template Configuration

Flagger can integrate custom metrics to enhance decision-making for progressive delivery.
Prometheus metrics are evaluated by Flagger
This diagram shows how Prometheus metrics are evaluated by Flagger to determine the success or failure of a canary rollout.
Example Custom Metric Configuration for Flagger

apiVersion: flagger.app/v1beta1
kind: MetricTemplate
metadata:
  name: checkout-failure-rate
  namespace: istio-system
spec:
  provider:
    type: prometheus
    address: http://prometheus.istio-system:9090
  query: |
    100 - sum(
        rate(
            istio_requests_total{
              reporter="destination",
              destination_workload_namespace="{{ namespace }}",
              destination_workload="{{ target }}",
              response_code!~"5.*"
            }[{{ interval }}]
        )
    )
    /
    sum(
        rate(
            istio_requests_total{
              reporter="destination",
              destination_workload_namespace="{{ namespace }}",
              destination_workload="{{ target }}",
            }[{{ interval }}]
        )
    ) * 100

Explanation:

  • Calculates the percentage of successful requests by filtering out 5xx response codes.
  • Uses Prometheus as the backend to fetch metric data.

Enhancing Metric Templates with Custom Prometheus Queries

To improve Flagger’s decision-making capabilities, consider creating advanced Prometheus queries for custom metrics.

Example Custom Prometheus Query for API Latency Analysis:

apiVersion: flagger.app/v1beta1
kind: MetricTemplate
metadata:
  name: api-latency-threshold
  namespace: istio-system
spec:
  provider:
    type: prometheus
    address: http://prometheus.istio-system:9090
  query: |
    histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket{job="api-service"}[5m])) by (le))

Explanation:

  • This query measures 95th percentile latency for the api-service application.
  • By tracking latency distribution instead of simple averages, developers can detect spikes in performance degradation early.
  • Use these insights to tune your Flagger analysis steps and improve deployment safety.

Best Practices for Flagger Integration

  1. Design Small Increments for Safer Rollouts: Gradual traffic shifting minimizes risk.
  2. Leverage Webhooks for Automated Testing: Webhooks allow for extensive testing before promoting changes.
  3. Use Custom Metrics for Better Insights: Track business-critical metrics that directly impact performance.
  4. Ensure Clear Alerting Channels: Slack, Discord, or Teams notifications help teams act quickly during failures.
  5. Integrate Load Testing: Automated load tests during canary releases validate stability before promotion.

Step 4: Custom Metrics for Developers

Why Use Custom Metrics?

Custom metrics provide actionable insights by tracking application-specific behaviors such as checkout success rates, queue sizes, or memory usage. By aligning metrics with business objectives, developers gain deeper insights into their system’s performance.

  • Monitor User Experience: Track latency, response times, or page load speeds.
  • Measure Application Health: Observe error rates, service availability, or queue backlogs.
  • Track Business Outcomes: Monitor KPIs like orders, logins, or transaction success rates.

By incorporating these insights into metrics, developers can improve troubleshooting, identify performance bottlenecks, and correlate application issues with user experience impacts.

Custom Metrics Configuration

Developers can integrate custom metrics into their applications using libraries like Micrometer (Java) or Prometheus Client (Node.js).

Java Example – Custom Metrics with Micrometer

Dependencies in pom.xml


    io.micrometer
    micrometer-registry-prometheus
    1.9.0

Configuration in Code

@Configuration
public class MetricsConfig {

    @Bean
    public MeterRegistry meterRegistry() {
        return new PrometheusMeterRegistry(PrometheusConfig.DEFAULT);
    }

    @Bean
    public RestTemplate restTemplate() {
        return new RestTemplate();
    }
}

Custom Metric Example

@RestController
@RequestMapping("/api")
public class OrderController {

    private final Counter orderCounter;

    public OrderController(MeterRegistry meterRegistry) {
        this.orderCounter = Counter.builder("orders_total")
                .description("Total number of orders processed")
                .register(meterRegistry);
    }

    @PostMapping("/order")
    public ResponseEntity createOrder(@RequestBody Map request) {
        orderCounter.increment();
        return ResponseEntity.ok("Order Created");
    }
}

Custom metrics in a Java application using Micrometer
This diagram illustrates the flow of custom metrics in a Java application using Micrometer, where data is defined in code, registered with MeterRegistry, and visualized through Grafana.

Node.js Example – Custom Metrics with Prometheus Client

Dependencies
npm install prom-client

Configuration in Code

const express = require('express');
const client = require('prom-client');

const app = express();
const collectDefaultMetrics = client.collectDefaultMetrics;
collectDefaultMetrics();

const orderCounter = new client.Counter({
    name: 'orders_total',
    help: 'Total number of orders processed'
});

app.post('/order', (req, res) => {
    orderCounter.inc();
    res.send('Order Created');
});

app.get('/metrics', async (req, res) => {
    res.set('Content-Type', client.register.contentType);
    res.end(await client.register.metrics());
});

app.listen(3000, () => console.log('Server running on port 3000'));

Node.js application using the Prometheus Client library
This diagram demonstrates how custom metrics flow in a Node.js application using the Prometheus Client library, exposing data via /metrics endpoints for visualization in Grafana.

Enhancing Java Micrometer Example

1. Adding Histogram for Latency Tracking

import io.micrometer.core.instrument.Timer;
import org.springframework.web.bind.annotation.*;
import io.micrometer.core.instrument.MeterRegistry;

@RestController
@RequestMapping("/api")
public class LatencyController {

    private final Timer requestTimer;

    public LatencyController(MeterRegistry meterRegistry) {
        this.requestTimer = Timer.builder("http_request_latency")
            .description("Tracks HTTP request latency in milliseconds")
            .publishPercentileHistogram()
            .register(meterRegistry);
    }

    @GetMapping("/process")
    public ResponseEntity processRequest() {
        return requestTimer.record(() -> {
            try { Thread.sleep(200); } catch (InterruptedException e) {}
            return ResponseEntity.ok("Request Processed");
        });
    }
}

2. Adding Gauge for System-Level Metrics

import io.micrometer.core.instrument.Gauge;
import io.micrometer.core.instrument.MeterRegistry;
import org.springframework.stereotype.Component;
import java.util.concurrent.atomic.AtomicInteger;

@Component
public class QueueSizeMetric {

    private final AtomicInteger queueSize = new AtomicInteger(0);

    public QueueSizeMetric(MeterRegistry meterRegistry) {
        Gauge.builder("queue_size", queueSize::get)
            .description("Tracks the current size of the task queue")
            .register(meterRegistry);
    }

    public void addToQueue() {
        queueSize.incrementAndGet();
    }

    public void removeFromQueue() {
        queueSize.decrementAndGet();
    }
}

Enhancing Node.js Example with Labeling Best Practices

Recommended Labeling Practices:

  • Use Meaningful Labels: Focus on key factors like status_code, endpoint, or region.
  • Minimize High-Cardinality Labels: Avoid labels with unique values like user_id or transaction_id.
  • Use Consistent Naming Conventions: Maintain uniform patterns across your metrics.

Improved Node.js Metric Example:

const client = require('prom-client');

const requestCounter = new client.Counter({
    name: 'http_requests_total',
    help: 'Total HTTP requests processed',
    labelNames: ['method', 'endpoint', 'status_code']
});

app.get('/checkout', (req, res) => {
    requestCounter.inc({ method: 'GET', endpoint: '/checkout', status_code: 200 });
    res.send('Checkout Complete');
});

Integration with Flagger – Business-Critical Metrics Example

Example Prometheus Query for Checkout Failure Tracking:

apiVersion: flagger.app/v1beta1
kind: MetricTemplate
metadata:
  name: checkout-failure-rate
  namespace: istio-system
spec:
  provider:
    type: prometheus
    address: http://prometheus.istio-system:9090
  query: |
    sum(rate(http_requests_total{job="checkout-service", status_code!="200"}[5m])) /
    sum(rate(http_requests_total{job="checkout-service"}[5m])) * 100

Explanation:

  • This metric tracks the percentage of failed checkout attempts, a key indicator for e-commerce stability.
  • Tracking these business-critical metrics can provide developers with actionable insights to improve customer experience.
    Flagger monitors Prometheus metrics for the checkout service
    This diagram illustrates how Flagger monitors Prometheus metrics for the checkout service, triggering rollbacks via Alert Manager and notifying the DevOps team in case of failures.

    Alerting Best Practices for Custom Metrics

  • Define meaningful alert thresholds that align with business impact.

  • Suppress excessive alerts by fine-tuning alert duration windows.

  • Use Prometheus AlertManager to send proactive alerts for degraded service performance.

Conclusion

Achieving comprehensive observability in Kubernetes environments is challenging, yet essential for ensuring application performance, security, and stability. By adopting the right tools and best practices, developers can significantly enhance visibility across their microservices landscape.

  • Tracestore enables developers to trace requests across services, improving root cause analysis and identifying performance bottlenecks.
  • OPA enforces dynamic policy controls, enhancing security by ensuring consistent access management and protecting data integrity.
  • Flagger automates progressive delivery, reducing deployment risks with controlled traffic shifting, metric-based evaluations, and proactive rollbacks.
  • Custom Metrics provide actionable insights by tracking key application behaviors, aligning performance monitoring with business objectives.

By combining these tools, developers can build resilient, scalable, and secure Kubernetes workloads. Following best practices such as efficient trace propagation, thoughtful Rego policy design, strategic Flagger configurations, and well-defined custom metrics ensures your Kubernetes environment can meet performance demands and evolving business goals.

Embracing these observability solutions allows developers to move from reactive troubleshooting to proactive optimization, fostering a culture of reliability and improved user experience.

References