CockroachDB: live certificate rotation

certificate rotation in a live CRDB environment

On Kubernetes we have a CockroachDB deployment and associated secret resources that are mapped as volumes in the CRDB pods. These secrets represent the certificates that are required by the database to operate, and include CA certs, Node certs, and User certs.

CockroachDB allows you to rotate these certificates in a non-disruptive way that keeps existing client/SQL connections alive, and no rolling restarts are required.

Because we’re working in a containerized environment, there is a specific sequence of tasks required to accomplish this process.

This blog covers these basics, and with the help of the linked GitHub repo you can automate this workflow using the NodeJS app in a reliable, repeatable, reusable way.

GitHub: crdb-cert-rotation

…to the cert rotation tasks

This section details the sequence of steps that are required to complete the cert-rotation. Like any integrated software system, we will encounter caveats and variances based on target platforms, customizations on the deployments, and security/role restrictions.

step 1: update your actual secrets (manual)

This first step requires you to refresh, generate, update your actual certificates.
These can grouped into a single common secret object or might be broken down into separate secret-objects based on specific usage. For example the CA can be its own secret, the Node certs might reside in a separate secret, and same goes for user/client certs.

This is the only step that requires manual intervention. This is by design because each organization typically has their own workflow to refresh these. Some use Hashicorp Vault, some rely on the cockroach cert commands, while other admins generate their own certs manually.

These need to be updated in the secrets such that future pod restarts will load these moving forward.

step 2: read all the new secrets

This step collects all the secrets tied to CockroachDB. These are specified as a list of tuples that represent the secret name and the mount-path from the CRDB pod perspective. The example below shows 2 secrets with mounts, but you might have only 1 secret or possibly more. This tool allows for you to specify any number of mounts and their respective secrets object.

step 3: identify the target CRDB pods to refresh (auto)

CockroachDB pods should be labelled using common tags as well as node-specific tags. In the case here, we have a common tag that matches all the pods in the cluster, defined by the environment variable MZ_CRDB_POD_LABEL_KV.

This tag is actually a tuple (KV), where my example indicates a common tag defined as zlamal:demo-2025 that can be found in all the pods of my cluster. It’s functionally equal to a label listed as “zlamal: demo-2025”.

The NodeJS app reads this from the environment (supplied by you in the job-definition YAML), and uses it to capture the running pods so that the tool can perform actions on them in subsequent steps.

step 4: we iterate through each pod and perform the certificate updates

This next step is a loop that performs a few sub-steps to accomplish the rotation for each node. These tasks ensure that the pod remains running while bringing the latest secrets, and this can be verified by SHing directly into the pods to inspect the certs folders for changes.

step 4.1: delete the old certs

This step deletes the certificates for each secret object. There can be many secrets, and each secret can contain many certs and keys.

step 4.2: save the new certs

This step deletes the certificates for each secret object in an iterative process. There can be many secrets, and each secret can contain many certs and keys. This is all covered by iterative loops to save all the items.

step 4.3: adjust the permissions of the keys

CockroachDB will NOT accept keys that have open permissions. The app sets the read permissions to only the current user, no group access.

step 4.4: the magic of sighup

This is the special task that tells CRDB there were changes to the certificates. The sighup is an OS-level kill command that doesn’t actually kill the process, it merely tells the process of a HANGUP system event. CockroachDB knows to reload all certificates in this condition, without dropping any connections.

You can also see the effects of a SIGHUP in the CockroachDB logging (both on disk and in the console).

step 5: …verify the certificates?

In your CockroachDB admin console, you can find the actual certificates under the Advanced Debug tab, and in the Even More Advanced Debugging section.

Here you will find “Certificates on this node” that lets you validate the expiry of your certs, and now they should reflect the freshest dates based on step 1 (cert creation saved as secrets).

Example certificates in Advanced Debug

Conclusion & References

All YAML is available in the GitHub repo for this project I did not want to mess-up this write up with YAML blocks, so please review them in the repository.
Have fun, but please work with the Cockroach Enterprise Architects to run through this process for the first time.
Please review the code. It’s an implementation of these steps, written in JavaScript running under NodeJS
I have a dockerized image available (currently for ARM64 but can be created for all architectures), but requires a pull-secret.