This section discusses Kubernetes features and settings that you need to pay attention to in order to ensure the high availability of applications running in K8s.

Number of replicas

You need at least two replicas for the application to be considered minimally available. But why, you may ask, is one replica not enough? The problem is that many Kubernetes entities (Node, Pod, ReplicaSet, etc.) are ephemeral, i.e., they may be automatically deleted/recreated under certain conditions. Obviously, a Kubernetes cluster and its applications must take this into account.

For example, when the autoscaler scales down the number of nodes, some of them will be deleted, including the Pods running on them. Suppose the sole instance of your application is running on one of the nodes being deleted. In that case, you may find that your application is entirely unavailable, although this is usually short-lived. In general, if you have only one application replica, its abnormal termination will result in downtime. In other words, you must have at least two running replicas of the application.

The more replicas you have, the less your application’s processing power will be reduced if some replica fails. Suppose you have two replicas, and one fails due to network problems on the node. The load that your application can handle will be cut in half (with only one replicas running of the two). Of course, a new replica will be immediately scheduled to a new node, and the load capacity of the application will be fully restored soon. But until then, an increase in load can cause service disruptions, so you must have some replicas in reserve

The above recommendations are relevant to cases in which no HorizontalPodAutoscaler is used. The best alternative for applications with more than a few replicas is to configure HorizontalPodAutoscaler and let it manage the number of replicas. We will focus on HorizontalPodAutoscaler in the following article.

The update strategy

The default update strategy for Deployment entails a reduction of the number of old+new ReplicaSet Pods with a Ready status of 75% of their pre-update amount. Thus, during the update, the computing capacity of an application may drop to 75% of its regular level, and that may lead to a partial failure (degradation of the application’s performance). The strategy.RollingUpdate.maxUnavailable parameter allows you to configure the maximum percentage of Pods that can become unavailable during an update. Therefore, either make sure that your application runs smoothly even in the event that 25% of your Pods are unavailable or lower the maxUnavailable parameter. Note that the maxUnavailable parameter is rounded down.

There’s a little trick to the default update strategy (RollingUpdate): the application will temporarily have not only a few replicas, but two different versions (the old one and the new one) running concurrently as well. Therefore, if running different replicas and different versions of the application side by side is unfeasible for some reason, then you can use strategy.type: Recreate. Under the Recreate strategy, all the existing Pods are killed before the new Pods are created. This results in a short-lived downtime.

Other deployment strategies (blue-green, canary, etc.) can often provide a much better alternative to the RollingUpdate strategy. However, we are not taking them into account in this article since their implementation depends on the software used to deploy the application. That goes beyond the scope of this article.

Here is a great article on the topic that we recommend and is well worth the read.

Uniform replicas distribution across nodes

It is very important that you distribute Pods of the application across different nodes if you have multiple replicas of the application. To do so, you can instruct your scheduler to avoid starting multiple Pods of the same Deployment on the same node:

affinity:
podAntiAffinity:
  preferredDuringSchedulingIgnoredDuringExecution:
  - podAffinityTerm:
      labelSelector:
        matchLabels:
          app: testapp
      topologyKey: kubernetes.io/hostname

It is better to use preferredDuringSchedulingaffinity instead of requiredDuringScheduling. The latter may render it impossible to start new Pods if the number of nodes required for the new Pods is larger than the number of nodes available. Still, the requiredDuringScheduling affinity might come in handy when the number of nodes and application replicas is known in advance and you need to be sure that two Pods will not end up on the same node.

PodDisruptionBudget

The PodDisruptionBudget (PDB) mechanism is a must-have for applications running in production. It provides you the means to specify a maximum limit to the number of application Pods that can be unavailable simultaneously. Earlier we discussed some methods that are instrumental in avoiding potentially risky situations: running several application replicas, specifying podAntiAffinity (to prevent several Pods from being assigned to the same node), etc.

However, you may encounter a situation in which more than one K8s node becomes unavailable at the same time. For example, suppose you decide to switch instances to more powerful ones. There may be other reasons besides that, but that’s beyond the scope of this article. The thing is that several nodes get removed at the same time. “But that’s Kubernetes!” you might say. ”Everything is ephemeral here! The Pods will get moved to other nodes, so what’s the deal?” Well, let’s have a look.

Suppose the application has three replicas. The load is evenly distributed between them, while the Pods are distributed across the nodes. In this case, the application will continue to run even if one of the replicas fails. However, the failure of two replicas will result in a service degradation: one single Pod simply cannot handle the entire load on its own. The clients will start getting 5XX errors. (Of course, you can set a rate limit in the nginx container; in that case, the error will be 429 Too Many Requests. Still, the service will degrade nevertheless).

And that’s where PodDisruptionBudget comes to help. Let’s take a look at its manifest:

apiVersion: policy/v1beta1
kind: PodDisruptionBudget
metadata:
  name: app-pdb
spec:
  maxUnavailable: 1
  selector:
    matchLabels:
      app: app

The manifest is pretty straightforward; you are probably familiar with most of its fields, maxUnavailable being the most interesting among them. This field sets the maximum number of Pods that can be simultaneously unavailable. This can be either an absolute number or a percentage.

Suppose the PDB is configured for the application. What will happen in the event that, for some reason, two or more nodes start evicting application Pods? The above PDB only allows for one Pod to be evicted at a time. Therefore, the second node will wait until the number of replicas reverts to the pre-eviction level, and only then will the second replica be evicted.

As an alternative, you can also set a minAvailable parameter. For example:

apiVersion: policy/v1beta1
kind: PodDisruptionBudget
metadata:
  name: app-pdb
spec:
  minAvailable: 80%
  selector:
    matchLabels:
      app: app

This parameter ensures that at least 80% of replicas are available in the cluster at all times. Thus, only 20% of replicas can be evicted if necessary. minAvailable can be either an absolute number or a percentage.

But there is a catch: there have to be enough nodes in the cluster that satisfy the podAntiAffinity criteria. Otherwise, you may encounter a situation in which a replica gets evicted, but the scheduler cannot re-deploy it due to a lack of suitable nodes. As a result, draining a node will take forever to complete, and will get you two application replicas instead of three. Granted, you can invoke kubectl describe for a Pending Pod to see what’s going on and eliminate the problem. But still, it is better to prevent this kind of situation from happening.

To summarize, always configure the PDB for critical components of your system.

prev
next