This section discusses application resource management in Kubernetes.

Priority

priorityClassName represents your Pod priority. The scheduler uses it to decide which Pods are to be scheduled first and which Pods should be evicted first if there is no space for Pods left on the nodes.

You will need to add several PriorityClass type resources and map Pods to them using priorityClassName. Here is an example of how PriorityClasses may vary:

  • Cluster. Priority > 10000. Cluster-critical components, such as kube-apiserver.
  • Daemonsets. Priority: 10000. Usually, it is not advised for DaemonSet Pods to be evicted from cluster nodes and replaced by ordinary applications.
  • Production-high. Priority: 9000. Stateful applications.
  • Production-medium. Priority: 8000. Stateless applications.
  • Production-low. Priority: 7000. Less critical applications.
  • Default. Priority: 0. Non-production applications.

Setting priorities will help you to avoid sudden evictions of critical components. Also, critical applications will evict less important applications if there is a lack of node resources.

Reserving resources

The scheduler uses a Pod’s resources.requests to decide which node to place the Pod on. For instance, a Pod cannot be scheduled on a Node that does not have enough free (i.e., non-requested) resources to cover that Pod’s resource requests. On the other hand, resources.limits allow you to limit Pods’ resource consumption that heavily exceeds their respective requests. A good tip is to set limits equal to requests. Setting limits at much higher than requests may lead to a situation when some of a node’s Pods not getting the requested resources. This may lead to the failure of other applications on the node (or even the node itself). Kubernetes assigns a QoS class to each Pod based on its resource scheme. K8s then uses QoS classes to make decisions about which Pods should be evicted from the nodes.

Therefore, you have to set both requests and limits for both the CPU and memory. The only thing you can/should omit is the CPU limit if the Linux kernel version is older than 5.4 (in the case of EL7/CentOS7, the kernel version must be older than 3.10.0-1062.8.1.el7).

Refer to the Kubernetes documentation to learn more about QOS classes.

Furthermore, the memory consumption of some applications tends to grow in an unlimited fashion. A good example of that is Redis used for caching or an application that basically runs “on its own”. To limit their impact on other applications on the node, you can (and should) set limits for the amount of memory to be consumed. The only problem with that is the application will be KILLed when this limit is reached. Applications cannot predict/handle this signal, and this will probably prevent them from shutting down correctly. That is why, in addition to Kubernetes limits, we highly recommend using application-specific mechanisms for limiting memory consumption so that it does not exceed (or come close to) the amount set in a Pod’s limits.memory parameter.

Here is a Redis configuration that can help you with this:

maxmemory 500mb   # if the amount of data exceeds 500 MB...
maxmemory-policy allkeys-lru   # ...Redis would delete rarely used keys

As for Sidekiq, you can use the Sidekiq worker killer:

require 'sidekiq/worker_killer'
Sidekiq.configure_server do |config|
  config.server_middleware do |chain|
    # Terminate Sidekiq correctly when it consumes 500 MB
    chain.add Sidekiq::WorkerKiller, max_rss: 500
  end
end

It is clear that in all these cases that limits.memory needs to be higher than the thresholds for triggering the above mechanisms.

Next we’ll discuss using VerticalPodAutoscaler to allocate resources automatically.

VerticalPodAutoscaler

VPA analyzes the resource requirements of the containers and sets (if the corresponding mode is enabled) their limits and requests.

Suppose you have deployed a new app version with some new functions and it turns out that, say, the imported library is a huge resource eater, or the code isn’t very well optimized. In other words, the application resource requirements have increased. You failed to notice this during testing (since it is hard to load the application in the same way as in production).

And, of course, the relevant requests and limits had been set for the app before an update begins. And now the application reaches the memory limit, and its Pod gets killed due to OOM. VPA can prevent this! At first glance, VPA looks like a great tool that should be used whenever and wherever possible. But in real life that isn’t always necessarily the case, and you have to bear in mind the finer details involved.

The main problem (it isn’t solved yet) is that the Pod needs to be restarted for resource changes to take effect. In the future, VPA will modify them without restarting the Pod, but for now, it simply isn’t capable of doing that. But no need to worry. That isn’t a big deal if you have a “well-written” application that is always ready for redeployment (say, it has a large number of replicas; its PodAntiAffinity, PodDistruptionBudget, HorizontalPodAutoscaler are carefully configured; etc.). In that case, you (probably) won’t even notice the VPA activity.

Sadly, there are other less pleasant scenarios that may occur like: the application not taking redeployment very well, the number of replicas being limited due to a lack of nodes, our application running as a StatefulSet, etc. In the worst-case scenario, the Pods’ resource consumption grows due to an increased load, HPA starts to scale up the cluster, and then, suddenly, VPA proceeds to modify the resource parameters and restarts the Pods. As a result, this high load gets distributed across the rest of the Pods. Some of them may crash, rendering things even worse and resulting in a chain reaction of failure.

That is why having a profound understanding of various VPA operating modes is important. Let’s start with the simplest one — “Off”.

Off mode

All this mode does is calculate the resource consumption of Pods and make recommendations. Looking ahead, I would like to note that at Flant we use this mode in the majority of cases (and we recommend it). But first, let’s look at a few examples.

Some basic manifests follow below:

apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: my-app-vpa
spec:
  targetRef:
    apiVersion: "apps/v1"
    kind: Deployment
    name: my-app
  updatePolicy:
    updateMode: "Recreate"
    containerPolicies:
      - containerName: "*"
        minAllowed:
          cpu: 100m
          memory: 250Mi
        maxAllowed:
          cpu: 1
          memory: 500Mi
        controlledResources: ["cpu", "memory"]
        controlledValues: RequestsAndLimits

We will not go into detail about this manifest’s parameters: this article provides a detailed description of the features and aspects of VPA. In short, we specify the VPA target (targetRef) and select the update policy. Additionally, we specify the upper and lower limits for the resources VPA can use. The primary focus is on the updateMode field. In “Recreate” or “Auto” mode, VPA will recreate Pods with all consequences (until an above-mentioned patch for in-place Pod resource parameters update becomes available). Since we don’t want it, we use the “Off” mode:

apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: my-app-vpa
spec:
  targetRef:
    apiVersion: "apps/v1"
    kind: Deployment
    name: my-app
  updatePolicy:
    updateMode: "Off"   # !!!
  resourcePolicy:
    containerPolicies:
      - containerName: "*"
        controlledResources: ["cpu", "memory"]

VPA starts collecting metrics. You can use the kubectl describe vpa command to see the recommendations (just let VPA run for a few minutes):

Recommendation:
Container Recommendations:
  Container Name:  nginx
  Lower Bound:
    Cpu:     25m
    Memory:  52428800
  Target:
    Cpu:     25m
    Memory:  52428800
  Uncapped Target:
    Cpu:     25m
    Memory:  52428800
  Upper Bound:
    Cpu:     25m
    Memory:  52428800

The VPA recommendations will be more accurate after a couple of days (a week, a month, etc.) of running. And then is the perfect time to adjust limits in the application manifest. That way, you can avoid OOM kills due to a lack of resources and save on infrastructure (if initial requests/limits are too high).

Now, let’s talk about some of the details of using VPA.

Other VPA modes

Note that in “Initial” Mode, VPA assigns resources when Pods are started and never changes them later. Thus, VPA will set low requests/limits for newly created Pods if the load was relatively low over the past week. It may lead to problems if the load suddenly increases because the requests/limits will be much lower than what is required for such a load. This mode may come in handy if your load is uniformly distributed and grows in a linear fashion.

In “Auto” mode, VPA recreates the Pods. Thus, the application must handle the restart properly. If it cannot shutdown gracefully (i.e. by closing the existing connections correctly and so on), you will most likely catch some avoidable 5XX errors. Using Auto mode with a StatefulSet is rarely advisable: imagine VPA attempting to add PostgreSQL resources to production…

As for the dev environment, you can freely experiment to find the level of resources to use (later) in production that is acceptable to you. Suppose you want to use VPA in the “Initial” mode and we have Redis in the cluster using the maxmemory parameter. You will most likely need to change it to adjust it to your needs. The problem is Redis doesn’t care about the limits at the cgroups level. In other words, you are risking a lot if maxmemory is, say, 2GB while your Pod’s memory is capped at 1GB. But how can you set maxmemory to be the same as the limit? Well, there is a way! You can use the VPA-recommended values:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: redis
  labels:
    app: redis
spec:
  replicas: 1
  selector:
    matchLabels:
      app: redis
  template:
    metadata:
      labels:
        app: redis
    spec:
      containers:
      - name: redis
        image: redis:6.2.1
        ports:
        - containerPort: 6379
        resources:
           requests:
             memory: "100Mi"
             cpu: "256m"
           limits:
             memory: "100Mi"
             cpu: "256m"
        env:
          - name: MY_MEM_REQUEST
            valueFrom:
              resourceFieldRef:
                containerName: app
                resource: requests.memory
          - name: MY_MEM_LIMIT
            valueFrom:
              resourceFieldRef:
                containerName: app
                resource: limits.memory

You can use environment variables to obtain the memory limit (and subtract, say, 10% from that for application needs) and set the resulting value as maxmemory. You will probably have to do something about the init container that uses sed to process the Redis config since the default Redis container image does not support passing maxmemory using an environment variable. Nevertheless, this solution is quite functional.

Finally, I would like to turn your attention to the fact that VPA evicts the DaemonSet Pods all at once, en masse. We are currently working on a patch that fixes this.

Final VPA recommendations

“Off” mode is suitable for the majority of cases.

You can experiment with “Auto” and “Initial” modes in the dev environment.

Only use VPA in production if you have already accumulated recommendations and tested them thoroughly. In addition, you have to clearly understand what you are doing and why you are doing it.

In the meantime, we are eagerly anticipating in-place (restart-free) updates for Pod resources.

Note that there are some limitations associated with joint use of HPA and VPA. For instance, VPA should not be used together with HPA if the CPU- or Memory-based metric is used as a trigger. The reason is that when the threshold is reached, VPA increases resource requests/limits while HPA adds new replicas. Consequently, the load will drop off sharply, and the process will go in reverse, resulting in “flapping”. The official documentation sheds more light on the existing limitations.

HorizontalPodAutoscaler

Let’s consider another situation: what happens if an application has an unexpected load that is significantly higher than usual? Yes, you can scale up the cluster manually, but that is not the method we use.

That is where HorizontalPodAutoscaler (HPA) comes in. With HPA, you can choose a metric and use it as a trigger for scaling the cluster up/down automatically, depending on the metric’s value. Imagine that on one quiet night your cluster suddenly gets blasted with a massive uptick in traffic, say, Reddit users have found out about your service. The CPU load (or some other Pod metric) increases, hits the threshold, and then HPA comes into play. It scales up the cluster, thus distributing the load between a larger number of Pods.

Thanks to that, all the incoming requests are processed successfully. Just as important, after the load returns to the average level, HPA scales the cluster down to reduce infrastructure costs and save money. Sounds great, doesn’t it?

Let’s see how exactly HPA calculates the number of replicas to be added. Here is the formula from the documentation:

desiredReplicas = ceil[currentReplicas * ( currentMetricValue / desiredMetricValue )]

Now suppose that:

  • the current number of replicas is 3;
  • the current metric value is 100;
  • the metric threshold value is 60;

In this case, the resulting number is 3 * ( 100 / 60 ), i.e. “about” 5 replicas (HPA will round the result up). Thus, the application will gain two more replicas. But that is not the end of the story: HPA will continue to calculate the number of replicas required (using the formula above) to scale down the cluster if the load decreases.

And that brings us to the most exciting part. What metric should you use? The first thing that comes to mind is one of the primary metrics, such as CPU or Memory utilization. And that will work if your CPU and Memory consumption is directly proportional to the incoming load. But what if the Pods are handling different requests? Some requests require many CPU cycles, others may consume a lot of memory, and still others only demand minimum resources.

Let’s take a look, for example, at the RabbitMQ queue and the instances processing it. Suppose there are ten messages in the queue. Monitoring shows that messages are being dequeued (as per RabbitMQ’s terminology) steadily and regularly. That is, we feel that ten messages in the queue on average is okay. But then the load suddenly increases, and the queue grows to 100 messages. However, the workers’ CPU and Memory consumption stays the same: they are steadily processing the queue, leaving about 80-90 messages in it.

But what if we use a custom metric that describes the number of messages in the queue? Let’s configure our custom metric as follows:

  • the current number of replicas is 3;
  • the current metric value is 80;
  • the metric threshold value is 15.

Thus, 3 * ( 80 / 15 ) = 16. In this case, HPA can increase the number of workers to 16, and they quickly process all the messages in the queue (at which point HPA will decrease their number again). However, all the required infrastructure must be ready to accommodate this number of Pods. That is, they must fit on the existing nodes, or new nodes must be provisioned by the infrastructure provider (cloud provider) in the case that Cluster Autoscaler is used. In other words, we are back to planning cluster resources.

Now let’s take a look at some manifests:

apiVersion: autoscaling/v1
kind: HorizontalPodAutoscaler
metadata:
  name: php-apache
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: php-apache
  minReplicas: 1
  maxReplicas: 10
  targetCPUUtilizationPercentage: 50

This one is simple. As soon as the CPU load reaches 50%, HPA starts scaling the number of replicas to a maximum of 10.

Here is a more interesting one:

apiVersion: autoscaling/v1
kind: HorizontalPodAutoscaler
metadata:
  name: worker
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: worker
  minReplicas: 1
  maxReplicas: 10
metrics:
  - type: External
    external:
      metric:
        name: queue_messages
      target:
        type: AverageValue
        averageValue: 15

Note that in this example, HPA uses the custom metric. It will base its scaling decisions on the size of the queue (queue_messages metric). Given that the average number of messages in the queue is 10, we set the threshold to 15. This way, you can manage the number of replicas more accurately. As you can see, the custom metric enables more accurate cluster autoscaling than, say, a CPU-based metric.

Additional features

The HPA configuration options are pretty diverse. For example, you can combine different metrics. In the manifest below, CPU utilization and queue size are used to trigger scaling decisions.

apiVersion: autoscaling/v1
kind: HorizontalPodAutoscaler
metadata:
  name: worker
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: worker
  minReplicas: 1
  maxReplicas: 10
metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 50
  - type: External
    external:
      metric:
        name: queue_messages
      target:
        type: AverageValue
        averageValue: 15

What calculation algorithm does HPA apply? Well, it uses the highest calculated number of replicas regardless of the metric exploited. For example, if the calculation based on the CPU metric shows that 5 replicas need to be added while the queue size-based metric gives only 3 Pods, HPA will use the larger value and add 5 Pods.

With the release of Kubernetes 1.18, you now have the ability to define scaleUp and scaleDown policies. For example:

behavior:
  scaleDown:
    stabilizationWindowSeconds: 60
    policies:
    - type: Percent
      value: 5
      periodSeconds: 20
    - type: Pods
      value: 5
      periodSeconds: 60
    selectPolicy: Min
  scaleUp:
    stabilizationWindowSeconds: 0
    policies:
    - type: Percent
      value: 100
      periodSeconds: 10

As you can see in the manifest above, it features two sections. The first one (scaleDown) defines the scaling down parameters while the second (scaleUp) is used for scaling up. Each section features the stabilizationWindowSeconds. This helps prevent what is referred to as “flapping” (or unnecessary scaling) as the number of replicas continues to oscillate. This parameter essentially serves as a timeout after the number of replicas is changed.

Now let’s talk about the policies. The scaleDown policy allows you to specify the percent of Pods (type: Percent) to scale down over a specific period of time. If the load features a cyclical pattern, what you have to do is decrease the percentage and increase the duration period. In that case, as the load decreases, HPA will not kill a large number of Pods at once (according to its formula) but will do so gradually instead. Furthermore, you can set the maximum number of Pods (type: Pods) that HPA is allowed to kill over the specified time period.

Note the selectPolicy: Min parameter. What that means is HPA uses the policy that affects the minimum number of Pods. Thus, HPA will choose a percent value if it (5% in the example above) is less than the numeric alternative (5 Pods in the example above). Conversely, the selectPolicy: Max policy will have the opposite effect.

Similar parameters are used in the scaleUp section. Note that in most situations, the cluster must scale up (almost) instantly since even a slight delay can affect users and their experience. For that reason, stabilizationWindowSeconds is set to 0 in this section. If the load has a cyclical pattern, HPA can increase the replica count to maxReplicas (as defined in the HPA manifest) if necessary. Our policy allows HPA to add up to 100% to the currently running replicas every 10 seconds (periodSeconds: 10).

Finally, you can set the selectPolicy parameter to Disabled to turn off scaling in the given direction:

behavior:
  scaleDown:
    selectPolicy: Disabled

Most of the time that policies are used is when HPA does not work as expected. Policies provide flexibility but render the manifest harder to grasp.

Recently, HPA became capable to track the resource usage of individual containers across a set of Pods (introduced as an alpha feature in Kubernetes 1.20).

HPA: summary

Let us conclude this section with an example of the complete HPA manifest:

apiVersion: autoscaling/v1
kind: HorizontalPodAutoscaler
metadata:
  name: worker
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: worker
  minReplicas: 1
  maxReplicas: 10
metrics:
  - type: External
    external:
      metric:
        name: queue_messages
      target:
        type: AverageValue
        averageValue: 15
  behavior:
    scaleDown:
      stabilizationWindowSeconds: 60
      policies:
      - type: Percent
        value: 5
        periodSeconds: 20
      - type: Pods
        value: 5
        periodSeconds: 60
      selectPolicy: Min
    scaleUp:
      stabilizationWindowSeconds: 0
      policies:
      - type: Percent
        value: 100
        periodSeconds: 10

Please note this example is provided for informational purposes only. You will need to adapt it to suit the specifics of your own operation.

Horizontal Pod Autoscaler resume: HPA is perfect for production environments. But you have to be careful and forward-thinking when choosing metrics for HPA. A mistaken metric or an incorrect threshold will result in either a waste of resources (from unnecessary replicas) or service degradation (if the number of replicas is not enough). Closely monitor the behavior of the application and test it until you’ve achieved the right balance.

prev
next