Colocate your application using Katalyst

This is a quick start guide to get a taste of what you can achieve with Katalyst colocation capabilities. It is highly suggested that you read through core concepts before you begin, so that you won’t get confused on some terminologies in this guide.

Prerequisites

As of v0.4.0, Katalyst colocation capability still depends on Kubewharf enhanced kubernetes to work properly. If you don’t have a Kubewharf enhanced kubernetes cluster up and running already, please follow this guide to install one.

Installation

Add helm repo

helm repo add kubewharf https://kubewharf.github.io/charts
helm repo update

Install Malachite

helm install malachite -n malachite-system --create-namespace kubewharf/malachite

Install Katalyst colocation

helm install katalyst-colocation -n katalyst-system --create-namespace kubewharf/katalyst-colocation

A colocation example

Before going to the next step, let’s assume that we have the following setup:

  • Total resources are set as 48 cores and 195924424Ki per node;
  • Reserved resources for pods with shared_cores are set as 4 cores and 5Gi. This means that we’ll always keep at least this amount of resources for those pods for bursting requirements.

Based on the assumption above, you can follow the steps to deep dive into the colocation workflow.

Reporting reclaimed resource

After installing, resource reporting module will report reclaimed resources. Since there are no pods running, reclaimed resource will be calculated as: reclaimed_resources = total_resources - reserve_resources

Katalyst defines a CRD CustomNodeResource to keep the extended per-node resources including reclaimed resources. If we check CustomNodeResource with kubectl get kcnr, reclaimed resources will be shown as follows. It indicates the amount of reclaimed resources this node offers that can be allocated to pods with reclaimed_cores QoS class.

status:
    resourceAllocatable:
        katalyst.kubewharf.io/reclaimed_memory: "195257901056"
        katalyst.kubewharf.io/reclaimed_millicpu: 44k
    resourceCapacity:
        katalyst.kubewharf.io/reclaimed_memory: "195257901056"
        katalyst.kubewharf.io/reclaimed_millicpu: 44k

Create a shared_cores pods with the following yaml, and generate some CPU workloads to make reclaimed resources fluctuate along with the running state of workload. We can think of it as an online service like a web server.

apiVersion: v1
kind: Pod
metadata:
  annotations:
    "katalyst.kubewharf.io/qos_level": shared_cores
  name: shared-normal-pod
  namespace: default
spec:
  containers:
    - name: stress
      image: joedval/stress:latest
      command:
        - stress
        - -c
        - "1"
      imagePullPolicy: IfNotPresent
      resources:
        requests:
          cpu: "2"
          memory: 1Gi
        limits:
          cpu: "2"
          memory: 1Gi
  schedulerName: katalyst-scheduler

After successfully scheduled, the pod starts running with cpu-usage ~= 1cores and cpu-load ~= 1, and the reclaimed resources will be changed according to the formula below. We skip memory here since it’s more difficult to reproduce with accurate value than cpu, but the principle is familiar. reclaim cpu = allocatable - round(ceil(reserve + max(usage,load.1min,load.5min))

status:
    resourceAllocatable:
        katalyst.kubewharf.io/reclaimed_millicpu: 42k
    resourceCapacity:
        katalyst.kubewharf.io/reclaimed_millicpu: 42k

Now we put pressure on those pods to simulate peak hours of online services with stress, and the cpu-load will rise to approximately 3 to make the reclaimed cpu shrink to 40k.

kubectl exec shared-normal-pod -it -- stress -c 2

If we check the CustomNodeResource again, we can verify that the allocatable reclaimed CPU reduces to 40k.

status:
    resourceAllocatable:
        katalyst.kubewharf.io/reclaimed_millicpu: 40k
    resourceCapacity:
        katalyst.kubewharf.io/reclaimed_millicpu: 40k

Enhanced QoS capabilities

After pods get successfully scheduled, katalyst will apply resource isolation strategies for various resource dimension (e.g. CPU, memory, network, disk io etc) and dynamically adjust resource allocation for pods with different QoS classes. These strategies are to ensure the QoS of shared_cores workloads are met while reclaiming as much resource as possible. In this guide we will see how Katalyst manages CPU for shared_cores and reclaimed_cores workloads.

Before going into the next step, remember to remove previous pods to make a clean environment.

In a nutshell, Katalyst use cpuset to isolate interference between pods from different QoS classes. It allocates separate cpuset pool for shared_cores and reclaimed_cores workloads so that no shared_cores pod will scheduled to run on a same cpu core with reclaimed_cores pods at the same time. Let’s verify this with an example:

Create a pod with shared_cores QoS class with the yaml below.

apiVersion: v1
kind: Pod
metadata:
  annotations:
    "katalyst.kubewharf.io/qos_level": shared_cores
  name: shared-normal-pod
  namespace: default
spec:
  containers:
    - name: stress
      image: joedval/stress:latest
      command:
        - stress
        - -c
        - "1"
      imagePullPolicy: IfNotPresent
      resources:
        requests:
          cpu: "2"
          memory: 1Gi
        limits:
          cpu: "2"
          memory: 1Gi
  schedulerName: katalyst-scheduler

After a short ramp-up period, the cpuset for shared-pool will be 6 cores in total:

  • 4 cores are reserved as buffer for bursting request.
  • 2 cores for regular requirements. This number is derived from algorithm which takes the current CPU usage/load of shared_cores pods into consideration.

The rest of the cores are considered suitable for reclaimed pods.

Let’s login to the node to check the cpuset allocation result:

cat<<'EOF' > /tmp/get_cpuset.sh
id=$1
cid=$(crictl ps |grep "$id"|awk '{print $1}')
if [ "${cid}x" == "x" ]; then
exit
fi

cp=$(crictl inspect "$cid"|grep cgroupsPath|awk '{print $2}'|tr -d "\""|tr -d ',')
if [ "${cp}x" == "x" ]; then
exit
fi

date
cat /sys/fs/cgroup/cpuset/"$cp"/cpuset.cpus
EOF

chmod 700 /tmp/get_cpuset.sh

The script above get the allocated cpuset of a specific pod. Let’s check the cpuset allocated to shared-normal-pod.

./tmp/get_cpuset.sh shared-normal-pod

Tue Jan  3 16:18:31 CST 2023
11,22-23,35,46-47

Now create a pod with reclaimed_cores QoS class with the yaml below, and the cpuset for reclaimed_cores pods will be 40 cores in total.

apiVersion: v1
kind: Pod
metadata:
  annotations:
    "katalyst.kubewharf.io/qos_level": reclaimed_cores
  name: reclaimed-normal-pod
  namespace: default
spec:
  containers:
    - name: stress
      image: joedval/stress:latest
      command:
        - stress
        - -c
        - "1"
      imagePullPolicy: IfNotPresent
      resources:
        requests:
          resource.katalyst.kubewharf.io/reclaimed_millicpu: 2k
          resource.katalyst.kubewharf.io/reclaimed_memory: 1Gi
        limits:
          resource.katalyst.kubewharf.io/reclaimed_millicpu: 2k
          resource.katalyst.kubewharf.io/reclaimed_memory: 1Gi
  schedulerName: katalyst-scheduler

Check the cpuset allocated for reclaimed_cores

./tmp/get_cpuset.sh reclaimed-normal-pod

Tue Jan  3 16:23:20 CST 2023
0-10,12-21,24-34,36-45

Now if the cpu load of shared-normal-pod rises to 3, the cpuset for all shared_cores pods will be 8 cores in total (i.e. 4 cores are reserved as buffer for bursting request, plus 4 cores for regular requirements). And cpuset allocated for reclaimed_cores pods will shrink to 48.

kubectl exec reclaimed-normal-pod -it -- stress -c 2
./tmp/get_cpuset.sh shared-normal-pod

Tue Jan  3 16:25:23 CST 2023
10-11,22-23,34-35,46-47
./tmp/get_cpuset.sh reclaimed-normal-pod

Tue Jan  3 16:28:32 CST 2023
0-9,12-21,24-33,36-45

Pod Eviction

Eviction is usually used as a fallback measure in case that the QoS fails to be satisfied, and we should always make sure that the QoS of pods with higher priority (i.e. shared_cores pods) is met by evicting pods with lower priority (i.e. reclaimed_cores pods). Katalyst provides both per-node and centralized evictions to meet different requirements.

Before going to the next step, remember to clear previous pods to make a clean environment.

Per-node Eviction

Per-node eviction is performed by a daemonset agent. Currently, Katalyst provides several in-tree agent eviction implementations.

Resource Overcommit

Since allocatable reclaimed resources are always fluctuating according to the running state of pods with shared_cores, there are cases where allocatable reclaimed resource gets very tight due to the rise of resource usage of shared_cores pods. In this case, katalyst will evict reclaimed_cores pods to have them rescheduled to a node with more resource. The trigger for this kind of eviction can be described as: sum(requested_reclaimed_resource) > alloctable_reclaimed_resource * threshold

Create several pods (including shared_cores and reclaimed_cores)

curl https://raw.githubusercontent.com/kubewharf/katalyst-core/main/examples/shared-large-pod.yaml | kubectl create -f -
curl https://raw.githubusercontent.com/kubewharf/katalyst-core/main/examples/reclaimed-large-pod.yaml | kubectl create -f -

Now add some loads to shared_cores pods to reduce allocatable reclaimed resources until it is below the threshold. Eventually, it will trigger the eviction of pod reclaimed-large-pod-2.

kubectl exec shared-large-pod-2 -it -- stress -c 40

Check CustomNodeResource of the node, we can see that allocatable reclaimed cpu is reduced to 4k:

status:
    resourceAllocatable:
        katalyst.kubewharf.io/reclaimed_millicpu: 4k
    resourceCapacity:
        katalyst.kubewharf.io/reclaimed_millicpu: 4k

Check the eviction events:

kubectl get event -A | grep evict

default     43s         Normal   EvictCreated     pod/reclaimed-large-pod-2   Successfully create eviction; reason: met threshold in scope: katalyst.kubewharf.io/reclaimed_millicpu from plugin: reclaimed-resource-pressure-eviction-plugin
default     8s          Normal   EvictSucceeded   pod/reclaimed-large-pod-2   Evicted pod has been deleted physically; reason: met threshold in scope: katalyst.kubewharf.io/reclaimed_millicpu from plugin: reclaimed-resource-pressure-eviction-plugin

The default threshold for reclaimed resources 5, we can dynamically configure the threshold with KCC:

kubectl create -f - <<EOF
apiVersion: config.katalyst.kubewharf.io/v1alpha1
kind: KatalystCustomConfig
metadata:
  name: eviction-configuration
spec:
  targetType:
    group: config.katalyst.kubewharf.io
    resource: evictionconfigurations
    version: v1alpha1

---
apiVersion: config.katalyst.kubewharf.io/v1alpha1
kind: EvictionConfiguration
metadata:
  name: default
spec:
  config:
    evictionPluginsConfig:
      reclaimedResourcesEvictionPluginConfig:
        evictionThreshold:
          "katalyst.kubewharf.io/reclaimed_millicpu": 10
          "katalyst.kubewharf.io/reclaimed_memory": 10
EOF
Memory

Memory eviction is implemented in two parts: numa-level eviction and system-level eviction. The former is used along with numa-binding enhancement, while the latter is used for more general cases. In this tutorial, we will mainly demonstrate the latter. For each level, katalyst will trigger memory eviciton based on memory usage and Kswapd active rate to avoid slow path for memory allocation in kernel.

Create several pods (including shared_cores and reclaimed_cores)

curl https://raw.githubusercontent.com/kubewharf/katalyst-core/main/examples/shared-large-pod.yaml | kubectl create -f -
curl https://raw.githubusercontent.com/kubewharf/katalyst-core/main/examples/reclaimed-large-pod.yaml | kubectl create -f -

Create KCC to change the default free memory and Kswapd rate threshold.

kubectl create -f - <<EOF
apiVersion: config.katalyst.kubewharf.io/v1alpha1
kind: KatalystCustomConfig
metadata:
  name: eviction-configuration
spec:
  targetType:
    group: config.katalyst.kubewharf.io
    resource: evictionconfigurations
    version: v1alpha1

---
apiVersion: config.katalyst.kubewharf.io/v1alpha1
kind: EvictionConfiguration
metadata:
  name: default
spec:
  config:
    evictionPluginsConfig:
      memoryEvictionPluginConfig:
        enableNumaLevelDetection: false
        systemKswapdRateExceedTimesThreshold: 1
        systemKswapdRateThreshold: 2000
EOF
Memory Usage Eviction

Exec into reclaimed-large-pod-2 and request enough memory. When memory free drops below the threshold, Katalyst will try to evict pods reclaimed_cores pods, and it will choose the pod that uses the most memory.

kubectl exec -it reclaimed-large-pod-2 bash

stress --vm 1 --vm-bytes 175G --vm-hang 1000 --verbose

Check the eviction event:

kubectl get event -A | grep evict

default     2m40s       Normal   EvictCreated     pod/reclaimed-large-pod-2   Successfully create eviction; reason: met threshold in scope: memory from plugin: memory-pressure-eviction-plugin
default     2m5s        Normal   EvictSucceeded   pod/reclaimed-large-pod-2   Evicted pod has been deleted physically; reason: met threshold in scope: memory from plugin: memory-pressure-eviction-plugin
taints:
- effect: NoSchedule
  key: node.katalyst.kubewharf.io/MemoryPressure
  timeAdded: "2023-01-09T06:32:08Z"
Kswapd Eviction

Login into the working node and put some pressure on system memory

stress --vm 1 --vm-bytes 180G --vm-hang 1000 --verbose

When Kswapd active rates exceed the target threshold (default = 1), Katalyst will try to eviction both reclaimed_cores and shared_cores pods, and reclaimed_cores will be picked for eviction first.

Check the eviction event:

kubectl get event -A | grep evict

default           2m2s        Normal    EvictCreated              pod/reclaimed-large-pod-2          Successfully create eviction; reason: met threshold in scope: memory from plugin: memory-pressure-eviction-plugin
default           92s         Normal    EvictSucceeded            pod/reclaimed-large-pod-2          Evicted pod has been deleted physically; reason: met threshold in scope: memory from plugin: memory-pressure-eviction-plugin
taints:
- effect: NoSchedule
  key: node.katalyst.kubewharf.io/MemoryPressure
  timeAdded: "2023-01-09T06:32:08Z"
Load Eviction

For shared_cores pods, if any pod spawns too many threads, the scheduling period of linux cfs scheduler may be split into small pieces and will result in more throttling and impacts workload performance. To solve this, katalyst implements load eviction to detect load counts and trigger taint and eviction actions based on threshold, and the comparison formula is as follows. soft: load > resource_pool_cpu_amount hard: load > resource_pool_cpu_amount * threshold

Create several pods (including shared_cores and reclaimed_cores).

curl https://raw.githubusercontent.com/kubewharf/katalyst-core/main/examples/shared-large-pod.yaml | kubectl create -f -
curl https://raw.githubusercontent.com/kubewharf/katalyst-core/main/examples/reclaimed-large-pod.yaml | kubectl create -f -

Put some pressure to reduce allocatable reclaimed resources until the load exceeds the soft threshold. In this case, taint will be added in CNR to avoid scheduling new pods, but the existing pods will keep running.

kubectl exec shared-large-pod-2 -it -- stress -c 50
taints:
- effect: NoSchedule
  key: node.katalyst.kubewharf.io/CPUPressure
  timeAdded: "2023-01-05T05:26:51Z"

Put more pressure to reduce allocatable reclaimed resources until the load exceeds the hard threshold. In this case, katalyst will evict the pods that create the most amount of threads.

kubectl exec shared-large-pod-2 -it -- stress -c 100
$ kubectl get event -A | grep evict
67s         Normal   EvictCreated     pod/shared-large-pod-2      Successfully create eviction; reason: met threshold in scope: cpu.load.1min.container from plugin: cpu-pressure-eviction-plugin
68s         Normal   Killing          pod/shared-large-pod-2      Stopping container stress
32s         Normal   EvictSucceeded   pod/shared-large-pod-2      Evicted pod has been deleted physically; reason: met threshold in scope: cpu.load.1min.container from plugin: cpu-pressure-eviction-plugin

Centralized Eviction

In some cases, the agents may suffer from the single point of failure, i.e. In a large cluster, the daemon may fail to work because of a lot of abnormal cases, and pods running on the node may go out of control. Katalyst introduces a centralized eviction mechanism to evict all reclaimed pods to relieve this problem. By default, if the readiness state keeps failing for 10 minutes, Katalyst will taint the CNR as unSchedubable to make sure no more pods with reclaimed_cores can be scheduled to this node. And if the readiness state keeps failing for 20 minutes, it will try to evict all pods with reclaimed_cores.

taints:
- effect: NoScheduleForReclaimedTasks
  key: node.kubernetes.io/unschedulable

Further More

We will try to provide more tutorials in the future along with feature releases in the future.