Enhanced Platform Awareness (EPA) in OpenShift — Part II, CPU pinning
This is the second part about how to configure an “EPA ready” OpenShift worker node (OpenShift 4.4 ). As already mentioned in Part I, I’m using the standard way of configuring each feature instead of using the Performances addon Operator that is coming in next releases and that I will show in another post.
If you want to take a look at the next parts:
Let’s review how to configure CPU pinning and CPU isolation!
CPU pinning
When a process needs to be allocated in a CPU, the kernel scheduler treats all CPUs as “available” for scheduling making the usage of the CPU resources more efficient, but at the same time, making the system less deterministic since a process can share a CPU with another one that could be using it at the same time.
Kubernetes provides CPU resources in MiliCPUs, which means shared CPUs between PODs, but we can configure the scheduler to reserve complete CPUs (1000 MiliCPUs). That means that if a POD would request 1500 MiliCPUs (1,5 CPUs), instead of using all CPUs available to provide those 1500MiliCPUs, the scheduler will lock 2 “full” CPUs and they will be reserved for that POD, and only for that POD. That implies that no other POD will be able to use that CPU so the overall system will be less efficient, but you will assure that no other POD will be running on the CPUs that your POD is using. You can configure this CPU pinning in OpenShift using CPU Manager
Apart from the workloads running on PODs, we also need to take into account the CPU cycles of all processes not related to Kubernetes POD (ie. Operating System processes) as part of the CPU isolation tasks. We can reserve CPUs for those processes isolating them from the CPUs destinated to run the PODs, why doing this?, because although we have configured CPU pinning for the PODs, the rest of processes of that node (so no “Kubernetes workloads”) will be free to run in any CPU of the system, potentially impacting in the latency of the CPU response in our POD. This configuration can be done including some Kernel boot Arguments in the nodes.
CPU pinning configuration
1. Label the nodes
This is optional but recommended:
$ oc label node <node> cpumanager=true
2. Edit the machineConfigPool of those nodes
Edit MachineConfigPool in worker nodes and add a label “custom-kubelet: cpumanager-enabled” (remember to use “worker-epa” if you created the new role for the EPA worker nodes)
$ oc edit machineconfigpool worker
And include that label:
...
Metadata:
...
labels:
custom-kubelet: cpumanager-enabled
...
You can use a patch instead of ‘oc edit’ if you want:
oc patch machineconfigpool worker --type='json' -p='[{"op": "add", "path": "/metadata/labels/custom-kubelet", "value": "cpumanager-enabled" }]'
You can also use the Web Console to make that change:
3. Create the customized kubeletConfig
After creating that label, you just need to create a custom kubeletConfig object for nodes with that label that enables the CPU Manager feature (be aware that the Machine Config Operator (MCO) reboots the node):
apiVersion: machineconfiguration.openshift.io/v1
kind: KubeletConfig
metadata:
name: cpumanager-enabled
spec:
machineConfigPoolSelector:
matchLabels:
custom-kubelet: cpumanager-enabled
kubeletConfig:
cpuManagerPolicy: static
cpuManagerReconcilePeriod: 5s
Create the object with the ‘oc’ client:
$ oc create -f cpumanager-enabled.yaml
Or using the Web Console (this time we are using the “+” button that permits to create any supported object in the cluster instead of looking for the object type):
After creating that object you will find that new machineConfigs are created (this time, instead of ssh to the node, we use the ‘oc debug’ command to run a pod in that node and open an rsh to it, so we can review the node config without using ssh):
$ oc debug node/<node>
sh-4.2# cat /host/etc/kubernetes/kubelet.conf | grep cpuManager
You should see the in the command output:
...“”cpuManagerPolicy”:”static”,”cpuManagerReconcilePeriod”:”5s”...
4. Split CPUs for Workloads from CPUs for Operating System
As mentioned in the introduction, if you want a deterministic system makes sense to not mixing CPU cycles of Operating System processes and workloads on the same CPUs so you PODs are not affected by other node system tasks.
We have two approaches to split OS vs Workloads resources, configuring ‘isolcpus’ Kernel Argument on the nodes or configuring CPU Manager to handle CPU scheduling at that level (well, actually with the first approach the CPU manager is the one that is using the cores isolated by isolcpus because in the latest versions it’s aware of such configuration).
The ‘isolcpus’ approach works if all PODs on the node use guaranteed resources (requests are equal to limits) because those are the ones that the CPU Manager will make be scheduler in the isolated CPUs. All PODs with non-guaranteed resources won’t use the reserved CPUs and will land on the “non-isolated CPUs” (that were intended to be used by the Operating System).
The second approach is to configure the regular CPU Manager resource reservation, it’s valid for a host with both guaranteed and non-guaranteed resources workloads but it does not dedicate CPUs, it only assures that the number of CPU cycles of each type won’t go beyond the configured limit, so one cannot impact the amount of resources that are destinated to the other one, but it’s not solving the latency/jitter problem that implies sharing CPUs.
In our EPA node, all workloads should use guaranteed resources, so let’s modify the Kernel Arguments by modifying the MachineConfig that we created it before for the HugePages configuration, in this case, we need to add the line isolcpus (cores that will be used for workloads).
You can use the ‘oc’ client:
oc edit machineconfig 52-kargs-hugepages-1g
My nodes have 4 cores (remember that if you have multithreading enabled you should group threads of the same physical core, so do not split a physical core between isolated and non-isolated ranges). I reserve the first core for the OS and for the PODs with no guaranteed resources. Remember to save resources for such workloads since by default there will be running some PODs without guaranteed resources even in an “empty” node (think about daemonSets for example).
The MachineConfig should look like this (change “worker” role by “worker-epa” if you decided to create that new role):
apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfig
metadata:
labels:
machineconfiguration.openshift.io/role: worker
name: 52-kargs-hugepages-1g
spec:
kernelArguments:
- default_hugepagesz=1G
- hugepagesz=1G
- isolcpus="2,3"
Remember that ranges can be configured instead of specifying cores one-by-one.
It also can be done using the Web Console:
That covered the CPU isolation, but if we would like to be sure that memory consumption from the workload doesn’t affect the Operating System, we can also configure the CPU Manager (although it would be just to cover this risk, not the CPU isolation). You could edit the already created KubeletConfig and add the systemReserved (for OS) and kubeReserved (for Kubelet) parameters, and the rest would be the resources to create PODs (plus the eviction resources that I’ve not configured here). You can find this diagram in the Kubernetes documentation:
Node Capacity
---------------------------
| kube-reserved |
|-------------------------|
| system-reserved |
|-------------------------|
| eviction-threshold |
|-------------------------|
| |
| allocatable |
| (available for pods) |
| |
| |
---------------------------
The result should look like this in my case:
apiVersion: machineconfiguration.openshift.io/v1
kind: KubeletConfig
metadata:
name: cpumanager-enabled
spec:
machineConfigPoolSelector:
matchLabels:
custom-kubelet: cpumanager-enabled
kubeletConfig:
cpuManagerPolicy: static
cpuManagerReconcilePeriod: 5s
systemReserved:
cpu: 500m
memory: 1Gi
kubeReserved:
cpu: 500m
memory: 14Gi
CPU pinning testing
Let’s create a POD using CPU pinning, in this case, it’s using a single CPU, and we include the nodeSelector to be sure that the pod will be hosted by our configured node
apiVersion: v1
kind: Pod
metadata:
name: example
labels:
app: hello-openshift
namespace: test
spec:
containers:
- name: hello-openshift
image: openshift/origin-cli
command: ["/bin/sh"]
args: ["-c", "while true; do echo hello; sleep 10;done"]
resources:
requests:
cpu: 1
memory: "1G"
limits:
cpu: 1
memory: "1G"
nodeSelector:
cpumanager: "true"
Create the test pod:
$ oc create -f pod-cpupinning.yaml
Jump into the node where the POD is running and check the cgroups:
$ ssh core@<node>
[root@<node> ~]# systemd-cgls
You will see a big output but you can look for ‘kubepods.slice’ entries (it appears under .slice because the PODs with Guaranteed QoS are placed there), more specifically to the one with the command that the POD that we created is running.
In my case, the process number is “3825715”. Using that process number you can review the CPU list that the POD can use (POD is using core 1 in this case):
[root@<node> ~]# grep ^Cpus_allowed_list /proc/3825715/status
Cpus_allowed_list: 1
You can also check this value without having to jump into the node (it only works if the container image has a shell available):
$ oc exec example -- cat /sys/fs/cgroup/cpuset/cpuset.cpus
1
Or using the Web Console:
If you check that any other POD that has no guaranteed resources does not have the CPU pinning enforced (this node has 4 CPUs):
[root@<node> ~]# grep ^Cpus_allowed_list /proc/3941/status
Cpus_allowed_list: 0–3
Although it seems that the non-QoS guaranteed POD is allowed to use any CPU, in fact, it cannot use CPU 0, because of the isolcpus Kernel Argument that we configured before, and it cannot run on CPU 1 on this node, since the example POD created above is pinned to it.
If you want to double-check that it’s working properly you can create new PODs on the same node and check how many of them can be scheduled, for example, in my environment my node has 4 cores, but one is reserved for Operating System (with isolcpus), so in theory, I could only create PODs using 3 cores. I will reuse the previous POD description that uses 1 core, so I will be able to create only three PODs running on that node.
Please, note that I’ve included ‘nodeName’ key in the POD description so I’m sure that all PODs will be forced to use the same node, the POD description is (remember to configure requests equal to limits for guaranteed resources):
apiVersion: v1
kind: Pod
metadata:
generateName: cpumanager-
spec:
containers:
- name: cpumanager
image: gcr.io/google_containers/pause-amd64:3.0
resources:
requests:
cpu: 1
memory: "1G"
limits:
cpu: 1
memory: "1G"
nodeSelector:
cpumanager: "true"
nodeName: ocp-qkdnr-worker-0-j4cg7
Let’s try it out!:
In the next Post we will be reviewing how to deal with NUMA topologies in OpenShift:
You can also take a look again to the Part I where the HugePages configuration is explained:
.. or check other Post of the series: