Enhanced Platform Awareness (EPA) in OpenShift — Part III, NUMA Topology Awareness
All good things come in threes, so here you have the third part of the EPA on OpenShift series. If you want to take a look to Hugepages or CPU pinning configuration you can review the past two parts:
Or if you want to directly jump to any of the next Posts:
This time we are going to focus on how we can assure that the CPU scheduling takes into account the NUMA topology of the processor. As we will see, running our PODs in the “right” cores will improve the latency and jitter of the access to memory.
EPA is looking to make the system more deterministic, sometimes at the cost of performance (remember when we discussed HugePages, we saw that loading large pages of memory could reduce the overall performance) or efficiency (ie. when configuring CPU pinning we were reserving CPUs to single PODs, making the usage of the overall node less efficient). NUMA Awareness is located in the “being deterministic at the cost of efficiency” group, but actually it could provide not only a deterministic system but a more performant one (it also happens with CPU pinning some times).
This can be useful for some use cases, for example, imagine that you want to run an AI application that makes use of a GPU, it would be great if the CPUs that that application is using is “close” to that GPU (PCI address) because your request doesn’t need to go thought a lot of interconnecting links inside your processor, that’s exactly what NUMA Topology Awareness is doing.
NUMA Topology Awareness
Modern Hardware divides the memory into “NUMA nodes” (NUMA=Non-Uniform Memory Access) and binds some CPUs to each of those nodes. This makes it possible to scale the system performance by adding more CPUs or cores. This concept of NUMA nodes is also expanded not only to memory but also to PCI I/O buses.
This division does not mean that a process running in one CPU bind to a NUMA node cannot access the memory or PCI devices (GPUs, SR-IOV NICS,…) bind to a different node, That access is possible but it must go through a interconnect bus that impacts the latency of that access. The jitter introduced in this connection implies that the system cannot be deterministic.
By default, Kubernetes schedules PODs based on available CPUs but it does not pay attention to the NUMA topology but it’s possible to change this behavior in OpenShift by using Topology Manager to make possible the NUMA topology awareness.
As pre-requirement, the CPU Manager policy needs to be set to “static” (We already did this in the “CPU pinning” section) so we can jump directly to the Topology Manager configuration.
NUMA Topology Awareness configuration
- Add the featureSet LatencySensitive specification
Add the ‘featureSet: LatencySensitive’ key:value under the featuregate/cluster object. As per Kubernetes documentation a featuregate is “a set of key=value pairs that describe Kubernetes features”.We will use it as a way to “turn on“ the NUMA Topology awareness feature.
$ oc edit featuregate/cluster
or using ‘oc patch’ instead of ‘edit’:
oc patch featuregate cluster --type='json' -p='[{"op": "add", "path": "/spec/featureSet", "value": "LatencySensitive" }]'
It should look like this:
apiVersion: config.openshift.io/v1
kind: FeatureGate
metadata:
annotations:
release.openshift.io/create-only: "true"
name: cluster
spec:
featureSet: LatencySensitive
Or using the WebConsole:
2. Configure Topology Manager with a custom kubeConfig
Now you have to configure the Topology Manager policy in the custom KubeletConfig. There are several TopologyManagerPolicies that can be applied, from the documentation:
- none (default): This is the default policy and does not perform any topology alignment.
- best-effort: For each container in a Guaranteed Pod with the best-effort topology management policy, kublet calls each Hint Provider to discover their resource availability. Using this information, the Topology Manager stores the preferred NUMA Node affinity for that container. If the affinity is not preferred, Topology Manager will store this and admit the pod to the node anyway.
- restricted: For each container in a Guaranteed Pod with the restricted topology management policy, kublet calls each Hint Provider to discover their resource availability. Using this information, the Topology Manager stores the preferred NUMA Node affinity for that container. If the affinity is not preferred, Topology Manager will reject this pod from the node. This will result in a pod in a Terminated state with a pod admission failure.
- single-numa-node: For each container in a Guaranteed Pod with the single-numa-node topology management policy, kublet calls each Hint Provider to discover their resource availability. Using this information, the Topology Manager determines if a single NUMA Node affinity is possible. If it is, the pod will be admitted to the node. If this is not possible then the Topology Manager will reject the pod from the node. This will result in a pod in a Terminated state with a pod admission failure.
In this case we will configure the ‘single-numa-node’ policy. We already have a kubeletConfig (named ‘cpumanager-enabled’) because we created it when configuring CPU pinning (we could add the NUMA awareness at that time), so we need to edit it:
$ oc edit kubeletconfig/cpumanager-enabled
After editing, it should look like this:
apiVersion: machineconfiguration.openshift.io/v1
kind: KubeletConfig
metadata:
name: cpumanager-enabled
spec:
machineConfigPoolSelector:
matchLabels:
custom-kubelet: cpumanager-enabled
kubeletConfig:
cpuManagerPolicy: static
cpuManagerReconcilePeriod: 5s
topologyManagerPolicy: single-numa-node
If you want to use the Web Console:
NUMA Topology Awareness testing
You can create a Guaranteed QoS POD (requests are equal to limits) to test it NUMA awareness:
apiVersion: v1
kind: Pod
metadata:
name: example-numa
labels:
app: example-numa
spec:
containers:
- name: hello-openshift
image: openshift/origin-cli
command: ["/bin/sh"]
args: ["-c", "while true; do echo hello; sleep 10;done"]
resources:
requests:
cpu: 4
memory: "2G"
limits:
cpu: 4
memory: "2G"
nodeSelector:
cpumanager: "true"
Let’s review the CPUs that are allowed for this POD:
$ oc exec example-numa -- cat /sys/fs/cgroup/cpuset/cpuset.cpus
2-3,42-43
And now let’s check the NUMA topology of the node that is running that pod (you can know it easily running ‘oc get pod <pod name> -o wide’):
$ oc debug node/<node>
sh-4.2# lscpu | grep -i numa
We can see the topology:
NUMA node(s): 2
NUMA node0 CPU(s): 0-19,40-59
NUMA node1 CPU(s): 20-39,60-79
As you can see, all CPUs used by the POD are associated with the same NUMA (numa0) so we avoid possible delays while going through the NUMA interconnect, actually, if you pay attention it selected the sibling threads of two physical cores.
So the scheduler selected that NUMA based on the resources that we want to commit, but do you remember that we talked about locating the CPUs “close” to the GPU/SR-IOV NIC? well, actually Hint Providers can be used, so we can provide this information when making the resource allocation decision. We will see this when explaining SR-IOV usage in the next Post, but it the POD request would look like the one above, pay attention to the “limits” and “request” section when we included a hint to ask for an additional resource:
apiVersion: v1
kind: Pod
metadata:
name: dpdk-app
namespace: <target_namespace>
annotations:
k8s.v1.cni.cncf.io/networks: intel-dpdk-network
spec:
containers:
- name: testpmd
image: <DPDK_image>
securityContext:
capabilities:
add: ["IPC_LOCK"]
volumeMounts:
- mountPath: /dev/hugepages
name: hugepage
resources:
limits:
openshift.io/intelnics: "1"
memory: "1Gi"
cpu: "4"
hugepages-1Gi: "4Gi"
requests:
openshift.io/intelnics: "1"
memory: "1Gi"
cpu: "4"
hugepages-1Gi: "4Gi"
command: ["sleep", "infinity"]
volumes:
- name: hugepage
emptyDir:
medium: HugePages
Next episodes:
Past episodies: