Enhanced Platform Awareness (EPA) in OpenShift — Part III, NUMA Topology Awareness

7 min readJun 3, 2020

All good things come in threes, so here you have the third part of the EPA on OpenShift series. If you want to take a look to Hugepages or CPU pinning configuration you can review the past two parts:

Enhanced Platform Awareness (EPA) in OpenShift — Part I, HugePages

This series of posts will show how to configure EPA support in OpenShift 4. In this part you will see 1GB HugePages…

medium.com

Enhanced Platform Awareness (EPA) in OpenShift — Part II, CPU pinning

This is the second part about how to configure an EPA ready OpenShift worker node, covering the CPU pinning and CPU…

medium.com

Or if you want to directly jump to any of the next Posts:

Enhanced Platform Awareness (EPA) in OpenShift — Part IV, SR-IOV, DPDK and RDMA

In this Post we’ll review some concepts that help to minimize random latencies and improve the network performance…

medium.com

Enhanced Platform Awareness (EPA) in OpenShift — Bonus Track!, Performance Addon Operator

Presenting the new Performance Addon Operator, which helps with some of the EPA features setup and includes the…

medium.com

This time we are going to focus on how we can assure that the CPU scheduling takes into account the NUMA topology of the processor. As we will see, running our PODs in the “right” cores will improve the latency and jitter of the access to memory.

EPA is looking to make the system more deterministic, sometimes at the cost of performance (remember when we discussed HugePages, we saw that loading large pages of memory could reduce the overall performance) or efficiency (ie. when configuring CPU pinning we were reserving CPUs to single PODs, making the usage of the overall node less efficient). NUMA Awareness is located in the “being deterministic at the cost of efficiency” group, but actually it could provide not only a deterministic system but a more performant one (it also happens with CPU pinning some times).

This can be useful for some use cases, for example, imagine that you want to run an AI application that makes use of a GPU, it would be great if the CPUs that that application is using is “close” to that GPU (PCI address) because your request doesn’t need to go thought a lot of interconnecting links inside your processor, that’s exactly what NUMA Topology Awareness is doing.

NUMA Topology Awareness

Modern Hardware divides the memory into “NUMA nodes” (NUMA=Non-Uniform Memory Access) and binds some CPUs to each of those nodes. This makes it possible to scale the system performance by adding more CPUs or cores. This concept of NUMA nodes is also expanded not only to memory but also to PCI I/O buses.

This division does not mean that a process running in one CPU bind to a NUMA node cannot access the memory or PCI devices (GPUs, SR-IOV NICS,…) bind to a different node, That access is possible but it must go through a interconnect bus that impacts the latency of that access. The jitter introduced in this connection implies that the system cannot be deterministic.

By default, Kubernetes schedules PODs based on available CPUs but it does not pay attention to the NUMA topology but it’s possible to change this behavior in OpenShift by using Topology Manager to make possible the NUMA topology awareness.

As pre-requirement, the CPU Manager policy needs to be set to “static” (We already did this in the “CPU pinning” section) so we can jump directly to the Topology Manager configuration.

NUMA Topology Awareness configuration

Add the featureSet LatencySensitive specification

Add the ‘featureSet: LatencySensitive’ key:value under the featuregate/cluster object. As per Kubernetes documentation a featuregate is “a set of key=value pairs that describe Kubernetes features”.We will use it as a way to “turn on“ the NUMA Topology awareness feature.

$ oc edit featuregate/cluster

or using ‘oc patch’ instead of ‘edit’:

oc patch featuregate cluster --type='json' -p='[{"op": "add", "path": "/spec/featureSet", "value": "LatencySensitive" }]'

It should look like this:

apiVersion: config.openshift.io/v1
kind: FeatureGate
metadata:
  annotations:
    release.openshift.io/create-only: "true"
  name: cluster
spec:
  featureSet: LatencySensitive

Or using the WebConsole:

2. Configure Topology Manager with a custom kubeConfig

Now you have to configure the Topology Manager policy in the custom KubeletConfig. There are several TopologyManagerPolicies that can be applied, from the documentation:

none (default): This is the default policy and does not perform any topology alignment.
best-effort: For each container in a Guaranteed Pod with the best-effort topology management policy, kublet calls each Hint Provider to discover their resource availability. Using this information, the Topology Manager stores the preferred NUMA Node affinity for that container. If the affinity is not preferred, Topology Manager will store this and admit the pod to the node anyway.
restricted: For each container in a Guaranteed Pod with the restricted topology management policy, kublet calls each Hint Provider to discover their resource availability. Using this information, the Topology Manager stores the preferred NUMA Node affinity for that container. If the affinity is not preferred, Topology Manager will reject this pod from the node. This will result in a pod in a Terminated state with a pod admission failure.
single-numa-node: For each container in a Guaranteed Pod with the single-numa-node topology management policy, kublet calls each Hint Provider to discover their resource availability. Using this information, the Topology Manager determines if a single NUMA Node affinity is possible. If it is, the pod will be admitted to the node. If this is not possible then the Topology Manager will reject the pod from the node. This will result in a pod in a Terminated state with a pod admission failure.

In this case we will configure the ‘single-numa-node’ policy. We already have a kubeletConfig (named ‘cpumanager-enabled’) because we created it when configuring CPU pinning (we could add the NUMA awareness at that time), so we need to edit it:

$ oc edit kubeletconfig/cpumanager-enabled

After editing, it should look like this:

apiVersion: machineconfiguration.openshift.io/v1
kind: KubeletConfig
metadata:
  name: cpumanager-enabled
spec:
  machineConfigPoolSelector:
    matchLabels:
      custom-kubelet: cpumanager-enabled
  kubeletConfig:
     cpuManagerPolicy: static
     cpuManagerReconcilePeriod: 5s
     topologyManagerPolicy: single-numa-node

If you want to use the Web Console:

NUMA Topology Awareness testing

You can create a Guaranteed QoS POD (requests are equal to limits) to test it NUMA awareness:

apiVersion: v1
kind: Pod
metadata:
  name: example-numa
  labels:
    app: example-numa
spec:
  containers:
  - name: hello-openshift
    image: openshift/origin-cli
    command: ["/bin/sh"]
    args: ["-c", "while true; do echo hello; sleep 10;done"]
    resources:
          requests:
            cpu: 4
            memory: "2G"
          limits:
            cpu: 4
            memory: "2G"
  nodeSelector:
    cpumanager: "true"

Let’s review the CPUs that are allowed for this POD:

$ oc exec example-numa -- cat /sys/fs/cgroup/cpuset/cpuset.cpus
2-3,42-43

And now let’s check the NUMA topology of the node that is running that pod (you can know it easily running ‘oc get pod <pod name> -o wide’):

$ oc debug node/<node>
sh-4.2# lscpu | grep -i numa

We can see the topology:

NUMA node(s):          2
NUMA node0 CPU(s):     0-19,40-59
NUMA node1 CPU(s):     20-39,60-79

As you can see, all CPUs used by the POD are associated with the same NUMA (numa0) so we avoid possible delays while going through the NUMA interconnect, actually, if you pay attention it selected the sibling threads of two physical cores.

So the scheduler selected that NUMA based on the resources that we want to commit, but do you remember that we talked about locating the CPUs “close” to the GPU/SR-IOV NIC? well, actually Hint Providers can be used, so we can provide this information when making the resource allocation decision. We will see this when explaining SR-IOV usage in the next Post, but it the POD request would look like the one above, pay attention to the “limits” and “request” section when we included a hint to ask for an additional resource:

apiVersion: v1
kind: Pod
metadata:
  name: dpdk-app
  namespace: <target_namespace> 
  annotations:
    k8s.v1.cni.cncf.io/networks: intel-dpdk-network
spec:
  containers:
  - name: testpmd
    image: <DPDK_image> 
    securityContext:
     capabilities:
        add: ["IPC_LOCK"] 
    volumeMounts:
    - mountPath: /dev/hugepages 
      name: hugepage
    resources:
      limits:
        openshift.io/intelnics: "1" 
        memory: "1Gi"
        cpu: "4" 
        hugepages-1Gi: "4Gi" 
      requests:
        openshift.io/intelnics: "1"
        memory: "1Gi"
        cpu: "4"
        hugepages-1Gi: "4Gi"
    command: ["sleep", "infinity"]
  volumes:
  - name: hugepage
    emptyDir:
      medium: HugePages

Next episodes:

Enhanced Platform Awareness (EPA) in OpenShift — Part IV, SR-IOV, DPDK and RDMA

In this Post we’ll review some concepts that help to minimize random latencies and improve the network performance…

medium.com

Enhanced Platform Awareness (EPA) in OpenShift — Bonus Track!, Performance Addon Operator

Presenting the new Performance Addon Operator, which helps with some of the EPA features setup and includes the…

medium.com

Past episodies:

Enhanced Platform Awareness (EPA) in OpenShift — Part I, HugePages

This series of posts will show how to configure EPA support in OpenShift 4. In this part you will see 1GB HugePages…

medium.com

Enhanced Platform Awareness (EPA) in OpenShift — Part II, CPU pinning

This is the second part about how to configure an EPA ready OpenShift worker node, covering the CPU pinning and CPU…

medium.com

Enhanced Platform Awareness (EPA) in OpenShift — Part III, NUMA Topology Awareness

Enhanced Platform Awareness (EPA) in OpenShift — Part I, HugePages

This series of posts will show how to configure EPA support in OpenShift 4. In this part you will see 1GB HugePages…

Enhanced Platform Awareness (EPA) in OpenShift — Part II, CPU pinning

This is the second part about how to configure an EPA ready OpenShift worker node, covering the CPU pinning and CPU…

Enhanced Platform Awareness (EPA) in OpenShift — Part IV, SR-IOV, DPDK and RDMA

In this Post we’ll review some concepts that help to minimize random latencies and improve the network performance…

Enhanced Platform Awareness (EPA) in OpenShift — Bonus Track!, Performance Addon Operator

Presenting the new Performance Addon Operator, which helps with some of the EPA features setup and includes the…

NUMA Topology Awareness

NUMA Topology Awareness configuration

NUMA Topology Awareness testing

Enhanced Platform Awareness (EPA) in OpenShift — Part IV, SR-IOV, DPDK and RDMA

In this Post we’ll review some concepts that help to minimize random latencies and improve the network performance…

Enhanced Platform Awareness (EPA) in OpenShift — Bonus Track!, Performance Addon Operator

Presenting the new Performance Addon Operator, which helps with some of the EPA features setup and includes the…

Enhanced Platform Awareness (EPA) in OpenShift — Part I, HugePages

This series of posts will show how to configure EPA support in OpenShift 4. In this part you will see 1GB HugePages…

Enhanced Platform Awareness (EPA) in OpenShift — Part II, CPU pinning

This is the second part about how to configure an EPA ready OpenShift worker node, covering the CPU pinning and CPU…

Written by Luis Javier Arizmendi Alonso

No responses yet