Enhanced Platform Awareness (EPA) in OpenShift — Bonus Track!, Performance Addon Operator

16 min readJul 2, 2020

This Post extends the “Enhanced Platform Awareness (EPA) in OpenShift” series by presenting the Performance Addon Operator, which configures many of the “EPA features” that we reviewed in the past Posts, but without having to configure them separately, and includes additional tunings to the Worker node Operating System that makes it better to accommodate performant deterministic workloads.

Note: This Post is slightly different from the previous ones, where we were using General Availability or Technology Preview OpenShift 4.4 features, in this case we try to show the capabilities that will be available in near future with OpenShift and that will cover additionals GAP to get an node running in your Cluster prepared for deterministic workloads. In OpenShift 4.5 (I will be using an OpenShift 4.5 release candidate version for these tests, not OCP 4.4 that is GA and was used in previous Posts), the Performance Addon Operator is given as a “Developer Preview” feature.

If you want to check each feature configuration or review the concepts, please read the previous Posts:

Enhanced Platform Awareness (EPA) in OpenShift — Part I, HugePages

This series of posts will show how to configure EPA support in OpenShift 4. In this part you will see 1GB HugePages…

medium.com

Enhanced Platform Awareness (EPA) in OpenShift — Part II, CPU pinning

This is the second part about how to configure an EPA ready OpenShift worker node, covering the CPU pinning and CPU…

medium.com

Enhanced Platform Awareness (EPA) in OpenShift — Part III, NUMA Topology Awareness

This time we are going to focus on how we can assure that the CPU scheduling takes into account the NUMA topology of…

medium.com

Enhanced Platform Awareness (EPA) in OpenShift — Part IV, SR-IOV, DPDK and RDMA

In this Post we’ll review some concepts that help to minimize random latencies and improve the network performance…

medium.com

Performance Addon Operator

The Performance Addon Operator optimizes OpenShift clusters for applications sensitive to latency by configuring features such HugePages, CPU pinning, and NUMA Awareness scheduling, but also by installing a Real-Time Kernel and configuring the Operating System to run as a Real-Time System.

Note that the Performance Addon Operator does not configure network-related features such SR-IOV that was explained in other Post.

We already described what’s HugePages, CPU pinning, and NUMA Awareness scheduling but we didn’t mention Real-Time Systems so let’s begin explaining this concept.

Real-Time Kernel

In the previous Posts we have seen how in order to obtain a more performant deterministic system we should avoid using the Kernel space (with DPDK or RDMA) because by doing that we skip as much as possible the Kernel interrupts. This works well but is a “patch” to the problem since other subsystems of the Operating System won’t be able to skip those interrupts.

By using a Real-Time System, which has embedded a Real-Time Kernel, we can work in an even more deterministic way. A Real-Time Kernel is a software that manages the time of microprocessor to ensure that time-critical events are processed as efficiently as possible, always trying to execute the highest priority task that is ready to run.

Linux was designed as a general-purpose Operating System, which means that the architecture is supposed to achieve good performance for multiple applications running at the same time but without having performance optimized for any particular one of them, that makes it difficult to run a time-critical process on time.

That does not mean that Linux cannot be used for real-time, deterministic, low-latency applications. One option is to use Linux with a hypervisor that has priority over the Linux kernel and is where the real-time tasks are performed, which would be valid for systems running VMs.

For systems that do not need a hypervisor, the most common approach is to apply a patch known as PREEMPT_RT to the kernel, which adds preemption points where the OS can stop the execution of one process and give time to a critical one. The drawback of this approach is that it decreases the overall performance.

There is another solution to the problem: decoupling the real-time part of an OS from the general-purpose kernel, which means that it’s possible to optimize the real-time part separately to meet timing deadlines while allowing the rest of the system to show the best possible performance, so the best of both worlds. This is the technique used by RTLinux: “RTLinux was designed to share a computing device between a real-time and non-real-time operating system so that the real-time operating system could never be blocked from execution by the non-real-time operating system and components running in the two different environments could easily share data.” You can find here a nice overview of how RTLinux approach makes possible to differentiate between processes using the “regular” Linux Kernel and the Real-Time Kernel, it adds to the common Linux architecture:

a new layer that owns the Real-Time tasks preemption:

As you know, OpenShift 4 uses Red Hat CoreOS (RHCOS) as the underlying Operating System (worker nodes could run Red Hat Enterprise Linux but it’s encouraged to use RHCOS because of the great integration that you get between the OS and OpenShift). Recently it’s possible to install the “kernel-rt” that is a Kernel with PREEMPT_RT patch (ok, it’s not an actual hard real-time system as RTLinux, but it’s a step ahead). This means that we can make RHCOS be more deterministic at the cost of overall System Performance, good enough.

Besides the Real-Time Kernel more settings must be done to assure a Real-Time System (BIOS, OS settings, etc), you can check some of the baseline and real-time specific best practices while configuring your systems (it’s RHEL 8 documentation, but remember than RHCOS is based on RHEL 8):

Tuning Guide Red Hat Enterprise Linux for Real Time 8 | Red Hat Customer Portal

This book contains advanced tuning procedures for Red Hat Enterprise Linux for Real Time. For installation…

access.redhat.com

The Performance Addon Operator helps on that, so it does not just setup the RHCOS Real-Time Kernel, but also makes some tweaks in the Operating System to provide a truly Real-Time RHCOS System (we will see some of these tweaks during the testing section).

Performance Addon Operator configuration

The steps in order to configure the Performance Addon Operator are:

Prepare the Operator’s namespace
Install the Performance Addon Operator
Create the PerformaceProfile CRD

1.Prepare the Operator’s namespace

The first step is to create a namespace where to install the operator. We need to follow the same steps as we did when we installed the SR-IOV Operator, first, we need to create the namespace with the label openshift.io/run-level: “1” . We also have to create an OperatorGroup if we plan to do the Operator install from CLI, since we are installing an Operator in a namespace that is not the Openshift-Operators

You can do it using the following object descriptors in the CLI, for the Namespace:

apiVersion: v1
kind: Namespace
metadata:
  name: performance-addon-operator
  labels:
    openshift.io/run-level: "1"

and for the OperatorGroup (only needed if you plan to install the Operator using the CLI in the next step)

apiVersion: operators.coreos.com/v1
kind: OperatorGroup
metadata:
  name: performance-addon-operator
  namespace: openshift-performance-addon-operator
spec:
  targetNamespaces:
  - openshift-performance-addon-operator

If you prefer to do it in the Web Console:

2. Install the Performance Addon Operator

The next step is to install the Performance Addon Operator. If you use the CLI you need to create a “Subscription” object. In that object, you need the name of the operator along with the channel. If you want to double-check what’s the exact channel name you can run this command:

oc get packagemanifest performance-addon-operator -n openshift-marketplace -o jsonpath='{.status.defaultChannel}'

Once you know the channel (“4.4" in this case, 4.5 is still not available at this time… and remember this point because we will come back to it later) you can prepare the object descriptor:

apiVersion: operators.coreos.com/v1alpha1
kind: Subscription
metadata:
  name: performance-addon-operator-subscription
  namespace: performance-addon-operator
spec:
  channel: 4.4
  name: performance-addon-operator
  source: redhat-operators
  sourceNamespace: openshift-marketplace

In the Web Console is much easier since you just need to search for the Operator, click install and be sure that you are installing only in the already created namespace (this operator does not create the namespace for you):

3.Include required labels

The Performance Addon Operator needs two labels:

NodeSelector, that use to be the role (remember that if you created a custom role for your EPA nodes)
MachineConfigPoolSelector, that select the MachineConfigPool to be changed by the Operator (for example, to modify the kubelet configuration)

The role label you should already have it, but you will need to configure the MachineConfigPoolSelector label. By default the label will be this one:

machineconfiguration.openshift.io/role=<same role in NodeSelector>

…but you can configure whatever you want, just remind to configure the same label in the PerformanceProfile definition (check next section).

You can see the label configuration on the CLI

$ oc label machineconfigpools.machineconfiguration.openshift.io worker machineconfiguration.openshift.io/role=worker

or on the Web Console:

If you created other roles following the steps mentioned in the first Post, you need to bear in mind that this operator won’t use the node labels that we included as part of those roles, only those that are mentioned above.

4.Create the PerformaceProfile CRD

Here is where the actual configuration resides. You can check the spec options that can be configured in the Performance Addon Operator GitHub repo.

Let’s start with the HugePages part:

...
...
hugepages:
    defaultHugepagesSize: "1G"
    pages:
      - size: "1G"
        count: 12
        node: 1
...
...

As you can see, along with the HugePages number and size setup, the is also a “node” parameter that you can use to assign the HugePages only in the NUMA that is indicated by that parameter (by default it uses all NUMAS and split the HugePages). We will see below when this can be useful.

In the CPU section, we have two parameters:

isolated: it will assign CPUs used to give to application threads the most execution time possible and has the lowest latency. Processes in this group have no interruptions
reserved: selects the CPUs that won’t be used by any workload (so CPUs reserved for Operating System or also known as housekeeping CPUs).

You can reserve cores, or threads, for operating system housekeeping tasks from a single NUMA node and put your workloads on another NUMA node. The reason for this is that the housekeeping processes might be using the CPUs in a way that would impact latency-sensitive processes running on those same CPUs. Keeping your workloads on a separate NUMA node prevents the processes from interfering with each other. Additionally, each NUMA node has its own memory bus that is not shared.

If you want to do such a thing you first will need to know which CPU is attached to which NUMA. You can do it running the following command:

$ oc debug node/<node>
sh-4.2# lscpu | grep -i numa

The result will show the CPU distribution among the NUMA Nodes, in my case:

NUMA node(s): 2
NUMA node0 CPU(s): 0–2
NUMA node1 CPU(s): 3–5

In OpenShift 4.5 this can also be done using the Web Console, using the new option “Terminal” that you can find in the node which makes possible to access the nodes as if you were using “oc debug node/<node name>”:

Since I will configure the Real-Time Kernel too, I will follow this advice of running my workloads in a different NUMA than the Housekeeping CPUs, so I select CPUs 3–5 as isolated (remember that I also reserved HugePages only in the NUMA Node 1). I’ve found that even though I just want one core to be used for housekeeping tasks If I define reserved: 0 to use only the CPU 0, some workloads will use the remaining CPUs that are not in the isolated list, so finally I had to setup reserved: 0-2

With this setup, I will use less CPUs for my real-time app (0–3 CPUs won’t be used) but I will assure a better determinism since I’m using a “dedicated” NUMA, as mentioned earlier (if you don’t need such level of determinism it would be better to maximize the number of CPUs, in this case configuring reserved: 0 and isolated: 1-5).

...
...
cpu:
    isolated: "3-5"
    reserved: "0-2"
...
...

Numa configuration is straight forward, we just need to include the NUMA policy (check the NUMA Awareness Post if you want to know more about this NUMA policies):

...
...
numa:
    topologyPolicy: "single-numa-node"
...
...

Enabling the Real-Time Kernel is also easy:

...
...
realTimeKernel:
    enabled: true
...
...

There are more configuration parameters, for example, you could find useful to configure some specific additional Kernel Parameters:

...
...
additionalKernelArgs:
    - "nmi_watchdog=0"
    - "audit=0"
    - "mce=off"
    - "processor.max_cstate=1"
    - "idle=poll"
    - "intel_idle.max_cstate=0"
...
...

Finally, this is the complete descriptor. Remember to pay attention to the “nodeSelector” and “machineConfigPoolSelector”. In this case, I will configure all my nodes with role “worker”, but if you configured a role for the EPA workers (as shown in the first Post) you should include the right label here. Also bear in mind that despite is the default value, I specify here the machineConfigPoolSelector (to let you know that you can configure it right here if you want to use your custom label)

apiVersion: performance.openshift.io/v1alpha1
kind: PerformanceProfile
metadata:
  name: rt-performanceprofile
spec:
  additionalKernelArgs:
    - nmi_watchdog=0
    - audit=0
    - mce=off
    - processor.max_cstate=1
    - idle=poll
    - intel_idle.max_cstate=0
  cpu:
    isolated: 3-5
    reserved: 0-2
  hugepages:
    defaultHugepagesSize: 1G
    pages:
      - count: 12
        node: 1
        size: 1G
  machineConfigPoolSelector:
    machineconfiguration.openshift.io/role: worker
  nodeSelector:
    node-role.kubernetes.io/worker: ''
  numa:
    topologyPolicy: "single-numa-node"
  realTimeKernel:
    enabled: true

If you want to create it using the Web Console now with OpenShift 4.5 you have the option to do it filling the Form, or using the traditional YAML paste:

The node configuration will take some time, including reboots, so be patient…

Performance Addon Operator testing

The Performance Addon Operator creates a new MachineConfig (“performance-rt-performanceprofile”) that applies to the nodes where all the configuration that we did in the previous step is contained.

If you want to check the detail, this is the content of the file (take a look and try to find all the parameters that we configured above).

apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfig
metadata:
  selfLink: >-
    /apis/machineconfiguration.openshift.io/v1/machineconfigs/performance-rt-performanceprofile
  resourceVersion: '82887'
  name: performance-rt-performanceprofile
  uid: 4c99bbb2-01f4-46b4-b644-10bcbe693e54
  creationTimestamp: '2020-07-01T10:52:16Z'
  generation: 1
  managedFields:
    - apiVersion: machineconfiguration.openshift.io/v1
      fieldsType: FieldsV1
      fieldsV1:
        'f:metadata':
          'f:labels':
            .: {}
            'f:machineconfiguration.openshift.io/role': {}
          'f:ownerReferences':
            .: {}
            'k:{"uid":"80082826-df12-45d6-8304-8cd4d46d15db"}':
              .: {}
              'f:apiVersion': {}
              'f:blockOwnerDeletion': {}
              'f:controller': {}
              'f:kind': {}
              'f:name': {}
              'f:uid': {}
        'f:spec':
          .: {}
          'f:config':
            .: {}
            'f:ignition':
              .: {}
              'f:config': {}
              'f:security':
                .: {}
                'f:tls': {}
              'f:timeouts': {}
              'f:version': {}
            'f:networkd': {}
            'f:passwd': {}
            'f:storage':
              .: {}
              'f:files': {}
            'f:systemd':
              .: {}
              'f:units': {}
          'f:fips': {}
          'f:kernelArguments': {}
          'f:kernelType': {}
          'f:osImageURL': {}
      manager: performance-operator
      operation: Update
      time: '2020-07-01T10:52:16Z'
  ownerReferences:
    - apiVersion: performance.openshift.io/v1alpha1
      blockOwnerDeletion: true
      controller: true
      kind: PerformanceProfile
      name: rt-performanceprofile
      uid: 80082826-df12-45d6-8304-8cd4d46d15db
  labels:
    machineconfiguration.openshift.io/role: worker
spec:
  config:
    ignition:
      config: {}
      security:
        tls: {}
      timeouts: {}
      version: 2.2.0
    networkd: {}
    passwd: {}
    storage:
      files:
        - contents:
            source: >-
              data:text/plain;charset=utf-8;base64,IyEvdXNyL2Jpbi9lbnYgYmFzaAoKc2V0IC1ldW8gcGlwZWZhaWwKClNZU1RFTV9DT05GSUdfRklMRT0iL2V0Yy9zeXN0ZW1kL3N5c3RlbS5jb25mIgpTWVNURU1fQ09ORklHX0NVU1RPTV9GSUxFPSIvZXRjL3N5c3RlbWQvc3lzdGVtLmNvbmYuZC9zZXRBZmZpbml0eS5jb25mIgoKaWYgWyAtZiAvZXRjL3N5c2NvbmZpZy9pcnFiYWxhbmNlIF0gJiYgWyAtZiAke1NZU1RFTV9DT05GSUdfQ1VTVE9NX0ZJTEV9IF0gICYmIHJwbS1vc3RyZWUgc3RhdHVzIC1iIHwgZ3JlcCAtcSAtZSAiJHtTWVNURU1fQ09ORklHX0ZJTEV9ICR7U1lTVEVNX0NPTkZJR19DVVNUT01fRklMRX0iICYmIGVncmVwIC13cSAiXklSUUJBTEFOQ0VfQkFOTkVEX0NQVVM9JHtSRVNFUlZFRF9DUFVfTUFTS19JTlZFUlR9IiAvZXRjL3N5c2NvbmZpZy9pcnFiYWxhbmNlOyB0aGVuCiAgICBlY2hvICJQcmUgYm9vdCB0dW5pbmcgY29uZmlndXJhdGlvbiBhbHJlYWR5IGFwcGxpZWQiCmVsc2UKICAgICNTZXQgSVJRIGJhbGFuY2UgYmFubmVkIGNwdXMKICAgIGlmIFsgISAtZiAvZXRjL3N5c2NvbmZpZy9pcnFiYWxhbmNlIF07IHRoZW4KICAgICAgICB0b3VjaCAvZXRjL3N5c2NvbmZpZy9pcnFiYWxhbmNlCiAgICBmaQoKICAgIGlmIGdyZXAgLWxzICJJUlFCQUxBTkNFX0JBTk5FRF9DUFVTPSIgL2V0Yy9zeXNjb25maWcvaXJxYmFsYW5jZTsgdGhlbgogICAgICAgIHNlZCAtaSAicy9eLipJUlFCQUxBTkNFX0JBTk5FRF9DUFVTPS4qJC9JUlFCQUxBTkNFX0JBTk5FRF9DUFVTPSR7UkVTRVJWRURfQ1BVX01BU0tfSU5WRVJUfS8iIC9ldGMvc3lzY29uZmlnL2lycWJhbGFuY2UKICAgIGVsc2UKICAgICAgICBlY2hvICJJUlFCQUxBTkNFX0JBTk5FRF9DUFVTPSR7UkVTRVJWRURfQ1BVX01BU0tfSU5WRVJUfSIgPj4vZXRjL3N5c2NvbmZpZy9pcnFiYWxhbmNlCiAgICBmaQoKICAgIHJwbS1vc3RyZWUgaW5pdHJhbWZzIC0tZW5hYmxlIC0tYXJnPS1JIC0tYXJnPSIke1NZU1RFTV9DT05GSUdfRklMRX0gJHtTWVNURU1fQ09ORklHX0NVU1RPTV9GSUxFfSIgCgogICAgdG91Y2ggL3Zhci9yZWJvb3QKZmkK
            verification: {}
          filesystem: root
          mode: 448
          path: /usr/local/bin/pre-boot-tuning.sh
        - contents:
            source: >-
              data:text/plain;charset=utf-8;base64,IyEvdXNyL2Jpbi9lbnYgYmFzaAoKc2V0IC1ldW8gcGlwZWZhaWwKCm5vZGVzX3BhdGg9Ii9zeXMvZGV2aWNlcy9zeXN0ZW0vbm9kZSIKaHVnZXBhZ2VzX2ZpbGU9IiR7bm9kZXNfcGF0aH0vbm9kZSR7TlVNQV9OT0RFfS9odWdlcGFnZXMvaHVnZXBhZ2VzLSR7SFVHRVBBR0VTX1NJWkV9a0IvbnJfaHVnZXBhZ2VzIgoKaWYgWyAhIC1mICAke2h1Z2VwYWdlc19maWxlfSBdOyB0aGVuCiAgICBlY2hvICJFUlJPUjogJHtodWdlcGFnZXNfZmlsZX0gZG9lcyBub3QgZXhpc3QiCiAgICBleGl0IDEKZmkKCmVjaG8gJHtIVUdFUEFHRVNfQ09VTlR9ID4gJHtodWdlcGFnZXNfZmlsZX0K
            verification: {}
          filesystem: root
          mode: 448
          path: /usr/local/bin/hugepages-allocation.sh
        - contents:
            source: >-
              data:text/plain;charset=utf-8;base64,IyEvdXNyL2Jpbi9lbnYgYmFzaAoKc2V0IC1ldW8gcGlwZWZhaWwKCmlmIFtbIC1mIC92YXIvcmVib290IF1dOyB0aGVuIAogICAgcm0gLWYgL3Zhci9yZWJvb3QKICAgIGVjaG8gIkZpbGUgL3Zhci9yZWJvb3QgZXhpc3RzLCBpbml0aWF0ZSByZWJvb3QiCiAgICBzeXN0ZW1jdGwgcmVib290CmZpCg==
            verification: {}
          filesystem: root
          mode: 448
          path: /usr/local/bin/reboot.sh
        - contents:
            source: >-
              data:text/plain;charset=utf-8;base64,W01hbmFnZXJdCkNQVUFmZmluaXR5PTA=
            verification: {}
          filesystem: root
          mode: 448
          path: /etc/systemd/system.conf.d/setAffinity.conf
    systemd:
      units:
        - contents: |
            [Unit]
            Description=Preboot tuning patch
            Before=kubelet.service
            Before=reboot.service[Service]
            Environment=RESERVED_CPUS=0
            Environment=RESERVED_CPU_MASK_INVERT=ffffffff,fffffffe
            Type=oneshot
            RemainAfterExit=true
            ExecStart=/usr/local/bin/pre-boot-tuning.sh[Install]
            WantedBy=multi-user.target
          enabled: true
          name: pre-boot-tuning.service
        - contents: |
            [Unit]
            Description=Reboot initiated by pre-boot-tuning
            Wants=network-online.target
            After=network-online.target
            Before=kubelet.service[Service]
            Type=oneshot
            RemainAfterExit=true
            ExecStart=/usr/local/bin/reboot.sh[Install]
            WantedBy=multi-user.target
          enabled: true
          name: reboot.service
        - contents: |
            [Unit]
            Description=Hugepages-1048576kB allocation on the node 1
            Before=kubelet.service[Service]
            Environment=HUGEPAGES_COUNT=12
            Environment=HUGEPAGES_SIZE=1048576
            Environment=NUMA_NODE=1
            Type=oneshot
            RemainAfterExit=true
            ExecStart=/usr/local/bin/hugepages-allocation.sh[Install]
            WantedBy=multi-user.target
          enabled: true
          name: hugepages-allocation-1048576kB-NUMA1.service
  fips: false
  kernelArguments:
    - nohz=on
    - nosoftlockup
    - skew_tick=1
    - intel_pstate=disable
    - intel_iommu=on
    - iommu=pt
    - rcu_nocbs=3-5
    - tuned.non_isolcpus=00000001
    - default_hugepagesz=1G
    - nmi_watchdog=0
    - audit=0
    - mce=off
    - processor.max_cstate=1
    - idle=poll
    - intel_idle.max_cstate=0
  kernelType: realtime
  osImageURL: ''

HugePages

As you can see, some scripts are created to config tuned and HugePages (Performance Addon Operator takes a different way than we did in the first Post to configure HugePages, it is using Systemd to call a script that reserves the Pages)

Let’s check the HugePages configuration in the node:

cat /proc/meminfo | grep -i “Total\|Huge”

That gives us the expected result:

MemTotal:       37060016 kB
SwapTotal:             0 kB
VmallocTotal:   34359738367 kB
HugePages_Total:      12
HugePages_Free:       12
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:    1048576 kB
Hugetlb:        12582912 kB

In the Web Console:

2. KernelArgs

Now let’s review the kernel arguments:

sh-4.2# cat /proc/cmdline 
BOOT_IMAGE=(hd0,gpt1)/ostree/rhcos-cd503dcacae13e148b67adcfd912a898c45e10031af676d44463c03a8dfa662e/vmlinuz-4.18.0-193.9.1.rt13.60.el8_2.x86_64 rhcos.root=crypt_rootfs console=tty0 console=ttyS0,115200n8 ignition.platform.id=metal rd.luks.options=discard ostree=/ostree/boot.0/rhcos/cd503dcacae13e148b67adcfd912a898c45e10031af676d44463c03a8dfa662e/0 nohz=on nosoftlockup skew_tick=1 intel_pstate=disable intel_iommu=on iommu=pt rcu_nocbs=3-5 tuned.non_isolcpus=00000001 default_hugepagesz=1G nmi_watchdog=0 audit=0 mce=off processor.max_cstate=1 idle=poll intel_idle.max_cstate=0

You can find there all the arguments that we configured in the “PerformanceProfile” plus other that the Operator included to configure HugePages and CPU isolation.

3. CPU pinning

Let’s double-check that the operator changed the values in the kubelet config in the node (using “oc debug” or the node’s “Terminal” tab in the Web console):

sh-4.4# cat /etc/kubernetes/kubelet.conf | grep -i cpumanager
  "cpuManagerPolicy": "static",
  "cpuManagerReconcilePeriod": "5s",

4. NUMA Awareness

If you remember from the NUMA Awareness Post, we have to enable the “LatencySensitive” featurette. You can check that has been added by the operator

We can also check if the topologyManagerPolicy has been set in the kubelet config in the node (using “oc debug” or the node’s “Terminal” tab in the Web console):

sh-4.4# cat /etc/kubernetes/kubelet.conf | grep -i topology
  "topologyManagerPolicy": "best-effort",
    "TopologyManager": true

We got here a bad surprise, the topologyManagerPolicy is setup to “best-effort” instead of what we configured, “single-numa-node”, Why? well, if we check the options of the Performance Addon Operator release 4.4 instead of the options of the Performance Addon Operator release 4.5 we can see the difference, the “numa” parameter is not ready in 4.4, thus the operator won’t make this config effective, and if you review it, when we installed the operator we had to select the release 4.4 since the version 4.5 is not yet ready in the Operator Hub for OpenShift 4.5 release candidate 4 version…remember that we are testing a developer preview in an OpenShift release candidate version, this kind of things happens :-).

We could install manually the release 4.5 of the Performance Addon Operator to test the NUMA configuration…but this time We’ll be confident that this will be ready when we have the channel 4.5 available for the Performance Addon Operator.

5. Real-Time Kernel

We can compare the Kernel running on the Master nodes (that has not been changed) with the one running on Workers.

This is the Kernel on Masters:

$ oc debug node/<node>
sh-4.2# uname -a
Linux master0.ocp.136.243.40.222.nip.io 4.18.0-147.8.1.el8_1.x86_64 #1 SMP Wed Feb 26 03:08:15 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux

and now check the differences with the Kernel running on the Workers:

$ oc debug node/<node>
sh-4.2# uname -a
Linux worker1.ocp.136.243.40.222.nip.io 4.18.0-147.8.1.rt24.101.el8_1.x86_64 #1 SMP PREEMPT RT Wed Feb 26 16:43:18

You can see the “PREEMPT RT” flag and the “.rt” in the name, which denotes that this is a Real-Time Kernel.

6. Operating System tunning

We can also check some of the Operating System running that has been done by the Operator, for example (these are not all customizations done by the Operator, only a subset of them):

sh-4.4#  cat /etc/systemd/system.conf.d/setAffinity.conf
[Manager]
CPUAffinity=0sh-4.4#  cat /etc/sysconfig/irqbalance
# IRQBALANCE_BANNED_CPUS
IRQBALANCE_BANNED_CPUS=ffffffff,fffffffesh-4.4# cat /sys/devices/virtual/workqueue/cpumask
1
sh-4.4# cat /sys/bus/workqueue/devices/writeback/cpumask
1sh-4.4#  cat /etc/kubernetes/kubelet.conf
...
...
  "systemReserved": {
    "cpu": "1000m",
    "ephemeral-storage": "1Gi",
    "memory": "500Mi"
  },
  "kubeReserved": {
    "cpu": "1000m",
    "memory": "500Mi"
  },
  "reservedSystemCPUs": "0"

Now let’s create a test POD

apiVersion: v1
kind: Pod
metadata:
  name: perf-test-app
spec:
  containers:
  - name: perf-test
    image: openshift/origin-cli 
    command: ["/bin/sh"]
    args: ["-c", "while true; do echo hello; sleep 10;done"]
    securityContext:
     capabilities:
        add: ["IPC_LOCK"] 
    volumeMounts:
    - mountPath: /dev/hugepages 
      name: hugepage
    resources:
      limits:
        memory: "1Gi"
        cpu: "2" 
        hugepages-1Gi: "4Gi" 
      requests:
        memory: "1Gi"
        cpu: "2"
        hugepages-1Gi: "4Gi"
  volumes:
  - name: hugepage
    emptyDir:
      medium: HugePages

We can check the allowed CPUs list as we did in the CPU pinning Post:

First, you can be sure that the POD is not running in the burstable slice by running “systemd-cgls” command in the node:

Now let’s check which CPUs are allowed for this POD reviewing the cpuset.cpus file from inside the POD:

$ oc exec perf-test-app -- cat /sys/fs/cgroup/cpuset/cpuset.cpus
3-4

Those are the CPUs that we selected as “isolated” and that have the highest preference while running their threads.

If you want to measure latencies (to check how these tunings along with the Real-Time Kernel are improving the results), you can use this image that will show the minimum, maximum, and average latencies, just to remember to adjust the time that it will be running.

apiVersion: v1 
kind: Pod 
metadata:
  name: cyclictest 
spec:
  restartPolicy: Never 
  containers:
  - name: container-perf-tools 
    image: quay.io/jianzzha/perf-tools
    resources:
      limits:
        memory: "200Mi"
        cpu: "2"
    env:
    - name: tool
      value: "cyclictest"
    - name: DURATION
      value: "60s"
    - name: DISABLE_CPU_BALANCE
      value: "y"
      # DISABLE_CPU_BALANCE requires privileged=true
    securityContext:
      privileged: true
    volumeMounts:
    - mountPath: /dev/cpu_dma_latency
      name: cstate
  volumes:
  - name: cstate
    hostPath:
      path: /dev/cpu_dma_latency

And that’s it, I hope that you find this new Performance Addon Operator so interesting as I do. I’m looking forward to having it GA (or even Tech. preview) and start using it in my deployments.

Check out the whole EPA worker node in OpenShift series:

Enhanced Platform Awareness (EPA) in OpenShift — Part I, HugePages

This series of posts will show how to configure EPA support in OpenShift 4. In this part you will see 1GB HugePages…

medium.com

Enhanced Platform Awareness (EPA) in OpenShift — Part II, CPU pinning

This is the second part about how to configure an EPA ready OpenShift worker node, covering the CPU pinning and CPU…

medium.com

Enhanced Platform Awareness (EPA) in OpenShift — Part III, NUMA Topology Awareness

This time we are going to focus on how we can assure that the CPU scheduling takes into account the NUMA topology of…

medium.com

Enhanced Platform Awareness (EPA) in OpenShift — Part IV, SR-IOV, DPDK and RDMA

In this Post we’ll review some concepts that help to minimize random latencies and improve the network performance…

medium.com

Enhanced Platform Awareness (EPA) in OpenShift — Bonus Track!, Performance Addon Operator

Enhanced Platform Awareness (EPA) in OpenShift — Part I, HugePages

This series of posts will show how to configure EPA support in OpenShift 4. In this part you will see 1GB HugePages…

Enhanced Platform Awareness (EPA) in OpenShift — Part II, CPU pinning

This is the second part about how to configure an EPA ready OpenShift worker node, covering the CPU pinning and CPU…

Enhanced Platform Awareness (EPA) in OpenShift — Part III, NUMA Topology Awareness

This time we are going to focus on how we can assure that the CPU scheduling takes into account the NUMA topology of…

Enhanced Platform Awareness (EPA) in OpenShift — Part IV, SR-IOV, DPDK and RDMA

In this Post we’ll review some concepts that help to minimize random latencies and improve the network performance…

Performance Addon Operator

Real-Time Kernel

Tuning Guide Red Hat Enterprise Linux for Real Time 8 | Red Hat Customer Portal

This book contains advanced tuning procedures for Red Hat Enterprise Linux for Real Time. For installation…

Performance Addon Operator configuration

Performance Addon Operator testing

Enhanced Platform Awareness (EPA) in OpenShift — Part I, HugePages

This series of posts will show how to configure EPA support in OpenShift 4. In this part you will see 1GB HugePages…

Enhanced Platform Awareness (EPA) in OpenShift — Part II, CPU pinning

This is the second part about how to configure an EPA ready OpenShift worker node, covering the CPU pinning and CPU…

Enhanced Platform Awareness (EPA) in OpenShift — Part III, NUMA Topology Awareness

This time we are going to focus on how we can assure that the CPU scheduling takes into account the NUMA topology of…

Enhanced Platform Awareness (EPA) in OpenShift — Part IV, SR-IOV, DPDK and RDMA

In this Post we’ll review some concepts that help to minimize random latencies and improve the network performance…

Written by Luis Javier Arizmendi Alonso